I don't know, I think I used about 40 languages. The beauty is that zip-compression captures rich statistical properties of the languages, so representation-wise it should come a long way. But counting compressed output length discretises the lang-lang distance. For shorter text this might be troubling, since this could easily result in ties. So, maybe. Perhaps I should try :).
I agree the task is neither impossible nor useless. There’s work to do. Short passages should be supported. I do however think franc does a good job, and adds support for some languages which before today have never (I think) been supported. Franc, certainly, “attempt”s to fix language detection, which I would argue is an AI-complete problem.
Anyway, You’re completely right. Italian is `und` due to LTE 10 characters, the others are slightly off due to short input too, but the demo (http://wooorm.github.io/franc/) shows the correct languages in the second or third place though!
No it doesn't, still takes French for Catalan (French only comes at third place, after Italian), and Swedish for Dutch.
(Arguably those are close languages, but hey, this is why I'm using this, right?)
By `correct language` I mean the language you expect, by `second` and `third` I mean `2.` and `3.` in the previously mentioned demo: http://wooorm.github.io/franc/). I think we’re talking about the same thing!
Anyway, yeah, franc is for language detecting, but it’s optimised for many languages and works best at longer text. It’s a trade-off. For less languages and shorter texts, check out https://github.com/shuyo/ldig
No full-frequency data is kept, only 300 top-trigrams are identified. A quick through the source also reveals wooorm/trigrams, and wooorm/udhr, as sources!
It sucks, right? Currently, it’s good at long passages. But for shorter values, the results are pretty poor. The amount of supported languages is just too damn high!
I’m not sure. I don’t know any CJK languages myself. I’d like some test-cases where the current methods do not work, as the example in the Readme seems to work pretty well: `এটি একটি ভাষা একক IBM স্ক্রিপ্ট` is classified as Bengali?
Some examples follow. I've really tested with arbitrary text on the Web, and I agree that they are somewhat marginal examples. (But I do think that Franc's margin for CJK languages is way wide.)
한국어 문서가 전 세계 웹에서 차지하는 비중은 2004년에 4.1%로, 이는 영어(35.8%), 중국어(14.1%), 일본어(9.6%), 스페인어(9%), 독일어(7%)에 이어 전 세계 6위이다. 한글 문서와 한국어 문서를 같은 것으로 볼 때, 웹상에서의 한국어 사용 인구는 전 세계 69억여 명의 인구 중 약 1%에 해당한다.
This text from Korean Wikipedia is about the ratio of Korean documents over all documents in the Internet. Digits distort the overall ratio and Franc doesn't give any candidates (even no "und").
現行の学校文法では、英語にあるような「目的語」「補語」などの成分はないとする。英語文法では "I read a book." の "a book" はSVO文型の一部をなす目的語であり、また、"I go to the library." の "the library" は前置詞とともに付け加えられた修飾語と考えられる。
This text from Japanese Wikipedia concerns about the distinction of objectives and complements in the English syntax. In this bilingual text it looks like that Japanese has reached the 60% threshold but the codepoint count doesn't.
It does have several translations of the bible, though. I guess it would be a lot of work to find bible translations for all those languages - or was there another reason for using the Human Rights Declaration?
Thanks! Currently, the UDHRs are crawled, and I’d rather not include exceptions and maintain their plain-text and XML/JSON versions by hand. If you’re into growing the language, I suggest contacting the Office of the High Commissioner of Human Rights of the UN, and the Unicode project, or fork wooorm/udhr and add support, I’ll merge :)