More

wooorm · on Sept 2, 2015

alex is open to suggestion. See http://alexjs.com/#contributing on how to contribute. So if you have a word to add or phrasing to remove let us know!

wooorm · on Oct 5, 2014

It’s a very interesting idea. Would it work accurate enough when scaled to 160+ languages?

breuderink · on Oct 5, 2014

I don't know, I think I used about 40 languages. The beauty is that zip-compression captures rich statistical properties of the languages, so representation-wise it should come a long way. But counting compressed output length discretises the lang-lang distance. For shorter text this might be troubling, since this could easily result in ties. So, maybe. Perhaps I should try :).

wooorm · on Oct 6, 2014

Perhaps you should ;) If, I’d be interest to know how it goes!

wooorm · on Oct 3, 2014

I agree the task is neither impossible nor useless. There’s work to do. Short passages should be supported. I do however think franc does a good job, and adds support for some languages which before today have never (I think) been supported. Franc, certainly, “attempt”s to fix language detection, which I would argue is an AI-complete problem.

wooorm · on Oct 3, 2014

Ha! Some very nice examples, I have to say :)

Anyway, You’re completely right. Italian is `und` due to LTE 10 characters, the others are slightly off due to short input too, but the demo (http://wooorm.github.io/franc/) shows the correct languages in the second or third place though!

jodent · on Oct 3, 2014

No it doesn't, still takes French for Catalan (French only comes at third place, after Italian), and Swedish for Dutch. (Arguably those are close languages, but hey, this is why I'm using this, right?)

wooorm · on Oct 3, 2014

By `correct language` I mean the language you expect, by `second` and `third` I mean `2.` and `3.` in the previously mentioned demo: http://wooorm.github.io/franc/). I think we’re talking about the same thing!

Anyway, yeah, franc is for language detecting, but it’s optimised for many languages and works best at longer text. It’s a trade-off. For less languages and shorter texts, check out https://github.com/shuyo/ldig

wooorm · on Oct 3, 2014

No full-frequency data is kept, only 300 top-trigrams are identified. A quick through the source also reveals wooorm/trigrams, and wooorm/udhr, as sources!

riffraff · on Oct 3, 2014

yes, I meant: keeping full frequency could have been avoided to save space/memory but having two classes high/low could be a good tradeoff.

wooorm · on Oct 3, 2014

It’s an interesting thought. I might fiddle on it, but I’m not sure it would work in practice (d’oh). Thanks!

wooorm · on Oct 3, 2014

It sucks, right? Currently, it’s good at long passages. But for shorter values, the results are pretty poor. The amount of supported languages is just too damn high!

wooorm · on Oct 3, 2014

That would be awesome :)

wooorm · on Oct 3, 2014

I’m not sure. I don’t know any CJK languages myself. I’d like some test-cases where the current methods do not work, as the example in the Readme seems to work pretty well: `এটি একটি ভাষা একক IBM স্ক্রিপ্ট` is classified as Bengali?

lifthrasiir · on Oct 3, 2014

Some examples follow. I've really tested with arbitrary text on the Web, and I agree that they are somewhat marginal examples. (But I do think that Franc's margin for CJK languages is way wide.)

한국어 문서가 전 세계 웹에서 차지하는 비중은 2004년에 4.1%로, 이는 영어(35.8%), 중국어(14.1%), 일본어(9.6%), 스페인어(9%), 독일어(7%)에 이어 전 세계 6위이다. 한글 문서와 한국어 문서를 같은 것으로 볼 때, 웹상에서의 한국어 사용 인구는 전 세계 69억여 명의 인구 중 약 1%에 해당한다.

This text from Korean Wikipedia is about the ratio of Korean documents over all documents in the Internet. Digits distort the overall ratio and Franc doesn't give any candidates (even no "und").

現行の学校文法では、英語にあるような「目的語」「補語」などの成分はないとする。英語文法では "I read a book." の "a book" はSVO文型の一部をなす目的語であり、また、"I go to the library." の "the library" は前置詞とともに付け加えられた修飾語と考えられる。

This text from Japanese Wikipedia concerns about the distinction of objectives and complements in the English syntax. In this bilingual text it looks like that Japanese has reached the 60% threshold but the codepoint count doesn't.

wooorm · on Oct 3, 2014

Oh you’re right. I think I have a fix in mind, will work on it. Thanks so much!

wooorm · on Oct 5, 2014

I pushed a fix, incorporating your suggestions, and your examples in the specs.

Thanks a lot!

wooorm · on Oct 3, 2014

And it doesn’t have a Universal Declaration of Human rights: http://www.unicode.org/udhr/index_by_name.html

Luc · on Oct 3, 2014

It does have several translations of the bible, though. I guess it would be a lot of work to find bible translations for all those languages - or was there another reason for using the Human Rights Declaration?

P.S. Kudos, very cool project!

EDIT: Frisian version should you want it: https://www.google.com/search?q=Yn+betinken+nommen+dat+it+er...

wooorm · on Oct 3, 2014

Thanks! Currently, the UDHRs are crawled, and I’d rather not include exceptions and maintain their plain-text and XML/JSON versions by hand. If you’re into growing the language, I suggest contacting the Office of the High Commissioner of Human Rights of the UN, and the Unicode project, or fork wooorm/udhr and add support, I’ll merge :)

wooorm · on Oct 3, 2014

Fries as in Frisian? I don’t think it has one million speakers (right?) :p