Language Detection¶
Polyglot depends on pycld2 library which in turn depends on cld2 library for detecting language(s) used in plain text.
from polyglot.detect import Detector
Example¶
arabic_text = u""" أفاد مصدر امني في قيادة عمليات صلاح الدين في العراق بأن " القوات الامنية تتوقف لليوم الثالث على التوالي عن التقدم الى داخل مدينة تكريت بسبب انتشار قناصي التنظيم الذي يطلق على نفسه اسم "الدولة الاسلامية" والعبوات الناسفة والمنازل المفخخة والانتحاريين، فضلا عن ان القوات الامنية تنتظر وصول تعزيزات اضافية ". """
detector = Detector(arabic_text) print(detector.language)
name: Arabic code: ar confidence: 99.0 read bytes: 907
Mixed Text¶
mixed_text = u""" China (simplified Chinese: 中国; traditional Chinese: 中國), officially the People's Republic of China (PRC), is a sovereign state located in East Asia. """
If the text contains snippets from different languages, the detector is able to find the most probable langauges used in the text. For each language, we can query the model confidence level:
for language in Detector(mixed_text).languages: print(language)
name: English code: en confidence: 87.0 read bytes: 1154 name: Chinese code: zh_Hant confidence: 5.0 read bytes: 1755 name: un code: un confidence: 0.0 read bytes: 0
To take a closer look, we can inspect the text line by line, notice that the confidence in the detection went down for the first line
for line in mixed_text.strip().splitlines(): print(line + u"\n") for language in Detector(line).languages: print(language) print("\n")
China (simplified Chinese: 中国; traditional Chinese: 中國), name: English code: en confidence: 71.0 read bytes: 887 name: Chinese code: zh_Hant confidence: 11.0 read bytes: 1755 name: un code: un confidence: 0.0 read bytes: 0 officially the People's Republic of China (PRC), is a sovereign state located in East Asia. name: English code: en confidence: 98.0 read bytes: 1291 name: un code: un confidence: 0.0 read bytes: 0 name: un code: un confidence: 0.0 read bytes: 0
Best Effort Strategy¶
Sometimes, there is no enough text to make a decision, like detecting a language from one word. This forces the detector to switch to a best effort strategy, a warning will be thrown and the attribute reliable will be set to False .
detector = Detector("pizza") print(detector)
WARNING:polyglot.detect.base:Detector is not able to detect the language reliably.
Prediction is reliable: False Language 1: name: English code: en confidence: 85.0 read bytes: 1194 Language 2: name: un code: un confidence: 0.0 read bytes: 0 Language 3: name: un code: un confidence: 0.0 read bytes: 0
In case, that the detection is not reliable even when we are using the best effort strategy, an exception UnknownLanguage will be thrown.
--------------------------------------------------------------------------- UnknownLanguage Traceback (most recent call last) ipython-input-9-de43776398b9> in module>() ----> 1 print(Detector("4")) /usr/local/lib/python2.7/dist-packages/polyglot-15.04.17-py2.7.egg/polyglot/detect/base.pyc in __init__(self, text, quiet) 63 self.quiet = quiet 64 """If true, exceptions will be silenced.""" ---> 65 self.detect(text) 66 67 @staticmethod /usr/local/lib/python2.7/dist-packages/polyglot-15.04.17-py2.7.egg/polyglot/detect/base.pyc in detect(self, text) 89 90 if not reliable and not self.quiet: ---> 91 raise UnknownLanguage("Try passing a longer snippet of text") 92 else: 93 logger.warning("Detector is not able to detect the language reliably.") UnknownLanguage: Try passing a longer snippet of text
Such an exception may not be desirable especially for trivial cases like characters that could belong to so many languages. In this case, we can silence the exceptions by passing setting quiet to True
WARNING:polyglot.detect.base:Detector is not able to detect the language reliably.
Prediction is reliable: False Language 1: name: un code: un confidence: 0.0 read bytes: 0 Language 2: name: un code: un confidence: 0.0 read bytes: 0 Language 3: name: un code: un confidence: 0.0 read bytes: 0
Command Line¶
usage: polyglot detect [-h] [--input [INPUT [INPUT . ]]] optional arguments: -h, --help show this help message and exit --input [INPUT [INPUT . ]]
The subcommand detect tries to identify the language code for each line in a text file. This could be convieniet if each line represents a document or a sentence that could have been generated by a tokenizer
!polyglot detect --input testdata/cricket.txt
English Australia posted a World Cup record total of 417-6 as they beat Afghanistan by 275 runs. English David Warner hit 178 off 133 balls, Steve Smith scored 95 while Glenn Maxwell struck 88 in 39 deliveries in the Pool A encounter in Perth. English Afghanistan were then dismissed for 142, with Mitchell Johnson and Mitchell Starc taking six wickets between them. English Australia's score surpassed the 413-5 India made against Bermuda in 2007. English It continues the pattern of bat dominating ball in this tournament as the third 400 plus score achieved in the pool stages, following South Africa's 408-5 and 411-4 against West Indies and Ireland respectively. English The winning margin beats the 257-run amount by which India beat Bermuda in Port of Spain in 2007, which was equalled five days ago by South Africa in their victory over West Indies in Sydney.
Supported Languages¶
cld2 can detect up to 165 languages.
from polyglot.utils import pretty_list print(pretty_list(Detector.supported_languages()))
1. Abkhazian 2. Afar 3. Afrikaans 4. Akan 5. Albanian 6. Amharic 7. Arabic 8. Armenian 9. Assamese 10. Aymara 11. Azerbaijani 12. Bashkir 13. Basque 14. Belarusian 15. Bengali 16. Bihari 17. Bislama 18. Bosnian 19. Breton 20. Bulgarian 21. Burmese 22. Catalan 23. Cebuano 24. Cherokee 25. Nyanja 26. Corsican 27. Croatian 28. Croatian 29. Czech 30. Chinese 31. Chinese 32. Chinese 33. Chinese 34. Chineset 35. Chineset 36. Chineset 37. Chineset 38. Chineset 39. Chineset 40. Danish 41. Dhivehi 42. Dutch 43. Dzongkha 44. English 45. Esperanto 46. Estonian 47. Ewe 48. Faroese 49. Fijian 50. Finnish 51. French 52. Frisian 53. Ga 54. Galician 55. Ganda 56. Georgian 57. German 58. Greek 59. Greenlandic 60. Guarani 61. Gujarati 62. Haitian_creole 63. Hausa 64. Hawaiian 65. Hebrew 66. Hebrew 67. Hindi 68. Hmong 69. Hungarian 70. Icelandic 71. Igbo 72. Indonesian 73. Interlingua 74. Interlingue 75. Inuktitut 76. Inupiak 77. Irish 78. Italian 79. Ignore 80. Javanese 81. Javanese 82. Japanese 83. Kannada 84. Kashmiri 85. Kazakh 86. Khasi 87. Khmer 88. Kinyarwanda 89. Krio 90. Kurdish 91. Kyrgyz 92. Korean 93. Laothian 94. Latin 95. Latvian 96. Limbu 97. Limbu 98. Limbu 99. Lingala 100. Lithuanian 101. Lozi 102. Luba_lulua 103. Luo_kenya_and_tanzania 104. Luxembourgish 105. Macedonian 106. Malagasy 107. Malay 108. Malayalam 109. Maltese 110. Manx 111. Maori 112. Marathi 113. Mauritian_creole 114. Romanian 115. Mongolian 116. Montenegrin 117. Montenegrin 118. Montenegrin 119. Montenegrin 120. Nauru 121. Ndebele 122. Nepali 123. Newari 124. Norwegian 125. Norwegian 126. Norwegian_n 127. Nyanja 128. Occitan 129. Oriya 130. Oromo 131. Ossetian 132. Pampanga 133. Pashto 134. Pedi 135. Persian 136. Polish 137. Portuguese 138. Punjabi 139. Quechua 140. Rajasthani 141. Rhaeto_romance 142. Romanian 143. Rundi 144. Russian 145. Samoan 146. Sango 147. Sanskrit 148. Scots 149. Scots_gaelic 150. Serbian 151. Serbian 152. Seselwa 153. Seselwa 154. Sesotho 155. Shona 156. Sindhi 157. Sinhalese 158. Siswant 159. Slovak 160. Slovenian 161. Somali 162. Spanish 163. Sundanese 164. Swahili 165. Swedish 166. Syriac 167. Tagalog 168. Tajik 169. Tamil 170. Tatar 171. Telugu 172. Thai 173. Tibetan 174. Tigrinya 175. Tonga 176. Tsonga 177. Tswana 178. Tumbuka 179. Turkish 180. Turkmen 181. Twi 182. Uighur 183. Ukrainian 184. Urdu 185. Uzbek 186. Venda 187. Vietnamese 188. Volapuk 189. Waray_philippines 190. Welsh 191. Wolof 192. Xhosa 193. Yiddish 194. Yoruba 195. Zhuang 196. Zulu
© Copyright 2014-2015, Rami Al-Rfou. Revision d0d2aa8d .
langdetect 1.0.9
Language detection library ported from Google’s language-detection.
Навигация
Ссылки проекта
Статистика
Метаданные
Лицензия: Apache Software License (MIT)
Метки language, detection, library
Сопровождающие
Классификаторы
Описание проекта
langdetect
Port of Nakatani Shuyo’s language-detection library (version from 03/03/2014) to Python.
Installation
Supported Python versions 2.7, 3.4+.
Languages
langdetect supports 55 languages out of the box (ISO 639-1 codes):
af, ar, bg, bn, ca, cs, cy, da, de, el, en, es, et, fa, fi, fr, gu, he, hi, hr, hu, id, it, ja, kn, ko, lt, lv, mk, ml, mr, ne, nl, no, pa, pl, pt, ro, ru, sk, sl, so, sq, sv, sw, ta, te, th, tl, tr, uk, ur, vi, zh-cn, zh-tw
Basic usage
To detect the language of the text:
To find out the probabilities for the top languages:
Language detection algorithm is non-deterministic, which means that if you try to run it on a text which is either too short or too ambiguous, you might get different results everytime you run it.
To enforce consistent results, call following code before the first language detection:
You need to create a new language profile. The easiest way to do it is to use the langdetect.jar tool, which can generate language profiles from Wikipedia abstract database files or plain text.
Wikipedia abstract database files can be retrieved from «Wikipedia Downloads» (http://download.wikimedia.org/). They form ‘(language code)wiki-(version)-abstract.xml’ (e.g. ‘enwiki-20101004-abstract.xml’ ).
usage: java -jar langdetect.jar —genprofile -d [directory path] [language codes]
- Specify the directory which has abstract databases by -d option.
- This tool can handle gzip compressed file.
Remark: The database filename in Chinese is like ‘zhwiki-(version)-abstract-zh-cn.xml’ or zhwiki-(version)-abstract-zh-tw.xml’, so that it must be modified ‘zh-cnwiki-(version)-abstract.xml’ or ‘zh-twwiki-(version)-abstract.xml’.
To generate language profile from a plain text, use the genprofile-text command.
usage: java -jar langdetect.jar —genprofile-text -l [language code] [text file path]
Original project
This library is a direct port of Google’s language-detection library from Java to Python. All the classes and methods are unchanged, so for more information see the project’s website or wiki.