forum

Improvements to in-client language detection of strings

posted
Total Posts
2
This is a feature request. Feature requests can be voted up by supporters.
Current Priority: +26
Topic Starter
xxxxen0n
Currently it seems only strings containing kana are considered Japanese, which can cause many Japanese strings with no kana and only kanji to render in a Chinese font. This is only mildly frustrating (hey many can't even recognize the different fonts...), but here is a list of jōyō kanji not encoded by GB2312 nor Big5, thus not commonly seen in Simplified or Traditional Chinese texts. The detection algorithm can be modified to include them for more accurate detection of Japanese, which I imagine is fairly simple work.

method
This is done by putting the kanji through round-trips of the charsets and collecting those that caused mojibake.

As for the charset chosen for Simplified Chinese, GBK is a little too inclusive for the purpose of excluding encode-able characters, correctly preserving characters such as 亜 圧 and 壱 which are unused in Chinese. GB18030 is technically a UTF, thus a round-trip test would not tell the characters not present in it, actually by definition there is no such characters in a UTF.

characters
両乗亀亜仏仮価倂値倹働僞児円処剣剤剰労効勅勧勲匂単卽厳収呉呪咲啓喩営団囲図圏圧塀塁塡塩増壊壌壱売変奨奬妬姉姫娯嬢実寛対専峠巣巻帯帰庁広廃弐弾従徳徴応恵悩悪愼懐戦戯戸戻払抜択拝拠拡挙挿捜掲揺摂撃擧敍斉斎晩暁暦曽枠査栃栄桜桟検楽様槪権歓歩歯歳歴殻毎気氷汚沢浄涙渇済渉渋渓満滝澁瀬焼爲犠猟獣瑠甁産畑畳疎痩発県眞砕硏碁稲穂穏窓竜竝粋粛粧経絵絶継続総緑緖縁縄縦繊聴脇脳臓舎舗艶荘菓蔵薫薬蛍衆衞裏襃覇覚覧観訳説読謡譲賛転軽辺込逓遅遡郞郷鄕酔醸釈鉄鉢鉱銭鋭鋳錬録鎭鑛関閲闘陥険隠隣隷雑霊頼顔顕飜餠駄駅駆騒験髄髪鬭鶏鷄麺黒黙齢𠮟欄廊朗虜殺類隆塚神祥福諸都侮僧免勉勤卑喝嘆器塀墨層悔慨憎懲敏既暑梅海漢煮碑社祉祈祖祝禍穀突節練繁署者臭著褐視謁謹賓贈逸難響頻

code points (in python string)
u'\u4e21\u4e57\u4e80\u4e9c\u4ecf\u4eee\u4fa1\u5002\u5024\u5039\u50cd\u50de\u5150\u5186\u51e6\u5263\u5264\u5270\u52b4\u52b9\u52c5\u52e7\u52f2\u5302\u5358\u537d\u53b3\u53ce\u5449\u546a\u54b2\u5553\u55a9\u55b6\u56e3\u56f2\u56f3\u570f\u5727\u5840\u5841\u5861\u5869\u5897\u58ca\u58cc\u58f1\u58f2\u5909\u5968\u596c\u59ac\u59c9\u59eb\u5a2f\u5b22\u5b9f\u5bdb\u5bfe\u5c02\u5ce0\u5de3\u5dfb\u5e2f\u5e30\u5e81\u5e83\u5ec3\u5f10\u5f3e\u5f93\u5fb3\u5fb4\u5fdc\u6075\u60a9\u60aa\u613c\u61d0\u6226\u622f\u6238\u623b\u6255\u629c\u629e\u62dd\u62e0\u62e1\u6319\u633f\u635c\u63b2\u63fa\u6442\u6483\u64e7\u654d\u6589\u658e\u6669\u6681\u66a6\u66fd\u67a0\u67fb\u6803\u6804\u685c\u685f\u691c\u697d\u69d8\u69ea\u6a29\u6b53\u6b69\u6b6f\u6b73\u6b74\u6bbb\u6bce\u6c17\u6c37\u6c5a\u6ca2\u6d44\u6d99\u6e07\u6e08\u6e09\u6e0b\u6e13\u6e80\u6edd\u6f81\u702c\u713c\u7232\u72a0\u731f\u7363\u7460\u7501\u7523\u7551\u7573\u758e\u75e9\u767a\u770c\u771e\u7815\u784f\u7881\u7a32\u7a42\u7a4f\u7a93\u7adc\u7add\u7c8b\u7c9b\u7ca7\u7d4c\u7d75\u7d76\u7d99\u7d9a\u7dcf\u7dd1\u7dd6\u7e01\u7e04\u7e26\u7e4a\u8074\u8107\u8133\u81d3\u820e\u8217\u8276\u8358\u83d3\u8535\u85ab\u85ac\u86cd\u8846\u885e\u88cf\u8943\u8987\u899a\u89a7\u89b3\u8a33\u8aac\u8aad\u8b21\u8b72\u8cdb\u8ee2\u8efd\u8fba\u8fbc\u9013\u9045\u9061\u90de\u90f7\u9115\u9154\u91b8\u91c8\u9244\u9262\u9271\u92ad\u92ed\u92f3\u932c\u9332\u93ad\u945b\u95a2\u95b2\u95d8\u9665\u967a\u96a0\u96a3\u96b7\u96d1\u970a\u983c\u9854\u9855\u98dc\u9920\u99c4\u99c5\u99c6\u9a12\u9a13\u9ac4\u9aea\u9b2d\u9d8f\u9dc4\u9eba\u9ed2\u9ed9\u9f62\U00020b9f\uf91d\uf928\uf929\uf936\uf970\uf9d0\uf9dc\ufa10\ufa19\ufa1a\ufa1b\ufa22\ufa26\ufa30\ufa31\ufa32\ufa33\ufa34\ufa35\ufa36\ufa37\ufa38\ufa39\ufa3a\ufa3b\ufa3d\ufa3e\ufa3f\ufa40\ufa41\ufa42\ufa43\ufa44\ufa45\ufa47\ufa48\ufa4b\ufa4c\ufa4d\ufa4e\ufa50\ufa51\ufa52\ufa54\ufa55\ufa56\ufa57\ufa59\ufa5a\ufa5b\ufa5c\ufa5f\ufa60\ufa61\ufa62\ufa63\ufa64\ufa65\ufa67\ufa68\ufa69\ufa6a'
Tanomoshii Nekojou
Want this! :D
Please sign in to reply.

New reply