Hm, the problem with syllabic scripts is that you never have enough syllables on your bar, you'd need something massive. It would probably be a lot of extra work but if you used only the Hangul components, then it might work. As in, Unicode has this massive Hangul block but it's actually composed of 24 letters (it actually is an alphabetic script but because the letters are grouped in blocks of 1 to 4 symbols it comes across as being syllabic).
So 을 (riŭl) for example is composed of ㅇ and ㅡ and ㄹ. But how you would convert them, tricky. If there is a reliable converter, maybe.
I think of all the Asian scripts the only one which is realistic would be Japanese if you did a Hiragana based Scrabble.
******************************* Do, or do not. There is no try.
Some kind of letter composer between the rack and the actual tile that is placed... sounds feasible. But no, we need more flexibility on the board. Tricky. Some day I'll go ahead with Asian languages.
So why don't we do Japanese as a proof of concept for Asian languages? It should be fairly easy. There are three scripts, one is like Chinese the other two are syllabic (Hiragana and Katakan) but much more limited in scope than Korean. To begin with, we can dispense with Katakana, the main difference is purpose i.e. foreign terms are usually written with Katakana, native terms with Hiragana. There's around 100 or so that you need but I don't think it needs much in the way of modifying the game itself. So for example sushi is written as すし (su + shi), manga is まんが (ma + n + ga) (n is the only consonant only symbol, before you ask).
The main trick would be to find a Hiragana wordlist.
******************************* Do, or do not. There is no try.
Ah hab eins hier gefunden. Wenn wir nur Hiragana nehmen, dann bräuchten wir Zeile 6 bis 18230, wobei noch einige Zeilen rauszuwerfen wären die Kanji (dh die komplexen Chinesischen Lehnzeichen enthalten) enthalten aber das schaffe ich technisch nicht. Man könnte uU eine Tabelle der erwünschten Hiragana Zeichen erstellen und dann alles rauswerfen was sonst Zeichen hat, die nicht in der liste sind oder das uU über die Unicode bereiche machen.
Alle Zeilen in der nur ein einziges Zeichen ist, müßte man auch rauswerfen.
Aber sonst hätten wir dann ein brauchbares Konzept für Scrabble3D in Japanisch (Hiragana). Aber das sind eher Gedankenspiele, ich gleube man bräuchte dazu mindestens ... grrr sorry, lapsed into German, we'd need at least one speaker of Japanese.
It would appear, however, that Hiragana versions are the tentative script of choice for Japanese Scrabble, I did some searching and found this image of someone who did a paper version and Scrabble-ish games also use Hiragana. Maybe one could do a beta and put it out there and wait for a response?
******************************* Do, or do not. There is no try.
I collected all words between the chars ぁ and ん with more than one char. It results in a list with 16277 words. The letter distribution, i.e. how many times a particular char is found in the list, is as follows:
It is from 1998, in EUC-JP coding. I opened it as Shift-JIS which seemed to work. /// EDICT 26JUN98 V98-002, Main Japanese-English Electronic Dictionary File, Copyright J.W. Breen - 1998
It has hiragana index with 53158 entries: katakana-english 13567 hiragana-(kanji)-en 1493+51665=53158
->for each hiragana entry there are kanji and english meaning.
EDICT can be freely used provided satisfactory acknowledgement is made in any software product, server, etc. that uses it. There are a few other conditions relating to distributing copies of EDICT with or without modification. Copyright is vested in the EDRG (Electronic Dictionary Research Group). You can see the specific licence statement at the Group's site. ...
If you drop the rarest hiragana from the letterset, for example all with less than 150 occurrences in the dictionary, the game will be more playable. The hiragana.zip with 16277 words, if all kanji entries are dropped from the original file, contains proportionally more particles, inflections and suffixes.
I was thinking along the lines of actually invoking a pre-war convention and switching all Hiragana to Katakana, because apart from increasing the list by using Katakana, they also tend to be longer. But the wordlist you listed seems a lot bigger. It might be an option to simple use 3-kana as a minimum length though I'm not sure if that would fix the stair issue.
******************************* Do, or do not. There is no try.
xyz, thanks for that. not sure about the copyright issues on that one - I think while we're trying to work out if it's feasible or not, we're better off sticking to the open source file.
Right, with the help of our code mage, I have done a merged list - all in Katakana. Here's in detail what I did: * Converted all Hiragana only words to Katakana * Converted mixed Hiragana/Kanji words to Katakana * Converted mixed Katakana/Kanji words to Katakana * Converted all Kanji to Katakana
There's a bit of cleaning up to do. I used an automatic converter (with some spot checking) so overall I'm fairly confident the quality is good. - there are strings which contain a space, these need to be chucked out - there are strings which contains items outside the Katakana range (some Hiragana, some Kanji, I think these are rare combinations the converter was not familiar with)
At the moment the list is just short of 120k, I think once we've thrown out the messy ones, we should be on 100k or so, which seems reasonable.
If this gives us a reasonably balanced game, we could run it as a beta and see if we can attract some Japanese people to clean up the list etc etc.
Admin: Attachment deleted
******************************* Do, or do not. There is no try.
Hiragana should be used, even though it's doubtful if Japanese adults would play even it. Katakana is used only for foreign loan words and for emphasis. Kids learn Japanese with hiragana.
If it is possible to use the EDICT file and have/make a Hiragana->Kanji->English index for it, Kanji and English could then be shown in the tooltip for the word. That would be great even for learning Japanese!
The free use of EDICT seems clear to me from http://www.edrdg.org/ "EDICT can be freely used provided satisfactory acknowledgement is made in any software product, server, etc. that uses it."
I sent email to Jim Breen about the possible use.
There is a physical Scrabble in Japanese, it's in romaji. I'll try to find the photo.
That's the way Kana are used today but it wasn't always so. Prior to WW2, much of what is written today in Hiragana was written in Katakana only. Also, Hiragana words are sometimes written in Katakana for emphasis so it's not totally unheard of even today.
There are also practical reasons related to the use of the choonpu i.e. you can easily convert a Hiragana list to Katakana but not the other way round. Given the issues we have with stairs etc in the game, I think that is (for now) the key factor, having an unplayable game is not much use to anyone.
******************************* Do, or do not. There is no try.
... Let me explain briefly what my dictionary files contain. I am sure you can reformat the contents to make an index file of the type you seek.
I'll illustrate this using a Japanese word for tooth cavities. The word is usually pronounced mushiba, and more rarely kushi or ushi. It's commonly written 虫歯, but is also written 齲歯 or 齲. (Yes, complicated but Japanese is like that.
My main dictionary distribution format is the XML version (JMdict). In this format the entry is: <ent_seq>1604850</ent_seq> <k_ele> <keb>虫歯</keb> <ke_pri>ichi1</ke_pri> <ke_pri>news1</ke_pri> <ke_pri>nf17</ke_pri> </k_ele> <k_ele> <keb>齲歯</keb> </k_ele> <k_ele> <keb>齲</keb> </k_ele> <r_ele> <reb>むしば</reb> <re_pri>ichi1</re_pri> <re_pri>news1</re_pri> <re_pri>nf17</re_pri> </r_ele> <r_ele> <reb>うし</reb> <re_restr>齲歯</re_restr> </r_ele> <r_ele> <reb>くし</reb> <re_restr>齲歯</re_restr> </r_ele> <info> <audit> <upd_date>2012-09-05</upd_date> <upd_detl>Entry created</upd_detl> </audit> <audit> <upd_date>2012-09-05</upd_date> <upd_detl>Entry amended</upd_detl> </audit> <audit> <upd_date>2012-09-05</upd_date> <upd_detl>Entry amended</upd_detl> </audit> </info> <sense> <pos>&n;</pos> <pos>&adj-no;</pos> <gloss>cavity</gloss> <gloss>tooth decay</gloss> <gloss>decayed tooth</gloss> <gloss>caries</gloss> </sense> </entry>
That's quite complex, but it can be parsed, etc.
There are two simpler formats. One is the EDICT2 one:
The JMdict version is in UTF-8. The other two are in EUC-JP. You can convert them to UTF-8, e.g. iconv -f EUC-JP -t UTF-8 EDICT2 > EDICT2_UTF8 (on a Unix/Linux system).