As we (we, that means Linhart ) are working for example on a Latin dictionary just now, we know that such lists like spell checker lists contain proper names and other words that are not valid in Scrabble. Often there are grammatically wrong forms, like for example passive forms of intransitive verbs. If such a list is used only for spell checking purposes, this is not really a problem, but in Scrabble games - and especially in Scrabble3D - it would be a great problem, if playing against the computer he will always place a lot of wrong, not valid words.
Of course, it is a good start to begin with a GNU spell checker list, but if we want to have a dictionary of good quality, that spell checker list ought to be totally revised and adapted for Scrabble3D purposes, which might take years of very very accurate work. Gero has worked several years on our deutsch.dic as well, because in special cases the grammar is not as easy as it seems to be. And I know which huge work on our future latin.dic Linhart is doing just now. So we really know what we are talking about...
Of course, a Basque spell checker list is better than no list at all...
But, jmontane, you wrote: "a word list based on LibreOffice spellchecker (XUXEN dictionary)". So what do you mean exactly by saying "based"? Have you already revised/adapted that list for Scrabble purposes?
Zitat von Bussinchen im Beitrag #6But, jmontane, you wrote: "a word list based on LibreOffice spellchecker (XUXEN dictionary)". So what do you mean exactly by saying "based"? Have you already revised/adapted that list for Scrabble purposes?
I know the differences between spell-checker word list and Scrabble word list, :)
LibreOffice/OpenOffice/hunspell/myspell dicts use two files. eu_ES.dic and eu_ES.aff (for Basque language). The first one (.dic) is a list of entries, followed by affixes to apply in each entry. The second one (.aff) keep the affixes (preffixes and suffixes to aply at the entries).
So, some (an easy) steps to adapt a Basque spell-cheker dict to Scrabble purposes are:
1st: remove entries from eu_ES.dic A.- Words starting with starting uppercase (propper nouns, trademarks, ...) B.- Words with hyphens (compound words) C.- Words with characters not present in Scrabble tile distribution (ñ) D.- Words finished with dot (.) (abbreviations) E.- Symbols: m kg cm mm ...
2nd: remove affixes that generate non desired words forms.
3st: genereate all inflected forms, using unmunch command from hunspell
That's the general idea.
Some remarks,
About 1st step (A,B,C,D) it's easy to do. 1-E is a little hard, but usually are words of 2 a 3 chars. About 2nd step, I don't know nothing about Basque morphology, so any advice? About 3rd step, Basque is a high inflection language. My first attempt to generate all inflected words fails, :(. I have found an alternative to unmunch more stronger to very large word lists.
Linhart has worked on such lists before, for example when he created our persian.dic. Since he already has that experience, I think he can give us good advice here.
Akerbeltzalba ( ) knows Basque (!), he speaks Basque (!) --> http://en.wikipedia.org/wiki/User:Akerbeltz. So I think that he is very competent to give us some advice here as well!
Bussinchen writes that I have experience with spellchecker lists. This is true, but unfortunately I will not be able to give you any advice how to handle such lists with the hunspell program since I wrote my own program for this purpose, and this is only suited for the Latin word list.
The Persian word list has a different structure, it does not consist of two lists (.aff and .dic), but only of one.
It depends very much on the language I'd say. Gaelic and Irish were relatively easy to build on the back of the spellchecker files. I haven't seen the Xuxen files, perhaps jmontane can post a sample of the words he has generated? On the whole I'm slightly concerned about autogenerating anything in Basque because of it being a polysynthetic language but on the other hand, most software the Basques have produced is good so it could work. I know what the Basque Regional Government had its hand in the development of Xuxen anyway, so a guarded "maybe" at this stage :)
Zitat von ScottyUnbelievable! But I'd be more interested in his experiences in Chinese
As far as Scrabble goes, no chance. You would need a "letterset" that's about 6,000 "units" big. Not really feasible I'm afraid.
It's kinda as I expected. It's a dictionary file and a massive affix file. But it looks very well done so if someone could come up with some fancy code that generates all the words that the .aff file is supposed to generate, then we'd have a dictionary we can use.
What I mean with that is this; if you look in the .dic file, you'll see (for example)
1
aarondar/60
This means that the word aarondar can take all affixes from the .aff file which begin with SFX 60, for example
So this creates aarondarra, aarondarrago, aarondarragoak and so on. What I haven't quite figured is what the /243 is for, I'll have to ask but I think it's to prevent recursion i.e. it tells you what you may not stick on the word once you have generated it, so aarondarrago + rago is illegal.
If someone is capable of doing this sort of code, then yes, we could use the Xuxen dictionary.
I've good news. A month ago, a Scrabble-like for smartphones and Facebook, called Apalabrados (Angry Words) added Basque language, :)))
Letter distributions is pretty similar. I think they changed it to avoid Mattel lawsuit.
About word list, I've contacted with the player who provide it. I hope she will provide us the word list in an opensource license.
So. I think we can wait a few weeks.
About /243 feature... In hunspell dictionaries format, it's possible to compound affixes. In your example, the suffix numbered 60 is applied first, and then, the suffix numbered 243 is applied to the generated word by the first one.