This paper reports on experimental work applying the unsupervised learning algorithm known as Linguistica v2.0.4 (Goldsmith 2002) to a corpus of approx. 460,000 alphanumeric tokens of the Eastern Nilotic language Ateso. Linguistica divides the morphological discovery process into a set of heuristics which guide the segmentation process and a Minimum Description Length model (Rissanen 1989, Goldsmith 2001) which evaluates the outcome; it has been tested on mostly non-agglutinating languages so far. The results of Linguistica are compared with a manual analysis of 3 samples of 100 words each that are randomly chosen from the Ateso corpus. A quantitative evaluation of Linguistica in terms of recall and precision is supplemented by a qualitative evaluation and a summary which describes what difficulties are encountered in running this experiment on an under-documented language.
Back to symposium main page