Some years ago I thought of the following toy comp ling (Computational Linguistics) problem. I thought of it before I had ever even been aware that there was such a thing as Computational Linguistics, and I trust the folks that actually do comp ling will forgive me if I make it appear that their problems are as trivial as the one I will now pose.
I wondered if there was some way to ‘discover’ the vowels. I mean, if we were Martians looking at a bunch of English texts, wondering what those Earthlings could have meant by those symbols, would we ever discover that special set of characters that are vowels? I know that this is a nearly impossibly vague question and that is part of the reason that I did nothing with it for years.
My vague notion was that the vowels, because of their special place in the language representing voiced components, must somehow leave some kind of statistical footprint which could be noticed by our Martian linguist.
One of our characterizations of the vowels is that ‘every word needs a vowel’. The vowels are an inescapable set of characters. It occurred to me that our comp ling Martian might well ask the question, “is there a small inescapable set of letters, where every word must have one from that set?” (Yes, I am assuming that our Martians have already decided that space is a separator and that the chunks that they are separating are called words, and while we are at it, Let’s assume that they’ve also figured out punctuation and upper and lower case and they figured out that the dot over the j character is actually part of the same character and not a period. I mean, after all, this is just a toy problem. Let’s not worry about painful details that actually confront our Martian computational linguist.)
Now this is a problem that one could write some code to solve. That is of course the real reason that an old coder like me poses problems like this, just another excuse to think about how to write some more code.
Suppose I start with a block of text which I reduce to a list of words. By hypothesis, each word has a vowel I could choose one letter at random out of the first word and guess that this was a vowel and throw it into the inescapable set that I am building. For each new word, I look and see if it is already explained by the inescapable set so far. Each time I find a word that has no characters from the inescapable set, I know that my set is incomplete so I add another letter selected at random from that word.
By the time I have worked through the word list, I will have constructed an inescapable set. Every single word in my text will have been explained by that set. Unfortunately, there is absolutely nothing that prevented me from making mistakes and including characters that did not need to be in the set. The inescapable set that I have just created is not necessarily minimal. (And of course, as our Martian well knows, there is no guarantee that any single unique minimal inescapable set even exists!)
What can we do now? Well, we could just do it again. And again and again, and each time we run it we could keep track of the best inescapable set that we have seen so far. The first pass might have 15 characters, the next only 10, the next 12, eventually, by sheer luck alone on any single pass we might just have happened to choose the right characters and we will have tripped over the actual vowels.
I wrote the code to do this. If I made just a couple of passes it usually did not find the vowels. If I ran it a hundred passes it did much better, it nearly always found the vowels. If it ran it a thousand times it always found the vowels.
I thought that this was pretty cool. I had code that ‘discovered’ an inescapable set of characters in English words. I thought that I was done. But the best was yet to come.
I started thinking about how I could tell when to stop running passes. Of course, when you are doing a Monte Carlo app with randomization you might never win –but the more you can slant the odds in your favor, the better things will be. I decided that I needed to know what the odds are that any given pass I might actually generate a true minimal set. Since I choose a character at random from my word and declare it to be a vowel, my odds of being correct are greatly enhanced if I start with short words. After all, with a one letter word like ‘I’ or ‘A’ you are assured that you made the correct choice. With a 2 letter word, you have a 50% chance of doing the right thing. If I stack the deck and start with short 2 letter words, I can know that making 5 correct choices in a row has only a 1 in 32 (2 to the fifth) chance of happening. Without stacking the deck, my odds are even lower. This is why a hundred runs were not always enough but 1000 seemed to work.
So I started to rework my code to first sort my text to put the short words first and then the light bulb went off. I don’t need the computer at all to do this comp ling problem. Look at this:
Consider this list of words: be, fee, he, lee, me, see, tee, we
Every word on the list is explained by the single letter ‘e’. BUT if I DON’T choose ‘e’ for my inescapable set I am forced to include all of ‘BFHLMSTW’ that is way more than the smallest set that my program produced. This is not proof that ‘e’ is a required character in our inescapable set, but it is strong evidence. Any set that does NOT have ‘e’ will have at least 8 letters.
I quickly produced the following short word lists that provide strong evidence for the other vowels.
A and I are explained by the single letter words. They must be in any inescapable set.
E is explained by the list above, without it we must include ‘BFHLMSTW’
O is explained by odd, of, go, loo, mom, no, or, so, to, zoo – ‘DFGLMNRSTZ’
U is explained by dud, lull, mum, nun, up, us – ‘DLMNPS’
Y is explained by by, cry, fly, my – absence of Y requires at least 4 other letters ‘BM(CR)(FL)’
So if you start with a list of ‘AEIOUY’ ripping out any single one of those letters costs you an addition of at least another 4. This tells us that if ‘AEIOUY’ actually covers all the words (no proof of that, you must run through all the words once and see) than it is actually minimal and in fact both unique and minimal by quite a bit because it costs so much to miss one of those letters.
I thought that this was a nifty result. Comp Ling without any Ling and without any actual Comp needed once you think about it in the right way. I found it especially amusing that the role that the computer played in solving this problem was that it forced me to think about how to create an efficient algorithm and in fact, the solution ended up being so efficient that there was no need to use the computer at all.
So there you go. English does have vowels and they can be found! You can rest easy tonight.