Saturday, April 12, 2014

Finding a minimal set of English vocabulary with WordNet, Perl, and GLPK integer programming

This was my semester project for the class "Principles of Optimization" which was one of my absolute favorite classes that I took in college. It uses WordNet, which is a fantastic database for looking at relationships between English words, (it seems to be the basis of many online tools such as the Visual Thesuarus that is constantly advertised on, it was fun to see how the graphs I made for this project exactly matched the content of Visual Thesuarus, a website that seems to be just begging for an open-source ripoff...). To access WordNet, I used the Perl library WordNet::QueryData which seemed to work quite well. For the Integer Programming, I had a much harder time, I used the library Math::GLPK which has bugs that confused me for quite a while (I wish I remembered what they were, but I don't, all I remember is a general feeling of frustration with the library), if I were to redo this project, I'd take a different approach, maybe writing lp text files and solving with the glpk command line interface, or maybe using Python to interface with an IP solver. I used Cytoscape to generate the network graphs.

For those just here for example code for WordNet::QueryData, or Math::GLPK,
you can find it in a the perl file that you can download here. All others, Read On!

Finding a minimal set of English vocabulary

English is a language with a huge amount of words. WordNet is a database of words, word senses, and other information about English lexicography and syntax. WordNet contains definitions for 147306 words, with each word having one or more senses. Senses are grouped into ‘synsets’, a synset is a group of senses that share a definition. Because a word can have more than one sense, it can be a member of multiple synsets.

The set of all synsets comprises the lexical expressivity of the language. If all of the words contributing to a synset were to be removed from English, either a new word would have to be added, or the same meaning would have to be conveyed by a string of words, or by some non-lexical device (such as word order, or word stress, or wild gesticulations). In this project, I develop a technique for using integer programming to find the minimal set of words from a vocabulary necessary for every synset to be represented by at least one word. I term such a subset a “synsetically complete lecixal subset” or “SCLeSub”. A language's SCLeSub contains a word for every idea that the whole language has a word for, but it has fewer total words, thus the SCLeSub has a higher ideas per word ratio than the parent vocabulary.

To use linear programming to explore the properties of the English lexical space as modeled by the WordNet database. To answer the questions:

1. How many words can be eliminated from English without losing any senses (i.e. synsets)?
2. Does any part of speech contain more redundant vocabulary than other parts of speech?
3. Is the SCLeSub of English unique, or are there multiple alternative SCLeSubs?
4. Does translating ordinary prose into words found only in the SCLeSub affect the readability of the text?

The questions posed in the Goals section are certainly interesting from a linguistic perspective, and the answers will tell us something about the structure of English vocabulary. They may also be useful from a practical perspective. For example, a writer restricts himself to only words in a SCLeSub will produce works with a greater number of puns and multiple interpretations than a writer not following a SCLeSub. Adherence to a SCLeSub is therefore an easy, even mechanical, way to make an author's writing seem more nuanced. Furthermore, starting from the output generated in this study, software could easily be developed to partially automate the process of translating existing English language texts into language adhering to a SCLeSub.

The Integer Program:

To answer the questions listed under the goals section I wrote a Perl script to query the WordNet database, generate the IP matrix and constraints, call GLPK to solve the problem, then format and print data about the solution. To query WordNet, I used the library “WordNet::QueryData” (, to call GLPK from Perl I used the library “Math::GLPK” ( (Math::GLPK is a little bit buggy and has some undocumented behavior, so I don't really recommend it, even so, I think that it's the best option for calling GLPK from Perl).

For the first two questions, the data can be represented by a bipartite graph with the two node types being words and synsets, and edges only connecting words and synsets. An edge is drawn between a word and a synset if that word can mean the idea represented by the synset (fig. 1). In terms of this representation of the data, a SCLeSub can be found by removing as many word nodes as possible such that each syngroup node is still attached to at least one word node (fig. 2). The problem can be represented as an IP of the following form:
There is one decision variable per word.
The decision variables are Boolean.
A value of 1 means that the word kept in the network, a value of 0 means that the word is thrown out of the network.
The objective function is to minimize the sum of all of the decision variables.
There is one constraint for each syngroup.
The constraints say that the sum the decision variables representing the words that can mean the idea represented by the syngroup must be at least 1.
Or, more formally:
Program 1:
Where rjn is 1 if word n can take the meaning of syngroup j, and 0 otherwise.

Figure 1: A section of a bipartite graph of adverbs and adverbial syngroups. Red nodes are words. Blue nodes are syngroups.

Figure 2: The same graph as figure 1, but with colors indicating membership in a SCLeSub. Green nodes are words that are retained in the SCLeSub. Red nodes are words that are thrown out of the network. Blue nodes are syngroups.

1. How many words can be eliminated from English without losing any senses (i.e. synsets)?
To answer this question, it was necessary to solve five integer programs. Once to find a SCLeSub for the entire language, and then once for each part of speech: noun, adjective, verb, and adverb. Table 1-A shows a summary of the solutions to those optimizations. From the table we can observe a number of interesting phenomena. As we might expect, the number of synsets in the language is equal to the sum of the number of synsets in the individual parts of speech. This is logical because it seems impossible for a word sense to be, for example, both an object and an action at the same time. The sum of the number of words from each part of speech, however, is larger than the number of words in the complete language. This also is logical, because one can think of many words that can take one of several parts of speech. The word “duck” for example is counted both under “Nouns”, where it can mean “small wild or domesticated web-footed broad-billed swimming bird usually having a depressed body and short legs” and “Verbs”, where it can mean “to move the head or body quickly downwards or away”. Table 1 shows that about 46 percent of the words in English can be eliminated without eliminating any senses.

2. Does any part of speech contain more redundant vocabulary than other parts of speech?
What is perhaps more interesting is that the sum of the sizes of the SCLeSubs for the individual parts of speech is about 6.3% larger than the SCLeSub calculated from all words. This could suggest that by merging word nodes to connect the individual part of speech networks and form the “All words” network, we are allowing more nodes to be eliminated, due to increased connectivity of the merged nodes. We could na├»vely conclude, because merging the networks decreases the number of nodes while keeping the number of edges constant, that the size of the number of words eliminated will depend on the edges/node ratio. Table 1 B shows us that the situation is somewhat more complicated. Among the different parts of speech there is no consistent correlation between the percent of words retained in the SCLeSub and the ratios of the various graph features. The only remaining explanation is that there must be differences in the topologies of the word/sensynset graphs among the various parts of speech.

3. Is the SCLeSub of English unique, or are there multiple alternative SCLeSubs?
To arrive at an answer to this question I had to extend the original IP slightly, in a way analogous to the “flux variability analysis” technique used in metabolic modeling in biology. The goal here is to define the boundaries of the solution (SCLeSub) space. After solving the original optimization problem (Program 1), original optimal objective function can be set as a constraint, and a series of new IPs can be solved where for each one, the objective function is to minimize or maximize a single decision variable (Program 2). I deem this technique “SCLeSub variability analysis” or SVA. A convenient feature of SVA is that after solving the initial IP, the solution can be used as a feasible basis for the subsequent IPs, this means that it is quicker for the computer to solve the IPs in the SVA calculation than it is for the computer to solve the original problem. Nevertheless, SVA problem is highly computationally intensive. On my home computer, a 2.8 GHz quad-core Intel i7, I started the execution of all five SVA problems (all words, nouns, adjectives, verbs, and adverbs) at the same time. The adverbs problem finished in about 2 hours. Interestingly, the adjectives problem finished second, after about 10 hours. The verbs problem, despite having a smaller constraint matrix and fewer decision variables than the adjectives problem, took about 18 hours. At 18 hours, the nouns problem had just finished 100 iterations, and the all words problem had not even gotten that far, so I killed the processes and settled with what I had. The results the SVA analysis (table 2) clearly show that the composition of the SCLeSubs are not unique.
Program 2:

Solve for each k = 1 to |x|. Maximize if xk = 0 in the solution to the original IP, minimize if xk = 1 in the original solution. rjn is 1 if word n can take the meaning of syngroup j, and 0 otherwise. z* is the value of the objective function in the original problem.
Nouns Adjectives Verbs Adverbs All words Sum of parts
Of speech
Word-synset edges 146,312 30,002 25,047 5,580 206,941 206,941
Synsets 82,115 18,156 13,767 3,621 117,659 117,659
Words Total 117,798 21,479 11,529 4,481 147,306 155,287
Words in SCLeSub 62,364 12,700 6,294 2,833 79,230 84,191

Nouns Adjectives Verbs Adverbs All words Sum of parts
Of speech
Percent of words in SCLeSub 52.94 59.13 54.59 63.22 53.79 54.22
edges per word 1.24 1.40 2.17 1.25 1.40 1.33
edges per synset 1.78 1.65 1.82 1.54 1.76 1.76
edges per node 0.73 0.76 0.99 0.69 0.78 0.76
words per synset 1.43 1.18 0.84 1.24 1.25 1.32
Table 1: Comparison of SCLeSub problem and its solution among the various parts of speech. (A) Entries in the “Sum of parts of speech” column are the sum of the other numbers in that row, except for the number in the “All words” column. (B) Entries are derived from the data in (A).

Adjectives Verbs Adverbs
Total Words 21,479 11,529 4,481
In every SCLeSub Count 9,591 5,058 2,178

Percent 44.65 43.87 48.61
In at least 1 SCLeSub Count 7,491 2,883 1,581
but not in every one Percent 34.88 25.01 35.28
Not in any SCLeSub Count 4,397 3,588 722

Percent 20.47 31.12 16.11
Table 2: Comparison of the SCLeSub space among adjectives, verbs, and adverbs.

Most of the words in the SCLeSubs must be in every SCLeSub, but for every grammatical category, there is still a substantial population of words that are in some SCLeSubs, but not every one. Once again, if there are consistent patterns in the bulk statistics, they are obscure patterns, suggesting significant topological differences among the underlying graphs of the various parts of speech.

4. Does translating ordinary prose into words found only the SCLeSub affect the readability of the text?
In light of the above discussed results, the concept of “the SCLeSub” no longer seems valid. Thus, to answer this last question, one of several approaches could be taken. The vocabulary could be restricted to only words that appear in at least one SCLeSub, or an arbitrary SCLeSub could be chosen and the vocabulary restricted to words that are members of that particular SCLeSub. Here, I follow the second rule, for no other reason than that it will result in more word replacements than the other method. I used the optimal solution for Program 1 output by GLPK after inputting the “All words” network as the basis for these translations (see Supplemental data 1). The following two sentences are taken from the New York Times (underlined words are words that were changed):

Original: A tactical flight officer on board the helicopter fired multiple times into the vehicle in an attempt to shoot the tires.

Translated: A tactical flight officer on board the chopper fired multiple times into the vehicle in an endeavor to shoot the tyres.

Original: Floridians can cast their ballots into the ocean and hope they end up in a state that's more capable of counting them.

Translated: Floridians can cast their ballots into the ocean and hope they wind up in a state that's more capable of counting them.

I also tried translating the Christmas song “Silent Night”, and “Yellow Submarine” by the Beatles, but neither one seemed to have any words or phrases that were not already in the SCLeSub.

Making text conform to the SCLeSub, does seem to make the text a little less natural sounding, particularly in the first example, but still readable, and in many cases no changes are even needed. Many of the words in WordNet are actually phrases, so an automated translator would undoubtedly find more places to substitute words than I did just searching manually.

While the results I have arrived at in this project are interesting, they also leave plenty of room for further research. For example, I conclude that the word-synset graphs for each part of speech must have distinct topological characteristics, but I do not attempt to explain what those characteristics might be. Future research in this area might start with comparing the frequencies of various network motifs among the graphs.
A striking feature of the SCLeSub space for every part of speech is how broad it is. There are thousands of different possible SCLeSubs. If someone wanted to use a SCLeSub-like subset of the language for some practical purpose, it would make sense to find some criteria for picking a specific SCLeSub out of the SCLeSub space. The total word-space could be iteratively reduced by solving a series of IPs, with the optimal objective function for the previous IP becoming a constraint in the subsequent IP, and a lower priority criterion becoming the new objective function. Alternatively, a single IP including multiple criteria in the objective function, each criterion weighted with a constant coefficient, could be solved. One possibly useful additional optimization criterion could be the rank of words in a list of word-frequencies. After the size of the SCLeSubs is found (by Program 1), a smaller area of SCLeSub space, perhaps a unique x, could be found by Program 3. If the goal is to find a set of common words, perhaps words to teach foreign students of English, the objective function would be a minimization, if the goal is to find a set of obscure words, perhaps for a politician or comedian, then objective function would be a maximization.
So far, particularly based on the exploration of goal 4, I'm not convinced that could ever really be a useful practical application of the SCLeSub concept, but perhaps if translations were attempted with into a SCLeSub optimized for word obscurity (thus excluding as many common words as possible), the changes would be more dramatic. Contrariwise, a SCLeSub may find use not as a list of words to be used, but as a list of words to be avoided. Perhaps the words in the “Not in any SCLeSub” category are more sophisticated and erudite sounding than words in the other two categories; they are certainly less vague.
While working on this project, I learned about formulating potentially useful integer programs, I learned about how to integrate GLPK calls into computer programs, and I made some really kick-ass graphs in Cytoscape (see appendix and supplementary PDFs). I enjoyed working on this project, I hope you enjoyed reading about it.

Program 3:
Like Program 2, except for the objective function. Fi is the rank of word xi in a word-frequency list, with the most often used word ranked 1 and least often used word ranked n.


There's some supplemental data to this project which you can download here

Supplemental_Data_1_All_Words_Solution.txt : as alluded to in the text, this file contains the data describing one SCLeSub of the 'All words' network. If there is only one word on a line, then that word is member of the SCLeSub. If there are multiple words, then the first word is not a member of the SCLeSub, but the subsequent words are in the SCLeSub and share a synset with the first word. To make a sentence conform to the SCLeSub, search for all the words and phrases in the sentence, and if one is not in the SCLeSub, choose the appropriate word to replace is with from among the words following it in the text file. : This is the perl script I wrote to do all of the heavy lifting for this project. It's pretty sloppy, but it gets the job done.

adverb_map.pdf : This is the entire graph for adverbs, including the data from SVA analysis. Light blue nodes are syngroups. Red nodes are words that are not in any SCLeSub. Yellow nodes are words that are in some SCLeSubs, but not in every SCLeSub. Green nodes are words that are in every SCLeSub.
I generated this graph in Cytoscape using data output by my perl script. If you zoom in to about %1500 percent you can actually read the names of the words. Also, you can search for words in the PDF and see the network context around them. This is the best looking of the three I think.

verb_map.pdf : Like above, but for with verbs.

adjective_map.pdf : Like above, but for with adjectives.

I tried to make ones for nouns and all words, but they're huge and I couldn't get them to turn out well.

No comments:

Post a Comment