For more than 50 years, linguists and computer scientists have tried to get computers to understand human language by programming semantics as software. Driven by efforts to translate Russian texts during the Cold War (and more recently by the value of information retrieval and data analysis tools), these efforts have met with mixed success. IBM’s Jeopardy-winning Watson system and Google Translate are high-profile, successful applications of language technologies, but the humorous answers and mistranslations they sometimes produce are evidence of the problem.
Our ability to distinguish between multiple word meanings is rooted in a lifetime of experience. The context in which a word is used — an intrinsic understanding of syntax and logic, and a sense of the speaker’s intention — all
help us interpret what another person is saying. A computer can’t access these experiences, so it requires a lot of data to begin to “learn” the distinctions.
But language isn’t always straightforward, even for humans. The multiple definitions in a dictionary can make it difficult even for people to choose the correct meaning of a word. Katrin Erk, a linguistics researcher in the College of Liberal Arts, refers to this as “semantic muck.” Enabled by supercomputers at the Texas Advanced Computing Center, Erk has developed a new method for visualizing the words in a high-dimensional space.
Instead of hard-coding human logic or deciphering dictionaries to try to teach computers language, Erk decided to try a different tactic: feed computers a vast body of texts (which are a reflection of human knowledge) and use the implicit connections between the words to create a map of relationships.
“An intuition for me was that you could visualize the different meanings of a word as points in space,” Erk says. “You could think of them as sometimes far apart, like a battery charge and criminal charges, and sometimes close together, like criminal charges and accusations (“the newspaper published charges…”). The meaning of a word in a particular context is a point in this space. Then we don’t have to say how many senses a word has. Instead we say: ‘This use of the word is close to this usage in another sentence, but far away from the third use.’ ”
To create a model that can accurately recreate the intuitive ability to distinguish word meaning requires a lot of text and a lot of analytical horsepower.
“The lower end for this kind of a research is a text collection of 100 million words,” she explains. “If you can give me a few billion words, I’d be much happier. But how can we process all of that information? That’s where supercomputers come in.”