Linas NLP tasks
From OpenCog
This page lists a set of half-baked ideas and tasks regarding NLP (natural language processing) within OpenCog. The goal of this page is to help User:Linas more clearly formulate his strategic direction.
Contents |
Overview
Intermediate-term goals are to synthesize the 'best-practices', 'best-ideas' from construction grammar and dependency grammar together with OpenCog's current parser, based on the link grammar, into one. The synthesis is to make use of the general principles of coprus linguistics, that is, mining statistical information to obtain concrete parse rules/construction rules.
Current focus is on using mutual information as a basic measure of statistical relatedness, and then using either Markov chains, Markov random fields or Markov logic networks to obtain the colligations, head words, phrases, parse rules, co-reference resolution, and ultimately, hopefully the semes/lexis of the text. The three Markovian systems are listed in terms of increasing complexity; the intent it to only apply the simplest model needed to resolve a particular situation.
Rule statistics
- Collect statistics for how often each disjunct parsing rule in link-grammar is used.
- Collect statistics for how often pairs of such rules are used together.
- Search for clusters of words that use such rules and such pairs of rules.
The goal of the above is both to expand and also to refine the coverage of the parser. The dictionaries have large collections of words that have been pre-assigned to disjuncts of parse rules. Some of these collections seem to be over-broad. It is quite likely that some of these larger dictionary entries will split into smaller groups. However, these can only be found by means of corpus linguistics.
- Add suffix printing to the file format.
WSD statistics
- Collect statistics on word-sense usage vs. parse rule employed.
The goal here is to discover whether (and which) parse rules are more commonly associated with which word senses).
Hierarchical Colligation
Yuret's(1998) algo tries to create a dependency tree by computing mutual information (MI) between word pairs. The tree is discovered by computing the maximum spanning tree of the MI between all word pairs. There is an alternate approach: true, hierarchical clustering. In hierarchical clustering, one creates an MI-based metric (MI alone is not a metric), and applies cluster analysis techniques. To get hierarchy, one also looks for and computes metric measures between clusters i.e. one computes the MI between a third word, and a word-pair, instead of just the third word, and the head-word of the pair phrase.
Synonym clusters
Consider the word pairs "motorcycle crash" (mutual info MI=7.27962) and "motorcycle accident" (MI=7.86758). Compare this to MI(motorcycle, [crash or accident]) = 7.66015 which is higher than the average (MI(moto crash) + MI(moto accident))/2 = 7.57361 So, clustering together the two words, "crash, accident", raises the MI just a tad over the average.
By contrast, this doesn't hold for morphology, so MI(motorcycle, [club or clubs]) <(MI(motorcycle club) + MI(motorcycle clubs))/2. Hmmmm. That's odd ...
- Experiment was performed 4 Sept 2008, dead end. Source code in directory lexical-attr/src/cluster.

