OpenCog

Ideas

From OpenCog

See also Development and Projects pages, and Publications for background reading.

Ideas listed here may be taken up as projects by anyone with the necessary skill and motivation (the ideas here are not just for GSoC students!).

Wikipedia good-article-quality edits to any page on the entire wiki are most welcome!

Google Summer of Code 2008 students:

Accepted projects are listed at GSoCProjects2008. Please create a WikiWord for your project and use the resulting page to keep a record of your work.

OpenCog projects are inter-related in many ways, although they vary considerably in their exploration of various sub-fields of computer science, programming languages used, size and scope, and other properties. OpenCog teams overlap and compliment each other to varying degrees. The formation of particular projects and teams are influenced primarily by the goals and needs of OpenCog as an integrated and coherent system and for the performance of specific larger project goals, such as building a natural language conversational system.

Most tasks are given difficulty labels:

  1. Relatively straightforward AI R&D, though not easy
  2. Pretty difficult
  3. Rather difficult

(Note that these labels refer only to the AI aspects of the task; there may also be difficult software engineering or systems integration aspects, which haven't been labeled.)

If you add an idea as a Blueprint in Launchpad (under the respective project), please append a link to the blueprint after the idea text on this page.

Contents

RelEx

RelEx is an English-language semantic relationship extractor, built on the Carnegie-Mellon link parser. It can identify subject, object, indirect object and many other relationships between words in a sentence. It can also provide part-of-speech tagging, noun-number tagging, verb tense tagging, gender tagging, and so on. Relex includes a basic implementation of the Hobbs anaphora (pronoun) resolution algorithm. Optionally, it can use GATE for entity detection. RelEx also provides semantic relationship framing, similar to that of FrameNet.

Improved reference resolution

Pronomial reference resolution, or anaphora resolution, refers to being able to automatically figure out the antecedents for pronouns like "it", "him" and "her". The current RelEx implementation uses the Hobbs algorithm for this, which is an intelligent but crude mechanism that achieves about 60% accuracy. By combining Hobbs algorithm with statistical measures (one of which is sometimes called the Hobbs score), it should be possible (according to literature results) to get up to 85% accuracy or so.

The project is then to implement a statistical learning algorithm to improve the accuracy of pronomial resolution in RelEx! Appropriate for anyone who is interested in getting some hands-on experience with statistical corpus linguistics. (difficulty level 1)

Learning simple grammars

RelEx uses the CMU Link Grammar as its underlying English-language parser. Link Grammar's weak point is short sentences, simple commands, directives and the like; sentences which typically occur in chat rooms. The project is to tinker with with the automatic acquisition of language via some learning mechanism or another.

The paper John Lafferty, Daniel Sleator, and Davy Temperley. 1992. "Grammatical Trigrams: A Probabilistic Model of Link Grammar." Proceedings of the AAAI Conference on Probabilistic Approaches to Natural Language, October, 1992. describes statistical learning algos to generate Link Grammar rules. There have been several implementations of statistical learning for Link Grammar; one recent one was used to generate 50K Link Grammar rules for biomedical terms; See P. Szolovits, "Adding a Medical Lexicon to an English Parser". Proc. AMIA 2003 Annual Symposium. Pages 639-643. 2003. See also the link grammar bibliography for details.

It is not clear how adaptable these algorithms would be to the short, frequently ungrammatical sentences seen in chatrooms; and what mechanisms they suggest for keeping such learning from garbaging up the correct parses of more complex sentences. It is quite possible that link grammar needs new link types which would be used only in short sentences. (!) The idea of learning new link types seems unexplored.

The project is then to learn new Link Grammar links and rules, suitable for parsing the kind of speech often seen in chat rooms.

See also Natural Language Processing, below.

Statistical parse ranking

Right now, RelEx spits out between 1 and 100+ parses for each sentence you feed it. While all are syntactically correct, many are semantically wacky. DeKang Lin, in the context of his MiniPar parser, developed a technique for using corpus linguistics to statistically rank parses according to their semantic plausibility.

The idea here is to have the system learn, via statistical analysis of a corpus of sentences, which parses are "statistically more likely". See page 5 of Dekang Lin and Patrick Pantel. 2001. "Discovery of Inference Rules for Question Answering." Natural Language Engineering 7(4):343-360 for a description of how a similar idea was implemented in the MiniPar parser, to good effect...

We (Novamente, Ben, Murillo) have figured out how to adapt this idea to RelEx, but there is plenty of experimentation as well as implementation to be done. Again, appropriate for anyone who is interested in getting some hands-on experience with statistical corpus linguistics.

Inference on extracted semantic relationships

Currently RelEx takes in a sentence, and outputs a set of logical relationships expressing the semantics of the sentence. It is possible to take the logical relationships extracted from multiple sentences, and combine them using a logical reasoning engine, to see what conclusions can be derived.

Some prototype experiments along these lines were performed in 2006, using sentences contained in PubMed abstracts. But no systematic software approach was ever implemented.

These experiments could be done in a variety of rule engines, including the Probabilistic Logic Networks engine to be open-sourced in May as part of OpenCog, as well as more standard crisp rule engines such as SWIRLS.

This project is appropriate for a student who is interested in both computational linguistics and logical inference, and has some knowledge of predicate logic.

(An alternate approach is to perform statistical inferencing, as described in Dekang Lin and Patrick Pantel. 2001. "Discovery of Inference Rules for Question Answering." Natural Language Engineering 7(4):343-360)

Grub Distributed Web Crawler

Write a RelEx plugin for the Grub distributed web crawler (particularly, the alpha Java version of GrubNG client here). The idea is really simple, not only crawl the page, but extract and process the textual content, sending the results back to a shared repository as well.

Explore Landmark Transitivity in Link Grammar

The Link Grammar uses a constraint of "planar graphs" (i.e. no link crossings) to rule out unreasonable parses. It seems that it might be possible to replace this rule by the notion of "Landmark Transitivity" taken from Hudson's Word Grammar. The basic idea is this:

Each Link Grammar link is given a parent-child relationship: one end of the link is the parent, the other the child. Thus, for example, given a noun to noun-modifier link, the noun is the parent of the link. Then, parents are landmarks for children. Transitivity (in the mathematical sense of "transitive relation") is applied to these parent-child relationships. Specifically, the no-links-cross rule is replaced by two landmark transitivity rules:

  • If B is a landmark for C, then A is also a type-L landmark for C
  • If A is a landmark for C, then B is also a landmark for C

where type-L means either a right-going or left-going link.

Ben hypothesizes that adding Landmark transitivity might be able to eliminate most or all of Link Grammar's post-processing rules. See Ben's PROWL grammar for details. See also Natural Language Processing, below.

Natural Language Generation

Given a set of RelEx-like relations, generate a syntactically correct English-language sentence.

Other NLP Ideas

The page Linas NLP tasks lists additional NLP-related ideas and tasks.

MOSES

Meta-optimizing semantic evolutionary search (MOSES) is a new approach to program evolution, based on representation-building and probabilistic modeling. MOSES has been successfully applied to solve hard problems in domains such as computational biology, sentiment evaluation, and agent control. Results tend to be more accurate, and require less objective function evaluations, in comparison to other program evolution systems. Best of all, the result of running MOSES is not a large nested structure or numerical vector, but a compact and comprehensible program written in a simple Lisp-like mini-language. More at http://metacog.org/doc.html.

Higher-order Programmatic Constructs

A very important project, appropriate for a student with some functional programming background, is to extend MOSES to handle higher-order programmatic constructs, including variable expression. Our design for this involves Sinot's formalism of "director strings as combinators," and there is opportunity for the student to assist with working out the details of the design as well as the implementation. This can be done many ways, including using combinatory logic or lambda calculus. The route that seems best at the moment would be to use Sinot's formalism of "director strings as combinators." Much of the work here is in Reduct and representation-building, which would be useful for both MOSES and Pleasure. (2)

Procedure Learning

Implement, test and explore the Pleasure Algorithm for program learning, implemented standalone and/or integrated with MOSES (1)

Causing MOSES to generalize across problem instances, so what it has learned across multiple problem instances can be used to help prime its learning in new problem instances. This can be done by extending the probabilistic model building step to span multiple generations, but this poses a number of subsidiary problems, and requires integration of some sort of sophisticated attention allocation method into MOSES to tell it which patterns observed in which prior problem instances to pay attention to. (2)

Arbitrarily Complex Program Learning

More on the previous project suggestion: The motivation for the above is to allow MOSES to learn arbitrarily complex programs. For instance, we would like it to be able to easily learn nlogn sorting algorithms without any fancy data preparation or other "cheating." It is possible that integrating Sinot's formalism into MOSES will allow effective learning of moderately complex programs using recursive control, which is something no one has achieved before and which is of critical importance in automated program learning.

Action-Sequences Handling

The current version of MOSES does not elegantly or efficiently handle the learning of programs involving long sequences of actions. This is problematic for applications involving the control of robots or virtual agents. So, an important project is the extension of the Reduct and representation-building components of MOSES to effectively handle action-sequences. This work will be testable via using MOSES to control agents in virtual worlds such as Multiverse or CrystalSpace.

Improved hBOA

MOSES consists of four critical aspects: deme management, program tree reduction, representation-building, and population modeling. For the latter, the hBOA algorithm (invented by Martin Pelikan in his 2002 PhD thesis) is currently used, but we've found it not to be optimal in this context. So there is room for experimentation in replacing hBOA with a different algorithm; for instance, a variant of simulated annealing has been suggested, as has been a pattern-recognition approach similar to LZ compression. A student with some familiarity with evolutionary learning, probability theory and machine learning may enjoy experimenting with alternatives to hBOA so as to help turn MOSES into a super-efficient automated program learning framework. It already works quite well, dramatically outperforming GP, but we believe that with some attention to improving the hBOA component it can be improved dramatically.

OpenBiomind Integration

OpenBiomind (see below) contains code for applying genetic programming to analyze gene expression microarray data and SNP data. This approach has been successfully used to learn diagnostic rules for cancer, Alzheimer's, Parkinson's and other diseases, as reflected in several publications by Biomind LLC staff. MOSES however is known to be generally more effective than GP, and so it's of interest to integrate MOSES with OpenBiomind, and explore its effectiveness at microarray and SNP data analysis. Some customization of MOSES as well as extensive parameter tuning will be helpful here in achieving optimal results and making a really powerful tool for bioinformaticians.

PLN

  • Language: advanced C++
  • Code: TBD
  • List: TBD
  • License: TBD

Probabilistic Logic Networks (PLN) are a novel conceptual, mathematical and computational approach to uncertain inference. In order to carry out effective reasoning in real-world circumstances, AI software must robustly handle uncertainty. However, previous approaches to uncertain inference do not have the breadth of scope required to provide an integrated treatment of the disparate forms of cognitively critical uncertainty as they manifest themselves within the various forms of pragmatic inference. Going beyond prior probabilistic approaches to uncertain inference, PLN is able to encompass within uncertain logic such ideas as induction, abduction, analogy, fuzziness and speculation, and reasoning about time and causality.

Probabilistic Logic

Implement and test intensional inference in PLN, and compare the results to data regarding human intensional inference (1)

Integrate MOSES-based supervised categorization into PLN, so that when PLN chaining hits a confusing point, it can launch MOSES to learn patterns in the members of the Atoms at the current end of the chain (which may then provide additional information useful in pruning). (1)

Cause PLN's backward and forward chaining inference to utilize history -- so that an inference step is more likely to be taken if similar steps have been taken in similar instances. (1)

AGI-Sim

AGISim is a 3D simulation world intended specifically for teaching Proto AGI Systems.

OpenSim

Modify AGISim so that it becomes, in effect, an OpenSim proxy for OpenCog. Modify the RealXTend user interface (for OpenSim) to support AI training/monitoring, as already prototyped in the AGISim client interface. Create a suite of testing/training tools and environments for AIs in OpenSim

HypergraphDB

C++ port

Historically, HyperGraphDB's Java implementation started as a prototype for the "real" C++ version. The challenge is to make a blazing fast implementation that's compatible with the storage layout set forth by the Java version and the follows a similar architecture. Ideally, the C++ version should be used both for the short-term memory AtomTable and long term storage. It should be portable across platforms and easily updateable to a 64bit version (at the moment, the underlying BerkeleyDB doesn't have a 64bit version).

Distributed Version

Implement a distributed version of the database. The idea is to create a network of HyperGraphDB instances where data is not necessarily identical (though one should be able to configure nodes as replicas of each other). Then the network should be queriable as if it was a single instance. The challenge is to design and implement a protocol, possibly based on some standard knowledge exchange protocol using the JXTA P2P framework. The project can be phased out and only a simple version implemented initially, with basic communication and the ability the aggregate query results from different instances.

  • Cluster TDB is a simple, mature distributed database, intended for high-speed, fault-tolerant high-availability cluster operation. It is used as the underlying database for the CIFS/Samba (Microsoft SMB) file system. It might be an ideal replacement for BerkeleyDB. CTDB hello world example, another example.

Query Optimization

Currently, HyperGraphDB is queried directly through its Java API. The query layer uses a few vanilla heuristics for better performance. The challenge is to develop and implement a query optimization algorithm tailored towards the specific data organization and usage patterns of HyperGraphDB.

Query Language

Implement a query language specific to HyperGraphDB. Some ideas, to be refined and expanded upon can be found here.

Graph Algorithms

Implement and thoroughly test graph algorithms such as spanning trees, subgraph matching etc. for HyperGraphDB.

OpenBiomind

OpenBiomind contains code for applying genetic programming to analyze gene expression microarray data and SNP data. This approach has been successfully used to learn diagnostic rules for cancer, Alzheimer's, Parkinson's and other diseases, as reflected in several publications.

Java GUI

Many biologists using microarrayers and other tools are command-line-phobic. A simple Java GUI (like Weka) could significantly improve !OpenBiomind's usability lower the barrier to entry. At a minimum, the Java GUI should allow users to select parameters, launch processes and manage pipelines.

Recursive feature selection

One important and interesting project involves recursive feature selection. !OpenBiomind? now contains innovative methods for finding the most important genes associated with a categorial (gene expression or SNP) dataset. One can try interpreting these important genes as a feature set, and then re-running the categorial analysis methods, presumably getting even higher accuracy. Lather, rinse, repeat. This may be a way of getting classification accuracies on gene expression and SNP datasets that beat current !OpenBiomind? results, which in turn beat all published results on many datasets. This is a good project for someone with an interest in exploring machine learning and supervised classification in a practical context. Experimentation could involve datasets on cancer, Alzheimer's Disease and calorie restriction which have already been analysed in the !OpenBiomind? system.

Genetic profiling for predicting disease

Combining results from multiple SNP experiments to predict the real probability that a person with a given genetic profile will get a certain disease. 23andMe and a number of competing firms are utilizing tests that predict, for instance, a person's odds of getting prostate cancer based on their SNP profile. These predictions involve combination of results from various experiments done by various researchers. The way this combination is currently performed is quite crude and can be substantially improved by use of more sophisticated mathematical methods, which may be fruitfully done within the !OpenBiomind? framework. So this is a chance to implement some new bioinformatics that may have a real impact on how disease probabilities are assessed by the numerous commercial companies in this emerging space. Some understanding of genetics will be helpful here, as well as probability theory and Java coding skills.

MOSES Integration

MOSES is known to be generally more effective than GP, and so it's of interest to integrate MOSES with OpenBiomind, and explore its effectiveness at microarray and SNP data analysis. Some customization of MOSES as well as extensive parameter tuning will be helpful here in achieving optimal results and making a really powerful tool for bioinformaticians.

Neurobiological data analysis

Add necessary data types preprocessors (e.g. MEG, fMRI, EEG, PET, etc.), analysis algorithm tweaks, documented methodologies and other facilities for analysis of neurobiological data. Many public neurobiological databases exist (falling under the umbrella of the Human Cognome Project), for example the Allen Brain Atlas and others.

OpenCog Framework

  • Language: advanced C++
  • Code: TBD
  • List: TBD
  • License: AGPLv3 + linking exception

Java Language Bindings

Using JNI, implement Java language bindings for the OpenCog Framework APIs, particularly the AtomSpace, and cogServer (scheduler), and subsequently for key AI modules like MOSES and PLN. The goal of creating Java bindings is to enable MindAgents to be written entirely in Java, with no knowledge of C++ necessary.

Python Language Bindings

Implement Python language bindings for the OpenCog Framework APIs, particularly the AtomSpace, and cogServer (scheduler), and subsequently for key AI modules like MOSES and PLN. The goal of creating Python bindings is to enable MindAgents to be written entirely in Python, with no knowledge of C++ necessary.

Attention Allocation

Implement economic attention allocation and test it as a means of controlling inference on knowledge derived from natural language (1)

Implement economic attention allocation on procedures and test it as a means of controlling inference regarding which actions to take in a simulation world (2)

Supercompilation

The Combo programming language, used to represent procedures internally in OpenCog, is elegant and highly appropriate for machine learning and automated reasoning. However, interpretation of Combo programs is not very fast. So there is a project, which is to update this approach to the new Combo interpreter, and tune it to work on complex Combo programs. This is important because in the long run we would like to write all OpenCog MindAgents in Combo, so that they can be adapted and understood by the system itself. This is a necessary prerequisite for making a strongly self-modifying OpenCog system. (Difficulty is 1-3 depending on how far you go.)

Multi-threading

Various work on many parts. Greater detail to follow soon.

Normalization of Weights across NodeTypes

Develop cross-node-type normalization generics where the operator can choose to use either or both heuristic approaches, or more compute intensive conversion to explicit predicate logic form, for truth value weights/probability normalization. Also, after much experience, develop documented methodologies, examples, guidelines, and an attempt at a best practices categorization. This is a multi-project idea.

OpenCog Shell Enhancements

Add wikipedia:Role-based_access_control and LDAP user authentication (for shell commands only; no deeper integration without much more forethought). Add a goosh-like interface. Add the ability to send commands to a remote instance with XML-RPC. Add the ability to transfer sets of atoms to/from a remote instance. Add the ability to synchronize sets of atoms to/from a remote instance. (note, these last to potentially go much deeper than shell integration, but could be achieved through the shell module)

OpenCog Prime

Projects related to building and testing the OpenCogPrime design for an AGI.

  • Language: advanced C++
  • Code: TBD
  • List: TBD
  • License: AGPLv3

Concept Formation

Implement conceptual blending as a heuristic for combining concepts, with a fitness function for a newly formed concept incorporating the quality of inferential conclusions derived from the concepts, and the quality of the MOSES classification rules learned using the concept. (1)

Implement "map formation," a process that uses frequent subgraph mining methods to find frequently co-occurring subnetworks of the AtomTable, and then embodies these as new nodes. This requires some extension and adaptation of current algorithms for frequent subgraph mining. It also requires functional attention allocation. (1)

Implement context formation, wherein an Atom C is defined as a context if it is important, and restricting other Atoms A to the context of C leads to conclusions that are significantly different from A's default truth value. (2)

OpenCog Collective

  • Language: advanced C++
    (until bindings are created for other languages, such as Java, Python, Ruby, Lisp, etc.)
  • License: any approved FOSS license

AI modules & MindAgents written for the OpenCog Framework, ranging from small stand-alone projects to system designs intended to address AGI.

NARS

Integrate NARS [ open-nars.googlecode.com | nars.wang.googlepages.com/ ] and OpenCog at various points including KR database, input/output standardized file format common ground and other areas of potential shared resources. Many approaches are possible, including wrapping NARS logic (Java) to function as an OpenCog MindAgent (C++) and making these two systems communicating with each other.

Statistics Gathering

Plan with the community and implement a statistics gathering and collection infrastructure, where 'statistics' means measurements of internal state indicators (i.e. 'feelings'), paramaters, etc. This infrastructure could consists of a monitor MindAgent, which would ideally run in every instance of OpenCog, and a server (also an OpenCog instance) to collect data from multiple sources. Ideally, all development and production instances of OpenCog would voluntarily send a standard compliment of data to a server run by the SIAI.

OpenCog Applications

projects relating to using OpenCog Prime or OpenCog Collective components for specific applications

  • Language: advanced C++
    (until bindings are created for other languages, such as Java, Python, Ruby, Lisp, etc.)
  • License: any approved FOSS license

Theorem Proving

Import Mizar into the OpenCog knowledge representation, and experiment with using PLN and data mining to inductively learn proof patterns. The subtle problem here is that Mizar is a very low-level, non-normalized representation of proofs. So it would be optimal to create a set of normalization rules mapping Mizar proof constructs into a more normalized form. This is a big job, and would best be prototyped in a fairly narrow domain like set theory. (3)

Toy Domains

Sokoban would be a good toy domain for experimenting with various OpenCog methods. So hooking up OpenCog to a simple Sokoban server would seem worthwhile. MOSES and PLN could be used for Sokoban on their own, but it's more interesting of course to take an integrative approach. (1)

Shultz's book "Computational Developmental Psychology" casts a number of Piagetan child-development learning tasks in a form suitable for computer algorithms. But his approach to solving the problems is fairly narrow-AI-ish, in that he custom-codes an algorithm for each problem. If there were one learning algorithm that, without specialized tuning, could handle all of them, that would be mighty nice. (1-2 depending how deep you go.)

Robotics

Extend OpenCog to communicate with physical robots, using a toolkit like PyRo or something similar. As mechanisms for communicating with agents in virtual worlds will be provided, this is not a huge conceptual leap, but will doubtless lead to many practical complexities. (1-3 depending on how far you go)

Natural Language Processing

Use PLN to learn new grammatical rules. The simplest thing to do here is to use it to automatically assign link-parser link types to words based on studying their patterns of usage, using analogy to words whose link-parser link types are already known. (1)

Implement a Word Grammar parser that utilizes the link grammar dictionary, and utilizes PLN inference for semantic biasing of the parsing process. This gives a strong impression of being an approach to NL comprehension that is suitable for general intelligence. (2)

Extend the above Word Grammar approach to NL generation, extending our current link grammar approach to NL generation. (2)

Use PLN for semantic disambiguation and reference resolution, building on and generalizing current approaches that use statistical corpus analysis without semantic understanding. (2)

Use PLN to combine the hand-coded semantic normalization rules that exist in the RelExToFrame rule-base, to form new rules. Also, to generalize the rules to other words not currently covered by them, via combining the rules with semantically-based concept-similarity measures. (1)

Dialogue management: code an approach to controlling NL conversation in OpenCog (1)

Adaptive dialogue management: allow the system to use PLN and MOSES to adapt its conversation-control routines (2)