Modeling common sense with ConceptNet

To understand semantic connections between words in human language (e.g. what do apples have in common with oranges?), AI researchers have explored a wide range of strategies during the previous decades. Ontologies such as WordNet were constructed to represent knowledge about the meaning of words. In a broader sense these ontologies aim at representing the structure of the world itself. Such databases have supported classic AI interfaces like question ↔ answer systems.

ConceptNet is a member of the family of semantic databases but it takes a more ‘fuzzy’ approach. It does not (necessarily) attempt to paint a true and scientifically valid picture of the world, but rather, to represent the world as human subjects tend to understand it. As such, it seeks to model common sense. Its descriptions may sometimes be conflicting or paradoxical, but a typical human tends to agree more often with statements derived from ConceptNet than those derived from other formal databases (see original paper).

Some example statements which can be found inside ConceptNet are:

candle → AtLocation → church
candle → UsedFor → set mood
candle → CapableOf → fire hazard
cheese → RelatedTo → mozzarella
cheese → HasProperty → smelly
silence → DistinctFrom → noise

Word embeddings

Semantic databases have an obvious limitation. They can never hold sufficient knowledge about every single domain of interest. A more recent approach in modeling word relations are word vector representations (also called word embeddings). In this approach word relations are learned from a big data corpus, primarily through a so-called skip gram analysis which investigates how frequent they occur in each others proximity. The first of such algorithms was developed by Google and is called word2vec [https://en.wikipedia.org/wiki/Word2vec]. Because this is an unsupervised learning technique, far less human labor is necessary to build these databases. Similarity between words is produced as a number and no longer via a list of labeled relationships. The successes of word2vec and its descendants have been significant and they have helped solve several existing problems in natural language processing. Notably, the word2vec approach allows AI systems to make an educated guess at the meaning of words it has not even seen yet. [examples: colors]

There is a drawback to simply relying on vector representations of words. They may reduce your ability to explain relations inside your dataset in a manner which is easy for humans to understand. We know ‘politician’ has a close semantic relation to ‘power’, but we don’t know how to describe this relation properly, namely as desire. While word2vec does allow for some level of human-like explanation via the automatic production of analogies, the knowledge encoded inside ConceptNet is much more fine-grained. That’s why most state-of-the-art work in natural language processing today, and indeed in artificial intelligence in general, is trying to combine symbolic (model-based) intelligence with distributed (neural-based) intelligence.

For example, let’s say we want to investigate the semantic relationship between two concepts in natural language. We could express it numerically as cosine similarity (a mathematical formula to establish a geometrical similarity), or we could explore the graph structure of ConceptNet.

Apple has a cosine similarity of 0.346 with orange, which places them at a fair distance.

Using the path search function which we have designed in our Java ConceptNet exploration library, we get much more informative results.

apple → Antonym → orange
apple → HasA → peel , peel → RelatedTo → orange
apple → HasProperty → red , red →RelatedTo → orange
apple → RelatedTo → ball , ball → RelatedTo → orange
apple → IsA → fruit , fruit → RelatedTo → orange

ConceptNet has shown that, while apple and oranges are very distinct (antonyms) they share the properties of being able to be peeled as fruit and having (when the apple is red) similar colors and geometrical shapes. Curiously enough, the common edibility property is not explicitly produced by the ConceptNet graph searches in this example.

Path searches for seemingly distinct words often produce informative and amusing results. When querying the word moon as a source and the word blood as the target, this this yields the following semantic connection:

RelatedTo phase → RelatedTo period → RelatedTo blood

ConceptNet has provided us with a link between the Earths moon and the menstrual cycle of women via blood. This connection is both sensible and poetical. (In tribal societies the menstrual cycle of woman often runs parallel to the phases of the moon).

Combining resources

The most recent installments of ConceptNet include pre-trained word embeddings for a large segment of its database. The word embedding component is called Numberbatch. The ConceptNet/Numberbatch approach allows you to combine or choose one of the above mentioned methods of semantic representation while remaining within a single framework.

The path search function in the ConceptNet Java library is extremely fast and allows you to perform graph-wide searches in under a second. Because we construct the entire graph in memory during the start of the application, this does come at the cost of a significant boot up time when using the library.

The ConceptNet database is published under a Creative Commons license. Our Java library is free software and distributed under a BSD license. InterTextueel is not yet using ConceptNet for its own services, but we may do so in the future. We are currently in the process of building unsupervised text mining tools which extract descriptive relations between key concepts in a dataset. Essentially, those relations will produce graph-like structures similar to ConceptNet, which serves as an inspiration. We would be happy to experiment with running our tools on public datasets like Wikipedia, or on anonymized Twitter datasets, and contribute what is useful back into ConceptNet by replicating its format.

Software

Download the binary JAR file here (and follow the installation/usage instruction on the Github link below).
Head to Github to find the most recent source code of the Java library.
You will need to manually download ConceptNet 5.5 to use the library. It is available here.

The Java documentation can be found here.

Code snippet for apples and oranges:

Concept apple = new Concept("/c/en/apple");
Concept orange = new Concept("/c/en/orange");
Optional<Double> similarity = conceptnet.getCosineSimilarity(apple, orange);
if (similarity.isPresent()) {
   System.out.println("Similarity between apple and orange is: " + similarity.get());
}

About us

InterTextueel provides robust and tailored text classification services, mainly to companies and organisations in the Netherlands. We can assist with topic modeling and other Natural Language Processing projects, big or small.