NodeBox

Create visual output with Python programming code
Home Download Reference Tutorial Library Gallery About

Linguistics

Description

With the Nodebox English Linguistics library you can do grammar inflection and semantic operations on English content. You can use the library to conjugate verbs, pluralize nouns, write out numbers, find dictionary descriptions and synonyms for words, summarise texts and parse grammatical structure from sentences.

The library bundles WordNet (using Oliver Steele's PyWordNet), NLTK, Damian Conway's pluralisation rules, Jason Wiener's Brill tagger and several algorithms adopted from Michael Granger's Ruby Linguistics module.

Download

downloadlinguistics.zip (13MB)
Last updated for NodeBox 1.0rc7.
Licensed under GPL

Documentation

 


How to get the library up and running

Put the en library folder in the same folder as your script so NodeBox can find the library. It takes some time to load all the data the first time.

import en

 


Categorise words as nouns, verbs, numbers and more

The is_number() command returns True when the given value is a number:

print en.is_number(12)
print en.is_number("twelve")
>>> True
>>> True

The is_noun() command returns True when the given string is a noun. You can also check for is_verb(), is_adjective() and is_adverb():

print en.is_noun("banana")
>>> True

The is_tag() command returns True when the given string is a tag, for example HTML or XML.

The is_html_tag() command returns True when the string is a HTML tag.

 


Guessing the (emotional) value of a word

The is_basic_emotion() command returns True if the given word expresses a basic emotion (anger, disgust, fear, joy, sadness, surprise):

print en.is_basic_emotion("cheerful")
>>> True

The is_persuasive() command returns True if the given word is a "magic" word (you, money, save, new, results, health, easy, ...):

print en.is_persuasive("money")
>>> True

The is_connective() command returns True if the word is a connective (nevertheless, whatever, secondly, ... and words like I, the, own, him which have little semantical value):

print en.is_connective("but")
>>> True

 


Converting between numbers and words

The number.ordinal() command returns the ordinal of the given number, 100 yields 100th, 3 yields 3rd and twenty-one yields twenty-first:

print en.number.ordinal(100)
print en.number.ordinal("twenty-one")
>>> 100th
>>> twenty-first

The number.spoken() command writes out the given number:

print en.number.spoken(25)
>>> twenty-five

 


Quantification of numbers and lists

The number.quantify() command quantifies the given word:

print en.number.quantify(10, "chicken")
print en.number.quantify(800, "chicken")
>>> a number of chickens
>>> hundreds of chickens

The list.conjunction() command quantifies a list of words. Notice how goose is correctly pluralized and duck has the right article.

farm = ["goose", "goose", "chicken", "chicken", "chicken"]
print en.list.conjunction(farm)
>>> several chickens, a pair of geese and a duck

You can also quantify the types of things in the given list, class or module:

print en.list.conjunction((1,2,3,4,5), generalize=True)
print en.list.conjunction(en, generalize=True)
>>> several integers
>>> a number of modules, a number of functions, a number of strings,
>>> a pair of lists, a pair of dictionaries, an en verb, an en sentence,
>>> an en number, an en noun, an en list, an en content, an en adverb,
>>> an en adjective, a None type and a DrawingPrimitives Context

 


Indefinite article: a or an

The noun.article() returns the noun with its indefinite article:

print en.noun.article("university")
print en.noun.article("owl")
print en.noun.article("hour")
>>> a university
>>> an owl
>>> an hour

 


Pluralization of nouns

The noun.plural() command pluralizes the given noun:

print en.noun.plural("child")
print en.noun.plural("kitchen knife")
print en.noun.plural("wolf")
print en.noun.plural("part-of-speech")
>>> children
>>> kitchen knives
>>> wolves
>>> parts-of-speech

You can also do adjective.plural().

An optional classical parameter is True by default and determines if either classical or modern inflection is used (e.g. classical pluralization of octopus yields octopodes instead of octopuses).

 


Emotional value of a word

The noun.is_emotion() guesses whether the given noun expresses an emotion by checking if there are synonyms of the word that are basic emotions. Returns True or False by default.

print en.noun.is_emotion("anger")
>>> True

Or you can return a string which provides some information with the boolean=False parameter.

print en.adjective.is_emotion("anxious", boolean=False)
>>> fear

An additional optional parameter shallow=True speeds up the lookup process but doesn't check as many synonyms. You can also use verb.is_emotion(), adjective.is_emotion() and adverb.is_emotion().

 


WordNet glossary, synonyms, antonyms, components

WordNet describes semantic relations between synonym sets.

The noun.gloss() command returns the dictionary description of a word:

print en.noun.gloss("book")
>>> a written work or composition that has been published (printed on pages
>>> bound together); "I am reading a good book on economics"

A word can have multiple senses, for example "tree" can mean a tree in a forest but also a tree diagram, or a person named Sir Herbert Beerbohm Tree:

print en.noun.senses("tree")
>>> [['tree'], ['tree', 'tree diagram'], ['Tree', 'Sir Beerbohm Tree']]
print en.noun.gloss("tree", sense=1)
>>> a figure that branches from a single root; "genealogical tree"

The noun.lexname() command returns a categorization for the given word:

print en.noun.lexname("book")
>>> communication

The noun.hyponym() command return examples of the given word:

print en.noun.hyponym("vehicle")
>>> [['bumper car', 'Dodgem'], ['craft'], ['military vehicle'], ['rocket',
>>> 'projectile'], ['skibob'], ['sled', 'sledge', 'sleigh'], ['steamroller',
>>> 'road roller'], ['wheeled vehicle']]
print en.noun.hyponym("tree", sense=1)
>>> [['cladogram'], ['stemma']]

The noun.hypernym() command return abstractions of the given word:

print en.noun.hypernym("earth")
print en.noun.hypernym("earth", sense=1)
>>> [['terrestrial planet']]
>>> [['material', 'stuff']]

You can also execute a deep query on hypernyms and hyponyms. Notice how returned values become more and more abstract:

print en.noun.hypernyms("vehicle", sense=0)
>>> [['vehicle'], ['conveyance', 'transport'],
>>> ['instrumentality', 'instrumentation'],
>>> ['artifact', 'artefact'], ['whole', 'unit'],
>>> ['object', 'physical object'],
>>> ['physical entity'], ['entity']]

The noun.holonym() command returns components of the given word:

print en.noun.holonym("computer")
>>> [['busbar', 'bus'], ['cathode-ray tube', 'CRT'],
>>> ['central processing unit', 'CPU', 'C.P.U.', 'central processor',
>>> 'processor', 'mainframe'] ...

The noun.meronym() command returns the collection in which the given word can be found:

print en.noun.meronym("tree")
>>> [['forest', 'wood', 'woods']]

The noun.antonym() returns the semantic opposite of the word:

print en.noun.antonym("black")
>>> [['white', 'whiteness']]

Find out what two words have in common:

print en.noun.meet("cat", "dog", sense1=0, sense2=0)
>>> [['carnivore']]

The noun.absurd_gloss() returns an absurd description for the word:

print en.noun.absurd_gloss("typography")
>>> a business deal on a trivial scale

The return value of a WordNet command is usually a list containing other lists of related words. You can use the en.list.flatten() command to flatten the list:

print en.list.flatten(en.noun.senses("tree"))
>>> ['tree', 'tree', 'tree diagram', 'Tree', 'Sir Herbert Beerbohm Tree']

If you want a list of all nouns/verbs/adjectives/adverbs there's the wordnet.all_nouns(), wordnet.all_verbs() ... commands:

print len(en.wordnet.all_nouns())
>>> 117096

All of the commands shown here for nouns are also available for verbs, adjectives and adverbs, verbs.hypernyms("run"), en.adjective.gloss("beautiful") etc. are valid commands.

 


Verb conjugation

NodeBox English Linguistics knows the verb tenses for about 10000 English verbs.

The verb.infinitive() command returns the infinitive form of a verb:

print en.verb.infinitive("swimming")
>>> swim

The verb.present() command returns the present tense for the given person. Known values for person are 1, 2, 3, "1st", "2nd", "3rd", "plural", "*". Just use the one you like most.

print en.verb.present("gave")
print en.verb.present("gave", person=3, negate=False)
>>> give
>>> gives

The verb.present_participle() command returns the present participle tense:

print en.verb.present_participle("be")
>>> being

Return the past tense:

print en.verb.past("give")
print en.verb.past("be", person=1, negate=True)
>>> gave
>>> wasn't

Return the past participle tense:

print en.verb.past_participle("be")
>>> been

A list of all possible tenses:

print en.verb.tenses()
>>> ['past', '3rd singular present', 'past participle', 'infinitive',
>>> 'present participle', '1st singular present', '1st singular past',
>>> 'past plural', '2nd singular present', '2nd singular past',
>>> '3rd singular past', 'present plural']

The verb.tense() command returns the tense of the given verb:

print en.verb.tense("was")
>>> 1st singular past

Return True if the given verb is in the given tense:

print en.verb.is_tense("wasn't", "1st singular past", negated=True)
print en.verb.is_present("does", person=1)
print en.verb.is_present_participle("doing")
print en.verb.is_past_participle("done")
>>> True
>>> False
>>> True
>>> True

 


Shallow parsing, the grammatical structure of a sentence

NodeBox English Linguistics is able to do sentence structure analysis using a combination of Jason Wiener's tagger and NLTK's chunker. The tagger assigns a part-of-speech tag to each word in the sentence using a (Brill's) lexicon. A postag is something like NN or VBP marking words as nouns, verbs, determiners, pronouns, etc. The chunker is then able to group syntactic units in the sentence. A syntactic unit is, for example, a determiner followed by adjectives followed by a noun: the tasty little chicken is a syntactic unit.

The sentence.tag() command tags the given sentence. The return value is a list of (word, tag) tuples. However, when you print it out it will look like a string.

print en.sentence.tag("this is so cool")
>>> this/DT is/VBZ so/RB cool/JJ

There are lots of part-of-speech tags and it takes some time getting to know them. The full list is here. The sentence.tag_description() returns a (description, examples) tuple for a given tag:

print en.sentence.tag_description("NN")
>>> ('noun, singular or mass', 'tiger, chair, laughter')

The sentence.chunk() command returns the chunked sentence:

from pprint import pprint
pprint( en.sentence.chunk("we are going to school") )
>>> [['SP',
>>> ['NP', ('we', 'PRP')],
>>> ['AP',
>>> ['VP', ('are', 'VBP'), ('going', 'VBG'), ('to', 'TO')],
>>> ['NP', ('school', 'NN')]]]]

Now what does all this mean?

A handy sentence.traverse(sentence, cmd) command lets you feed a chunked sentence to your own command chunk by chunk:

s = "we are going to school"
def callback(chunk, token, tag):
if chunk != None :
print en.sentence.tag_description(chunk)[0].upper()
if chunk == None :
print token, "("+en.sentence.tag_description(tag)[0]+")"
en.sentence.traverse(s, callback)
>>> SUBJECT PHRASE
>>> NOUN PHRASE
>>> we (pronoun, personal)
>>> VERB PHRASE AND ARGUMENTS
>>> VERB PHRASE
>>> are (verb, non-3rd person singular present)
>>> going (verb, gerund or present participle)
>>> to (infinitival to)
>>> NOUN PHRASE
>>> school (noun, singular or mass)

Finally, if you feel up to it you could feed the following command with a list of your own regular expression units to chunk, mine are pretty basic as I'm not a linguist.

print en.sentence.chunk_rules()

 


Summarisation of text to keywords

NodeBox English Linguistics is able to strip keywords from a given text.

en.content.keywords(txt, top=10, nouns=True, singularize=True, filters=[])

The content.keywords() command guesses a list of words that frequently occur in the given text. The return value is a list (length defined by top) of (count, word) tuples. When nouns is True, returns only nouns. The command furthermore ignores connectives, numbers and tags. When singularize is True, attempts to singularize nouns in the text. The optional filters parameter is a list of words which the command should ignore.

So, assuming you would want to summarise web content you can do the following:

from urllib import urlopen
html = urlopen("http://news.bbc.co.uk/").read()
meta = ["news", "health", "uk", "version", "weather",
"video", "sport", "return", "read", "help"]
print sentence_keywords(html, filters=meta)
>>> [(6, 'funeral'), (5, 'beirut'), (3, 'war'), (3, 'service'), (3, 'radio'),
>>> (3, 'lebanon'), (3, 'islamist'), (3, 'function'), (3, 'female')]