2013-02-26

NLTK 1.3 - Computing with Language: Simple Statistics

Most of the remaining parts of the first chapter of NLTK book serve as an introduction to Python in the context of text processing. I won't translate that to Lisp, because there're much better resources explaining how to use Lisp properly. First and foremost I'd refer anyone interested to the appropriate chapters of Practical Common Lisp:

It's only worth noting that Lisp has a different notion of lists, than Python. Lisp's lists are linked lists, while Python's are essentially vectors. Lisp also has vectors as a separate data-structure, and it also has multidimensional arrays (something Python mostly lacks). And the set of Lisp's list operations is somewhat different from Python's. List is the default sequence data-structure, but you should understand its limitations and know, when to switch to vectors (when you will have a lot of elements and often access them at random). Also Lisp doesn't provide Python-style syntactic sugar for slicing and dicing lists, although all the operations are there in the form of functions. The only thing which isn't easily reproducible in Lisp is assigning to a slice:

>>> sent[1:9] = ['Second', 'Third']
>>> sent
['First', 'Second', 'Third', 'Last']

There's replace but it can't shrink a sequence:

CL-USER> (defvar sent '(1 2 3 4 5 6 7 8 9 0))
CL-USER> (replace sent '("Second" "Third") :start1 1 :end1 9)
(1 "Second" "Third" 4 5 6 7 8 9 0)

Ngrams

So, the only part worth discussing here is statistics.

Let's start with a frequency distribution. We have already used something similar in the previous part for text generation, but it was very basic and tailored to the task. Now, it's time to get into some serious language modeling and discuss a more general-purpose implementation.

Such modeling is accomplished via collecting of large amounts of statistical data about words and their sequences appearances in texts. These sequences are called ngrams. In a nutshell, you can think of ngrams distribution as a table mapping ngram sequences to numbers.

(defclass ngrams ()
  ((order :initarg :order :reader ngrams-order)
   (count :reader ngrams-count)
   (max-freq :reader ngrams-max-freq)
   (min-freq :reader ngrams-min-freq)
   (total-freq :reader ngrams-total-freq)))

The crucial parameter of this class is order which defines the length of a sequence. In practice, ngrams of order from 1 to 5 may be used.

ngrams is an abstract class. In Lisp you don't have to somehow specify this property, you just don't implement methods for it. The simplest ngrams implementation — table-ngrams — uses an in-memory hash-table as a store. You can get ngram frequency and "probability" (the maximum likelihood estimation) from it, as well as log of probability which is used more often in calculations, because it allows to avoid the problem of floating point rounding errors occurring when multiplying probabilities which are rather small values.

NLTK> (freq (text-bigrams *moby*) "The whale")
Indexing bigrams...
Number of bigrams: 116727
14
NLTK> (logprob (text-bigrams *moby*) "The whale")
-14.255587

So how do we get bigrams of Moby Dick? For that we just have to count all of them in text (this is a simplified version — some additional processing for sentence start/ends is needed):

(defun index-ngrams (order words &key ignore-case)
  (make 'table-ngrams :order order
        :table
        (let ((ht (make-hash-table :test (if ignore-case 'equalp 'equal))))
          (do ((tail words (rest tail)))
               ((shorter? tail order))
            (incf (get# (if (= order 1)
                            (car tail)
                            (sub tail 0 order))
                        ht 0)))
          ht)))

table-ngrams will be useful for simple experimentation and prototyping, like we do in our NLTK examples.

NLTK> (defvar *1grams* (text-ugrams *moby*))
Indexing unigrams...
Number of unigrams: 1924
NLTK> (freq *1grams* "whale")
906
NLTK> (take 50 (vocab *1grams* :order-by '>))
("," "the" "<S>" "</S>" "." "of" "and" "-" "a" "to" ";" "in" "\"" "that" "'"
 "his" "it" "I" "!" "s" "is" "he" "with" "was" "as" "all" "for" "this" "at"
 "by" "but" "not" "him" "from" "be" "on" "?" "so" "whale" "one" "you" "had"
 "have" "there" "But" "or" "were" "now" "which" "me")

The strings "<S>" and "</S>" here denote special symbols for sentence start and end.

Here's a cumulative plot of them:

Cumulative Frequency Plot for 50 Most Frequent Words in Moby Dick

And here's just the counts graph:

Frequency Plot for 50 Most Frequent Words in Moby Dick

And, finally, here's hapaxes:

NLTK> (take 50 (hapaxes (text-ugrams *moby*)))
("orphan" "retracing" "sheathed" "padlocks" "dirgelike" "Buoyed" "liberated"
 "Till" "Ixion" "closing" "suction" "halfspent" "THEE" "ESCAPED" "ONLY"
 "Epilogue" "thrill" "etherial" "intercept" "incommoding" "tauntingly"
 "backwardly" "coincidings" "ironical" "intermixingly" "whelmings" "inanimate"
 "animate" "lookouts" "infatuation" "Morgana" "Fata" "gaseous" "mediums"
 "bewildering" "bowstring" "mutes" "voicelessly" "THUS" "grapple"
 "unconquering" "comber" "foregone" "bullied" "uncracked" "unsurrendered"
 "Diving" "flume" "dislodged" "buttress")

The next Python feature showcased here is list comprehensions. The idea behind them is to resemble theoretical-set notation in list definition. There's no such thing out-of-the box in Lisp (although you can implement an even closer to set-notation variant in just 24 lines), and the general approach is to favor functional style filtering with variants of map and remove-if.

NLTK> (sort (remove-if #`(< (length %) 15)
                       (uniq (text-words *moby*)))
            'string<)
("CIRCUMNAVIGATION" "Physiognomically" "apprehensiveness" "cannibalistically" "characteristically" "circumnavigating" "circumnavigation" "circumnavigations" "comprehensiveness" "hermaphroditical" "indiscriminately" "indispensableness" "irresistibleness" "physiognomically" "preternaturalness" "responsibilities" "simultaneousness" "subterraneousness" "supernaturalness" "superstitiousness" "uncomfortableness" "uncompromisedness" "undiscriminating" "uninterpenetratingly")

NLTK> (sort (remove-if #`(or (<= (length %) 7)
                             (<= (freq (text-ugrams *chat*) %) 7))
                       (vocab (text-ugrams *chat*)))
            'string<)
("20sUser104" <... another 130 users ...> "Question" "actually" "anything" "computer" "everyone" "football" "innocent" "listening" "remember" "seriously" "something" "talkcity_adults" "thinking" "together" "watching")

In NLTK variant all users are removed from the corpus with some pre-processing.

Language Modeling

But to be useful for real-world scenarios ngrams have to be large, really large (on the orders of tens of gigabytes of data for trigrams). This means that you won't be able to simply store them in memory and will have to use some external storage: a general-purpose data-store, like the relational database or a special-purpose software.

One such ngrams service that is available on the internet is Microsoft Web N-gram Services. If you have a developer token you can query it over HTTP. The service only returns log-probabilities and also log-conditional-probabilities and runs really slow, but it is capable of serving batch requests, i.e. return probabilities for several ngrams at once. The implementation of ngrams interface for such service is provided in contrib/ms-ngrams.lisp.

We have already encountered conditional probabilities in the previous part. They have the following relationship with regular (so called, "joint") probabilities (for bigrams):

p(A,B) = p(B|A) * p(A)
where P(A,B) is a joint probability and P(B|A) is the conditional one

I.e. they can be calculated from current ngrams plus the ngrams of preceding order. So, this operation is performed not on a single ngrams object, but on a pair of such objects. And they serve an important role we'll see below. But first we need to talk about language models.

A language model is, basically, a collection of ngrams of different orders. Combining these ngrams we're able to obtain some other measures beyond a simple frequency value or probability estimate. The biggest added value of such model is in smoothing capabilities that it implements. The problem smoothing solves is that you'll almost never be able to have all possible ngrams in your data-store — there's just too many of them and the language users keep adding more. But it's very nasty to get 0 probability for some ngram. The language model allows to find a balance between the number of ngrams you have to store and the possibility to get meaningful probability numbers for any ngram. This is achieved with various smoothing techniques: interpolation and discounting. Some of the smoothing methods are:

  • +1 smoothing
  • Kneser-Ney smoothing
  • and Stupid backoff

A good general compilation of various smoothing methods is assembled in this presentation.

Let's look at the simplified implementation of scoring a sentence with the Stupid Backoff model:

(defmethod logprob ((model language-model) (sentence list))
  (with-slots (order) model
    (let ((rez 0)
          (s (append (cons "<S>" sentence) (list "</S>"))))
      (when (shorter? s order)
        (return-from logprob (logprob (get-ngrams (length s) model) s)))
      ;; start of the sentence: p(A|<S>) * p(B|<S>,A) * ...
      (do ((i 2 (1+ i)))
          ((= i order))
        (incf rez (cond-logprob model (sub s 0 i))))
      ;; middle of the sentence
      (do ((tail s (rest tail)))
          ((shorter? tail order))
        (incf rez (cond-logprob model (sub tail 0 order))))
      rez)))

Eventually, the language model is able to return the estimated probability of any sequence of words, not limited to the maximum order of ngram in it. This is usually calculated using the Markov assumption with the following formula (for a bigram language model):

p(s) = p(A) * p(B|A) * p(C|A,B) * p(D|B,C) * ... * p(Z|X,Y)
where s = A B ... Z

NLTK> (defvar *moby-lm2* (make-lm 'stupid-backoff-lm
                                  :1g (text-ugrams *moby*)
                                  :2g (text-bigrams *moby*)))
NLTK> (prob *moby-lm2* "This is a test sentence.")
6.139835e-20

That was, by the way, the probability of an unseen sentence with the word "sentence" completely missing from vocabulary.

NLTK> (prob *moby-lm2* '("<S>" "Moby" "Dick" "." "</S>"))
5.084481e-9
NLTK> (float (prob (text-bigrams *moby*) '("Moby" "Dick")))
3.0310333e-4

As you see, it's much more likely to encounter the sentence "Moby Dick." in this text, although not so likely as the phrase "Moby Dick". :)

Also such model is able to generate random texts just like we did in the previous part. But because of the smoothing capability it's much more general, i.e. it can generate sequences with any word from the vocabulary, even the phrases unseen before. At the same time it's much more computationally expensive, because now generating each new word takes O(vocabulary size) while it was O(average number of words following any particular word).

NLTK> (princ (generate *genesis* :order 2 :n 93))
burial to judged eaten sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung sprung foreign longed them ought up Temanites Aran earth earth blessings surface surface surface surface surface surface surface surface surface floated Darkness Now homage earth Now In princes said vengeance It passed said divide In beginning earth Asenath said re The peaceful kind Calah said blameless mistress Chaldees said hunter said middle surface surface surface surface yonder earth rib said said smoking smoking smoking

And, as you see, this example totally doesn't resemble the one in the previous part. Is this a bug? No, just a trick that is played with us because we aren't following the basic math principles. In the Stupid Backoff model the probabilities don't add up to 1 and the conditional probability of an unseen ngrams may be larger than the largest probability of any recorded one! This is the reason we get to produce sequences of repeated words. This problem is much less obvious for the trigram model, although the text remains a complete gibberish.

 NLTK> (princ (generate *genesis* :order 3 :n 93))
 brink time wagons fourth Besides south darkness listen foreigner Stay possessor lentils backwards be call dignity Kenizzites tar witness strained Yes appear colts bodies Reuel burn inheritance Galeed Hadar money touches conceal mighty foreigner spices Set pit straw son hurry yoke numbered gutters Dedan honest drove Magdiel Nod life assembly your Massa iniquity Tola still fifteen ascending wilderness everywhere shepherd harm bore Elah Jebusites Assyria butler Euphrates sinners gave Nephilim Stay garments find lifted communing closed Ir lights doing weeping shortly disobedience possessions drank peoples fifteen bless talked songs lamb far Shaveh heavens

What this example shows us are at least two things:

  • we should always check that mathematical properties of our models still hold as we tweak them
  • although the major use-case for language model is scoring, you can get a feel of how good it will perform by looking at the texts it generates

Finding collocations

This is another interesting and useful NLP problem with a very elegant baseline solution, which is explained in this article. Hopefully, we'll get back to it in more detail in the future chapters. And for now here's the results of implementing the algorithm from the article:

NLTK> (collocations *inaugural*)
(("United" "States") ("fellow" "citizens") ("four" "years") ("years" "ago")
 ("Federal" "Government") ("General" "Government") ("American" "people")
 ("Vice" "President") ("Old" "World") ("Almighty" "God") ("Fellow" "citizens")
 ("Chief" "Magistrate") ("Chief" "Justice") ("God" "bless") ("go" "forward")
 ("every" "citizen") ("Indian" "tribes") ("public" "debt") ("one" "another")
 ("foreign" "nations") ("political" "parties") ("State" "governments")
 ("National" "Government") ("United" "Nations") ("public" "money")
 ("national" "life") ("beloved" "country") ("upon" "us") ("fellow" "Americans")
 ("Western" "Hemisphere"))

I'm surprised at how similar they are to NLTK's considering that I didn't look at their implementation. In fact, they are the same up to the difference in the list of stopwords (the dirty secret of every NLP application :) The code for collocation extraction function can be found in core/measures.lisp.

Other uses of ngrams

Ngrams are also sometimes used for individual characters to build Character language models. And here's another usage from NLTK — for counting word lengths.

NLTK> (defvar *moby-lengths*
              (index-ngrams 1 (mapcar #'length (text-words *moby*))))
NLTK> (vocab *moby-lengths*)
(1 4 2 6 8 9 11 5 7 3 10 12 13 14 16 15 17 18 20)
NLTK> (ngrams-pairs *moby-lengths*)
((1 . 58368) (4 . 42273) (2 . 35726) (6 . 17111) (8 . 9966) (9 . 6428)
 (11 . 1873) (5 . 26595) (7 . 14399) (3 . 49633) (10 . 3528) (12 . 1053)
 (13 . 567) (14 . 177) (16 . 22) (15 . 70) (17 . 12) (18 . 1) (20 . 1))
NLTK> (ngrams-max-freq *moby-lengths*)
58368
NLTK> (freq *moby-lengths* 3)
49633

Final thoughts

Language modeling is really the foundation of any serious NLP work. Having access to ngrams expands your possibilities immensely, but the problem with them is that moving from prototype to production implementation becomes tricky due to the problems of collecting a representative data-set and consequently efficiently storing it. Yet, there are solutions: the Google Books Ngrams and Google Web1T are an example of web-scale ngrams data-set, and there's also special-purpose software for storing large ngrams corpora and obtaining language models from them. The notable examples are BerkeleyLM and KenLM.

Here's a link to some code for this chapter.

- Previous: NLTK 1.1 - Computing with Language: Texts and Words

submit

2013-02-14

NLTK 1.1 - Computing with Language: Texts and Words

OK, let's get started with the NLTK book. Its first chapter tries to impress the reader with how simple it is to accomplish some neat things with texts using it. Actually, the underlying algorithms that allow to achieve these results are mostly quite basic. We'll discuss them in this post and the code for the first part of the chapter can be found in nltk/ch1-1.lisp.

Setting up texts for processing

For the purpose of this demonstration we'll need several texts which can be downloaded from NLTK data. Namely, we'll use the following 5 texts:

  • Moby Dick (can be found inside Project Gutenberg)
  • Sense and Sensibility (likewise from Project Gutenberg)
  • The Book of Genesis
  • Inaugural Address Corpus (this one comes as a collection of separate texts, that you'll need to cat together into one file)
  • NPS Chat Corpus

These texts are in nltk/data/ directory in CL-NLP.

NLTK guys have created a special Text class and have defined all the operations in this chapter as its methods. We'll employ a slightly simpler approach and implement them as ordinary functions. Yet we'll also have a special-purpose text class to cache reusable results of long-running operations, like tokenization.

NLTK> (load-nltk-texts "data/")
#<MOBY [Moby Dick by Herman... 1242986>
#<SENSE [Sense and Sensibili... 673019>
#<GENESIS In the beginning God... 188665>
...

As you've already guessed, we've just loaded all the texts. The number in the last column is each text's character count.

Now they are stored in *texts* hash-table. This is how we can access an individual text and name it for future usage:

(defparameter *sense* (get# :sense *texts*))

get# is one of the shorthand functions for operating on hash-tables defined in rutils.

Now we have a variable pointing to "Sense and Sensibility". If we examine it, this is what we'll see:

NLTK> (describe *sense*)
#<SENSE [Sense and Sensibili... 673019>
  [standard-object]
Slots with :INSTANCE allocation:
  NAME         = :SENSE
  RAW          = "[Sense and Sensibility by Jane Austen 1811]..
  WORDS        = #<unbound slot>
  CTXS         = #<unbound slot>
  TRANSITIONS  = #<unbound slot>
  DISPERSION   = #<unbound slot>

As you see, there are some unbound slots in this structure: words will hold every word in the text after tokenization, ctxs will be a table of contexts for each word with their probabilities. By analogy, transitons will be a table of transition probabilities between words. Finally, dispersion will be a table of indices of word occurrences in text. We'll use a lazy initialization strategy for them by defining slot-unbound CLOS methods, that will be called on first access to each slot. For example, here's how words is initialized:

(defmethod slot-unbound (class (obj text) (slot (eql 'words)))
  (with-slots (raw words) obj
    (format t "~&Tokenizing text...~%")
    (prog1 (setf words (mapcan #`(cons "¶" (tokenize  %))
                               (tokenize  raw)))
      (format t "Number of words: ~A~%" (length words)))))

First we split the raw text in paragraphs, because we'd like to preserve paragraph information. Splitting is slightly involved as paragraphs are separated by double newlines, while single newlines end every line in the text, and we have to distinguish this. We insert pillcrow signs paragraph boundaries. Then we tokenize the paragraphs into separate words (real words, punctuation marks, symbols, etc).

NB. I consider tokenization the crucial function of the NLP toolkit, and we'll explore it in more detail in one of the future posts.

Implementing the examples

OK, now we are ready to start churning out examples from the first chapter.

The first one finds occurrences of certain words in the text. NLTK guys perform the search on the tokenized texts. But I think, it's quite OK to do it on raw strings with regexes. This has an added benefit of preserving text structure.

NLTK> (concordance *moby* "monstrous")
Displaying 11 of 11 matches
    former, one was of a most monstrous size. ...  This came towards
               "Touching that monstrous bulk of the whale or ork we h
                     array of monstrous clubs and spears.  Some were
 you gazed, and wondered what monstrous
 has survived the flood; most monstrous
                              monstrous fable, or still worse and mor
                       Of the Monstrous Pictures of Whales.
        In connexion with the monstrous pictures of whales, I am stro
o enter upon those still more monstrous stories of them
ave been rummaged out of this monstrous
 Whale-Bones; for Whales of a monstrous size are                     

With :pass-newlines on we can get the output similar to NLTK's. Let's try one of the homework tasks:

NLTK> (concordance *genesis* "lived" :pass-newlines t)
Displaying 75 of 75 matches
t from Yahweh's presence, and lived in the land of Nod, east of
 when they were created. Adam lived one hundred thirty years, and
...

Now let's try similarity. Here we won't do without proper tokenization.

NLTK> (similar *moby* "monstrous")
Building word contexts...
Tokenizing text...
Number of words: 267803
Number of unique words: 19243
("mystifying" "subtly" "maddens" "impalpable" "modifies" "vexatious" "candid"
 "exasperate" "doleful" "delightfully" "trustworthy" "domineering" "abundant"
 "puzzled" "untoward" "contemptible" "gamesome" "reliable" "mouldy"
 "determined")
NLTK> (similar *sense* "monstrous")
Building word contexts...
Tokenizing text...
Number of words: 146926
Number of unique words: 6783
("amazingly" "vast" "heartily" "extremely" "remarkably" "great" "exceedingly"
 "sweet" "very" "so" "good" "a" "as")

We mostly get the same words as NLTK's result, but with a slightly different ordering. It turns out, that the reason for this is very simple. The function similar matches words based on the contexts, where they occur. According to the famous quote by John Rupert Firth:

You shall know a word by the company it keeps

But if we look at context overlap between various words from our list we'll see that the similarity relation between all these words is extremely weak: the decision is based on the match of a single context in which both words appeared in text. In fact, all the listed words are similar to the same extent.

NLTK> (common-contexts *moby* "monstrous" "loving")
("most_and")
NLTK> (common-contexts *moby* "monstrous" "mystifying")
("most_and")
NLTK> (apply #'common-contexts *moby* (similar *moby* "monstrous"))
("most_and")

Actually, the next NLTK example is, probably, the best context overlap you can get from those texts:

NLTK> (common-contexts *sense* "monstrous" "very")
("am_glad" "is_pretty" "a_pretty" "a_lucky")

Now let's draw a dispersion plot of the words from inaugural corpus. This task may seem difficult to approach at first, because the authors use a Python library matplotlib for drawing the graph. Fortunately, there's a language-agnostic tool to achieve similar goals, which is called gnuplot. There is a couple of Lisp wrapper libraries for it, and the actual code you need to write to drive it amounts to 2 lines (not counting the code to format the data for consumption). There are, actually, numerous language-agnostic tools on the Unix platform — don't forget to look for them when you have such kind of specific need :) gnuplot dispersion graph with cl-nlp

The next problem in this part also seems pretty hard. And, in fact, it is extremely hard if framed correctly — to generate a meaningful text based on some other text. But the example solves an easier task to generate a somewhat meaningful text. And the approach taken to solve it is a very simple one — it is the baseline method in this area and is based on Markov chains. There was even a famous mock with Markov chains in the times of Usenet called Mark V. Shaney. Markov models have one principal parameter — order. Mark V. Shaney was an order 2 chain.

Let's try to generate something with it:

NLTK> (princ (generate *genesis* :order 2 :n 93))
Building prefix transitions index of length 2...
Number of prefixes: 14326
In the marsh grass . Behold , the male and female . Of these , and that Jacob obeyed his father . I will return to the earth was dry . God set them in the earth . Noah built an altar there , and put me in the pit ; and that which was our father ' s hand , then the handmaids and their sin is very grievous in your flesh for an everlasting covenant between God and every man to see the nakedness of his flock and keep it .

And what if we raise the order?

NLTK> (princ (generate *genesis* :order 3 :n 93))
Building prefix transitions index of length 3...
Number of prefixes: 28206
In the beginning God created the large sea creatures , and every bird , whatever moves on the earth . He stayed yet another seven days , and sent over that which he had gained in Paddan Aram . Esau saw that the interpretation was good , he said to them , they conspired against him to kill him . They took captive all their little ones , and for days and years ; and let it divide the waters from the waters . God said to the younger , and his seed

The text starts to resemble the original more and more. Also you may notice, that the text will always start with "In". That's because Genesis isn't split in paragraphs, and our generation starts from paragraph beginnings, of which there's only one here.

OK, this seems to work, but with probabilities you never know for sure... ;)

Now, we're left with very simple tasks. Let's just do them:

NLTK> (length (text-words *genesis*))
44671

In the book they had a slightly different number: 44764. This is because of the different tokenization scheme. The differences can be seen in the next snippet (we have a cleaner version for this use case :)

NLTK> (take 20 (sort (remove-duplicates (text-words *genesis*) :test 'string=) 'string<))
("!" "\"" "'" "(" ")" "," "-" "." ":" ";" "?" "A" "Abel" "Abida" "Abimael"
 "Abimelech" "About" "Abraham" "Abram" "Accad")

What about the vocabulary size? Well, once again very similar to the NLTK number (2789).

NLTK> (length (remove-duplicates (text-words *genesis*) :test 'string=))
2634

Now, let's look for words:

NLTK> (count "smote" (text-words *genesis*) :test 'string=)
0

Hmm... What about some other word?

NLTK> (count "Abraham" (text-words *genesis*) :test 'string=)
134

This seems to work. What's the problem with "smote"? Turns out, there's no such word in the Genesis text: at least the examination of the text doesn't show any traces of it. Looks like we've found a bug in the book :)

(defun percentage (count total)
  (/ (* 100.0 count) total))
NLTK> (with-slots (words) *inaugural*
        (percentage (count "a" words :test 'string=) (length words)))
1.4597242
(defun lexical-diversity (text)
  (with-slots (words) text
    (/ (float (length words))
       (length (remove-duplicates words :test 'string=)))))
NLTK> (lexical-diversity *genesis*)
16.959377
NLTK> (lexical-diversity *chat*)
6.9837084

Interestingly, the results for *chat* corpus differ from the NLTK ones, although they are calculated based on tokens, provided in the corpus and not extracted by our tokenization algorithms. This text is special, because it is extracted from the XML-structured document, which also contains the full tokenization. To use it we swap words in *chat* corpus:

NLTK> (setf (text-words *chat*)
            (mapcar #'token-word
                    (flatten (corpus-text-tokens ncorp:+nps-chat-corpus+))))

But first we need to get the corpus and extract the data from it - see corpora/nps-chat.lisp for details.

And, finally, we can examine the Brown Corpus.

       Genre        |  Tokens  |  Types  |  Lexical Diversity
--------------------+----------+---------+--------------------
PRESS-REPORTAGE     |  100554  |  14393  |    7.0
PRESS-EDITORIAL     |  61604   |  9889   |    6.2
PRESS-REVIEWS       |  40704   |  8625   |    4.7
RELIGION            |  39399   |  6372   |    6.2
SKILL-AND-HOBBIES   |  82345   |  11934  |    6.9
POPULAR-LORE        |  110299  |  14502  |    7.6
BELLES-LETTRES      |  173096  |  18420  |    9.4
MISCELLANEOUS-GOVER |  70117   |  8180   |    8.6
LEARNED             |  181888  |  16858  |   10.8
FICTION-GENERAL     |  68488   |  9301   |    7.4
FICTION-MYSTERY     |  57169   |  6981   |    8.2
FICTION-SCIENCE     |  14470   |  3232   |    4.5
FICTION-ADVENTURE   |  69342   |  8873   |    7.8
FICTION-ROMANCE     |  70022   |  8451   |    8.3
HUMOR               |  21695   |  5016   |    4.3

OK, seems like we're done with this chapter. So far there was no rocket science involved, but it was interesting...

Implementation details

So, what are the interesting bits we haven't discussed?

First, let's look at a small optimization trick for calculating lexical-diversity. Our initial variant uses a library function remove-duplicates which is highly inefficient for this case.

NLTK> (time (lexical-diversity *chat*))
Evaluation took:
  9.898 seconds of real time
  9.888618 seconds of total run time (9.888618 user, 0.000000 system)
  99.91% CPU
  23,687,560,947 processor cycles
  229,392 bytes consed

What we'd like to do is something similar to the Python's version, which puts everything in a set and calculates its size. A set is easily represented with a hash-table:

(defun uniq (list &key raw case-insensitive)
  "Return only unique elements from LIST either as a new list
   or as hash-table if RAW is set. Can be CASE-INSENSITIVE."
  (let ((uniqs (make-hash-table :test (if case-insensitive 'equalp 'equal))))
    (dolist (elt list)
      (set# elt uniqs t))
    (if raw uniqs (ht-keys uniqs))))

Here's the time of the same calculation using uniq:

NLTK> (time (lexical-diversity *chat*))
Evaluation took:
  0.014 seconds of real time
  0.012001 seconds of total run time (0.012001 user, 0.000000 system)
  85.71% CPU
  33,396,336 processor cycles
  613,568 bytes consed

A 1000x speed increase!

Now, let's return to text generation. It is accomplished with the following loop (a simplified version):

(loop :for i :from 1 :to length :do
  (let ((r (random 1.0))
        (total 0))
    (dotable (word prob
              (or (get# prefix transitions)
                  ;; no continuation - start anew
                  (prog1 (get# (setf prefix initial-prefix) transitions)
                    ;; add period unless one is already there
                    (unless (every #'period-char-p (car rez))
                      (push "." rez)
                      (incf i)))))
      (when (> (incf total prob) r)
        (push word rez)
        (setf prefix (cons word (butlast prefix)))
        (return)))))

On each iteration it places all possible continuations of the current prefix on a segment from 0 to 1 and generates a random number that points to one of the variants. If there's no continuation it starts anew.

NLTK book, actually, uses a slightly more complicated model: first, it builds a probability distribution on top of the transition frequencies and then generated the text from the probabilities. As of now I don't see why this is needed and if it makes any difference in the results.

And, finally, here's how we draw the dispersion plot:

(defun plot-dispersion (words file)
  "Plot WORDS dispersion data from FILE."
  (cgn:start-gnuplot)
  (cgn:format-gnuplot "set title \"Lexical Dispersion Plot\"")
  (cgn:format-gnuplot "plot \"~A\" using 1:2:yticlabels(3) title \"\"" file))

It's just 1 line of gnuplot code, actually, but we also need to prepare the data in a tab-separated text file:

(defun dump-data (words dispersion-table)
  "Dump data from DISPERSION-TABLE for WORDS into a temporary file
   and return its name."
  (let ((filename (fmt "/tmp/~A" (gensym))))
    (with-out-file (out filename)
      (format out "0~t0~t~%")
      (doindex (i word words)
        (dolist (idx (get# word dispersion-table))
          (format out "~A~t~A~t~A~%" idx (1+ i) word)))
      (format out "0~t~A~t~%" (1+ (length words))))
    filename))

To wrap up, we've seen a demonstration of a lot of useful tools for text processing, and also discussed how they can be built. Among all of them I want to outline the utility of a seemingly simplistic concordance that is actually kind of a grep tool that is indispensable for large text exploration. I even used it a couple of times debugging issues in more complex functions from this pack.

- Previous: Introduction

submit

2013-02-02

Natural Language Meta Processing with Lisp

Recently I've started work on gathering and assembling a comprehensive suite of NLP tools for Lisp — CL-NLP. Something along the lines of OpenNLP or NLTK. There's actually quite a lot of NLP tools in Lisp accumulated over the years, but they are scattered over various libraries, internet sites and books. I'd like to have them in one place with a clean and concise API which would provide easy startup point for anyone willing to do some NLP experiments or real work in Lisp. There's already a couple of NLP libraries, most notably, langutils, but I don't find them very well structured and also their development isn't very active. So, I see real value in creating CL-NLP.

Besides, I'm currently reading the NLTK book. I thought that implementing the examples from the book in Lisp could be likewise a great introduction to NLP and to Lisp as it is an introduction to Python. So I'm going to work through them using CL-NLP toolset. I plan to cover 1 or 2 chapters per month. The goal is to implement pretty much everything meaningful, including the graphs — for them I'm going to use gnuplot driven by cgn of which I've learned answering questions on StackOverflow. :) I'll try to realize the examples just from the description — not looking at NLTK code — although, I reckon it will be necessary sometimes if the results won't match. Also in the process I'm going to discuss different stuff re NLP, Lisp, Python, and NLTK — that's why there's "meta" in the title. :)

CL-NLP overview

So here's a brief overview of CL-NLP 0.1.0. I think the overall structure of NLTK is very natural and simple to understand and use, and CL-NLP will be structured somewhat similarly. The packages I'm currently working on include: util, core, and corpora — the foundational tools. There will also be phonetics, syntax, semantics, generation, and learning (classification and other machine learning activities), as well as, probably, others.

Each package will export a small number of generic functions for major operations, that will serve as the API entry point: tokenize, lemmatize, pos-tag, parse etc. Each of these functions will take as first argument an instance of a specific class, corresponding to some concrete algorithm: regex-tokenizer, markov-chain-generator, porter-stemmer and so on. The instance may or may not have configurable properties and/or state depending on the algorithm. So the algorithms themselves will be implemented in the generic functions' methods. This makes addition of new algorithms for well-specified tasks, like stemming, really straightforward: define a subclass of stemmer and implement method on stem for it. Finally, we'll have a way to easily access pre-defined instances for some frequently used algorithms. This will both provide an easy startup with the library and a simple way to experiment and extend it.

Let's take a look at the tokenization module, which is one of the few non-waporware ones so far. There're a couple of things I'd like to note:

  • Terminology: when we speak about tokenization we may assume that the result of this operation will be some kind of tokens. But in NLP field a token is often thought of as a specific data structure, that has some other properties beyond its string value, for example, POS tag. Also tokenization can produce different kinds of results: letters, phonemes, words or sentences, for instance. NLTK tokenizers will return strings by default, and they have some additional methods to return spans — tuples of corresponding beginning-end pairs. I support the approach that tokenization shouldn't produce opaque data structures not to restrict the choice for higher-level tools. In general, I adhere to the principle, that each tool in CL-NLP should produce the most raw data for its level of work. Yet I find it not very convenient to have different functions, that return string tokens and corresponding spans. This is easily amended using Lisp's multiple-values feature. So here's a definition of tokenize:
    (defgeneric tokenize (tokenizer string)
      (:documentation
       "Tokenize STRING with TOKENIZER. Outputs 2 values:
        - list of words
        - list of spans as beg-end cons pairs"))
    
  • Using a generic function allows us to define an :around method for performing pre-processing of the input text by splitting it into lines and assembling each line's tokenization.
    (defclass tokenizer ()
      ()
      (:documentation
       "Base class for tokenizers."))
    
    (defmethod tokenize :around ((tokenizer tokenizer) string)
      "Pre-split text into lines and tokenize each line separately."
      (let ((offset 0)
            words spans)
        (loop :for line :in (split-sequence #\Newline string) :do
           (mv-bind (ts ss) (call-next-method tokenizer line)
             (setf words (nconc words ts)
                   spans (nconc spans (mapcar #`(cons (+ (car %) offset)
                                                      (+ (cdr %) offset))
                                              ss)))
             (incf offset (1+ (length line)))))
        (values words
                spans)))
    
  • And here's how we'll implement pre-defined instances of the tokenizers:
    (define-lazy-singleton word-chunker
        (make 'regex-word-tokenizer :regex (re:create-scanner "[^\\s]+"))
      "Dumb word tokenizer, that will not split punctuation from words.")
    
    It is done with a special macro that defines a lazily initialized singleton and a special syntactic sugar for it. This facility is mainly intended for convenience in interactive experimentation, and not for production deployment.
    (defmacro define-lazy-singleton (name init &optional docstring)
      "Define a function NAME, that will return a singleton object,
       initialized lazily with INIT on first call.
       Also define a symbol macro  that will expand to (NAME)."
      (with-gensyms (singleton)
        `(let (,singleton)
           (defun ,name ()
             ,docstring
             (or ,singleton
                 (setf ,singleton ,init)))
           (define-symbol-macro ,(mksym name :format "<~A>") (,name)))))
    
  • Finally, a very interesting direction for development is stream tokenization. It has a whole different set of optimization compromises, and I hope to return to it in some of the future posts.

NB. I find it one of the biggest shortcomings of NLTK's design, that the algorithms are implemented inside the concrete classes. There's a good saying by Joe Armstrong about this: you wanted to have a banana, but got a gorilla holding a banana. I think, it very well characterizes the additional conceptual load that you have to incur when you look at NLTK's code with algorithms scattered over various classes. If Python supported CLOS-like decoupling of classes and methods, it would be much easier to separate the algorithms from other stuff much cleanly in NLTK. Well, in Lisp it's exactly what we're going to do.

Quickstart

To start working with CL-NLP you have to get it from my github account: vseloved/cl-nlp.

For those who are Lisp newbies: you'll also need Lisp, obviously. There's a lot of implementations (this is how a program running Lisp code is usually called) of it, but all of them support the same standard, so you can choose any. The ones I recommend are SBCL and CCL. We'll call them Lisp from now on. You can interact with the implementation either by starting it and pasting commands in the presented prompt. Or you can use more fancy tools: if you're already an Emacs user or think that you're savvy enough I recommend to get SLIME and enjoy hacking with it. If yu're on Vim try SLIMV. Otherwise your choice is pretty limited, but take a look at ABLE — a very simple program for interacting with Lisp with some useful features.

Also get quicklisp to make your work with 3rd-party libraries seamless and comfortable.

To load CL-NLP in your Lisp do the following:

(push "path/to/cl-nlp/" asdf:*central-registry*)
(ql:quickload "cl-nlp")

Here, "path/to/cl-nlp/" should point to a directory where you've downloaded the project, and the path should end with a slash. We use ql:quickload to load the project with quicklisp which will take care of loading its few dependencies.

And now if everything worked fine we're ready to start exploring the NLTK book!

submit