Errors in your Platform

One of the always trending topics in our pop-culture-like computer industry is dealing with errors. There are whole movements addressing it. Yet, in my humble opinion, these movements often see errors in a rather narrow, I would even say, simplistic manner: like the proverbial example of passing a string to a function which expects integers. But, in fact, the scope of errors that can occur in a program is indefinite. In a sense, the language of errors is Turing complete, and creating a program can be viewed from a certain point as devising a way of working around all relevant errors.

Today, I'd like to talk about a particularly nasty type of errors - those that occur in the platform on which we're building the program. And the reality is such that we base our programs on at least several layers of platforms: a hardware platform, an OS and its standard library, a language's VM and the language's standard library, some user-level framework(s) - to name a few. There are several aspects of this picture - some positive for us and some negative.

  • the lower we go, the more widely-used the platform is and thus the better it is tested and the more robust (positive)
  • at the same time it becomes, usually, more general and so more complex and feature-full, which means that some of the more obscure features may, actually, be not so well tested (negative)
  • the platform is usually heterogeneous: each level has it's own language with its semantics and concepts, often conflicting with the ones at the other levels (negative)

There's a revealing rant by node.js creator Ryan Dahl on the topic of such platforms - now removed, but available in the all-remembering Internet - "I hate almost all software".

The platform errors are so nasty, because of two things:

  • you learn to trust the platform (or you can easily go insane otherwise), so you first search for errors in your code
  • the majority of the platforms are not built with a priority on traceability, so it's an order of magnitude harder to find the problem, when you finally start looking for it IN the platform

And the thing is, there are severe bugs in all of the platforms we use. The Pentium FDIV bug is one acute example, but let's talk Java here, because there's so many people bragging how rock solid the JVM is. One of the problems of the JVM is jar hell - the issue that haunts almost every platform under different names (like DLL hell). We had an encounter with it recently, when after an update our server started hanging for no apparent reason. After an investigation we've found that part of our code was compiled with a more recent version of Gson library for parsing JSON, while the main server used the older version. The interesting thing is that there was no error condition signalled. I wonder, how many projects do the same mistake and don't even know what's the source of their problems. We had a similar but more visible issue with different versions of the servlet-api, but that time they were caught at compile-time because of the minor API inconsistency.

The other problem we'd fought fruitlessly and had to eventually work around with external tools was the JVM not properly closing network sockets. Yes, it leaks sockets in some scenarios and we couldn't find a way to prevent it. This is an example of another nasty platform error - the one which happens at layer boundaries. Also, many java libraries, like the ubiquitous jetty webserver leak memory in their default setting. Actually, I wonder, what percentage of Java libraries doesn't leak memory by default?.. This is no way to say, that JVM is exceptional in this regard - other platforms suffer from the same problems. At the same time to JVM's credit it has one of the tools to fight platform errors that many other platforms lack.

So, what are the ways to deal with those errors? Unfortunately, I don't have a lot of answers here, but I hope to discover some with this article. Here are some of the suggestions:

First of all, don't trust the platform.
Try to check everything and cross-validate your assumptions with external tools. This is much harder to do than say, and there's a tradeoff to make somewhere between slackness and paranoia. But if you think of it beforehand, you can see ways to set-up the system so that it would be easier to deal with the problem when it's necessary.
Have the platform's source code at hand.
This is one dimension of traceability - the static one. It won't help you easily find all the problems, but without it you're just at the mercy of the platform's creator. And I'd not trust those guys (see p.1)
Do proper error-handling.
There's a plethora of information on that, and I'd not repeat it here. See the error handling manifesto for one of the good examples.

But the biggest upside is not in fighting with the existing system, but in removing or diminishing the source of the problems - i.e. creating more programmer-friendly platforms. So the most important recommendations I see are for platform authors (and, in fact, most of us become them to some extent when we release a utility library or a framework).

Build platforms with traceability in mind.
One example of such traceability is JVM's VisualVM which is a great tool to get visibility into what's happening in the platform. Another example is Lisp images the state of which you can inspect interactively with Lisp commands. For instance, Lisp has built-in operators TRACE that allow to see the invocations of any interesting function, and INSPECT which is an interactive object browser. These things are really hard to add after-the-fact: look, for example, at the trouble of bringing DTrace into the Linux world.
Create homogenous platforms.
Why? Because in homogenous platforms the boundaries are not so sharp, and so there tend to be fewer bugs on the boundaries. And it's much easier to build in traceability: for example you don't need to deal with multiple stacks, multiple object representation etc. Once again, Lisp comes up here, because with its flexibility it allows to build systems of different levels of abstraction in the same language. And that was exemplified by the Lisp Machines - see the Kalman Reti's talk describing their inner workings.
Support component versioning at core.
This is, actually, one of the topics not addressed well by any mainstream platform (except, maybe, .Net), as far as I know: how to allow several versions of the same component to live together in one program image.

How can static typing help with these types of errors? I don't know. What about the Erlang's failure isolation. Once again, it's pretty orthogonal to the mentioned issues: if your Erlang library leaks memory, the whole machine will suffer - i.e. you can't isolate errors which touch the underlying layers of your platform which you don't fully control...

Please, share your approaches and stories about dealing with platform errors.


NLTK 2.3 - Working with Wordnet

I'm a little bit behind my schedule of implementing NLTK examples in Lisp with no posts on topic in March. It doesn't mean that work on CL-NLP has stopped - I've just had an unexpected vacation and also worked on parts, related to writing programs for the excellent Natural Language Processing by Michael Collins Coursera course.
Today we'll start looking at Chapter 2, but we'll do it from the end, first exploring the topic of Wordnet.

Wordnet and Lisp

Wordnet is a database of word meanings (senses) and word semantic relations (synonymy, hyponymy and various other ymy's). It is a really important freely available resource, so having easy access to it is very valuable. At the same time, getting Wordnet to work is somewhat involved technically. That's why I've decided to implement it's support in cl-nlp-contrib system, so that the basic cl-nlp functionality can be loaded without the additional hassle of having to ensure Wordnet (and other similar stuff) is loading.
The implementation of Wordnet interface in cl-nlp is incomplete in the sense, that it doesn't provide matching entities for all Wordnet tables and so doesn't support all the possible interactions with it out-of-the-box. Yet, it is sufficient to run all NLTK's examples and it is trivial to implement the rest as the need arises (I plan to do it later this year).
There are several other Wordnet interfaces for CL developed in the previous years, starting from as long ago as early nineties:
Non of them use my desired storage format (sqlite) and their overall support ranges from non-existing to little if any. Besides, one of the principles behind cl-nlp is to prefer uniform interface and ease of extensibility over performance and efficiency with the premise that a specialized efficient version can be easily implemented by restricting the more general one, while the opposite is often much harder. And adapting the existing packages to the desired interface didn't seem like a straightforward task, so I decided to implement Wordnet support from scratch.

Wordnet installation

Wordnet isn't a corpus, although NLTK categorizes it as such placing it in nltk.corpus module. It is a word database distributed in a custom format, as well as in binary format of several databases. I have picked sqlite3-based distribution as the easiest to work with — it's a single file of roughly 64 MB in size which can be downloaded from WNSQL project page. The default place where cl-nlp will search for Wordnet is the data/ directory. If you call (download :wordnet), it will fetch a copy and place it there.
There are several options to work with sqlite databases in Lisp, and I have chosen CLSQL as the most mature and comprehensive system — it's a goto library for doing simple SQL stuff in CL. Besides the low-level interface it provides the basic ORM functionality and some other goodies.
Also, to work with sqlite from Lisp a C client library is needed. It can be obtained in various ways depending on your OS and distribution (on most Linuxes with apt-get or similar package managers). Yet there's another issue of making Lisp find the installed library. So for Linux I decided to ship the appropriate binaries in the project's lib/ dir. If you are not on Linux and CLSQL can't find sqlite native library, you need to manually call the following command before you load cl-nlp-contrib with quicklisp or ASDF:
(clsql:push-library-path "path to sqlite3 or libsqlite3 directory")

Supporting Wordnet interaction

Using CLSQL we can define classes for Wordnet entities, such as: Synset, Sense or Lexlink.
(clsql:def-view-class synset ()
  ((synsetid :initarg :synsetid :reader synset-id
             :type integer :db-kind :key)
   (pos :initarg :pos :reader synset-pos
        :type char)
   (lexdomainid :initarg :lexdomain
                :type smallint :db-constraints (:not-null))
   (definition :initarg :definition :reader synset-def
               :type string)
   (word :initarg :word :db-kind :virtual)
   (sensenum :initarg :sensenum :db-kind :virtual))
  (:base-table "synsets"))
Notice the presence of virtual slots word and sensenum which are used to cache data from other tables in synset to display it like it is done in NLTK. They are populated lazily with slot-unbound method just like we've seen in Chapter 1.
All the DB entities are cached as well in the global cache, so that same requests don't produce different objects and Wordnet objects could be compared with eql.
There are some inconsistencies in Wordnet lexicon which have also migrated to NLTK interface (and then NLTK introduced some more). I've done a cleanup on that so some classes and slot accessors don't fully mimic the names of their DB counterparts or the NLTK's interface. For instance, in our dictionary word refers to a raw string and lemma — to its representation as a DB entity.
CLSQL also has special syntactic support for writing SQL queries in Lisp:
(clsql:select [item_id] [as [count [*]] 'num]
              :from [table]
              :where [is [null [sale_date]]]
              :group-by [item_id]
              :order-by '((num :desc))
              :limit 1)
Yet I've chosen to use another very simple custom solution for that, which I've kept in the back of my mind for a long time since I'd started working with CLSQL literal SQL syntax. Here's how we get all lemma object for a specific synset — they are connected through senses:
(query (select 'lemma
               `(:where wordid
                 :in ,(select '(wordid :from senses)
                              `(:where synsetid := ,(synset-id synset))))))
For the sake of efficiency we don't open new DB connection on every such request, but use a combination of the following:
  • there's a CLSQL special variable *default-database* that points to the current DB connection; it is used by all Wordnet interfacing functions
  • the connection can be established with connect-wordnet which is given an instance of sql-wordnet3 class (it has the usual default instance <wordnet>, but you can use any other instance which may connect to some othe Wordnet SQL DB like MySQL if needed)
  • ensure-db function checks, if the connection is present and opens it otherwise
  • it is assumed that all functions will perform Wordnet interaction inside with-wordnet macro that implements a standard Lisp resource-management call-with-* pattern for the Wordnet connection

Running NLTK's examples

Now, let's run the examples from the book to see how they work in our interface.
First, connect to Wordnet:
WORDNET> (connect-wordnet <wordnet>)
#<CLSQL-SQLITE3:SQLITE3-DATABASE /cl-nlp/data/wordnet30.sqlite OPEN {1011D1FAF3}>
Now we can run the functions with the existing connection passed implicitly through the special variable. wn is a special convenience package which implements the logic of implicitly relying on the default <wordnet>. There are functions with the same names in wordnet package that take a wordnet instance as first argument, similar to how other parts of cl-nlp are organized.
WORDNET> (wn:synsets "motorcar")
(#<SYNSET auto.n.1 102958343 {10116B0003}>)
WORDNET> (synsets <wordnet> "motorcar")
(#<SYNSET auto.n.1 102958343 {10116B0003}>)
The printing of synsets and other Wordnet objects is performed with custom print-object methods. Here's a print-object for synset that uses the built-in print-unreadable-object macro:
(defmethod print-object ((sample sample) stream)
  (print-unreadable-object (sample stream :type t :identity t)
    (with-slots (sample sampleid) sample
      (format stream "~A ~A" sample sampleid))))
Let's proceed further.
WORDNET> (wn:words (wn:synset "car.n.1"))
("auto" "automobile" "car" "machine" "motorcar")
This snippet is equivalent to the following NLTK code:
Definitions and examples:
WORDNET> (synset-def (wn:synset "car.n.1"))
"a motor vehicle with four wheels; usually propelled by an internal combustion engine"
WORDNET> (wn:examples (wn:synset "car.n.1"))
("he needs a car to get to work")
Next are lemmas.
WORDNET> (wn:lemmas (wn:synset "car.n.1"))
(#<LEMMA auto 9953 {100386BCC3}> #<LEMMA automobile 10063 {100386BCF3}>
 #<LEMMA car 20722 {100386BC03}> #<LEMMA machine 80312 {100386BD23}>
 #<LEMMA motorcar 86898 {1003250543}>)
As I've written earlier, there's some confusion in Wordnet between words and lemmas, and, IMHO, it is propagated in NLTK. The term lemma only appears once in Wordnet DB as a column in words table. And synsets are not directly related to words — they are linked through senses table. NLTK calls a synset to word pairing a lemma, which only adds to the confusion. I decided to call entities of words table lemmas. Now, how do you implement the equivalent of
We can do it like this:
WORDNET> (remove-if-not #`(string= "automobile" (lemma-word %))
                        (wn:lemmas (wn:synset "car.n.1")))
(#<LEMMA automobile 10063 {100386BCF3}>)
But, actually, we need not a raw lemma object, but a sense object, because sense is that mapping of word to its meaning (defined by a synset), and it's a proper Wordnet entity for this:
WORDNET> (wn:sense "car~automobile.n.1")
#<SENSE car~auto.n.1 28261 {1011F226C3}>
You can also get at the raw lemma by its DB id:
WORDNET> (wn:synset (wn:lemma 10063))
#<SYNSET auto.n.1 102958343 {100386BB33}>
Notice, that synset here is named auto. This is different from NLTK's 'car', and the reason for this is that it's unclear from the Wordnet DB, what is the "primary" lemma for a synset, so I just use the word which appears first in the DB. Probably, NLTK also uses it, but has a different ordering — compare the next output with the book's one:
WORDNET> (dolist (synset (wn:synsets "car"))
           (print (wn:words synset)))
("cable car" "car")
("auto" "automobile" "car" "machine" "motorcar")
("car" "railcar" "railroad car" "railway car")
("car" "elevator car")
("car" "gondola")
Also I've chosen a different naming scheme for senses — word~synset — as it better reflects the semantic meaning of this concept.

Wordnet relations

There're two types of Wordnet relations: semantic ones between synsets and lexical ones between senses. All of the relations or links can be found in *link-types* parameter. There are 28 of them in Wordnet 3.0.
To get at any relation there's a generic function related:
WORDNET> (defvar *motorcar* (wn:synset "car.n.1"))
WORDNET> (defvar *types-of-motorcar* (wn:related *motorcar* :hyponym))
WORDNET> (nth 0 *types-of-motorcar*)
#<SYNSET ambulance.n.1 102701002 {1011E35063}>
In our case "ambulance" is the first motorcar type.
WORDNET> (sort (mapcan #'wn:words *types-of-motorcar*) 'string<)
("ambulance" "beach waggon" "beach wagon" "bus" "cab" "compact" "compact car"
 "convertible" "coupe" "cruiser" "electric" "electric automobile"
 "electric car" "estate car" "gas guzzler" "hack" "hardtop" "hatchback" "heap"
 "horseless carriage" "hot rod" "hot-rod" "jalopy" "jeep" "landrover" "limo"
 "limousine" "loaner" "minicar" "minivan" "model t" "pace car" "patrol car"
 "phaeton" "police car" "police cruiser" "prowl car" "race car" "racer"
 "racing car" "roadster" "runabout" "s.u.v." "saloon" "secondhand car" "sedan"
 "sport car" "sport utility" "sport utility vehicle" "sports car" "squad car"
 "stanley steamer" "station waggon" "station wagon" "stock car" "subcompact"
 "subcompact car" "suv" "taxi" "taxicab" "tourer" "touring car" "two-seater"
 "used-car" "waggon" "wagon")
Hypernyms are more general entities in synset hierarchy:
WORDNET> (wn:related *motorcar* :hypernym)
(#<SYNSET automotive vehicle.n.1 103791235 {10125702C3}>)
Let's trace them up to root entity:
WORDNET> (defvar *paths* (wn:hypernym-paths *motorcar*))
WORDNET> (length *paths*)
WORDNET> (mapcar #'synset-name (nth 0 *paths*))
("entity.n.1" "physical entity.n.1" "object.n.1" "unit.n.6" "artefact.n.1"
 "instrumentality.n.3" "conveyance.n.3" "vehicle.n.1" "wheeled vehicle.n.1"
 "self-propelled vehicle.n.1" "automotive vehicle.n.1" "auto.n.1")
WORDNET> (mapcar #'synset-name (nth 1 *paths*))
("entity.n.1" "physical entity.n.1" "object.n.1" "unit.n.6" "artefact.n.1"
 "instrumentality.n.3" "container.n.1" "wheeled vehicle.n.1"
 "self-propelled vehicle.n.1" "automotive vehicle.n.1" "auto.n.1")
And here are just the root hypernyms:
WORDNET> (remove-duplicates (mapcar #'car *paths*))
(#<SYNSET entity.n.1 100001740 {101D016453}>)
Now, if we look at part-meronym, substance-meronym and member-holonym relations and try to get them with related, we'll get an empty set.
WORDNET> (wn:related (wn:synset "tree.n.1") :substance-meronym)
That's because the relation is actually one-way: a burl is part of a tree, but not vice versa. For this case there's a :reverse key to related:
WORDNET> (wn:related (wn:synset "tree.n.1") :part-meronym :reverse t)
(#<SYNSET stump.n.1 113111504 {10114220B3}>
 #<SYNSET crown.n.7 113128003 {1011423B73}>
 #<SYNSET limb.n.2 113163803 {1011425633}>
 #<SYNSET bole.n.2 113165815 {10114270F3}>
 #<SYNSET burl.n.2 113166044 {1011428BE3}>)
WORDNET> (wn:related (wn:synset "tree.n.1") :substance-meronym :reverse t)
(#<SYNSET sapwood.n.1 113097536 {10113E8BE3}>
 #<SYNSET duramen.n.1 113097752 {10113EA6A3}>)
WORDNET> (wn:related (wn:synset "tree.n.1") :member-holonym :reverse t)
(#<SYNSET forest.n.1 108438533 {10115A81C3}>)
While the tree is member-meronym of forest (i.e. NLTK has it slightly the opposite way):
WORDNET> (wn:related (wn:synset "tree.n.1") :member-meronym)
(#<SYNSET forest.n.1 108438533 {10115A81C3}>)
And here's the mint example:
WORDNET> (dolist (s (wn:synsets "mint" :pos #\n))
           (format t "~A: ~A~%" (synset-name s) (synset-def s)))
mint.n.6: a plant where money is coined by authority of the government
mint.n.5: a candy that is flavored with a mint oil
mint.n.4: the leaves of a mint plant used fresh or candied
mint.n.3: any member of the mint family of plants
mint.n.2: any north temperate plant of the genus Mentha with aromatic leaves and small mauve flowers
batch.n.2: (often followed by `of') a large number or amount or extent
WORDNET> (wn:related (wn:synset "mint.n.4") :part-holonym :reverse t)
(#<SYNSET mint.n.2 112855042 {1011CF3CF3}>)
WORDNET> (wn:related (wn:synset "mint.n.4") :substance-holonym :reverse t)
(#<SYNSET mint.n.5 107606278 {1011CEECA3}>)
WORDNET> (wn:related (wn:synset "walk.v.1") :entail)
(#<SYNSET step.v.1 201928838 {1004139FF3}>)
WORDNET> (wn:related (wn:synset "eat.v.1") :entail)
(#<SYNSET chew.v.1 201201089 {10041BDC33}>
 #<SYNSET get down.v.4 201201856 {10041BF6F3}>)
WORDNET> (wn:related (wn:synset "tease.v.3") :entail)
(#<SYNSET arouse.v.7 201762283 {10042495E3}>
 #<SYNSET disappoint.v.1 201798936 {100424B0A3}>)
And, finally, here's antonomy, which is a lexical, not semantic, relationship, and it takes place between senses:
WORDNET> (wn:related (wn:sense "supply~supply.n.2") :antonym)
(#<SENSE demand~demand.n.2 48880 {1003637C03}>)
WORDNET> (wn:related (wn:sense "rush~rush.v.1") :antonym)
(#<SENSE linger~dawdle.v.4 107873 {1010780DE3}>)
WORDNET> (wn:related (wn:sense "horizontal~horizontal.a.1") :antonym)
(#<SENSE inclined~inclined.a.2 94496 {1010A5E653}>)
WORDNET> (wn:related (wn:sense "staccato~staccato.r.1") :antonym)
(#<SENSE legato~legato.r.1 105844 {1010C18AF3}>)

Similarity measures

Let's now use the lowest-common-hypernyms function abbreviated to lch to see the semantic grouping of marine animals:
WORDNET> (defvar *right* (wn:synset "right whale.n.1"))
WORDNET> (defvar *orca* (wn:synset "orca.n.1"))
WORDNET> (defvar *minke* (wn:synset "minke whale.n.1"))
WORDNET> (defvar *tortoise* (wn:synset "tortoise.n.1"))
WORDNET> (defvar *novel* (wn:synset "novel.n.1"))
WORDNET> (wn:lch *right* *minke*)
(#<SYNSET baleen whale.n.1 102063224 {102A2E0003}>)
WORDNET> (wn:lch *right* *orca*)
(#<SYNSET whale.n.2 102062744 {102A373323}>)
WORDNET> (wn:lch *right* *tortoise*)
(#<SYNSET craniate.n.1 101471682 {102A3734A3}>)
WORDNET> (wn:lch *right* *novel*)
(#<SYNSET entity.n.1 100001740 {102A373653}>)
The second and third return values of lch come handy here, as they show the depth of the paths to the common ancestor and give some immediate data for estimating semantic relatedness, that we're going to explore more now.
WORDNET> (wn:min-depth (wn:synset "baleen whale.n.1"))
WORDNET> (wn:min-depth (wn:synset "whale.n.2"))
WORDNET> (wn:min-depth (wn:synset "vertebrate.n.1"))
WORDNET> (wn:min-depth (wn:synset "entity.n.1"))
By the way, there's another whale, that is much closer to the root:
WORDNET> (wn:min-depth (wn:synset "whale.n.1"))
Guess what it is?
WORDNET> (synset-def (wn:synset "whale.n.1"))
"a very large person; impressive in size or qualities"
Now, let's calculate different similarity measures:
WORDNET> (wn:path-similarity *right* *minke*)
WORDNET> (wn:path-similarity *right* *orca*)
WORDNET> (wn:path-similarity *right* *tortoise*)
WORDNET> (wn:path-similarity *right* *novel*)
The algorithm here is very simple — it just reuses the secondary return values of lch:
(defmethod path-similarity ((wordnet sql-wordnet3)
                            (synset1 synset) (synset2 synset))
  (mv-bind (_ l1 l2) (lowest-common-hypernyms wordnet synset1 synset2)
    (declare (ignore _))
    (when l1
      (/ 1 (+ l1 l2 1)))))
There are many more Wordnet similarity measures in NLTK, and they are also implemented in cl-nlp. For example, lch-similarity the name of which luckily coincides with lch function:
WORDNET> (wn:lch-similarity (wn:synset "dog.n.1") (wn:synset "cat.n.1"))
Calculating taxonomy depth for #<SYNSET entity.n.1 100001740 {102A9F7A83}>
This measure depends on performing expensive computation of taxonomy depth which is done in a lazy manner in the max-depth function:
(defmethod max-depth ((wordnet sql-wordnet3) (synset synset))
  (reduce #'max (mapcar #`(or (get# % *taxonomy-depths*)
                                (format *debug-io*
                                        "Calculating taxonomy depth for ~A~%" %)
                                (set# % *taxonomy-depths*
                                      (calc-max-depth wordnet %))))
                        (mapcar #'car (hypernym-paths wordnet synset)))))
Other similarity measures include wup-similarity, res-similarity, lin-similarity and others. You can read about them in the NLTK Wordnet manual. Most of them depend on the words' information content database which is calculated for different corpora (e.g. BNC and Penn Treebank) and is not part of Wordnet. There's a WordNet::Similarity project that distributes pre-calculated databases for some popular corpora.
Lin similarity proposes a similar formula to LCH similarity for calculating the score, but only uses information contents instead of taxonomy depths as the arguments to it:
WORDNET> (wn:lin-similarity (wn:synset "dog.n.1") (wn:synset "cat.n.1"))
To calculate it we first need to fetch the information content corpus with (download :wordnet-ic) and load the data for SemCor:
WORDNET> (setf *lc* (load-ic :filename-pattern "semcor"))
#<HASH-TABLE :TEST EQUAL :COUNT 213482 {100D7B80D3}>
The default :filename-pattern will load the combined information content scores for the following 5 corpora: BNC, Brown, SemCor and its raw version, all of Shakespeare's works and Penn Treebank. We'll talk more about various corpora in the next part of the series...
Well, this is all as far as Wordnet concerned for now. There's another useful Wordnet functionality — word morpher, but we'll return to it when we'll be talking about word forming, stemming and lemmatization. Meanwhile, here's a good schema showing the full potential of Wordnet:
Wordnet 3.0 UML diagram