2013-11-05

Another Concurrency FUD

Today I've stumbled on an article entitled Locks, Actors, And STM In Pictures which looks rather cool, but is a total FUD. It tells a FoxNews kind of story of how troublesome using locks are compared to the new synchronization primitives.

First of all, the author uses an example synchronization problem that is very well solved with locks, and just tells that it is a bad solution without any concrete argumentation.

What's more important though is that he needs to modify the original problem to fit with the alternative solutions.

In the Actor's case we need to resort to a waiter. Can the problem be solved without it? Surely, but the code will be, probably twice as big, and very complicated compared to the lock-based solution. Not to mentioned that it will also be prone to a deadlock. The philosophers code-based solution with a waiter is even simpler: lock the waiter, tell it what to do, unlock it.

And in the STM case we are only taking 2 forks at once, but in the task we need to do it one by one. This seems like a minor issue, but now try to add printing to the mix: can we print "Aristotle took left fork" after he takes left fork, "Aristotle put left fork" if he couldn't take right fork?

PS. I like actors and STM and try to use them where appropriate. I also teach them to students as part of the Operating Systems course.

2013-09-12

NLTK 2.2 - Conditional Frequency Distributions

The summer is over, and it's time to go back to school and continue our exploration of the NLTK book. We're finally getting at the last part of the second chapter - conditional frequency distributions.

What is a conditional frequency distribution? From a statistical point of view, it is a function of 2 arguments - a condition and a concrete outcome - producing an integer result that is a frequency of the outcome's occurrence under the condition. The simplest variant of such function is a table.

In object-oriented programming there's a tendency to put everything in a program in objects, including functions, and NLTK authors took the same route by creating a ConditionalFreqDist class. Yet, in my view, it is quite dubious and I have a hard time with such objects. What we really need here is just a protocol to work with tables. At the same time there's a classic separation of concerns issue here, as this protocol shouldn't mandate the internal representation or a way to generate different values in the distribution. This is where object come to the rescue in OOP: they allow you to change the value storage/generation policy via subclassing. But, as Joe Armstrong has wittily pointed, "you wanted to have a banana, but got a gorilla holding a banana"...

So, let's get to the protocol part for which a proper concept in OOP languages will be an interface. Its main operations are: creating a distribution and examining it (tabulating, plotting, etc.)

Creating a distribution

Let's create a CFD from the Brown corpus. Remember that its topics are the following (and they'll become the conditions of our CFD):

NLTK> (keys (corpus-groups +brown-corpus+))
(:PRESS-REPORTAGE :PRESS-EDITORIAL :PRESS-REVIEWS :RELIGION
 :SKILL-AND-HOBBIES :POPULAR-LORE :BELLES-LETTRES
 :MISCELLANEOUS-GOVERNMENT-HOUSE-ORGANS :LEARNED :FICTION-GENERAL
 :FICTION-MYSTERY :FICTION-SCIENCE :FICTION-ADVENTURE :FICTION-ROMANCE
 :HUMOR)

Let's create the simplest distribution:

NLTK> (make-cfd (maptable (lambda (topic texts)
                            (declare (ignore topic))
                            (mapcar #'token-word
                                    (flatten (mapcar #'text-tokens texts))))
                          (corpus-groups +brown-corpus+)))
#<HASH-TABLE :TEST EQL :COUNT 15 {101AA320A3}>
NLTK> (defvar *cfd* *)

We'll use a very concrete hash-table representation for the distribution, and we can go a very long way with it. Although, later we'll see how to deal with other representations.

There's a new utility here, maptable, that is an equivalent of mapcar in its simple form, when it works with a single list, but operating on hash-table instead. (And, being a generic function, it can also operate on any other table-like structures).

Each frequency distribution in a CFD is an instance of ngrams class that we've worked with in the previous chapter:

NLTK> (~ *cfd* :fiction-romance)
#<TABLE-NGRAMS order:1 count:8451 outcomes:70022 {101C030083}>

Another new utility here, ~, is an alias to the generic element access operator generic-elt that is meant to provide uniform access to elements of different linguistic structures that are defined throughout CL-NLP. It's the analogue of [] access operator in Python and other C-like languages, with the upside that it's not special — just a function. And it also supports chaining using the generic-function :around method-combination:

(defgeneric generic-elt (obj key &rest keys)
  (:documentation
   "Generic element access in OBJ by KEY.
    Supports chaining with KEYS.")
  (:method :around (obj key &rest keys)
    (reduce #'generic-elt keys :initial-value (call-next-method obj key))))

As you can see, a lot of utilities, including very generic ones, are defined along the way of our exploration. Most of them or similar ones come built into Python. This can be seen as a failure of Lisp standard at the first sight. Yet it may as well be considered a win, because it's so easy to add these things on top of the existing core with generic functions and macros, and there's no issue of "second-class citizenship". In Python the [] operator comes pre-built: you can redefine it with special hooks but only in one way, you can't have different [] implementations for one class. The other benefit of Lisp's approach is more control: you can go with the fast functions that work directly with the concrete data-structures, like mapcar for lists or gethash for hash-table. Or you can use a generic operator and pay the price of generic dispatch: standard map for abstract sequences or user-defined ~ etc, when you need future extensibility.

So, we can also use generic-elt to access individual frequency in the CFD:

NLTK> (~ *cfd* :fiction-romance "could")
193

For the first key it will be mapped to hash-table accessor (gethash) and for the second to the method freq of the ngrams object.

Examining the distribution

Now, we can move to the more interesting part: analysing the distribution.

The first way to explore the CFD shown in NLTK book is tabulating. It prints the distribution values in a nice and readable 2D table.

NLTK> (tabulate *cfd*
                :conditions '(:press-reportage :religion :skill-and-hobbies
                              :finction-science :fiction-romance :humor)
                :samples '("can" "could" "may" "might" "must" "will"))

                     can  could  may  might  must  will
    PRESS-REPORTAGE   93     86   66     38    50   389
           RELIGION   82     59   78     12    54    71
  SKILL-AND-HOBBIES  268     58  131     22    83   264
    FICTION-ROMANCE   74    193   11     51    45    43
              HUMOR   16     30    8      8     9    13

And, by the way, here's how we can get the individual numbers for one category:

NLTK> (dolist (m '("can" "could" "may" "might" "must" "will"))
        (format t "~A: ~A " m (~ *cfd* :press-reportage m)))
can: 93 could: 86 may: 66 might: 38 must: 50 will: 389

Interestingly enough, these results differ from NLTK's ones:

can: 94 could: 87 may: 93 might: 38 must: 53 will: 389

What's the reason? The answer is case-sensitivity. Let's try the same with the case-insensitive version of the CFD:

NLTK> (let ((cfd (make-cfd (maptable (lambda (topic texts)
                                       (mapcar #`(string-downcase (token-word %))
                                               (flatten (mapcar #'text-tokens
                                                                texts))))
                                     (corpus-groups +brown-corpus+)))))
        (dolist (m '("can" "could" "may" "might" "must" "will"))
          (format t "~A: ~A " m (~ cfd :press-reportage m))))
can: 94 could: 87 may: 93 might: 38 must: 53 will: 389

Now, there's an exact match :)

The printing of each row in tabulate is performed in the following loop:

(dotable (k v table)
  (when (or (null conditions)
            (member k conditions))
    (format stream "  ~V:@A" key-width k)
    (let ((total 0))
      (dolist (s samples)
        (format stream "  ~V:@A" (strlen s)
                (funcall (if cumulative
                             #`(incf total %)
                             #'identity)
                         (or (~ v s) 0)))))))

It has to account for scenarios when conditions may or may not be provided, as well as to treat cumulative output.

And, finally, let's draw some nice pictures. But first we need to create the inaugural corpus from individual texts:

(defparameter *inaugural* (make-corpus-from-dir
                           "Inaugural speeches corpus"
                           (data-file "../nltk/data/inaugural/")
                           :test #`(re:scan "^\\d+" %)))

To do that we introduce the function make-corpus-from-dir in the ncorpus package that walks the directory of raw text files and creates a basic corpus from them.

(let (texts)
  (fad:walk-directory dir
                      #`(when (funcall test (pathname-name %))
                          (push (pair (pathname-name %)
                                      (read-file %))
                                texts)))
  (make-corpus
   :desc name
   :texts (mapcar #`(make-text
                     :name (lt %) :raw (rt %)
                     :tokens
                     (mapcar #`(ncore:make-token :word %)
                             (mapcan #`(ncore:tokenize ncore:<word-tokenizer> %)
                                     (ncore:tokenize ncore:<sentence-splitter>
                                                     (rt %)))))
                  texts))

(Here, pair, lt, and rt are yet another generic utility group — the equivalents of cons, car, and cdr).

Now, we need to make a CFD from the corpus. It is somewhat complicated, because the corpus is keyed by text names, and we want a CFD keyed by words:

(make-cfd (let ((ht (make-hash-table :test 'equalp)))
            (dolist (text (corpus-texts *inaugural*))
              (let ((year (sub (text-name text) 0 4)))
                (dolist (word (mapcar #'token-word (text-tokens text)))
                  (when-it (find word '("america" "citizen")
                                 :test #`(search %% % :test 'string-equal))
                    (set# it ht (cons year (get# it ht)))))))
            ht)
          :eq-test 'equalp)

And now, we can plot it:

NLTK> (plot-table *)

Plot of usage of 'america' and 'citizen' in Inaugural speeches

Once again, plotting is performed with gnuplot. This time, though, with a general plot-table function that is able to handle any table-like structure. Here is its code:

(defun plot-table (table &rest args
                   &key keys cols cumulative (order-by (constantly nil)))
  "Plot all or selected KEYS and COLS from a TABLE.
   CUMULATIVE counts may be used, as well as custom ordering with ORDER-BY."
  (mv-bind (file cols-count keys-count)
      (apply #'write-tsv table args)
    (let ((row-format (fmt "\"~A\" using ~~A:xtic(2) with lines ~
                            title columnheader(~~:*~~A)"
                           file)))
      (cgn:with-gnuplot (t)
        (cgn:format-gnuplot "set grid")
        (cgn:format-gnuplot "set xtics rotate 90 1")
        (cgn:format-gnuplot "set ylabel \"~@[Cumulative ~]Counts\"" cumulative)
        (cgn:format-gnuplot "plot [0:~A] ~A"
                            cols-count
                            (strjoin "," (mapcar #`(fmt row-format (+ 3 %))
                                                 (range 0 keys-count))))))
    (delete-file file)))

A generic function write-tsv creates a temporary file in tab-separated format. Here, we use its method for hash-table objects (which in our case represent a CFD — each condition is a key and each sample is a column). So, the TSV file's contents for our last distribution look like this:

No Label citizen america
0 1789 5 2
1 1793 1 1
...

Using different data sources

Finally, as promised, here's how to use the newly developed CFD machinery with other ways to obtain raw data for the distribution. Oftentimes, it's infeasible to read the whole contents of some files in memory, i.e. it makes sense to use line-by-line processing. There's a handy dolines macro that allows just that. Let's see, how we can use it to build the male/female names last letters distribution from NLTK Figure 2.10.

First of all, we can just use it when we create the distribution:

(make-cfd (let ((ht (make-hash-table)))
            (dolist (gender '(:female :male))
              (dolines (word (fmt "../nltk/data/names/~(~A~).txt" gender))
                (let ((last-char (char word (1- (length word)))))
                  (when (alpha-char-p last-char)
                    (set# gender ht
                          (cons last-char (get# gender ht)))))))
            ht))

And here's the plot of the distribution for reference:

(plot-table * :order-by 'char<)

Plot of the distribution of last letters in male/female names

Another approach would be to use a proxy object, that allows to pull data for a CFD from a list of files:

(defclass file-list ()
  ((files :initarg :files :reader files)))

and specialize make-cfd generic function for it (adding a transform keyword argument to be able to manipulate file data):

(defmethod make-cfd ((raw file-list)
                     &key (eq-test 'equal) (transform 'identity)
                     &allow-other-keys)
  (let ((rez (make-hash-table :test eq-test)))
    (dolist (file (files raw))
      (set# (mkeyw (pathname-name file)) rez
            (make 'table-ngrams :order 1
                  :table (count-ngram-freqs
                          (mapcar transform
                                  (split-sequence-if
                                   #'newline-char-p
                                   (read-file file)
                                   :remove-empty-subseqs t))))))
    rez))

And this is how we would call it:

(make-cfd (make 'file-list
                :files (mapcar #`(fmt "../nltk/data/names/~(~A~).txt" %)
                               '(:male :female)))
          :eq-test 'eql
          :transform #`(char % (1- (length %))))

This looks much heavier than the original approach, but sometimes it may be inevitable.

Yet another way is to use a different plotting function that dynamically builds the plot as it pulls data. This will require different style of communication with gnuplot: not via a temporary file, but giving commands to it dynamically. This is left as an exercise to the reader :)

Parting words

This part was more a discussion of various rather abstract design issues, than a demonstration of NLP techniques.

Overall, I don't see a lot of value in special treatment of conditional frequency distributions as it is done in NLTK. Surely, it's one of the many useful analytic tools, but I haven't seen such objects used in production code. Unlike conditional frequencies per se or ngrams that are all over the places.

2013-09-04

Ищем Лиспера с горящими глазами

Уже 3 года я работаю в Grammarly, где мы строим (и довольно успешно) самый точный сервис исправления и улучшения английских текстов. В экстремуме эта задача упирается в полноценное понимание языка, но пока мы не достигли этого уровня, приходится искать обходные пути. :) Понятно, что работы у нашей команды, которую мы называем "core team", выше крыши. Но эта работа довольно сильно отличается от обычной программной инженерии, точнее, инженерия — это ее конечный этап, которому предшествует огромное количество экспериментов и исследования.

Я как-то назвал нашу систему Grammar OS, поскольку она чем-то похожа на ОС: в ней есть низкий уровень языковых сервисов, есть промежуточные слои проверщиков, есть и пользовательские приложения. Одним из ключевых компонентов является написанная на Лиспе и постоянно развивающаяся система проверка грамматики. Этот проект помимо собственно сложной задачи, на которую отлично легли возможности Лиспа, high-load'а, также интересен и тем, что над ним работают около 5 компьютерных лингвистов, которые все время дописывают в него какой-то код...

К чему я это веду, это к тому, что мы уже больше полугода ищем Лисп-разработчика, желающего подключиться к развитию этой системы, а в перспеткиве и к работе над новыми Лисп-проектами, о которых мы думаем. Помимо Лиспа в ядре у нас используется много Джавы (даже больше), и мы, в общем-то, ее тоже любим. И немного Питона. А основной фокус сейчас все больше и больше сдвигается в сторону различных алгоритмов машинного обучения.

Наконец, у нас в Grammarly отличная атмосфера и куча людей, у которых есть чему поучиться.

 

В общем, как по мне, практически работа мечты для программиста, которому нравится обработка текстов, R&D или Лисп. Логичный вопрос: почему же мы так долго не можем найти подходящего человека? Ответов несколько:
  • Во-первых, мы всех нанимаем очень долго (почти полгода искали даже Java-разработчика). Почему — смотри выше — качество команды для нас важнее скорости.
  • Во-вторых, лисперов с индустриальным опытом у нас не так и много: в Киеве их от силы пару десятков, в России — больше, но мало кто из них хочет переезжать.
  • В-третьих, даже у технически сильных кандидатов часто не горят глаза :(
Короче говоря, ситуация такова:
  • Есть сложная и интересная работа, с хорошей зарплатой, отличной командой, компанией и перспективами.
  • Нужен увлеченный своей работой программист, который хорошо понимает алгоритмы, математику и статистику, и хочет писать на Лиспе.
  • Опыт на Лиспе нужен, но не обязательно из реального мира: open-source или хобби-проекты тоже покатят. Но тогда важно желание и способность быстро развиваться в этом направлении.
Если что, пишите на vseloved@gmail.com...

2013-06-21

Free "Lisp Hackers" Ebook

More than a year passed since I had started to publish a series of interviews with Lisp hackers. It was a great project, but I'm going to finalize it for now. All of the interviews are collected in a free ebook that is available through leanpub. You should share it with your friends and on the interwebs! ;)

Below is a few observations on the whole series that I think are interesting.

The Road to Lisp

There was a famous survey on the Association of Lisp Users site called "The Road to Lisp". I have somewhat replicated it in the interview and some of the answers where quite surprising, as far as I'm concerned.

First of all, people came to Lisp through friends, who gave them a copy of SICP or showed some fascinating Lisp programs. There's also a mention of online forums (usenet, irc, etc.)

So SICP is the most influential book, followed by Peter Norvig's PAIP. The other mentioned sources are "Writing GNU Emacs Extensions" by Bob Glickstein, "The UNIX-Haters Handbook", and Paul Graham's "Beating the Averages".

Two interesting things get notable mentions as leading to Lisp: computer graphics and Caml-light (for the French guys).

And receiving it as legacy from the dad is somewhat priceless!

Parens transfer

I'd also like to add my experience here: the 2 things which brought me to Lisp were Eric S. Raymond's "How to Become a Hacker" which mentioned that you need to learn Lisp to become a better programmer, and Peter Seibel's great "Practical Common Lisp", which came out after all the interviewed hackers were already into Lisp, so they couldn't have been influenced by it.

Dislikes

This was an interesting question, because it made most of the persons go out of their comfort zone a little: after all, most of them are fond of Common Lisp. And the answers are interesting, although expected.

Basically, it boils down to 3 things:

  • the arrogance of the community and/or outsiders prejudice against it
  • lack of static typing and other ways to express constraints on the program
  • disdain for the platform it lives on, which includes problems with embedding into other systems, no good way to write one-liners, etc.

Also noted were:

  • necessity to struggle to achieve high performance
  • the unspecified corners, but this problem is becoming less and less acute

A Lisp project

This was a question, proposed by Zach Beane: what Lisp project would you do if you weren't constrained in resources and other stuff. For the sake of completeness, I've also followed up with him, and here's his answer:

Right now I'd love to spend more time polishing Quicklisp. I'd love to make a good system for discovering the functionality you need (it's not easy to make the connection from "I need to fetch web pages" to "drakma"). I'd like to make interfaces for accepting user feedback and reviews. I'd like to make it easy to make connections from a project to its website, bug tracker, mailing list, documentation, etc. I'd like to publish my build tools and build logs. I'd like to make EVERYTHING a lot better, to make the system not just better than what CL had before but better than what other language ecosystems offer, too. It's all a lot of work.

As a fantasy project, I'd love to make a system for interesting visualization of complex data, something where it's easy to splat something quick and dirty on the screen/page, but which can grow in capability as the need arises. Or maybe just some tool for making audiovisual toys, with cool pictures and noises coming out.

The other interesting answers were:

  • another take on reflection
  • an OS, based on Linear Lisp, bootstrapped from Racket or Maru
  • Lisp bridges to other runtime systems
  • a Common Lisp object database
  • a Go-playing program
  • and, surely, an own Lisp dialect or even a separate programming language :)

Lisp companies

It's one of the popular myths that it's impossible to find a Lisp job. Well, it's definitely harder than to find a Java one, but speaking about the job's quality YMMV. Some of the Lisp companies were mentioned in the interviews:

  • ITA Software that employed up to 50 Lisp developers in its Boston office and was bought by Google for almost $1B
  • Franz in the Silicon Valley
  • Teclo Networks in Switzerland
  • MSI in Japan
  • Novasparks which operates from Boston and Paris

But a lot more weren't mentioned. To name a few:

  • probably, the biggest Lisp company in the world — the Portuguese SISCOG with 70+ Lisp developers
  • Clozure Associates band of Lisp gurus from the US East Coast
  • Copyleft from Norway
  • RavenPack from Spain
  • Australia's division of Accenture
  • Agri-Esprit from France

I would say, that most of them work in pretty interesting domains and with challenging problems. There are also many more one- or two-man Lisp shops scattered around the world. So, yes, Lisp companies are rare, but there's nothing wrong with relying on Lisp in a company: it doesn't fail you and may even bring some outstanding results. Not to mention the fun of the process itself...

submit

2013-06-12

Lisp Hackers: Marc Battyani

Marc Battyani is one of the people who put Lisp at the foundation of their business, and he doesn't seem to regret that decision. And the business itself is exploring very interesting aspects of high-performance and reconfigurable computing. Besides, he has a notable open-source contribution with cl-pdf/cl-typesetting libraries (which even I use, so thanks Marc!). He elaborates on all these and more in much detail in the interview.

Tell us something interesting about yourself.

I think maybe the most unusual things I do is that I work on very different application domains which are even sometimes completely at opposite extremes both in electronics and software. For instance form ultra low power smart sensors based on $1 microprocessors which will run continuously for 5+ years only powered by a small coin battery up to the world's lowest latency supercomputers based on FPGA costing thousands of $ per chip. On the software side I use programming languages ranging from the lowest level languages like assembly or even below with VHDL up to really high level languages like Common Lisp.

I've always enjoyed mixing electronics design and higher level computer science and all that diversity probably gives me a different and original view on say computing and programming in general.


What's your job? Tell us about your company.

I'm the CTO of NovaSparks a startup I founded in 2008 to make ultra-low latency FPGA based supercomputers for the financial markets. BTW These things are really incredibly fast. For instance on 10Gb/s Ethernet market data packets coming from exchanges like the NASDAQ we process the IP/UDP/multicast network stack, extract the messages from the packets, parse/decode/filter/normalize those messages, maintain the indexed order book data structures, aggregate the price levels per stock, generate output messages and finally send them to a server through PCI-express or 10Gb/s Ethernet network stacks. The nice thing is that we do all that fully pipelined at a rate of one message every 12 nanoseconds! To have an idea of how fast it is, in 12 ns the light will only travel 3.6 meters (11.8 ft). Another way to view this performance is that the system can process 83 Millions of financial messages per second without any queuing.

As an aside it is interesting to note that the Domain Specific Language Compilers and various other tools written in Common Lisp have been key enabling factors for the creation of NovaSparks.

I'm also the CEO and CTO of Fractal Concept which is the company where we were developing that technology before starting NovaSparks but as NovaSparks has been using more than 100% of my time for the last years Fractal Concept has been less active with only the development and maintenance of the smart sensors going on.


Do you use Lisp at work? If yes, how you've made it happen? If not, why?

I've always used Common Lisp for most of my work and even when I need to use other programming languages (like VHDL for NovaSpark's FPGAs or javascript for web stuff) I generally use Common Lisp at least for a lot of related tasks like prototyping, designing and testing algorithms, extracting statistics, performing simulations, generating test data, analysing test runs.

The next step is very often to generate some or all the code in other languages by designing various domain specific languages (DSL) which will take care of the tedious aspects of programming in less powerful languages. I really like it when from a few 100s of lines written in a easy to use high level DSL we generate 10000 to 60000+ lines of very low level VHDL code saving months of development.

About that point I think I would even say that I'm mostly doing Language Oriented Programming by trying to abstract the domain specific knowledge into various domain specific languages with very different syntaxes like s-expressions, C like or other languages or even sometimes adding to the mix some GUI or data-based inputs. Then those DSLs would generate most if not all the application code in whatever language is needed be it VHDL, asm, C, javascript or Common Lisp.


What brought you to Lisp? What holds you?

A friend of mine gave me a version of Le_Lisp a version of Lisp used in the 80's by the French universities and engineering schools to teach high level programming concepts. At that time I was mostly programming in Z80 assembly language on a TRS80 and the straight jump from ASM to Lisp was quite a shock and an eye opener.

Since then I've used Common Lisp for countless projects ranging from a few hours of work to multi-years ones and I still find it awesome. Where else can you find a language providing such powerful and multi-paradigms features like CLOS, generic functions, the MOP, macros, closures, s-expressions, lambdas, an interactive REPL (Read Eval Print Loop), native compilation of applications, on the fly native compilation of generated code, the condition system, interactive live programming, real time live debugging of running software and more!


What's the most exciting use of Lisp you had?

I've got so many of them than picking only one would be difficult so here are a few of them:
  • softscan: real time driving and data acquisition of automated non-destructive testing installations with real time 3D display (a first in 1995)
  • my web application framework: Automatic generation of really complex web applications (the whole stuff from the HTML/javascript front-end to the server back-end). It has been fully Ajax-based with dynamic modification of the displayed pages without reloading the page. This seems obvious nowadays but it was also a first in 2000 and IIRC the term "Ajax" was only invented 5 years later.
  • hpcc: The awesome DSL to VHDL compiler. In many aspects VHDL is the complete opposite of Common Lisp. It's a hardware description language used to program FPGA and it deals with the lowest possible programming level with data types like signals, clocks, and bits. Programming at that level is really tedious, time consuming and verbose so it's really a relief to be able to generate tens of thousands of lines of highly optimized VHDL code from just a few hundred lines of some high level Domain Specific Language.
  • cl-pdf/cl-typesetting: Being able to generate the first PDF files from scratch in only 107 lines of Common Lisp was an Haha moment.

What you dislike the most about Lisp?

The language itself is somewhat good enough and anyway Common Lisp makes it really easy to change most of itself to add the new and cool stuff or ideas of the day.

In fact what I dislike about Lisp is outside the language and more related to the (mis)perception that people have about it. Having almost everytime to justify its use and sometimes even to fight to be able to use it is somewhat annoying and tiresome.

Of course, Lisp is not for everybody but this is not a reason to have nobody using it. In fact I view Lisp as some kind of amplifier which will give awesome things when used by brilliant developers but will end up giving an incredible mess when used by people without any clues about what they are doing. That's one aspect that makes Lisp very different from other languages which are especially designed to try to normalize and average what people can do with them.


Describe your workflow, give some productivity tips to fellow programmers.

I obviously use emacs and all the nice stuff that work with it like slime, org-mode, magit and lots of other packages. I switched from subversion (svn) to git for all my projects and I now use git mostly form emacs with magit.

In general I have several instances of Lispworks running as standalone apps on my laptop but connected to Slime through M-x slime-connect rather than being started by slime. I do a lot of exploratory programming and interactively play with the code while I write it and if I have to optimize some code I very often use #'disassemble.

About my Lisp coding style I really do like generic functions, CLOS and the MOP. As mentioned earlier I try to generate as much of the code as possible by using macros in simple cases and full blown DSLs in more complex ones. Sometimes I do not dislike making some premature optimizations when I know that speed will be important for some project. BTW when speed matters I like to generate and compile code on the fly when it's useful (and possible) and try to generate really optimized code. It's always cool when you have Common Lisp applications running much faster than C++ ones.

I also refactor a lot. Generally when I start on some new stuff I try to make a first version that works well enough but then once it's done or generally after some time I have some cool ideas to make something much better. At that point I'm really happy that Common Lisp makes refactoring very easy thanks to optional and key args, macros and generic functions with multiple dispatch because all that enables me to make even very deep modifications very easily. I really find those Common Lisp features make software very resilient to code modifications.

What are the advantages of using Lisp in the HPC field? What are the drawbacks? Are you happy with your technology choice?

The HPC field is huge and I can only talk about the small corner of it I know which is the ultra-low latency processing of vast amounts of data we do in NovaSparks. For that the various DSL compilers generating VHDL code have really been a key enabling factor. Programming FPGAs is notoriously difficult and time consuming and this is the major factor limiting the use of reconfigurable computing in the HPC field. The use of those DSL compilers has made it possible to use those FPGA on a more practical basis for processing financial data.

BTW I'm looking for other HPC domains in which those DSL VHDL compilers could be used so feel free to contact me if you have some ideas.


If you had all the time in the world for a Lisp project, what would it be?

Again that's a difficult choice. Speaking about pure Lisp projects I've been willing for a long time to clean up and modernize my web app framework before releasing it as open-source so maybe that could be a good project to start.

Otherwise as mentioned above, I'm looking for possible applications and opportunities of leveraging the DSL => hardware compilers and the FPGAs around some Big Data processing. I have a few ideas but nothing specific for now.


Anything else I forgot to ask?

A conclusion? (Everybody wants a conclusion.) So here is mine:

Common Lisp is Awesome! It is much easier nowadays to use it thanks to all the projects and libraries that Quicklisp makes available. If you do not know Common Lisp then learn it and this will make you a better programmer anyway even if you do not use it directly after.


submit

2013-06-09

NLTK 2.1 - Working with Text Corpora

Let's return to start of chapter 2 and explore the tools needed to easily and efficiently work with various linguistic resources.

What are the most used and useful corpora? This is a difficult question to answer because different problems will likely require specific annotations and often a specific corpus. There are even special conferences dedicated to corpus linguistics.

Here's a list of the most well-known general-purpose corpora:

  • Brown Corpus - one of the first big corpora and the only one in the list really easily accessible - we've already worked with it in the first chapter
  • Penn Treebank - Treebank is a corpus of sentences annotated with their constituency parse trees so that they can be used to train and evaluate parsers
  • Reuters Corpus (not to be confused with the ApteMod version provided with NLTK)
  • British National Corpus (BNC) - a really huge corpus, but, unfortunately, not freely available

Another very useful resource which isn't structured specifically as academic corpora mentioned above, but at the same time has other dimensions of useful connections and annotations is Wikipedia. And there's being a lot of interesting linguistic research performed with it.

Besides there are two additional valuable language resources that can't be classified as text corpora at all, but rather as language databases: WordNet and Wiktionary. We have already discussed CL-NLP interface to Wordnet. And we'll touch working with Wiktionary in this part.

Processing corpora

All of the text corpora we'll encounter come as a set of files which use some format, either custom or standard (like XML). So to be able to access this data we'll need to implement reading and processing of individual files and assembling these files into groups. Besides, as some corpora are really large (for instance a zipped instance of the Reuters corpus amounts to 1.5+ Gb), it is also useful to be able to process the data one text at a time not detaining them in memory. All this defines a rather simple protocol for corpora processing.

A corpus is just a list of texts, that can be arbitrary grouped.

(defstruct corpus
  desc texts groups)

A text has a name and several representations: basic texts may have a raw, a cleaned-up and a tokenized one. We'll also see texts with other representations, like the parse trees for treebanks - we'll use structure inheritance to describe them.

(defstruct text
  name raw clean tokens)

The protocol itself comprises of 3 operations:

(defgeneric read-corpus (type path))

(defgeneric read-corpus-file (type source))

(defgeneric map-corpus (type path fn))

read-corpus creates a corpus object, and it uses read-corpus-file to read individual files and return their contents in multiple forms, potentially needed for further processing. map-corpus calls function fn with every text produced from calling read-corpus-file on a corpus in arbitrary order. This function works more like maphash than mapcar, because it doesn't collect the results of applying fn.

The methods of these functions are usually boring. For text-based formats we implement read-corpus-file by reading each file's text either in whole or line-by-line with the dolines macro, performing tokenization and some post-processing. For XML-based variants we may use SAX processing with Closure XML library.

Processing the NPS Chat corpus

Let's look at the NPS Chat corpus that is provided with NLTK. It has a rather simple XML format. The entries are called Posts and have the following structure:

<Post class="Emotion" user="10-19-30sUser2">10-19-30sUser11 lol
    <terminals>
        <t pos="NNP" word="10-19-30sUser11"/>
        <t pos="UH" word="lol"/>
    </terminals>
</Post>

We process each file by calling cxml:parse with an instance of the nps-chat-sax class.

(defmethod read-corpus-file ((type (eql :nps-chat)) source)
  (cxml:parse source (make 'nps-chat-sax)))

For this object we'll specialize the parser methods. It will also serve as a storage for state of processing as SAX parsing operates with callbacks that don't have access to the current context besides access to the parser object.

(defclass nps-chat-sax (sax:sax-parser-mixin)
  ((texts :initform nil)
   (tokens :initform nil)
   (classes :initform nil)
   (users :initform nil)
   (cur-tag :initform nil)
   (cur-tokens :initform nil)))

read-corpus-file will return a list of each post's contents, tokens, as well as the list of posts' classes and users, which are indicated in the meta attributes. Technically these values will be produced by the sax:end-document method that is called at the end of SAX processing:

(defmethod sax:end-document ((sax nps-chat-sax))
  (with-slots (texts tokens users classes) sax
    (values nil  ;; raw text isn't useful for this corpus
            (reverse texts)
            (reverse tokens)
            (reverse classes)
            (reverse users))))

In sax:start-element method we store the current tag and record attributes of each post or tokens, depending on the element.

(defmethod sax:start-element ((sax nps-chat-sax)
                              namespace-uri local-name qname attributes)
  (with-slots (classes users cur-tokens cur-tag) sax
    (case (setf cur-tag (mkeyw local-name))
      (:post (push (attr "class" attributes) classes)
             (push (attr "user" attributes) users))
      (:t (push (make-token :word (attr "word" attributes)
                            :tag (mkeyw (attr "pos" attributes)))
                cur-tokens)))))

In sax:characters we record current post's text:

(defmethod sax:characters ((sax nps-chat-sax) data)
  (with-slots (cur-tag texts) sax
    (when (eql :terminals cur-tag)
      (push data texts))))

And in sax:end-element we dump current tokens:

(defmethod sax:end-element ((sax nps-chat-sax) namespace-uri
                            local-name qname)
  (when (eql :terminals (mkeyw local-name))
    (with-slots (tokens cur-tokens) sax
      (push (reverse cur-tokens) tokens)
      (setf cur-tokens nil))))

If you have used SAX parsing in some other language, like Python, you should immediately recognize this approach and see how it can translated to that language almost line-by-line.

Processing the Reuters corpus

A slightly more complex processing is required for the Reuters corpus. The reason for that is that unlike with the Chat corpus that we assumed to be inside a filesystem directory, it's not always reasonable to extract this corpus as it's big and also is stored in two-level zip archive with individual archives packed inside another big archive. Extracting all of them is, obviously, tedious.

For such corpus it makes sense to resort to using map-corpus. Here's a definition of its method for this corpus:

(defmethod map-corpus ((type (eql :reuters)) path fn)
  (zip:with-zipfile (zip path)
    (zip:do-zipfile-entries (name entry zip)
      (unless (char= #\/ (last-char name))
        (with-zipped-zip (in entry :raw t)
          (mv-bind (_ text tokens __ ___ headline byline dateline)
              (read-corpus-file :reuters in)
            (declare (ignore _ __ ___))
            (funcall fn (make-reuters-text
                         :name name
                         :clean text
                         :tokens tokens
                         :headline headline
                         :byline byline
                         :dateline dateline))))))))

It relies on read-corpus-file which processes individual XML documents just like we saw early for the NPS Chat corpus. But the documents are fed into it not by fad:walk-directory, but with the help of the utilities, provided by David Lichteblau's excellent zip library.

(zip:with-zipfile (zip path)
  (zip:do-zipfile-entries (name entry zip)
    (unless (char= #\/ (last-char name))
      (with-zipped-zip (in entry :raw t)

In this snippet we open the zip file at path in with-zipfile and iterate over its entries with do-zipfile-entries. These are usual Lisp patterns to handle such kinds of resources. Yet inside the zip file we find another layer of zip archives. To remove unnecessary hassle and boilerplate I've added an additional macro with-zipped-zip. It has to some extent to replicate the functionality of with-zipfile, but it should take the data from the contents of one of the entries of the already open zip file. Another twist is that it optionally gives the possibility to access the raw unzipped binary stream, not yet converted to a text representation - this is useful for CXML which can work with such streams efficiently.

(defmacro with-zipped-zip ((name stream zipfile-entry
                                   &key (external-format :utf-8) raw)
                           &body body)
  (with-gensyms (zipstream v end entry entries zip x n)
    ;; here we transform the archives contents into an input stream
    ;; and then read it
    `(flex:with-input-from-sequence
          (,zipstream (zip:zipfile-entry-contents ,zipfile-entry))
        (let ((,v (make-array (zip:zipfile-entry-size ,zipfile-entry)
                              :element-type '(unsigned-byte 8))))
          (read-sequence ,v ,zipstream)

          ;; here we look for the central directory header
          ;; and move to it in the stream
          (if-it (search #(80 75 5 6) ,v :from-end t)
                 (file-position ,zipstream it)
                 (error "end of central directory header not found"))

          ;; here we create a corresponding zipfile object for the entry
          (let* ((,end (zip::make-end-header ,zipstream))
                 (,n (zip::end/total-files ,end))
                 (,entries (make-hash-table :test #'equal))
                 (,zip (zip::make-zipfile
                        :stream ,zipstream
                        :entries ,entries
                        :external-format ,external-format)))
            (file-position ,zipstream
                           (zip::end/central-directory-offset ,end))

            ;; and here we read individual entries from the archive
            (dotimes (,x ,n)
              (let ((,entry (zip::read-entry-object ,zipstream
                                                    ,external-format)))
                (set# (zip:zipfile-entry-name ,entry) ,entries ,entry)))

            ;; finally, we're able to access entries in the same manner
            ;; as we'll do for the ordinary archive
            (do-entries (,name ,stream ,zip
                         :external-format ,external-format :raw ,raw)
              ,@body))))))

As you see, this macro uses a lot of zip's internal symbols, as it implements the similar function to the one which the library does, but not anticipated by its author...

Let's see how it works:

NLP> (map-corpus :reuters "Reuters.zip"
                 #`(print (text-name %)))
"Reuters/Eng Lang_Disk 1/19960824.zip/12536newsML.xml"
"Reuters/Eng Lang_Disk 1/19960824.zip/12537newsML.xml"
...

Processing the Penn Treebank

The Penn Treebank is a very important corpus which is also not easy to obtain: you have to order a CD from Linguistic Data Consortium (LDC) and pay more than 1.5 thousand dollars for it. Yet, there are 2 workarounds:

  • NLTK data provides 5% of the treebank, so you can use it for experimenting with the data format
  • There's a more recent corpus OntoNotes which includes the Penn Treebank, and you also should obtain it from LDC, but it costs only 50$ (I've learned about its existence on StackOverflow, ordered it and it indeed has a treebank inside, as well as a lot of other useful linguistic annotations; the only thing I'm not sure about is whether it is an exact copy of the Penn Treebank)

Let's see how to load the NLTK's treebank. The obvious thing about treebanks is that they are basically Lisp code, so it should be very easy to parse the data with Lisp. The only catch is that the treebank doesn't obey Lisp's formatting rules for strings and doesn't distinguish special characters like quote and pipe. So the task is to properly handle all that.

In CL-NLP we have several utilities for dealing with trees: the macros dotree and doleaves which execute arbitrary code for each subtree or leaf in a tree for side-effects, and their counterpart functions maptree and mapleaves that allow to create isomorphic tree structures by applying a function to all tree's nodes or leaf nodes.

So, reading a treebanked tree will be performed in 2 steps:

  • first, we prepare the tree for loading by properly escaping everything:

    (defun prepare-tree-for-reading (string)
      (strjoin " " (mapcar #`(cond-it
                               ((char= #\( (char % 0))
                                (cond-it
                                  ((member (sub % 1)
                                           '("." "," ";" ":" "#" "''" "``")
                                           :test 'string=)
                                   (fmt "(|~A|" (sub % 1))))
                                  ((position #\| %)
                                   (fmt "(|~A\\|~A|"
                                        (sub % 1 it) (sub % (1+ it))))
                                  (t %)))
                               ((position #\) %)
                                (if (zerop it)
                                    %
                                    (fmt "\"~A\"~A" (sub % 0 it) (sub % it))))
                               (t (fmt "\"~A\"" %)))
                           (split-sequence-if #`(member % +white-chars+)
                                              string :remove-empty-subseqs t)))
    
  • and then we read the tree with the Lisp reader and separately collect its tokens:

    (defmethod read-corpus-file ((type (eql :treebank)) file)
      (let ((raw (string-trim +white-chars+ (read-file file))))
        (with-input-from-string (in (prepare-tree-for-reading raw))
          (loop
             :for tree := (car (read in nil)) :while tree
             :collect raw :into raws
             :collect tree :into trees
             :collect (let ((pos 0)
                            toks)
                        (dotree (subtree tree)
                          (when (and (listp subtree)
                                     (single (cdr subtree))
                                     (atom (cadr subtree)))
                            (let ((word (second subtree)))
                              (push (make-token
                                     :beg pos
                                     :end (1- (incf pos (1+ (length word))))
                                     :word word
                                     :tag (first subtree))
                                    toks))))
                        (reverse toks))
             :into tokens
             :finally
             (return (values (strjoin #\Newline raws)
                             (mapcar #`(strjoin #\Space
                                                (mapcar #'token-word %))
                                     tokens)
                             tokens
                             trees))))))
    

Other corpora

Most of the other corpora will use formats similar to the Brown, Reuters Corpus, or Penn Treebank, so their readers may easily be defined by analogy. Besides, the implementations of read-corpus methods assume a certain way to present the corpus to the library: in a filesystem directory, in a zipfile etc. Certainly, the corpora may come in different form, but it will take just a couple of lines of code changed to re-purpose the methods to handle a different representation

Working with some other collections of documents presented in the NLTK book, like Project Gutenberd collection of famous literary texts in public domain or Inaugural speeches collection is just trivial - you can see how it's done in our implementation of Chapter 1.

For instance, if we have Project Gutenberg collection in some directory dir, we can process it by using the following pattern:

(fad:walk-directory
 dir
 #`(let ((text (string-trim +white-chars (read-file %))))
     ... do some processing on the raw text of the corpus ...
     ))

Working with Wikipedia and Wiktionary

Unlike most of academic corpora that are not so easy to obtain Wikipedia and Wiktionary without a hassle give you access to full dumps of their data in various formats: SQL, XML and text. For instance, we can obtain the latest dump of English Wiktionary from http://dumps.wikimedia.org/enwiktionary/.

I found it the most convenient to process them using the same XML SAX parsing approach that was utilized for the Reuters and NPS Chat corpora. The difference for them is that inside the XML documents a special Mediawiki markup is used to add metadata. It is, in fact, a heroic feat to parse this markup, because it doesn't have a clear, precise spec - it's real spec is the PHP code of the Wikipedia's parser,- and different people tend to (ab)use different variations of annotations to express the same or similar things, as well as just to use the markup in an untidy manner. I call such documents semi-structured compared to unstructured raw text and structured XML markup. Still, there's a lot of value in this annotation, because of their crowdsourced nature that allows to capture much more information than in most centralized efforts. Basically, there are 2 ways to work with Mediawiki markup: using regexes or implementing the complete parser.

Below is a snippet of code, that uses the regex-based approach to extract some information from the English Wiktionary - English word definitions. In Wiktionary markup the definitions are inside each language's section (denoted with ==Lang== marker, where Lang can be English or some other language). The definitions themselves are placed on their own lines and start with one of these markers: #, #:, #*, #*:, #*::, * [[.

(defvar *defs* (list))

(defclass word-collector (sax:sax-parser-mixin)
  ((title :accessor title :initform nil)
   (tag :accessor tag :initform nil)
   (text :accessor text :initform ())
   (itemcount :accessor itemcount :initform 0)
   (tagcount :accessor tagcount :initform 0)))

(defmethod sax:end-element ((sax word-collector)
                            namespace-uri local-name qname)
  (with-slots (title tag text) sax
    (when (string= "text" local-name)
      (let ((lang-marker "==English==")
            (whole-text (strjoin "" (reverse text))))
        (when-it (search lang-marker whole-text)
          (let ((start (+ it (length lang-marker))))
            (dolist (line (split-sequence
                           #\Newline (sub whole-text start
                                          (re:scan "[^=]==[^=]" whole-text
                                                   :start start))))
              (when (some #`(starts-with % line)
                          '("# " "#: " "#* " "#*: " "#*:: " "* [["))
                (push (clean-words line) *defs*)
                (incf (itemcount sax))))))))))))

(defun clean-words (line)
  "Return a list of downcased words (only alpha-chars or hyphens
   in the middle) found in LINE."
  (mapcar #`(string-downcase (string-trim "-" %))
          (remove-if #`(or (blankp %)
                           (loop :for char :across % :do
                             (unless (or (alpha-char-p char)
                                         (char= #\- char))
                               (return t))))
                     (re:split +punctuation+ (raw-text line)))))

(defun raw-text (line)
  "Return LINE without special-purpose Wiktionary chars."
  (re:regex-replace-all "({[^}]+})|(\\[\\[)|(\\]\\])|('''?)"
                        (sub line (1+ (position #\Space line)))
                        ""))

As the task is clearly defined and limited - extract just the text of definitions - it is possible to use regexes here:

  • one regex to spot the interesting lines
  • another one to filter out all the unnecessary markup.

Using corpora

Now, as we can already load and play with our corpora let's take a look at some NLTK examples.

First, let's examine the Brown corpus. Here are its categories:

NLP> (ht-keys (corpus-groups *brown*))
(:PRESS-REPORTAGE :PRESS-EDITORIAL :PRESS-REVIEWS :RELIGION
 :SKILL-AND-HOBBIES :POPULAR-LORE :BELLES-LETTRES
 :MISCELLANEOUS-GOVERNMENT-HOUSE-ORGANS :LEARNED :FICTION-GENERAL
 :FICTION-MYSTERY :FICTION-SCIENCE :FICTION-ADVENTURE :FICTION-ROMANCE
 :HUMOR)

As you see, there is no "news" category which is mentioned in the NLTK example. Probably, what they refer to is :press-reportage.

NLP> (take 9 (mapcar #'token-word
                     (flatten (mapcar #'text-tokens
                                      (get# :press-reportage
                                            (corpus-groups *brown*))))))
("The" "Fulton" "County" "Grand" "Jury" "said" "Friday" "an" "investigation")

NLP> (take 9 (mapcar #'token-word
                     (flatten
                      (mapcar #'text-tokens
                              (remove-if-not #`(string= "cg22" (text-name %))
                                             (corpus-texts *brown*))))))
("Does" "our" "society" "have" "a" "runaway" "," "uncontrollable" "growth")

NLP> (setf *print-length* 3)
NLP> (mapcan #`(ncorpus::sentences-from-tokens (text-tokens %))
             (get# :press-reportage (corpus-groups *brown*)))
((#<The/at 0..3> #<Fulton/np-tl 4..10> #<County/nn-tl 11..17> ...)
 (#<The/at 166..169> #<jury/nn 170..174> #<further/rbr 175..182> ...)
 (#<The/at 409..412> #<September-October/np 413..430> #<term/nn 431..435> ...)
 ...)

The function sentences-from-tokens here operates under the assumption, that every token with . tag is ending a sentence, and it splits the sentences on one or more such tokens.

The presented code may seam much more elaborate, than the NLTK version:

>>> brown.words(categories='news')
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
>>> brown.words(fileids=['cg22'])
['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...]
>>> brown.sents(categories=['news', 'editorial', 'reviews'])
[['The', 'Fulton', 'County'...], ['The', 'jury', 'further'...], ...]

Yet, it should be understood, that it is easy to add a ton of various utility accessors, like it is done in NLTK, but they will inevitable make the code more complex and harder to maintain. And the question is, how frequently are they going to be used and what we will miss anyway? Coding such utilities is very pleasant and easy using the functional style, as shown above, so they are left outside of the scope of cl-nlp, at least until proven essential...

(NB. Don't forget to set *print-length* back to nil).

Next, we'll be once again looking at frequency distributions. First, we need to build an ngram index from Brown corpus words in "news" category. In chapter 1 we've already done a very similar thing:

NLP> (index-ngrams
      1 (mapcar #'token-word
                (flatten (mapcar #'text-tokens
                                 (get# :press-reportage
                                       (corpus-groups *brown*))))))
#<TABLE-NGRAMS order:1 count:14395 outcomes:108130 {101C2B57E3}>
NLP> (defvar *news-1grams* *)  ;; * is the previous returned value
NLP> (dolist (m '("can" "could" "may" "might" "must" "will"))
       (format t "~A: ~A " m (freq *news-1grams* m)))
can: 93 could: 86 may: 66 might: 38 must: 50 will: 389

Funny, some numbers don't add up to what there's in NLTK:

can: 94 could: 87 may: 93 might: 38 must: 53 will: 389

Well, we can double-check them from a different direction:

NLP> (length (remove-if-not
              #`(string= "can" %)
              (mapcar #'token-word
                      (flatten (mapcar #'text-tokens
                                       (get# :press-reportage
                                             (corpus-groups *brown*)))))))
93

Looks like our calculation is precise.

Next is conditional frequency distribution, but we'll leave this topic for the next part that is fully dedicated to that.

So far so good. We have examined common approaches to handling various corpora. There are, basically, 2 main things you need to do with them: load & parse and filter for some interesting information. The loading for most corpora will be performed either with some regexes on raw strings, with XML SAX parsing, or using the Lisp reader. And for filtering Lisp offers a great versatile toolset of higher-order functions: mapcar, remove-if-not, position, and reduce - to name a few.

2013-04-29

Errors in your Platform

One of the always trending topics in our pop-culture-like computer industry is dealing with errors. There are whole movements addressing it. Yet, in my humble opinion, these movements often see errors in a rather narrow, I would even say, simplistic manner: like the proverbial example of passing a string to a function which expects integers. But, in fact, the scope of errors that can occur in a program is indefinite. In a sense, the language of errors is Turing complete, and creating a program can be viewed from a certain point as devising a way of working around all relevant errors.

Today, I'd like to talk about a particularly nasty type of errors - those that occur in the platform on which we're building the program. And the reality is such that we base our programs on at least several layers of platforms: a hardware platform, an OS and its standard library, a language's VM and the language's standard library, some user-level framework(s) - to name a few. There are several aspects of this picture - some positive for us and some negative.

  • the lower we go, the more widely-used the platform is and thus the better it is tested and the more robust (positive)
  • at the same time it becomes, usually, more general and so more complex and feature-full, which means that some of the more obscure features may, actually, be not so well tested (negative)
  • the platform is usually heterogeneous: each level has it's own language with its semantics and concepts, often conflicting with the ones at the other levels (negative)

There's a revealing rant by node.js creator Ryan Dahl on the topic of such platforms - now removed, but available in the all-remembering Internet - "I hate almost all software".

The platform errors are so nasty, because of two things:

  • you learn to trust the platform (or you can easily go insane otherwise), so you first search for errors in your code
  • the majority of the platforms are not built with a priority on traceability, so it's an order of magnitude harder to find the problem, when you finally start looking for it IN the platform

And the thing is, there are severe bugs in all of the platforms we use. The Pentium FDIV bug is one acute example, but let's talk Java here, because there's so many people bragging how rock solid the JVM is. One of the problems of the JVM is jar hell - the issue that haunts almost every platform under different names (like DLL hell). We had an encounter with it recently, when after an update our server started hanging for no apparent reason. After an investigation we've found that part of our code was compiled with a more recent version of Gson library for parsing JSON, while the main server used the older version. The interesting thing is that there was no error condition signalled. I wonder, how many projects do the same mistake and don't even know what's the source of their problems. We had a similar but more visible issue with different versions of the servlet-api, but that time they were caught at compile-time because of the minor API inconsistency.

The other problem we'd fought fruitlessly and had to eventually work around with external tools was the JVM not properly closing network sockets. Yes, it leaks sockets in some scenarios and we couldn't find a way to prevent it. This is an example of another nasty platform error - the one which happens at layer boundaries. Also, many java libraries, like the ubiquitous jetty webserver leak memory in their default setting. Actually, I wonder, what percentage of Java libraries doesn't leak memory by default?.. This is no way to say, that JVM is exceptional in this regard - other platforms suffer from the same problems. At the same time to JVM's credit it has one of the tools to fight platform errors that many other platforms lack.

So, what are the ways to deal with those errors? Unfortunately, I don't have a lot of answers here, but I hope to discover some with this article. Here are some of the suggestions:

First of all, don't trust the platform.
Try to check everything and cross-validate your assumptions with external tools. This is much harder to do than say, and there's a tradeoff to make somewhere between slackness and paranoia. But if you think of it beforehand, you can see ways to set-up the system so that it would be easier to deal with the problem when it's necessary.
Have the platform's source code at hand.
This is one dimension of traceability - the static one. It won't help you easily find all the problems, but without it you're just at the mercy of the platform's creator. And I'd not trust those guys (see p.1)
Do proper error-handling.
There's a plethora of information on that, and I'd not repeat it here. See the error handling manifesto for one of the good examples.

But the biggest upside is not in fighting with the existing system, but in removing or diminishing the source of the problems - i.e. creating more programmer-friendly platforms. So the most important recommendations I see are for platform authors (and, in fact, most of us become them to some extent when we release a utility library or a framework).

Build platforms with traceability in mind.
One example of such traceability is JVM's VisualVM which is a great tool to get visibility into what's happening in the platform. Another example is Lisp images the state of which you can inspect interactively with Lisp commands. For instance, Lisp has built-in operators TRACE that allow to see the invocations of any interesting function, and INSPECT which is an interactive object browser. These things are really hard to add after-the-fact: look, for example, at the trouble of bringing DTrace into the Linux world.
Create homogenous platforms.
Why? Because in homogenous platforms the boundaries are not so sharp, and so there tend to be fewer bugs on the boundaries. And it's much easier to build in traceability: for example you don't need to deal with multiple stacks, multiple object representation etc. Once again, Lisp comes up here, because with its flexibility it allows to build systems of different levels of abstraction in the same language. And that was exemplified by the Lisp Machines - see the Kalman Reti's talk describing their inner workings.
Support component versioning at core.
This is, actually, one of the topics not addressed well by any mainstream platform (except, maybe, .Net), as far as I know: how to allow several versions of the same component to live together in one program image.

How can static typing help with these types of errors? I don't know. What about the Erlang's failure isolation. Once again, it's pretty orthogonal to the mentioned issues: if your Erlang library leaks memory, the whole machine will suffer - i.e. you can't isolate errors which touch the underlying layers of your platform which you don't fully control...

Please, share your approaches and stories about dealing with platform errors.

2013-04-11

NLTK 2.3 - Working with Wordnet

I'm a little bit behind my schedule of implementing NLTK examples in Lisp with no posts on topic in March. It doesn't mean that work on CL-NLP has stopped - I've just had an unexpected vacation and also worked on parts, related to writing programs for the excellent Natural Language Processing by Michael Collins Coursera course.
Today we'll start looking at Chapter 2, but we'll do it from the end, first exploring the topic of Wordnet.

Wordnet and Lisp

Wordnet is a database of word meanings (senses) and word semantic relations (synonymy, hyponymy and various other ymy's). It is a really important freely available resource, so having easy access to it is very valuable. At the same time, getting Wordnet to work is somewhat involved technically. That's why I've decided to implement it's support in cl-nlp-contrib system, so that the basic cl-nlp functionality can be loaded without the additional hassle of having to ensure Wordnet (and other similar stuff) is loading.
The implementation of Wordnet interface in cl-nlp is incomplete in the sense, that it doesn't provide matching entities for all Wordnet tables and so doesn't support all the possible interactions with it out-of-the-box. Yet, it is sufficient to run all NLTK's examples and it is trivial to implement the rest as the need arises (I plan to do it later this year).
There are several other Wordnet interfaces for CL developed in the previous years, starting from as long ago as early nineties:
Non of them use my desired storage format (sqlite) and their overall support ranges from non-existing to little if any. Besides, one of the principles behind cl-nlp is to prefer uniform interface and ease of extensibility over performance and efficiency with the premise that a specialized efficient version can be easily implemented by restricting the more general one, while the opposite is often much harder. And adapting the existing packages to the desired interface didn't seem like a straightforward task, so I decided to implement Wordnet support from scratch.

Wordnet installation

Wordnet isn't a corpus, although NLTK categorizes it as such placing it in nltk.corpus module. It is a word database distributed in a custom format, as well as in binary format of several databases. I have picked sqlite3-based distribution as the easiest to work with — it's a single file of roughly 64 MB in size which can be downloaded from WNSQL project page. The default place where cl-nlp will search for Wordnet is the data/ directory. If you call (download :wordnet), it will fetch a copy and place it there.
There are several options to work with sqlite databases in Lisp, and I have chosen CLSQL as the most mature and comprehensive system — it's a goto library for doing simple SQL stuff in CL. Besides the low-level interface it provides the basic ORM functionality and some other goodies.
Also, to work with sqlite from Lisp a C client library is needed. It can be obtained in various ways depending on your OS and distribution (on most Linuxes with apt-get or similar package managers). Yet there's another issue of making Lisp find the installed library. So for Linux I decided to ship the appropriate binaries in the project's lib/ dir. If you are not on Linux and CLSQL can't find sqlite native library, you need to manually call the following command before you load cl-nlp-contrib with quicklisp or ASDF:
(clsql:push-library-path "path to sqlite3 or libsqlite3 directory")

Supporting Wordnet interaction

Using CLSQL we can define classes for Wordnet entities, such as: Synset, Sense or Lexlink.
(clsql:def-view-class synset ()
  ((synsetid :initarg :synsetid :reader synset-id
             :type integer :db-kind :key)
   (pos :initarg :pos :reader synset-pos
        :type char)
   (lexdomainid :initarg :lexdomain
                :type smallint :db-constraints (:not-null))
   (definition :initarg :definition :reader synset-def
               :type string)
   (word :initarg :word :db-kind :virtual)
   (sensenum :initarg :sensenum :db-kind :virtual))
  (:base-table "synsets"))
Notice the presence of virtual slots word and sensenum which are used to cache data from other tables in synset to display it like it is done in NLTK. They are populated lazily with slot-unbound method just like we've seen in Chapter 1.
All the DB entities are cached as well in the global cache, so that same requests don't produce different objects and Wordnet objects could be compared with eql.
There are some inconsistencies in Wordnet lexicon which have also migrated to NLTK interface (and then NLTK introduced some more). I've done a cleanup on that so some classes and slot accessors don't fully mimic the names of their DB counterparts or the NLTK's interface. For instance, in our dictionary word refers to a raw string and lemma — to its representation as a DB entity.
CLSQL also has special syntactic support for writing SQL queries in Lisp:
(clsql:select [item_id] [as [count [*]] 'num]
              :from [table]
              :where [is [null [sale_date]]]
              :group-by [item_id]
              :order-by '((num :desc))
              :limit 1)
Yet I've chosen to use another very simple custom solution for that, which I've kept in the back of my mind for a long time since I'd started working with CLSQL literal SQL syntax. Here's how we get all lemma object for a specific synset — they are connected through senses:
(query (select 'lemma
               `(:where wordid
                 :in ,(select '(wordid :from senses)
                              `(:where synsetid := ,(synset-id synset))))))
For the sake of efficiency we don't open new DB connection on every such request, but use a combination of the following:
  • there's a CLSQL special variable *default-database* that points to the current DB connection; it is used by all Wordnet interfacing functions
  • the connection can be established with connect-wordnet which is given an instance of sql-wordnet3 class (it has the usual default instance <wordnet>, but you can use any other instance which may connect to some othe Wordnet SQL DB like MySQL if needed)
  • ensure-db function checks, if the connection is present and opens it otherwise
  • it is assumed that all functions will perform Wordnet interaction inside with-wordnet macro that implements a standard Lisp resource-management call-with-* pattern for the Wordnet connection

Running NLTK's examples

Now, let's run the examples from the book to see how they work in our interface.
First, connect to Wordnet:
WORDNET> (connect-wordnet <wordnet>)
#<CLSQL-SQLITE3:SQLITE3-DATABASE /cl-nlp/data/wordnet30.sqlite OPEN {1011D1FAF3}>
Now we can run the functions with the existing connection passed implicitly through the special variable. wn is a special convenience package which implements the logic of implicitly relying on the default <wordnet>. There are functions with the same names in wordnet package that take a wordnet instance as first argument, similar to how other parts of cl-nlp are organized.
WORDNET> (wn:synsets "motorcar")
(#<SYNSET auto.n.1 102958343 {10116B0003}>)
WORDNET> (synsets <wordnet> "motorcar")
(#<SYNSET auto.n.1 102958343 {10116B0003}>)
The printing of synsets and other Wordnet objects is performed with custom print-object methods. Here's a print-object for synset that uses the built-in print-unreadable-object macro:
(defmethod print-object ((sample sample) stream)
  (print-unreadable-object (sample stream :type t :identity t)
    (with-slots (sample sampleid) sample
      (format stream "~A ~A" sample sampleid))))
Let's proceed further.
WORDNET> (wn:words (wn:synset "car.n.1"))
("auto" "automobile" "car" "machine" "motorcar")
This snippet is equivalent to the following NLTK code:
wn.synset('car.n.01').lemma_names
Definitions and examples:
WORDNET> (synset-def (wn:synset "car.n.1"))
"a motor vehicle with four wheels; usually propelled by an internal combustion engine"
WORDNET> (wn:examples (wn:synset "car.n.1"))
("he needs a car to get to work")
Next are lemmas.
WORDNET> (wn:lemmas (wn:synset "car.n.1"))
(#<LEMMA auto 9953 {100386BCC3}> #<LEMMA automobile 10063 {100386BCF3}>
 #<LEMMA car 20722 {100386BC03}> #<LEMMA machine 80312 {100386BD23}>
 #<LEMMA motorcar 86898 {1003250543}>)
As I've written earlier, there's some confusion in Wordnet between words and lemmas, and, IMHO, it is propagated in NLTK. The term lemma only appears once in Wordnet DB as a column in words table. And synsets are not directly related to words — they are linked through senses table. NLTK calls a synset to word pairing a lemma, which only adds to the confusion. I decided to call entities of words table lemmas. Now, how do you implement the equivalent of
wn.lemma('car.n.01.automobile')
We can do it like this:
WORDNET> (remove-if-not #`(string= "automobile" (lemma-word %))
                        (wn:lemmas (wn:synset "car.n.1")))
(#<LEMMA automobile 10063 {100386BCF3}>)
But, actually, we need not a raw lemma object, but a sense object, because sense is that mapping of word to its meaning (defined by a synset), and it's a proper Wordnet entity for this:
WORDNET> (wn:sense "car~automobile.n.1")
#<SENSE car~auto.n.1 28261 {1011F226C3}>
You can also get at the raw lemma by its DB id:
WORDNET> (wn:synset (wn:lemma 10063))
#<SYNSET auto.n.1 102958343 {100386BB33}>
Notice, that synset here is named auto. This is different from NLTK's 'car', and the reason for this is that it's unclear from the Wordnet DB, what is the "primary" lemma for a synset, so I just use the word which appears first in the DB. Probably, NLTK also uses it, but has a different ordering — compare the next output with the book's one:
WORDNET> (dolist (synset (wn:synsets "car"))
           (print (wn:words synset)))
("cable car" "car")
("auto" "automobile" "car" "machine" "motorcar")
("car" "railcar" "railroad car" "railway car")
("car" "elevator car")
("car" "gondola")
Also I've chosen a different naming scheme for senses — word~synset — as it better reflects the semantic meaning of this concept.

Wordnet relations

There're two types of Wordnet relations: semantic ones between synsets and lexical ones between senses. All of the relations or links can be found in *link-types* parameter. There are 28 of them in Wordnet 3.0.
To get at any relation there's a generic function related:
WORDNET> (defvar *motorcar* (wn:synset "car.n.1"))
WORDNET> (defvar *types-of-motorcar* (wn:related *motorcar* :hyponym))
WORDNET> (nth 0 *types-of-motorcar*)
#<SYNSET ambulance.n.1 102701002 {1011E35063}>
In our case "ambulance" is the first motorcar type.
WORDNET> (sort (mapcan #'wn:words *types-of-motorcar*) 'string<)
("ambulance" "beach waggon" "beach wagon" "bus" "cab" "compact" "compact car"
 "convertible" "coupe" "cruiser" "electric" "electric automobile"
 "electric car" "estate car" "gas guzzler" "hack" "hardtop" "hatchback" "heap"
 "horseless carriage" "hot rod" "hot-rod" "jalopy" "jeep" "landrover" "limo"
 "limousine" "loaner" "minicar" "minivan" "model t" "pace car" "patrol car"
 "phaeton" "police car" "police cruiser" "prowl car" "race car" "racer"
 "racing car" "roadster" "runabout" "s.u.v." "saloon" "secondhand car" "sedan"
 "sport car" "sport utility" "sport utility vehicle" "sports car" "squad car"
 "stanley steamer" "station waggon" "station wagon" "stock car" "subcompact"
 "subcompact car" "suv" "taxi" "taxicab" "tourer" "touring car" "two-seater"
 "used-car" "waggon" "wagon")
Hypernyms are more general entities in synset hierarchy:
WORDNET> (wn:related *motorcar* :hypernym)
(#<SYNSET automotive vehicle.n.1 103791235 {10125702C3}>)
Let's trace them up to root entity:
WORDNET> (defvar *paths* (wn:hypernym-paths *motorcar*))
WORDNET> (length *paths*)
2
WORDNET> (mapcar #'synset-name (nth 0 *paths*))
("entity.n.1" "physical entity.n.1" "object.n.1" "unit.n.6" "artefact.n.1"
 "instrumentality.n.3" "conveyance.n.3" "vehicle.n.1" "wheeled vehicle.n.1"
 "self-propelled vehicle.n.1" "automotive vehicle.n.1" "auto.n.1")
WORDNET> (mapcar #'synset-name (nth 1 *paths*))
("entity.n.1" "physical entity.n.1" "object.n.1" "unit.n.6" "artefact.n.1"
 "instrumentality.n.3" "container.n.1" "wheeled vehicle.n.1"
 "self-propelled vehicle.n.1" "automotive vehicle.n.1" "auto.n.1")
And here are just the root hypernyms:
WORDNET> (remove-duplicates (mapcar #'car *paths*))
(#<SYNSET entity.n.1 100001740 {101D016453}>)
Now, if we look at part-meronym, substance-meronym and member-holonym relations and try to get them with related, we'll get an empty set.
WORDNET> (wn:related (wn:synset "tree.n.1") :substance-meronym)
NIL
That's because the relation is actually one-way: a burl is part of a tree, but not vice versa. For this case there's a :reverse key to related:
WORDNET> (wn:related (wn:synset "tree.n.1") :part-meronym :reverse t)
(#<SYNSET stump.n.1 113111504 {10114220B3}>
 #<SYNSET crown.n.7 113128003 {1011423B73}>
 #<SYNSET limb.n.2 113163803 {1011425633}>
 #<SYNSET bole.n.2 113165815 {10114270F3}>
 #<SYNSET burl.n.2 113166044 {1011428BE3}>)
WORDNET> (wn:related (wn:synset "tree.n.1") :substance-meronym :reverse t)
(#<SYNSET sapwood.n.1 113097536 {10113E8BE3}>
 #<SYNSET duramen.n.1 113097752 {10113EA6A3}>)
WORDNET> (wn:related (wn:synset "tree.n.1") :member-holonym :reverse t)
(#<SYNSET forest.n.1 108438533 {10115A81C3}>)
While the tree is member-meronym of forest (i.e. NLTK has it slightly the opposite way):
WORDNET> (wn:related (wn:synset "tree.n.1") :member-meronym)
(#<SYNSET forest.n.1 108438533 {10115A81C3}>)
And here's the mint example:
WORDNET> (dolist (s (wn:synsets "mint" :pos #\n))
           (format t "~A: ~A~%" (synset-name s) (synset-def s)))
mint.n.6: a plant where money is coined by authority of the government
mint.n.5: a candy that is flavored with a mint oil
mint.n.4: the leaves of a mint plant used fresh or candied
mint.n.3: any member of the mint family of plants
mint.n.2: any north temperate plant of the genus Mentha with aromatic leaves and small mauve flowers
batch.n.2: (often followed by `of') a large number or amount or extent
WORDNET> (wn:related (wn:synset "mint.n.4") :part-holonym :reverse t)
(#<SYNSET mint.n.2 112855042 {1011CF3CF3}>)
WORDNET> (wn:related (wn:synset "mint.n.4") :substance-holonym :reverse t)
(#<SYNSET mint.n.5 107606278 {1011CEECA3}>)
Verbs:
WORDNET> (wn:related (wn:synset "walk.v.1") :entail)
(#<SYNSET step.v.1 201928838 {1004139FF3}>)
WORDNET> (wn:related (wn:synset "eat.v.1") :entail)
(#<SYNSET chew.v.1 201201089 {10041BDC33}>
 #<SYNSET get down.v.4 201201856 {10041BF6F3}>)
WORDNET> (wn:related (wn:synset "tease.v.3") :entail)
(#<SYNSET arouse.v.7 201762283 {10042495E3}>
 #<SYNSET disappoint.v.1 201798936 {100424B0A3}>)
And, finally, here's antonomy, which is a lexical, not semantic, relationship, and it takes place between senses:
WORDNET> (wn:related (wn:sense "supply~supply.n.2") :antonym)
(#<SENSE demand~demand.n.2 48880 {1003637C03}>)
WORDNET> (wn:related (wn:sense "rush~rush.v.1") :antonym)
(#<SENSE linger~dawdle.v.4 107873 {1010780DE3}>)
WORDNET> (wn:related (wn:sense "horizontal~horizontal.a.1") :antonym)
(#<SENSE inclined~inclined.a.2 94496 {1010A5E653}>)
WORDNET> (wn:related (wn:sense "staccato~staccato.r.1") :antonym)
(#<SENSE legato~legato.r.1 105844 {1010C18AF3}>)

Similarity measures

Let's now use the lowest-common-hypernyms function abbreviated to lch to see the semantic grouping of marine animals:
WORDNET> (defvar *right* (wn:synset "right whale.n.1"))
WORDNET> (defvar *orca* (wn:synset "orca.n.1"))
WORDNET> (defvar *minke* (wn:synset "minke whale.n.1"))
WORDNET> (defvar *tortoise* (wn:synset "tortoise.n.1"))
WORDNET> (defvar *novel* (wn:synset "novel.n.1"))
WORDNET> (wn:lch *right* *minke*)
(#<SYNSET baleen whale.n.1 102063224 {102A2E0003}>)
2
1
WORDNET> (wn:lch *right* *orca*)
(#<SYNSET whale.n.2 102062744 {102A373323}>)
3
2
WORDNET> (wn:lch *right* *tortoise*)
(#<SYNSET craniate.n.1 101471682 {102A3734A3}>)
7
5
WORDNET> (wn:lch *right* *novel*)
(#<SYNSET entity.n.1 100001740 {102A373653}>)
15
7
The second and third return values of lch come handy here, as they show the depth of the paths to the common ancestor and give some immediate data for estimating semantic relatedness, that we're going to explore more now.
WORDNET> (wn:min-depth (wn:synset "baleen whale.n.1"))
14
WORDNET> (wn:min-depth (wn:synset "whale.n.2"))
13
WORDNET> (wn:min-depth (wn:synset "vertebrate.n.1"))
8
WORDNET> (wn:min-depth (wn:synset "entity.n.1"))
0
By the way, there's another whale, that is much closer to the root:
WORDNET> (wn:min-depth (wn:synset "whale.n.1"))
5
Guess what it is?
WORDNET> (synset-def (wn:synset "whale.n.1"))
"a very large person; impressive in size or qualities"
Now, let's calculate different similarity measures:
WORDNET> (wn:path-similarity *right* *minke*)
1/4
WORDNET> (wn:path-similarity *right* *orca*)
1/6
WORDNET> (wn:path-similarity *right* *tortoise*)
1/13
WORDNET> (wn:path-similarity *right* *novel*)
1/23
The algorithm here is very simple — it just reuses the secondary return values of lch:
(defmethod path-similarity ((wordnet sql-wordnet3)
                            (synset1 synset) (synset2 synset))
  (mv-bind (_ l1 l2) (lowest-common-hypernyms wordnet synset1 synset2)
    (declare (ignore _))
    (when l1
      (/ 1 (+ l1 l2 1)))))
There are many more Wordnet similarity measures in NLTK, and they are also implemented in cl-nlp. For example, lch-similarity the name of which luckily coincides with lch function:
WORDNET> (wn:lch-similarity (wn:synset "dog.n.1") (wn:synset "cat.n.1"))
Calculating taxonomy depth for #<SYNSET entity.n.1 100001740 {102A9F7A83}>
2.0281482
This measure depends on performing expensive computation of taxonomy depth which is done in a lazy manner in the max-depth function:
(defmethod max-depth ((wordnet sql-wordnet3) (synset synset))
  (reduce #'max (mapcar #`(or (get# % *taxonomy-depths*)
                              (progn
                                (format *debug-io*
                                        "Calculating taxonomy depth for ~A~%" %)
                                (set# % *taxonomy-depths*
                                      (calc-max-depth wordnet %))))
                        (mapcar #'car (hypernym-paths wordnet synset)))))
Other similarity measures include wup-similarity, res-similarity, lin-similarity and others. You can read about them in the NLTK Wordnet manual. Most of them depend on the words' information content database which is calculated for different corpora (e.g. BNC and Penn Treebank) and is not part of Wordnet. There's a WordNet::Similarity project that distributes pre-calculated databases for some popular corpora.
Lin similarity proposes a similar formula to LCH similarity for calculating the score, but only uses information contents instead of taxonomy depths as the arguments to it:
WORDNET> (wn:lin-similarity (wn:synset "dog.n.1") (wn:synset "cat.n.1"))
0.87035906
To calculate it we first need to fetch the information content corpus with (download :wordnet-ic) and load the data for SemCor:
WORDNET> (setf *lc* (load-ic :filename-pattern "semcor"))
#<HASH-TABLE :TEST EQUAL :COUNT 213482 {100D7B80D3}>
The default :filename-pattern will load the combined information content scores for the following 5 corpora: BNC, Brown, SemCor and its raw version, all of Shakespeare's works and Penn Treebank. We'll talk more about various corpora in the next part of the series...
Well, this is all as far as Wordnet concerned for now. There's another useful Wordnet functionality — word morpher, but we'll return to it when we'll be talking about word forming, stemming and lemmatization. Meanwhile, here's a good schema showing the full potential of Wordnet:
Wordnet 3.0 UML diagram

2013-03-05

Lisp Hackers: Vladimir Sedach

Vladimir Sedach is an active open-source Common Lisp developer and proponent, as well as a computing philosopher to some extent. At his carcaddar blog he writes about decentralized social networks, forgotten bits of computer history and, surely, Lisp. He is the maintainer or originator of a few open-source libraries like parenscript and Eager-Future2, and works on Vacietis C-to-Lisp compiler, which he describes in more detail in the interview. Together with Andrey Moskvitin they were the driving force behind 2012 cliki update effort (see cliki2).

Tell us something interesting about yourself.

In 2012 my best friend and I rode our bicycles across the USA.

What's your job? Tell us about your company.

Right now I work at ZestFinance in Hollywood. We're a relatively new company that's helping the underbanked receive access to credit on more reasonable terms than otherwise obtainable, by using machine learning. The only thing bad about my job is that I get so focused on programming at work that I don't have any desire to hack on Free Software projects when I come home, so a lot of my own projects are currently badly neglected.

Do you use Lisp at work? If yes, how you've made it happen? If not, why?

I've used Lisp for new commercial projects where people trust me with the technology choices. I've also come in to work on existing Lisp projects. I have used Lisp for one-off tasks at "non-Lisp" companies, but I've never started a Lisp project at a company that wasn't using Lisp already. Most computer programmers tend to make technical decisions based on cliches they read about on the Internet, and unfortunately I don't know the cure for being dumb.

What brought you to Lisp? What holds you?

As a teenager I got interested in computer graphics, and came across this really cool 3d graphics program called Mirai from Nichimen Graphics. Mirai was written in Common Lisp, and I read more about Lisp in the Slashdot interview with Kent Pitman, and started reading SICP based on Kent's recommendation. By the second chapter computer programming had finally made sense to me, and by the third I had decided to study mathematics in university.

As an aside, Mirai is directly descended from the S-Graphics software from Symbolics. Not many people know this, but Symbolics was actually the second producer of commercially available 3d computer animation software (the first was Wavefront). S-Graphics was a complete modeling, animation, and painting system, and it was all written in Lisp in the mid 1980s.

Common Lisp is still the best programming language I've come across. There is the feeling of freedom. No one is telling you what patterns or types to use. You can always just write code in a way that is most appropriate for what you're currently doing.

What's the most exciting use of Lisp you had?

Working with Daniel Gackle on Skysheet. We did some very awesome things with Parenscript.

What you dislike the most about Lisp?

No one has yet invented a good way to write "one-liners" in Lisp. I would love to replace my Unix shell with a Lisp REPL someday.

Among the software projects you've participated in what's your favorite?

Vacietis, for the sheer amount of hacks per line of code, and for the "I proved them wrong" factor. It was during the development of Vacietis that I came up with my current programming philosophy: "when stuck, do the stupidest thing possible"

Tell us about Vacietis: your vision for it, the project's progress and roadmap, what's lacking?..

Vacietis was originally intended to be a translator for a subset of C code to Common Lisp. I discussed the idea for Vacietis a few times online before starting to code, and people generally thought it wouldn't be doable. Scott Burson, who wrote the Zeta-C compiler for Lisp Machines in the 1980s, told me it would take at least a year of full-time of work.

As I worked on Vacietis, I realized that adding more and more of the "advanced" C functionality was actually easy. First came the C preprocessor (which is implemented as part of the Common Lisp readtable that also parses the C code), then large patches of the C standard library, and then one day Brit Butler came along and sent me a one-page patch that actually used Vacietis as a stand-alone C compiler! I hadn't even realized the project had gotten to that phase. The compiler comes as the "vacietis.vcc" ASDF system as part of Vacietis.

Some big things that need to be done are struct call-by-value, pointer scaling, arguments to main(), some libc stuff, setjmp, and making VCC produce linkable Lisp "object" files (fasls) for the different implementations. I would also like to change the pointer representation to be a fixnum offset into a sparse array (right now pointers are represented by structures that look like <offset, array>).

Overall, I am amazed by how much progress Vacietis has made given how little time I spent on it. I am now convinced that Common Lisp is mostly a superset of C, and have a newfound appreciation for gotos and PROG.

If you had all the time in the world for a Lisp project, what would it be?

A single-address space Common Lisp operating system based on capabilities (Jonathan Rees wrote a dissertation on doing this in Scheme in 1995) and a virtualized package system. I really like Azul Systems' hack of using the EPT/RVI virtualized page tables as a hardware read barrier for real-time garbage collection, and with a single-address space operating system I think you can do the same thing on hardware with a regular MMU.

If you follow the steps of the Viewpoints Research Institute's Fundamental New Computer Technologies project and keep the system as simple as possible, it should be a surmountable amount of work to realize a somewhat working system.

Vacietis is actually the first step in the Common Lisp operating system project. I'd like to have a C runtime onto which I can port hardware drivers from OpenBSD with the minimal amount of hand coding. The way hardware drivers are written in OpenBSD (a set of patterns it borrows from NetBSD, where they originated and are colloquially known as "bus_dma") is very pleasant. Theo de Raadt was also the first well-known Free Software operating system developer to take a position against binary hardware drivers, and I respect him tremendously for that. Linus Torvalds doesn't care, and as a result it is now impossible to get drivers for any of your ARM devices, even though 99% of them run Linux. OpenBSD will continue to have great, stable Free Software device drivers that are easily portable for different hardware architectures and to the other BSD-derived operating systems. I hope OpenBSD's model of Free Software hardware drivers catches on more widely.

Describe your workflow, give some productivity tips to fellow programmers.

I tend to design software systems in advance only at the level of the domain. I find that programmers who try to come up with APIs or draw class diagrams ahead of starting to code tend to write very poor code.

I rely heavily on interactive programming for doing mundane tasks, and since most programming time is spent dealing with mundane things, I spend most of my time in the REPL. Fortunately the Common Lisp REPL is the best programming environment I have come across. Things like the Rails console don't even come close. The only comparable environment is the Unix shell.

It's also very easy to turn Common Lisp REPL code into unit tests, which I tend to do a lot. That is something that's very hard to do with object-oriented code, which is why idiotic things like dependency injection and Test-Driven Development have to be invented.

For difficult problems, it's always better to step back from the keyboard and just think about your code. Hacking around in the debugger is usually a waste of time (the only exception is when you're having data structure issues, but then technically you're in the object inspector). I mostly tend to debug with print statements.

The best productivity tip I've come across is the "Seinfeld technique" that I learned about from reading Hacker News. It involves doing something, no matter how small, on your project every single consecutive day, without any gaps or interruptions. Apparently Jerry Sienfeld uses a big calendar and tries to mark up large "runs" of consecutive days on it. That really keeps you focused.

One thing I try to focus on in my code is giving unambiguous, unique (ie, easily grep-able) names to symbols. That makes it easy to go back to old code and quickly figure out what is going on. Common words like "status," "mapping," etc. make terrible names unless you qualify them.

submit