2017-04-17

Pretty-Printing Trees

  (or The Ugliest Code I've Ever Written)

In the last couple of days, I was ill and had to stay in bed, so I've used this time also to tidy up the work that accumulated over the past year in cl-nlp. That was especially timely, considering the interest that was expressed in using it by some people who I've met at the recent Lisp-related events.

I've even assembled a rough checklist of the things that need to be finished to get it to v.1.0 and beyond.

Besides, after finishing the basic cleaning, I've returned to one of the programming tasks that has racked my head for long: tree pretty-printing. In NLP, we constantly have to deal with various versions of parse trees, like the constituency or dependency ones, but the problem is that they are not easily visualized. And good visualization plays, at least for me, a critical role in effective debugging, ideation and programming. It's an essential part of a solid interactive experience that is one of the fundamental traits of Lisp development.

For instance, a constituency tree is usually presented as a Lisp list. Here's an infamous example from the Penn Treebank:

  (S 
    (NP-SBJ 
      (NP (NNP Pierre) (NNP Vinken) )
      (, ,) 
      (ADJP 
        (NP (CD 61) (NNS years) )
        (JJ old) )
      (, ,) )
    (VP (MD will) 
      (VP (VB join) 
        (NP (DT the) (NN board) )
        (PP-CLR (IN as) 
          (NP (DT a) (JJ nonexecutive) (NN director) ))
        (NP-TMP (NNP Nov.) (CD 29) )))
    (. .) )

A dependency tree has several representations, all of which are not really intuitive to grasp. This is the Stanford format:

 amod(ideas-2, Colorless-0)
 amod(ideas-2, green-1)
 nsubj(sleep-3, ideas-2)
 root(sleep-3, sleep-3)
 advmod(sleep-3, furiously-4)
 punct(sleep-3, .-5)

And here's the CoNLL one:

0 Colorless _ _ ADJ 2
1 green _ _ ADJ 2
2 ideas _ _ NOUN 3
3 sleep _ _ NOUN 3
4 furiously _ _ ADV 3
5 . _ _ PUNCT 3

Also, Google's Parsey McParseface offers another - presumably, more visual - representation (using the asciitree lib). Still, it is not good enough, as it messes with the order of words in a sentence.

Input: Bob brought the pizza to Alice .
Parse:
brought VBD ROOT
 +-- Bob NNP nsubj
 +-- pizza NN dobj
 |   +-- the DT det
 +-- to IN prep
 |   +-- Alice NNP pobj
 +-- . . punct

As you see, dependency trees are not trivial to visualize (or pretty-print) in ASCII. The authors of Spacy creatively approached solving this problem by using CSS in their displaCy tool:

However, it seems like an overkill to bring a browser to with you for such a small task. And it's also not very scalable:

I, in fact, was always interested in creative ways of text-based visualization. So, I thought of ways to represent parse trees in ASCII.

With constituency ones, it's rather trivial:

> (pprint-tree '(TOP (S (NP (NN <This:0 0..4>))
                        (VP (VBZ <is:1 5..7>)
                            (NP (DT <a:2 8..9>)
                                (JJ <simple:3 10..16>)
                                (NN <test:4 17..21>)))
                        (|.| <.:5 22..23>)))
           TOP            
            :             
            S             
  .-----------:---------. 
  :          VP         : 
  :   .---------.       : 
 NP   :        NP       : 
  :   :   .----:-----.  : 
 NN  VBZ DT   JJ    NN  . 
  :   :   :    :     :  : 
This  is  a simple test . 

The dependencies are trickier, but I managed to find a way to show them without compromising the sentence word order:

> (pprint-deps '(<This:0 0..4> <is:1 5..7> <a:2 8..9> <simple:3 10..16> <test:4 17..21> <.:5 22..23>)
               '(root(_ROOT_-0, is-1) nsubj(is-1, This-0) dobj(is-1, test-4) det(test-4, a-2) amod(test-4, simple-3) punct(is-1, .-5)))
Colorless green     ideas      sleep     furiously . 
    ^       ^        .^         .^.          ^     ^
    :       `. amod .´:         :::          :     :
    `..... amod .....´:         :::          :     :
                      `. nsubj .´::          :     :
                                 :`. advmod .´     :
                                 :`.... punct .....´
                               root

And it looks pretty neat even for longer sentences:

We        hold these   truths to       be  self -       evident , that all      men are        created     equal , that they are        endowed      by  their    Creator    with certain unalienable Rights , that among     these are      Life         , Liberty   and the    pursuit     of     Happiness . 
 ^         .^.   ^       .^    ^       .^.   ^  ^         .^    ^   ^   ^       .^   ^           .^.         ^   ^   ^    ^   ^           .^.         ^.   ^        .^.        ^.    ^         ^        .^   ^   ^    ^.        ^   .^.        ^.         ^    ^.      ^   ^       .^.        ^.        ^     ^
 `. nsubj .´::   `. det .´:    `. aux .´::   :  `. punct .´:    :   :   `. det .´:   `. auxpass .´::         :   :   :    :   `. auxpass .´::         ::   `. poss .´::        ::    :         `. amod .´:   :   :    :`. pobj .´   :::        :`. punct .´    :`. cc .´   `. det .´::        :`. pobj .´     :
            :`... dobj ...´             ::   `. npadvmod .´:    :   :            :               ::`. advcl .´   :   :    :               :::         ::             ::        ::    `...... amod ......´:   :   :    :             :::        ::              ::                   :`. prep .´               :
            ::                          :`..... acomp .....´    :   :            `.. nsubjpass ..´::             :   :    :               :::         ::             ::        :`......... pobj .........´   :   :    :             :::        ::              :`...... conj .......´                         :
            :`......... advcl ..........´                       :   :                            ::`... punct ...´   :    :               :::         ::             :`. prep .´                             :   :    :             :::        :`.... conj ....´                                              :
            :`..................... punct ......................´   `........... mark ...........´::                 :    :               :::         :`... pobj ....´                                       :   :    :             ::`. attr .´                                                              :
            ::                                                                                    ::                 :    :               ::`. agent .´                                                      :   :    `... prep ....´:                                                                        :
            ::                                                                                    ::                 :    `.. nsubjpass ..´::                                                                :   `...... mark ......´:                                                                        :
            ::                                                                                    ::                 `....... mark .......´::                                                                :                       :                                                                        :
            ::                                                                                    ::                                       :`............................ punct .............................´                       :                                                                        :
            ::                                                                                    ::                                       :`........................................ advcl .........................................´                                                                        :
            ::                                                                                    :`................ advcl ................´                                                                                                                                                                  :
            :`...................................... ccomp .......................................´                                                                                                                                                                                                           :
            :`............................................................................................................................................ punct .............................................................................................................................................´
          root

However, writing the visualization code was one of the most intimidating programming tasks I've ever encountered. One explanation is that trees are most naturally processed in depth-first order top-down, while the visualization requires bottom-up BFS approach. The other may be that pixel-perfect (or, in this case, character-perfect display is always tedious). As far as I'm concerned, this is not a sufficient explanation, but I couldn't find any other. The ugliest part of this machinery is deps->levels function that prints the dependency relations in a layered fashion. The problem is to properly calculate minimal space necessary to accommodate both tokens and dependency labels and to account for different cases when the token has outgoing dependency arcs or doesn't. In theory sounds pretty easy, but in practice, it turned out a nightmare.

And all of this assumes projective trees (non-intersecting arcs), as well as doesn't know how to show on one level two arcs going from one token in two directions. Finally, I still couldn't align the two trees (constituency and dependency) above and under the sentence. Here's the target:

           TOP            
            :             
            S             
  .----------------:--------------. 
  :               VP              : 
  :         .---------.           : 
 NP         :        NP           : 
  :         :   .----:---------.  : 
 NN        VBZ DT   JJ        NN  . 
This        is  a simple     test . 
  ^         .^. ^    ^        .^  ^
  `. nsubj .´:: :    `. amod .´:  :
             :: `.... det ....´:  :
             :`..... dobj .....´  :
             :`...... punct ......´
           root

and this is how it prints for now (one more challenge was to transfer additional offsets from dependencies into the constituency tree):

           TOP            
            :             
            S             
  .-----------:---------. 
  :          VP         : 
  :   .---------.       : 
 NP   :        NP       : 
  :   :   .----:-----.  : 
 NN  VBZ DT   JJ    NN  . 
This  is  a simple test . 
  ^         .^. ^    ^        .^  ^
  `. nsubj .´:: :    `. amod .´:  :
             :: `.... det ....´:  :
             :`..... dobj .....´  :
             :`...... punct ......´
           root

Well, the good news is that it is usable, but it still needs more work to be feature complete. I wonder what was I doing wrong: maybe, someone can come up with a clean and simple implementation of this functionality (in any language)? I consider it a great coding challenge, although it may require a week of your free time and a bunch of dead neurons to accomplish. But if you're willing to take it, I'd be glad to see the results... :D

2017-01-02

(m8n)ware Open for Business

Today, I want to announce (m8n)ware (the name is an i18n-abbreviation of "meditationware" with a mix of Lisp parens). This is a thing I always wanted to build. After parting ways with Grammarly almost a year ago, I had some time to rest and think about my next move. And this thought I couldn't let go so I figured: you can always go work somewhere, but you don't have a lot of stabs at realizing some of your own ideas. Maybe, two or three in a lifetime. I had tried this once already with fin-ack almost 8 years ago, and the concept behind it was, basically, the same — the implementation differed.

In theory

What is (m8n)ware? It is a company aimed at solving problems in the area of cognition-related computing, which will be built as a distributed network of mostly Lisp research-oriented engineers. This sounds rather complex, so let me try to explain a couple of points:

  • Cognition-related computing is the best term I came to after a long thinking about the area of CS that includes various tasks related to cognition, intelligence, knowledge, and associated logic. The common marketing buzzword is Artificial Intelligence, but it has a negative history and is quite misleading. All computer programs implement some form of "artificial" intelligent behavior. The defining feature of cognition-related computing is that it requires some transformation of raw data into structured computer-processable information and back, which is similar to human cognitive functions that arguably do the same transformation for our own internal processing.
  • The zest of the distributed network notion is that the primary focus of (m8n)ware is building not a localized corporate-like structure, in which people are bound primarily by legal contracts and payment obligations, but a loosely coupled group of like-minded people, who share the same values, interests, and approaches to technology. This organization will be seeking a perfect middle-ground between a corporation and an open-source community.
  • Research-oriented engineers is another "middle-ground" term that describes the main multidisciplinary role needed in this organization. We're not a scientific lab that is focused on fundamental research, neither are we an outsourcing shop that faithfully implements existing results according to a given spec. We're engineers in the sense that we deliver production-ready technology that may be useful to the end users in a straightforward manner. And, at the same time, we're researchers because we wield the methodology and are ready to experiment in new areas that don't have satisfactory state-of-the-art solutions.

I don't believe in the VC mantra of "build a startup, get rich, change the world." First of all, I don't believe in changing the material world (which implies a conviction that you know better). I believe in changing yourself. Also, getting rich and doing something good (to the world) are not the goals that are always aligned. Moreover, they are usually in conflict. I'm not a businessman in the sense that money is not my ultimate goal. But I like to see things grow and develop, things that are bigger than myself. Thus I'm interested not in market share but in mind share.

Considering all of the above, (m8n)ware is not going to be a product company in a traditional sense. It will be a technology company that will create and disseminate knowledge-based services and products. Also, it will not aim at rapid growth, but rather at sustainable development.

In the previous post, I've listed my motivations for moving in this particular direction and explained my values. I can't say that it got overwhelmingly positive feedback, but, in general, the results were better than I expected :) Now, I have several clues on how to cross the most challenging chasm of scaling its operation from a single-person endeavor to a productive and sustainable group. Meanwhile, I was testing if this approach may work in practice and doing market research. Now, I'm ready to go all in and devote at least the next year of my professional life to building this thing.

http://m8nware.com is oficially live. If you're interested in cooperation as a client, partner or co-worker, please, let me know...

In practice

There are several aspects that I'm betting on in (m8n)ware that are non-mainstream and may seem somewhat counter-intuitive. In this part, I want to provide more details about why and how, I think, this will work.

The first aspect is radical transparency. From this and the previous post, it should be clear that (m8n)ware originated and plans to continue functioning fully exposed to the outside world, not relying on any secret know-hows or clever tricks. I don't plan to conceal what we're going to do and why. Neither "fake it till you make it." Why this will work? First of all, my experience shows that in the current age of information overload, we're fighting primarily not for the purse but for the thoughts of our "customers" (in many possible markets: not only where you sell your product/services, but also in the labor market, and in the ecosystem of potential competitors, partners, and vendors). And this requires information sharing, not safekeeping and concealment. Secondly, in general, I'm not interested in competition — rather I'd like to find a unique niche in the market that will be served by the company in the best possible manner and will be big enough to sustain it. The good news is that the AI-market is, currently, growing very fast and this trend will last at least for a couple more years. So demand is greater than supply, and this means not a very harsh competitive environment. Another thing is Lisp: no one in their right mind will bet a company on Lisp, so I'm not really worried about the competition in the labor market. :) The final point about openness is that I personally endorse it, and as this is the company that aims to be as close to my ideal as possible it should endorse it.

Although it's not a classic product company, it's not going to be a typical outsourcing one either. Yes, initially, it will provide primarily consulting services, but the idea is that, with time, the share of these services will decrease in favor of supporting more general-purpose tools and technology developed in-house. And to ensure the constant priority of this goal, we'll be doing such work from day one. Currently, I see it in the following manner: the time of all engineers will be split in some proportion between for-pay consulting and developing open-source/research projects for free, and with time as some of these projects become important to the company, it will start paying the people who develop them for this work as well. This is a frugal approach, but I advocate it based on personal perspective: working at my previous gigs, I'd be eager to forfeit, say, 20% of my salary to be able to spend 20% of my time on open-source projects that matter to me personally. Actually, the percentage may be much bigger. Currently, I spend 50% of my time working on such projects and am quite happy with this. I deeply believe that such balance is more appealing to many programmers (especially, the kind of people I'd be willing to cooperate with) than a conventional approach.

Lisp again. From my experience working in cognitive problems domain, I can definitely say that it's not about coding. For several reasons. The obvious one is that 80% of resources are spent in other parts: thinking/learning, working with data, experiments, documentation. (The remaining 20% are still critical, especially since most of the solutions are resource-demanding and the code is algorithm-heavy). Then, current technology situation: the days of backend-only solutions are, unfortunately, gone. A lot of problems require heavy mobile or in-browser presence. And on the backend, thanks to microservices and other stuff, no one is developing in a single language and even on a single platform anymore. Finally, there's knowledge transfer. Programs may be not a bad medium to express concepts, but not the optimal one either: between scientific papers, blog posts, markdown documents, experiment notebooks, and production-optimized programs, there is no one-size-fits-all solution. All this creates conditions, in which the choice of a programming language becomes much less a constraint than it was just a a few years ago. On the other side, from the point of view of "internal" productivity (not concerned with integration into the bigger picture), Lisp has proven to be a great and rewarding environment very well suited for research work. Plus a great way to differentiate in the labor market... :)

Our value proposition

So, if you need to solve some cognitive computing problems, here's what (m8n)ware may offer.

  1. Provide small-scale consulting services: talk to your people and help them with their challenges, perform an audit, study feasibility of some serious project, help with gathering relevant data sets, etc.
  2. Develop a prototype solution for a particular problem, and deliver it as a set of data, documentation, and a working web-service to allow integration in your prototypes, testing in your environment, with your clients and data.
  3. Develop a turnkey solution and integrate it into your environment. This is rather tricky as we'll prefer to work in Lisp, and not every environment will be ready for this. We're also willing to compromise and develop some non-critical integration parts in other languages when necessary, provided that the core remains Lisp-based.

Why you should go to us instead of solving the problem on your own? The current situation in cognition-related computing is that such projects have high business value, but are not easy to complete: they require not just engineering, but a substantial/prevailing research component. Productive work in this area assumes a skill set of developers and managers that is different from conventional software development. Obviously, you want to develop this expertise in-house, but growing it, currently, is a slow and daunting process. Still, you should definitely do that for the long-term benefits, but this doesn't mean that you can say something in the lines of: in the next half a year I need to solve this complex AI problem, so I'll just hire a person/team in a couple of months and let them do it. It's a risky approach even in conventional software development, and in this field, it just doesn't work. The competition for AI researchers is insane and, moreover, if you're a regular company and not Google/Facebook or the latest-hottest startup your chances of hiring and retaining top talent are, basically, nil. (Why we'll be able to have the talented people while you won't? Because our particular focus — Cognitive+Tech+Distiributed+Lisp — will allow us to appeal to a portion of the talent pool that is not happy in mainstream environments).

Cognitive computing projects are risky and hard to predict. That's why for any serious long-term (longer than a couple of months) partnership we'll be dividing the work into reasonable chunks that will allow you to get at least part of the value at each checkpoint, see and assess progress, and pivot if your plans or conditions change.

We're open for business — write to info@m8nware.com if interested.