Transition Based Dependency Parser

Hi folks welcome to my blog! I badly wanted to make my own blog since more than three, four years but now only I'm writing my first blog. (Apparently this blog was created in 2010 but no posts since ;-) ) Anyway this blog post is for the guys who have some prior knowledge about natural language processing, and dependency parsing in particular. Let's begin!

One of the key tasks in Natural Language Processing is to identify the linguistic structure of sentences. There are mainly two views of linguistic structure. One is phrase structure where the types of phrases inside the sentence is analyzed in the means of context free grammar. The other is dependency structure where the relationships between words of a sentence is analyzed in the means of dependents and their dependency types. In other words, dependency parsing simply means taking every word of the sentence and finding out what that word depends on. Figure 1 shows and example.

Figure 1: Dependency between words

The outcome of dependency parsing is a dependency tree as shown in figure 2. To make things consistent, a $ROOT$ node is added at the top of the tree so that every word in the sentence depends on some other word.(or $ROOT$)

Figure 2: Dependency parse tree

A few things to note :

The arrow goes from the head to dependent.
Every word can only have one head.
Root can only have one dependent.
Any other word can have any number of dependents.
Dependency graph should be a tree. (i.e. acyclic)

Methods of Dependency Parsing

Dependency parsing is a process of obtaining the dependency parse tree given a sentence as the input. There are mainly four approaches to dependency parsing.

Dynamic programming approach.
Graph theory approach. - Obtaining the minimum spanning tree
Constraint satisfaction problem (CSP) approach
Transition based parsing approach

Out of the above four techniques, Transition Based parsing algorithms perform faster due to its linear time complexity. Usually other approaches take polynomial time complexity.

Transition Based Parsing

Transition Based parsing is based on a greedy approach. Hence it is assumed that an optimal solution to a subproblem will become an optimal solution for the main problem as well. Transition based dependency parsing will predict a sequence of transitions that will go from an initial configuration to a terminal configuration while deriving the dependency parse tree.

A configuration $c$ is defined as $c = (s, b, A)$ where $s$ is a stack, $b$ is a buffer and $A$ is a list of dependency arcs. Here arcs are dependencies between words. (edges in the dependency parse tree)

Stack and buffer contain words of the sentence. Initially, Stack contains $ROOT$ and buffer contains words of the sentence in order.

Say the sentence is \[X = w_1, w_2, ... , w_n\] then initial configuration is,

\[s = {[ROOT]}\]\[b = {w_1, w_2, ... , w_n}\] \[A = \emptyset\]

$SHIFT$ : First element of the buffer will be taken out and pushed into the stack \[s \rightarrow s + b_1\] \[b \rightarrow b - b_1\]
$LEFT\_ARC$ : Making the second top word of the stack dependent to first word of the stack. Then remove the second word from the stack. \[s \rightarrow s - s_2\] \[A \rightarrow A + arc(s_1,s_2)\]
$RIGHT\_ARC$ : Making the first word of the stack dependent to the second word of the stack. Then remove the first word from the stack.\[s \rightarrow s - s_1\] \[A \rightarrow A + arc(s_2,s_1)\]

In the case of labeled dependency parsing, the type of the dependency should also be added when a dependency is created. In that case, the above LEFT ARC and RIGHT ARC operations are subdivided into operations containing the label for each type of dependency as well.

Example

Let's say types of dependencies are $\{amod, tmod, nsubj\}$. Then the possible operations are;

$SHIFT$, $LEFT\_ARC-amod$, $LEFT\_ARC-tmod$, $LEFT\_ARC-nsubj$, $RIGHT\_ARC-amod$, $RIGHT\_ARC-tmod$, $RIGHT\_ARC-nsubj$

Hence if there are $N$ number of dependency labels, the total number of operations will be $N\times2+1$\\

Then from an initial configuration, the sequence of the above operations will be performed until the terminating configuration. Terminating configuration will be an empty buffer and only $ROOT$ in the stack. That is $s = \{ROOT\}$ , $b = \emptyset$. And then $A$ will have the dependencies.

The next question is how we decide the sequence of operations to perform. Under the greedy assumption, at a given configuration the next operation to perform is decided by running a discriminative classification algorithm (e.g. SVM) on all possible moves. The classification algorithm takes features of the current configuration and possible operations as the input and outputs the operation to perform.

Some of the features commonly used are,

Words and their corresponding POS Tags of the first two elements of the stack and buffer.
Word pair features.
Three-word features

Then comes the problem of training the classifiers. One important characteristic of dependency treebanks helps to solve this problem. That is treebanks can be converted into a sequence of above mentioned operations deterministically. Hence by obtaining the sequence of operations from dependency treebanks, classifiers can be trained in a supervised manner.

The problem with using traditional classification algorithms is that it requires a very large feature array. For a classifier that uses above mentioned features, and to a large enough word dictionary, the number of features can be even more than $10^8$. Hence the feature computation takes a lot of time. In fact, it is observed that more than 95\% of the time is consumed for feature computation in the parsing process. And another problem is features are highly sparse.

To overcome these problems a neural classifier can be used with dense features vectors. (This was proposed by Chen & Manning (2014) )

I'll be discussing this in a later blog. And I'll be implementing this model and update the code explain the implementation later. Thank you for reading!

Search This Blog

SMB's Blog