Joint Extraction of Entities and Relations Based on a Novel Tagging Scheme

This paper reminds me of a funny idea several years ago: use CRF tagger to do dependency parsing. Does it work? Well, the author over-claimed a high score I've never replicated. It doesn't sound reasonable to cast a dependency tree into sequence of tags.


In this paper, they also did a cast:

Strait-forward, but can't handle nested cases.

Bias Objective Function

The only interesting point is the loss function:

L=\max\sum_{j=1}^{\vert \mathcal{D}\vert}\sum_{t=1}^{L_j}\left(\log(p_t^{(j)}=y_t^{(j)}\vert x_j,\Theta) \cdot I(O) + \alpha\cdot\log(p_t^{(j)}=y_t^{(j)}\vert x_j,\Theta) \cdot \left(1-I(O)\right)\right)

where \(\vert \mathcal{D}\vert\) is the training set size, \(L_j\) is length of sentence \(x_j\), \(p\) and \(y\) are the prediction and gold labels, \(I(O)\) is a indicator function which outputs \(1\) only if the \(y=O\), \(\alpha\) is the bias weight, controls how important the non-O tag is. In their experiment, they set \(\alpha=10\).

2018/2/15 posted in  NLP

TypeError: __init__() should return None, not 'NoneType'

2018/2/8 posted in  Python

Dynet Debug

When you are facing bugs like:

Assertion failed: (dimensions_match(m_leftImpl.dimensions(), m_rightImpl.dimensions())), function TensorEvaluator, file include/eigen3/unsupported/Eigen/CXX11/src/Tensor/TensorEvaluator.h, line 392.

Check if you made mistake in transposing a matrix's dimension multi-times.

        t_rel_logits = dy.transpose(rel_logits, [1, 0, 0])
2018/1/8 posted in  ML

What's Dropout Mask

I happened to see some APIs in Dynet about dropout mask, like:


The document said that:

Set dropout masks at the beginning of a sequence for a specific batch size
If this function is not called on batched input, the same mask will be applied across all batch elements. Use this to apply different masks to each batch element.

What does that mean?

Dropout mask is nothing else but a vector of random values \(\mathbf{d}_i\in[0, 1]\) sampled from a Bernoulli distribution. To apply mask to your data points, you do:


Suppose you have \(2\) batches input1 and input2, if you do dropout by default, you will have two masks generated. Otherwise the same dropout mask is applied across all batches.

2018/1/7 posted in  ML

Stack Long Short-Term Memories

Dyer et al. (2015)1 added an stack pointer \(TOP\) to a conventional LSTM. The trick is use that \(TOP\) cell as \(h_{t-1}\) and \(c_{t-1}\).

Not only the two standard structures (stack and buffer) in transition-based dependency parsing are implemented via a stack-LSTM, but also a third stack storing history actions are introduced and implemented in the same way. The authors seem to favor stack structure and hope it can encode configurations more thoroughly.

  1. C. Dyer, M. Ballesteros, W. Ling, A. Matthews, and N. A. Smith, “Transition-Based Dependency Parsing with Stack Long Short-Term Memory.,” arXiv, vol. 1505, p. arXiv:1505.08075, 2015. 

2017/11/22 posted in  NLP

Recurrent Neural Network Grammars

Dyer et al. (2016)1 adopted RNNs to do both parsing and language modeling.

RNNs are deemed to be inappropriate models of natural language, since relations between words are in compositional nested structures rather than sequential surface order2.

The authors introduced a new generative probabilistic model of sentences to enable modeling of nested, hierarchical structures in nature language, for RNNs. Parsing operates in bottom-up fashion, while generation makes use of top-down grammar information.

RNNG defines a joint probability distribution over string terminals (words in a language) and phrase-structure nonterminals. It is motivated by the conventional transition system, which is an abstract state machine. But the first big difference is that RNNG is a generative model (although can be modified to discriminative parsing). Formally, the RNNG is defined by a triple \(\langle N,\Sigma,\Theta \rangle\), where \(N\) denotes nonterminal symbols (NP, VP, etc.), \(\Sigma\) denotes terminal symbols (\(N \cap \Sigma = \emptyset\)), and \(\Theta\) denotes model parameters. Regarding implementation, RNNG is consisted of a stack storing partially completed constituents, a buffer storing already-generated terminals, and a list of past actions. It generates sentence \(x\) and its phrase-structure tree \(y\) simultaneously. The actions sequence \(\boldsymbol{a} = \langle a_1,\ldots,a_n \rangle\) to generate \((x, y)\) is called the oracle.

  1. C. Dyer, A. Kuncoro, M. Ballesteros, and N. A. Smith, “Recurrent Neural Network Grammars,”, vol. cs.CL. p. arXiv:1602.07776, 24-Feb-2016. 

  2. Noam Chomsky, Syntactic Structures. The Hague[J]. Mouton and Company, 1957. 

2017/11/4 posted in  NLP

Structured Prediction


Most NLP tasks produce structured decisions, such as chunking, parsing, coreference resolution.

In classification, every output is a label independent to each other. Learning algorithm is relatively simpler.

In structured prediction, the overall output is a structured object such as a tree, a sequence or an alignment, which is comprised by intermediate output variables with inter-dependence. As feature vector is extracted from both input and output structure, structured learning is also considerably difficult involving repeating inference and searching.

The final decision is not independent, it depends on other decisions. To cooperate with each other, global decision(the final one) must have some mutual dependencies on local decisions(temporary ones).

Joint Inference with General Constraint Structure

Joint Inference

For instance, consider the Entities and Relations Recognizing problem. In which we want to recognize all entities and their relations. Here is a tiny example: 2017-09-24 下午4.14.35

Some classifiers may output \(5\) tables above. If Jane is a per, then her relation with Bernie won't be born_in, which has the highest probability under prediction.

This example gives us a motivation, if we learn to recognize entities and relations together, better results will be achieved. But most systems use the convention pipeline design, leading to propagation of errors. Although unifying the whole pipeline tends to be aggressive, some closely related tasks can be learnt jointly.


Many "rules" act as constraints in prediction. Such as:

  • The words pp., pages correspond to PAGE.
  • Four digits starting with 20xx and 19xx are DATE.
  • Quotations can appear only in TITLE.

By adding constraints, we get correct results without changing the model.

Constrained Conditional Models

One formula tells all.


The second term covers all the constraints.


Learning the objective function, Max Margin or CRF log likelihood.

Learning is thus driven by the attempt to find a weight vector w such that for each given annotated example, we have:

\forall y \quad w^T\phi(x_i,y_i)\geq w^T\phi(x_i,y)+\Delta(y,y_i)

Here \(\Delta(y,y_i)\) denotes the penalty for predicting other structure.


Inference is expressed as a maximization of a scoring function, or output the structure which gets the highest score.


You talked about linear model learning and predicting, which is a home truth. What about constraints?

Constraints are formalized as Integer Linear Programming(ILP) in literature. ILP is a set of rules expressed as indicator function. For example, in POS tagging, every word can have one and only one label, then

\forall i \quad \sum_{y \in Y }1_{\{ y_i=y\}}=1

2017/9/24 posted in  ML

Markdown syntax guide and writing on Markdown


Markdown is intended to be as easy-to-read and easy-to-write as is feasible.
Readability, however, is emphasized above all else. A Markdown-formatted document should be publishable as-is, as plain text, without looking like it's been marked up with tags or formatting instructions.
Markdown's syntax is intended for one purpose: to be used as a format for writing for the web.

Read more   2017/4/18 posted in  Other