Dynet Debug

When you are facing bugs like:

Assertion failed: (dimensions_match(m_leftImpl.dimensions(), m_rightImpl.dimensions())), function TensorEvaluator, file include/eigen3/unsupported/Eigen/CXX11/src/Tensor/TensorEvaluator.h, line 392.

Check if you made mistake in transposing a matrix's dimension multi-times.

        t_rel_logits = dy.transpose(rel_logits, [1, 0, 0])
2018/1/8 posted in  ML

What's Dropout Mask

I happened to see some APIs in Dynet about dropout mask, like:


The document said that:

Set dropout masks at the beginning of a sequence for a specific batch size
If this function is not called on batched input, the same mask will be applied across all batch elements. Use this to apply different masks to each batch element.

What does that mean?

Dropout mask is nothing else but a vector of random values \(\mathbf{d}_i\in[0, 1]\) sampled from a Bernoulli distribution. To apply mask to your data points, you do:


Suppose you have \(2\) batches input1 and input2, if you do dropout by default, you will have two masks generated. Otherwise the same dropout mask is applied across all batches.

2018/1/7 posted in  ML

Structured Prediction


Most NLP tasks produce structured decisions, such as chunking, parsing, coreference resolution.

In classification, every output is a label independent to each other. Learning algorithm is relatively simpler.

In structured prediction, the overall output is a structured object such as a tree, a sequence or an alignment, which is comprised by intermediate output variables with inter-dependence. As feature vector is extracted from both input and output structure, structured learning is also considerably difficult involving repeating inference and searching.

The final decision is not independent, it depends on other decisions. To cooperate with each other, global decision(the final one) must have some mutual dependencies on local decisions(temporary ones).

Joint Inference with General Constraint Structure

Joint Inference

For instance, consider the Entities and Relations Recognizing problem. In which we want to recognize all entities and their relations. Here is a tiny example:

hankcs.com 2017-09-24 下午4.14.35

Some classifiers may output \(5\) tables above. If Jane is a per, then her relation with Bernie won't be born_in, which has the highest probability under prediction.

This example gives us a motivation, if we learn to recognize entities and relations together, better results will be achieved. But most systems use the convention pipeline design, leading to propagation of errors. Although unifying the whole pipeline tends to be aggressive, some closely related tasks can be learnt jointly.


Many "rules" act as constraints in prediction. Such as:

  • The words pp., pages correspond to PAGE.
  • Four digits starting with 20xx and 19xx are DATE.
  • Quotations can appear only in TITLE.

By adding constraints, we get correct results without changing the model.

Constrained Conditional Models

One formula tells all.


The second term covers all the constraints.


Learning the objective function, Max Margin or CRF log likelihood.

Learning is thus driven by the attempt to find a weight vector w such that for each given annotated example, we have:

\forall y \quad w^T\phi(x_i,y_i)\geq w^T\phi(x_i,y)+\Delta(y,y_i)

Here \(\Delta(y,y_i)\) denotes the penalty for predicting other structure.


Inference is expressed as a maximization of a scoring function, or output the structure which gets the highest score.


You talked about linear model learning and predicting, which is a home truth. What about constraints?

Constraints are formalized as Integer Linear Programming(ILP) in literature. ILP is a set of rules expressed as indicator function. For example, in POS tagging, every word can have one and only one label, then

\forall i \quad \sum_{y \in Y }1_{\{ y_i=y\}}=1

2017/9/24 posted in  ML