## Introduction

Most NLP tasks produce structured decisions, such as chunking, parsing, coreference resolution.

In classification, every output is a label independent to each other. Learning algorithm is relatively simpler.

In structured prediction, the overall output is a structured object such as a tree, a sequence or an alignment, which is comprised by intermediate output variables with inter-dependence. As feature vector is extracted from both input and output structure, structured learning is also considerably difficult involving repeating inference and searching.

The final decision is not independent, it depends on other decisions. To cooperate with each other, **global decision**(the final one) must have some mutual dependencies on **local decisions**(temporary ones).

### Joint Inference with General Constraint Structure

#### Joint Inference

For instance, consider the *Entities and Relations Recognizing* problem. In which we want to recognize all entities and their relations. Here is a tiny example:

Some classifiers may output \(5\) tables above. If Jane is a *per*, then her relation with Bernie won't be *born_in*, which has the highest probability under prediction.

This example gives us a motivation, if we learn to recognize entities and relations together, better results will be achieved. But most systems use the convention pipeline design, leading to propagation of errors. Although unifying the whole pipeline tends to be aggressive, some closely related tasks can be learnt jointly.

#### Constraints

Many "rules" act as constraints in prediction. Such as:

- The words pp., pages correspond to PAGE.
- Four digits starting with 20xx and 19xx are DATE.
- Quotations can appear only in TITLE.

By adding constraints, we get correct results without changing the model.

### Constrained Conditional Models

One formula tells all.

The second term covers all the constraints.

#### Training

Learning the objective function, Max Margin or CRF log likelihood.

Learning is thus driven by the attempt to find a weight vector w such that for each given annotated example, we have:

\[

\forall y \quad w^T\phi(x_i,y_i)\geq w^T\phi(x_i,y)+\Delta(y,y_i)

\]

Here \(\Delta(y,y_i)\) denotes the penalty for predicting other structure.

#### Inference

Inference is expressed as a maximization of a scoring function, or output the structure which gets the highest score.

#### Constraints

You talked about linear model learning and predicting, which is a home truth. What about constraints?

Constraints are formalized as Integer Linear Programming(ILP) in literature. ILP is a set of rules expressed as indicator function. For example, in POS tagging, every word can have one and only one label, then

\[

\forall i \quad \sum_{y \in Y }1_{\{ y_i=y\}}=1

\]