Assignment 3: Using Learning for real world applications

The generic sequence labelling problem

You have a sequence of entities: e1, e2, e3, ..., en And a sequence of corresponding labels: l1, l2, l3, ..., ln The problem is: given a new sequence of entities, how can you come up with a sequence of corresponding labels L*.

This problem applies to many real world scenarios. For this assignment you will need to solve any of the following three problems

Problem 1 – Parts of speech (POS) annotation problem

Entities’ sequence: words in a sentence
Label sequence: corresponding POS tags.


For example,


For the entities’ sequence: time flies like an arrow
Possible label sequences are: N V P A N :L1
  V N P A N :L2
  N N V A N :L3


The most likely label sequence for the above sentence should be L1.

Problem 2 – Protien sequence annotation problem

Entities’ sequence: P1, P2, ...
Label sequence: P, S, T, ...
The set of labels for protein structures is {Primary (P), Secondary (S), and Tertiary (T)}

Problem 3 – Gene Sequence annotation problem

The details (corpus links etc) are in moodle.