Acknowledgment: This lab assignment is based on Project 5: Classification, which is a part of a recent offering of CS188 at UC Berkeley. The code and resources provided here are almost entirely drawn from the Berkeley project. We thank the authors at Berkeley for making their project available to the public.
Optical Character Recognition (OCR) is the task of extracting text from image sources. In this assignment, you will design a multi-class classifier on a set of scanned handwritten digit images: specifically a Perceptron classifier. Subsequently, you will create your own enhanced feature extractor, which will supplement the simple features provided to further improve the performance of your classifier.
(Image source: https://inst.eecs.berkeley.edu/~cs188/sp08/projects/project8/webpage/img2.gif.)
The data set on which you will run your classifiers is a collection of handwritten numerical digits (\(0-9\)). OCR on digits is a very commercially useful technology, similar to the technique used by the US post office to route mail by zip codes. There are systems that can perform with over 99% classification accuracy (see LeNet-5 for an example system in action). Details about your tasks, and the files you must submit as a part of this assignment, are mentioned in the sections below.
The base code for this assignment is available
in this zip file. You need to use
these files from the classification
directory.
File Name | Description |
classificationMethod.py | Abstract super class for the classifiers you will write. You should read this file carefully to see how the infrastructure is set up. Do not edit this file. |
createData.sh | Bash script that runs the perceptron classifier on several different sizes of the training set. |
createGraphIterations.gnuplot | Gnuplot script to plot accuracy versus number of training examples seen. |
createGraphTraining.gnuplot | Gnuplot script to plot accuracy versus size of the training set. |
data.zip | Contains the training, validation and test data sets. |
dataClassifier.py | The wrapper code that will call your classifiers. You will also write your enhanced feature extractor here. |
graph_answers.txt | Blank text file for you to enter answers from Task 2. |
perceptron.py | The file where you will write your perceptron classifier. |
samples.py | I/O code to read in the classification data. Do not edit this file. |
util.py | Code defining some useful tools, that will save you a lot of time. Do not edit this file. |
A Perceptron keeps a weight vector \(w^y\) corresponding to each class \(y\) (\(y\) is a suffix, not an exponent). Given a feature list \(f\) (a vector with the same number of dimensions as the weight vectors), the perceptron computes the class \(y\) whose weight vector is most similar to the input vector \(f\). Formally, given a feature vector \(f\) (in our case, a map from pixel locations to indicators of whether they are on), we score each class \(y\) with:
\( \mbox{score}(f,y) = \sum\limits_{i} f_i w^y_i \),
where \(i\) iterates over the dimensions in the data. Then we
choose the class with highest score as the predicted label for data
instance \(f\). In the code, we will represent \(w^y\) as
a Counter
(defined in the file util.py
).
In the basic multi-class Perceptron, we scan over the training data one instance at a time. When we come to an instance \((f, y)\), we find the label with highest score: \( y' = arg \max\limits_{y''} score(f,y'') \), breaking ties arbitrarily. We compare \(y'\) to the true label \(y\). If \(y' = y\), we have correctly classified the instance, and we do nothing. Otherwise, we predicted \(y'\) when we should have predicted \(y\). That means that \(w^y\) has scored \(f\) lower, and/or \(w^{y'}\) has scored \(f\) higher, than what would have been ideal. To avert this error in the future, we update these two weight vectors accordingly:
\( w^y = w^y + f \), and
\( w^{y'} = w^{y'} - f \).
In this task, you have to implement the multi-class Perceptron
classifier. Fill out code in the train()
function at the
location indicated in perceptron.py
. Using the addition,
subtraction, and multiplication functionality of
the Counter
class in util.py
, the perceptron
updates should be relatively easy to code. Certain implementation
issues have been taken care of for you in perceptron.py
,
such as the following.
weights
data structure. This can be accessed using self.weights
in the train()
method.Counter
of weights.Run your code with the following command.
python dataClassifier.py -c perceptron -t 1000 -s 1000
This will print out a host of details, such as the classifier that
is being trained, the training set size, if enhanced features are
being used (more on this in Task 5), etc. After this, the classifier
gets trained for a default of 3
iterations. dataClassifier.py
would then print the
accuracy of your model on the validation and test data sets.
The -t
option passed to the Perceptron training code
specifies the number of training data points to read in from memory,
while the -s
option specifies the sizes of the validation
and test sets (the train, validation, and test data sets are mutually
independent). Since iterating over 1000 points may take some time, you
can use 100 data points instead while developing and testing your
code. However, note that all our evaluations will involve train,
validation, and test data sets with 1000 points.
Evaluation: Your classifier will be evaluated by for its accuracy on the test set after the Perceptron has been trained for the default 3 iterations. You will be awarded 3 marks if the accuracy exceeds 65%, otherwise 2 marks if accuracy exceeds 55%, otherwise 1 mark if accuracy exceeds 50%, otherwise 0 marks.
One of the problems with the perceptron is that its performance is
sensitive to several practical details, such as how many iterations
you train it for, the order you use for the training examples, and how
many data points you use. The current code uses a default value of 3
training iterations. You can change the number of iterations for the
perceptron with the -i iterations
option. In practice,
you would use the performance on the validation set to figure out when
to stop training, but you don't need to implement this stopping
criterion for this task.
Instead, you need to analyse and comment on how the validation accuracy changes (1) as the total number of data points seen during training change, and (2) as the total size of the training set changes. See details below.
Variation with number of training data points 'seen': In this sub-task, you need to plot the validation accuracy as increasing numbers of data points are 'seen' (that is, a point is examined to determine whether an update needs to be made to the Perceptron, or the point is already correctly classified, and so no update is needed) by the classifier.
You have been provided helper code for this purpose:
the train()
function in perceptron.py
writes <examples seen>,<validation
accuracy>
as a comma-separated pair into the
file perceptronIterations.csv
2 times in every
iteration. The plotting
script createGraphIterations.gnuplot
uses this csv
file to create the required plot and save it in a file by
name plot_iterations.png
. You can modify either of
the two code files mentioned above to modify the plot being
generated. Feel free to experiment with the various switches and
write a description of your observations and a description of the
nature of the plot in the file graph_answers.txt
in
plain text.
Below are commands to generate the required graph.
python dataClassifier.py -c perceptron -t 1000 -s 1000 -v
gnuplot createGraphIterations.gnuplot
Variation with training set size in each iteration: Here, you will see the influence of training set size on the accuracy of validation set, when trained for the same number of iterations.
Every time dataClassifier.py
is run, it writes a
comma-separated pair of numbers—<training set
size>,<accuracy>—for the validation and the test
sets to files perceptron_valid.csv
and perceptron_valid.csv
. Use createData.sh
(supplied in the zip file) to run the perceptron training on 100, 200,
300, ..., 1000 examples. Create a plot
named plot_training.png
by
running createGraphTraining.gnuplot
. Below are
commands to generate the required graph.
./createData.sh
gnuplot createGraphTraining.gnuplot
Append your observations and description of the plot obtained in
the file graph_answers.txt
(to which you had already
written in the previous sub-task). Additionally, answer the following
questions.
Note: The bash script to generate the data for the graph might take a long time to run. You are advised to proceed to subsequent tasks while the script is running.
Evaluation: Sub-tasks 1 and 2 will be evaluated for a total of 4 marks, with the plots and the corresponding explanations and answers all taken into consideration.
Building classifiers is only a small part of getting a good system working for a task. Indeed, the main difference between a good classification system and a bad one is usually not the classifier itself, but rather the quality of the features used. So far, we have only used the simplest possible features: the identity of each pixel (being on/off).
To increase your classifier's accuracy further, you will need to
extract more useful features from the
data. The EnhancedFeatureExtractorDigit()
in dataClassifier.py
is your new playground. When
analysing your classifiers' results, you should look at some of your
errors and look for characteristics of the input that would give the
classifier useful information about the label. You can add code to
the analysis()
function in dataClassifier.py
to inspect what your classifier is doing.
As a concrete illustration, consider the digit data you have used. In each training point (a binary matrix corresponding to an image), consider the number of separate, connected regions of white pixels. This quantity would vary by digit type. 1, 2, 3, 5, and 7 tend to mostly have one contiguous region of white space, while the loops in 0, 6, 8, and 9 create more. The number of white regions in a 4 depends on the writer. This is an example of a feature that is not directly available to the classifier from the per-pixel information.
If your feature extractor adds new features such as the quantity
described above (think of others, too!), the classifier will be able
exploit them. Note that some features may require non-trivial
computation to extract, so write efficient and correct code. Add your
new binary features for the digit data set in
the EnhancedFeatureExtractorDigit()
function. Note that
you can encode a feature which takes 3 values [1,2,3] by using 3
binary features, of which only one is on for any given instance,
indicating which of the original feature values has occurred.
We will test your classifier with the following command.
python dataClassifier.py -c perceptron -f -t 1000 -s 1000
Evaluation: If your new features result in any improvement in accuracy over your Perceptron classifier with the basic features, you will receive 1 mark. If your new features give you a test accuracy exceeding 80% using the command above, you will get 2 additional marks. (Note: we don't expect breaching the 80% mark to be easy!).
Note: Using comments in your code, briefly explain the working of your EnhancedFeatureExtractorDigit()
.
You are not done yet! Place all files in which you have written
code in or modified in a directory named la7-rollno
,
where rollno
is your roll number (say 12345678). Tar and
Gzip the directory to produce a single compressed file
(say la7-12345678.tar.gz
). It must contain the following
files.
perceptron.py
graph_answers.txt
dataClassifier.py
createData.sh
createGraphIterations.gnuplot
createGraphTraining.gnuplot
plot_iterations.png
plot_training.png
Submit this compressed file on Moodle, under Lab Assignment 7.