Week 
Date
and Title 
Summary 
1 
20080724
Course overview 
 Basic course information and introductory
lecture slides
 Learning task, instances, features, labels,
reward/loss, training, testing
 Overfitting and generalization in the context of
polynomial curvefitting
 Overfitting and
generalization in the context of segmenting coin toss sequences

2 
20080728
Bayesian learning; Regression 
 Bayesian learning,
priors and
posteriors, MAP and other quantities of interest
 Smoothing and
conjugate
priors; exponential family
 Review of Bernoulli,
binomial
distributions, gamma and beta functions
 Review of multinomial
distribution and Dirichlet distribution
 Laplace smoothing
with uniform prior over coin parameter
 Linear leastsquares
fitting
from a Bayesian perspective
 How the Bayesian
approach
leads to Ridge penalty
 Solving least square
and
Ridge regression, the "hat matrix"

2 
20080731
Ridge and Lasso;
Nonlinear and kernel regression

 Alternative way to
write
Ridge regression as finding parameters within disk
 Behavior of model
weights as
disk radius is changed (demo)
 "Evidencesharing"
and
geometric reasons for nonsparse solutions
 Lasso regression:
replacing
L2 model penalty with L1 model penalty
 Casting Lasso
regression as a
quadratic program
 Behavior of model
weights as
L1 limit is changed (demo)
 Implicit feature selection effect (sparser models) in
Lasso
 Handling nonlinear
regression by lifting data points to higher dimension
 Local regression
using kernels  NadarayaWatson

3 
20080804
Classification basics 
 Foldbased and leaveoneout cross validation
 Classification with
small
discrete label sets
 Nearestneighbor classification
 Criteria to evaluate
classifiers: accuracy, recall, precision, F1, AUC
 Why classification
loss is
fundamentally different from regression
 Bayesoptimal
classification,
Bayes risk, discriminants

3 
20080807
Discriminants;
Perceptron;
Margin 
 Classification based
on
classconditional density
 Multivariate Gaussian
(normal) distribution review, covariance
 Discriminants between
Gaussian classconditional densities
 Linear discriminant
in the
special case of equal covariance for all classes
 The special case of
spherical
Gaussian densities
 Discriminants are
quadratic
surfaces in case of diverse covariances
 Play with
, view
 Loss functions: 0/1
("true"),
perceptron, hinge, square

4 
20070811
Support vector machines 
 Fisher's linear discriminant
 Perceptron vs. hinge loss and margin

4 
20080814
Dual SVM, SMO, Kernels 
 :
Gordan's theorem, dual feasibility, complementary slackness
 Example: linear
programming
 Turning the SVM QP
into a
dual QP

in Scilab, demonstration of support vectors

5 
20080818
Maximum entropy;
Logistic regression 
 Sequential minimal
optimization (SMO) algorithm
 Support vector regression and ordinal
regression
 General feature functions φ(x,y)
 Conditional
probabilistic model for classification
 Relationship between maxent and
 Loss function for
LR compared to hinge, squarehinge, etc.

5 
20080820, 6:308pm
SVM variations; transduction 
 Relationship between maxent/LR and SVM loss functions
 Keerthi and DeCoste
linear
SVM with squareslack regularizer
 Comparison of hinge and ramp loss
 Convexconcave SVM formulation and optimization (CCCP)

6 
20080825
Scilab
tutorial 
 slides from Dhaval Manawana
Instructor
away at SIGKDD Conference

6 
20080828
SVM variations 

and
SVM,
SVM

for
SVM optimization
Guest lecture by Prof. Ganesh
Ramakrishnan; instructor away at SIGKDD Conference

7 
20080901
Kernels and embedding

 The kernel trick, common kernels in use
 Dual SVM with
nonlinear kernels 
 SVM dual, where to
plug in the kernel
 Polynomial, Gaussian, RBF kernels
 Rules for creating
valid kernels by combining known kernels
 Gram matrix and why
it should be positive semi definite
 Eigen decomposition,
(rotate chart to see separation)

7 
20080904
RKHS, representer theorem

 Embedding in
infinite dimensional vector spaces
 Designing kernels
for combinatorial objects likes strings and trees
 Function norms,
eigen functions
 Connection between
min(norm(beta)) and function norm in RKHS
 Representer theorem
(Kimeldorf and Wahba)
 Overflow topic: Transductive
and semisupervised
SVMs, nonconvex problems
 CCCP approach to
transduction

8 
20080908 
(Midterm week) 
8 
20080911 
(Midterm week) 
8 
20080913 
Midterm
exam with some solutions 
9 
20080915
Structured learning

 Structured prediction and learning problems
 Structured learning
and prediction applications in NLP
 Motivation from F1,
AUC, ranking
 Motivation from
graphical models: HMM in StructSVM notation

9 
20080918
Structured learning, cont'd


10 
20080922
PCFG 
 Case study: Probabilistic context free grammars
 Inference and maxmargin learning
Guest
lecture by Ganesh Ramakrishnan; instructor away at ISI Kolkata 
10 
20080925
PCFG

 More about maxmargin PCFG learning
Guest lecture by Ganesh Ramakrishnan; instructor on leave

11 
20080929
Decomposable loss


11 
20081002 
(Mahatma Gandhi birthday) 
11 
20081003, 89pm
StructSVM applications 

12 
20081006
Eigen, SVD,
PCA

 Eigenvalue and
eigenvector
recap, demo
 SVD, with
lowrank plus noise matrix
 Connections between
SVD, least squares, Ridge regression
 EigenSVD connection
 Connection between
SVD and
the pseudoinverse
of a matrix
 Principal component
analysis
(PCA)
 Relation between SVD and PCA,

12 
20081009 
(Dussehra) 
13 
20081013
More dyadic models

 Using
kernels in PCA
 Dyadic factors of
boolean
matrices, aspect model
 Crossassociations and coclustering
 Nonnegative matrix
factorization
 Iterative scaling,

13 
20081016
Mixture and aspect models

 Undirected graph
Laplacian
 Spectral graph
partitioning,
ratio cuts, transduction
 Probabilistic model
learning, expectation maximization
 Expectation
maximization, variational interpretation
 EM for factor,
aspect, and dyadic models
 Aspect model as
nonnegative matrix factorization

14 
20081020 
(Instructor on leave) 
14 
20081023

(Instructor on leave) 
15 
20081027

(Diwali) 
15 
20081030
EM, Mean field, Sampling distributions

 Expectation of functions over complicated distributions: Mean
field approximation
 Sampling the
posterior if
Pr(hD) is simple
 Rejection
sampling
using upper bounds on density functions
 The Metropolis MCMC
algorithm
for complicated Pr(hD)
 MetropolisHastings
algorithm, proof of correctness
 Gibb's Q(h,h') for
sampling

16 
20081103
Ensemble, bagging,
boosting 
 Ensemble and
committee
learners
 Bagging, benefits
from
uncorrelated errors
 Bagging with
committee
members making correlated mistakes
 Boosting, phased
exponential
loss minimization
 Biasvariance
decomposition

16 
20081106

 Risk, Bayes risk,
empirical/training risk
 Risk and
generalization
bounds, intro
 Probabilistically approximately correct (PAC) learning
 Risk bounds:
hypothesis
consistent with training set
 Bounds on true risk
minus
empirical (training) risk

16 
20081108 23:30pm
, VC theory

 Empirical processes
 Risk bounds for
finite
hypothesis classes
 Also see the Scholkopf and
Smola book
 Symmetrization, growth function,

17 
20081110

 The stability approach to proving generalization
bounds
 Algorithmic stability
 Uniform stability for
regression
 Uniform stability for
classification

? 
2008mmdd
Rate distortion,
information
bottleneck 
(Cancelled by popular lack of demand)
 Ratedistortion
theory and
relation to crossassociation
 Information
bottleneck and
approximating distributions

17 
20081113 
(Guru Nanak birthday) 

20081119 
Final exam
with some solutions 