Week |
Date
and Title |
Summary |
1 |
2008-07-24
Course overview |
- Basic course information and introductory
lecture slides
- Learning task, instances, features, labels,
reward/loss, training, testing
- Overfitting and generalization in the context of
polynomial curve-fitting
- Overfitting and
generalization in the context of segmenting coin toss sequences
|
2 |
2008-07-28
Bayesian learning; Regression |
- Bayesian learning,
priors and
posteriors, MAP and other quantities of interest
- Smoothing and
conjugate
priors; exponential family
- Review of Bernoulli,
binomial
distributions, gamma and beta functions
- Review of multinomial
distribution and Dirichlet distribution
- Laplace smoothing
with uniform prior over coin parameter
- Linear least-squares
fitting
from a Bayesian perspective
- How the Bayesian
approach
leads to Ridge penalty
- Solving least square
and
Ridge regression, the "hat matrix"
|
2 |
2008-07-31
Ridge and Lasso;
Non-linear and kernel regression
|
- Alternative way to
write
Ridge regression as finding parameters within disk
- Behavior of model
weights as
disk radius is changed (demo)
- "Evidence-sharing"
and
geometric reasons for non-sparse solutions
- Lasso regression:
replacing
L2 model penalty with L1 model penalty
- Casting Lasso
regression as a
quadratic program
- Behavior of model
weights as
L1 limit is changed (demo)
- Implicit feature selection effect (sparser models) in
Lasso
- Handling non-linear
regression by lifting data points to higher dimension
- Local regression
using kernels -- Nadaraya-Watson
|
3 |
2008-08-04
Classification basics |
- Fold-based and leave-one-out cross validation
- Classification with
small
discrete label sets
- Nearest-neighbor classification
- Criteria to evaluate
classifiers: accuracy, recall, precision, F1, AUC
- Why classification
loss is
fundamentally different from regression
- Bayes-optimal
classification,
Bayes risk, discriminants
|
3 |
2008-08-07
Discriminants;
Perceptron;
Margin |
- Classification based
on
class-conditional density
- Multivariate Gaussian
(normal) distribution review, covariance
- Discriminants between
Gaussian class-conditional densities
- Linear discriminant
in the
special case of equal covariance for all classes
- The special case of
spherical
Gaussian densities
- Discriminants are
quadratic
surfaces in case of diverse covariances
- Play with
, view
- Loss functions: 0/1
("true"),
perceptron, hinge, square
|
4 |
2007-08-11
Support vector machines |
- Fisher's linear discriminant
- Perceptron vs. hinge loss and margin
|
4 |
2008-08-14
Dual SVM, SMO, Kernels |
- :
Gordan's theorem, dual feasibility, complementary slackness
- Example: linear
programming
- Turning the SVM QP
into a
dual QP
-
in Scilab, demonstration of support vectors
|
5 |
2008-08-18
Maximum entropy;
Logistic regression |
- Sequential minimal
optimization (SMO) algorithm
- Support vector regression and ordinal
regression
- General feature functions φ(x,y)
- Conditional
probabilistic model for classification
- Relationship between maxent and
- Loss function for
LR compared to hinge, square-hinge, etc.
|
5 |
2008-08-20, 6:30--8pm
SVM variations; transduction |
- Relationship between maxent/LR and SVM loss functions
- Keerthi and DeCoste
linear
SVM with square-slack regularizer
- Comparison of hinge and ramp loss
- Convex-concave SVM formulation and optimization (CCCP)
|
6 |
2008-08-25
Scilab
tutorial |
- slides from Dhaval Manawana
Instructor
away at SIGKDD Conference
|
6 |
2008-08-28
SVM variations |
-
and
SVM,
SVM
-
for
SVM optimization
Guest lecture by Prof. Ganesh
Ramakrishnan; instructor away at SIGKDD Conference
|
7 |
2008-09-01
Kernels and embedding
|
- The kernel trick, common kernels in use
- Dual SVM with
nonlinear kernels ---
- SVM dual, where to
plug in the kernel
- Polynomial, Gaussian, RBF kernels
- Rules for creating
valid kernels by combining known kernels
- Gram matrix and why
it should be positive semi definite
- Eigen decomposition,
(rotate chart to see separation)
|
7 |
2008-09-04
RKHS, representer theorem
|
- Embedding in
infinite dimensional vector spaces
- Designing kernels
for combinatorial objects likes strings and trees
- Function norms,
eigen functions
- Connection between
min(norm(beta)) and function norm in RKHS
- Representer theorem
(Kimeldorf and Wahba)
- Overflow topic: Transductive
and semisupervised
SVMs, non-convex problems
- CCCP approach to
transduction
|
8 |
2008-09-08 |
(Midterm week) |
8 |
2008-09-11 |
(Midterm week) |
8 |
2008-09-13 |
Midterm
exam with some solutions |
9 |
2008-09-15
Structured learning
|
- Structured prediction and learning problems
- Structured learning
and prediction applications in NLP
- Motivation from F1,
AUC, ranking
- Motivation from
graphical models: HMM in StructSVM notation
|
9 |
2008-09-18
Structured learning, cont'd
|
|
10 |
2008-09-22
PCFG |
- Case study: Probabilistic context free grammars
- Inference and max-margin learning
Guest
lecture by Ganesh Ramakrishnan; instructor away at ISI Kolkata |
10 |
2008-09-25
PCFG
|
- More about max-margin PCFG learning
Guest lecture by Ganesh Ramakrishnan; instructor on leave
|
11 |
2008-09-29
Decomposable loss
|
|
11 |
2008-10-02 |
(Mahatma Gandhi birthday) |
11 |
2008-10-03, 8--9pm
StructSVM applications |
|
12 |
2008-10-06
Eigen, SVD,
PCA
|
- Eigenvalue and
eigenvector
recap, demo
- SVD, with
low-rank plus noise matrix
- Connections between
SVD, least squares, Ridge regression
- Eigen-SVD connection
- Connection between
SVD and
the pseudoinverse
of a matrix
- Principal component
analysis
(PCA)
- Relation between SVD and PCA,
|
12 |
2008-10-09 |
(Dussehra) |
13 |
2008-10-13
More dyadic models
|
- Using
kernels in PCA
- Dyadic factors of
boolean
matrices, aspect model
- Cross-associations and co-clustering
- Non-negative matrix
factorization
- Iterative scaling,
|
13 |
2008-10-16
Mixture and aspect models
|
- Undirected graph
Laplacian
- Spectral graph
partitioning,
ratio cuts, transduction
- Probabilistic model
learning, expectation maximization
- Expectation
maximization, variational interpretation
- EM for factor,
aspect, and dyadic models
- Aspect model as
nonnegative matrix factorization
|
14 |
2008-10-20 |
(Instructor on leave) |
14 |
2008-10-23
|
(Instructor on leave) |
15 |
2008-10-27
|
(Diwali) |
15 |
2008-10-30
EM, Mean field, Sampling distributions
|
- Expectation of functions over complicated distributions: Mean
field approximation
- Sampling the
posterior if
Pr(h|D) is simple
- Rejection
sampling
using upper bounds on density functions
- The Metropolis MCMC
algorithm
for complicated Pr(h|D)
- Metropolis-Hastings
algorithm, proof of correctness
- Gibb's Q(h,h') for
sampling
|
16 |
2008-11-03
Ensemble, bagging,
boosting |
- Ensemble and
committee
learners
- Bagging, benefits
from
uncorrelated errors
- Bagging with
committee
members making correlated mistakes
- Boosting, phased
exponential
loss minimization
- Bias-variance
decomposition
|
16 |
2008-11-06
|
- Risk, Bayes risk,
empirical/training risk
- Risk and
generalization
bounds, intro
- Probabilistically approximately correct (PAC) learning
- Risk bounds:
hypothesis
consistent with training set
- Bounds on true risk
minus
empirical (training) risk
|
16 |
2008-11-08 2-3:30pm
, VC theory
|
- Empirical processes
- Risk bounds for
finite
hypothesis classes
- Also see the Scholkopf and
Smola book
- Symmetrization, growth function,
|
17 |
2008-11-10
|
- The stability approach to proving generalization
bounds
- Algorithmic stability
- Uniform stability for
regression
- Uniform stability for
classification
|
? |
2008-mm-dd
Rate distortion,
information
bottleneck |
(Cancelled by popular lack of demand)
- Rate-distortion
theory and
relation to cross-association
- Information
bottleneck and
approximating distributions
|
17 |
2008-11-13 |
(Guru Nanak birthday) |
|
2008-11-19 |
Final exam
with some solutions |