CS705 Statistical Foundations of Machine Learning
Autumn 2008 Calendar

Week Date and Title Summary
1 2008-07-24
Course overview
  • Basic course information and introductory lecture slides
  • Learning task, instances, features, labels, reward/loss, training, testing
  • Overfitting and generalization in the context of polynomial curve-fitting
  • Overfitting and generalization in the context of segmenting coin toss sequences
2 2008-07-28
Bayesian learning; Regression
  • Bayesian learning, priors and posteriors, MAP and other quantities of interest
  • Smoothing and conjugate priors; exponential family
  • Review of Bernoulli, binomial distributions, gamma and beta functions
  • Review of multinomial distribution and Dirichlet distribution
  • Laplace smoothing with uniform prior over coin parameter
  • Linear least-squares fitting from a Bayesian perspective
  • How the Bayesian approach leads to Ridge penalty
  • Solving least square and Ridge regression, the "hat matrix"
2 2008-07-31
Ridge and Lasso;
Non-linear and kernel regression
  • Alternative way to write Ridge regression as finding parameters within disk
  • Behavior of model weights as disk radius is changed (demo)
  • "Evidence-sharing" and geometric reasons for non-sparse solutions
  • Lasso regression: replacing L2 model penalty with L1 model penalty
  • Casting Lasso regression as a quadratic program
  • Behavior of model weights as L1 limit is changed (demo)
  • Implicit feature selection effect (sparser models) in Lasso
  • Handling non-linear regression by lifting data points to higher dimension
  • Local regression using kernels -- Nadaraya-Watson
3 2008-08-04
Classification basics
  • Fold-based and leave-one-out cross validation
  • Classification with small discrete label sets
  • Nearest-neighbor classification
  • Criteria to evaluate classifiers: accuracy, recall, precision, F1, AUC
  • Why classification loss is fundamentally different from regression
  • Bayes-optimal classification, Bayes risk, discriminants
3 2008-08-07
Discriminants;
Perceptron;
Margin
  • Classification based on class-conditional density
  • Multivariate Gaussian (normal) distribution review, covariance
  • Discriminants between Gaussian class-conditional densities
  • Linear discriminant in the special case of equal covariance for all classes
  • The special case of spherical Gaussian densities
  • Discriminants are quadratic surfaces in case of diverse covariances
  • Play with , view
  • Loss functions: 0/1 ("true"), perceptron, hinge, square
4 2007-08-11
Support vector machines
  • Fisher's linear discriminant
  • Perceptron vs. hinge loss and margin
4 2008-08-14
Dual SVM, SMO, Kernels
  • : Gordan's theorem, dual feasibility, complementary slackness
  • Example: linear programming
  • Turning the SVM QP into a dual QP
  • in Scilab, demonstration of support vectors
5 2008-08-18
Maximum entropy;
Logistic regression 
  • Sequential minimal optimization (SMO) algorithm
  • Support vector regression and ordinal regression
  • General feature functions φ(x,y)
  • Conditional probabilistic model for classification
  • Relationship between maxent and
  • Loss function for LR compared to hinge, square-hinge, etc.
5 2008-08-20, 6:30--8pm
SVM variations; transduction
  • Relationship between maxent/LR and SVM loss functions
  • Keerthi and DeCoste linear SVM with square-slack regularizer
  • Comparison of hinge and ramp loss
  • Convex-concave SVM formulation and optimization (CCCP)
6 2008-08-25
Scilab tutorial
  • slides from Dhaval Manawana
Instructor away at SIGKDD Conference
6 2008-08-28
SVM variations
  •  and SVM,  SVM
  •  for SVM optimization
Guest lecture by Prof. Ganesh Ramakrishnan; instructor away at SIGKDD Conference
7 2008-09-01
Kernels and embedding
  • The kernel trick, common kernels in use
  • Dual SVM with nonlinear kernels ---
  • SVM dual, where to plug in the kernel
  • Polynomial, Gaussian, RBF kernels
  • Rules for creating valid kernels by combining known kernels
  • Gram matrix and why it should be positive semi definite
  • Eigen decomposition, (rotate chart to see separation)
7 2008-09-04
RKHS, representer theorem
  • Embedding in infinite dimensional vector spaces
  • Designing kernels for combinatorial objects likes strings and trees
  • Function norms, eigen functions
  • Connection between min(norm(beta)) and function norm in RKHS
  • Representer theorem (Kimeldorf and Wahba)
  • Overflow topic: Transductive and semisupervised SVMs, non-convex problems
  • CCCP approach to transduction
8 2008-09-08 (Midterm week)
8 2008-09-11 (Midterm week)
8 2008-09-13 Midterm exam with some solutions
9 2008-09-15
Structured learning
  • Structured prediction and learning problems
  • Structured learning and prediction applications in NLP
  • Motivation from F1, AUC, ranking
  • Motivation from graphical models: HMM in StructSVM notation
9 2008-09-18
Structured learning, cont'd
10 2008-09-22
PCFG
  • Case study: Probabilistic context free grammars
  • Inference and max-margin learning
Guest lecture by Ganesh Ramakrishnan; instructor away at ISI Kolkata
10 2008-09-25
PCFG
  • More about max-margin PCFG learning
Guest lecture by Ganesh Ramakrishnan; instructor on leave
11 2008-09-29
Decomposable loss
  • (see Ben Taskar's PhD thesis)
  • Linear time linear SVM and RankSVM
11 2008-10-02 (Mahatma Gandhi birthday)
11 2008-10-03, 8--9pm
StructSVM applications
12 2008-10-06
Eigen, SVD, PCA
  • Eigenvalue and eigenvector recap, demo
  • SVD, with low-rank plus noise matrix
  • Connections between SVD, least squares, Ridge regression
  • Eigen-SVD connection
  • Connection between SVD and the pseudoinverse of a matrix
  • Principal component analysis (PCA)
  • Relation between SVD and PCA,
12 2008-10-09 (Dussehra)
13 2008-10-13
More dyadic models
  • Using kernels in PCA
  • Dyadic factors of boolean matrices, aspect model
  • Cross-associations and co-clustering
  • Non-negative matrix factorization
  • Iterative scaling,
13 2008-10-16
Mixture and aspect models
  • Undirected graph Laplacian
  • Spectral graph partitioning, ratio cuts, transduction
  • Probabilistic model learning, expectation maximization
  • Expectation maximization, variational interpretation
  • EM for factor, aspect, and dyadic models
  • Aspect model as nonnegative matrix factorization
14 2008-10-20 (Instructor on leave)
14 2008-10-23
(Instructor on leave)
15 2008-10-27
(Diwali)
15 2008-10-30
EM, Mean field, Sampling distributions
  • Expectation of functions over complicated distributions: Mean field approximation
  • Sampling the posterior if Pr(h|D) is simple
  • Rejection sampling using upper bounds on density functions
  • The Metropolis MCMC algorithm for complicated Pr(h|D)
  • Metropolis-Hastings algorithm, proof of correctness
  • Gibb's Q(h,h') for sampling
16 2008-11-03
Ensemble, bagging, boosting
  • Ensemble and committee learners
  • Bagging, benefits from uncorrelated errors
  • Bagging with committee members making correlated mistakes
  • Boosting, phased exponential loss minimization
  • Bias-variance decomposition
16 2008-11-06
  • Risk, Bayes risk, empirical/training risk
  • Risk and generalization bounds, intro
  • Probabilistically approximately correct (PAC) learning
  • Risk bounds: hypothesis consistent with training set
  • Bounds on true risk minus empirical (training) risk
16 2008-11-08 2-3:30pm
, VC theory
  • Empirical processes
  • Risk bounds for finite hypothesis classes
  • Also see the Scholkopf and Smola book
  • Symmetrization, growth function,
17 2008-11-10
  • The stability approach to proving generalization bounds
  • Algorithmic stability
  • Uniform stability for regression
  • Uniform stability for classification
? 2008-mm-dd
Rate distortion, information bottleneck
(Cancelled by popular lack of demand)
  • Rate-distortion theory and relation to cross-association
  • Information bottleneck and approximating distributions
17 2008-11-13 (Guru Nanak birthday)
2008-11-19 Final exam with some solutions