CS705 Autumn 2008 Calendar

CS705 Statistical Foundations of Machine Learning
Autumn 2008 Calendar

Week	Date and Title	Summary
1	2008-07-24 Course overview	Basic course information and introductory lecture slides Learning task, instances, features, labels, reward/loss, training, testing Overfitting and generalization in the context of polynomial curve-fitting Overfitting and generalization in the context of segmenting coin toss sequences
2	2008-07-28 Bayesian learning; Regression	Bayesian learning, priors and posteriors, MAP and other quantities of interest Smoothing and conjugate priors; exponential family Review of Bernoulli, binomial distributions, gamma and beta functions Review of multinomial distribution and Dirichlet distribution Laplace smoothing with uniform prior over coin parameter Linear least-squares fitting from a Bayesian perspective How the Bayesian approach leads to Ridge penalty Solving least square and Ridge regression, the "hat matrix"
2	2008-07-31 Ridge and Lasso; Non-linear and kernel regression	Alternative way to write Ridge regression as finding parameters within disk Behavior of model weights as disk radius is changed (demo) "Evidence-sharing" and geometric reasons for non-sparse solutions Lasso regression: replacing L2 model penalty with L1 model penalty Casting Lasso regression as a quadratic program Behavior of model weights as L1 limit is changed (demo) Implicit feature selection effect (sparser models) in Lasso Handling non-linear regression by lifting data points to higher dimension Local regression using kernels -- Nadaraya-Watson
3	2008-08-04 Classification basics	Fold-based and leave-one-out cross validation Classification with small discrete label sets Nearest-neighbor classification Criteria to evaluate classifiers: accuracy, recall, precision, F1, AUC Why classification loss is fundamentally different from regression Bayes-optimal classification, Bayes risk, discriminants
3	2008-08-07 Discriminants; Perceptron; Margin	Classification based on class-conditional density Multivariate Gaussian (normal) distribution review, covariance Discriminants between Gaussian class-conditional densities Linear discriminant in the special case of equal covariance for all classes The special case of spherical Gaussian densities Discriminants are quadratic surfaces in case of diverse covariances Play with , view Loss functions: 0/1 ("true"), perceptron, hinge, square
4	2007-08-11 Support vector machines	Fisher's linear discriminant Perceptron vs. hinge loss and margin
4	2008-08-14 Dual SVM, SMO, Kernels	: Gordan's theorem, dual feasibility, complementary slackness Example: linear programming Turning the SVM QP into a dual QP in Scilab, demonstration of support vectors
5	2008-08-18 Maximum entropy; Logistic regression	Sequential minimal optimization (SMO) algorithm Support vector regression and ordinal regression General feature functions φ(x,y) Conditional probabilistic model for classification Relationship between maxent and Loss function for LR compared to hinge, square-hinge, etc.
5	2008-08-20, 6:30--8pm SVM variations; transduction	Relationship between maxent/LR and SVM loss functions Keerthi and DeCoste linear SVM with square-slack regularizer Comparison of hinge and ramp loss Convex-concave SVM formulation and optimization (CCCP)
6	2008-08-25 Scilab tutorial	slides from Dhaval Manawana Instructor away at SIGKDD Conference
6	2008-08-28 SVM variations	and SVM, SVM for SVM optimization Guest lecture by Prof. Ganesh Ramakrishnan; instructor away at SIGKDD Conference
7	2008-09-01 Kernels and embedding	The kernel trick, common kernels in use Dual SVM with nonlinear kernels --- SVM dual, where to plug in the kernel Polynomial, Gaussian, RBF kernels Rules for creating valid kernels by combining known kernels Gram matrix and why it should be positive semi definite Eigen decomposition, (rotate chart to see separation)
7	2008-09-04 RKHS, representer theorem	Embedding in infinite dimensional vector spaces Designing kernels for combinatorial objects likes strings and trees Function norms, eigen functions Connection between min(norm(beta)) and function norm in RKHS Representer theorem (Kimeldorf and Wahba) Overflow topic: Transductive and semisupervised SVMs, non-convex problems CCCP approach to transduction
8	2008-09-08	(Midterm week)
8	2008-09-11	(Midterm week)
8	2008-09-13	Midterm exam with some solutions
9	2008-09-15 Structured learning	Structured prediction and learning problems Structured learning and prediction applications in NLP Motivation from F1, AUC, ranking Motivation from graphical models: HMM in StructSVM notation
9	2008-09-18 Structured learning, cont'd	The generic cutting plane recipe StructSVM implementation (TsochantaridisJHA2005structsvm)
10	2008-09-22 PCFG	Case study: Probabilistic context free grammars Inference and max-margin learning Guest lecture by Ganesh Ramakrishnan; instructor away at ISI Kolkata
10	2008-09-25 PCFG	More about max-margin PCFG learning Guest lecture by Ganesh Ramakrishnan; instructor on leave
11	2008-09-29 Decomposable loss	(see Ben Taskar's PhD thesis) Linear time linear SVM and RankSVM
11	2008-10-02	(Mahatma Gandhi birthday)
11	2008-10-03, 8--9pm StructSVM applications	SVMs for nondecomposable loss functions like PRBE, F1, AUC (contingency table-driven losses) WWW 2007 tutorial on learning to rank
12	2008-10-06 Eigen, SVD, PCA	Eigenvalue and eigenvector recap, demo SVD, with low-rank plus noise matrix Connections between SVD, least squares, Ridge regression Eigen-SVD connection Connection between SVD and the pseudoinverse of a matrix Principal component analysis (PCA) Relation between SVD and PCA,
12	2008-10-09	(Dussehra)
13	2008-10-13 More dyadic models	Using kernels in PCA Dyadic factors of boolean matrices, aspect model Cross-associations and co-clustering Non-negative matrix factorization Iterative scaling,
13	2008-10-16 Mixture and aspect models	Undirected graph Laplacian Spectral graph partitioning, ratio cuts, transduction Probabilistic model learning, expectation maximization Expectation maximization, variational interpretation EM for factor, aspect, and dyadic models Aspect model as nonnegative matrix factorization
14	2008-10-20	(Instructor on leave)
14	2008-10-23	(Instructor on leave)
15	2008-10-27	(Diwali)
15	2008-10-30 EM, Mean field, Sampling distributions	Expectation of functions over complicated distributions: Mean field approximation Sampling the posterior if Pr(h\|D) is simple Rejection sampling using upper bounds on density functions The Metropolis MCMC algorithm for complicated Pr(h\|D) Metropolis-Hastings algorithm, proof of correctness Gibb's Q(h,h') for sampling
16	2008-11-03 Ensemble, bagging, boosting	Ensemble and committee learners Bagging, benefits from uncorrelated errors Bagging with committee members making correlated mistakes Boosting, phased exponential loss minimization Bias-variance decomposition
16	2008-11-06	Risk, Bayes risk, empirical/training risk Risk and generalization bounds, intro Probabilistically approximately correct (PAC) learning Risk bounds: hypothesis consistent with training set Bounds on true risk minus empirical (training) risk
16	2008-11-08 2-3:30pm , VC theory	Empirical processes Risk bounds for finite hypothesis classes Also see the Scholkopf and Smola book Symmetrization, growth function,
17	2008-11-10	The stability approach to proving generalization bounds Algorithmic stability Uniform stability for regression Uniform stability for classification
?	2008-mm-dd Rate distortion, information bottleneck	(Cancelled by popular lack of demand) Rate-distortion theory and relation to cross-association Information bottleneck and approximating distributions
17	2008-11-13	(Guru Nanak birthday)
	2008-11-19	Final exam with some solutions

CS705 Statistical Foundations of Machine Learning Autumn 2008 Calendar

CS705 Statistical Foundations of Machine Learning
Autumn 2008 Calendar