CS 215: Data Interpretation and Analysis, Fall 2018

CS 215 - Data Interpretation and Analysis

Instructor: Ajit Rajwade and Suyash Awate
Office: SIA-218, KReSIT Building
Email:

Lecture Venue: LC 001 (Lecture Hall Complex)
Lecture Timings: Slot 11, Tuesday and Friday 3:30 to 5:00 pm

Instructor Office Hours (at the Lecture Venue): Tuesday and Friday, 5:00 to 6:00 pm, i.e. immediately after class, or by appointment via email (also feel free to send queries over email or moodle)

Teaching Assistants: Dibyangshu Mukherjee, Yadnyesh Patil, Jay Bansal, Aman Kansal, Pranay Reddy Samala, Anurag Maurya, Deepak Singh Baghel, Ashish Aggarwal
Emails: (dbnshu, yadny, jaybansal, amankansal, pranayr, anuragcse, deepakbaghel, ashishaggarwal ) AT CSE DOT iitb DOT ac DOT in

Topics to be covered (tentative list)

Descriptive statistics
Discrete and continuous probability
Random variables and expectation
Special random variables: Gaussian, Bernoulli, Beta, Gamma, Uniform, Poisson, Exponential, Binomial, etc.
Hypothesis testing
Parameter estimation
Regression
Probability density estimation

Intended Audience

2nd year BTech students from CSE

Learning Materials and Textbooks

Lecture slides that will be regularly posted. I may occasionally post links to applets or videos, or additional material such as problem sets.
We will use moodle for posting assignments and grades
Course textbook: Introduction to Probability and Statistics for Engineers and Scientists: Fifth Edition (Freely downloadable via the IITB network)

Computational Resources

Matlab

Download MATLAB from here.
Matlab tutorial 1
Matlab tutorial 2
Matlab tutorial 3
The MathWorks - MATLAB Tutorial
Matlab Primer
On-line Matlab Help
Writing Fast Matlab Code (pdf)
One more tutorial for writing fast matlab code
Code Vectorization Guide
Matlab Programmin Style Guidelines (pdf)
Matlab array manipulation

Grading Policy (tenative)

Mid-sem exam: 25%
Final exam (cumulative): 25%
Programming and written assignments (about five): 35% - all to be done in groups of 2 students.
Two pre-announced quizzes: 15% total

Other Policies

Attendance is mandatory. Students with less than 80% attendance may be given a DX grade.
Assignments will be given out (typically) once every two or three weeks. They must be submitted on or before the deadline. No late assignments will be accepted. The programming components of the assignments will typically involve MATLAB, so you must be willing to learn it quickly.
We will adopt a zero-tolerance policy against any forms of plagiarism or any other form of cheating. Just don't do it! In cases of plagiarism, givers and takers will both be considered equally responsible.
This course is (inherently) cumulative. The syllabus for the final exam will include everything taught during the semester.

Tutorials

See moodle

Quizzes

See moodle

Lecture Schedule:

Please check moodle for links to lecture videos. You will be able to download these videos via your xyz@iitb.ac.in accounts.

Date
Content of the Lecture
Assignments/Readings/Notes

Lecture Video 1 (parts 1 and 2)

Introduction, course overview and course policies

Slides: Course Overview (see moodle)

Lecture Video 2 Descriptive Statistics

Descriptive statistics: key terminology

Methods to represent data: frequency tables, bar/line graphs, frequency polygon, pie-chart

Concept of frequency and relative frequency

Cumulative frequency plots

Interesting examples of histograms of intensity values in an image

Data summarization: mean and median
"Proof" that median minimizes the sum of absolute deviations - using calculus

Slides: Descriptive statistics (see moodle)

Readings: section 2.1, 2.2 from the textbook by Sheldon Ross

Lecture Video 3

Properties of the mean and median
"Proof" that median minimizes the sum of absolute deviations - using calculus
Proof that median minimizes the sum of absolute deviations, without using calculus
Concept of quantile/percentile

Calculation of mean and median in different ways from histogram or cumulative plots

Slides: Descriptive statistics (see moodle)

Readings: section 2.1, 2.2 from the textbook by Sheldon Ross
Proof that Median minimizes the sum of absolute deviations from the mean

Lecture Video 4

Standard deviation and variance, some applications

Two-sided Chebyshev inequality with proof; One-side Chebyshev inequality (Chebyshev-Cantelli inequality)

Proof of one-sided Chebyshev's inequality

Slides: Descriptive statistics (see moodle)

Readings: section 2.1, 2.2, 2.3, 2.4, 2.6 from the textbook by Sheldon Ross

Lecture Video 5

Concept of correlation coefficient and formula for it; proof that its value lies from -1 to +1

Correlation coefficient: properties; uncentered correlation coefficient; limitations of correlation coefficient and Anscombe's quartet

Correlation and causation

Slides: Descriptive statistics (check moodle)

Readings: section 2.1, 2.2, 2.3, 2.4, 2.6 from the textbook by Sheldon Ross
The correlation versus causation debtate: Link 1, Link 2, Link 3.

Lecture Videos 6-8

MATLAB Demo Codes (check moodle): vector and matrix operations, very basic image input/output, basic statistical operations, plots of various types (scatterplot, plot, boxplot, surf, surfc)

Lecture Videos 9-11 Discrete Probability

Discrete probability: sample space, event, composition of events: union, intersection, complement, exclusive or, De Morgan's laws

Boole's and Bonferroni's inequalities

Conditional probability, Bayes rule, False Positive Paradox

Independent and mutually exclusive events

Birthday paradox

Slides: Discrete Probability (check moodle)

Readings: Chapter 3 from the textbook by Sheldon Ross

Lecture 12 Random variables

Random variable: concept, discrete and continuous random variables

Probability mass function (pmf), cumulative distribution function (cdf) and probability density function (pdf)

Expected value for discrete and continuous random variables

Slides: Random variables
Readings: Chapter 4 from the textbook by Sheldon Ross

Lecture 13

Law of the Unconscious Statistician (LOTUS): Expected value of a function of a random variable

Linearity of expectation

The mean and the median as minimizers of squared and absolute losses respectively (with proofs for both)

Variance and standard deviation, with alternate expressions

Properties of variance

Markov's inequality and its proof

Slides: Random variables
Readings: Chapter 4 from the textbook by Sheldon Ross

Lecture 14

Proof of Chebyshev's inequality (two-sided) using Markov's inequality

Weak law of large numbers and its proof using Chebyshev's inequality

Statement of strong law of large numbers

Concept of joint PMF, PDF, CDF

Slides: Random variables
Readings: Chapter 4 from the textbook by Sheldon Ross

Lecture 15

Concept of conditonal CDF, PDF, with verification/understanding of stated formula

Concept of covariance, concept of mutual independence and pairwise independence

Properties of covariance

Covariance: properties, correlation versus independence

Slides: Random variables
Readings: Chapter 4 from the textbook by Sheldon Ross

Lecture 16

Covariance: properties, correlation versus independence

Concept of moment generating function, two different proofs of uniqueness of moment generating function for discrete random variables, properties of moment generating functions

Slides: Random variables
Readings: Chapter 4 from the textbook by Sheldon Ross

Lecture 17 Families of random variables

Concept of families of random variables

Bernoulli PMF: mean, median, mode, variance, MGF

Binomial PMF: relation to Bernoulli PMF, mean, median, mode, variance, plots, MGF, difference between binomial and geometric distribution

Slides: Families of Random variables (check moodle)

Readings: Section 5.1 from the textbook by Sheldon Ross

Lecture 18

Multinomial PMF - generalization of the binomial, mean vector and covariance matrix for a multinomial random variable, MGF for multinomial

Slides: Families of Random variables (check moodle)

Wiki article on the Multinomial Distribution

Lecture 19

Hypergeometric distribution: genesis, mean, variance

Applications of the hypergeometric distribution in counting of animals via the capture-recapture method

Slides: Families of Random variables (check moodle)

Wiki article on Capture Recapture Method

Lecture 20

Gaussian distribution: Derivation of mean, variance, MGF, median, mode

CDF of a Gaussian and its relations to error functions; probability of a Gaussian random variable to have values between mu +/- k sigma.

Gaussian (normal) PDF: motivation from the central limit theorem

Illustration of central limit theorem, statement of central limit theorem

Slides: Families of Random variables (check moodle)
Readings: Section 5.1,5.2,5.5,6.1,6.2 from the textbook by Sheldon Ross

Lecture 21

Statement of central limit theorem and its extensions; proof of CLT using MGF

Relation between CLT and the law of large numbers

Slides: Families of Random variables (check moodle)
Readings: Section 5.1,5.2,5.5,6.1,6.2 from the textbook by Sheldon Ross

Lecture 22

Illustration of central limit theorem: coss toss example

Gaussian tail bounds

Distribution of the sample mean and the sample variance, Bessel's correction;

Chi-squared distribution - definition, genesis, MGF

Slides: Families of Random variables
Readings: Section 5.1,5.2,5.5,6.1,6.2,6.3,6.4 from the textbook by Sheldon Ross

Lecture 23

Chi-squared distribution - definition, genesis, MGF, properties; use of a chi-square distribution toward defining the PDF of the sample variance

Uniform distribution: mean, variance, median, MGF; applications in sampling from a pre-specified PMF; application in generating a random permutation of a given set

Slides: Families of Random variables (check moodle)

Readings: Section 5.1,5.2,5.5,6.1,6.2,6.3,6.4 from the textbook by Sheldon Ross

Lecture 24

Poisson distribution: mean, variance, MGF, mode, addition of Poisson random variables, examples; derivation of Poisson from binomial

Relation between Poisson and Gaussian distributions, examples

Slides: Families of Random variables (check moodle)

Readings: Section 5.2,5.6 from the textbook by Sheldon Ross

Lecture 25

Exponential distribution: mean, median, MGF, variance, property of memorylessness, minimum of exponential random variables

Slides: Families of Random variables (check moodle)

Readings: Section 5.6 from the textbook by Sheldon Ross

Lecture 26 Parameter Estimation

Parameter Estimation

Concept of parameter estimation (or parametric PDF/PMF estimation)

Maximum likelihood estimation (MLE)

MLE for parameters of Bernoulli, Poisson, Gaussian and uniform distributions

Least squares line fitting as an MLE problem

Slides and derivations: Parameter Estimation (check moodle)

Readings: Section 5.6 from the textbook by Sheldon Ross

Readings: Sections 7.1, 7.2, 7.5, 7.7, 9.2 (for least squares line fitting) of the textbook by Sheldon Ross

Lecture 27

MLE for parameters of uniform distributions

Least squares line fitting as an MLE problem

Slides and derivations: Parameter Estimation (check moodle)

Readings: Section 5.6 from the textbook by Sheldon Ross

Readings: Sections 7.1, 7.2, 7.5, 7.7, 9.2 (for least squares line fitting) of the textbook by Sheldon Ross

Lecture 28

Concept of estimator bias, mean squared error, variance

Estimators for interval of uniform distribution: example of bias

Concept of two-sided confidence interval and one-sided confidence interval

Confidence interval for mean of a Gaussian with known standard deviation

Confidence interval for variance of a Gaussian

Slides and derivations: Parameter Estimation (check moodle)

Readings: Sections 7.1, 7.2, 7.5, 7.7, 9.2 (for least squares line fitting) of the textbook by Sheldon Ross

Lecture 29

Concept of two-sided confidence interval and one-sided confidence interval

Confidence interval for mean of a Gaussian with known standard deviation

Confidence interval for variance of a Gaussian

Slides and derivations: Parameter Estimation (check moodle)

Readings: Sections 7.1, 7.2, 7.5, 7.7, 9.2 (for least squares line fitting) of the textbook by Sheldon Ross

Lecture 30

Concept of nonparametric density estimation

Concept of histogram as a probability density estimator

Bias, variance and MSE for a histogram estimator for a smooth density (with bounded first derivatives) which is non-zero on a finite-sized interval; derivation of optimal number of bins (equivalently, optimal binwidth) and optimal MSE O(n^{-2/3})

Derivation of MSE, bias, variance for a histogram
(Read section 6.1 only. These notes are by Prof. Yen-Chi Chen from the Univ. of Washington, Seattle. A local copy of the pdf is here
Readings: Sections 7.1, 7.2, 7.5, 7.7, 9.2 (for least squares line fitting) of the textbook by Sheldon Ross

Date	Content of the Lecture	Assignments/Readings/Notes
Lecture Video 1 (parts 1 and 2)	Introduction, course overview and course policies	Slides: Course Overview (see moodle)
Lecture Video 2	Descriptive Statistics Descriptive statistics: key terminology Methods to represent data: frequency tables, bar/line graphs, frequency polygon, pie-chart Concept of frequency and relative frequency Cumulative frequency plots Interesting examples of histograms of intensity values in an image Data summarization: mean and median "Proof" that median minimizes the sum of absolute deviations - using calculus	Slides: Descriptive statistics (see moodle) Readings: section 2.1, 2.2 from the textbook by Sheldon Ross
Lecture Video 3	Properties of the mean and median "Proof" that median minimizes the sum of absolute deviations - using calculus Proof that median minimizes the sum of absolute deviations, without using calculus Concept of quantile/percentile Calculation of mean and median in different ways from histogram or cumulative plots	Slides: Descriptive statistics (see moodle) Readings: section 2.1, 2.2 from the textbook by Sheldon Ross Proof that Median minimizes the sum of absolute deviations from the mean
Lecture Video 4	Standard deviation and variance, some applications Two-sided Chebyshev inequality with proof; One-side Chebyshev inequality (Chebyshev-Cantelli inequality) Proof of one-sided Chebyshev's inequality	Slides: Descriptive statistics (see moodle) Readings: section 2.1, 2.2, 2.3, 2.4, 2.6 from the textbook by Sheldon Ross
Lecture Video 5	Concept of correlation coefficient and formula for it; proof that its value lies from -1 to +1 Correlation coefficient: properties; uncentered correlation coefficient; limitations of correlation coefficient and Anscombe's quartet Correlation and causation	Slides: Descriptive statistics (check moodle) Readings: section 2.1, 2.2, 2.3, 2.4, 2.6 from the textbook by Sheldon Ross The correlation versus causation debtate: Link 1, Link 2, Link 3.
Lecture Videos 6-8	MATLAB Demo Codes (check moodle): vector and matrix operations, very basic image input/output, basic statistical operations, plots of various types (scatterplot, plot, boxplot, surf, surfc)
Lecture Videos 9-11	Discrete Probability Discrete probability: sample space, event, composition of events: union, intersection, complement, exclusive or, De Morgan's laws Boole's and Bonferroni's inequalities Conditional probability, Bayes rule, False Positive Paradox Independent and mutually exclusive events Birthday paradox	Slides: Discrete Probability (check moodle) Readings: Chapter 3 from the textbook by Sheldon Ross
Lecture 12	Random variables Random variable: concept, discrete and continuous random variables Probability mass function (pmf), cumulative distribution function (cdf) and probability density function (pdf) Expected value for discrete and continuous random variables	Slides: Random variables Readings: Chapter 4 from the textbook by Sheldon Ross
Lecture 13	Law of the Unconscious Statistician (LOTUS): Expected value of a function of a random variable Linearity of expectation The mean and the median as minimizers of squared and absolute losses respectively (with proofs for both) Variance and standard deviation, with alternate expressions Properties of variance Markov's inequality and its proof	Slides: Random variables Readings: Chapter 4 from the textbook by Sheldon Ross
Lecture 14	Proof of Chebyshev's inequality (two-sided) using Markov's inequality Weak law of large numbers and its proof using Chebyshev's inequality Statement of strong law of large numbers Concept of joint PMF, PDF, CDF	Slides: Random variables Readings: Chapter 4 from the textbook by Sheldon Ross
Lecture 15	Concept of conditonal CDF, PDF, with verification/understanding of stated formula Concept of covariance, concept of mutual independence and pairwise independence Properties of covariance Covariance: properties, correlation versus independence	Slides: Random variables Readings: Chapter 4 from the textbook by Sheldon Ross
Lecture 16	Covariance: properties, correlation versus independence Concept of moment generating function, two different proofs of uniqueness of moment generating function for discrete random variables, properties of moment generating functions	Slides: Random variables Readings: Chapter 4 from the textbook by Sheldon Ross
Lecture 17	Families of random variables Concept of families of random variables Bernoulli PMF: mean, median, mode, variance, MGF Binomial PMF: relation to Bernoulli PMF, mean, median, mode, variance, plots, MGF, difference between binomial and geometric distribution	Slides: Families of Random variables (check moodle) Readings: Section 5.1 from the textbook by Sheldon Ross
Lecture 18	Multinomial PMF - generalization of the binomial, mean vector and covariance matrix for a multinomial random variable, MGF for multinomial	Slides: Families of Random variables (check moodle) Wiki article on the Multinomial Distribution
Lecture 19	Hypergeometric distribution: genesis, mean, variance Applications of the hypergeometric distribution in counting of animals via the capture-recapture method	Slides: Families of Random variables (check moodle) Wiki article on Capture Recapture Method
Lecture 20	Gaussian distribution: Derivation of mean, variance, MGF, median, mode CDF of a Gaussian and its relations to error functions; probability of a Gaussian random variable to have values between mu +/- k sigma. Gaussian (normal) PDF: motivation from the central limit theorem Illustration of central limit theorem, statement of central limit theorem	Slides: Families of Random variables (check moodle) Readings: Section 5.1,5.2,5.5,6.1,6.2 from the textbook by Sheldon Ross
Lecture 21	Statement of central limit theorem and its extensions; proof of CLT using MGF Relation between CLT and the law of large numbers	Slides: Families of Random variables (check moodle) Readings: Section 5.1,5.2,5.5,6.1,6.2 from the textbook by Sheldon Ross
Lecture 22	Illustration of central limit theorem: coss toss example Gaussian tail bounds Distribution of the sample mean and the sample variance, Bessel's correction; Chi-squared distribution - definition, genesis, MGF	Slides: Families of Random variables Readings: Section 5.1,5.2,5.5,6.1,6.2,6.3,6.4 from the textbook by Sheldon Ross
Lecture 23	Chi-squared distribution - definition, genesis, MGF, properties; use of a chi-square distribution toward defining the PDF of the sample variance Uniform distribution: mean, variance, median, MGF; applications in sampling from a pre-specified PMF; application in generating a random permutation of a given set	Slides: Families of Random variables (check moodle) Readings: Section 5.1,5.2,5.5,6.1,6.2,6.3,6.4 from the textbook by Sheldon Ross
Lecture 24	Poisson distribution: mean, variance, MGF, mode, addition of Poisson random variables, examples; derivation of Poisson from binomial Relation between Poisson and Gaussian distributions, examples	Slides: Families of Random variables (check moodle) Readings: Section 5.2,5.6 from the textbook by Sheldon Ross
Lecture 25	Exponential distribution: mean, median, MGF, variance, property of memorylessness, minimum of exponential random variables	Slides: Families of Random variables (check moodle) Readings: Section 5.6 from the textbook by Sheldon Ross
Lecture 26	Parameter Estimation Parameter Estimation Concept of parameter estimation (or parametric PDF/PMF estimation) Maximum likelihood estimation (MLE) MLE for parameters of Bernoulli, Poisson, Gaussian and uniform distributions Least squares line fitting as an MLE problem	Slides and derivations: Parameter Estimation (check moodle) Readings: Section 5.6 from the textbook by Sheldon Ross Readings: Sections 7.1, 7.2, 7.5, 7.7, 9.2 (for least squares line fitting) of the textbook by Sheldon Ross
Lecture 27	MLE for parameters of uniform distributions Least squares line fitting as an MLE problem	Slides and derivations: Parameter Estimation (check moodle) Readings: Section 5.6 from the textbook by Sheldon Ross Readings: Sections 7.1, 7.2, 7.5, 7.7, 9.2 (for least squares line fitting) of the textbook by Sheldon Ross
Lecture 28	Concept of estimator bias, mean squared error, variance Estimators for interval of uniform distribution: example of bias Concept of two-sided confidence interval and one-sided confidence interval Confidence interval for mean of a Gaussian with known standard deviation Confidence interval for variance of a Gaussian	Slides and derivations: Parameter Estimation (check moodle) Readings: Sections 7.1, 7.2, 7.5, 7.7, 9.2 (for least squares line fitting) of the textbook by Sheldon Ross
Lecture 29	Concept of two-sided confidence interval and one-sided confidence interval Confidence interval for mean of a Gaussian with known standard deviation Confidence interval for variance of a Gaussian	Slides and derivations: Parameter Estimation (check moodle) Readings: Sections 7.1, 7.2, 7.5, 7.7, 9.2 (for least squares line fitting) of the textbook by Sheldon Ross
Lecture 30	Concept of nonparametric density estimation Concept of histogram as a probability density estimator Bias, variance and MSE for a histogram estimator for a smooth density (with bounded first derivatives) which is non-zero on a finite-sized interval; derivation of optimal number of bins (equivalently, optimal binwidth) and optimal MSE O(n^{-2/3})	Derivation of MSE, bias, variance for a histogram (Read section 6.1 only. These notes are by Prof. Yen-Chi Chen from the Univ. of Washington, Seattle. A local copy of the pdf is here Readings: Sections 7.1, 7.2, 7.5, 7.7, 9.2 (for least squares line fitting) of the textbook by Sheldon Ross