CS 215 - Data Interpretation and Analysis

Instructor: Ajit Rajwade and Suyash Awate
Office: SIA-218, KReSIT Building
Email:

Lecture Venue: LC 001 (Lecture Hall Complex)
Lecture Timings: Slot 11, Tuesday and Friday 3:30 to 5:00 pm

Instructor Office Hours (at the Lecture Venue): Tuesday and Friday, 5:00 to 5:30 pm, i.e. immediately after class, or by appointment via email (also feel free to send queries over email or moodle)

Teaching Assistants: R Sudarsanan, Sharvik Mittal, Sriram Balasubramanian, Rupesh, Sahil Pandey, Shreya Mahabala Alva
Emails: (sudarsanan, sharky, sriramb, rupesh, sahilpandey, shreya) AT CSE DOT iitb DOT ac DOT in

Topics to be covered (tentative list)


Intended Audience

2nd year BTech students from CSE

Learning Materials and Textbooks

Computational Resources


Grading Policy (tenative)


Other Policies


Tutorials

Quizzes

Lecture Schedule:


Date

Content of the Lecture

Assignments/Readings/Notes

30/07 (Tue)
  • Introduction, course overview and course policies
Descriptive Statistics
  • Descriptive statistics: key terminology
  • Methods to represent data: frequency tables, bar/line graphs, frequency polygon, pie-chart
  • Concept of frequency and relative frequency
  • Cumulative frequency plots
  • Interesting examples of histograms of intensity values in an image
  • Data summarization: mean and median
  • "Proof" that median minimizes the sum of absolute deviations - using calculus
02/08 (Fri)
  • Proofs that median minimizes the sum of absolute deviations without using calculus
  • Concept of quantile/percentile
  • Standard deviation and variance, some applications
  • Two-sided Chebyshev inequality with proof; One-side Chebyshev inequality (Chebyshev-Cantelli inequality)
  • Concept of correlation coefficient and formula for it; proof that its value lies from -1 to +1
03/08 (Sat)
  • Correlation coefficient: properties; uncentered correlation coefficient; limitations of correlation coefficient and Anscombe's quartet
  • Correlation and causation
  • Proof of one-sided Chebyshev's inequality

06/08 (Tue) Discrete Probability
  • Discrete probability: sample space, event, composition of events: union, intersection, complement, exclusive or, De Morgan's laws
  • Boole's and Bonferroni's inequalities
  • Conditional probability, Bayes rule, False Positive Paradox
  • Birthday paradox
  • Independent and mutually exclusive events
09/08 (Fri) Random variables
  • Random variable: concept, discrete and continuous random variables
  • Probability mass function (pmf), cumulative distribution function (cdf) and probability density function (pdf)
  • Expected value for discrete and continuous random variables
  • Law of the Unconscious Statistician (LOTUS): Expected value of a function of a random variable
  • The mean and the median as minimizers of squared and absolute losses respectively (with proof for the former)
  • Variance and standard deviation, with alternate expressions
13/08 (Tue)
  • Concept of joint PMF, PDF, CDF
  • Concept of covariance, concept of mutual independence and pairwise independence
  • Properties of covariance
  • Covariance: properties, correlation versus independence
  • Concept of moment generating function, two different proofs of uniqueness of moment generating function for discrete random variables, properties of moment generating functions
  • PDF/PMF of sum of random variables
  • Proof that median minimizes total absolute loss
16/08 (Fri)
  • Conditional CDF, PMF, PDF; verification of definition

Families of random variables
  • Concept of families of random variables
  • Bernoulli PMF: mean, median, mode, variance, MGF
  • Binomial PMF: relation to Bernoulli PMF, mean, median, mode, variance, plots, MGF, difference between binomial and geometric distribution
  • Gaussian (normal) PDF: motivation from the central limit theorem
  • Illustration of central limit theorem, statement of central limit theorem
17/08 (Sat)
  • Derivation of mean, variance, MGF, median, mode
  • CDF of a Gaussian and its relations to error functions; probability of a Gaussian random variable to have values between mu +/- k sigma.
  • Statement of central limit theorem and its extensions; proof of CLT using MGF
  • application of CLT and its relation to the binomial distribution - de Moivre-Laplace theorem (without proof);
  • one application of the CLT; relation between CLT and the law of large numbers
20/08 (Tue)
  • Gaussian tail bounds
  • Distribution of the sample mean and the sample variance, Bessel's correction;
  • Chi-squared distribution - definition, genesis, MGF, properties; use of a chi-square distribution toward defining the PDF of the sample variance
  • Uniform distribution: mean, variance, median, MGF; applications in sampling from a pre-specified PMF; application in generating a random permutation of a given set
23/08 (Fri)
  • Poisson distribution: mean, variance, MGF, mode, addition of Poisson random variables, examples; derivation of Poisson from binomial
  • Relation between Poisson and Gaussian distributions, examples
  • Multinomial PMF - generalization of the binomial, mean vector and covariance matrix for a multinomial random variable, MGF for multinomial
24/08 (Sat)
  • Exponential distribution: mean, median, MGF, variance, property of memorylessness, minimum of exponential random variables

Parameter Estimation
  • Parameter Estimation
  • Concept of parameter estimation (or parametric PDF/PMF estimation)
  • Maximum likelihood estimation (MLE)
  • MLE for parameters of Bernoulli, Poisson, Gaussian and uniform distributions
  • Least squares line fitting as an MLE problem
27/08 (Tue)
  • Least squares line fitting as an MLE problem
  • Concept of estimator bias, mean squared error, variance
  • Estimators for interval of uniform distribution: example of bias
  • Concept of two-sided confidence interval and one-sided confidence interval
  • Confidence interval for mean of a Gaussian with known standard deviation
  • Confidence interval for variance of a Gaussian
30/08 (Fri)
  • Concept of nonparametric density estimation
  • Concept of histogram as a probability density estimator
  • Bias, variance and MSE for a histogram estimator for a smooth density (with bounded first derivatives) which is non-zero on a finite-sized interval; derivation of optimal number of bins (equivalently, optimal binwidth) and optimal MSE O(n^{-2/3})
13/09 (Fri)
  • Hypergeometric distribution: genesis, mean, variance
  • Applications of the hypergeometric distribution in counting of animals via the capture-recapture method
  • Concept of kernel density estimator
  • Bias, variance and MSE for a kernel density estimator for a smooth density (with bounded second derivatives); derivation of optimal bandwidth and optimal MSE O(n^{-4/5})