CS 215 - Data Interpretation and Analysis

Instructor: Ajit Rajwade and Pushpak Bhattacharya
Office: SIA-218, KReSIT Building
Email:

Instructor Office Hours (at the Lecture Venue): Immediately after class, or by appointment via email (also feel free to send queries over email or moodle)

Teaching Assistants:
  • Piyush Sawarkar: piyushs AT cse DOT iitb.ac.in
  • Bhumika Khetan: bhumikakhetan AT cse DOT iitb.ac.in
  • Sankara Sri Raghava Ravindra Muddu: sriraghava AT cse DOT iitb.ac.in
  • Nagakalyani Goda: nagakalyani AT cse DOT iitb.ac.in
  • Dhananjay Kejriwal: dhankejriwal AT cse DOT iitb.ac.in
  • Vaibhav Raj: vaibhavraj AT cse DOT iitb.ac.in
  • Arpon Basu: abasu AT cse DOT iitb.ac.in
  • Yash Kumthekar: 23m0797 AT iitb DOT ac DOT in
  • Shubajeet Dey: shubhajeet AT cse DOT iitb.ac.in
  • Drashthi Doshi: drashthi AT cse DOT iitb.ac.in

Topics to be covered (tentative list)


Intended Audience

2nd year BTech students from CSE

Learning Materials and Textbooks

Computational Resources


Grading Policy (tenative)


Other Policies


Tutorials

Quizzes

Lecture Schedule:


Date

Content of the Lecture

Assignments/Readings/Notes

31/07
  • Introduction, course overview and course policies
    Descriptive Statistics
  • Terminology: population, sample, discrete and continuous valued attributes
  • Frequency tables, frequency polyongs, line diagrams, pie charts, relative frequency tables
  • Histograms with examples for image intensity histograms, image gradient histograms
  • Histogram binning problem
5/8
  • Data summarization: mean and median
  • "Proof" that median minimizes the sum of absolute deviations - using calculus
  • Proof that median minimizes the sum of absolute deviations, without using calculus
  • Concept of quantile/percentile
  • Calculation of mean and median in different ways from histogram or cumulative plots
  • Standard deviation and variance, some applications
  • Two-sided Chebyshev inequality with proof; One-side Chebyshev inequality (Chebyshev-Cantelli inequality)
  • Concept of scatter plots
7/8
  • Concept of correlation coefficient and formula for it; proof that its value lies from -1 to +1
  • Correlation coefficient: properties; uncentered correlation coefficient; limitations of correlation coefficient and Anscombe's quartet
  • Correlation and causation
10/8 Discrete Probability
  • Discrete probability: sample space, event, composition of events: union, intersection, complement, exclusive or, De Morgan's laws
  • Boole's and Bonferroni's inequalities
  • Conditional probability, Bayes rule, False Positive Paradox
  • Independent and mutually exclusive events
  • Birthday paradox
14/8 Random Variables
  • Random variable: concept, discrete and continuous random variables
  • Probability mass function (pmf), cumulative distribution function (cdf) and probability density function (pdf)
  • Expected value for discrete and continuous random variables; Law of the Unconscious Statistician
  • Standard deviation, Markov's inequality, Chebyshev's inequality; proofs of these inequalities
  • Concept of covariance and its properties
17/8
  • Proof of the law of the unconscious statistician
  • Weak law of large numbers and its proof using Chebyshev's inequality; statement of strong law
  • Joint PMF, PDF, CDF with examples; marginals obtained by integration of joint PDFs, CDFs, PMFs
  • Concept of independence of random variables
21/8
  • Moment generating functions: definition, genesis, properties, uniqueness proofs
  • Conditional CDF, PDF, PMF; conditional expectation; examples

Families of Random Variables
  • Bernoulli random variables: mean, median, mode, variance, MGF
  • Binomial random variables: definition
24/8
  • Binomial random variables: definition, mean, variance, mode, CDF, MGF
  • Gaussian distribution: definition, mean, variance, verification of integration to 1
  • Introduction to and basic statement of the central limit theorem, with examples
28/8
  • Properties of Gaussian: CDF and error function, MGF
  • Proof of Central Limit Theorem
  • Relation between CLT and Law of Large Numbers
  • Gaussian tail bounds
  • Distribution of sample mean and sample variance, Bessel's correction
31/8
  • Distribution of sample variance given Gaussian random variables
  • Chi-square distribution: derivation for the case of n = 1, formula stated for general n, derivation of MGF, mean, variance
  • Poisson distribution: genesis, mean, variance, MGF, thinning property, sum of Poisson random variables, practical application as model for image noise
  • Multinomial PMF - generalization of the binomial, mean vector and covariance matrix for a multinomial random variable
4/9
  • Multinomial PMF - derivation of MGF
  • Hypergeometric distribution: genesis, mean, variance
  • Applications of the hypergeometric distribution in counting of animals via the capture-recapture method
  • Uniform distribution: mean, variance, median, MGF; applications in sampling from a pre-specified PMF; application in generating a random permutation of a given set
  • Exponential distribution: mean, median, MGF, variance, property of memorylessness, minimum of exponential random variables
7/9
  • Parameter Estimation
  • Concept of parameter estimation (or parametric PDF/PMF estimation)
  • Maximum likelihood estimation (MLE)
  • MLE for parameters of Bernoulli, Poisson, Gaussian and uniform distributions
  • Least squares line fitting as an MLE problem
  • MLE for parameters of uniform distributions
  • Least squares line fitting as an MLE problem
11/9
  • Concept of estimator bias, mean squared error, variance
  • Estimators for interval of uniform distribution: example of bias
  • Concept of two-sided confidence interval and one-sided confidence interval
  • Confidence interval for mean of a Gaussian with known standard deviation
  • Confidence interval for variance of a Gaussian
  • Concept of two-sided confidence interval and one-sided confidence interval
  • Confidence interval for mean of a Gaussian with known standard deviation
  • Confidence interval for variance of a Gaussian
  • Slides: Parameter Estimation
  • MLE derivations
  • Readings: Section 5.6 from the textbook by Sheldon Ross
  • Readings: Sections 7.1, 7.2, 7.3, 7.5, 7.7, 9.2 (for least squares line fitting) of the textbook by Sheldon Ross
14/9
  • Concept of nonparametric density estimation
  • Concept of histogram as a probability density estimator
  • Bias, variance and MSE for a histogram estimator for a smooth density (with bounded first derivatives) which is non-zero on a finite-sized interval; derivation of optimal number of bins (equivalently, optimal binwidth) and optimal MSE O(n^{-2/3})