Week |
Date, slides |
Summary, papers |
1 |
2009/07/23 Introduction |
- Administrative info
- Course introduction and preview
|
2 |
2009/07/27
|
- Tokenization, compound words, stemming
- Building a compound word dictionary
- The term-document matrix
|
2 |
2009/07/30
|
- Generative models for the term-document matrix
- Multivariate binary, multinomial, Poisson models
|
3 |
2009/08/03
|
- Word burstiness, non-parametric models
- Word burstiness, Dirichlet word generators
- Modeling multiple topic clusters in the term document matrix
- (Single cause) mixture model and expectation
maximization
|
3 |
2009/08/06
|
- Limitations of single-cause mixture models
- Multiple cause mixture models, aspect model
- Latent Dirichlet topic model
- SVD/LSI
- Nonnegative matrix factorization
,
Coclustering and cross-associations
|
4 |
2009/08/10
|
- Basic and positional inverted index construction
- Space taken by dictionary and posting lists
- Compressing posting lists -- gamma code, skip pointers
- Compressing the dictionary -- front coding
|
4 |
2009/08/13 |
Class postponed on account of swine flu advisory. |
5 |
2009/08/17 |
Class postponed on account of swine flu advisory. |
5 |
2009/08/20
|
- Indexing for parametric and zoned search
- Maintaining the inverted index as the corpus changes
- Recall, (interpolated) precision, F1, break-even
- Area under ROC curve, pair preference violation
- MRR, MAP, NDCG, BPREF
|
6 |
2009/08/24 |
- Vector space model
- Term frequency, collection frequency, document frequency, TFIDF
- BM25
- Bridging the syntax gap: no/poor word overlap between query and doc
|
6 |
2009/08/27
|
|
7 |
2009/08/31
|
- Cluster hypothesis in IR
- Clustering and corpus visualization techniques
- k-Means and self-organizing maps
- Hierarchical agglomerative clustering, dendrograms, scatter-gather
- Multidimensional scaling, FastMap
- About dynamic
clustering, Vivisimo, Clusty, Kosmix
|
7 |
2009/09/03 |
- Query processing in IR: How to order and
scan postings
- Fagin's threshold algorithm and
probabilistic top-k
- Index caching
|
8 |
2009/09/07
|
- Find-similar search, avoiding the quadratic bottleneck
- Jaccard similarity over shingles
- Pairwise and min-wise hash functions and families
- Reducing random bits needed to sample permutations
|
8 |
2009/09/09 |
|
9 |
2009/09/18 |
Midterm exam 16:30--19:00. |
10 |
2009/09/21
|
- Labeling tasks: examples and motivation
- Document representation for labeling --- are corpus models adequate?
- Art of feature vector design
- Joint models, naive Bayes
- Entropy model
- Conditional model, maximum entropy
|
10 |
2009/09/24 |
- Logistic regression
- Discriminative labeling: support vector machines
- Multiclass and multilabel applications
- Dealing with topic hierarchies
|
11 |
2009/09/28 |
Dussehra holiday |
11 |
2009/10/01 Learning to rank |
- Learning to rank -- representation, training, testing
- Itemwise, pairwise, listwise training paradigms
- Itemwise: Ordinal regression
- Itermwise: McRank
(Algorithm 6)
|
12 |
2009/10/05 Learning to rank |
- Guest lecture on learning to rank by Dr. Shivani Agarwal
- Pairwise: RankSVM
and RankNet
- Pairwise: RankBoost
|
12 |
2009/10/08 Learning to rank |
- Listwise: SoftRank
- The cutting plane recipe
- RankAUC
|
12 |
2009/10/10 Learning to rank |
|
13 |
2009/10/12 Social networks |
- Intro to social networks
- Hyperlinks: the first online social network?
- And now, Wikipedia,
Orkut, Flickr, LinkedIn, Facebook, Twitter, Blogspot, Wordpress, ...
- Social networks as entity-relationship graph data models
- Power laws everywhere
- Diameter, hop-plots, bow-tie
- Generating realistic social networks
|
13 |
2009/10/15 |
- Preferential attachment
- Winner does not take all
- Copying model
|
14 |
2009/10/19 |
Diwali-related holiday
|
14 |
2009/10/22 |
|
15 |
2009/10/26 |
- PageRank
- Large-scale computation of PageRank (map-reduce?)
- Accelerating PageRank
C-H
Theorem
- PageRank linearity and decomposition
|
15 |
2009/10/29 |
- Topic-sensitive
and personalized
PageRank
- Asynchronous push algorithms
- Page staleness, link spam, trust/mistrust
|
16 |
2009/11/02 |
Guru Nanak birthday holiday
- Bipartite reinforcement: HITS
- Using host graph, page content, HTML layout, anchor text
- Score and rank stability of HITS and PageRank
|
16 |
2009/11/05 |
- PHITS and SALSA, effect on stability
- Maxent view of PageRank
- Learning to rank in graph models
NetRank
|
17 |
2009/11/09 Crawling |
- Basic crawler plumbing: DNS, frontier, work pool,
concurrency, page storage
- Systems and scaling issues
|
17 |
2009/11/12 |
- Frontier prioritization issues and considerations
- Focused crawling
ChakrabartiDB1999f,
- Refreshing crawls
|
|
2009/11/27 |
Final exam 14:30--18:30, SIC305. |