Date, slides |
Summary, papers |
2010/07/22 Introduction |
- Administrative info
- Course introduction and preview
|
2010/07/26 |
Class postponed.
|
2010/07/29
|
- Tokenization, compound words, stemming
- Building a compound word dictionary
- The term-document matrix
|
2010/08/02
|
- Generative models for the term-document matrix
- Multivariate binary, multinomial, Poisson models
|
2010/08/05
|
- Word burstiness, non-parametric models
- Word burstiness, Dirichlet word generators
- Modeling multiple topic clusters in the term document matrix
|
2010/08/09
|
|
2010/08/12
|
- Multiple cause mixture models, aspect model
- Latent Dirichlet topic model
|
2010/08/16
|
- Latent Dirichlet topic model
,
continued
- Coclustering and cross-associations
|
2010/08/19 |
|
2010/08/23
|
- Nonnegative matrix factorization
,
continued
- Basic and positional inverted index construction
- Space taken by dictionary and posting lists
- Compressing posting lists -- gamma code, skip pointers
- How to evaluate a query plan
|
2010/08/26
|
- Maintaining the inverted index as the corpus changes
- Indexing for zoned search, fields, word prefix match
- Compressing the dictionary -- front coding
|
2010/08/30
|
- Compressing the dictionary, continued
- Recall, (interpolated) precision, F1, break-even
- MRR, MAP, NDCG
- Area under ROC curve, pair preference violation
|
2010/09/02
|
- Scoring considerations: term frequency, collection/corpus frequency, document frequency, TFIDF
- Vector space model
- BM25
|
2010/09/04
|
- Query processing in IR: How to order and
scan postings
- Fagin's threshold algorithm and
probabilistic top-k
- Index caching
- FKS perfect hashing
|
2010/09/06
|
- Find-similar search, avoiding the quadratic bottleneck
- Jaccard similarity over shingles
- Pairwise and min-wise hash functions and families
- Reducing random bits needed to sample permutations
|
2010/09/09 |
|
2010/09/20 |
Midterm exam 15:30--18:30. |
2010/09/23
|
- Bridging the syntax gap: no/poor word overlap between query and doc
- Cluster hypothesis in IR
- Pseudorelevance feedback (PRF)
- Text search as translation
- Translation models via term-document random walk
|
2010/09/23
|
|
2010/09/30
|
- Clickthrough, abandonment, risk management
- Labeling tasks: examples and motivation
- Document representation for labeling --- are corpus models adequate?
- Joint models, naive Bayes
- Entropy model
|
2010/10/04
|
- Conditional model, maximum entropy principle
- Art of feature vector design
- Maxent and logistic regression
- Discriminative labeling: support vector machines
|
2010/10/11 |
- Multiclass and multilabel applications
- Dealing with topic hierarchies
- Learning to rank -- representation, training, testing
- Itemwise, pairwise, listwise training paradigms
|
2010/10/14 Learning to rank |
- Itemwise: Ordinal regression
- Pairwise: RankSVM
and RankNet
|
2010/10/18 Learning to rank |
- Itermwise: McRank
(Algorithm 6)
- Pairwise: RankBoost
|
2010/10/21 Learning to rank |
- Listwise: SoftRank
- The cutting plane recipe
- RankAUC
|
2010/10/25 Social networks |
- Listwise: SVMmap
- Listwise: DORM (for NDCG)
- SVMndcg,
SVMcombo
- Intro to social networks
- Hyperlinks: the first online social network?
- And now, Wikipedia,
Orkut, Flickr, LinkedIn, Facebook, Twitter, Blogspot, Wordpress, ...
- Social networks as entity-relationship graph data models
|
2010/10/28 |
Yahoo! talks in lieu of regular lecture.
|
2010/11/01 |
- Power laws everywhere
- Diameter, hop-plots, bow-tie
- Generating realistic social networks
- Preferential attachment
- Winner does not take all
- Copying model
|
2010/11/04 |
|
2010/11/08 |
- PageRank
- Large-scale computation of PageRank (map-reduce?)
- Accelerating PageRank
C-H
Theorem
- PageRank linearity and decomposition
- Topic-sensitive
and personalized
PageRank
- Asynchronous push algorithms
- Page staleness, link spam, trust/mistrust
|
2010/11/11 |
- Bipartite reinforcement: HITS
- Using host graph, page content, HTML layout, anchor text
- Score and rank stability of HITS and PageRank
- PHITS and SALSA, effect on stability
- Maxent view of PageRank
- Learning to rank in graph models
NetRank
|
2010/11/13 Crawling |
- Basic crawler plumbing: DNS, frontier, work pool,
concurrency, page storage
- Systems and scaling issues
- Frontier prioritization issues and considerations
- Focused crawling
ChakrabartiDB1999f,
- Refreshing crawls
|
2010/11/27 |
Final exam,
14:30--18:30, SIC301. |