2010/07/22 Introduction 
 Course introduction and preview

2010/07/26 
2010/07/29

 Tokenization, compound words, stemming
 Building a compound word dictionary
 The termdocument matrix

2010/08/02

 Generative models for the termdocument matrix
 Multivariate binary, multinomial, Poisson models

2010/08/05

 Word burstiness, nonparametric models
 Word burstiness, Dirichlet word generators
 Modeling multiple topic clusters in the term document matrix

2010/08/09


2010/08/12

 Multiple cause mixture models, aspect model
 Latent Dirichlet topic model

2010/08/16

 Latent Dirichlet topic model
,
continued
 Coclustering and crossassociations

2010/08/19 

2010/08/23

 Nonnegative matrix factorization
,
continued
 Basic and positional inverted index construction
 Space taken by dictionary and posting lists
 Compressing posting lists  gamma code, skip pointers
 How to evaluate a query plan

2010/08/26

 Maintaining the inverted index as the corpus changes
 Indexing for zoned search, fields, word prefix match
 Compressing the dictionary  front coding

2010/08/30

 Compressing the dictionary, continued
 Recall, (interpolated) precision, F1, breakeven
 MRR, MAP, NDCG
 Area under ROC curve, pair preference violation

2010/09/02

 Scoring considerations: term frequency, collection/corpus frequency, document frequency, TFIDF
 Vector space model
 BM25

2010/09/04

 Query processing in IR: How to order and
scan postings
 Fagin's threshold algorithm and
probabilistic topk
 Index caching
 FKS perfect hashing

2010/09/06

 Findsimilar search, avoiding the quadratic bottleneck
 Jaccard similarity over shingles
 Pairwise and minwise hash functions and families
 Reducing random bits needed to sample permutations

2010/09/09 

Midterm exam 15:3018:30. 
2010/09/23

 Bridging the syntax gap: no/poor word overlap between query and doc
 Cluster hypothesis in IR
 Pseudorelevance feedback (PRF)
 Text search as translation
 Translation models via termdocument random walk

2010/09/30

 Clickthrough, abandonment, risk management
 Labeling tasks: examples and motivation
 Document representation for labeling  are corpus models adequate?
 Joint models, naive Bayes
 Entropy model

2010/10/04

 Conditional model, maximum entropy principle
 Art of feature vector design
 Maxent and logistic regression
 Discriminative labeling: support vector machines

2010/10/11 
 Multiclass and multilabel applications
 Dealing with topic hierarchies
 Learning to rank  representation, training, testing
 Itemwise, pairwise, listwise training paradigms

2010/10/14 Learning to rank 
 Itemwise: Ordinal regression
 Pairwise: RankSVM
and RankNet

2010/10/18 Learning to rank 
 Itermwise: McRank
(Algorithm 6)
 Pairwise: RankBoost

2010/10/21 Learning to rank 
 Listwise: SoftRank
 The cutting plane recipe
 RankAUC

2010/10/25 Social networks 
 Listwise: SVMmap
 Listwise: DORM (for NDCG)
 SVMndcg,
SVMcombo
 Intro to social networks
 Hyperlinks: the first online social network?
 And now, Wikipedia,
Orkut, Flickr, LinkedIn, Facebook, Twitter, Blogspot, Wordpress, ...
 Social networks as entityrelationship graph data models

Yahoo! talks in lieu of regular lecture.

2010/11/01 
 Power laws everywhere
 Diameter, hopplots, bowtie
 Generating realistic social networks
 Preferential attachment
 Winner does not take all
 Copying model

2010/11/08 
 PageRank
 Largescale computation of PageRank (mapreduce?)
 Accelerating PageRank
CH
Theorem
 PageRank linearity and decomposition
 Topicsensitive
and personalized
PageRank
 Asynchronous push algorithms
 Page staleness, link spam, trust/mistrust

2010/11/11 
 Bipartite reinforcement: HITS
 Using host graph, page content, HTML layout, anchor text
 Score and rank stability of HITS and PageRank
 PHITS and SALSA, effect on stability
 Maxent view of PageRank
 Learning to rank in graph models
NetRank

2010/11/13 Crawling 
 Basic crawler plumbing: DNS, frontier, work pool,
concurrency, page storage
 Systems and scaling issues
 Frontier prioritization issues and considerations
 Focused crawling
ChakrabartiDB1999f,
 Refreshing crawls

2010/11/27 
Final exam,
14:3018:30, SIC301. 