Week 
Date, slides 
Summary, papers 
1 
2009/01/02 Introduction 

2 
2009/01/06

 Tokenization, compound words, stemming
 Building a compound word dictionary
 The termdocument matrix
 Generative models for the termdocument matrix
 Multivariate binary, multinomial, Poisson models

2 
2009/01/09

 Word burstiness, nonparametric models
 Word burstiness, Dirichlet word generators
 Modeling multiple topic clusters in the term document matrix
 (Single cause) mixture model and expectation
maximization
 Limitations of singlecause mixture models

3 
2009/01/13

 Multiple cause mixture models, aspect model
 Latent Dirichlet topic model
 Basic and positional inverted index construction
 Space taken by dictionary and posting lists
 Compressing posting lists  gamma code, skip pointers
 Compressing the dictionary  front coding

3 
2009/01/16

 Recall, (interpolated) precision, F1, breakeven
 Area under ROC curve, pair preference violation
 MRR, MAP, NDCG, BPREF
 Vector space model: term frequency, inverse document
frequency
 BM25

4 
2009/01/20

 Query processing in IR: How to order and
scan postings
 Fagin's threshold algorithm and
probabilistic topk
 Index caching
 Text search as translation
 Probabilistic models for information retrieval
 Translation models via termdocument random walk

4 
2009/01/23 Text clustering 
 Latent semantic indexing (LSI/SVD)
 Applying PLSI/aspect model to search
 Nonnegative matrix factorization
,
Coclustering and crossassociations
 Cluster hypothesis in IR
 Clustering and corpus visualization techniques
 kMeans and selforganizing maps
 Hierarchical agglomerative clustering, dendrograms,
scattergather
 Multidimensional scaling, FastMap
 About dynamic
clustering, Vivisimo, Clusty, Kosmix

5 
2009/01/27 Similaritybased query processing 
 Findsimilar search, avoiding the quadratic bottleneck
 Jaccard similarity over shingles
 Pairwise and minwise hash functions and families

5 
2009/01/30 Similarity search, continued 
 Reducing random bits needed to sample permutations
 Cosine similarity
 Fast mirror site detection
 Random projections and findsimilar with cosine
similarity
 Locality
sensitive hashing

6 
2009/02/03 Labeling feature vectors 
 Labeling tasks: examples and motivation
 Document representation for labeling  are corpus models adequate?
 Art of feature vector design
 Probabilistic labeling
 Joint models, naive Bayes
 Conditional model, logistic regression
 Entropy model

6 
2009/02/06 
Lecture postponed 
7 
2009/02/10
and ranking 
 Discriminative labeling
 Support vector machines
 Dealing with topic hierarchies
 Learning to rank  representation, training, testing
 Itemwise, pairwise, listwise training paradigms
 Itemwise: Ordinal regression

7 
2009/02/12 Learning to rank 
 Itermwise: McRank
(Algorithm 6)
 Pairwise: RankSVM
and RankNet

7 
2009/02/13 Learning to rank 

8 
2009/02/15 Midterm 
Midterm exam, SIC301, 9:3012:30.
Exam with some solutions. 
9 
2009/02/24 Social networks 
 Midterm review
 Intro to social networks
 Hyperlinks: the first online social network?
 And now, Wikipedia,
Orkut, Flickr, LinkedIn, Facebook, Twitter, Blogspot, Wordpress, ...
 Social networks as entityrelationship graph data models

9 
2009/02/27
Social
networks 
 Power laws everywhere
 Centrality and prestige
 Diameter, hopplots, eigenvalues
 Bowtie

10 
2009/03/03 Social networks 
 Preferential attachment
 Winner does not take all
 Copying model

10 
2009/03/06 Social networks 
 Compression and storage of graphs
,
Connectivity Server
 Generating realistic social networks
 PageRank

11 
2009/03/13 Random walks 

12 
2009/03/17 PageRank 
 PageRank linearity and decomposition
 Topicsensitive
and personalized
PageRank
 Page staleness, link spam, trust/mistrust,
viral marketing, clickthrough mining

12 
2009/03/20 PageRank 
 Asynchronous push algorithms
 Alternative decay profiles
 Score and rank stability of PageRank
 Maxent view of PageRank

13 
2009/03/24 HITS 
 Bipartite reinforcement: HITS
 Learning to rank in graph models
NetRank

13 
2009/03/27 
Holiday 
14 
2009/03/31 HITS, Crawling 
 Score and rank stability of HITS and related algorithms
 PHITS and SALSA
 Using host graph, page content, HTML layout, anchor text
 Basic crawler plumbing: DNS, frontier, work pool, concurrency,
page storage
 Frontier prioritization issues and considerations

14 
2009/04/03 Information extraction 
 Named entity recognition: HMM, CRF
 NE disambiguation
 Opendomain isa relation harvesting:
Hearst patterns ,
KnowItAll
 Binary relationship recognition:
DIPRE ,
Snowball ,
dependency parsing
 Opendomain binary relation harvesting:
TextRunner
 Extracting records from markedup sources

15 
2009/04/08 Semistructured search 

WW 
2009/04/26 Final 
Final exam, SIC301, 9:3013:00.
Exam with some solutions. 