Week |
Date, slides |
Summary, papers |
1 |
2009/01/02 Introduction |
|
2 |
2009/01/06
|
- Tokenization, compound words, stemming
- Building a compound word dictionary
- The term-document matrix
- Generative models for the term-document matrix
- Multivariate binary, multinomial, Poisson models
|
2 |
2009/01/09
|
- Word burstiness, non-parametric models
- Word burstiness, Dirichlet word generators
- Modeling multiple topic clusters in the term document matrix
- (Single cause) mixture model and expectation
maximization
- Limitations of single-cause mixture models
|
3 |
2009/01/13
|
- Multiple cause mixture models, aspect model
- Latent Dirichlet topic model
- Basic and positional inverted index construction
- Space taken by dictionary and posting lists
- Compressing posting lists -- gamma code, skip pointers
- Compressing the dictionary -- front coding
|
3 |
2009/01/16
|
- Recall, (interpolated) precision, F1, break-even
- Area under ROC curve, pair preference violation
- MRR, MAP, NDCG, BPREF
- Vector space model: term frequency, inverse document
frequency
- BM25
|
4 |
2009/01/20
|
- Query processing in IR: How to order and
scan postings
- Fagin's threshold algorithm and
probabilistic top-k
- Index caching
- Text search as translation
- Probabilistic models for information retrieval
- Translation models via term-document random walk
|
4 |
2009/01/23 Text clustering |
- Latent semantic indexing (LSI/SVD)
- Applying PLSI/aspect model to search
- Nonnegative matrix factorization
,
Coclustering and cross-associations
- Cluster hypothesis in IR
- Clustering and corpus visualization techniques
- k-Means and self-organizing maps
- Hierarchical agglomerative clustering, dendrograms,
scatter-gather
- Multidimensional scaling, FastMap
- About dynamic
clustering, Vivisimo, Clusty, Kosmix
|
5 |
2009/01/27 Similarity-based query processing |
- Find-similar search, avoiding the quadratic bottleneck
- Jaccard similarity over shingles
- Pairwise and min-wise hash functions and families
|
5 |
2009/01/30 Similarity search, continued |
- Reducing random bits needed to sample permutations
- Cosine similarity
- Fast mirror site detection
- Random projections and find-similar with cosine
similarity
- Locality
sensitive hashing
|
6 |
2009/02/03 Labeling feature vectors |
- Labeling tasks: examples and motivation
- Document representation for labeling --- are corpus models adequate?
- Art of feature vector design
- Probabilistic labeling
- Joint models, naive Bayes
- Conditional model, logistic regression
- Entropy model
|
6 |
2009/02/06 |
Lecture postponed |
7 |
2009/02/10
and ranking |
- Discriminative labeling
- Support vector machines
- Dealing with topic hierarchies
- Learning to rank -- representation, training, testing
- Itemwise, pairwise, listwise training paradigms
- Itemwise: Ordinal regression
|
7 |
2009/02/12 Learning to rank |
- Itermwise: McRank
(Algorithm 6)
- Pairwise: RankSVM
and RankNet
|
7 |
2009/02/13 Learning to rank |
|
8 |
2009/02/15 Midterm |
Midterm exam, SIC301, 9:30--12:30.
Exam with some solutions. |
9 |
2009/02/24 Social networks |
- Midterm review
- Intro to social networks
- Hyperlinks: the first online social network?
- And now, Wikipedia,
Orkut, Flickr, LinkedIn, Facebook, Twitter, Blogspot, Wordpress, ...
- Social networks as entity-relationship graph data models
|
9 |
2009/02/27
Social
networks |
- Power laws everywhere
- Centrality and prestige
- Diameter, hop-plots, eigenvalues
- Bow-tie
|
10 |
2009/03/03 Social networks |
- Preferential attachment
- Winner does not take all
- Copying model
|
10 |
2009/03/06 Social networks |
- Compression and storage of graphs
,
Connectivity Server
- Generating realistic social networks
- PageRank
|
11 |
2009/03/13 Random walks |
|
12 |
2009/03/17 PageRank |
- PageRank linearity and decomposition
- Topic-sensitive
and personalized
PageRank
- Page staleness, link spam, trust/mistrust,
viral marketing, clickthrough mining
|
12 |
2009/03/20 PageRank |
- Asynchronous push algorithms
- Alternative decay profiles
- Score and rank stability of PageRank
- Maxent view of PageRank
|
13 |
2009/03/24 HITS |
- Bipartite reinforcement: HITS
- Learning to rank in graph models
NetRank
|
13 |
2009/03/27 |
Holiday |
14 |
2009/03/31 HITS, Crawling |
- Score and rank stability of HITS and related algorithms
- PHITS and SALSA
- Using host graph, page content, HTML layout, anchor text
- Basic crawler plumbing: DNS, frontier, work pool, concurrency,
page storage
- Frontier prioritization issues and considerations
|
14 |
2009/04/03 Information extraction |
- Named entity recognition: HMM, CRF
- NE disambiguation
- Open-domain is-a relation harvesting:
Hearst patterns ,
KnowItAll
- Binary relationship recognition:
DIPRE ,
Snowball ,
dependency parsing
- Open-domain binary relation harvesting:
TextRunner
- Extracting records from marked-up sources
|
15 |
2009/04/08 Semistructured search |
|
WW |
2009/04/26 Final |
Final exam, SIC301, 9:30--13:00.
Exam with some solutions. |