CS635 Web Search and Mining A
Autumn 2009 Calendar

Week Date, slides Summary, papers
1 2009/07/23
Introduction
  • Administrative info
  • Course introduction and preview
2 2009/07/27
  • Tokenization, compound words, stemming
  • Building a compound word dictionary
  • The term-document matrix
2 2009/07/30
  • Generative models for the term-document matrix
  • Multivariate binary, multinomial, Poisson models
3 2009/08/03
  • Word burstiness, non-parametric models
  • Word burstiness, Dirichlet word generators
  • Modeling multiple topic clusters in the term document matrix
  • (Single cause) mixture model and expectation maximization
3 2009/08/06
  • Limitations of single-cause mixture models
  • Multiple cause mixture models, aspect model
  • Latent Dirichlet topic model
  • SVD/LSI
  • Nonnegative matrix factorization , Coclustering and cross-associations
4 2009/08/10
  • Basic and positional inverted index construction
  • Space taken by dictionary and posting lists
  • Compressing posting lists -- gamma code, skip pointers
  • Compressing the dictionary -- front coding
4 2009/08/13 Class postponed on account of swine flu advisory.
5 2009/08/17 Class postponed on account of swine flu advisory.
5 2009/08/20
  • Indexing for parametric and zoned search
  • Maintaining the inverted index as the corpus changes
  • Recall, (interpolated) precision, F1, break-even
  • Area under ROC curve, pair preference violation
  • MRR, MAP, NDCG, BPREF
6 2009/08/24
  • Vector space model
  • Term frequency, collection frequency, document frequency, TFIDF
  • BM25
  • Bridging the syntax gap: no/poor word overlap between query and doc
6 2009/08/27
  • Pseudorelevance feedback (PRF)
  • Text search as translation
  • Probabilistic models for information retrieval
  • Translation models via term-document random walk
  • Latent semantic indexing (LSI/SVD)
  • Applying PLSI/aspect model to search
7 2009/08/31
  • Cluster hypothesis in IR
  • Clustering and corpus visualization techniques
  • k-Means and self-organizing maps
  • Hierarchical agglomerative clustering, dendrograms, scatter-gather
  • Multidimensional scaling, FastMap
  • About dynamic clustering, Vivisimo, Clusty, Kosmix
7 2009/09/03
  • Query processing in IR: How to order and scan postings
  • Fagin's threshold algorithm and probabilistic top-k
  • Index caching
8 2009/09/07
  • Find-similar search, avoiding the quadratic bottleneck
  • Jaccard similarity over shingles
  • Pairwise and min-wise hash functions and families
  • Reducing random bits needed to sample permutations
8 2009/09/09
  • Fast mirror site detection
  • More about hashing
  • Cosine similarity, random projections and find-similar
  • Locality sensitive hashing
9 2009/09/18 Midterm exam 16:30--19:00.
10 2009/09/21
  • Labeling tasks: examples and motivation
  • Document representation for labeling --- are corpus models adequate?
  • Art of feature vector design
  • Joint models, naive Bayes
  • Entropy model
  • Conditional model, maximum entropy
10 2009/09/24
  • Logistic regression
  • Discriminative labeling: support vector machines
  • Multiclass and multilabel applications
  • Dealing with topic hierarchies
11 2009/09/28 Dussehra holiday
11 2009/10/01
Learning to rank
  • Learning to rank -- representation, training, testing
  • Itemwise, pairwise, listwise training paradigms
  • Itemwise: Ordinal regression
  • Itermwise: McRank (Algorithm 6)
12 2009/10/05
Learning to rank
  • Guest lecture on learning to rank by Dr. Shivani Agarwal
  • Pairwise: RankSVM and RankNet
  • Pairwise: RankBoost
12 2009/10/08
Learning to rank
  • Listwise: SoftRank
  • The cutting plane recipe
  • RankAUC
12 2009/10/10
Learning to rank
  • Listwise: SVMmap
  • Listwise: DORM (for NDCG)
  • SVMndcg, SVMcombo
13 2009/10/12
Social networks
  • Intro to social networks
  • Hyperlinks: the first online social network?
  • And now, Wikipedia, Orkut, Flickr, LinkedIn, Facebook, Twitter, Blogspot, Wordpress, ...
  • Social networks as entity-relationship graph data models
  • Power laws everywhere
  • Diameter, hop-plots, bow-tie
  • Generating realistic social networks
13 2009/10/15
  • Preferential attachment
  • Winner does not take all
  • Copying model
14 2009/10/19 Diwali-related holiday
14 2009/10/22
  • Compression and storage of graphs
  • Connectivity Server and WebGraph
  • Centrality and prestige; eigenvalues
  • Hitting time, commute time, escape probability
  • Random walks as a way to sample the Web HenzingerHMN2000sample.ps
15 2009/10/26
  • PageRank
  • Large-scale computation of PageRank (map-reduce?)
  • Accelerating PageRank C-H Theorem
  • PageRank linearity and decomposition
15 2009/10/29
  • Topic-sensitive and personalized PageRank
  • Asynchronous push algorithms
  • Page staleness, link spam, trust/mistrust
16 2009/11/02 Guru Nanak birthday holiday
  • Bipartite reinforcement: HITS
  • Using host graph, page content, HTML layout, anchor text
  • Score and rank stability of HITS and PageRank
16 2009/11/05
  • PHITS and SALSA, effect on stability
  • Maxent view of PageRank
  • Learning to rank in graph models NetRank
17 2009/11/09
Crawling
  • Basic crawler plumbing: DNS, frontier, work pool, concurrency, page storage
  • Systems and scaling issues
17 2009/11/12
  • Frontier prioritization issues and considerations
  • Focused crawling ChakrabartiDB1999f,
  • Refreshing crawls
2009/11/27 Final exam 14:30--18:30, SIC305.