CS635 Web Search and Mining A
Autumn 2010 Calendar

Date, slides Summary, papers
2010/07/22
Introduction
  • Administrative info
  • Course introduction and preview
2010/07/26 Class postponed.
2010/07/29
  • Tokenization, compound words, stemming
  • Building a compound word dictionary
  • The term-document matrix
2010/08/02
  • Generative models for the term-document matrix
  • Multivariate binary, multinomial, Poisson models
2010/08/05
  • Word burstiness, non-parametric models
  • Word burstiness, Dirichlet word generators
  • Modeling multiple topic clusters in the term document matrix
2010/08/09
2010/08/12
  • Multiple cause mixture models, aspect model
  • Latent Dirichlet topic model
2010/08/16
  • Latent Dirichlet topic model , continued
  • Coclustering and cross-associations
2010/08/19
2010/08/23
  • Nonnegative matrix factorization , continued
  • Basic and positional inverted index construction
  • Space taken by dictionary and posting lists
  • Compressing posting lists -- gamma code, skip pointers
  • How to evaluate a query plan
2010/08/26
  • Maintaining the inverted index as the corpus changes
  • Indexing for zoned search, fields, word prefix match
  • Compressing the dictionary -- front coding
2010/08/30
  • Compressing the dictionary, continued
  • Recall, (interpolated) precision, F1, break-even
  • MRR, MAP, NDCG
  • Area under ROC curve, pair preference violation
2010/09/02
  • Scoring considerations: term frequency, collection/corpus frequency, document frequency, TFIDF
  • Vector space model
  • BM25
2010/09/04
  • Query processing in IR: How to order and scan postings
  • Fagin's threshold algorithm and probabilistic top-k
  • Index caching
  • FKS perfect hashing
2010/09/06
  • Find-similar search, avoiding the quadratic bottleneck
  • Jaccard similarity over shingles
  • Pairwise and min-wise hash functions and families
  • Reducing random bits needed to sample permutations
2010/09/09
  • Fast mirror site detection
  • More about hashing
  • Cosine similarity, random projections and find-similar
  • Locality sensitive hashing
2010/09/20 Midterm exam 15:30--18:30.
2010/09/23
  • Bridging the syntax gap: no/poor word overlap between query and doc
  • Cluster hypothesis in IR
  • Pseudorelevance feedback (PRF)
  • Text search as translation
  • Translation models via term-document random walk
2010/09/23
  • Probabilistic models for information retrieval
  • Latent semantic indexing (LSI/SVD)
  • Applying PLSI/aspect model to search
  • Clustering and embedding results, scatter-gather
2010/09/30
  • Clickthrough, abandonment, risk management
  • Labeling tasks: examples and motivation
  • Document representation for labeling --- are corpus models adequate?
  • Joint models, naive Bayes
  • Entropy model
2010/10/04
  • Conditional model, maximum entropy principle
  • Art of feature vector design
  • Maxent and logistic regression
  • Discriminative labeling: support vector machines
2010/10/11
  • Multiclass and multilabel applications
  • Dealing with topic hierarchies
  • Learning to rank -- representation, training, testing
  • Itemwise, pairwise, listwise training paradigms
2010/10/14
Learning to rank
  • Itemwise: Ordinal regression
  • Pairwise: RankSVM and RankNet
2010/10/18
Learning to rank
  • Itermwise: McRank (Algorithm 6)
  • Pairwise: RankBoost
2010/10/21
Learning to rank
  • Listwise: SoftRank
  • The cutting plane recipe
  • RankAUC
2010/10/25
Social networks
  • Listwise: SVMmap
  • Listwise: DORM (for NDCG)
  • SVMndcg, SVMcombo
  • Intro to social networks
  • Hyperlinks: the first online social network?
  • And now, Wikipedia, Orkut, Flickr, LinkedIn, Facebook, Twitter, Blogspot, Wordpress, ...
  • Social networks as entity-relationship graph data models
2010/10/28 Yahoo! talks in lieu of regular lecture.
2010/11/01
  • Power laws everywhere
  • Diameter, hop-plots, bow-tie
  • Generating realistic social networks
  • Preferential attachment
  • Winner does not take all
  • Copying model
2010/11/04
  • Compression and storage of graphs
  • Connectivity Server and WebGraph
  • Centrality and prestige; eigenvalues
  • Hitting time, commute time, escape probability
  • Random walks as a way to sample the Web HenzingerHMN2000sample.ps
2010/11/08
  • PageRank
  • Large-scale computation of PageRank (map-reduce?)
  • Accelerating PageRank C-H Theorem
  • PageRank linearity and decomposition
  • Topic-sensitive and personalized PageRank
  • Asynchronous push algorithms
  • Page staleness, link spam, trust/mistrust
2010/11/11
  • Bipartite reinforcement: HITS
  • Using host graph, page content, HTML layout, anchor text
  • Score and rank stability of HITS and PageRank
  • PHITS and SALSA, effect on stability
  • Maxent view of PageRank
  • Learning to rank in graph models NetRank
2010/11/13
Crawling
  • Basic crawler plumbing: DNS, frontier, work pool, concurrency, page storage
  • Systems and scaling issues
  • Frontier prioritization issues and considerations
  • Focused crawling ChakrabartiDB1999f,
  • Refreshing crawls
2010/11/27 Final exam, 14:30--18:30, SIC301.