CS635 Web Search and Mining
Spring 2009 Calendar

Week Date, slides Summary, papers
1 2009/01/02
Introduction
2 2009/01/06
  • Tokenization, compound words, stemming
  • Building a compound word dictionary
  • The term-document matrix
  • Generative models for the term-document matrix
  • Multivariate binary, multinomial, Poisson models
2 2009/01/09
  • Word burstiness, non-parametric models
  • Word burstiness, Dirichlet word generators
  • Modeling multiple topic clusters in the term document matrix
  • (Single cause) mixture model and expectation maximization
  • Limitations of single-cause mixture models
3 2009/01/13
  • Multiple cause mixture models, aspect model
  • Latent Dirichlet topic model

  • Basic and positional inverted index construction
  • Space taken by dictionary and posting lists
  • Compressing posting lists -- gamma code, skip pointers
  • Compressing the dictionary -- front coding
3 2009/01/16
  • Recall, (interpolated) precision, F1, break-even
  • Area under ROC curve, pair preference violation
  • MRR, MAP, NDCG, BPREF
  • Vector space model: term frequency, inverse document frequency
  • BM25
4 2009/01/20
  • Query processing in IR: How to order and scan postings
  • Fagin's threshold algorithm and probabilistic top-k
  • Index caching
  • Text search as translation
  • Probabilistic models for information retrieval
  • Translation models via term-document random walk
4 2009/01/23
Text clustering
  • Latent semantic indexing (LSI/SVD)
  • Applying PLSI/aspect model to search
  • Nonnegative matrix factorization , Coclustering and cross-associations

  • Cluster hypothesis in IR
  • Clustering and corpus visualization techniques
  • k-Means and self-organizing maps
  • Hierarchical agglomerative clustering, dendrograms, scatter-gather
  • Multidimensional scaling, FastMap
  • About dynamic clustering, Vivisimo, Clusty, Kosmix
5 2009/01/27
Similarity-based query processing
  • Find-similar search, avoiding the quadratic bottleneck
  • Jaccard similarity over shingles
  • Pairwise and min-wise hash functions and families
5 2009/01/30
Similarity search, continued
  • Reducing random bits needed to sample permutations
  • Cosine similarity
  • Fast mirror site detection
  • Random projections and find-similar with cosine similarity
  • Locality sensitive hashing
6 2009/02/03
Labeling feature vectors
  • Labeling tasks: examples and motivation
  • Document representation for labeling --- are corpus models adequate?
  • Art of feature vector design
  • Probabilistic labeling
    • Joint models, naive Bayes
    • Conditional model, logistic regression
    • Entropy model
6 2009/02/06 Lecture postponed
7 2009/02/10
and ranking
  • Discriminative labeling
    • Support vector machines
    • Dealing with topic hierarchies
  • Learning to rank -- representation, training, testing
  • Itemwise, pairwise, listwise training paradigms
  • Itemwise: Ordinal regression
7 2009/02/12
Learning to rank
  • Itermwise: McRank (Algorithm 6)
  • Pairwise: RankSVM and RankNet
7 2009/02/13
Learning to rank
  • Pairwise: RankBoost
  • Listwise: SoftRank , RankAUC , SVMmap , DORM , SVMndcg, SVMcombo
8 2009/02/15
Midterm
Midterm exam, SIC301, 9:30--12:30. Exam with some solutions.
9 2009/02/24
Social networks
  • Midterm review
  • Intro to social networks
  • Hyperlinks: the first online social network?
  • And now, Wikipedia, Orkut, Flickr, LinkedIn, Facebook, Twitter, Blogspot, Wordpress, ...
  • Social networks as entity-relationship graph data models
9 2009/02/27
Social networks
  • Power laws everywhere
  • Centrality and prestige
  • Diameter, hop-plots, eigenvalues
  • Bow-tie
10 2009/03/03
Social networks
  • Preferential attachment
  • Winner does not take all
  • Copying model
10 2009/03/06
Social networks
  • Compression and storage of graphs , Connectivity Server
  • Generating realistic social networks
  • PageRank
11 2009/03/13
Random walks
  • Random walks as a way to sample the Web HenzingerHMN2000sample.ps
  • Accelerating PageRank C-H Theorem
  • Large-scale computation
12 2009/03/17
PageRank
  • PageRank linearity and decomposition
  • Topic-sensitive and personalized PageRank
  • Page staleness, link spam, trust/mistrust, viral marketing, clickthrough mining
12 2009/03/20
PageRank
  • Asynchronous push algorithms
  • Alternative decay profiles
  • Score and rank stability of PageRank
  • Maxent view of PageRank
13 2009/03/24
HITS
  • Bipartite reinforcement: HITS
  • Learning to rank in graph models NetRank
13 2009/03/27 Holiday
14 2009/03/31
HITS, Crawling
  • Score and rank stability of HITS and related algorithms
  • PHITS and SALSA
  • Using host graph, page content, HTML layout, anchor text
  • Basic crawler plumbing: DNS, frontier, work pool, concurrency, page storage
  • Frontier prioritization issues and considerations
14 2009/04/03
Information extraction
  • Named entity recognition: HMM, CRF
  • NE disambiguation
  • Open-domain is-a relation harvesting: Hearst patterns , KnowItAll
  • Binary relationship recognition: DIPRE , Snowball , dependency parsing
  • Open-domain binary relation harvesting: TextRunner
  • Extracting records from marked-up sources
15 2009/04/08
Semistructured search
  • Convergence of DB+XML and IR+Web search
  • WHIRL
  • XIRQL and ELIXIR
  • XQuery, TeXQuery
  • DBXplorer , BANKS , DISCOVER
  • DBPedia, HBase, FreeBase, YAGO and NAGA
WW 2009/04/26
Final
Final exam, SIC301, 9:30--13:00. Exam with some solutions.