Mining the Web
(2nd edition)

Javascript is needed to view this page. Some links point to material that you can view only using a course password.

Slides accompanying Mining the Web are marked "MTW1" and "MTW2". Slides accompanying the Manning-Raghavan-Schuetze book are marked "MRS".

Text indexing

Abstract document model as set, bag, sequence of tokens.
Basic and positional indexing and search without scoring. [MRS MRS]
Posting list compression, dgap, pgap, gamma. [MRS MRS MRS MRS]
Maintaining the inverted index as the corpus changes.
Tokenization, vocabulary, dictionary. [MRS MRS MRS]
Indexing for zoned search, fields, word prefix match.
Compressing the dictionary, front coding.
Scalable high-performance indexing --- map-reduce.
Index caching .

Relevance ranking

MRS MRS
Recall, (interpolated) precision, F1, break-even.
MRR, MAP, NDCG.
Area under ROC curve, pair preference violation.
Scoring considerations: term frequency, collection/corpus frequency, document frequency, TFIDF.
Vector space model. [MRS MRS]
BM25.
DAAT and WAAT/KAAT query processing.
Early termination in scoring .
Fagin's threshold algorithm and probabilistic top-k .

Similarity search

Hash maps, FKS perfect hashing
Find-similar search, avoiding the quadratic bottleneck
Jaccard similarity over shingles
Pairwise and min-wise hash functions and families
Reducing random bits needed to sample permutations
Fast mirror site detection
More about hashing
Cosine similarity, random projections and find-similar
Locality sensitive hashing

Corpus models

.
Multivariate binary, Poisson, multinomial models .
Motivation for parameter smoothing.
Word burstiness, non-prametric model .
Word burstiness, Dirichlet word generator model .
Single cause mixture model and expectation maximization.
Limitations of single-cause mixture models.
Multiple cause mixture models, aspect model .
Latent Dirichlet topic model .
SVD and LSI.
Nonnegative matrix factorization .
Matrix compression: coclustering and cross associations .

Text clustering

Bridging the syntax gap: no/poor word overlap between query and doc
Cluster hypothesis in IR
Pseudorelevance feedback (PRF)
Text search as translation
Translation models via term-document random walk
Probabilistic models for information retrieval
Latent semantic indexing (LSI/SVD)
Applying PLSI/aspect model to search
Clustering and embedding results, scatter-gather
Diversity in ranking: MMR, subtopic coverage, submodular set selection.

Text classification

Labeling tasks: examples and motivation
Document representation for labeling --- are corpus models adequate?
Joint models, naive Bayes
Entropy model
Conditional model, maximum entropy principle
Art of feature vector design
Maxent and logistic regression
Discriminative labeling: support vector machines
Multiclass and multilabel applications
Dealing with topic hierarchies

Learning to rank

Learning to rank -- representation, training, testing
Itemwise, pairwise, listwise training paradigms
Itemwise: Ordinal regression
Pairwise: RankSVM and RankNet
Itermwise: McRank (Algorithm 6)
Pairwise: RankBoost
Listwise: SoftRank
The cutting plane recipe
RankAUC
Listwise: SVMmap
Listwise: DORM (for NDCG)
SVMndcg, SVMcombo
Another look at diversity

Social networks

Intro to social networks
Hyperlinks: the first online social network?
And now, Wikipedia, Orkut, Flickr, LinkedIn, Facebook, Twitter, Blogspot, Wordpress, ...
Social networks as entity-relationship graph data models
Power laws everywhere
Diameter, hop-plots, bow-tie
Generating realistic social networks
Preferential attachment
Winner does not take all
Copying model
Centrality and prestige; eigenvalues
Hitting time, commute time, escape probability
Link prediction

Searching graphs

Compression and storage of graphs
Connectivity Server and WebGraph
Random walks as a way to sample the Web HenzingerHMN2000sample.ps
PageRank
Large-scale computation of PageRank (map-reduce?)
Accelerating PageRank C-H Theorem
PageRank linearity and decomposition
Topic-sensitive and personalized PageRank
Asynchronous push algorithms
Page staleness, link spam, trust/mistrust
Bipartite reinforcement: HITS
Using host graph, page content, HTML layout, anchor text
Score and rank stability of HITS and PageRank
PHITS and SALSA, effect on stability
Maxent view of PageRank
Learning to rank in graph models NetRank

Crawling, sampling and monitoring the Web

Basic crawler plumbing: DNS, frontier, work pool, concurrency, page storage
Systems and scaling issues
Frontier prioritization issues and considerations
Focused crawling ChakrabartiDB1999f,
Refreshing crawls