Mining the Web
(2nd edition)
Javascript is needed to view this page. Some links point to
material that you can view only using a course password.
Slides accompanying Mining the Web
are marked "MTW1" and "MTW2". Slides accompanying the
Manning-Raghavan-Schuetze book are marked "MRS".
- Text indexing
-
-
- Abstract document model as set, bag, sequence of tokens.
- Basic and positional indexing and search without scoring.
[MRS
MRS]
- Posting list compression, dgap, pgap, gamma.
[MRS
MRS
MRS
MRS]
- Maintaining the inverted index as the corpus changes.
- Tokenization, vocabulary, dictionary.
[MRS
MRS
MRS]
- Indexing for zoned search, fields, word prefix match.
- Compressing the dictionary, front coding.
- Scalable high-performance indexing --- map-reduce.
- Index caching
.
- Relevance ranking
-
-
MRS
MRS
- Recall, (interpolated) precision, F1, break-even.
- MRR, MAP, NDCG.
- Area under ROC curve, pair preference violation.
- Scoring considerations: term frequency, collection/corpus frequency, document frequency, TFIDF.
- Vector space model.
[MRS
MRS]
- BM25.
- DAAT and WAAT/KAAT query processing.
- Early termination in scoring
.
- Fagin's threshold algorithm and
probabilistic top-k
.
- Similarity search
-
-
- Hash maps, FKS perfect hashing
- Find-similar search, avoiding the quadratic bottleneck
- Jaccard similarity over shingles
- Pairwise and min-wise hash functions and families
- Reducing random bits needed to sample permutations
- Fast mirror site detection
- More about hashing
- Cosine similarity, random projections and find-similar
- Locality
sensitive hashing
- Corpus models
-
- .
- Multivariate binary, Poisson, multinomial models
.
- Motivation for parameter smoothing.
- Word burstiness, non-prametric model
.
- Word burstiness, Dirichlet word generator model
.
- Single cause mixture model and expectation maximization.
- Limitations of single-cause mixture models.
- Multiple cause mixture models, aspect model
.
- Latent Dirichlet topic model
.
- SVD and LSI.
- Nonnegative matrix factorization
.
- Matrix compression: coclustering and cross associations
.
- Text clustering
-
- Bridging the syntax gap: no/poor word overlap between query and doc
- Cluster hypothesis in IR
- Pseudorelevance feedback (PRF)
- Text search as translation
- Translation models via term-document random walk
- Probabilistic models for information retrieval
- Latent semantic indexing (LSI/SVD)
- Applying PLSI/aspect model to search
- Clustering and embedding results, scatter-gather
- Diversity in ranking: MMR, subtopic coverage, submodular set selection.
- Text classification
-
- Labeling tasks: examples and motivation
- Document representation for labeling --- are corpus models adequate?
- Joint models, naive Bayes
- Entropy model
- Conditional model, maximum entropy principle
- Art of feature vector design
- Maxent and logistic regression
- Discriminative labeling: support vector machines
- Multiclass and multilabel applications
- Dealing with topic hierarchies
- Learning to rank
-
- Learning to rank -- representation, training, testing
- Itemwise, pairwise, listwise training paradigms
- Itemwise: Ordinal regression
- Pairwise: RankSVM
and RankNet
- Itermwise: McRank
(Algorithm 6)
- Pairwise: RankBoost
- Listwise: SoftRank
- The cutting plane recipe
- RankAUC
- Listwise: SVMmap
- Listwise: DORM (for NDCG)
- SVMndcg,
SVMcombo
- Another look at diversity
- Social networks
-
- Intro to social networks
- Hyperlinks: the first online social network?
- And now, Wikipedia,
Orkut, Flickr, LinkedIn, Facebook, Twitter, Blogspot, Wordpress, ...
- Social networks as entity-relationship graph data models
- Power laws everywhere
- Diameter, hop-plots, bow-tie
- Generating realistic social networks
- Preferential attachment
- Winner does not take all
- Copying model
- Centrality and prestige; eigenvalues
- Hitting time, commute time, escape probability
- Link prediction
- Searching graphs
-
- Compression and storage of graphs
- Connectivity
Server and
WebGraph
- Random walks as a way to sample the Web
HenzingerHMN2000sample.ps
- PageRank
- Large-scale computation of PageRank (map-reduce?)
- Accelerating PageRank
C-H
Theorem
- PageRank linearity and decomposition
- Topic-sensitive
and personalized
PageRank
- Asynchronous push algorithms
- Page staleness, link spam, trust/mistrust
- Bipartite reinforcement: HITS
- Using host graph, page content, HTML layout, anchor text
- Score and rank stability of HITS and PageRank
- PHITS and SALSA, effect on stability
- Maxent view of PageRank
- Learning to rank in graph models
NetRank
- Crawling, sampling and monitoring the Web
-
- Basic crawler plumbing: DNS, frontier, work pool, concurrency, page storage
- Systems and scaling issues
- Frontier prioritization issues and considerations
- Focused crawling
ChakrabartiDB1999f,
- Refreshing crawls