Mining the Web
material that you can view only using a course password.
Slides accompanying Mining the Web
are marked "MTW1" and "MTW2". Slides accompanying the
Manning-Raghavan-Schuetze book are marked "MRS".
- Text indexing
- Abstract document model as set, bag, sequence of tokens.
- Basic and positional indexing and search without scoring.
- Posting list compression, dgap, pgap, gamma.
- Maintaining the inverted index as the corpus changes.
- Tokenization, vocabulary, dictionary.
- Indexing for zoned search, fields, word prefix match.
- Compressing the dictionary, front coding.
- Scalable high-performance indexing --- map-reduce.
- Index caching
- Relevance ranking
- Recall, (interpolated) precision, F1, break-even.
- MRR, MAP, NDCG.
- Area under ROC curve, pair preference violation.
- Scoring considerations: term frequency, collection/corpus frequency, document frequency, TFIDF.
- Vector space model.
- DAAT and WAAT/KAAT query processing.
- Early termination in scoring
- Fagin's threshold algorithm and
- Similarity search
- Hash maps, FKS perfect hashing
- Find-similar search, avoiding the quadratic bottleneck
- Jaccard similarity over shingles
- Pairwise and min-wise hash functions and families
- Reducing random bits needed to sample permutations
- Fast mirror site detection
- More about hashing
- Cosine similarity, random projections and find-similar
- Corpus models
- Multivariate binary, Poisson, multinomial models
- Motivation for parameter smoothing.
- Word burstiness, non-prametric model
- Word burstiness, Dirichlet word generator model
- Single cause mixture model and expectation maximization.
- Limitations of single-cause mixture models.
- Multiple cause mixture models, aspect model
- Latent Dirichlet topic model
- SVD and LSI.
- Nonnegative matrix factorization
- Matrix compression: coclustering and cross associations
- Text clustering
- Bridging the syntax gap: no/poor word overlap between query and doc
- Cluster hypothesis in IR
- Pseudorelevance feedback (PRF)
- Text search as translation
- Translation models via term-document random walk
- Probabilistic models for information retrieval
- Latent semantic indexing (LSI/SVD)
- Applying PLSI/aspect model to search
- Clustering and embedding results, scatter-gather
- Diversity in ranking: MMR, subtopic coverage, submodular set selection.
- Text classification
- Labeling tasks: examples and motivation
- Document representation for labeling --- are corpus models adequate?
- Joint models, naive Bayes
- Entropy model
- Conditional model, maximum entropy principle
- Art of feature vector design
- Maxent and logistic regression
- Discriminative labeling: support vector machines
- Multiclass and multilabel applications
- Dealing with topic hierarchies
- Learning to rank
- Learning to rank -- representation, training, testing
- Itemwise, pairwise, listwise training paradigms
- Itemwise: Ordinal regression
- Pairwise: RankSVM
- Itermwise: McRank
- Pairwise: RankBoost
- Listwise: SoftRank
- The cutting plane recipe
- Listwise: SVMmap
- Listwise: DORM (for NDCG)
- Another look at diversity
- Social networks
- Intro to social networks
- Hyperlinks: the first online social network?
- And now, Wikipedia,
Orkut, Flickr, LinkedIn, Facebook, Twitter, Blogspot, Wordpress, ...
- Social networks as entity-relationship graph data models
- Power laws everywhere
- Diameter, hop-plots, bow-tie
- Generating realistic social networks
- Preferential attachment
- Winner does not take all
- Copying model
- Centrality and prestige; eigenvalues
- Hitting time, commute time, escape probability
- Link prediction
- Searching graphs
- Compression and storage of graphs
- Random walks as a way to sample the Web
- Large-scale computation of PageRank (map-reduce?)
- Accelerating PageRank
- PageRank linearity and decomposition
- Asynchronous push algorithms
- Page staleness, link spam, trust/mistrust
- Bipartite reinforcement: HITS
- Using host graph, page content, HTML layout, anchor text
- Score and rank stability of HITS and PageRank
- PHITS and SALSA, effect on stability
- Maxent view of PageRank
- Learning to rank in graph models
- Crawling, sampling and monitoring the Web
- Basic crawler plumbing: DNS, frontier, work pool, concurrency, page storage
- Systems and scaling issues
- Frontier prioritization issues and considerations
- Focused crawling
- Refreshing crawls