Mining the Web
(2nd edition)
Javascript is needed to view this page. Some links point to
material that you can view only using a course password.
- Document and corpus models
-
- .
- Multivariate binary, Poisson, multinomial models
.
- Motivation for parameter smoothing.
- Word burstiness, non-prametric model
.
- Word burstiness, Dirichlet word generator model
.
- Single cause mixture model and expectation maximization.
- Limitations of single-cause mixture models.
- Multiple cause mixture models, aspect model
.
- Latent Dirichlet topic model
.
- SVD and LSI.
- Nonnegative matrix factorization
.
- Matrix compression: coclustering and cross associations
.
- Indexing and search
-
- .
- Dictionary, posting list, dgap, pgap, gamma.
- Basic and positional query processing without scoring.
- Maintaining the inverted index as the corpus changes.
- Indexing for zoned search, fields, word prefix match.
- Compressing the dictionary -- front coding.
- Index caching
.
- Recall, (interpolated) precision, F1, break-even.
- MRR, MAP, NDCG.
- Area under ROC curve, pair preference violation.
- Scoring considerations: term frequency, collection/corpus frequency, document frequency, TFIDF.
- Vector space model.
- BM25.
- Early termination in scoring
.
- Fagin's threshold algorithm and
probabilistic top-k
.
- FKS perfect hashing
- Find-similar search, avoiding the quadratic bottleneck
- Jaccard similarity over shingles
- Pairwise and min-wise hash functions and families
- Reducing random bits needed to sample permutations
- Fast mirror site detection
- More about hashing
- Cosine similarity, random projections and find-similar
- Locality
sensitive hashing
- Bridging the syntax gap
-
- Bridging the syntax gap: no/poor word overlap between query and doc
- Cluster hypothesis in IR
- Pseudorelevance feedback (PRF)
- Text search as translation
- Translation models via term-document random walk
- Probabilistic models for information retrieval
- Latent semantic indexing (LSI/SVD)
- Applying PLSI/aspect model to search
- Clustering and embedding results, scatter-gather
- Labeling feature vectors
-
- Labeling tasks: examples and motivation
- Document representation for labeling --- are corpus models adequate?
- Joint models, naive Bayes
- Entropy model
- Conditional model, maximum entropy principle
- Art of feature vector design
- Maxent and logistic regression
- Discriminative labeling: support vector machines
- Multiclass and multilabel applications
- Dealing with topic hierarchies
- Scoring and ranking feature vectors
-
- Learning to rank -- representation, training, testing
- Itemwise, pairwise, listwise training paradigms
- Itemwise: Ordinal regression
- Pairwise: RankSVM
and RankNet
- Itermwise: McRank
(Algorithm 6)
- Pairwise: RankBoost
- Listwise: SoftRank
- The cutting plane recipe
- RankAUC
- Listwise: SVMmap
- Listwise: DORM (for NDCG)
- SVMndcg,
SVMcombo
- Measuring and modeling social networks
-
- Intro to social networks
- Hyperlinks: the first online social network?
- And now, Wikipedia,
Orkut, Flickr, LinkedIn, Facebook, Twitter, Blogspot, Wordpress, ...
- Social networks as entity-relationship graph data models
- Power laws everywhere
- Diameter, hop-plots, bow-tie
- Generating realistic social networks
- Preferential attachment
- Winner does not take all
- Copying model
- Centrality and prestige; eigenvalues
- Hitting time, commute time, escape probability
- Indexing, searching and mining social networks
-
- Compression and storage of graphs
- Connectivity
Server and
WebGraph
- Random walks as a way to sample the Web
HenzingerHMN2000sample.ps
- PageRank
- Large-scale computation of PageRank (map-reduce?)
- Accelerating PageRank
C-H
Theorem
- PageRank linearity and decomposition
- Topic-sensitive
and personalized
PageRank
- Asynchronous push algorithms
- Page staleness, link spam, trust/mistrust
- Bipartite reinforcement: HITS
- Using host graph, page content, HTML layout, anchor text
- Score and rank stability of HITS and PageRank
- PHITS and SALSA, effect on stability
- Maxent view of PageRank
- Learning to rank in graph models
NetRank
- Crawling and monitoring the Web
-
- Basic crawler plumbing: DNS, frontier, work pool, concurrency, page storage
- Systems and scaling issues
- Frontier prioritization issues and considerations
- Focused crawling
ChakrabartiDB1999f,
- Refreshing crawls