Week

Date/Title

Contents

1

20080104 lecture

Course introduction and administrative issues
Earlier known by the mouthful "Information Retrieval and Mining for
Hypertext and the Web". As far as the academic section is concerned I
will never succeed in renaming the course, but hereafter, for our
convenience, it is "Web Search and Mining".
 Time and place
 Slot 14, 56:25pm Tue+Fri, SIC301; first lecture on 2008/01/04
 Teaching assistants
 Rajiv Khanna, Dhaval Makawana
 Prerequisites
 Undergrad
algebra, calculus, probability, statistics, data structures,
algorithms, databases are mandatory. CS705 is strongly recommended.
 Evaluation

 Homeworks
 Midterm exam
 Final exam
Resources
 BaezaYates, R. and RibeiroNeto, B. (1999). Modern Information Retrieval. Pearson Education.
 Chakrabarti, S. (2002). Mining the Web: Discovering knowledge from hypertext data. MorganKaufman.
 Manning, C., Raghavan, P., and Schutze, H. (2007). An Introduction to Information Retrieval. Cambridge University Press.
 Witten,
I. H., Moffat, A., and Bell, T. C. (1999). Managing Gigabytes:
Compressing and Indexing Documents and Images. MorganKaufman.
 Grossman, D. A. and Frieder, O. (1998). Information Retrieval: Algorithms and Heuristics. Kluwer.

2

20080108 lecture

Tokenization, termdocument matrix, introduction to corpus models
 Tokenization, compound words, stemming
 Building a compound word dictionary
 The termdocument matrix
 Generative models, perplexity, curse of dimensionality
 Multivariate binary, multinomial, Poisson models
McCallumN1998event.pdf

3

20080115 lecture

Parametric and nonparametric document models, word burstiness


20080118 lecture

Termdocument random walk, translation models, latent semantic indexing

4

20080122 lecture

Modeling topics and clusters in a corpus
 LSI/SVD on almost lowrank matrices
 Principal component analysis (PCA) and issues with text analysis
 Modeling multiple topic clusters
 (Single cause) mixture model and Expectation maximization
 Limitations of singlecause mixture models
 Multiple cause mixture models, aspect model Hofmann1999plsa.ps.gz


20080125 lecture

Corpus models and clustering
 Latent Dirichlet topic model
BleiNJ2003lda.pdf
 Clustering and corpus visualization techniques
 kMeans and selforganizing maps
 Multidimensional scaling

5

20080129 lecture

Clustering; basic text indexing
 Multidimensional scaling

 Basic and positional inverted index construction


20080130 extra lecture

Text indexing and Boolean search
 Building the inverted index
 Space taken by dictionary and posting lists
 Compressing posting lists  gamma code
 Compressing the dictionary  front coding


20080201 lecture

Relevance ranking and the vector space model
 Recall, precision
 F1, breakeven
 Area under ROC curve
 Pair preference violation
 MRR, MAP, NDCG
 Vector space model: term frequency, inverse document frequency
 How to order and scan postings
 Fagin's threshold algorithm and probabilistic topk

6

20080205 lecture

Jaccard similarity, shingling, minwise hash functions
 Findsimilar search, avoiding the quadratic bottleneck
 Jaccard similarity
 Shingling
 Pairwise and minwise hash functions and families
 Reducing random bits needed to sample permutations
 Cosine similarity


20080206 rescheduled lecture

Findsimilar, nearneighbor search, locality sensitive hashing
 Fast mirror site detection
 Random projections and findsimilar with cosine similarity
 Locality sensitive hash functions


20080208 lecture

Supervised document labeling
 Labeling tasks: examples and motivation
 Document representation for labeling  are corpus models adequate?
 Art of feature vector design
 Probabilistic labeling
 Joint models, naive Bayes
 Conditional model, logistic regression
 Entropy model
 Topic hierarchies
 Discriminative labeling
 Support vector machines
 Topic hierarchies

8

20080223 midterm exam

Midterm exam 9:3012:30 A1Math
Paper with some solutions has been posted. Please check for bugs and report if you find any.

9

20080226 lecture

Learning to rank from relevance judgments
 Midterm exam review
 Discussion of next homework
 Linear scoring model
 Training inputs: ordinal scores, pair preferences

10

20080304 lecture

Learning to rank: itemwise, pairwise, listwise losses; McRank, RankSVM
 Itemwise, pairwise and listwise training paradigms
 Boosted regression trees (McRank)
 RankSVM, pairwise preferences and AUC


20080305 rescheduled lecture

StructSVM, SVMauc, RankNet, SoftRank
 StructSVM basics
 Nearlinear time SVMauc
 RankNet, SoftRank


20080307 lecture

More maxmargin listwise ranking
 Feature maps for ranking
 Pairwise feature map
 SVMmap
 SVMndcg, SVMmrr

11

20080311 lecture

Ranking; Social networks
 Types of social networks, nodes, edges
 Some statistics of the Web graph
 Power laws everywhere
 The Web is a bowtie


20080314 lecture

Power law, preferential attachment
 Connectivity server, link graph compression, reference encoding
 Power law degree distribution
 Rich gets richer and preferential attachment model; heuristic analysis
 Other properties not satisfied by preferential attachment model

12

20080318 lecture

Tweaks to preferential attachment; copying model, sampling Web pages
 Random links in addition to preferential attachment, heuristic analysis
 The copying model and its heuristic analysis
 Random walks to sample the Web: forward walk


20080319 rescheduled lecture

Sampling, PageRank, HITS and variants
 Bidirectional random walk
 Generating synthetic social networks: RMAT
 Basic PageRank
 Accelerating PageRank
 PageRank linearity and decomposition
 Asynchronous push algorithms

13

20080328 guest lecture

MapReduce (Amit Singh)
Amit's slides on MapReduce

14

20080401 lecture

HITS and PageRank; score and rank stability
 Basic HITS, bipartite reinforcement
 Score and rank stability of HITS
 Score and rank stability of PageRank


20080404 lecture

More HITS, PageRank, analysis and applications
 Complete PageRank stability analysis
 PHITS and SALSA
 Topicsensitive and personalized PageRank
 Using host graph info, page content and HTML layout, anchor text
 Applications: Page staleness, link spam, clickthrough mining

15

20080408 lecture

Bootstrapping knowledge bases from the unstructured Web
 Markov models
 Dependency parsing
 Hearst patterns
 DIPRE
 Snowball
 PMI
 KnowItAll
 TextRunner


20080409 extra lecture

Exploiting structure in text search
 Convergence of DB+XML and IR+Web search
 WHIRL
 XIRQL and ELIXIR
 XQuery, TeXQuery
 DBXplorer, BANKS, DISCOVER
 DBPedia


20080411 lecture

Crawling and monitoring the Web

17

20080424 final exam

Final exam A1Math 2:306pm
Paper with some solutions has been posted. Please check for bugs and report if you find any.
