This has been fossilized from Moodle on 2008/04/16. Many links may not work.

Week
Date/Title
Contents
1
2008-01-04 lecture
Course introduction and administrative issues

Earlier known by the mouthful "Information Retrieval and Mining for Hypertext and the Web". As far as the academic section is concerned I will never succeed in renaming the course, but hereafter, for our convenience, it is "Web Search and Mining".

Time and place
Slot 14, 5--6:25pm Tue+Fri, SIC301; first lecture on 2008/01/04
Teaching assistants
Rajiv Khanna, Dhaval Makawana
Prerequisites
Undergrad algebra, calculus, probability, statistics, data structures, algorithms, databases are mandatory. CS705 is strongly recommended.
Evaluation
  • Homeworks
  • Midterm exam
  • Final exam
Resources
  • Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern Information Retrieval. Pearson Education.
  • Chakrabarti, S. (2002). Mining the Web: Discovering knowledge from hypertext data. Morgan-Kaufman.
  • Manning, C., Raghavan, P., and Schutze, H. (2007). An Introduction to Information Retrieval. Cambridge University Press.
  • Witten, I. H., Moffat, A., and Bell, T. C. (1999). Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan-Kaufman.
  • Grossman, D. A. and Frieder, O. (1998). Information Retrieval: Algorithms and Heuristics. Kluwer.
2
2008-01-08 lecture
Tokenization, term-document matrix, introduction to corpus models
  • Tokenization, compound words, stemming
  • Building a compound word dictionary
  • The term-document matrix
  • Generative models, perplexity, curse of dimensionality
  • Multivariate binary, multinomial, Poisson models McCallumN1998event.pdf
3
2008-01-15 lecture
Parametric and non-parametric document models, word burstiness

2008-01-18 lecture
Term-document random walk, translation models, latent semantic indexing
4
2008-01-22 lecture
Modeling topics and clusters in a corpus
  • LSI/SVD on almost low-rank matrices
  • Principal component analysis (PCA) and issues with text analysis
  • Modeling multiple topic clusters
  • (Single cause) mixture model and Expectation maximization
  • Limitations of single-cause mixture models
  • Multiple cause mixture models, aspect model Hofmann1999plsa.ps.gz

2008-01-25 lecture
Corpus models and clustering
  • Latent Dirichlet topic model BleiNJ2003lda.pdf
  • Clustering and corpus visualization techniques
  • k-Means and self-organizing maps
  • Multidimensional scaling
5
2008-01-29 lecture
Clustering; basic text indexing
  • Multidimensional scaling
  •  
  • Basic and positional inverted index construction

2008-01-30 extra lecture
Text indexing and Boolean search
  • Building the inverted index
  • Space taken by dictionary and posting lists
  • Compressing posting lists -- gamma code
  • Compressing the dictionary -- front coding

2008-02-01 lecture
Relevance ranking and the vector space model
  • Recall, precision
  • F1, break-even
  • Area under ROC curve
  • Pair preference violation
  • MRR, MAP, NDCG
  • Vector space model: term frequency, inverse document frequency
  • How to order and scan postings
  • Fagin's threshold algorithm and probabilistic top-k
6
2008-02-05 lecture
Jaccard similarity, shingling, min-wise hash functions
  • Find-similar search, avoiding the quadratic bottleneck
  • Jaccard similarity
  • Shingling
  • Pairwise and min-wise hash functions and families
  • Reducing random bits needed to sample permutations
  • Cosine similarity

2008-02-06 rescheduled lecture
Find-similar, near-neighbor search, locality sensitive hashing
  • Fast mirror site detection
  • Random projections and find-similar with cosine similarity
  • Locality sensitive hash functions

2008-02-08 lecture
Supervised document labeling
  • Labeling tasks: examples and motivation
  • Document representation for labeling --- are corpus models adequate?
  • Art of feature vector design
  • Probabilistic labeling
    • Joint models, naive Bayes
    • Conditional model, logistic regression
    • Entropy model
    • Topic hierarchies
  • Discriminative labeling
    • Support vector machines
    • Topic hierarchies
8
2008-02-23 midterm exam
Midterm exam 9:30--12:30 A1-Math
Paper with some solutions has been posted. Please check for bugs and report if you find any.
9
2008-02-26 lecture
Learning to rank from relevance judgments
  • Midterm exam review
  • Discussion of next homework
  • Linear scoring model
  • Training inputs: ordinal scores, pair preferences
10
2008-03-04 lecture
Learning to rank: itemwise, pairwise, listwise losses; McRank, RankSVM
  • Itemwise, pairwise and listwise training paradigms
  • Boosted regression trees (McRank)
  • RankSVM, pairwise preferences and AUC

2008-03-05 rescheduled lecture
StructSVM, SVMauc, RankNet, SoftRank
  • StructSVM basics
  • Near-linear time SVMauc
  • RankNet, SoftRank

2008-03-07 lecture
More max-margin listwise ranking
  • Feature maps for ranking
  • Pairwise feature map
  • SVMmap
  • SVMndcg, SVMmrr
11
2008-03-11 lecture
Ranking; Social networks
  • Types of social networks, nodes, edges
  • Some statistics of the Web graph
  • Power laws everywhere
  • The Web is a bow-tie

2008-03-14 lecture
Power law, preferential attachment
  • Connectivity server, link graph compression, reference encoding
  • Power law degree distribution
  • Rich gets richer and preferential attachment model; heuristic analysis
  • Other properties not satisfied by preferential attachment model
12
2008-03-18 lecture
Tweaks to preferential attachment; copying model, sampling Web pages
  • Random links in addition to preferential attachment, heuristic analysis
  • The copying model and its heuristic analysis
  • Random walks to sample the Web: forward walk

2008-03-19 rescheduled lecture
Sampling, PageRank, HITS and variants
  • Bidirectional random walk
  • Generating synthetic social networks: RMAT
  • Basic PageRank
  • Accelerating PageRank
  • PageRank linearity and decomposition
  • Asynchronous push algorithms
13
2008-03-28 guest lecture
MapReduce (Amit Singh)
Amit's slides on MapReduce
14
2008-04-01 lecture
HITS and PageRank; score and rank stability
  • Basic HITS, bipartite reinforcement
  • Score and rank stability of HITS
  • Score and rank stability of PageRank

2008-04-04 lecture
More HITS, PageRank, analysis and applications
  • Complete PageRank stability analysis
  • PHITS and SALSA
  • Topic-sensitive and personalized PageRank
  • Using host graph info, page content and HTML layout, anchor text
  • Applications: Page staleness, link spam, clickthrough mining
15
2008-04-08 lecture
Bootstrapping knowledge bases from the unstructured Web
  • Markov models
  • Dependency parsing
  • Hearst patterns
  • DIPRE
  • Snowball
  • PMI
  • KnowItAll
  • TextRunner

2008-04-09 extra lecture
Exploiting structure in text search
  • Convergence of DB+XML and IR+Web search
  • WHIRL
  • XIRQL and ELIXIR
  • XQuery, TeXQuery
  • DBXplorer, BANKS, DISCOVER
  • DBPedia

2008-04-11 lecture
Crawling and monitoring the Web
17
2008-04-24 final exam
Final exam A1-Math 2:30--6pm
Paper with some solutions has been posted. Please check for bugs and report if you find any.