Week
|
Date/Title
|
Contents
|
1
|
2008-01-04 lecture
|
Course introduction and administrative issues
Earlier known by the mouthful "Information Retrieval and Mining for
Hypertext and the Web". As far as the academic section is concerned I
will never succeed in renaming the course, but hereafter, for our
convenience, it is "Web Search and Mining".
- Time and place
- Slot 14, 5--6:25pm Tue+Fri, SIC301; first lecture on 2008/01/04
- Teaching assistants
- Rajiv Khanna, Dhaval Makawana
- Prerequisites
- Undergrad
algebra, calculus, probability, statistics, data structures,
algorithms, databases are mandatory. CS705 is strongly recommended.
- Evaluation
-
- Homeworks
- Midterm exam
- Final exam
Resources
- Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern Information Retrieval. Pearson Education.
- Chakrabarti, S. (2002). Mining the Web: Discovering knowledge from hypertext data. Morgan-Kaufman.
- Manning, C., Raghavan, P., and Schutze, H. (2007). An Introduction to Information Retrieval. Cambridge University Press.
- Witten,
I. H., Moffat, A., and Bell, T. C. (1999). Managing Gigabytes:
Compressing and Indexing Documents and Images. Morgan-Kaufman.
- Grossman, D. A. and Frieder, O. (1998). Information Retrieval: Algorithms and Heuristics. Kluwer.
|
2
|
2008-01-08 lecture
|
Tokenization, term-document matrix, introduction to corpus models
- Tokenization, compound words, stemming
- Building a compound word dictionary
- The term-document matrix
- Generative models, perplexity, curse of dimensionality
- Multivariate binary, multinomial, Poisson models
McCallumN1998event.pdf
|
3
|
2008-01-15 lecture
|
Parametric and non-parametric document models, word burstiness
|
|
2008-01-18 lecture
|
Term-document random walk, translation models, latent semantic indexing
|
4
|
2008-01-22 lecture
|
Modeling topics and clusters in a corpus
- LSI/SVD on almost low-rank matrices
- Principal component analysis (PCA) and issues with text analysis
- Modeling multiple topic clusters
- (Single cause) mixture model and Expectation maximization
- Limitations of single-cause mixture models
- Multiple cause mixture models, aspect model Hofmann1999plsa.ps.gz
|
|
2008-01-25 lecture
|
Corpus models and clustering
- Latent Dirichlet topic model
BleiNJ2003lda.pdf
- Clustering and corpus visualization techniques
- k-Means and self-organizing maps
- Multidimensional scaling
|
5
|
2008-01-29 lecture
|
Clustering; basic text indexing
- Multidimensional scaling
-
- Basic and positional inverted index construction
|
|
2008-01-30 extra lecture
|
Text indexing and Boolean search
- Building the inverted index
- Space taken by dictionary and posting lists
- Compressing posting lists -- gamma code
- Compressing the dictionary -- front coding
|
|
2008-02-01 lecture
|
Relevance ranking and the vector space model
- Recall, precision
- F1, break-even
- Area under ROC curve
- Pair preference violation
- MRR, MAP, NDCG
- Vector space model: term frequency, inverse document frequency
- How to order and scan postings
- Fagin's threshold algorithm and probabilistic top-k
|
6
|
2008-02-05 lecture
|
Jaccard similarity, shingling, min-wise hash functions
- Find-similar search, avoiding the quadratic bottleneck
- Jaccard similarity
- Shingling
- Pairwise and min-wise hash functions and families
- Reducing random bits needed to sample permutations
- Cosine similarity
|
|
2008-02-06 rescheduled lecture
|
Find-similar, near-neighbor search, locality sensitive hashing
- Fast mirror site detection
- Random projections and find-similar with cosine similarity
- Locality sensitive hash functions
|
|
2008-02-08 lecture
|
Supervised document labeling
- Labeling tasks: examples and motivation
- Document representation for labeling --- are corpus models adequate?
- Art of feature vector design
- Probabilistic labeling
- Joint models, naive Bayes
- Conditional model, logistic regression
- Entropy model
- Topic hierarchies
- Discriminative labeling
- Support vector machines
- Topic hierarchies
|
8
|
2008-02-23 midterm exam
|
Midterm exam 9:30--12:30 A1-Math
Paper with some solutions has been posted. Please check for bugs and report if you find any.
|
9
|
2008-02-26 lecture
|
Learning to rank from relevance judgments
- Midterm exam review
- Discussion of next homework
- Linear scoring model
- Training inputs: ordinal scores, pair preferences
|
10
|
2008-03-04 lecture
|
Learning to rank: itemwise, pairwise, listwise losses; McRank, RankSVM
- Itemwise, pairwise and listwise training paradigms
- Boosted regression trees (McRank)
- RankSVM, pairwise preferences and AUC
|
|
2008-03-05 rescheduled lecture
|
StructSVM, SVMauc, RankNet, SoftRank
- StructSVM basics
- Near-linear time SVMauc
- RankNet, SoftRank
|
|
2008-03-07 lecture
|
More max-margin listwise ranking
- Feature maps for ranking
- Pairwise feature map
- SVMmap
- SVMndcg, SVMmrr
|
11
|
2008-03-11 lecture
|
Ranking; Social networks
- Types of social networks, nodes, edges
- Some statistics of the Web graph
- Power laws everywhere
- The Web is a bow-tie
|
|
2008-03-14 lecture
|
Power law, preferential attachment
- Connectivity server, link graph compression, reference encoding
- Power law degree distribution
- Rich gets richer and preferential attachment model; heuristic analysis
- Other properties not satisfied by preferential attachment model
|
12
|
2008-03-18 lecture
|
Tweaks to preferential attachment; copying model, sampling Web pages
- Random links in addition to preferential attachment, heuristic analysis
- The copying model and its heuristic analysis
- Random walks to sample the Web: forward walk
|
|
2008-03-19 rescheduled lecture
|
Sampling, PageRank, HITS and variants
- Bidirectional random walk
- Generating synthetic social networks: RMAT
- Basic PageRank
- Accelerating PageRank
- PageRank linearity and decomposition
- Asynchronous push algorithms
|
13
|
2008-03-28 guest lecture
|
MapReduce (Amit Singh)
Amit's slides on MapReduce
|
14
|
2008-04-01 lecture
|
HITS and PageRank; score and rank stability
- Basic HITS, bipartite reinforcement
- Score and rank stability of HITS
- Score and rank stability of PageRank
|
|
2008-04-04 lecture
|
More HITS, PageRank, analysis and applications
- Complete PageRank stability analysis
- PHITS and SALSA
- Topic-sensitive and personalized PageRank
- Using host graph info, page content and HTML layout, anchor text
- Applications: Page staleness, link spam, clickthrough mining
|
15
|
2008-04-08 lecture
|
Bootstrapping knowledge bases from the unstructured Web
- Markov models
- Dependency parsing
- Hearst patterns
- DIPRE
- Snowball
- PMI
- KnowItAll
- TextRunner
|
|
2008-04-09 extra lecture
|
Exploiting structure in text search
- Convergence of DB+XML and IR+Web search
- WHIRL
- XIRQL and ELIXIR
- XQuery, TeXQuery
- DBXplorer, BANKS, DISCOVER
- DBPedia
|
|
2008-04-11 lecture
|
Crawling and monitoring the Web
|
17
|
2008-04-24 final exam
|
Final exam A1-Math 2:30--6pm
Paper with some solutions has been posted. Please check for bugs and report if you find any.
|