CS635 Web Search and Mining A
Autumn 2010 Calendar

Date, slides	Summary, papers
2010/07/22 Introduction	Administrative info Course introduction and preview
2010/07/26	Class postponed.
2010/07/29	Tokenization, compound words, stemming Building a compound word dictionary The term-document matrix
2010/08/02	Generative models for the term-document matrix Multivariate binary, multinomial, Poisson models
2010/08/05	Word burstiness, non-parametric models Word burstiness, Dirichlet word generators Modeling multiple topic clusters in the term document matrix
2010/08/09	(Single cause) mixture model and expectation maximization Limitations of single-cause mixture models
2010/08/12	Multiple cause mixture models, aspect model Latent Dirichlet topic model
2010/08/16	Latent Dirichlet topic model , continued Coclustering and cross-associations
2010/08/19	SVD/LSI Nonnegative matrix factorization
2010/08/23	Nonnegative matrix factorization , continued Basic and positional inverted index construction Space taken by dictionary and posting lists Compressing posting lists -- gamma code, skip pointers How to evaluate a query plan
2010/08/26	Maintaining the inverted index as the corpus changes Indexing for zoned search, fields, word prefix match Compressing the dictionary -- front coding
2010/08/30	Compressing the dictionary, continued Recall, (interpolated) precision, F1, break-even MRR, MAP, NDCG Area under ROC curve, pair preference violation
2010/09/02	Scoring considerations: term frequency, collection/corpus frequency, document frequency, TFIDF Vector space model BM25
2010/09/04	Query processing in IR: How to order and scan postings Fagin's threshold algorithm and probabilistic top-k Index caching FKS perfect hashing
2010/09/06	Find-similar search, avoiding the quadratic bottleneck Jaccard similarity over shingles Pairwise and min-wise hash functions and families Reducing random bits needed to sample permutations
2010/09/09	Fast mirror site detection More about hashing Cosine similarity, random projections and find-similar Locality sensitive hashing
2010/09/20	Midterm exam 15:30--18:30.
2010/09/23	Bridging the syntax gap: no/poor word overlap between query and doc Cluster hypothesis in IR Pseudorelevance feedback (PRF) Text search as translation Translation models via term-document random walk
2010/09/23	Probabilistic models for information retrieval Latent semantic indexing (LSI/SVD) Applying PLSI/aspect model to search Clustering and embedding results, scatter-gather
2010/09/30	Clickthrough, abandonment, risk management Labeling tasks: examples and motivation Document representation for labeling --- are corpus models adequate? Joint models, naive Bayes Entropy model
2010/10/04	Conditional model, maximum entropy principle Art of feature vector design Maxent and logistic regression Discriminative labeling: support vector machines
2010/10/11	Multiclass and multilabel applications Dealing with topic hierarchies Learning to rank -- representation, training, testing Itemwise, pairwise, listwise training paradigms
2010/10/14 Learning to rank	Itemwise: Ordinal regression Pairwise: RankSVM and RankNet
2010/10/18 Learning to rank	Itermwise: McRank (Algorithm 6) Pairwise: RankBoost
2010/10/21 Learning to rank	Listwise: SoftRank The cutting plane recipe RankAUC
2010/10/25 Social networks	Listwise: SVMmap Listwise: DORM (for NDCG) SVMndcg, SVMcombo Intro to social networks Hyperlinks: the first online social network? And now, Wikipedia, Orkut, Flickr, LinkedIn, Facebook, Twitter, Blogspot, Wordpress, ... Social networks as entity-relationship graph data models
2010/10/28	Yahoo! talks in lieu of regular lecture.
2010/11/01	Power laws everywhere Diameter, hop-plots, bow-tie Generating realistic social networks Preferential attachment Winner does not take all Copying model
2010/11/04	Compression and storage of graphs Connectivity Server and WebGraph Centrality and prestige; eigenvalues Hitting time, commute time, escape probability Random walks as a way to sample the Web HenzingerHMN2000sample.ps
2010/11/08	PageRank Large-scale computation of PageRank (map-reduce?) Accelerating PageRank C-H Theorem PageRank linearity and decomposition Topic-sensitive and personalized PageRank Asynchronous push algorithms Page staleness, link spam, trust/mistrust
2010/11/11	Bipartite reinforcement: HITS Using host graph, page content, HTML layout, anchor text Score and rank stability of HITS and PageRank PHITS and SALSA, effect on stability Maxent view of PageRank Learning to rank in graph models NetRank
2010/11/13 Crawling	Basic crawler plumbing: DNS, frontier, work pool, concurrency, page storage Systems and scaling issues Frontier prioritization issues and considerations Focused crawling ChakrabartiDB1999f, Refreshing crawls
2010/11/27	Final exam, 14:30--18:30, SIC301.

CS635 Web Search and Mining A Autumn 2010 Calendar

CS635 Web Search and Mining A
Autumn 2010 Calendar