Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, pages 244-252,Suntec, Singapore, 2-7 August 2009. cfl2009 ACL and AFNLP A Non-negative Matrix Tri-factorization Approach toSentiment Classification with Lexical Prior Knowledge Tao Li Yi ZhangSchool of Computer Science Florida International University{ taoli,yzhan004}@cs.fiu.edu Vikas SindhwaniMathematical Sciences IBM T.J. Watson Research Center vsindhw@us.ibm.com Abstract Sentiment classification refers to the taskof automatically identifying whether a given piece of text expresses positive ornegative opinion towards a subject at hand. The proliferation of user-generated webcontent such as blogs, discussion forums and online review sites has made it possi-ble to perform large-scale mining of public opinion. Sentiment modeling is thusbecoming a critical component of market intelligence and social media technologiesthat aim to tap into the collective wisdom of crowds. In this paper, we considerthe problem of learning high-quality sentiment models with minimal manual super-vision. We propose a novel approach to learn from lexical prior knowledge in theform of domain-independent sentimentladen terms, in conjunction with domain-dependent unlabeled data and a few labeled documents. Our model is based on aconstrained non-negative tri-factorization of the term-document matrix which canbe implemented using simple update rules. Extensive experimental studies demon-strate the effectiveness of our approach on a variety of real-world sentiment predic-tion tasks. 1 Introduction Web 2.0 platforms such as blogs, discussion fo-rums and other such social media have now given a public voice to every consumer. Recent sur-veys have estimated that a massive number of internet users turn to such forums to collect rec-ommendations for products and services, guiding their own choices and decisions by the opin-ions that other consumers have publically expressed. Gleaning insights by monitoring and an-alyzing large amounts of such user-generated data is thus becoming a key competitive differentia-tor for many companies. While tracking brand perceptions in traditional media is hardly a newchallenge, handling the unprecedented scale of unstructured user-generated web content requiresnew methodologies. These methodologies are likely to be rooted in natural language processingand machine learning techniques. Automatically classifying the sentiment ex-pressed in a blog around selected topics of interest is a canonical machine learning task in this dis-cussion. A standard approach would be to manually label documents with their sentiment orienta-tion and then apply off-the-shelf text classification techniques. However, sentiment is often conveyedwith subtle linguistic mechanisms such as the use of sarcasm and highly domain-specific contextualcues. This makes manual annotation of sentiment time consuming and error-prone, presenting a bot-tleneck in learning high quality models. Moreover, products and services of current focus, and asso-ciated community of bloggers with their idiosyncratic expressions, may rapidly evolve over timecausing models to potentially lose performance and become stale. This motivates the problem oflearning robust sentiment models from minimal supervision. In their seminal work, (Pang et al., 2002)demonstrated that supervised learning significantly outperformed a competing body of workwhere hand-crafted dictionaries are used to assign sentiment labels based on relative frequencies ofpositive and negative terms. As observed by (Ng et al., 2006), most semi-automated dictionary-basedapproaches yield unsatisfactory lexicons, with either high coverage and low precision or vice versa.However, the treatment of such dictionaries as forms of prior knowledge that can be incorporatedin machine learning models is a relatively less explored topic; even lesser so in conjunction withsemi-supervised models that attempt to utilize un244 labeled data. This is the focus of the current paper.Our models are based on a constrained nonnegative tri-factorization of the term-documentmatrix, which can be implemented using simple update rules. Treated as a set of labeled features,the sentiment lexicon is incorporated as one set of constraints that enforce domain-independent priorknowledge. A second set of constraints introduce domain-specific supervision via a few documentlabels. Together these constraints enable learning from partial supervision along both dimensions ofthe term-document matrix, in what may be viewed more broadly as a framework for incorporatingdual-supervision in matrix factorization models. We provide empirical comparisons with severalcompeting methodologies on four, very different domains - blogs discussing enterprise softwareproducts, political blogs discussing US presidential candidates, amazon.com product reviews andIMDB movie reviews. Results demonstrate the effectiveness and generality of our approach.The rest of the paper is organized as follows. We begin by discussing related work in Section 2.Section 3 gives a quick background on Nonnegative Matrix Tri-factorization models. In Sec-tion 4, we present a constrained model and computational algorithm for incorporating lexical knowl-edge in sentiment analysis. In Section 5, we enhance this model by introducing document labelsas additional constraints. Section 6 presents an empirical study on four datasets. Finally, Section 7concludes this paper. 2 Related Work We point the reader to a recent book (Pang andLee, 2008) for an in-depth survey of literature on sentiment analysis. In this section, we brisklycover related work to position our contributions appropriately in the sentiment analysis and ma-chine learning literature. Methods focussing on the use and generation ofdictionaries capturing the sentiment of words have ranged from manual approaches of developingdomain-dependent lexicons (Das and Chen, 2001) to semi-automated approaches (Hu and Liu, 2004;Zhuang et al., 2006; Kim and Hovy, 2004), and even an almost fully automated approach (Turney,2002). Most semi-automated approaches have met with limited success (Ng et al., 2006) and super-vised learning models have tended to outperform dictionary-based classification schemes (Pang et al., 2002). A two-tier scheme (Pang and Lee,2004) where sentences are first classified as subjective versus objective, and then applying the sen-timent classifier on only the subjective sentencesfurther improves performance. Results in these papers also suggest that using more sophisticatedlinguistic models, incorporating parts-of-speech and n-gram language models, do not improve overthe simple unigram bag-of-words representation. In keeping with these findings, we also adopt aunigram text model. A subjectivity classification phase before our models are applied may furtherimprove the results reported in this paper, but our focus is on driving the polarity prediction stagewith minimal manual effort. In this regard, our model brings two inter-related but distinct themes from machine learning to bear on this problem: semi-supervised learning and learning from labeled features. The goalof the former theme is to learn from few labeled examples by making use of unlabeled data, whilethe goal of the latter theme is to utilize weak prior knowledge about term-class affinities (e.g.,the term "awful" indicates negative sentiment and therefore may be considered as a negatively la-beled feature). Empirical results in this paper demonstrate that simultaneously attempting boththese goals in a single model leads to improvements over models that focus on a single goal.(Goldberg and Zhu, 2006) adapt semi-supervised graph-based methods for sentiment analysis butdo not incorporate lexical prior knowledge in the form of labeled features. Most work in machinelearning literature on utilizing labeled features has focused on using them to generate weakly labeledexamples that are then used for standard supervised learning: (Schapire et al., 2002) propose onesuch framework for boosting logistic regression; (Wu and Srihari, 2004) build a modified SVMand (Liu et al., 2004) use a combination of clustering and EM based methods to instantiate simi-lar frameworks. By contrast, we incorporate lexical knowledge directly as constraints on our ma-trix factorization model. In recent work, Druck et al. (Druck et al., 2008) constrain the predictions ofa multinomial logistic regression model on unlabeled instances in a Generalized Expectation for-mulation for learning from labeled features. Unlike their approach which uses only unlabeled in-stances, our method uses both labeled and unlabeled documents in conjunction with labeled and 245 unlabeled words.The matrix tri-factorization models explored in this paper are closely related to the models pro-posed recently in (Li et al., 2008; Sindhwani et al., 2008). Though, their techniques for proving algo-rithm convergence and correctness can be readily adapted for our models, (Li et al., 2008) do notincorporate dual supervision as we do. On the other hand, while (Sindhwani et al., 2008) do in-corporate dual supervision in a non-linear kernelbased setting, they do not enforce non-negativityor orthogonality - aspects of matrix factorization models that have shown benefits in prior empiricalstudies, see e.g., (Ding et al., 2006). We also note the very recent work of (Sind-hwani and Melville, 2008) which proposes a dualsupervision model for semi-supervised sentimentanalysis. In this model, bipartite graph regularization is used to diffuse label information along bothsides of the term-document matrix. Conceptually, their model implements a co-clustering assump-tion closely related to Singular Value Decomposition (see also (Dhillon, 2001; Zha et al., 2001) formore on this perspective) while our model is based on Non-negative Matrix Factorization. In anotherrecent paper (Sandler et al., 2008), standard regularization models are constrained using graphs ofword co-occurences. These are very recently proposed competing methodologies, and we have notbeen able to address empirical comparisons with them in this paper.Finally, recent efforts have also looked at transfer learning mechanisms for sentiment analysis,e.g., see (Blitzer et al., 2007). While our focus is on single-domain learning in this paper, we notethat cross-domain variants of our model can also be orthogonally developed. 3 Background 3.1 Basic Matrix Factorization Model Our proposed models are based on non-negativematrix Tri-factorization (Ding et al., 2006). In these models, an m * n term-document matrix Xis approximated by three factors that specify soft membership of terms and documents in one of k-classes: X ss FSGT : (1) where F is an m * k non-negative matrix repre-senting knowledge in the word space, i.e., i-th rowof F represents the posterior probability of word i belonging to the k classes, G is an n * k non-negative matrix representing knowledge in document space, i.e., the i-th row of G represents theposterior probability of document i belonging tothe k classes, and S is an k * k nonnegative matrixproviding a condensed view of X .The matrix factorization model is similar to the probabilistic latent semantic indexing (PLSI)model (Hofmann, 1999). In PLSI, X is treatedas the joint distribution between words and documents by the scaling X ! _X = X = a*i j Xi j thus a*i j _Xi j = 1). _X is factorized as _X ss W SDT ; a* k W ik = 1; a* k D jk = 1; a* k S kk = 1: (2)where X is the m * n word-document seman-tic matrix, X = W SD, W is the word class-conditional probability, and D is the documentclass-conditional probability and S is the classprobability distribution. PLSI provides a simultaneous solution for theword and document class conditional distribution. Our model provides simultaneous solutionfor clustering the rows and the columns of X . Toavoid ambiguity, the orthogonality conditions FT F = I; GT G = I: (3) can be imposed to enforce each row of F and Gto possess only one nonzero entry. Approximating the term-document matrix with a tri-factorizationwhile imposing non-negativity and orthogonality constraints gives a principled framework forsimultaneously clustering the rows (words) and columns (documents) of X . In the context of co-clustering, these models return excellent empirical performance, see e.g., (Ding et al., 2006). Ourgoal now is to bias these models with constraints incorporating (a) labels of features (coming froma domain-independent sentiment lexicon), and (b) labels of documents for the purposes of domain-specific adaptation. These enhancements are addressed in Sections 4 and 5 respectively. 4 Incorporating Lexical Knowledge We used a sentiment lexicon generated by theIBM India Research Labs that was developed for other text mining applications (Ramakrishnan etal., 2003). It contains 2,968 words that have been human-labeled as expressing positive or negativesentiment. In total, there are 1,267 positive (e.g. "great") and 1,701 negative (e.g., "bad") unique 246 terms after stemming. We eliminated terms thatwere ambiguous and dependent on context, such as "dear" and "fine". It should be noted, that thislist was constructed without a specific domain in mind; which is further motivation for using train-ing examples and unlabeled data to learn domain specific connotations. Lexical knowledge in the form of the polarityof terms in this lexicon can be introduced in the matrix factorization model. By partially specify-ing term polarities via F, the lexicon influencesthe sentiment predictions G over documents. 4.1 Representing Knowledge in Word Space Let F0 represent prior knowledge about sentiment-laden words in the lexicon, i.e., if word i is apositive word (F0)i1 = 1 while if it is negative (F0)i2 = 1. Note that one may also use soft sen-timent polarities though our experiments are conducted with hard assignments. This informationis incorporated in the tri-factorization model via a squared loss term, minF,G,S kX - FSGTk2 + aTr \Theta (F - F0)TC1(F - F0)\Lambda (4)where the notation Tr (A) means trace of the matrix A. Here, a > 0 is a parameter which determinesthe extent to which we enforce F ss F0, C1 is a m * m diagonal matrix whose entry (C1)ii = 1 if thecategory of the i-th word is known (i.e., specifiedby the i-th row of F0) and (C1)ii = 0 otherwise.The squared loss terms ensure that the solution for F in the otherwise unsupervised learning problembe close to the prior knowledge F0. Note that if C1 = I, then we know the class orientation of allthe words and thus have a full specification of F0,Eq.(4) is then reduced to minF,G,S kX - FSGTk2 + akF - F0k2 (5) The above model is generic and it allows certainflexibility. For example, in some cases, our prior knowledge on F0 is not very accurate and we usesmaller a so that the final results are not dependent on F0 very much, i.e., the results are mostlyunsupervised learning results. In addition, the introduction of C1 allows us to incorporate partialknowledge on word polarity information. 4.2 Computational Algorithm The optimization problem in Eq.( 4) can be solvedusing the following update rules G jk G jk (X T FS) jk (GGT X T FS) jk ; (6) Sik Sik (F T X G)ik (FT FSGT G)ik : (7) Fik Fik (X GS T + aC1F0)ik (FFT X GST + aC1F)ik : (8) The algorithm consists of an iterative procedureusing the above three rules until convergence. We call this approach Matrix Factorization with Lex-ical Knowledge (MFLK) and outline the precise steps in the table below. Algorithm 1 Matrix Factorization with LexicalKnowledge (MFLK) begin1. Initialization: Initialize F = F0 G to K-means clustering results, S = (FT F)-1FT X G(GT G)-1.2. Iteration: Update G: fixing F; S, updating GUpdate F: fixing S; G, updating FUpdate S: fixing F; G, updating Send 4.3 Algorithm Correctness and Convergence Updating F; G; S using the rules above leads to anasymptotic convergence to a local minima. This can be proved using arguments similar to (Dinget al., 2006). We outline the proof of correctness for updating F since the squared loss term that in-volves F is a new component in our models. Theorem 1 The above iterative algorithm converges. Theorem 2 At convergence, the solution satisfies the Karuch, Kuhn, Tucker optimality condition, i.e., the algorithm converges correctly to a local optima. Theorem 1 can be proved using the standardauxiliary function approach used in (Lee and Seung, 2001).Proof of Theorem 2. Following the theory of constrained optimization (Nocedal and Wright, 1999), 247 we minimize the following function L(F) = ||X -FSGT ||2 +aTr \Theta (F - F0)TC1(F - F0)\Lambda Note that the gradient of L is, u""L u""F = -2X GS T + 2FSGT GST + 2aC1(F - F0): (9)The KKT complementarity condition for the nonnegativity of Fik gives [-2X GST + FSGT GST + 2aC1(F - F0)]ikFik = 0:(10) This is the fixed point relation that local minimafor F must satisfy. Given an initial guess of F, thesuccessive update of F using Eq.(8) will convergeto a local minima. At convergence, we have Fik = Fik (X GS T + aC1F0)ik (FFT X GST + aC1F)ik : which is equivalent to the KKT condition ofEq.(10). The correctness of updating rules for G inEq.(6) and S in Eq.(7) have been proved in (Dinget al., 2006). u-Note that we do not enforce exact orthogonality in our updating rules since this often implies softerclass assignments. 5 Semi-Supervised Learning WithLexical Knowledge So far our models have made no demands on hu-man effort, other than unsupervised collection of the term-document matrix and a one-time effort incompiling a domain-independent sentiment lexicon. We now assume that a few documents aremanually labeled for the purposes of capturing some domain-specific connotations leading to amore domain-adapted model. The partial labels on documents can be described using G0 where (G0)i1 = 1 if the document expresses positive sen-timent, and (G0)i2 = 1 for negative sentiment. Aswith F0, one can also use soft sentiment labelingfor documents, though our experiments are conducted with hard assignments.Therefore, the semi-supervised learning with lexical knowledge can be described as minF,G,S kX - FSGTk2 + aTr \Theta (F - F0)TC1(F - F0)\Lambda + bTr \Theta (G - G0)TC2(G - G0)\Lambda Where a > 0; b > 0 are parameters which deter-mine the extent to which we enforce F ss F0 and G ss G0 respectively, C1 and C2 are diagonal ma-trices indicating the entries of F0 and G0 that cor-respond to labeled entities. The squared loss terms ensure that the solution for F; G, in the otherwiseunsupervised learning problem, be close to the prior knowledge F0 and G0. 5.1 Computational Algorithm The optimization problem in Eq.( 4) can be solvedusing the following update rules G jk G jk (X T FS + bC2G0) jk (GGT X T FS + bGGTC2G0) jk (11) Sik Sik (F T X G)ik (FT FSGT G)ik : (12) Fik Fik (X GS T + aC1F0)ik (FFT X GST + aC1F)ik : (13) Thus the algorithm for semi-supervised learningwith lexical knowledge based on our matrix factorization framework, referred as SSMFLK, con-sists of an iterative procedure using the above three rules until convergence. The correctness and con-vergence of the algorithm can also be proved using similar arguments as what we outlined earlier forMFLK in Section 4.3. A quick word about computational complexity.The term-document matrix is typically very sparse with z o/ nm non-zero entries while k is typicallyalso much smaller than n; m. By using sparse ma-trix multiplications and avoiding dense intermediate matrices, the updates can be very efficientlyand easily implemented. In particular, updating F; S; G each takes O(k2(m + n) + kz) time per it-eration which scales linearly with the dimensions and density of the data matrix. Empirically, thenumber of iterations before practical convergence is usually very small (less than 100). Thus, com-putationally our approach scales to large datasets even though our experiments are run on relativelysmall-sized datasets. 6 Experiments 6.1 Datasets Description Four different datasets are used in our experi-ments. Movies Reviews: This is a popular dataset insentiment analysis literature (Pang et al., 2002). It consists of 1000 positive and 1000 negativemovie reviews drawn from the IMDB archive of the rec.arts.movies.reviews newsgroups. 248 Lotus blogs: The data set is targeted at detect-ing sentiment around enterprise software, specifically pertaining to the IBM Lotus brand (Sind-hwani and Melville, 2008). An unlabeled set of blog posts was created by randomly sampling2000 posts from a universe of 14,258 blogs that discuss issues relevant to Lotus software. In ad-dition to this unlabeled set, 145 posts were chosen for manual labeling. These posts came from14 individual blogs, 4 of which are actively posting negative content on the brand, with the resttending to write more positive or neutral posts. The data was collected by downloading the lat-est posts from each blogger's RSS feeds, or accessing the blog's archives. Manual labeling re-sulted in 34 positive and 111 negative examples. Political candidate blogs: For our second blogdomain, we used data gathered from 16,742 political blogs, which contain over 500,000 posts. Aswith the Lotus dataset, an unlabeled set was created by randomly sampling 2000 posts. 107 postswere chosen for labeling. A post was labeled as having positive or negative sentiment about a spe-cific candidate (Barack Obama or Hillary Clinton) if it explicitly mentioned the candidate in posi-tive or negative terms. This resulted in 49 positively and 58 negatively labeled posts. AmazonReviews: The dataset contains product reviews taken from Amazon.com from 4 product types:Kitchen, Books, DVDs, and Electronics (Blitzer et al., 2007). The dataset contains about 4000 pos-itive reviews and 4000 negative reviews and can be obtained from http://www.cis.upenn. edu/~mdredze/datasets/sentiment/. For all datasets, we picked 5000 words withhighest document-frequency to generate the vocabulary. Stopwords were removed and a nor-malized term-frequency representation was used. Genuinely unlabeled posts for Political and Lo-tus were used for semi-supervised learning experiments in section 6.3; they were not used in section6.2 on the effect of lexical prior knowledge. In the experiments, we set a, the parameter determiningthe extent to which to enforce the feature labels, to be 1/2, and b, the corresponding parameter forenforcing document labels, to be 1. 6.2 Sentiment Analysis with LexicalKnowledge Of course, one can remove all burden on hu-man effort by simply using unsupervised techniques. Our interest in the first set of experi-ments is to explore the benefits of incorporating a sentiment lexicon over unsupervised approaches.Does a one-time effort in compiling a domainindependent dictionary and using it for differentsentiment tasks pay off in comparison to simply using unsupervised methods? In our case, matrixtri-factorization and other co-clustering methods form the obvious unsupervised baseline for com-parison and so we start by comparing our method (MFLK) with the following methods: * Four document clustering methods: K-means, Tri-Factor Nonnegative Matrix Factorization (TNMF) (Ding et al.,2006), Information-Theoretic Co-clustering (ITCC) (Dhillon et al., 2003), and EuclideanCo-clustering algorithm (ECC) (Cho et al., 2004). These methods do not make use ofthe sentiment lexicon. * Feature Centroid (FC): This is a simpledictionary-based baseline method. Recall that each word can be expressed as a "bag-of-documents" vector. In this approach, we compute the centroids of these vectors, onecorresponding to positive words and another corresponding to negative words. This yieldsa two-dimensional representation for documents, on which we then perform K-meansclustering. Performance Comparison Figure 1 shows theexperimental results on four datasets using accuracy as the performance measure. The results areobtained by averaging 20 runs. It can be observed that our MFLK method can effectively utilize thelexical knowledge to improve the quality of sentiment prediction. Movies Lotus Political Amazon0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Accuracy MFLK FC TNMF ECC ITCC K-Means Figure 1: Accuracy results on four datasets 249 Size of Sentiment Lexicon We also investigatethe effects of the size of the sentiment lexicon on the performance of our model. Figure 2 showsresults with random subsets of the lexicon of increasing size. We observe that generally the per-formance increases as more and more lexical supervision is provided. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 Fraction of sentiment words labeled Accuracy Movies Lotus Political Amazon Figure 2: MFLK accuracy as size of sentimentlexicon (i.e., number of words in the lexicon) increases on the four datasets Robustness to Vocabulary Size High dimen-sionality and noise can have profound impact on the comparative performance of clustering andsemi-supervised learning algorithms. We simulate scenarios with different vocabulary sizes byselecting words based on information gain. It should, however, be kept in mind that in a tru-ely unsupervised setting document labels are unavailable and therefore information gain cannotbe practically computed. Figure 3 and Figure 4 show results for Lotus and Amazon datasets re-spectively and are representative of performance on other datasets. MLFK tends to retain its po-sition as the best performing method even at different vocabulary sizes. ITCC performance is alsonoteworthy given that it is a completely unsupervised method. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 Fraction of Original Vocabulary Accuracy MFLK FC TNMF K-Means ITCC ECC Figure 3: Accuracy results on Lotus dataset withincreasing vocabulary size 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5 0.52 0.54 0.56 0.58 0.6 0.62 0.64 0.66 0.68 Fraction of Original Vocabulary Accuracy MFLK FC TNMF K-Means ITCC ECC Figure 4: Accuracy results on Amazon datasetwith increasing vocabulary size 6.3 Sentiment Analysis with DualSupervision We now assume that together with labeled featuresfrom the sentiment lexicon, we also have access to a few labeled documents. The natural question iswhether the presence of lexical constraints leads to better semi-supervised models. In this section,we compare our method (SSMFLK) with the following three semi-supervised approaches: (1) Thealgorithm proposed in (Zhou et al., 2003) which conducts semi-supervised learning with local andglobal consistency (Consistency Method); (2) Zhu et al.'s harmonic Gaussian field method coupledwith the Class Mass Normalization (HarmonicCMN) (Zhu et al., 2003); and (3) Green's functionlearning algorithm (Green's Function) proposed in (Ding et al., 2007).We also compare the results of SSMFLK with those of two supervised classification methods:Support Vector Machine (SVM) and Naive Bayes. Both of these methods have been widely used insentiment analysis. In particular, the use of SVMs in (Pang et al., 2002) initially sparked interestin using machine learning methods for sentiment classification. Note that none of these competingmethods utilizes lexical knowledge. The results are presented in Figure 5, Figure 6,Figure 7, and Figure 8. We note that our SSMFLK method either outperforms all other methods overthe entire range of number of labeled documents (Movies, Political), or ultimately outpaces othermethods (Lotus, Amazon) as a few document labels come in. Learning Domain-Specific Connotations Inour first set of experiments, we incorporated the sentiment lexicon in our models and learnt thesentiment orientation of words and documents via F; G factors respectively. In the second set of 250 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 Number of documents labeled as a fraction of the original set of labeled documents Accuracy SSMFLK Consistency Method Homonic-CMN Green Function SVM Naive Bays Figure 5: Accuracy results with increasing numberof labeled documents on Movies dataset 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50.3 0.4 0.5 0.6 0.7 0.8 0.9 Number of documents labeled as a fraction of the original set of labeled documents Accuracy SSMFLK Consistency Method Homonic-CMN Green Function SVM Naive Bayes Figure 6: Accuracy results with increasing numberof labeled documents on Lotus dataset experiments, we additionally introduced labeleddocuments for domain-specific adjustments. Between these experiments, we can now look forwords that switch sentiment polarity. These words are interesting because their domain-specific con-notation differs from their lexical orientation. For amazon reviews, the following words switchedpolarity from positive to negative: fan, important, learning, cons, fast, feature, happy, memory, portable, simple, small, work while the followingwords switched polarity from negative to positive: address, finish, lack, mean, budget, rent, throw.Note that words like fan, memory probably referto product or product components (i.e., computer fan and memory) in the amazon review contextbut have a very different connotation say in the context of movie reviews where they probably re-fer to movie fanfare and memorable performances. We were surprised to see happy switch polarity!Two examples of its negative-sentiment usage are: I ended up buying a Samsung and I couldn't be more happy and BORING, not one single exciting thing about this book. I was happy when my lunch break ended so I could go back to work and stop reading. 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 Number of documents labeled as a fraction of the original set of labeled documents Accuracy SSMFLK Consistency Method Homonic-CMN Green Function SVM Naive Bays Figure 7: Accuracy results with increasing numberof labeled documents on Political dataset 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 Number of documents labeled as a fraction of the original set of labeled documents Accuracy SSMFLK Consistency Method Homonic-CMN Green Function SVM Naive Bays Figure 8: Accuracy results with increasing numberof labeled documents on Amazon dataset 7 Conclusion The primary contribution of this paper is to pro-pose and benchmark new methodologies for sentiment analysis. Non-negative Matrix Factoriza-tions constitute a rich body of algorithms that have found applicability in a variety of machine learn-ing applications: from recommender systems to document clustering. We have shown how to buildeffective sentiment models by appropriately constraining the factors using lexical prior knowledgeand document annotations. To more effectively utilize unlabeled data and induce domain-specificadaptation of our models, several extensions are possible: facilitating learning from related do-mains, incorporating hyperlinks between documents, incorporating synonyms or co-occurencesbetween words etc. As a topic of vigorous current activity, there are several very recently proposedcompeting methodologies for sentiment analysis that we would like to benchmark against. Theseare topics for future work. Acknowledgement: The work of T. Li is par-tially supported by NSF grants DMS-0844513 and CCF-0830659. We would also like to thank PremMelville and Richard Lawrence for their support. 251 References J. Blitzer, M. Dredze, and F. Pereira. 2007. Biogra-phies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Pro-ceedings of ACL, pages 440-447. H. Cho, I. Dhillon, Y. Guan, and S. Sra. 2004. Mini-mum sum squared residue co-clustering of gene expression data. In Proceedings of The 4th SIAM DataMining Conference, pages 22-24, April. S. Das and M. Chen. 2001. Yahoo! for amazon:Extracting market sentiment from stock message boards. In Proceedings of the 8th Asia Pacific Fi-nance Association (APFA). I. S. Dhillon, S. Mallela, and D. S. Modha. 2003.Information-theoretical co-clustering. In Proceedings of ACM SIGKDD, pages 89-98. I. S. Dhillon. 2001. Co-clustering documents andwords using bipartite spectral graph partitioning. In Proceedings of ACM SIGKDD. C. Ding, T. Li, W. Peng, and H. Park. 2006. Orthogo-nal nonnegative matrix tri-factorizations for clustering. In Proceedings of ACM SIGKDD, pages 126-135. C. Ding, R. Jin, T. Li, and H.D. Simon. 2007. Alearning framework using green's function and kernel regularization with application to recommendersystem. In Proceedings of ACM SIGKDD, pages 260-269. G. Druck, G. Mann, and A. McCallum. 2008. Learn-ing from labeled features using generalized expectation criteria. In SIGIR. A. Goldberg and X. Zhu. 2006. Seeing starswhen there aren't many stars: Graph-based semisupervised learning for sentiment categorization. InHLT-NAACL 2006: Workshop on Textgraphs. T. Hofmann. 1999. Probabilistic latent semantic in-dexing. Proceeding of SIGIR, pages 50-57. M. Hu and B. Liu. 2004. Mining and summarizingcustomer reviews. In KDD, pages 168-177. S.-M. Kim and E. Hovy. 2004. Determining the sen-timent of opinions. In Proceedings of International Conference on Computational Linguistics. D.D. Lee and H.S. Seung. 2001. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems 13. T. Li, C. Ding, Y. Zhang, and B. Shao. 2008. Knowl-edge transformation from word space to document space. In Proceedings of SIGIR, pages 187-194. B. Liu, X. Li, W.S. Lee, and P. Yu. 2004. Text classifi-cation by labeling words. In AAAI. V. Ng, S. Dasgupta, and S. M. Niaz Arifin. 2006. Ex-amining the role of linguistic knowledge sources in the automatic identification and classification of re-views. In COLING & ACL. J. Nocedal and S.J. Wright. 1999. Numerical Opti-mization. Springer-Verlag. B. Pang and L. Lee. 2004. A sentimental education:sentiment analysis using subjectivity summarization based on minimum cuts. In ACL. B. Pang and L. Lee. 2008. Opinion miningand sentiment analysis. Foundations and Trends in Information Retrieval: Vol. 2: No 12, pp1-135 http://www.cs.cornell.edu/home/llee/opinionmining-sentiment-analysis-survey.html. B. Pang, L. Lee, and S. Vaithyanathan. 2002. Thumbsup? sentiment classification using machine learning techniques. In EMNLP. G. Ramakrishnan, A. Jadhav, A. Joshi, S. Chakrabarti,and P. Bhattacharyya. 2003. Question answering via bayesian inference on lexical relations. In ACL,pages 1-10. T. Sandler, J. Blitzer, P. Talukdar, and L. Ungar. 2008.Regularized learning with networks of features. In NIPS. R.E. Schapire, M. Rochery, M.G. Rahim, andN. Gupta. 2002. Incorporating prior knowledge into boosting. In ICML. V. Sindhwani and P. Melville. 2008. Document-word co-regularization for semi-supervised sentiment analysis. In Proceedings of IEEE ICDM. V. Sindhwani, J. Hu, and A. Mojsilovic. 2008. Regu-larized co-clustering with dual supervision. In Proceedings of NIPS. P. Turney. 2002. Thumbs up or thumbs down? Se-mantic orientation applied to unsupervised classification of reviews. Proceedings of the 40th AnnualMeeting of the Association for Computational Linguistics, pages 417-424. X. Wu and R. Srihari. 2004. Incorporating priorknowledge with weighted margin support vector machines. In KDD. H. Zha, X. He, C. Ding, M. Gu, and H.D. Simon.2001. Bipartite graph partitioning and data clustering. Proceedings of ACM CIKM. D. Zhou, O. Bousquet, T.N. Lal, J. Weston, andB. Scholkopf. 2003. Learning with local and global consistency. In Proceedings of NIPS. X. Zhu, Z. Ghahramani, and J. Lafferty. 2003. Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of ICML. L. Zhuang, F. Jing, and X. Zhu. 2006. Movie reviewmining and summarization. In CIKM, pages 43-50. 252