Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, pages 244-252,Suntec, Singapore, 2-7 August 2009. cfl2009 ACL and AFNLP

A Non-negative Matrix Tri-factorization Approach toSentiment Classification with Lexical Prior Knowledge

Tao Li Yi ZhangSchool of Computer Science
Florida International University{
taoli,yzhan004}@cs.fiu.edu

Vikas SindhwaniMathematical Sciences
IBM T.J. Watson Research Center

vsindhw@us.ibm.com

Abstract
Sentiment classification refers to the taskof automatically identifying whether a

given piece of text expresses positive ornegative opinion towards a subject at hand.
The proliferation of user-generated webcontent such as blogs, discussion forums
and online review sites has made it possi-ble to perform large-scale mining of public opinion. Sentiment modeling is thusbecoming a critical component of market
intelligence and social media technologiesthat aim to tap into the collective wisdom of crowds. In this paper, we considerthe problem of learning high-quality sentiment models with minimal manual super-vision. We propose a novel approach to
learn from lexical prior knowledge in theform of domain-independent sentimentladen terms, in conjunction with domain-dependent unlabeled data and a few labeled documents. Our model is based on aconstrained non-negative tri-factorization
of the term-document matrix which canbe implemented using simple update rules.
Extensive experimental studies demon-strate the effectiveness of our approach on
a variety of real-world sentiment predic-tion tasks.

1 Introduction
Web 2.0 platforms such as blogs, discussion fo-rums and other such social media have now given

a public voice to every consumer. Recent sur-veys have estimated that a massive number of internet users turn to such forums to collect rec-ommendations for products and services, guiding their own choices and decisions by the opin-ions that other consumers have publically expressed. Gleaning insights by monitoring and an-alyzing large amounts of such user-generated data

is thus becoming a key competitive differentia-tor for many companies. While tracking brand
perceptions in traditional media is hardly a newchallenge, handling the unprecedented scale of
unstructured user-generated web content requiresnew methodologies. These methodologies are
likely to be rooted in natural language processingand machine learning techniques.

Automatically classifying the sentiment ex-pressed in a blog around selected topics of interest
is a canonical machine learning task in this dis-cussion. A standard approach would be to manually label documents with their sentiment orienta-tion and then apply off-the-shelf text classification
techniques. However, sentiment is often conveyedwith subtle linguistic mechanisms such as the use
of sarcasm and highly domain-specific contextualcues. This makes manual annotation of sentiment
time consuming and error-prone, presenting a bot-tleneck in learning high quality models. Moreover,
products and services of current focus, and asso-ciated community of bloggers with their idiosyncratic expressions, may rapidly evolve over timecausing models to potentially lose performance
and become stale. This motivates the problem oflearning robust sentiment models from minimal
supervision.

In their seminal work, (Pang et al., 2002)demonstrated that supervised learning significantly outperformed a competing body of workwhere hand-crafted dictionaries are used to assign
sentiment labels based on relative frequencies ofpositive and negative terms. As observed by (Ng et
al., 2006), most semi-automated dictionary-basedapproaches yield unsatisfactory lexicons, with either high coverage and low precision or vice versa.However, the treatment of such dictionaries as
forms of prior knowledge that can be incorporatedin machine learning models is a relatively less explored topic; even lesser so in conjunction withsemi-supervised models that attempt to utilize un244

labeled data. This is the focus of the current paper.Our models are based on a constrained nonnegative tri-factorization of the term-documentmatrix, which can be implemented using simple
update rules. Treated as a set of labeled features,the sentiment lexicon is incorporated as one set of
constraints that enforce domain-independent priorknowledge. A second set of constraints introduce
domain-specific supervision via a few documentlabels. Together these constraints enable learning
from partial supervision along both dimensions ofthe term-document matrix, in what may be viewed
more broadly as a framework for incorporatingdual-supervision in matrix factorization models.
We provide empirical comparisons with severalcompeting methodologies on four, very different
domains - blogs discussing enterprise softwareproducts, political blogs discussing US presidential candidates, amazon.com product reviews andIMDB movie reviews. Results demonstrate the effectiveness and generality of our approach.The rest of the paper is organized as follows.
We begin by discussing related work in Section 2.Section 3 gives a quick background on Nonnegative Matrix Tri-factorization models. In Sec-tion 4, we present a constrained model and computational algorithm for incorporating lexical knowl-edge in sentiment analysis. In Section 5, we enhance this model by introducing document labelsas additional constraints. Section 6 presents an
empirical study on four datasets. Finally, Section 7concludes this paper.

2 Related Work
We point the reader to a recent book (Pang andLee, 2008) for an in-depth survey of literature on

sentiment analysis. In this section, we brisklycover related work to position our contributions
appropriately in the sentiment analysis and ma-chine learning literature.

Methods focussing on the use and generation ofdictionaries capturing the sentiment of words have
ranged from manual approaches of developingdomain-dependent lexicons (Das and Chen, 2001)
to semi-automated approaches (Hu and Liu, 2004;Zhuang et al., 2006; Kim and Hovy, 2004), and
even an almost fully automated approach (Turney,2002). Most semi-automated approaches have met
with limited success (Ng et al., 2006) and super-vised learning models have tended to outperform
dictionary-based classification schemes (Pang et

al., 2002). A two-tier scheme (Pang and Lee,2004) where sentences are first classified as

subjective versus objective, and then applying the sen-timent classifier on only the

subjective sentencesfurther improves performance. Results in these

papers also suggest that using more sophisticatedlinguistic models, incorporating parts-of-speech
and n-gram language models, do not improve overthe simple unigram bag-of-words representation.
In keeping with these findings, we also adopt aunigram text model. A subjectivity classification
phase before our models are applied may furtherimprove the results reported in this paper, but our
focus is on driving the polarity prediction stagewith minimal manual effort.

In this regard, our model brings two inter-related but distinct themes from machine learning
to bear on this problem: semi-supervised learning and learning from labeled features. The goalof the former theme is to learn from few labeled

examples by making use of unlabeled data, whilethe goal of the latter theme is to utilize weak
prior knowledge about term-class affinities (e.g.,the term "awful" indicates negative sentiment and
therefore may be considered as a negatively la-beled feature). Empirical results in this paper
demonstrate that simultaneously attempting boththese goals in a single model leads to improvements over models that focus on a single goal.(Goldberg and Zhu, 2006) adapt semi-supervised
graph-based methods for sentiment analysis butdo not incorporate lexical prior knowledge in the
form of labeled features. Most work in machinelearning literature on utilizing labeled features has
focused on using them to generate weakly labeledexamples that are then used for standard supervised learning: (Schapire et al., 2002) propose onesuch framework for boosting logistic regression;
(Wu and Srihari, 2004) build a modified SVMand (Liu et al., 2004) use a combination of clustering and EM based methods to instantiate simi-lar frameworks. By contrast, we incorporate lexical knowledge directly as constraints on our ma-trix factorization model. In recent work, Druck et
al. (Druck et al., 2008) constrain the predictions ofa multinomial logistic regression model on unlabeled instances in a Generalized Expectation for-mulation for learning from labeled features. Unlike their approach which uses only unlabeled in-stances, our method uses both labeled and unlabeled documents in conjunction with labeled and

245

unlabeled words.The matrix tri-factorization models explored in
this paper are closely related to the models pro-posed recently in (Li et al., 2008; Sindhwani et al.,
2008). Though, their techniques for proving algo-rithm convergence and correctness can be readily
adapted for our models, (Li et al., 2008) do notincorporate dual supervision as we do. On the
other hand, while (Sindhwani et al., 2008) do in-corporate dual supervision in a non-linear kernelbased setting, they do not enforce non-negativityor orthogonality - aspects of matrix factorization
models that have shown benefits in prior empiricalstudies, see e.g., (Ding et al., 2006).

We also note the very recent work of (Sind-hwani and Melville, 2008) which proposes a dualsupervision model for semi-supervised sentimentanalysis. In this model, bipartite graph regularization is used to diffuse label information along bothsides of the term-document matrix. Conceptually,
their model implements a co-clustering assump-tion closely related to Singular Value Decomposition (see also (Dhillon, 2001; Zha et al., 2001) formore on this perspective) while our model is based
on Non-negative Matrix Factorization. In anotherrecent paper (Sandler et al., 2008), standard regularization models are constrained using graphs ofword co-occurences. These are very recently proposed competing methodologies, and we have notbeen able to address empirical comparisons with
them in this paper.Finally, recent efforts have also looked at transfer learning mechanisms for sentiment analysis,e.g., see (Blitzer et al., 2007). While our focus
is on single-domain learning in this paper, we notethat cross-domain variants of our model can also
be orthogonally developed.
3 Background
3.1 Basic Matrix Factorization Model
Our proposed models are based on non-negativematrix Tri-factorization (Ding et al., 2006). In

these models, an m * n term-document matrix Xis approximated by three factors that specify soft
membership of terms and documents in one of k-classes:

X ss FSGT : (1)
where F is an m * k non-negative matrix repre-senting knowledge in the word space, i.e.,

i-th rowof
F represents the posterior probability of word

i belonging to the k classes, G is an n * k non-negative matrix representing knowledge in document space, i.e., the i-th row of G represents theposterior probability of document

i belonging tothe
k classes, and S is an k * k nonnegative matrixproviding a condensed view of

X .The matrix factorization model is similar to

the probabilistic latent semantic indexing (PLSI)model (Hofmann, 1999). In PLSI,

X is treatedas the joint distribution between words and documents by the scaling X ! _X = X = a*i j Xi j thus
a*i j _Xi j = 1). _X is factorized as

_X ss W SDT ; a*

k W

ik = 1; a*

k D

jk = 1; a*

k S

kk = 1:

(2)where
X is the m * n word-document seman-tic matrix,

X = W SD, W is the word class-conditional probability, and

D is the documentclass-conditional probability and

S is the classprobability distribution.

PLSI provides a simultaneous solution for theword and document class conditional distribution. Our model provides simultaneous solutionfor clustering the rows and the columns of

X . Toavoid ambiguity, the orthogonality conditions

FT F = I; GT G = I: (3)
can be imposed to enforce each row of F and Gto possess only one nonzero entry. Approximating

the term-document matrix with a tri-factorizationwhile imposing non-negativity and orthogonality constraints gives a principled framework forsimultaneously clustering the rows (words) and
columns (documents) of X . In the context of co-clustering, these models return excellent empirical performance, see e.g., (Ding et al., 2006). Ourgoal now is to bias these models with constraints
incorporating (a) labels of features (coming froma domain-independent sentiment lexicon), and (b)
labels of documents for the purposes of domain-specific adaptation. These enhancements are addressed in Sections 4 and 5 respectively.
4 Incorporating Lexical Knowledge
We used a sentiment lexicon generated by theIBM India Research Labs that was developed for

other text mining applications (Ramakrishnan etal., 2003). It contains 2,968 words that have been
human-labeled as expressing positive or negativesentiment. In total, there are 1,267 positive (e.g.
"great") and 1,701 negative (e.g., "bad") unique

246

terms after stemming. We eliminated terms thatwere ambiguous and dependent on context, such
as "dear" and "fine". It should be noted, that thislist was constructed without a specific domain in
mind; which is further motivation for using train-ing examples and unlabeled data to learn domain
specific connotations.

Lexical knowledge in the form of the polarityof terms in this lexicon can be introduced in the

matrix factorization model. By partially specify-ing term polarities via

F, the lexicon influencesthe sentiment predictions

G over documents.

4.1 Representing Knowledge in Word Space
Let F0 represent prior knowledge about sentiment-laden words in the lexicon, i.e., if word

i is apositive word
(F0)i1 = 1 while if it is negative
(F0)i2 = 1. Note that one may also use soft sen-timent polarities though our experiments are conducted with hard assignments. This informationis incorporated in the tri-factorization model via a
squared loss term,

minF,G,S kX - FSGTk2 + aTr \Theta (F - F0)TC1(F - F0)\Lambda 

(4)where the notation Tr
(A) means trace of the matrix
A. Here, a > 0 is a parameter which determinesthe extent to which we enforce

F ss F0, C1 is a m *
m diagonal matrix whose entry (C1)ii = 1 if thecategory of the

i-th word is known (i.e., specifiedby the
i-th row of F0) and (C1)ii = 0 otherwise.The squared loss terms ensure that the solution for

F in the otherwise unsupervised learning problembe close to the prior knowledge

F0. Note that if
C1 = I, then we know the class orientation of allthe words and thus have a full specification of

F0,Eq.(4) is then reduced to

minF,G,S kX - FSGTk2 + akF - F0k2 (5)
The above model is generic and it allows certainflexibility. For example, in some cases, our prior
knowledge on F0 is not very accurate and we usesmaller a so that the final results are not dependent on F0 very much, i.e., the results are mostlyunsupervised learning results. In addition, the introduction of C1 allows us to incorporate partialknowledge on word polarity information.

4.2 Computational Algorithm
The optimization problem in Eq.( 4) can be solvedusing the following update rules

G jk  G jk (X

T FS) jk

(GGT X T FS) jk ; (6)

Sik  Sik (F

T X G)ik

(FT FSGT G)ik : (7)

Fik  Fik (X GS

T + aC1F0)ik

(FFT X GST + aC1F)ik : (8)
The algorithm consists of an iterative procedureusing the above three rules until convergence. We

call this approach Matrix Factorization with Lex-ical Knowledge (MFLK) and outline the precise
steps in the table below.
Algorithm 1 Matrix Factorization with LexicalKnowledge (MFLK)

begin1. Initialization:

Initialize F = F0
G to K-means clustering results,
S = (FT F)-1FT X G(GT G)-1.2. Iteration:

Update G: fixing F; S, updating GUpdate F: fixing

S; G, updating FUpdate S: fixing
F; G, updating Send

4.3 Algorithm Correctness and Convergence
Updating F; G; S using the rules above leads to anasymptotic convergence to a local minima. This

can be proved using arguments similar to (Dinget al., 2006). We outline the proof of correctness
for updating F since the squared loss term that in-volves

F is a new component in our models.

Theorem 1 The above iterative algorithm converges.

Theorem 2 At convergence, the solution satisfies
the Karuch, Kuhn, Tucker optimality condition,
i.e., the algorithm converges correctly to a local
optima.

Theorem 1 can be proved using the standardauxiliary function approach used in (Lee and Seung, 2001).Proof of Theorem 2. Following the theory of constrained optimization (Nocedal and Wright, 1999),

247

we minimize the following function
L(F) = ||X -FSGT ||2 +aTr \Theta (F - F0)TC1(F - F0)\Lambda 
Note that the gradient of L is,

u""L
u""F = -2X GS

T + 2FSGT GST + 2aC1(F - F0):

(9)The KKT complementarity condition for the nonnegativity of Fik gives

[-2X GST + FSGT GST + 2aC1(F - F0)]ikFik = 0:(10)

This is the fixed point relation that local minimafor

F must satisfy. Given an initial guess of F, thesuccessive update of

F using Eq.(8) will convergeto a local minima. At convergence, we have

Fik = Fik (X GS

T + aC1F0)ik

(FFT X GST + aC1F)ik :
which is equivalent to the KKT condition ofEq.(10). The correctness of updating rules for

G inEq.(6) and
S in Eq.(7) have been proved in (Dinget al., 2006).

u-Note that we do not enforce exact orthogonality

in our updating rules since this often implies softerclass assignments.

5 Semi-Supervised Learning WithLexical Knowledge
So far our models have made no demands on hu-man effort, other than unsupervised collection of
the term-document matrix and a one-time effort incompiling a domain-independent sentiment lexicon. We now assume that a few documents aremanually labeled for the purposes of capturing
some domain-specific connotations leading to amore domain-adapted model. The partial labels
on documents can be described using G0 where
(G0)i1 = 1 if the document expresses positive sen-timent, and

(G0)i2 = 1 for negative sentiment. Aswith
F0, one can also use soft sentiment labelingfor documents, though our experiments are conducted with hard assignments.Therefore, the semi-supervised learning with
lexical knowledge can be described as

minF,G,S kX - FSGTk2 + aTr \Theta (F - F0)TC1(F - F0)\Lambda  +

bTr \Theta (G - G0)TC2(G - G0)\Lambda 
Where a > 0; b > 0 are parameters which deter-mine the extent to which we enforce

F ss F0 and

G ss G0 respectively, C1 and C2 are diagonal ma-trices indicating the entries of

F0 and G0 that cor-respond to labeled entities. The squared loss terms

ensure that the solution for F; G, in the otherwiseunsupervised learning problem, be close to the
prior knowledge F0 and G0.
5.1 Computational Algorithm
The optimization problem in Eq.( 4) can be solvedusing the following update rules

G jk  G jk (X

T FS + bC2G0) jk

(GGT X T FS + bGGTC2G0) jk (11)

Sik  Sik (F

T X G)ik

(FT FSGT G)ik : (12)

Fik  Fik (X GS

T + aC1F0)ik

(FFT X GST + aC1F)ik : (13)
Thus the algorithm for semi-supervised learningwith lexical knowledge based on our matrix factorization framework, referred as SSMFLK, con-sists of an iterative procedure using the above three
rules until convergence. The correctness and con-vergence of the algorithm can also be proved using
similar arguments as what we outlined earlier forMFLK in Section 4.3.

A quick word about computational complexity.The term-document matrix is typically very sparse
with z o/ nm non-zero entries while k is typicallyalso much smaller than

n; m. By using sparse ma-trix multiplications and avoiding dense intermediate matrices, the updates can be very efficientlyand easily implemented. In particular, updating
F; S; G each takes O(k2(m + n) + kz) time per it-eration which scales linearly with the dimensions
and density of the data matrix. Empirically, thenumber of iterations before practical convergence
is usually very small (less than 100). Thus, com-putationally our approach scales to large datasets
even though our experiments are run on relativelysmall-sized datasets.

6 Experiments
6.1 Datasets Description
Four different datasets are used in our experi-ments.

Movies Reviews: This is a popular dataset insentiment analysis literature (Pang et al., 2002).
It consists of 1000 positive and 1000 negativemovie reviews drawn from the IMDB archive of
the rec.arts.movies.reviews newsgroups.

248

Lotus blogs: The data set is targeted at detect-ing sentiment around enterprise software, specifically pertaining to the IBM Lotus brand (Sind-hwani and Melville, 2008). An unlabeled set
of blog posts was created by randomly sampling2000 posts from a universe of 14,258 blogs that
discuss issues relevant to Lotus software. In ad-dition to this unlabeled set, 145 posts were chosen for manual labeling. These posts came from14 individual blogs, 4 of which are actively posting negative content on the brand, with the resttending to write more positive or neutral posts.
The data was collected by downloading the lat-est posts from each blogger's RSS feeds, or accessing the blog's archives. Manual labeling re-sulted in 34 positive and 111 negative examples.
Political candidate blogs: For our second blogdomain, we used data gathered from 16,742 political blogs, which contain over 500,000 posts. Aswith the Lotus dataset, an unlabeled set was created by randomly sampling 2000 posts. 107 postswere chosen for labeling. A post was labeled as
having positive or negative sentiment about a spe-cific candidate (Barack Obama or Hillary Clinton)
if it explicitly mentioned the candidate in posi-tive or negative terms. This resulted in 49 positively and 58 negatively labeled posts. AmazonReviews: The dataset contains product reviews
taken from Amazon.com from 4 product types:Kitchen, Books, DVDs, and Electronics (Blitzer
et al., 2007). The dataset contains about 4000 pos-itive reviews and 4000 negative reviews and can
be obtained from http://www.cis.upenn.
edu/~mdredze/datasets/sentiment/.

For all datasets, we picked 5000 words withhighest document-frequency to generate the vocabulary. Stopwords were removed and a nor-malized term-frequency representation was used.
Genuinely unlabeled posts for Political and Lo-tus were used for semi-supervised learning experiments in section 6.3; they were not used in section6.2 on the effect of lexical prior knowledge. In the
experiments, we set a, the parameter determiningthe extent to which to enforce the feature labels,
to be 1/2, and b, the corresponding parameter forenforcing document labels, to be 1.

6.2 Sentiment Analysis with LexicalKnowledge
Of course, one can remove all burden on hu-man effort by simply using unsupervised techniques. Our interest in the first set of experi-ments is to explore the benefits of incorporating a
sentiment lexicon over unsupervised approaches.Does a one-time effort in compiling a domainindependent dictionary and using it for differentsentiment tasks pay off in comparison to simply
using unsupervised methods? In our case, matrixtri-factorization and other co-clustering methods
form the obvious unsupervised baseline for com-parison and so we start by comparing our method
(MFLK) with the following methods:

* Four document clustering methods: K-means, Tri-Factor Nonnegative Matrix Factorization (TNMF) (Ding et al.,2006), Information-Theoretic Co-clustering
(ITCC) (Dhillon et al., 2003), and EuclideanCo-clustering algorithm (ECC) (Cho et al.,
2004). These methods do not make use ofthe sentiment lexicon.

* Feature Centroid (FC): This is a simpledictionary-based baseline method. Recall

that each word can be expressed as a "bag-of-documents" vector. In this approach, we
compute the centroids of these vectors, onecorresponding to positive words and another
corresponding to negative words. This yieldsa two-dimensional representation for documents, on which we then perform K-meansclustering.

Performance Comparison Figure 1 shows theexperimental results on four datasets using accuracy as the performance measure. The results areobtained by averaging 20 runs. It can be observed
that our MFLK method can effectively utilize thelexical knowledge to improve the quality of sentiment prediction.

Movies Lotus Political Amazon0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9

1

Accuracy

 

 
MFLK
FC
TNMF
ECC
ITCC
K-Means

Figure 1: Accuracy results on four datasets
249

Size of Sentiment Lexicon We also investigatethe effects of the size of the sentiment lexicon on
the performance of our model. Figure 2 showsresults with random subsets of the lexicon of increasing size. We observe that generally the per-formance increases as more and more lexical supervision is provided.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5
0.55

0.6
0.65

0.7
0.75

0.8
0.85

Fraction of sentiment words labeled
Accuracy

 

 
Movies
Lotus
Political
Amazon

Figure 2: MFLK accuracy as size of sentimentlexicon (i.e., number of words in the lexicon) increases on the four datasets
Robustness to Vocabulary Size High dimen-sionality and noise can have profound impact on
the comparative performance of clustering andsemi-supervised learning algorithms. We simulate scenarios with different vocabulary sizes byselecting words based on information gain. It
should, however, be kept in mind that in a tru-ely unsupervised setting document labels are unavailable and therefore information gain cannotbe practically computed. Figure 3 and Figure 4
show results for Lotus and Amazon datasets re-spectively and are representative of performance
on other datasets. MLFK tends to retain its po-sition as the best performing method even at different vocabulary sizes. ITCC performance is alsonoteworthy given that it is a completely unsupervised method.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5
0.55

0.6
0.65

0.7
0.75

0.8
0.85

Fraction of Original Vocabulary 
Accuracy

 

 
MFLK
FC
TNMF
K-Means
ITCC
ECC

Figure 3: Accuracy results on Lotus dataset withincreasing vocabulary size

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5
0.52
0.54
0.56
0.58

0.6
0.62
0.64
0.66
0.68

Fraction of Original Vocabulary
Accuracy

 

 
MFLK
FC
TNMF
K-Means
ITCC
ECC

Figure 4: Accuracy results on Amazon datasetwith increasing vocabulary size
6.3 Sentiment Analysis with DualSupervision
We now assume that together with labeled featuresfrom the sentiment lexicon, we also have access to
a few labeled documents. The natural question iswhether the presence of lexical constraints leads
to better semi-supervised models. In this section,we compare our method (SSMFLK) with the following three semi-supervised approaches: (1) Thealgorithm proposed in (Zhou et al., 2003) which
conducts semi-supervised learning with local andglobal consistency (Consistency Method); (2) Zhu
et al.'s harmonic Gaussian field method coupledwith the Class Mass Normalization (HarmonicCMN) (Zhu et al., 2003); and (3) Green's functionlearning algorithm (Green's Function) proposed
in (Ding et al., 2007).We also compare the results of SSMFLK with
those of two supervised classification methods:Support Vector Machine (SVM) and Naive Bayes.
Both of these methods have been widely used insentiment analysis. In particular, the use of SVMs
in (Pang et al., 2002) initially sparked interestin using machine learning methods for sentiment
classification. Note that none of these competingmethods utilizes lexical knowledge.

The results are presented in Figure 5, Figure 6,Figure 7, and Figure 8. We note that our SSMFLK
method either outperforms all other methods overthe entire range of number of labeled documents
(Movies, Political), or ultimately outpaces othermethods (Lotus, Amazon) as a few document labels come in.
Learning Domain-Specific Connotations Inour first set of experiments, we incorporated the

sentiment lexicon in our models and learnt thesentiment orientation of words and documents via
F; G factors respectively. In the second set of

250

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50.4
0.45

0.5
0.55

0.6
0.65

0.7
0.75

0.8

Number of documents labeled as a fraction of the original set of labeled documents
Accuracy

 

 
SSMFLK
Consistency Method
Homonic-CMN
Green Function
SVM
Naive Bays

Figure 5: Accuracy results with increasing numberof labeled documents on Movies dataset

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50.3
0.4
0.5
0.6
0.7
0.8
0.9

Number of documents labeled as a fraction of the original set of labeled documents
Accuracy

 

 
SSMFLK
Consistency Method
Homonic-CMN
Green Function 
SVM
Naive Bayes

Figure 6: Accuracy results with increasing numberof labeled documents on Lotus dataset

experiments, we additionally introduced labeleddocuments for domain-specific adjustments. Between these experiments, we can now look forwords that switch sentiment polarity. These words
are interesting because their domain-specific con-notation differs from their lexical orientation. For
amazon reviews, the following words switchedpolarity from positive to negative:

fan, important, learning, cons, fast, feature, happy, memory,
portable, simple, small, work while the followingwords switched polarity from negative to positive:

address, finish, lack, mean, budget, rent, throw.Note that words like

fan, memory probably referto product or product components (i.e., computer

fan and memory) in the amazon review contextbut have a very different connotation say in the
context of movie reviews where they probably re-fer to movie fanfare and memorable performances.
We were surprised to see happy switch polarity!Two examples of its negative-sentiment usage are:
I ended up buying a Samsung and I couldn't be
more happy and BORING, not one single exciting
thing about this book. I was happy when my lunch
break ended so I could go back to work and stop
reading.

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50.3
0.35

0.4
0.45

0.5
0.55

0.6
0.65

0.7
0.75

0.8

Number of documents labeled as a fraction of the original set of labeled documents
Accuracy

 

 
SSMFLK
Consistency Method
Homonic-CMN
Green Function
SVM
Naive Bays

Figure 7: Accuracy results with increasing numberof labeled documents on Political dataset

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50.4
0.45

0.5
0.55

0.6
0.65

0.7
0.75

0.8

Number of documents labeled as a fraction of the original set of labeled documents
Accuracy

 

 
SSMFLK
Consistency Method
Homonic-CMN
Green Function
SVM
Naive Bays

Figure 8: Accuracy results with increasing numberof labeled documents on Amazon dataset

7 Conclusion
The primary contribution of this paper is to pro-pose and benchmark new methodologies for sentiment analysis. Non-negative Matrix Factoriza-tions constitute a rich body of algorithms that have
found applicability in a variety of machine learn-ing applications: from recommender systems to
document clustering. We have shown how to buildeffective sentiment models by appropriately constraining the factors using lexical prior knowledgeand document annotations. To more effectively
utilize unlabeled data and induce domain-specificadaptation of our models, several extensions are
possible: facilitating learning from related do-mains, incorporating hyperlinks between documents, incorporating synonyms or co-occurencesbetween words etc. As a topic of vigorous current
activity, there are several very recently proposedcompeting methodologies for sentiment analysis
that we would like to benchmark against. Theseare topics for future work.

Acknowledgement: The work of T. Li is par-tially supported by NSF grants DMS-0844513 and
CCF-0830659. We would also like to thank PremMelville and Richard Lawrence for their support.

251

References
J. Blitzer, M. Dredze, and F. Pereira. 2007. Biogra-phies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Pro-ceedings of ACL, pages 440-447.

H. Cho, I. Dhillon, Y. Guan, and S. Sra. 2004. Mini-mum sum squared residue co-clustering of gene expression data. In Proceedings of The 4th SIAM DataMining Conference, pages 22-24, April.

S. Das and M. Chen. 2001. Yahoo! for amazon:Extracting market sentiment from stock message

boards. In Proceedings of the 8th Asia Pacific Fi-nance Association (APFA).

I. S. Dhillon, S. Mallela, and D. S. Modha. 2003.Information-theoretical co-clustering. In Proceedings of ACM SIGKDD, pages 89-98.
I. S. Dhillon. 2001. Co-clustering documents andwords using bipartite spectral graph partitioning. In

Proceedings of ACM SIGKDD.
C. Ding, T. Li, W. Peng, and H. Park. 2006. Orthogo-nal nonnegative matrix tri-factorizations for clustering. In Proceedings of ACM SIGKDD, pages 126-135.

C. Ding, R. Jin, T. Li, and H.D. Simon. 2007. Alearning framework using green's function and kernel regularization with application to recommendersystem. In Proceedings of ACM SIGKDD, pages
260-269.
G. Druck, G. Mann, and A. McCallum. 2008. Learn-ing from labeled features using generalized expectation criteria. In SIGIR.
A. Goldberg and X. Zhu. 2006. Seeing starswhen there aren't many stars: Graph-based semisupervised learning for sentiment categorization. InHLT-NAACL 2006: Workshop on Textgraphs.

T. Hofmann. 1999. Probabilistic latent semantic in-dexing. Proceeding of SIGIR, pages 50-57.
M. Hu and B. Liu. 2004. Mining and summarizingcustomer reviews. In KDD, pages 168-177.
S.-M. Kim and E. Hovy. 2004. Determining the sen-timent of opinions. In Proceedings of International

Conference on Computational Linguistics.
D.D. Lee and H.S. Seung. 2001. Algorithms for non-negative matrix factorization. In Advances in Neural

Information Processing Systems 13.
T. Li, C. Ding, Y. Zhang, and B. Shao. 2008. Knowl-edge transformation from word space to document

space. In Proceedings of SIGIR, pages 187-194.
B. Liu, X. Li, W.S. Lee, and P. Yu. 2004. Text classifi-cation by labeling words. In AAAI.

V. Ng, S. Dasgupta, and S. M. Niaz Arifin. 2006. Ex-amining the role of linguistic knowledge sources in

the automatic identification and classification of re-views. In COLING & ACL.

J. Nocedal and S.J. Wright. 1999. Numerical Opti-mization. Springer-Verlag.
B. Pang and L. Lee. 2004. A sentimental education:sentiment analysis using subjectivity summarization

based on minimum cuts. In ACL.
B. Pang and L. Lee. 2008. Opinion miningand sentiment analysis. Foundations and Trends

in Information Retrieval: Vol. 2: No 12, pp1-135 http://www.cs.cornell.edu/home/llee/opinionmining-sentiment-analysis-survey.html.
B. Pang, L. Lee, and S. Vaithyanathan. 2002. Thumbsup? sentiment classification using machine learning

techniques. In EMNLP.
G. Ramakrishnan, A. Jadhav, A. Joshi, S. Chakrabarti,and P. Bhattacharyya. 2003. Question answering

via bayesian inference on lexical relations. In ACL,pages 1-10.

T. Sandler, J. Blitzer, P. Talukdar, and L. Ungar. 2008.Regularized learning with networks of features. In

NIPS.
R.E. Schapire, M. Rochery, M.G. Rahim, andN. Gupta. 2002. Incorporating prior knowledge into

boosting. In ICML.
V. Sindhwani and P. Melville. 2008. Document-word co-regularization for semi-supervised sentiment analysis. In Proceedings of IEEE ICDM.
V. Sindhwani, J. Hu, and A. Mojsilovic. 2008. Regu-larized co-clustering with dual supervision. In Proceedings of NIPS.
P. Turney. 2002. Thumbs up or thumbs down? Se-mantic orientation applied to unsupervised classification of reviews. Proceedings of the 40th AnnualMeeting of the Association for Computational Linguistics, pages 417-424.
X. Wu and R. Srihari. 2004. Incorporating priorknowledge with weighted margin support vector machines. In KDD.
H. Zha, X. He, C. Ding, M. Gu, and H.D. Simon.2001. Bipartite graph partitioning and data clustering. Proceedings of ACM CIKM.
D. Zhou, O. Bousquet, T.N. Lal, J. Weston, andB. Scholkopf. 2003. Learning with local and global

consistency. In Proceedings of NIPS.
X. Zhu, Z. Ghahramani, and J. Lafferty. 2003. Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of ICML.
L. Zhuang, F. Jing, and X. Zhu. 2006. Movie reviewmining and summarization. In CIKM, pages 43-50.

252