{"title": "Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization", "book": "Advances in Neural Information Processing Systems", "page_first": 1857, "page_last": 1864, "abstract": "The cluster assumption is exploited by most semi-supervised learning (SSL) methods. However, if the unlabeled data is merely weakly related to the target classes, it becomes questionable whether driving the decision boundary to the low density regions of the unlabeled data will help the classification. In such case, the cluster assumption may not be valid; and consequently how to leverage this type of unlabeled data to enhance the classification accuracy becomes a challenge. We introduce Semi-supervised Learning with Weakly-Related Unlabeled Data\" (SSLW), an inductive method that builds upon the maximum-margin approach, towards a better usage of weakly-related unlabeled information. Although the SSLW could improve a wide range of classification tasks, in this paper, we focus on text categorization with a small training pool. The key assumption behind this work is that, even with different topics, the word usage patterns across different corpora tends to be consistent. To this end, SSLW estimates the optimal word-correlation matrix that is consistent with both the co-occurrence information derived from the weakly-related unlabeled documents and the labeled documents. For empirical evaluation, we present a direct comparison with a number of state-of-the-art methods for inductive semi-supervised learning and text categorization; and we show that SSLW results in a significant improvement in categorization accuracy, equipped with a small training set and an unlabeled resource that is weakly related to the test beds.\"", "full_text": "Semi-supervised Learning with Weakly-Related\n\nUnlabeled Data: Towards Better Text Categorization\n\nLiu Yang\n\nMachine Learning Dept.\n\nCarnegie Mellon University\n\n5000 Forbes Avenue\nPittsburgh, PA 15213\nliuy@cs.cmu.edu\n\nRong Jin\n\nDept. of Computer Sci. and Eng.\n\n3115 Engineering Building\nMichigan State University\nEast Lansing, MI 48824\n\nRahul Sukthankar\n\nIntel Research Pittsburgh\nand Carnegie Mellon Univ.\n4720 Forbes Avenue, #410\n\nPittsburgh, PA 15213\n\nrongjin@cse.msu.edu\n\nrahuls@cs.cmu.edu\n\nAbstract\n\nThe cluster assumption is exploited by most semi-supervised learning (SSL) meth-\nods. However, if the unlabeled data is merely weakly related to the target classes,\nit becomes questionable whether driving the decision boundary to the low density\nregions of the unlabeled data will help the classi\ufb01cation. In such case, the clus-\nter assumption may not be valid; and consequently how to leverage this type of\nunlabeled data to enhance the classi\ufb01cation accuracy becomes a challenge. We\nintroduce \u201cSemi-supervised Learning with Weakly-Related Unlabeled Data\u201d\n(SSLW), an inductive method that builds upon the maximum-margin approach,\ntowards a better usage of weakly-related unlabeled information. Although the\nSSLW could improve a wide range of classi\ufb01cation tasks, in this paper, we focus\non text categorization with a small training pool. The key assumption behind this\nwork is that, even with different topics, the word usage patterns across different\ncorpora tends to be consistent. To this end, SSLW estimates the optimal word-\ncorrelation matrix that is consistent with both the co-occurrence information de-\nrived from the weakly-related unlabeled documents and the labeled documents.\nFor empirical evaluation, we present a direct comparison with a number of state-\nof-the-art methods for inductive semi-supervised learning and text categorization.\nWe show that SSLW results in a signi\ufb01cant improvement in categorization accu-\nracy, equipped with a small training set and an unlabeled resource that is weakly\nrelated to the test domain.\n\n1 Introduction\n\nSemi-supervised Learning (SSL) takes advantage of a large amount of unlabeled data to enhance\nclassi\ufb01cation accuracy. Its application to text categorization is stimulated by the easy availability of\nan overwhelming number of unannotated web pages, in contrast to the limited number of annotated\nones. Intuitively, corpora with different topics may not be content wise related, however, word usage\nexhibits consistent patterns within a language. Then the question is, what would be an effective SSL\nstrategy to extract these valuable word usage patterns embedded in the unlabeled corpus? In this\npaper, we aim to identify a new data representation, that is on one hand informative to the target\nclass (category), and on the other hand consistent with the feature coherence patterns exhibiting in\nthe weakly related unlabeled data. We further turn it into a convex optimization problem, and solve\nit ef\ufb01ciently by an approximate approach. In this section, we \ufb01rst review the two types of semi-\nsupervised learning: transductive SSL and inductive SSL. Then we state SSL with weakly related\nunlabeled data as a new challenge. Finally, we provide a strategy of how to address this challenge in\nthe domain of text categorization, as well as a brief summary of related work in text categorization.\n\n1\n\n\fA variety of methods have been developed for transductive SSL [14, 21]. These methods can\nbe grouped as: EM with generative mixture models, bootstrapping methods (Self-training, Co-\ntraining and the Yarowsky Algorithm), discriminative models (Transductive Support Vector Ma-\nchines (TSVM) [2]) and data based methods, including Manifold Regularization [1], Information\nRegularization [17], and Low Density Separation(LDS) [11]. Speci\ufb01cally, TSVM extends the max-\nimum margin principle of SVM to unlabeled data. It combines the regularization of SVMs on the\nlabeled points with the cluster assumption on the unlabeled points, to enforce the decision bound-\nary to lie in low density regions. Data based methods discover an inherent geometry in the data,\nand exploit it in \ufb01nding a good classi\ufb01er, to which additional regularization based on unlabeled\ndata is added to avoid over\ufb01tting. Manifold Regularization uses the combinatorial Laplacian as a\nsmoothness term. Based on the assumption that different classes usually form separate manifolds,\nit constructs decision functions that vary little along the data manifolds. Information Regulariza-\ntion seeks a good conditional Pr(y|x), assuming that the decision boundary lies in a low density\narea and Pr(y|x) only varies a little in the area of high density. Low Density Separation makes a\nsimilar assumption as Manifold Regularization and Information Regularization. In addition, it fur-\nther computes a new data representation based on the unlabeled data, which often results in better\nclassi\ufb01cation performance for SSL.\n\nNot many inductive SSL approaches have been presented. In general, the essential distinction be-\ntween transductive learning and inductive learning is that transductive learning produces labels only\nfor the available unlabeled data; while inductive learning not only produces labels for the unlabeled\ndata, but also learns a classi\ufb01er that can be used to predict labels for new data. In this sense, some\nSSL algorithms, though named as \u201ctransductive\u201d, have an inductive nature. For example, TSVM\nis an inductive learner, because it learns a classi\ufb01er from a mixture of labeled and unlabeled data.\nSimilarly, as an inductive component of Low Density Separation (LDS) [11], \u2206 TSVMs learns the\nSVM classi\ufb01cation model in the primal, which can be used for predicting new data. However, the\ngraph part of LDS is transductive, because the kernel and the graph distances are addressed by a\nprior eigen-decompostion and re-representation (MDS); thus, it is unclear how to make a prediction\nof a new test point other than by rebuilding the graph with the new test point. Manifold Regulariza-\ntion [1] also has an implementation with inductive nature. Harmonic Mixtures [22] is a recent work\nthat aims to overcome the limitations of non-inductive inference. It models the data by a generative\nmixture of Gaussians, and adds discriminative regularization using the graph Laplacian.\n\nIn this paper, we focus on inductive SSL. In contrast to previous work in this area, we focus on the\nfollowing important problem that has been overlooked before. As stated in [11], either directly or\nindirectly, all successful semi-supervised algorithms typically make the cluster assumption, which\nputs the decision boundary in low density areas without crossing the high density regions. Note that\nthe cluster assumption is only meaningful when the labeled and unlabeled data are somehow closely\nrelated. When the unlabeled data comes from arbitrary data sources, their input patterns may not\nbe closely related to that of labeled ones. As a result, the labeled and unlabeled data could be well\nseparated, which makes it dif\ufb01cult, if not impossible, to exploit the cluster assumption. Hence, the\nkey challenge is how to leverage the seemingly unrelated unlabeled data to improve the classi\ufb01ca-\ntion accuracy of the target classes. Analogous to transfer learning in which information from one\ncategory may be generalized to the others, we propose a scheme that helps the categorization of one\ndata source, by making use of information from other unlabeled data sources with little relevance.\nOur study stands in contrast to the previous ones in that we aim to make maximum use of the un-\nlabeled data that is weakly related to the test bed. We refer to this problem as \u201cSSL with weakly\nrelated unlabeled data\u201d, or SSLW for short. We \ufb01rst build a maximum margin framework for\nSSL with weakly related unlabeled data. We then cast the framework into an Second Order Cone\nProgramming (SOCP) problem that can be ef\ufb01ciently solved.\n\nA typical approach for semi-supervised learning with weakly related unlabeled data, presented in\nthe recent study [13] is to \ufb01rst derive a new data representation from unlabeled data, and then apply\nsupervised learning technique to the derived new data representation. In [13], the authors proposed\na SSL scheme termed as self-taught learning, which essentially conducts the unsupervised dimen-\nsion reduction using sparse coding [10]. The new dimensions derived from the unlabeled data can\nthen be used to represent the labeled data points for supervised learning. Notably, self-taught learn-\ning [13] performs coding and classi\ufb01cation in two separate stages. In contrast, in our method, the\nconstruction of a good data representation is combined with the training of a maximum margin clas-\nsi\ufb01er under a uni\ufb01ed framework. In particular, the data representation generated by our method\n\n2\n\n\fexploits both labeled and unlabeled data, which differentiates the proposed framework from self-\ntaught learning.\n\nIn general, SSLW could improve a wide range of classi\ufb01cation tasks. However in this study, we\nfocus on text categorization with a small training set. Text categorization has been actively studied\nin the communities of Web data mining, information retrieval and statistical learning [9, 20]. A\nnumber of statistical learning techniques have been applied to text categorization [19], including\nthe K Nearest Neighbor approaches, decision trees, Bayesian classi\ufb01ers, inductive rule learning,\nneural networks, support vector machines (SVM), and logistic regression. Empirical studies [7]\nhave shown that support vector machines (SVM) is the leading technique for text categorization.\nGiven the limited amount of labeled documents, the key of semi-supervised text categorization is to\nexploit the unlabeled documents. The popular implementations of semi-supervised SVMs in [8, 15]\nare considered to be state-of-the-art in text categorization.\n\nFor text categorization with a small training pool, it is very likely that a large portion of words used\nby the testing documents are unseen in the training set, which could lead to a poor estimation of the\nsimilarity between documents. If we can identify the coherence information of words (e.g., word\ncorrelation) from both the labeled and unlabeled documents, we will be able to more accurately\nestimate the document similarity, particularly for documents sharing few or no common words, thus\nimproving the overall classi\ufb01cation accuracy. A straightforward approach is to utilize the word co-\noccurrence information for computing document similarity. However, this straightforward approach\nmay not serve the best interests of word correlation, because not all of the co-occurrence patterns\nare useful. Some co-occurrence patterns (e.g., co-occurrence with common words) do not re\ufb02ect\nthe semantic relations among words, and some are not related to the target class. Consequently,\nit is critical to identify a subset of co-occurrence patterns that are most informative to the target\nclassi\ufb01cation problems. To address this problem, SSLW explicitly estimates the optimal word-\ncorrelation matrix for the target document categorization problem. The rest of paper is organized\nas follows. Section 2 introduces the basic notations and gives a brief review of the SVM dualism.\nIn Section 3, we propose the framework of SSL with weakly-related unlabeled data, followed by an\nef\ufb01cient algorithm for its computation in Section 4. Section 5 evaluates SSLW; and in section 6 we\nprovide some insights into the experimental evidence and discuss future work.\n\n2 Preliminaries\nWe introduce the notation used throughout this paper and brie\ufb02y review the SVM dual formulation.\nDenote L = {(x1, y1), . . . , (xl, yl)} as the collection of labeled documents, where yi is +1 when\ndocument xi belongs to a given document category and \u22121 when it does not (text categorization\nproblem for multi-labeled documents can be treated as a set of independent binary classi\ufb01cation\nproblems). Let U = {xl+1 . . . , xn} be the unlabeled collection of documents. Let V denote the\nsize of the vocabulary. Importantly, as an SSL task with weakly-related unlabeled data, U comes\nfrom some external resources that are weakly related to the test domain. To facilitate our discussion,\nwe denote the document-word matrix on L by D = (d1, d2, . . . , dl), where di \u2208 NV represents\nthe word-frequency vector for document di. The word-document matrix on L + U is denoted by\nG = (g1, g2, . . . , gV ), where gi = (gi,1, gi,2, . . . , gi,n) represents the occurrence of the ith word in\nall the n documents. Recall the dual formalism for SVM:\n\nmax\n\n\u03b1\n\n\u03b1\n\n>e \u2212\n\n1\n2\n>y = 0\n\ns.t. \u03b1\n\n(\u03b1 \u25e6 y)>K(\u03b1 \u25e6 y)\n\n(1)\nwhere \u03b1 = (\u03b1i, \u03b12, . . . , \u03b1n) are the weights assigned to the training documents, e is a vector\nwith all elements being 1, and the symbol \u25e6 denotes an element-wise product between two vectors.\nK \u2208 Rn\u00d7n is the kernel matrix representing the document pairwise similarity and K = D>D.\n\n0 \u2264 \u03b1i \u2264 C, i = 1, 2, . . . , n,\n\n3 The Framework of Semi-supervised Learning with Weakly-Related\n\nUnlabeled Data\n\nIn this section, we present the algorithm of Semi-supervised Learning with Weakly-Related Unla-\nbeled Data (SSLW). As analysized in Section 1, the kernel similarity measure in the standard SVM\n\n3\n\n\fdual formalism K = D>D, is problematic in the sense that the similarity between two documents\nwill be zero if they do not share any common words, even if there exists a pairwise relationship be-\ntween the seen words and the unseen ones, from a large collection of documents. To solve this prob-\nlem, we take into account a word-correlation matrix when computing the kernel similarity matrix,\nand we search for an optimal word-correlation matrix, towards maximizing the categorization mar-\ngin. Speci\ufb01cally, we de\ufb01ne the kernel matrix as K = D>RD, by introducing the word-correlation\nmatrix R \u2208 RV \u00d7V , where each element Ri,j represents the correlation between the ith and the jth\nwords. Note G>G is not a desirable solution to R, because it is improper to assign a high corre-\nlation to two words simply because of their high co-occurrence; the two words may be not closely\nrelated as judged by the maximum-margin criterion. Therefore, it is important to search for the opti-\nmal word-correlation matrix R in addition to the maximum discovered in Eqn. (1), to maximize the\ncategorization margin. We denote the optimal value of the objective function in Eqn. (1) as \u03ba(K):\n\n\u03ba(K) = max\n\n\u03b1\n\n>e \u2212\n\n\u03b1\n\n1\n2\n\n(\u03b1 \u25e6 y)>K(\u03b1 \u25e6 y)\n\n(2)\n\nGiven the fact that \u03ba(K) is inversely-related to the categorization margin [4], minimizing \u03ba(K) is\nequivalent to maximizing the categorization margin.\n\nNow we consider how to make maximum use of the weakly-related source U. The G matrix is crucial\nin capturing the word correlation information from the weakly-related external source U. Thus, to\nincorporate the external source into the learning of the word-correlation matrix R, we regularize R\naccording to G by introducing an internal representation of words W = (w1, w2, . . . , wV ), where\nvector wi is the internal representation of the ith word (This idea is similar to non-negative matrix\nfactorization (NMF) [6]). We expect that W carries an equivalent amount of information as G does,\ni.e., G and W are roughly equivalent representations of words. As there exists a matrix U such that\nthe matrix G can be recovered from W by a linear transformation G = U W , the word-correlation\nmatrix can be computed as R = W >W . Further, the constraints G = U W and R = W W > can be\ncombined to obtain the following positive semi-de\ufb01nite constraint\n\nG T (cid:19) (cid:23) 0,\n(cid:18) R G>\n\nwhere T = U U > [18]. Another strategy we use to involve the unlabeled data into the learning of\nword correlation, is to construct the word correlation matrix R as a non-negative linear combination\nof the top p right eigenvectors of G, i.e.,\n\n(3)\n\n(4)\n\nR = \u03beIV +\n\n(\u03b1i \u2212 \u03be)sis>\ni ,\n\np\n\nXi=1\n\nwhere {si, i = 1, 2, . . . , n} denote the right eigenvectors of matrix G, sorted in descending order\nof their eigenvalues \u03b8i. IV is the V \u00d7 V identity matrix, and \u03b1i \u2265 0, i = 1, . . . , p and \u03be \u2265 0 are\nnon-negative combination weights. Note that introducing \u03beIV ensures non-singularity of the matrix\nR, which is important when computing the expression for matrix T ). This simpli\ufb01cation of R al-\nlows us to effectively extract and utilize the word co-occurrence information in the external source\nU. Additionally, the positive semi-de\ufb01nite constraint R (cid:23) 0 is converted into simple non-negative\nconstraints, i.e., \u03be \u2265 0 and {\u03b1i \u2265 0}p\ni=1. The number of variables in R, which was originally\nO(V 2), is now reduced to p + 1. A further insight into the combination weights reveals that, both\nthe straightforward co-occurrence matrix G>G and Manifold Regulization, give prede\ufb01ned weights\nfor eigenvector combination and thus can be seen as the special cases of SSLW. Precisely speak-\ning, the straightforward co-occurrence matrix G>G, directly uses the eigenvalues as the weights.\nManifold Regularization does a slightly better job by de\ufb01ning the weights as a strict function of the\neigenvalues. Different from both, we give SSLW the entire freedom to learn the weights from data.\nIn this sense, SSLW generalizes these two methods.\n\nBased on the above analysis, we reformulate an extension of SVM dual in Eqn. (1), to search for an\noptimal word-correlation matrix R, by exploiting the word co-occurrence information in the external\nU, under maximum-margin criterion, i.e.,\n\nwhere the word-correlation matrix R is restricted to domain \u2206 that is de\ufb01ned as\n\nmin\n\nR\u2208\u2206,U,W\n\n\u03ba(D>RD)\n\n\u2206 =(cid:26)R \u2208 SV \u00d7V\n\n+\n\n:(cid:18) R G>\n\nG T (cid:19) (cid:23) 0.(cid:27)\n\n4\n\n(5)\n\n(6)\n\n\fif we use (3) for R, and\n\n\u2206 =(R = \u03beIV +\n\np\n\nXi=1\n\n(\u03b1i \u2212 \u03be)sis>\ni\n\n: \u03be \u2265 0, \u03b1i \u2265 0, i = 1, . . . , p)\n\n(7)\n\nif we use Eqn. (4) for R. Given the de\ufb01nition of \u03ba in Eqn. (2), Eqn. (5) is the following min-max\nproblem without analytic solution.\n\nmin\n\nR\u2208\u2206,U,W\n\nmax\n\n\u03b1\n\n\u03b1\n\n>e \u2212\n\n1\n2\n\n(\u03b1 \u25e6 y)>(D>RD)(\u03b1 \u25e6 y)\n\n(8)\n\n4 An Ef\ufb01cient Algorithm of SSLW\n\nThis section provides a computationally-ef\ufb01cient and scalable algorithm for solving the min-max\nproblem in Eqn. (8), with domain \u2206 de\ufb01ned in (6). We \ufb01rst rewrite the maximization problem in\nEqn. (1) into a minimization problem by computing its dual form:\n\nt + 2C\u03b4>e\n\nmin\nt,\u03b7,\u03b4,\u03c1\n\ns.t.\n\nK\n\n\u03c1 \u25e6 y + \u03bbe\n\n(\u03c1 \u25e6 y + \u03bbe)>\n\nt\n\n(cid:18)\n\n\u03c1 = e + \u03b7 \u2212 \u03b4\n\u03b4i \u2265 0, \u03b7i \u2265 0, i = 1, 2, . . . , n.\n\n(cid:19) (cid:23) 0\n\n(9)\nThen, by plugging Eqn. (9) back into Eqn. (5), we transform the min-max problem in Eqn. (8) into\nthe following minimization problem:\n\nmin\n\nt,\u03b7,\u03b4,\u03c1,R\n\ns.t.\n\nt + 2C\u03b4>e + Cttr(T ) + Crtr(R)\n\nD>RD\n\n\u03c1 \u25e6 y + \u03bbe\n\n(\u03c1 \u25e6 y + \u03bbe)>\n\n\u03b4i \u2265 0, \u03b7i \u2265 0, i = 1, 2, . . . , n\n\n(cid:18)\n\u03c1 = e + \u03b7 \u2212 \u03b4, (cid:18) R G>\n\n(cid:19) (cid:23) 0\nG T (cid:19) (cid:23) 0.\n\nt\n\n(10)\n\nNote that as our goal is to compute R and T , thus any valid (W, U ) is suf\ufb01cient, and no uniqueness\nconstraints are imposed on W and U .\nIn Eqn. (10), Cttr(T ) and Crtr(R) serve as sparse regularizers for R and T . They are added to\nimprove the stability of the optimal solution, as well as to favor a simpler model over sophisticated\nones. The parameters Ct and Cr are used to weigh the importance of the two regularization terms.\nThe trace heuristic has been widely used to enforce a low-rank matrix by minimizing its trace in\nplace of its rank. In the generalization of the trace heuristic presented by [5], the dual of the spectrum\nnorm is the convex envelope of the rank on the set of matrices with norm less than one. The rank\nobjective can be replaced with the dual of the spectral norm, for rank minimization. In other words,\nthe best convex regularizer one can get for rank minimization is the trace function.\n\nEqn. (10) is a Semi-De\ufb01nite Programming (SDP) problem [3], and in general can be solved using\nSDP packages such as SeDuMi [16]. However, solving a SDP problem is computationally expensive\nand does not easily scale to a large number of training examples. [18] recently provides an elegant\nscheme of rewriting a SDP problem into a Second Order Cone Programming (SOCP) problem that\ncan be much more ef\ufb01ciently solved [3]. Technically, we adopt this procedure and rewrite Eqn. (10)\ninto a typical SOCP problem that can be ef\ufb01ciently solved. Given the estimated word-correlation\nmatrix R and K = D>RD, the example weights \u03b1 in SVM model can be estimated using the KKT\nconditions \u03b1 = (yy> \u25e6 K)\u22121(e + \u03b7 \u2212 \u03b4 + \u03bby). And the threshold b in SVM can be obtained by\nsolving the primal SVM using the linear programming technique.\n\n5 Evaluation\n\nIn this section, we evaluate SSLW on text categorization with limited training data. The experiment\nset up is purely inductive, i.e., the testing feature space is invisible in the training phrase. As an SSL\n\n5\n\n\ftask with weakly-related unlabeled data, the provided unlabeled data have little relevance to the test\ndomain. We show that SSLW can achieve noticeable gains over the state-of-the-art methods in both\ninductive SSL and text categorization, and we provide insight into why this happens. Following [18],\nour implementation of SSLW selects the top 200 right eigenvectors of the document-word matrix G\nmatrix to construct the R matrix. As de\ufb01ned in Section 3, the G matrix covers both the training sets\nand the weakly-related external collection.\nEvaluation datasets Two standard datasets for text categorization are used as the evaluation test bed:\nthe Reuters-21578 dataset and the WebKB dataset. For computational simplicity, 1000 documents\nare randomly selected from the TREC AP88 dataset and are used as an external information source\nfor both datasets. The AP88 dataset includes a collection of news documents reported by Associated\nPress in 1988. The same pre-processing and indexing procedure are applied to these three datasets,\nby using the Lemur Toolkit 1. For the Reuters-21578 dataset, among the 135 TOPICS categories,\nthe 10 categories with the largest amount of documents are selected (see Table 1). This results in\na collection of 9, 400 documents. For the WebKB dataset, which has seven categories: student,\nfaculty, staff, department, course, project, and other, we discard the category of \u201cother\u201d due to its\nunclear de\ufb01nition (see Table 2). This results in 4, 518 data samples in the selected dataset. The\nReuters-21578 dataset and the TREC AP88 dataset have very limited relevance in topic; and the\nWebKB dataset and the TREC AP88 dataset are even less content-wise related.\n\nCategory\n# Samples\n\nearn\n3987\n\nacq\n2448\n\nmoney-fx\n\n801\n\ncrude\n634\n\ngrain\n628\n\ntrade\n552\n\ninterest wheat\n306\n\n513\n\nship\n305\n\ncorn\n254\n\nTable 1: The ten categories of the Reuters-21578 dataset with the largest amount of documents.\n\nCategory\n# Samples\n\ncourse\n930\n\ndepartment\n\n182\n\nfaculty\n1124\n\nproject\n\n504\n\nstaff\n137\n\nstudent\n1641\n\nTable 2: The six categories of the WebKB dataset.\n\nEvaluation Methodology We focus on binary classi\ufb01cation. For each class, 4 positive samples and\n4 negative samples are randomly selected to form the training set; and the rest of the data serve\nas the testing set. As a rare classi\ufb01cation problem, the testing data is very unbalanced. Therefore,\nwe adopt the area under the ROC curve (AUR) [12] as the quantitative measurement of the binary\nclassi\ufb01cation performance for text categorization. AUR is computed based on the output of real-\nvalue scores of the classi\ufb01ers returned for testing documents. Each experiment is repeated ten times,\nand the AUR averaged over these trials is reported.\nBaseline Methods We use six baseline methods to demonstrate the strength of SSLW from dif-\nferent perspectives. The \ufb01rst two baselines are the standard SVM and the traditional TSVM.The\nthird baseline is \u2207 TSVM 2, the inductive component of LDS, which delivers the state-of-the-art\nperformance of SSL. The fourth baseline Manifold Regularization 3 (ManifoldR for short) is in-\ncluded as a state-of-the-art SSL approach with an inductive nature, and more importantly, being\nable to incorporate word relationship into the regularization. For the \ufb01fth baseline, we compare the\nword-correlation matrix estimated by SSLW, with the trivial word-correlation matrix G>G; and we\nname this baseline as COR. Finally, self-taught learning [13] serves as our sixth baseline method,\nnamed as Self-taught. It uses the unlabeled data to \ufb01nd an low-dimension representation, and then\nconducts standard classi\ufb01cation in this new space.\nText Categorization with Limited Training Data We describe the AUR results of both the Reuters-\n21578 dataset and the WebKB datset, by using different methods. For the Reuters-21578 dataset,\nTable 3 summarizes the AUR comparison between the six baseline methods and SSLW. Both mean\nand variance of AUR are shown in the table. We observe that SSLW consistently outperforms the six\nbaselines in AUR across most of the ten categories. In general, a t-test shows our performance gain is\nstatistically signi\ufb01cant compared to all the baselines at a signi\ufb01cance level of 0.05. Detailed analysis\nis provided below. First, TSVM and \u2207TSVM overall perform even worse than the standard SVM.\nThis observation reveals that if the unlabeled data are only weakly relevant to the target class, it could\n\n1http://www.lemurproject.org/\n2http://www.kyb.tuebingen.mpg.de/bs/people/chapelle/lds/\n3http://manifold.cs.uchicago.edu/manifold_regularization/software.html\n\n6\n\n\fharm the categorization accuracy by simply pushing the decision boundary towards the low density\nregions, and away from the high density areas of the unlabeled data. It also justi\ufb01es our intuitive\nhypothesis that the cluster assumption is not valid in this case. Second, the dramatic advantage\nof SSLW over the COR method con\ufb01rms our previous analysis \u2013 learning a good word-correlation\nmatrix that is jointly determined by the co-occurrence matrix and the classi\ufb01cation margin (as SSLW\ndoes), can achieve signi\ufb01cant gains over simply using the trivial form G>G. Third, we observe that\nSSLW algorithm consistently improves over Manifold Regularization, except on \u201ctrace\u201d category\nwhere ManifoldR has a little advantage. Most noticeably, on \u201cwheat\u201d category and \u201cship\u201d category,\nthe AUR is improved by more than 10%, as a result of SSLW. These results demonstrate that SSLW\nis effective in improving text categorization accuracy with a small amount of training data. We also\nnotice that, \u2206TSVM outperforms TSVM on some categories, but is slightly worse than TSVM on\nsome others. The unstable performance of \u2206TSVM can possibly be explained by its gradient descent\nnature. Finally, our method receives gains against self-taught learning [13] on most categories.\nThis proves SSLW is more effective than self-taught learning in using unlabeled data to improve\nclassi\ufb01cation. The gains can be attributed to the fact that Self-taught does coding and classi\ufb01cation\nin two separate stages, while SSLW achieves these two purposes simultaneously.\n\nA more careful examination indicates that SSLW also reduces the standard deviation in classi\ufb01cation\naccuracy. The standard deviations by SSLW are mostly less than 2.5%; while those by the baseline\nmethods are mostly above 2.5%. Over all the ten categories except the \u201cmoney-\ufb01x\u201d category, SSLW\nalways delivers the lowest or the second lowest standard deviation, among all the six methods. We\nhypothesize that the large standard deviation by the baseline models is mainly due to the small\nnumber of training documents. In this situation, many words should only appear in a few training\ndocuments. As a result, the association between these words and the class labels can not be reliably\nestablished. In extreme cases where these words do not appear in any of the training documents, no\nassociation can be established between these words and the class labels. Evidently, test documents\nrelated to these unseen words are likely to be classi\ufb01ed incorrectly. By contrast, SSLW can resolve\nthis problem by estimating the word correlation. For a missing word, its association with the class\nlabel can be reliably estimated through the correlation with other words that appear frequently in the\ntraining examples.\n\nTable 4 shows the AUR results of the WebKB dataset, from which we observe the similar trends as\ndescribed above in the Reuters-21578 dataset. It is shown that SSLW maintains its clear advantage\nover the six baseline methods, across all the six categories.\n\nCategory\n\nSVM\n\nTSVM\n\n82.3 \u00b1 2.1\n\n70.9 \u00b1 4.1\n\nearn\nacq\n\nmoney-fx\n\ncrude\ngrain\ntrade\ninterest\nwheat\nship\ncorn\n\n\u2207TSVM ManifoldR\n86.4 \u00b1 2.1\n70.1 \u00b1 5.2\n\nCOR\n\n62.6 \u00b1 5.8\n\nSelf-taught\n65.9 \u00b1 3.5\n\nSSLW\n\n89.3 \u00b1 1.6\n\n69.7 \u00b1 3.0\n\n63.1 \u00b1 3.3\n\n59.2 \u00b1 4.1\n\n70.1 \u00b1 3.0\n\n51.2 \u00b1 4.7\n\n68.2 \u00b1 2.6\n\n73.5 \u00b1 3.3\n\n71.3 \u00b1 2.6\n\n67.4 \u00b1 3.1\n\n70.0 \u00b1 2.0\n\n74.0 \u00b1 2.6\n\n76.5 \u00b1 4.6\n\n75.7 \u00b1 3.9\n\n82.1 \u00b1 4.4\n\n69.7 \u00b1 3.3\n\n68.6 \u00b1 3.2\n\n59.9 \u00b1 4.7\n\n71.5 \u00b1 3.3\n\n56.0 \u00b1 5.7\n\n67.6 \u00b1 3.1\n\n77.5 \u00b1 1.7\n\n70.7 \u00b1 3.5\n\n68.7 \u00b1 2.3\n\n66.4 \u00b1 3.5\n\n75.1 \u00b1 3.5\n\n62.1 \u00b1 5.4\n\n69.0 \u00b1 2.9\n\n82.7 \u00b1 2.0\n\n82.7 \u00b1 3.4\n\n65.1 \u00b1 5.0\n\n71.5 \u00b1 4.2\n\n85.1 \u00b1 3.4\n\n78.8 \u00b1 5.2\n\n78.5 \u00b1 4.4\n\n84.4 \u00b1 3.9\n\n79.3 \u00b1 1.5\n\n60.2 \u00b1 3.9\n\n70.4 \u00b1 3.1\n\n85.0 \u00b1 1.5\n\n69.4 \u00b1 4.7\n\n76.5 \u00b1 2.5\n\n89.4 \u00b1 1.8\n\n77.6 \u00b1 3.8\n\n61.9 \u00b1 3.6\n\n64.7 \u00b1 4.6\n\n79.1 \u00b1 3.8\n\n54.4 \u00b1 5.7\n\n67.1 \u00b1 2.6\n\n89.4 \u00b1 1.6\n\n70.4 \u00b1 2.6\n\n64.5 \u00b1 2.9\n\n65.8 \u00b1 3.9\n\n72.3 \u00b1 2.6\n\n52.1 \u00b1 5.0\n\n68.0 \u00b1 2.1\n\n82.8 \u00b1 1.4\n\n80.8 \u00b1 2.9\n\n65.4 \u00b1 2.1\n\n66.5 \u00b1 5.3\n\n77.0 \u00b1 5.0\n\n54.5 \u00b1 5.6\n\n66.8 \u00b1 3.7\n\n86.4 \u00b1 2.3\n\nTable 3: The AUR results (%) on the Reuters-21578 dataset with 8 training examples per category.\n\nCategory\ncourse\ndept.\nfaculty\nproject\nstaff\nstudent\n\nSVM\n\nTSVM\n\n66.8 \u00b1 2.2\n\n61.5 \u00b1 2.0\n\n\u2207TSVM ManifoldR\n68.4 \u00b1 2.8\n61.8 \u00b1 2.9\n\nCOR\n\n63.3 \u00b1 5.4\n\nSelf-taught\n66.0 \u00b1 3.9\n\nSSLW\n\n76.2 \u00b1 2.5\n\n72.2 \u00b1 2.8\n\n58.8 \u00b1 5.2\n\n63.7 \u00b1 3.5\n\n73.4 \u00b1 5.9\n\n58.3 \u00b1 5.1\n\n70.8 \u00b1 3.6\n\n87.6 \u00b1 2.2\n\n56.7 \u00b1 3.4\n\n56.4 \u00b1 2.6\n\n54.2 \u00b1 3.0\n\n56.9 \u00b1 2.8\n\n53.1 \u00b1 4.6\n\n61.7 \u00b1 3.3\n\n61.6 \u00b1 3.4\n\n59.6 \u00b1 2.9\n\n57.0 \u00b1 2.3\n\n60.3 \u00b1 1.4\n\n61.8 \u00b1 3.1\n\n50.0 \u00b1 5.9\n\n58.7 \u00b1 3.0\n\n69.5 \u00b1 3.2\n\n58.1 \u00b1 1.6\n\n53.0 \u00b1 1.1\n\n51.6 \u00b1 1.3\n\n52.9 \u00b1 0.9\n\n46.4 \u00b1 1.6\n\n59.9 \u00b1 1.9\n\n58.3 \u00b1 1.5\n\n59.2 \u00b1 2.7\n\n54.0 \u00b1 2.3\n\n55.3 \u00b1 2.7\n\n59.4 \u00b1 3.1\n\n56.0 \u00b1 4.1\n\n61.0 \u00b1 1.9\n\n67.7 \u00b1 2.6\n\nTable 4: The AUR results (%) on the WebKB dataset with 8 training examples per category.\n\n7\n\n\f6 Conclusion\n\nThis paper explores a new challenge in semi-supervised learning, i.e., how to leverage the unlabeled\ninformation that is weakly related to the target classes, to improve classi\ufb01cation performance. We\npropose the algorithm of Semi-supervised Learning with Weakly-Related Unlabeled Data (SSLW)\nto address this challenge. SSLW extends the theory of support vector machines to effectively iden-\ntify those co-occurrence patterns that are most informative to the categorization margin and ignore\nthose that are irrelevant to the categorization task. Applied to text categorization with limited num-\nber of training samples, SSLW automatically estimates the word correlation matrix by effectively\nexploiting the word co-occurrence embedded in the weakly-related unlabeled corpus. Empirical\nstudies show that SSLW signi\ufb01cantly improves both the accuracy and the reliability of text catego-\nrization, given a small training pool and the additional unlabeled data that are weakly related to the\ntest bed. Although SSLW is presented in the context of text categorization, it potentially facilitates\nclassi\ufb01cation tasks in a variety of domains. In future work, we will evaluate the bene\ufb01ts of SSLW on\nlarger data sets and in other domains. We will also investigate SSLW\u2019s dependencies on the number\nof eigenvectors used, and its behavior when varying the number of labeled training examples.\n\nAcknowledgments\nThe work was supported by the National Science Foundation (IIS-0643494) and National Institute\nof Health (1R01GM079688-01). Any opinions, \ufb01ndings, and conclusions or recommendations ex-\npressed in this material are those of the authors and do not necessarily re\ufb02ect the views of NSF and\nNIH.\n\nReferences\n[1] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning\n\nfrom labeled and unlabeled examples. Technical report, Univ. of Chicago, Dept. of Comp. Sci., 2004.\n\n[2] K. Bennett and A. Demiriz. Semi-supervised support vector machines. In Proc. NIPS, 1998.\n[3] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, 2004.\n[4] C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge\n\nDiscovery, 2(2), 1998.\n\n[5] M. Fazel, H. Hindi, and S. Boyd. A rank minimization heuristic with application to minimum order\n\nsystem approximation. In Proc. American Control Conf., 2001.\n\n[6] P. O. Hoyer. Non-negative matrix factorization with sparseness constraints. J. Mach. Learn. Res., 5, 2004.\n[7] T. Joachims. Text categorization with support vector machines: learning with many relevant features. In\n\nProc. ECML, 1998.\n\n[8] T. Joachims. Transductive inference for text classi\ufb01cation using support vector machines. In Proc. ICML,\n\n1999.\n\n[9] M. Lan, C. L. Tan, H.-B. Low, and S. Y. Sung. A comprehensive comparative study on term weighting\n\nschemes for text categorization with support vector machines. In Proc. WWW, 2005.\n\n[10] H. Lee, A. Battle, R. Rajat, and A. Ng. Ef\ufb01cient sparse coding algorithms. In Proc. NIPS, 2007.\n[11] A. Z. Olivier Chapelle. Semi-supervised classi\ufb01cation by low density separation. In Proc. Inter. Workshop\n\non Arti\ufb01cial Intelligence and Statistics, 2005.\n\n[12] F. Provost, T. Fawcett, and R. Kohavi. The case against accuracy estimation for comparing induction\n\nalgorithms. In Proc. ICML, 1998.\n\n[13] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng. Self-taught learning: transfer learning from unla-\n\nbeled data. In Proc. ICML, 2007.\n\n[14] M. Seeger. Learning with labeled and unlabeled data. Technical report, Univ. of Edinburgh, 2001.\n[15] V. Sindhwani and S. S. Keerthi. Large scale semi-supervised linear support vector machines. In Proc.\n\nACM SIGIR, 2006.\n\n[16] J. F. Sturm. Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optimiza-\n\ntion Methods Software, 11/12(1\u20134), 1999.\n\n[17] M. Szummer and T. Jaakkola. Information regularization with partially labeled data. In Proc. NIPS, 2002.\n[18] L. Yang, R. Jin, C. Pantofaru, and R. Sukthankar. Discriminative cluster re\ufb01nement: Improving object\n\ncategory recognition given limited training data. In Proc. CVPR, 2007.\n\n[19] Y. Yang. An evaluation of statistical approaches to text categorization. Journal of Info. Retrieval, 1999.\n[20] Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In Proc.\n\nICML, 1997.\n\n[21] X. Zhu. Semi-supervised learning literature survey. Technical report, UW-Madison, Comp. Sci., 2006.\n[22] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian \ufb01elds and harmonic\n\nfunctions. In Proc. ICML, 2003.\n\n8\n\n\f", "award": [], "sourceid": 929, "authors": [{"given_name": "Liu", "family_name": "Yang", "institution": null}, {"given_name": "Rong", "family_name": "Jin", "institution": null}, {"given_name": "Rahul", "family_name": "Sukthankar", "institution": null}]}