CN107943856A

CN107943856A - A kind of file classification method and system based on expansion marker samples

Info

Publication number: CN107943856A
Application number: CN201711086110.4A
Authority: CN
Inventors: 沈雅婷; 汪云云
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University
Priority date: 2017-11-07
Filing date: 2017-11-07
Publication date: 2018-04-20

Abstract

The present invention proposes a kind of based on the file classification method for expanding marker samples, collection includes marked samples of text first, the authentic specimen data set of unmarked samples of text, then high authentic specimen is found by clustering method KFCM to obtain the marker samples of expansion, it is marked to recycle quadratic loss function, sample data that is unmarked and expanding formulates a unified class object function, the parameters such as regularization parameter therein and kernel function are set, and learn to obtain text classification function, recently enter text data to be sorted, classified using text classification function, obtain the classification of text.The present invention also proposes a kind of Text Classification System, have compared with other error rates of classical taxonomy algorithm and related algorithm on test set and significantly improve, solve the problems, such as that the nicety of grading on the prior art text less and inaccurate in marker samples is low, and there is best mutual information and class variable.

Description

Text classification method and system based on extended label samples

Technical Field

The invention belongs to the field of text classification processing, and particularly relates to application of pattern recognition and machine learning in the field of data mining.

Background

The text classification problem is not substantially different from other classification problems, and the method can be summarized as matching according to certain characteristics of the data to be classified, and complete matching is unlikely, so that an optimal matching result must be selected (according to some evaluation criterion) to complete classification. The selection and training of the classifier, and the evaluation and feedback of the classification result are very important. Text classification is a basic task of machine learning.

Text classification can be divided into two broad categories, supervised classification and semi-supervised classification. The supervised classification is that the text samples all have labels, the semi-supervised classification is that part of the text samples have labels, and part of the text samples have no labels. In practice, unlabeled text is less costly and easier to obtain than labeled text, so, from an applicable information quantity perspective, semi-supervised text classification has a strong demand in real-world applications and has attracted considerable attention, learned by combining labeled and unlabeled text, achieving better performance than using labeled text alone. The manifold regularization method MR is a semi-supervised classification method which has been studied deeply and is often used, and similar samples on a manifold structure chart are constrained to have similar classification outputs in a regularization mode.

However, the marked samples are randomly selected, for example in the border area, and even in the opposite category area. Labeling from these samples propagates to their neighbors, which, although taking into account the structure of unlabeled samples, may also mislead the MR classification. An example can be seen in FIG. 1, where unlabeled samples in a single class are represented byAndis represented and the corresponding marked samples are respectively represented byAnd "\9679" - "indicates that the decision boundary of the MR is delineated for comparison with the real boundary. It can be readily observed from FIG. 1 that the labeled samples in class1 are closer to the class boundary than the labeled samples in class 2. In particular, point x1 is located in the overlapping region of the two types. In this case, alreadyThe labeled samples may "mislead" the classification, such that the decision boundary of the MR is closer to class2 and thus deviates from the true boundary. Of course, MR also takes into account the structure of unlabeled samples. Assigning less weight to labeled samples and more weight to unlabeled samples may result in a more realistic boundary, but the choice of regularization parameters in semi-supervised learning remains an open problem. Therefore, the location of the labeled samples is crucial for MR classification, whereas in semi-supervised classification labeled samples tend to be rare and randomly selected. The MR performance may be unsatisfactory once the marked samples are somewhat misleading.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: in view of the above problems and deficiencies of the prior art, it is an object of the present invention to perform text classification on a text data set by expanding a labeled sample set to reduce the influence of "misleading" labeled samples, so as to solve the problem of low classification accuracy of the prior art on texts with fewer labeled samples and inaccurate text.

The invention adopts the following technical scheme for solving the technical problems:

a semi-supervised manifold regularization text classification method based on an expansion mark sample comprises the following steps:

step 1, collecting a text real sample data set which comprises marked text samples and unmarked text samples, wherein the marked text samples comprise text category labels;

step 2, acquiring all text membership information through a clustering algorithm, selecting high-reliability text samples according to the clustering membership, and forming an expansion mark text sample set by using the high-reliability text samples and class labels thereof;

step 3, uniformly setting a target function for the marked text sample, the unmarked text sample and the extended marked text sample data according to a square loss function based on a popular regularization method MR, training the target function by using the extended marked sample obtained in the step 2 to obtain an optimal regularization parameter and a kernel function, and obtaining a final text classification function;

step 4, inputting text data to be classified, classifying by using the text classification function obtained in the step 3, and obtaining the text category: useful text and useless text.

Furthermore, in the method provided by the invention, in the step 2, the fuzzy kernel clustering algorithm KFCM is adopted to obtain the text membership information, and the clustering membership matrix obtained from the KFCM is assumed to be

Firstly, judging the membership grade of a row according to the statistical number of the class membership grade of each marked sample in any upper row and lower row of the matrix, which is matched with the actual class of the marked sample, so as to obtain the membership grade of another row;

then, in the row whose membership class is positive, u is selected _+i δ or u _+i Unlabeled text samples of ≦ 1- δ as high confidence text samples, where δ is [0.5,1,1]N, n represents the total number of samples; and forming an expanded marked text sample set by using the high-reliability text sample and the obtained cluster or category mark.

Further, in the method provided by the present invention, in step 2, after clustering, for each cluster, if the number of positive labeled samples is greater than the number of negative labeled samples, the cluster is regarded as a positive cluster or a positive class, otherwise, the cluster is regarded as a negative cluster or a negative class; thus, positive and negative clusters are obtained consistent with the true positive and negative categories.

Further, in the method provided by the present invention, the objective function in step 3 is:

wherein f is a local classification function to be solved and is located in a regeneration core Hilbert space (RKHS) H defined by a kernel function K _K In, gamma _A 、γ _I Respectively represent canonicalChange parameter, gamma _P Represents the KFCM parameter, x _i ∈R ^d ，y _i ∈{+1,-1}，n _u ＝n-n _l ，n _u Representing the number of unlabeled samples, n _l Representing the number of marked samples, n representing the total number of samples,l is the graph Laplace given by L = D-W, W is the weight matrix of the graph G, D is the graph given by the diagonal componentFormed diagonal matrix, weight W _ij Represents the connection sample x _i And x _j Similarity between them, f (x) _i ) Representing classes of text assigned to labeled samples after classification, regularization termsTo control the complexity of the classification surface to avoid over-learning; wherein the fourth term is the classification loss of the augmented labeled text sample, p _i Indicates the assigned label of each unlabeled sample, which is defined as

Wherein u is _+i And u _-i Are respectively a sample x _i Positive and negative degrees of membership, and u _+i +u _-i N, n represents the total number of samples.

Further, in the method provided by the present invention, in step 3, for a given text sample x, the text classification function is:

wherein

Wherein K is (n) _l +n _u )×(n _l +n _u ) By a kernel matrix ofGiven (n) _l +n _u ) Label vector of dimension, P is byGiven (n) _l +n _u ) A label vector of dimensions, whereinIs a vector with all elements being 0's,is one (n) _l +n _u )×(n _l +n _u ) J is an n-th unit matrix _l J = diag (1, \8230;, 1,0, \8230;, 0) with 1 on one diagonal and 0 on the rest _l +n _u )×(n _l +n _u ) Diagonal matrix of J _p Is obtained by taking abs (. Circle.) to obtain the absolute value of each element in PGiven (n) _l +n _u )×(n _l +n _u ) The diagonal matrix of (a).

Further, in the method provided by the present invention, the optimal regularization parameter and kernel function are set in step 3, wherein the regularization parameter γ is _I And gamma _A Set to 1 and 0.1, respectively, the parameter δ is set to 0.98, the kernel function selects the gaussian kernel rbf, whose parameter is set to 0.5, the parameter γ of the KFCM is adjusted _P =0.3, when gamma _P The formula degenerates to MR when =0, and the number of marked text samples is fixed to 10.

Further, in the method provided by the invention, step 1 is to collect text real data in the UCI public data set and the benchmark data set.

Furthermore, in the method provided by the invention, the collected text real sample data set in the step 1 comprises a plurality of web pages, firstly, the text content of the web pages is used, the link information is ignored, and the bag-of-word vector of the file indicates that the previous 3000 words are constructed and used, namely, the HTML head is skipped; and secondly, normalizing the feature vector into a unit length by adopting TFIDF mapping.

The invention also provides a text classification system, which comprises:

the sample acquisition module is used for acquiring a text real sample data set which comprises marked text samples and unmarked text samples, wherein the marked text samples comprise text category labels;

the extended marked sample acquisition module is used for acquiring all text membership degree information through a clustering algorithm, selecting high-reliability text samples according to the clustering membership degree, and forming an extended marked text sample set by using the high-reliability text samples and class labels thereof;

the classification function calculation module is used for uniformly setting a target function for the marked text sample, the unmarked text sample and the extended marked text sample data according to a square loss function based on a popular regularization method MR, training the target function by using the extended marked sample to obtain an optimal regularization parameter and a kernel function, and obtaining a final text classification function;

the text classification module is used for inputting text data to be classified, classifying the data by using a text classification function and outputting the classification of the text: useful text and useless text.

By adopting the technical scheme, compared with the prior art, the invention has the following technical effects:

the invention provides a novel label extended MR framework (LE _ MR for short), which introduces an extended label sample set into MR learning, namely constructs an optimization problem of the LE _ MR based on labeled, unlabeled and extended high-reliability sample sets, thereby relieving the problem of lack of labeled samples and improving the text classification performance.

Meanwhile, the invention carries out comparison experiments on a plurality of data sets with other classical classification algorithms and similar text classification algorithms, wherein the data sets comprise UCI public data sets (text characteristic data) and benchmark data sets. Experiments show that LE _ MR can improve the accuracy of text classification compared with other advanced semi-supervised classification methods at present to obtain encouraging results.

Drawings

FIG. 1 is a graph comparing LE _ MR and MR.

Fig. 2 is a flow chart of an implementation method of the present invention.

Detailed Description

The technical scheme of the invention is further explained in detail by combining the attached drawings:

it will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The invention provides a novel semi-supervised text classification method. For text classification on a text data set, which contains a plurality of web pages, the task is to classify these web pages into two categories: useful and useless, i.e., positive and negative. Using only the text content of the web page, ignoring the link information, the bag-of-words vector representation of the document is constructed using the top 3000 words (skipping the HTML header), with the best mutual information and class variables, followed by the TFIDF mapping. The feature vectors are normalized to a unit length.

Since the marked samples are randomly selected, for example in the border area, and even in the opposite category area. MR classification may be misled due to propagation of labels from these samples to their neighbors. In order to solve the problem, a new label extension MR framework (LE _ MR) is provided for semi-supervised text classification in the invention. In LE MR, a clustering method KFCM first finds high confidence samples, e.g., samples in the cluster center region. These samples generated from the cluster index are then employed to augment the set of labeled text samples. The invention develops a uniform target function for marked, unmarked and expanded high-reliability text samples based on the square loss function to represent the problem.

As shown in the flowchart of fig. 2, the present invention provides a semi-supervised manifold regularization text classification method based on extended labeled samples, which can be used for training classification and recognition of inaccurate texts with few labeled samples. The invention is realized by the following technical scheme, which comprises the following steps:

the first step is as follows: collecting a text data set under a real condition, wherein the text data set comprises a text data sample and a corresponding text category label, dividing a training set and a testing set, the text data set comprises a plurality of web pages, and the task is to divide the web pages into two types: useful and useless, i.e. positive and negative classes, using only the text content of the web page, ignoring link information, bag-of-words vector representation of the document is constructed using the first 3000 words (skipping the HTML header), with the best mutual information and class variables, followed by TFIDF mapping, the feature vectors are normalized to unit length;

the second step is that: acquiring a high-reliability text sample through text membership information obtained by a fuzzy kernel clustering algorithm KFCM to expand a marked text sample set;

the third step: setting parameters such as regularization parameters and kernel functions in the LE _ MR, and training the LE _ MR by using the extended marked text sample set obtained in the second step; obtaining an optimal regularization parameter and a kernel function to obtain a final text classification function f (x);

the fourth step: inputting text data to be classified, classifying the text data by using a text classification function f (x), and outputting the classification of the text: useful text and useless text.

The first step is as follows:

the first step is simple, and the text data under the real condition is collected, wherein the text data comprises a UCI data set (text characteristic data), a benchmark data set and the like, the UCI data set comprises text data samples and corresponding text category labels, and a training set and a test set are divided.

The second step is specifically as follows:

firstly, acquiring text membership in a training set through FCM (KFCM) based on a kernel. FCM is a classical clustering method that divides data into clusters and instances in the same cluster are as similar as possible. In addition, a variation of FCM is proposed, called KFCM-I, which attempts to improve learning ability through kernel skills. In KFCM-I, example X = { X = ₁ ,…,x _N Is first mapped to feature space (kernel space) by Mercer mapping Φ. Then at the mapped sample { Φ (x) ₁ ),…,Φ(x _N ) And directly realizing clustering.

The formula of KFCM-I is:

where V is the cluster center, U is the membership matrix, c is the number of clusters, N is the total number of input samples, x _k Is the input sample characteristic, m is the blur index,is the center of the cluster in the kernel space, andis description of x _k Cluster membership of the likelihood of belonging to the ith cluster.

However, because it is difficult to unambiguously express the kernel map, researchers have proposed the algorithm KFCM-II. KFCM-II by Nuclear Induction distance 1-K (x) _k ,v _i ) Instead of the similarity measureWherein K (x) _k ,v _i ) Shows an example x in the kernel space _k And a clustering center v _i The distance between them. In this way, this new objective function will become:

based on the Lagrange method, the clustering membership degree and the clustering center of the example are respectively as follows:

and

in KFCM-II, the cluster center V from FCM is taken as the initial estimate, and the iteration terminates when | | V (t + 1) -V (t) | | ≦ ε, where t is the number of iterations, and ε > 0, t =0.

Finally, the algorithm description of KFCM-II is summarized below,

1) The cluster center from FCM is taken as the initial cluster center of KFCM _ II;

2) Calculating K _t (x _k ,v _i ) (Gram matrix);

3) By passing

Calculating a clustering membership matrix U (t);

4) By passing

Calculating a clustering center V (t + 1);

5) The algorithm terminates when | | V (t + 1) -V (t) | ≦ ε, otherwise t → t +1 and return to step 2).

After clustering, for each cluster, if the number of positive mark instances is greater than that of negative mark instances, the cluster is regarded as a positive cluster or a positive class, otherwise, the cluster is regarded as a negative class. Thus, the positive and negative clusters are obtained to be consistent with the real positive and negative categories. And further, selecting a high-reliability sample according to the clustering membership.

Specifically, the text membership information is obtained by adopting a fuzzy kernel clustering algorithm KFCM, and a clustering membership matrix obtained from the KFCM is assumed to be

Firstly, according to the text category membership degree of each marked sample in two lines and the text category to which the actual marked sample belongs, the matching number is counted to judge the positive and negative text categories of the upper and lower lines in the clustering membership degree matrix.

In the uplink, the number of samples marked as positive types with the obtained membership degree of more than or equal to 0.9 is marked as S, and in the downlink, the number of samples marked as positive types with the obtained membership degree of more than or equal to 0.9 is marked as X; or, in the uplink, the number of samples marked as negative classes with the obtained membership degree less than or equal to 0.1 is recorded as S, and in the downlink, the number of samples marked as negative classes with the obtained membership degree less than or equal to 0.1 is recorded as X.

If S&gt, X, the ascending is the positive class membership u _+i Otherwise, the behavior under the condition of the positive class membership degree u _+i 。

If the upward behavior is positive, only the membership degree of the upward row is seen, and in the upward row, u of the unlabeled sample _+i When the value is more than or equal to delta, the text sample with high credibility is marked as +1, namely the marked positive type sample and the u of the unmarked sample are expanded _+i When the value is less than or equal to 1-delta, the text sample with high credibility is marked as-1, namely, the marked negative type sample is expanded, and the value of 1-delta < u of the unmarked sample _+i The < delta is still used as an unlabeled sample, and the originally labeled sample is still used as a labeled sample.

Similarly, if the down row is positive, only the membership of the down row is seen, and in the down row, u of the unlabeled sample _+i When the value is more than or equal to delta, the text sample with high credibility is marked as +1, namely the marked positive type sample and the u of the unmarked sample are expanded _+i When the value is less than or equal to 1-delta, the text sample with high credibility is marked as-1, namely, the marked negative type sample is expanded, and the value of 1-delta < u of the unmarked sample _+i The < delta is still used as an unlabeled sample, and the originally labeled sample is still used as a labeled sample.

And finally, forming an extended marked text sample set by using the high-reliability text sample and the category label thereof. Where δ is a threshold of [0.5,1 ].

The third step is specifically as follows:

and training an improved semi-supervised manifold regularization framework LE _ MR by using the extended text mark samples acquired in the second step.

MR proposes a semi-supervised manifold regularization classification framework by Belkin et al, which provides a powerful framework for semi-supervised classification. It propagates labels by going from labeled samples to unlabeled samples so that similar samples based on manifolds have similar classification outputs.

Given marked dataWith corresponding marksAnd unmarked dataTherein is provided withy _i E { +1, -1} and n _u ＝n-n _l . In the whole data setIs a pre-assigned Laplace graph in which each weight W _ij Represents a connection instance x _i And x _j Similarity between the two MR parameters, the optimization objective of the conventional MR is:

wherein the first term controls the classification model complexity, V (,) can be any form of loss function, such as the hinge loss max {0,1-y ] for a Support Vector Machine (SVM) _i f(x _i ) Or the squared loss (y) for a Regularized Least Squares Classifier (RLSC) _i -f(x _i )) ² . The second term is the classification loss of the labeled samples, which represents the smoothness and complexity of the classification function, which is used to control the complexity of the classification surface to avoid overfitting. The third term is smoothness regularization of all samples, describing the smoothness of all marked and unmarked samples. The optimization objective can be written as:

in thatWhere L is the graph Laplace given by L = D-W, W is the weight matrix of the graph G and D is the graph given by the diagonal componentA diagonal matrix is formed.

According to the Represser theorem, the above solution can be expressed in the following form,

wherein { alpha ₁ ,α ₂ ,...,α _n Is the lagrange parameter, K: x → R is a Mercer core.

Since the labeled samples in semi-supervised are randomly selected, for example in the border area, even in the opposite category area. Therefore, in MR, when the labels are propagated from these samples to their neighboring samples, the final classification may be misled in consideration of the structure of the unlabeled samples, and thus the classification effect may still be unsatisfactory. To solve this problem, the present invention proposes a new text markup extended MR framework (LE _ MR for short), which will be described in the later part of the process. In LE MR, the problem due to misleading of labeled instances can be alleviated by augmenting the set of labeled text samples with highly reliable text samples from cluster learning.

Given marked dataWith corresponding marksAnd unmarked dataTherein having x _i ∈R ^d ，y _i E { +1, -1} and n _u ＝n-n _l . Wherein n is _u Representing the number of unlabeled samples, n _l Representing the number of marked samples and n representing the total number of samples. By clustering fromThe expanded marked samples are obtained and the cluster number is set to 2. These selected instances are typically located in the cluster center region and they are used to augment the marked text sample.

Based on MR, a uniform LE _ MR target function is formulated for marked, unmarked and extended marked text sample data based on a square loss function:

wherein f is a local classification function to be solved and is located in a regeneration core Hilbert space (RKHS) H defined by a kernel function K _K In, gamma _A 、γ _I Respectively representing a regularization parameter, gamma _P Representing KFCM parameters, the fourth term being the loss of classification of the expanded labeled samples, where each p _i Represents the assigned label for each unlabeled sample, which is defined as:

wherein u is _+i And u _-i Are respectively a sample x _i Positive and negative degrees of membership obtained, and u _+i +u _-i N, n represents the total number of samples. Thus, the LE _ MR can remove misleading effects caused by limited labeled text samples by expanding the labeled text sample set by using the high membership text samples generated by clustering. When gamma is _P When =0, LE MR cannot use the augmented marked text sample, so that it will degenerate into MR.

From the representation theorem represenenter, the above optimization objective can be characterized as:

whereinIf the lagrangian product is used, and the lagrangian method is used to solve the optimization target, the target function can be written as:

where K is (n) at marked and unmarked points _l +n _u )×(n _l +n _u ) With Y being viaGiven (n) _l +n _u ) Dimension tag vector, P is byGiven (n) _l +n _u ) A dimension label vector. WhereinIs a vector with all elements 0. J is an n-th _l J = diag (1, \8230;, 1,0, \8230;, 0) with 1 on one diagonal and 0 on the rest _l +n _u )×(n _l +n _u ) Diagonal matrix of J _p Is obtained by taking abs (. Circle.) to obtain the absolute value of each element in PGiven (n) _l +n _u )×(n _l +n _u ) The diagonal matrix of (a).

When the derivative of the objective function is zero, the minimum value is calculated, namely:

thus:

and setting corresponding parameters therein, including a regularization parameter γ _I And gamma _A Set to 1 and 0.1, respectively, and the parameter δ is set to 0.98. The kernel function selects a Gaussian kernel rbf with the parameter set to 0.5, and the parameter gamma of KFCM is adjusted _P =0.3, when gamma _P The formula degenerates to MR when =0, the number of marked text samples is fixed to 10, and the operation mode is set as cross validation.

Finally, for a given text sample x, its classification function is:

wherein

To evaluate the performance of the LE MR proposed by the present invention, it is compared to the current advanced methods, including Support Vector Machines (SVMs), direct push support vector machines (TSVMs) and Manifold Regularization (MR). This part will be applied to the UCI and benchmark datasets. The invention also provides an experimental result of marking the expanded SVM (LE _ SVM), which is an extension method of the SVM for expanding marked text samples.

For LE _ MR, the gaussian kernel parameter is set to 0.5. When there are 10 marked instances, the parameter γ is regularized in all comparison methods _I And gamma _A Are set to 1 and 0.1, respectively. The parameter δ is set to 0.98. Regularization parameter gamma _P Is set to 0.3.

For each authentic text data set, the experiment was run using cross validation. Specifically, each text data set was randomly and evenly divided into 10 groups, and then one group was randomly drawn as a test set and the rest were drawn as a training set. In the training set, only 10 labeled samples were selected, the remainder being unlabeled samples. This learning process was repeated 10 times and the average results were recorded in table I, where the best performance in each line of the text data set is shown bolded.

TABLE 1

The results of the experimental table I show that the new method proposed by the present invention is very effective. For these results, there are several conclusions:

1) LE _ SVM performs better than SVM on most text datasets. In particular, it performs better than SVM on both text data sets. Therefore, the labeled samples are augmented by leveraging the membership information of the unlabeled samples. LE _ SVM does improve the performance of SVM.

2) MR performs better than SVM on most text data sets. In particular, he performs better than SVM on three text data sets. Therefore, by employing unmarked membership, the semi-supervised learning method can achieve better performance than the fully supervised learning method.

3) LE MR performs best in nine text datasets, because it not only uses the augmented labeled samples into the MR, but also combines the labeled and unlabeled samples in learning. Thus, the augmented labeled samples can improve the learning of the MR, especially when a given text sample is scarce or somewhat misleading.

The method provided by the invention effectively introduces the expansion of marked samples into MR, and a new semi-supervised text classification method is provided. And finding high-reliability samples according to membership information from the KFCM. The set of marked samples is then augmented by employing highly reliable samples, thereby reducing the impact of "misleading" marked samples.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A semi-supervised manifold regularization text classification method based on an expansion mark sample is characterized by comprising the following steps:

step 1, collecting a text real sample data set, wherein the text real sample data set comprises a marked text sample and an unmarked text sample, and the marked text sample comprises a text category label;

step 4, inputting text data to be classified, classifying the text data by using the text classification function obtained in the step 3, and obtaining the category of the text: useful text and useless text.

2. The method of claim 1, wherein the step 2 is to use a fuzzy kernel clustering algorithm KFCM to obtain the text membership information, assuming that the clustering membership matrix obtained from KFCM is

Firstly, judging the membership grade of the row according to the statistical number of the matched category membership grade of each marked sample in any upper row and lower row of the matrix and the actual category of the marked sample, and further obtaining the membership grade of the other row;

3. The method according to claim 1 or 2, wherein in step 2, after clustering, for each cluster, if the number of positive labeled samples is greater than the number of negative labeled samples, the cluster is regarded as a positive cluster or a positive class, otherwise, the cluster is regarded as a negative cluster or a negative class; thus, the positive and negative clusters are obtained to be consistent with the real positive and negative categories.

4. The method of claim 1, wherein the objective function of step 3 is:

wherein f is a local classification function to be solved and is positioned in a regeneration core Hilbert space H defined by a core function K _K In, gamma _A 、γ _I Respectively representing a regularization parameter, gamma _P Represents the KFCM parameter, x _i ∈R ^d ，y _i ∈{+1,-1}，n _u ＝n-n _l ，n _u Representing the number of unlabeled samples, n _l Representing the number of marked samples, n representing the total number of samples,l is the graph Laplace given by L = D-W, W is the weight matrix of the graph G, D is the graph given by the diagonal componentFormed diagonal matrix, weight W _ij Represents the connection sample x _i And x _j Similarity between them, f (x) _i ) Representing classes of text, regularization terms, assigned to labeled samples after classificationTo control the complexity of the classification surface to avoid over-learning; wherein the fourth term is the classification loss of the expanded labeled text sample, p _i Indicates the label assigned to each unlabeled specimen, which is defined as

5. The method of claim 4, wherein in step 3, for a given text sample x, the text classification function is:

wherein

Wherein K is (n) _l +n _u )×(n _l +n _u ) By a kernel matrix ofGiven (n) _l +n _u ) Dimension tag vector, P is byGiven (n) _l +n _u ) A dimension label vector ofIs a vector with all elements being 0's,is one (n) _l +n _u )×(n _l +n _u ) J is an n-th unit matrix _l J = diag (1, \8230;, 1,0, \8230;, 0) with 1 on one diagonal and 0 on the rest _l +n _u )×(n _l +n _u ) Diagonal matrix of J _p Is obtained by taking abs (-) to obtain the absolute value of each element in PGiven (n) _l +n _u )×(n _l +n _u ) The diagonal matrix of (a).

6. Method according to claim 4, characterized in that optimal regularization parameters and kernel functions are set in step 3, where positiveThen change the parameter gamma _I And gamma _A Set to 1 and 0.1, respectively, the parameter δ is set to 0.98, the kernel function selects the gaussian kernel rbf, whose parameter is set to 0.5, the parameter γ of the KFCM is adjusted _P =0.3, when gamma _P The formula degenerates to MR when =0, and the number of marked text samples is fixed to 10.

7. The method of claim 1, wherein step 1 is collecting textual real data in the UCI public dataset and the benchmark dataset.

8. The method of claim 1, wherein the collection text real sample data set of step 1 comprises a plurality of web pages, the text content of the web pages is firstly used, the link information is ignored, the bag-of-words vector representation of the file is constructed using the first 3000 words, namely skipping the HTML header; and secondly, the TFIDF mapping is adopted to normalize the characteristic vector into a unit length.

9. A text classification system, comprising:

the classification function calculation module is used for uniformly setting a target function for the marked text sample, the unmarked text sample and the extended marked text sample data according to a square loss function based on a popular regularization method MR, and training the target function by using the extended marked sample to obtain an optimal regularization parameter and a kernel function to obtain a final text classification function;