CN107943856A - A kind of file classification method and system based on expansion marker samples - Google Patents

A kind of file classification method and system based on expansion marker samples Download PDF

Info

Publication number
CN107943856A
CN107943856A CN201711086110.4A CN201711086110A CN107943856A CN 107943856 A CN107943856 A CN 107943856A CN 201711086110 A CN201711086110 A CN 201711086110A CN 107943856 A CN107943856 A CN 107943856A
Authority
CN
China
Prior art keywords
text
samples
sample
marked
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711086110.4A
Other languages
Chinese (zh)
Inventor
沈雅婷
汪云云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN201711086110.4A priority Critical patent/CN107943856A/en
Publication of CN107943856A publication Critical patent/CN107943856A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes

Abstract

The present invention proposes a kind of based on the file classification method for expanding marker samples, collection includes marked samples of text first, the authentic specimen data set of unmarked samples of text, then high authentic specimen is found by clustering method KFCM to obtain the marker samples of expansion, it is marked to recycle quadratic loss function, sample data that is unmarked and expanding formulates a unified class object function, the parameters such as regularization parameter therein and kernel function are set, and learn to obtain text classification function, recently enter text data to be sorted, classified using text classification function, obtain the classification of text.The present invention also proposes a kind of Text Classification System, have compared with other error rates of classical taxonomy algorithm and related algorithm on test set and significantly improve, solve the problems, such as that the nicety of grading on the prior art text less and inaccurate in marker samples is low, and there is best mutual information and class variable.

Description

Text classification method and system based on extended label samples
Technical Field
The invention belongs to the field of text classification processing, and particularly relates to application of pattern recognition and machine learning in the field of data mining.
Background
The text classification problem is not substantially different from other classification problems, and the method can be summarized as matching according to certain characteristics of the data to be classified, and complete matching is unlikely, so that an optimal matching result must be selected (according to some evaluation criterion) to complete classification. The selection and training of the classifier, and the evaluation and feedback of the classification result are very important. Text classification is a basic task of machine learning.
Text classification can be divided into two broad categories, supervised classification and semi-supervised classification. The supervised classification is that the text samples all have labels, the semi-supervised classification is that part of the text samples have labels, and part of the text samples have no labels. In practice, unlabeled text is less costly and easier to obtain than labeled text, so, from an applicable information quantity perspective, semi-supervised text classification has a strong demand in real-world applications and has attracted considerable attention, learned by combining labeled and unlabeled text, achieving better performance than using labeled text alone. The manifold regularization method MR is a semi-supervised classification method which has been studied deeply and is often used, and similar samples on a manifold structure chart are constrained to have similar classification outputs in a regularization mode.
However, the marked samples are randomly selected, for example in the border area, and even in the opposite category area. Labeling from these samples propagates to their neighbors, which, although taking into account the structure of unlabeled samples, may also mislead the MR classification. An example can be seen in FIG. 1, where unlabeled samples in a single class are represented byAndis represented and the corresponding marked samples are respectively represented byAnd "\9679" - "indicates that the decision boundary of the MR is delineated for comparison with the real boundary. It can be readily observed from FIG. 1 that the labeled samples in class1 are closer to the class boundary than the labeled samples in class 2. In particular, point x1 is located in the overlapping region of the two types. In this case, alreadyThe labeled samples may "mislead" the classification, such that the decision boundary of the MR is closer to class2 and thus deviates from the true boundary. Of course, MR also takes into account the structure of unlabeled samples. Assigning less weight to labeled samples and more weight to unlabeled samples may result in a more realistic boundary, but the choice of regularization parameters in semi-supervised learning remains an open problem. Therefore, the location of the labeled samples is crucial for MR classification, whereas in semi-supervised classification labeled samples tend to be rare and randomly selected. The MR performance may be unsatisfactory once the marked samples are somewhat misleading.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: in view of the above problems and deficiencies of the prior art, it is an object of the present invention to perform text classification on a text data set by expanding a labeled sample set to reduce the influence of "misleading" labeled samples, so as to solve the problem of low classification accuracy of the prior art on texts with fewer labeled samples and inaccurate text.
The invention adopts the following technical scheme for solving the technical problems:
a semi-supervised manifold regularization text classification method based on an expansion mark sample comprises the following steps:
step 1, collecting a text real sample data set which comprises marked text samples and unmarked text samples, wherein the marked text samples comprise text category labels;
step 2, acquiring all text membership information through a clustering algorithm, selecting high-reliability text samples according to the clustering membership, and forming an expansion mark text sample set by using the high-reliability text samples and class labels thereof;
step 3, uniformly setting a target function for the marked text sample, the unmarked text sample and the extended marked text sample data according to a square loss function based on a popular regularization method MR, training the target function by using the extended marked sample obtained in the step 2 to obtain an optimal regularization parameter and a kernel function, and obtaining a final text classification function;
step 4, inputting text data to be classified, classifying by using the text classification function obtained in the step 3, and obtaining the text category: useful text and useless text.
Furthermore, in the method provided by the invention, in the step 2, the fuzzy kernel clustering algorithm KFCM is adopted to obtain the text membership information, and the clustering membership matrix obtained from the KFCM is assumed to be
Firstly, judging the membership grade of a row according to the statistical number of the class membership grade of each marked sample in any upper row and lower row of the matrix, which is matched with the actual class of the marked sample, so as to obtain the membership grade of another row;
then, in the row whose membership class is positive, u is selected +i δ or u +i Unlabeled text samples of ≦ 1- δ as high confidence text samples, where δ is [0.5,1,1]N, n represents the total number of samples; and forming an expanded marked text sample set by using the high-reliability text sample and the obtained cluster or category mark.
Further, in the method provided by the present invention, in step 2, after clustering, for each cluster, if the number of positive labeled samples is greater than the number of negative labeled samples, the cluster is regarded as a positive cluster or a positive class, otherwise, the cluster is regarded as a negative cluster or a negative class; thus, positive and negative clusters are obtained consistent with the true positive and negative categories.
Further, in the method provided by the present invention, the objective function in step 3 is:
wherein f is a local classification function to be solved and is located in a regeneration core Hilbert space (RKHS) H defined by a kernel function K K In, gamma A 、γ I Respectively represent canonicalChange parameter, gamma P Represents the KFCM parameter, x i ∈R d ,y i ∈{+1,-1},n u =n-n l ,n u Representing the number of unlabeled samples, n l Representing the number of marked samples, n representing the total number of samples,l is the graph Laplace given by L = D-W, W is the weight matrix of the graph G, D is the graph given by the diagonal componentFormed diagonal matrix, weight W ij Represents the connection sample x i And x j Similarity between them, f (x) i ) Representing classes of text assigned to labeled samples after classification, regularization termsTo control the complexity of the classification surface to avoid over-learning; wherein the fourth term is the classification loss of the augmented labeled text sample, p i Indicates the assigned label of each unlabeled sample, which is defined as
Wherein u is +i And u -i Are respectively a sample x i Positive and negative degrees of membership, and u +i +u -i N, n represents the total number of samples.
Further, in the method provided by the present invention, in step 3, for a given text sample x, the text classification function is:
wherein
Wherein K is (n) l +n u )×(n l +n u ) By a kernel matrix ofGiven (n) l +n u ) Label vector of dimension, P is byGiven (n) l +n u ) A label vector of dimensions, whereinIs a vector with all elements being 0's,is one (n) l +n u )×(n l +n u ) J is an n-th unit matrix l J = diag (1, \8230;, 1,0, \8230;, 0) with 1 on one diagonal and 0 on the rest l +n u )×(n l +n u ) Diagonal matrix of J p Is obtained by taking abs (. Circle.) to obtain the absolute value of each element in PGiven (n) l +n u )×(n l +n u ) The diagonal matrix of (a).
Further, in the method provided by the present invention, the optimal regularization parameter and kernel function are set in step 3, wherein the regularization parameter γ is I And gamma A Set to 1 and 0.1, respectively, the parameter δ is set to 0.98, the kernel function selects the gaussian kernel rbf, whose parameter is set to 0.5, the parameter γ of the KFCM is adjusted P =0.3, when gamma P The formula degenerates to MR when =0, and the number of marked text samples is fixed to 10.
Further, in the method provided by the invention, step 1 is to collect text real data in the UCI public data set and the benchmark data set.
Furthermore, in the method provided by the invention, the collected text real sample data set in the step 1 comprises a plurality of web pages, firstly, the text content of the web pages is used, the link information is ignored, and the bag-of-word vector of the file indicates that the previous 3000 words are constructed and used, namely, the HTML head is skipped; and secondly, normalizing the feature vector into a unit length by adopting TFIDF mapping.
The invention also provides a text classification system, which comprises:
the sample acquisition module is used for acquiring a text real sample data set which comprises marked text samples and unmarked text samples, wherein the marked text samples comprise text category labels;
the extended marked sample acquisition module is used for acquiring all text membership degree information through a clustering algorithm, selecting high-reliability text samples according to the clustering membership degree, and forming an extended marked text sample set by using the high-reliability text samples and class labels thereof;
the classification function calculation module is used for uniformly setting a target function for the marked text sample, the unmarked text sample and the extended marked text sample data according to a square loss function based on a popular regularization method MR, training the target function by using the extended marked sample to obtain an optimal regularization parameter and a kernel function, and obtaining a final text classification function;
the text classification module is used for inputting text data to be classified, classifying the data by using a text classification function and outputting the classification of the text: useful text and useless text.
By adopting the technical scheme, compared with the prior art, the invention has the following technical effects:
the invention provides a novel label extended MR framework (LE _ MR for short), which introduces an extended label sample set into MR learning, namely constructs an optimization problem of the LE _ MR based on labeled, unlabeled and extended high-reliability sample sets, thereby relieving the problem of lack of labeled samples and improving the text classification performance.
Meanwhile, the invention carries out comparison experiments on a plurality of data sets with other classical classification algorithms and similar text classification algorithms, wherein the data sets comprise UCI public data sets (text characteristic data) and benchmark data sets. Experiments show that LE _ MR can improve the accuracy of text classification compared with other advanced semi-supervised classification methods at present to obtain encouraging results.
Drawings
FIG. 1 is a graph comparing LE _ MR and MR.
Fig. 2 is a flow chart of an implementation method of the present invention.
Detailed Description
The technical scheme of the invention is further explained in detail by combining the attached drawings:
it will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
The invention provides a novel semi-supervised text classification method. For text classification on a text data set, which contains a plurality of web pages, the task is to classify these web pages into two categories: useful and useless, i.e., positive and negative. Using only the text content of the web page, ignoring the link information, the bag-of-words vector representation of the document is constructed using the top 3000 words (skipping the HTML header), with the best mutual information and class variables, followed by the TFIDF mapping. The feature vectors are normalized to a unit length.
Since the marked samples are randomly selected, for example in the border area, and even in the opposite category area. MR classification may be misled due to propagation of labels from these samples to their neighbors. In order to solve the problem, a new label extension MR framework (LE _ MR) is provided for semi-supervised text classification in the invention. In LE MR, a clustering method KFCM first finds high confidence samples, e.g., samples in the cluster center region. These samples generated from the cluster index are then employed to augment the set of labeled text samples. The invention develops a uniform target function for marked, unmarked and expanded high-reliability text samples based on the square loss function to represent the problem.
As shown in the flowchart of fig. 2, the present invention provides a semi-supervised manifold regularization text classification method based on extended labeled samples, which can be used for training classification and recognition of inaccurate texts with few labeled samples. The invention is realized by the following technical scheme, which comprises the following steps:
the first step is as follows: collecting a text data set under a real condition, wherein the text data set comprises a text data sample and a corresponding text category label, dividing a training set and a testing set, the text data set comprises a plurality of web pages, and the task is to divide the web pages into two types: useful and useless, i.e. positive and negative classes, using only the text content of the web page, ignoring link information, bag-of-words vector representation of the document is constructed using the first 3000 words (skipping the HTML header), with the best mutual information and class variables, followed by TFIDF mapping, the feature vectors are normalized to unit length;
the second step is that: acquiring a high-reliability text sample through text membership information obtained by a fuzzy kernel clustering algorithm KFCM to expand a marked text sample set;
the third step: setting parameters such as regularization parameters and kernel functions in the LE _ MR, and training the LE _ MR by using the extended marked text sample set obtained in the second step; obtaining an optimal regularization parameter and a kernel function to obtain a final text classification function f (x);
the fourth step: inputting text data to be classified, classifying the text data by using a text classification function f (x), and outputting the classification of the text: useful text and useless text.
The first step is as follows:
the first step is simple, and the text data under the real condition is collected, wherein the text data comprises a UCI data set (text characteristic data), a benchmark data set and the like, the UCI data set comprises text data samples and corresponding text category labels, and a training set and a test set are divided.
The second step is specifically as follows:
firstly, acquiring text membership in a training set through FCM (KFCM) based on a kernel. FCM is a classical clustering method that divides data into clusters and instances in the same cluster are as similar as possible. In addition, a variation of FCM is proposed, called KFCM-I, which attempts to improve learning ability through kernel skills. In KFCM-I, example X = { X = 1 ,…,x N Is first mapped to feature space (kernel space) by Mercer mapping Φ. Then at the mapped sample { Φ (x) 1 ),…,Φ(x N ) And directly realizing clustering.
The formula of KFCM-I is:
where V is the cluster center, U is the membership matrix, c is the number of clusters, N is the total number of input samples, x k Is the input sample characteristic, m is the blur index,is the center of the cluster in the kernel space, andis description of x k Cluster membership of the likelihood of belonging to the ith cluster.
However, because it is difficult to unambiguously express the kernel map, researchers have proposed the algorithm KFCM-II. KFCM-II by Nuclear Induction distance 1-K (x) k ,v i ) Instead of the similarity measureWherein K (x) k ,v i ) Shows an example x in the kernel space k And a clustering center v i The distance between them. In this way, this new objective function will become:
based on the Lagrange method, the clustering membership degree and the clustering center of the example are respectively as follows:
and
in KFCM-II, the cluster center V from FCM is taken as the initial estimate, and the iteration terminates when | | V (t + 1) -V (t) | | ≦ ε, where t is the number of iterations, and ε > 0, t =0.
Finally, the algorithm description of KFCM-II is summarized below,
1) The cluster center from FCM is taken as the initial cluster center of KFCM _ II;
2) Calculating K t (x k ,v i ) (Gram matrix);
3) By passing
Calculating a clustering membership matrix U (t);
4) By passing
Calculating a clustering center V (t + 1);
5) The algorithm terminates when | | V (t + 1) -V (t) | ≦ ε, otherwise t → t +1 and return to step 2).
After clustering, for each cluster, if the number of positive mark instances is greater than that of negative mark instances, the cluster is regarded as a positive cluster or a positive class, otherwise, the cluster is regarded as a negative class. Thus, the positive and negative clusters are obtained to be consistent with the real positive and negative categories. And further, selecting a high-reliability sample according to the clustering membership.
Specifically, the text membership information is obtained by adopting a fuzzy kernel clustering algorithm KFCM, and a clustering membership matrix obtained from the KFCM is assumed to be
Firstly, according to the text category membership degree of each marked sample in two lines and the text category to which the actual marked sample belongs, the matching number is counted to judge the positive and negative text categories of the upper and lower lines in the clustering membership degree matrix.
In the uplink, the number of samples marked as positive types with the obtained membership degree of more than or equal to 0.9 is marked as S, and in the downlink, the number of samples marked as positive types with the obtained membership degree of more than or equal to 0.9 is marked as X; or, in the uplink, the number of samples marked as negative classes with the obtained membership degree less than or equal to 0.1 is recorded as S, and in the downlink, the number of samples marked as negative classes with the obtained membership degree less than or equal to 0.1 is recorded as X.
If S&gt, X, the ascending is the positive class membership u +i Otherwise, the behavior under the condition of the positive class membership degree u +i
If the upward behavior is positive, only the membership degree of the upward row is seen, and in the upward row, u of the unlabeled sample +i When the value is more than or equal to delta, the text sample with high credibility is marked as +1, namely the marked positive type sample and the u of the unmarked sample are expanded +i When the value is less than or equal to 1-delta, the text sample with high credibility is marked as-1, namely, the marked negative type sample is expanded, and the value of 1-delta < u of the unmarked sample +i The < delta is still used as an unlabeled sample, and the originally labeled sample is still used as a labeled sample.
Similarly, if the down row is positive, only the membership of the down row is seen, and in the down row, u of the unlabeled sample +i When the value is more than or equal to delta, the text sample with high credibility is marked as +1, namely the marked positive type sample and the u of the unmarked sample are expanded +i When the value is less than or equal to 1-delta, the text sample with high credibility is marked as-1, namely, the marked negative type sample is expanded, and the value of 1-delta < u of the unmarked sample +i The < delta is still used as an unlabeled sample, and the originally labeled sample is still used as a labeled sample.
And finally, forming an extended marked text sample set by using the high-reliability text sample and the category label thereof. Where δ is a threshold of [0.5,1 ].
The third step is specifically as follows:
and training an improved semi-supervised manifold regularization framework LE _ MR by using the extended text mark samples acquired in the second step.
MR proposes a semi-supervised manifold regularization classification framework by Belkin et al, which provides a powerful framework for semi-supervised classification. It propagates labels by going from labeled samples to unlabeled samples so that similar samples based on manifolds have similar classification outputs.
Given marked dataWith corresponding marksAnd unmarked dataTherein is provided withy i E { +1, -1} and n u =n-n l . In the whole data setIs a pre-assigned Laplace graph in which each weight W ij Represents a connection instance x i And x j Similarity between the two MR parameters, the optimization objective of the conventional MR is:
wherein the first term controls the classification model complexity, V (,) can be any form of loss function, such as the hinge loss max {0,1-y ] for a Support Vector Machine (SVM) i f(x i ) Or the squared loss (y) for a Regularized Least Squares Classifier (RLSC) i -f(x i )) 2 . The second term is the classification loss of the labeled samples, which represents the smoothness and complexity of the classification function, which is used to control the complexity of the classification surface to avoid overfitting. The third term is smoothness regularization of all samples, describing the smoothness of all marked and unmarked samples. The optimization objective can be written as:
in thatWhere L is the graph Laplace given by L = D-W, W is the weight matrix of the graph G and D is the graph given by the diagonal componentA diagonal matrix is formed.
According to the Represser theorem, the above solution can be expressed in the following form,
wherein { alpha 12 ,...,α n Is the lagrange parameter, K: x → R is a Mercer core.
Since the labeled samples in semi-supervised are randomly selected, for example in the border area, even in the opposite category area. Therefore, in MR, when the labels are propagated from these samples to their neighboring samples, the final classification may be misled in consideration of the structure of the unlabeled samples, and thus the classification effect may still be unsatisfactory. To solve this problem, the present invention proposes a new text markup extended MR framework (LE _ MR for short), which will be described in the later part of the process. In LE MR, the problem due to misleading of labeled instances can be alleviated by augmenting the set of labeled text samples with highly reliable text samples from cluster learning.
Given marked dataWith corresponding marksAnd unmarked dataTherein having x i ∈R d ,y i E { +1, -1} and n u =n-n l . Wherein n is u Representing the number of unlabeled samples, n l Representing the number of marked samples and n representing the total number of samples. By clustering fromThe expanded marked samples are obtained and the cluster number is set to 2. These selected instances are typically located in the cluster center region and they are used to augment the marked text sample.
Based on MR, a uniform LE _ MR target function is formulated for marked, unmarked and extended marked text sample data based on a square loss function:
wherein f is a local classification function to be solved and is located in a regeneration core Hilbert space (RKHS) H defined by a kernel function K K In, gamma A 、γ I Respectively representing a regularization parameter, gamma P Representing KFCM parameters, the fourth term being the loss of classification of the expanded labeled samples, where each p i Represents the assigned label for each unlabeled sample, which is defined as:
wherein u is +i And u -i Are respectively a sample x i Positive and negative degrees of membership obtained, and u +i +u -i N, n represents the total number of samples. Thus, the LE _ MR can remove misleading effects caused by limited labeled text samples by expanding the labeled text sample set by using the high membership text samples generated by clustering. When gamma is P When =0, LE MR cannot use the augmented marked text sample, so that it will degenerate into MR.
From the representation theorem represenenter, the above optimization objective can be characterized as:
whereinIf the lagrangian product is used, and the lagrangian method is used to solve the optimization target, the target function can be written as:
where K is (n) at marked and unmarked points l +n u )×(n l +n u ) With Y being viaGiven (n) l +n u ) Dimension tag vector, P is byGiven (n) l +n u ) A dimension label vector. WhereinIs a vector with all elements 0. J is an n-th l J = diag (1, \8230;, 1,0, \8230;, 0) with 1 on one diagonal and 0 on the rest l +n u )×(n l +n u ) Diagonal matrix of J p Is obtained by taking abs (. Circle.) to obtain the absolute value of each element in PGiven (n) l +n u )×(n l +n u ) The diagonal matrix of (a).
When the derivative of the objective function is zero, the minimum value is calculated, namely:
thus:
and setting corresponding parameters therein, including a regularization parameter γ I And gamma A Set to 1 and 0.1, respectively, and the parameter δ is set to 0.98. The kernel function selects a Gaussian kernel rbf with the parameter set to 0.5, and the parameter gamma of KFCM is adjusted P =0.3, when gamma P The formula degenerates to MR when =0, the number of marked text samples is fixed to 10, and the operation mode is set as cross validation.
Finally, for a given text sample x, its classification function is:
wherein
To evaluate the performance of the LE MR proposed by the present invention, it is compared to the current advanced methods, including Support Vector Machines (SVMs), direct push support vector machines (TSVMs) and Manifold Regularization (MR). This part will be applied to the UCI and benchmark datasets. The invention also provides an experimental result of marking the expanded SVM (LE _ SVM), which is an extension method of the SVM for expanding marked text samples.
For LE _ MR, the gaussian kernel parameter is set to 0.5. When there are 10 marked instances, the parameter γ is regularized in all comparison methods I And gamma A Are set to 1 and 0.1, respectively. The parameter δ is set to 0.98. Regularization parameter gamma P Is set to 0.3.
For each authentic text data set, the experiment was run using cross validation. Specifically, each text data set was randomly and evenly divided into 10 groups, and then one group was randomly drawn as a test set and the rest were drawn as a training set. In the training set, only 10 labeled samples were selected, the remainder being unlabeled samples. This learning process was repeated 10 times and the average results were recorded in table I, where the best performance in each line of the text data set is shown bolded.
TABLE 1
The results of the experimental table I show that the new method proposed by the present invention is very effective. For these results, there are several conclusions:
1) LE _ SVM performs better than SVM on most text datasets. In particular, it performs better than SVM on both text data sets. Therefore, the labeled samples are augmented by leveraging the membership information of the unlabeled samples. LE _ SVM does improve the performance of SVM.
2) MR performs better than SVM on most text data sets. In particular, he performs better than SVM on three text data sets. Therefore, by employing unmarked membership, the semi-supervised learning method can achieve better performance than the fully supervised learning method.
3) LE MR performs best in nine text datasets, because it not only uses the augmented labeled samples into the MR, but also combines the labeled and unlabeled samples in learning. Thus, the augmented labeled samples can improve the learning of the MR, especially when a given text sample is scarce or somewhat misleading.
The method provided by the invention effectively introduces the expansion of marked samples into MR, and a new semi-supervised text classification method is provided. And finding high-reliability samples according to membership information from the KFCM. The set of marked samples is then augmented by employing highly reliable samples, thereby reducing the impact of "misleading" marked samples.
The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (9)

1. A semi-supervised manifold regularization text classification method based on an expansion mark sample is characterized by comprising the following steps:
step 1, collecting a text real sample data set, wherein the text real sample data set comprises a marked text sample and an unmarked text sample, and the marked text sample comprises a text category label;
step 2, acquiring all text membership information through a clustering algorithm, selecting high-reliability text samples according to the clustering membership, and forming an expansion mark text sample set by using the high-reliability text samples and class labels thereof;
step 3, uniformly setting a target function for the marked text sample, the unmarked text sample and the extended marked text sample data according to a square loss function based on a popular regularization method MR, training the target function by using the extended marked sample obtained in the step 2 to obtain an optimal regularization parameter and a kernel function, and obtaining a final text classification function;
step 4, inputting text data to be classified, classifying the text data by using the text classification function obtained in the step 3, and obtaining the category of the text: useful text and useless text.
2. The method of claim 1, wherein the step 2 is to use a fuzzy kernel clustering algorithm KFCM to obtain the text membership information, assuming that the clustering membership matrix obtained from KFCM is
Firstly, judging the membership grade of the row according to the statistical number of the matched category membership grade of each marked sample in any upper row and lower row of the matrix and the actual category of the marked sample, and further obtaining the membership grade of the other row;
then, in the row whose membership class is positive, u is selected +i δ or u +i Unlabeled text samples of ≦ 1- δ as high confidence text samples, where δ is [0.5,1,1]N, n represents the total number of samples; and forming an expanded marked text sample set by using the high-reliability text sample and the obtained cluster or category mark.
3. The method according to claim 1 or 2, wherein in step 2, after clustering, for each cluster, if the number of positive labeled samples is greater than the number of negative labeled samples, the cluster is regarded as a positive cluster or a positive class, otherwise, the cluster is regarded as a negative cluster or a negative class; thus, the positive and negative clusters are obtained to be consistent with the real positive and negative categories.
4. The method of claim 1, wherein the objective function of step 3 is:
wherein f is a local classification function to be solved and is positioned in a regeneration core Hilbert space H defined by a core function K K In, gamma A 、γ I Respectively representing a regularization parameter, gamma P Represents the KFCM parameter, x i ∈R d ,y i ∈{+1,-1},n u =n-n l ,n u Representing the number of unlabeled samples, n l Representing the number of marked samples, n representing the total number of samples,l is the graph Laplace given by L = D-W, W is the weight matrix of the graph G, D is the graph given by the diagonal componentFormed diagonal matrix, weight W ij Represents the connection sample x i And x j Similarity between them, f (x) i ) Representing classes of text, regularization terms, assigned to labeled samples after classificationTo control the complexity of the classification surface to avoid over-learning; wherein the fourth term is the classification loss of the expanded labeled text sample, p i Indicates the label assigned to each unlabeled specimen, which is defined as
Wherein u is +i And u -i Are respectively a sample x i Positive and negative degrees of membership, and u +i +u -i N, n represents the total number of samples.
5. The method of claim 4, wherein in step 3, for a given text sample x, the text classification function is:
wherein
Wherein K is (n) l +n u )×(n l +n u ) By a kernel matrix ofGiven (n) l +n u ) Dimension tag vector, P is byGiven (n) l +n u ) A dimension label vector ofIs a vector with all elements being 0's,is one (n) l +n u )×(n l +n u ) J is an n-th unit matrix l J = diag (1, \8230;, 1,0, \8230;, 0) with 1 on one diagonal and 0 on the rest l +n u )×(n l +n u ) Diagonal matrix of J p Is obtained by taking abs (-) to obtain the absolute value of each element in PGiven (n) l +n u )×(n l +n u ) The diagonal matrix of (a).
6. Method according to claim 4, characterized in that optimal regularization parameters and kernel functions are set in step 3, where positiveThen change the parameter gamma I And gamma A Set to 1 and 0.1, respectively, the parameter δ is set to 0.98, the kernel function selects the gaussian kernel rbf, whose parameter is set to 0.5, the parameter γ of the KFCM is adjusted P =0.3, when gamma P The formula degenerates to MR when =0, and the number of marked text samples is fixed to 10.
7. The method of claim 1, wherein step 1 is collecting textual real data in the UCI public dataset and the benchmark dataset.
8. The method of claim 1, wherein the collection text real sample data set of step 1 comprises a plurality of web pages, the text content of the web pages is firstly used, the link information is ignored, the bag-of-words vector representation of the file is constructed using the first 3000 words, namely skipping the HTML header; and secondly, the TFIDF mapping is adopted to normalize the characteristic vector into a unit length.
9. A text classification system, comprising:
the sample acquisition module is used for acquiring a text real sample data set which comprises marked text samples and unmarked text samples, wherein the marked text samples comprise text category labels;
the extended marked sample acquisition module is used for acquiring all text membership degree information through a clustering algorithm, selecting high-reliability text samples according to the clustering membership degree, and forming an extended marked text sample set by using the high-reliability text samples and class labels thereof;
the classification function calculation module is used for uniformly setting a target function for the marked text sample, the unmarked text sample and the extended marked text sample data according to a square loss function based on a popular regularization method MR, and training the target function by using the extended marked sample to obtain an optimal regularization parameter and a kernel function to obtain a final text classification function;
the text classification module is used for inputting text data to be classified, classifying the data by using a text classification function and outputting the classification of the text: useful text and useless text.
CN201711086110.4A 2017-11-07 2017-11-07 A kind of file classification method and system based on expansion marker samples Pending CN107943856A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711086110.4A CN107943856A (en) 2017-11-07 2017-11-07 A kind of file classification method and system based on expansion marker samples

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711086110.4A CN107943856A (en) 2017-11-07 2017-11-07 A kind of file classification method and system based on expansion marker samples

Publications (1)

Publication Number Publication Date
CN107943856A true CN107943856A (en) 2018-04-20

Family

ID=61933485

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711086110.4A Pending CN107943856A (en) 2017-11-07 2017-11-07 A kind of file classification method and system based on expansion marker samples

Country Status (1)

Country Link
CN (1) CN107943856A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110222180A (en) * 2019-06-04 2019-09-10 江南大学 A kind of classification of text data and information mining method
CN110569856A (en) * 2018-08-24 2019-12-13 阿里巴巴集团控股有限公司 sample labeling method and device, and damage category identification method and device
CN110796262A (en) * 2019-09-26 2020-02-14 北京淇瑀信息科技有限公司 Test data optimization method and device of machine learning model and electronic equipment
CN111178042A (en) * 2019-12-31 2020-05-19 出门问问信息科技有限公司 Data processing method and device and computer storage medium
CN111310794A (en) * 2020-01-19 2020-06-19 北京字节跳动网络技术有限公司 Target object classification method and device and electronic equipment
CN111460156A (en) * 2020-03-31 2020-07-28 深圳前海微众银行股份有限公司 Sample expansion method, device, equipment and computer readable storage medium
CN111581380A (en) * 2020-04-29 2020-08-25 南京理工大学紫金学院 Single-point and double-point smooth combination manifold regularization semi-supervised text classification method
CN112363465A (en) * 2020-10-21 2021-02-12 北京工业大数据创新中心有限公司 Expert rule set training method, trainer and industrial equipment early warning system
CN112528030A (en) * 2021-02-09 2021-03-19 中关村科学城城市大脑股份有限公司 Semi-supervised learning method and system for text classification
CN113127605A (en) * 2021-06-17 2021-07-16 明品云(北京)数据科技有限公司 Method and system for establishing target recognition model, electronic equipment and medium
WO2022126810A1 (en) * 2020-12-14 2022-06-23 上海爱数信息技术股份有限公司 Text clustering method
CN115174251A (en) * 2022-07-19 2022-10-11 深信服科技股份有限公司 False alarm identification method and device for safety alarm and storage medium
US11809454B2 (en) 2020-11-21 2023-11-07 International Business Machines Corporation Label-based document classification using artificial intelligence

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350011A (en) * 2007-07-18 2009-01-21 中国科学院自动化研究所 Method for detecting search engine cheat based on small sample set
CN104156438A (en) * 2014-08-12 2014-11-19 德州学院 Unlabeled sample selection method based on confidence coefficients and clustering
CN104318242A (en) * 2014-10-08 2015-01-28 中国人民解放军空军工程大学 High-efficiency SVM active half-supervision learning algorithm
WO2017090051A1 (en) * 2015-11-27 2017-06-01 Giridhari Devanathan A method for text classification and feature selection using class vectors and the system thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350011A (en) * 2007-07-18 2009-01-21 中国科学院自动化研究所 Method for detecting search engine cheat based on small sample set
CN104156438A (en) * 2014-08-12 2014-11-19 德州学院 Unlabeled sample selection method based on confidence coefficients and clustering
CN104318242A (en) * 2014-10-08 2015-01-28 中国人民解放军空军工程大学 High-efficiency SVM active half-supervision learning algorithm
WO2017090051A1 (en) * 2015-11-27 2017-06-01 Giridhari Devanathan A method for text classification and feature selection using class vectors and the system thereof

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘宏: "通过标记样本与未标记样本学习文本分类规则", 《中国学位论文全文数据库》 *
尚耐丽: "半监督分类方法的研究", 《计算机应用于软件》 *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569856A (en) * 2018-08-24 2019-12-13 阿里巴巴集团控股有限公司 sample labeling method and device, and damage category identification method and device
CN110569856B (en) * 2018-08-24 2020-07-21 阿里巴巴集团控股有限公司 Sample labeling method and device, and damage category identification method and device
CN110222180A (en) * 2019-06-04 2019-09-10 江南大学 A kind of classification of text data and information mining method
CN110796262A (en) * 2019-09-26 2020-02-14 北京淇瑀信息科技有限公司 Test data optimization method and device of machine learning model and electronic equipment
CN110796262B (en) * 2019-09-26 2023-09-29 北京淇瑀信息科技有限公司 Test data optimization method and device of machine learning model and electronic equipment
CN111178042B (en) * 2019-12-31 2023-04-28 出门问问信息科技有限公司 Data processing method and device and computer storage medium
CN111178042A (en) * 2019-12-31 2020-05-19 出门问问信息科技有限公司 Data processing method and device and computer storage medium
CN111310794A (en) * 2020-01-19 2020-06-19 北京字节跳动网络技术有限公司 Target object classification method and device and electronic equipment
CN111310794B (en) * 2020-01-19 2021-04-20 北京字节跳动网络技术有限公司 Target object classification method and device and electronic equipment
CN111460156A (en) * 2020-03-31 2020-07-28 深圳前海微众银行股份有限公司 Sample expansion method, device, equipment and computer readable storage medium
CN111581380A (en) * 2020-04-29 2020-08-25 南京理工大学紫金学院 Single-point and double-point smooth combination manifold regularization semi-supervised text classification method
CN112363465A (en) * 2020-10-21 2021-02-12 北京工业大数据创新中心有限公司 Expert rule set training method, trainer and industrial equipment early warning system
US11809454B2 (en) 2020-11-21 2023-11-07 International Business Machines Corporation Label-based document classification using artificial intelligence
WO2022126810A1 (en) * 2020-12-14 2022-06-23 上海爱数信息技术股份有限公司 Text clustering method
CN112528030A (en) * 2021-02-09 2021-03-19 中关村科学城城市大脑股份有限公司 Semi-supervised learning method and system for text classification
CN113127605B (en) * 2021-06-17 2021-11-02 明品云(北京)数据科技有限公司 Method and system for establishing target recognition model, electronic equipment and medium
CN113127605A (en) * 2021-06-17 2021-07-16 明品云(北京)数据科技有限公司 Method and system for establishing target recognition model, electronic equipment and medium
CN115174251A (en) * 2022-07-19 2022-10-11 深信服科技股份有限公司 False alarm identification method and device for safety alarm and storage medium
CN115174251B (en) * 2022-07-19 2023-09-05 深信服科技股份有限公司 False alarm identification method and device for safety alarm and storage medium

Similar Documents

Publication Publication Date Title
CN107943856A (en) A kind of file classification method and system based on expansion marker samples
CN103984959B (en) A kind of image classification method based on data and task-driven
CN111368886B (en) Sample screening-based label-free vehicle picture classification method
Shen et al. Weakly supervised object detection via object-specific pixel gradient
Lian et al. Max-margin dictionary learning for multiclass image categorization
CN105184298A (en) Image classification method through fast and locality-constrained low-rank coding process
CN105389583A (en) Image classifier generation method, and image classification method and device
CN111753874A (en) Image scene classification method and system combined with semi-supervised clustering
CN104966105A (en) Robust machine error retrieving method and system
CN109034186B (en) Handwriting data identification method based on DA-RBM classifier model
CN105279519A (en) Remote sensing image water body extraction method and system based on cooperative training semi-supervised learning
CN103745233B (en) The hyperspectral image classification method migrated based on spatial information
CN107358172B (en) Human face feature point initialization method based on human face orientation classification
CN103942749A (en) Hyperspectral ground feature classification method based on modified cluster hypothesis and semi-supervised extreme learning machine
CN106056165A (en) Saliency detection method based on super-pixel relevance enhancing Adaboost classification learning
CN113554100A (en) Web service classification method for enhancing attention network of special composition picture
Silva et al. Superpixel-based online wagging one-class ensemble for feature selection in foreground/background separation
CN114255371A (en) Small sample image classification method based on component supervision network
CN115439715A (en) Semi-supervised few-sample image classification learning method and system based on anti-label learning
CN111898704A (en) Method and device for clustering content samples
Song et al. Two-level hierarchical feature learning for image classification
Ghadhban et al. Segments interpolation extractor for finding the best fit line in Arabic offline handwriting recognition words
White et al. Digital fingerprinting of microstructures
JP5633424B2 (en) Program and information processing system
CN107993311B (en) Cost-sensitive latent semantic regression method for semi-supervised face recognition access control system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180420