CN110795564B

CN110795564B - Text classification method lacking negative cases

Info

Publication number: CN110795564B
Application number: CN201911058163.4A
Authority: CN
Inventors: 吴刚; 王楠
Original assignee: Nanjing Jitu Data Technology Co ltd
Current assignee: Nanjing Jitu Data Technology Co ltd
Priority date: 2019-11-01
Filing date: 2019-11-01
Publication date: 2022-02-22
Anticipated expiration: 2039-11-01
Also published as: CN110795564A

Abstract

The invention discloses a negative-case-lacking text classification method, and belongs to the technical field of machine learning and text classification. The method comprises the steps of firstly, determining a data text to be classified, and customizing a text classification category; then training a TF-IDF model and an LSI model based on the obtained corpus; then respectively constructing feature vectors of the text based on the trained TF-IDF model and the LSI model, and constructing combined text feature vectors based on an ensemble method; then, training a Basic classifier by adopting an ROC-SVM combined algorithm, and training the Basic classifier by combining a k-means clustering method and simultaneously training a label classifier; and finally, initially classifying the text to be classified by adopting a Basic classifier, screening by using an elastic search to determine candidate classification, and accurately classifying the document to be classified into one or more types in the user-defined classes by adopting a label classifier. The method can effectively classify the text data lacking negative cases, and has high accuracy, good effect and high efficiency.

Description

Text classification method lacking negative cases

Technical Field

The invention belongs to the technical field of machine learning and text classification, and particularly relates to a negative case-lacking text classification method.

Background

With the development of the internet, the number of internet texts is increased sharply, and the classification requirements of the internet texts are stronger. In the face of massive data texts, artificial classification is obviously impossible, but with the rise of machine learning methods, ideas are provided for solving the requirement, so a large number of researchers put forward a series of methods around the field, for example, machine learning methods such as naive Bayes method, decision tree, k nearest neighbor, support vector machine and the like are successfully applied to text classification, and a good effect is obtained. However, because the data texts in different fields are complicated and the mechanisms of many methods are different, it is difficult to accurately classify a large amount of data texts in many cases, so that the classification accuracy is poor and the efficiency is not high.

For example, when a supervised learning method is adopted for text classification, a large amount of positive example data and negative example data need to be labeled, and when the text classification task is applied to a specific field, the work of collecting the negative example text is very difficult, and the main reasons are that 1) the negative example text must cover all text collections except the positive example text, and 2) the artificial labeled negative example text may have deviation, which is mainly derived from unintended bias caused by different cognition of the annotators. The lack of sufficient negative case data or the presence of erroneous data in the negative case data may result in a degradation of the performance of the classification model. For example, when classifying texts of diseases suffered by patients through diagnosis reports, a database of a hospital stores diagnosis information of all patients suffered from the diseases, and can be used as a positive case data set, but negative case data cannot use files which do not suffer from the diseases in the database, because many patients do not detect whether the patients suffer from the diseases, some patients may suffer from the diseases, but do not diagnose, and the diagnosis of the diseases for the patients is obviously impossible, it can be seen that it is very difficult to collect negative case data, so that the accuracy is very low when the text classification in the field is performed by adopting a supervised learning method.

In order to solve the problem of difficulty in negative case data collection, the learner starts a study of semi-supervised learning. Semi-supervised learning need not provide completely mutually exclusive positive and negative classifications of data, but rather only positive case data and large amounts of unlabeled data that are relatively easy to collect. Unlabeled data is a random sampling of the whole text set, the category of each sample is arbitrary and is not related to each other, for example, if a classifier for identifying "sports" needs to be trained, we only need to label a part of the text related to "sports" as a regular data set, and then randomly sample all the remaining texts (including the text of sports class) as an unlabeled data set. In the unlabeled dataset, no manual labeling is performed for any of the categories of text.

Text classification is a process of classifying a piece of text into one or more of several predefined categories, and a general process divides the task into two parts, namely feature engineering and a classification model. By using a feature engineering method based on a vector space model and matching with a classification model with strong generalization capability, the accuracy (precision) and the recall (recall) of more than 85 percent can be achieved in most classification tasks. However, the existing text classification method still has some problems:

(1) the feature engineering mainly adopts a method represented by a bag-of-words model and a TF-IDF model to map a continuous text to a vector of a high-dimensional feature space, and different texts can be represented in the same feature space by using the method, but the defects are that the feature vectors representing the texts are too sparse, the represented semantic information is incomplete, and the phenomenon of word ambiguity or word ambiguity cannot be processed, so that the subsequent data processing is low in efficiency and the classification accuracy is low. Another modeling idea is to use topic models, each text is considered to have many potential topics, and the purpose of the topic models is to obtain a probability distribution about the topics, and finally, the topic distribution is used as the feature of the characterization text. Studies in this respect, such as LSI (latent semantic indexing) models, sometimes also referred to as LSA (latent semantic analysis), can obtain largely compressed information of text through SVD (singular value decomposition), but this type of method also has some drawbacks, such as the fact that the obtained models cannot be interpreted with probability, lack of support from statistical theory, and so on.

(2) In the absence of negative case data, the training data for the classification model is difficult to select. So far, the classification method for the data lacking negative examples is to select reliable negative examples from the unlabeled data, and then train the classifier with positive example data and negative example data. Methods such as PCA, decision tree, Bayesian framework, S-EM and the like are adopted by learners in the selection process of negative examples, but the methods do not belong to classifiers with strong generalization capability, so that the final effect is slightly deficient.

(3) When the types to be judged are more, a lot of time is consumed for scoring each classification by using the trained classification model, and the use in the production environment is seriously influenced. In a traditional classification model, a Support Vector Machine (SVM) is mostly considered to classify texts, and each classification is judged to be correct or incorrect to obtain a fine classification of the texts. But determining each category is not only inefficient, but may not be necessary.

Through the analysis, when text classification is performed, for text classification lacking negative example data, classification is difficult due to the lack of negative example data, and the accuracy is poor, the effect is poor, and the efficiency is low.

Disclosure of Invention

The technical problem is as follows: the invention provides a method for classifying texts lacking negative examples, which can classify the texts lacking negative example data and has the advantages of high accuracy, good effect and high efficiency.

The technical scheme is as follows: the invention provides a negative example-lacking text classification method, which comprises the following steps:

s1: determining classified text and classification category

Determining a data text to be classified, and customizing a text classification type, wherein the customized text classification type is a positive example analogy;

s2: construction of vectorized model

Acquiring a corpus used for training a text vectorization model based on a public text database interface, performing text preprocessing on the corpus, and dividing a regular data set P and an unlabeled data set U; respectively constructing a TF-IDF model and an LSI model, and training the TF-IDF model and the LSI model by adopting text data in a corpus;

s3: constructing text feature vectors

Traversing each text in the corpus, obtaining a feature vector of each text by adopting a trained TF-IDF model, obtaining a feature vector of each text by adopting a trained LSI model, and combining the text feature vector obtained by adopting the TF-IDF model and the feature vector obtained by adopting the LSI model into one feature vector to obtain a combined text feature vector corresponding to each text;

s4: training classifier

Training a Basic classifier based on an ROC-SVM combined algorithm, and judging whether any input text belongs to a classification category defined by a user; training a label classifier by using a traditional SVM algorithm, wherein the label classifier is used for judging which type or types of the classification type any text belongs to;

s5: inputting text to be classified and classifying

The method comprises the steps of initially classifying data texts to be classified by using a Basic classifier, judging whether any one of the data texts to be classified belongs to a regular case category given by a user, screening the categories of the input data texts to be classified, determining candidate classification of the data texts, and then determining which category or categories of the categories the data texts to be classified specifically belong to by using a Label classifier.

Further, in step S3, when constructing the combined feature vector, the feature vector obtained from the LSI model and the feature vector obtained from the TF-IDF model are combined by using an ensemble method in a serial or weighted summation manner to obtain a combined text feature vector.

Further, when a combined feature vector combination is constructed in a weighted summation mode, feature selection is performed on feature vectors obtained through the TF-IDF model by using chi-square test, dimension reduction is performed on the feature vectors to enable the dimension of the feature vectors to be the same as that of the feature vectors obtained through the LSI model, and then weighted summation is performed.

Further, in step S4, when training the Basic classifier, the method includes the following steps:

s4.1.1: establishing prototype vectors of a prime example data set P and an unlabeled data set U based on an ROC algorithm, taking a central plane of the two prototype vectors as a partition of a vector space, and partitioning all instances in the space belonging to a certain prototype into the category of the prototype;

s4.1.2: calculating the similarity of texts, respectively comparing the similarity of text feature vectors in an unlabeled data set U with prototype vectors of a positive example data set P and the unlabeled data set U, if the similarity is similar to the prototype vectors constructed by the positive example data set P, the text feature vectors are regarded as positive example texts, and if the similarity is similar to the prototype vectors constructed by the unlabeled data set U, the text feature vectors are regarded as negative example texts, and the negative example texts are put into a set to form a negative example set RN;

s4.1.3: taking the union of the positive example data set P and the set RN as a positive example set training set and training a reference classifier C0 by adopting an SVM algorithm; selecting a negative example set W from a difference set of the unlabeled data set U relative to the negative example set RN, training a new classifier by using a union of the negative example set RN and the negative example set W as a negative example training set, and repeatedly iterating until no negative example exists in the difference set of the unlabeled data set U relative to the negative example set RN or the iteration number reaches a given threshold value to obtain an iterated classifier C1;

s4.1.4: and comparing the performances of the iterated classifier C1 and the reference classifier C0, and taking the classifier with better performance as a final Basic classifier.

Further, in the step S4.1.2, when the text data is not linearly separable, filtering and purifying the set RN by using a k-means clustering method to obtain a new negative case set RN ', replacing RN with RN', and continuing to perform the steps S4.1.3 and S4.1.4 to obtain a final Basic classifier.

Further, in the step S4.1.2, a cosine distance or an euclidean distance is used to calculate the similarity of the texts.

Further, the training of the Label classifier in step S4 includes the following steps:

s4.2.1: using a one-to-many multi-classification strategy, and for each class, using the data of the class as positive example data and using the data of other classes as negative example data by a classifier;

s4.2.2: and constructing a binary classifier for each class by adopting an SVM algorithm to finally obtain a Label classifier, and training to obtain a plurality of Label classifiers when a plurality of classes exist.

Further, in step S5, when the input data text to be classified is screened, an Elasticsearch is used to search a data text similar to the text to be classified in the regular data set P, and the category of the similar data text is used as a candidate classification.

Has the advantages that: compared with the prior art, the invention has the following advantages:

(1) according to the text classification method, when text classification is carried out, a positive case data set and a negative case data set are not required to be labeled at the same time, only the positive case data set is required to be labeled, and other data are unlabeled, so that the problem that due to the fact that negative case data are lacked in the data set, an accurate classifier is difficult to train, and the data text lacking the negative case cannot be classified accurately is solved. Meanwhile, the feature vectors of the data texts are constructed by adopting the TF-IDF model and the LSI model, and the advantages of the TF-IDF model and the LSI model are integrated, so that the feature vectors of the texts can be accurately and quickly constructed, the trained classifier is more accurate and rapid, the precision is higher, and the data texts can be classified more accurately and efficiently.

(2) The invention adopts the ROC-SVM combined algorithm to train the Basic classifier, and the algorithm does not need to label the positive example data and the negative example data of the text, only needs to label part of the positive example data, and does not label the other parts. Negative case data are extracted through an ROC (Rocchio) algorithm, then an SVM is adopted for classifier training, the problem that accurate classification is difficult to perform under the condition that negative case data are lacked in the traditional method is solved, meanwhile, the problems that negative cases are marked artificially, accuracy is low, and the accuracy of trained classifiers is low can be avoided, the classification accuracy is effectively improved, and the classification effect is good.

(3) Because whether the data is linearly separable or not is judged difficultly if the data is not analyzed accurately, and if the data is linearly inseparable, negative example data extracted by an ROC algorithm also has certain unreliability, so that a trained classifier is inaccurate, and the text classification effect is poor.

(4) Because there are many classifications in many cases, if a certain document is judged to belong to one or more of the classifications, a lot of time is consumed, so that the classification efficiency is very low, and the accuracy is also influenced.

Drawings

FIG. 1 is a flow chart of a method of text classification absent a negative example of the present invention;

FIG. 2 is a pseudo-code flow diagram of the ROC (Rocchio) algorithm;

FIG. 3 is a graph of f1 scores for four schemes ROC, PU-SVM, ROC-SVM and ROC-SVM with k-means.

Detailed Description

The invention is further described with reference to the following examples and the accompanying drawings.

The invention provides a negative example-lacking text classification method, which is used for text classification, and the method is not limited to the text classification in a single field and can be used in all fields. With reference to fig. 1, the method comprises the following specific steps:

s1: determining classified text and classification category

Determining a data text to be classified, and customizing a text classification category, wherein the customized text classification category is taken as a normal case category.

When text classification is performed, which texts need to be classified need to be determined, and a user can customize classification categories according to needs, and determine which categories data texts need to be specifically classified, wherein the given categories are taken as regular categories. For example, in a hospital, there are many cancer patients, such as gastric cancer, lung cancer, esophageal cancer, and liver cancer, so that the classification categories can be customized to gastric cancer, lung cancer, esophageal cancer, and liver cancer, so that the medical record texts of the patients can be classified according to the customized classification categories, but because some patients may suffer from multiple cancers at the same time, the medical record texts of some patients may be classified into multiple categories at the same time, such as one patient suffers from gastric cancer and liver cancer at the same time, and the medical record of the patient needs to be classified into gastric cancer and liver cancer at the same time.

When the classification category is defined, an interactive mode or a batch processing mode is required according to the requirements of computer software, namely, some software can define the classification category when a user uses the software in an interactive mode; the other method is that the classification category is directly defined in a background program in the software development process, and a user does not need to define the classification category when using the software.

S2: construction of vectorized model

Acquiring a corpus used for training a text vectorization model based on a public text database interface, performing text preprocessing on the corpus, and dividing a regular data set P and an unlabeled data set U; and respectively constructing a TF-IDF model and an LSI model, and training the TF-IDF model and the LSI model by adopting text data in a corpus.

In order to classify and process data texts, in the field of machine learning, a common practice is to vectorize texts, and in order to better vectorize texts, a vectorization model needs to be adopted, and feature vectors of the texts are constructed by using the vectorization model. When the vectorization model is constructed, the vectorization model needs to be trained, so that a corpus used for training the text vectorization model can be obtained from a public text database interface, text preprocessing is performed on the corpus, and a regular data set P and an unlabeled data set U are divided.

In the present invention, a TF-IDF model and an LSI model are adopted and trained based on the selected data text of the entire corpus. The TF-IDF model is a word frequency-based vectorization model, and the LSI model is a topic-based vectorization model. Both models require the conversion of plain text into a word list, then statistics of all words present in the data set, and filtering out some words present in each text and words with a low word frequency in the entire data set. The set of all the words appearing at this time is a dictionary D, each word in the dictionary corresponds to an id, and the following briefly describes the two types of models.

1) TF-IDF (term frequency inverse document frequency) model

The feature dimension of the TF-IDF model is the size of the dictionary D, and the score on the corresponding feature is determined by two factors, namely the frequency of the word in the text and the proportion of the text containing the word to the total number of the text, and the mathematical expression of the TF-IDF model is as follows:

tf-idf(t，d)＝tf(t，d)×idf(t)

the tf-idf value is the product of tf and idf, t represents words in the dictionary, d represents documents, tf and tf (t, d) represent the frequency of occurrence of each word in the document, nd is the total number of documents, df (t) represents the number of documents in which the word t occurs, and idf (t) represents the inverse document frequency of the word t. By traversing the corpus, the tf-idf value of each word of the lexicon D is obtained. The TF-IDF model is a weight calculation method, and each word is endowed with different weights, so that the TF-IDF model accords with visual perception.

2) LSI (Latent semantic indexing) is a topic model whose idea is to find the relationship between words in a large number of documents, and is based on a Singular Value Decomposition (SVD) method to obtain the topic of text. For a matrix a of m × n, it can be decomposed into three matrices:

wherein the matrices U and V are orthogonal matrices and the matrix Σ is a diagonal matrix consisting of singular values of the document matrix.

Assuming that the document dataset has m texts, each text corresponding to n words, A_ijThe eigenvalues representing the jth word of the ith text, where TF-IDF values obtained by the TF-IDF model may be used as eigenvalues, thus resulting in a matrix a. After SVD decomposition, U_ilDegree of correlation, V, corresponding to the ith text and the l topic_jmCorresponding to the degree of correlation between the jth word and the mth word, lm corresponding to the lth subjectAnd the degree of correlation of the mth word.

Assuming that k is the set number of potential topics, k is generally much smaller than the number of features, i.e., words, in the TF-IDF model and the bag-of-words model. Since the size of the singular values in the matrix Σ represents the size of the matrix variation in that dimension, and the singular values are arranged in the matrix Σ in order from large to small. When the first k singular values in Σ are large, taking the first k singular values can be regarded as an approximation to the original matrix. Such a document can be represented using k topics, and the relevance of the document to the k topics is represented as a feature vector of the document. The principle of the LSI model is very simple, the topic model can be obtained only by once and simple decomposition, and meanwhile, the problems related to words and word senses are solved.

S3: constructing text feature vectors

Traversing each text in the corpus, obtaining a feature vector of each text by adopting a trained TF-IDF model, obtaining a feature vector of each text by adopting a trained LSI model, and combining the feature vector obtained by adopting the TF-IDF model and the feature vector obtained by adopting the LSI model into one feature vector to obtain a combined text feature vector corresponding to each text.

In the invention, two different feature vectors can be obtained by adopting the TF-IDF model and the LSI model, and the feature vectors are combined, so that the advantages of the two models can be combined, the features can be accurately and efficiently extracted, and the classification performance is improved.

When the feature vectors obtained through the two models are combined, an ensemble method is adopted, and the two vectors are combined into one vector in a serial or weighted summation mode. In the series connection mode, the dimensions of each vector do not need to be considered, and the vectors can be directly combined in series, for example, one vector is r1 dimensional, and the other vector is r2 dimensional, so that a vector with r1+ r2 dimensions is obtained after combination. When the vectors are combined by adopting a weighted summation mode, the dimensions of the two vectors need to be unified first, and then the weighted summation can be carried out. In general, the feature vector obtained by adopting a TF-IDF model has higher dimensionality which is much higher than that of a feature vector obtained by an LSI model, so that the feature vector obtained by the TF-IDF model needs to be subjected to dimensionality reduction.

Chi-square test assumes that a word t is independent of another class c. If independent, the explanatory word t does not contribute at all to the category c. By comparing the theoretical value E (mathematical expectation) with the actual observed value x of several samples { x ═ x }₁,x₂,...,x_nCalculate the corresponding chi-squared value (error e):

the selection process is to calculate the chi-square value of each word and the category c, sort the words in descending order according to the chi-square value, and take the first k words as the final characteristics. It should be noted that in the text classification task, the error e does not need to be compared with a set threshold, and only k features with the highest scores need to be selected according to the sorting of the error sizes.

S4: training classifier

Training a Basic classifier based on an ROC-SVM combined algorithm, and judging whether any input text belongs to a classification category defined by a user; and training a label classifier by using a traditional SVM algorithm, wherein the label classifier is used for judging which type or types of the classification type any text belongs to.

In this step, a total of two classifiers are trained, one is a Basic classifier, and the other is a label classifier, wherein the Basic classifier is the core for text classification, and the quality of the Basic classifier directly determines the overall classification accuracy. Firstly, a process for training a Basic classifier is explained, and the process for training the Basic classifier is carried out according to the following steps:

s4.1.1: prototype vectors of a normal data set P and an unlabeled data set U are constructed based on an ROC (Rocchio) algorithm, a central plane of the two prototype vectors is used as a partition of a vector space, and all instances in the space belonging to a certain prototype are divided into categories to which the prototype belongs. The ROC algorithm (rocchi o) is an algorithm built on a Vector Space Model (VSM).

S4.1.2: and calculating the similarity of the texts, respectively comparing the similarity of the text feature vectors in the unlabeled data set U with the prototype vectors of the positive example data set P and the unlabeled data set U, if the similarity is similar to the prototype vector constructed by the positive example data set P, the text is regarded as the positive example text, and if the similarity is similar to the prototype vector constructed by the unlabeled data set U, the text is regarded as the negative example text, and the negative example text is put into a set to form a negative example set RN. The similarity of the text can be calculated by adopting a cosine distance or Euclidean distance method, and in general, in a vector space model, the cosine distance is mostly adopted as a method for calculating the similarity, so in the preferred embodiment of the invention, the cosine distance can be adopted for calculating the similarity.

Therefore, the construction of the negative case set is completed by combining steps S4.1.2 and S4.1.2, and the programming operation can be performed according to the ROC (rocchi) algorithm flow shown in fig. 2, where fig. 2 is a pseudo code flow chart of the ROC algorithm. In the algorithm, P represents a regular data set, U represents an unlabeled data set,

a feature vector representing each of the documents,

a prototype of a positive example is shown,

representing a negative prototype, α and β are parameters that regulate the correlation factor of the two sets of samples, where studies according to the prior art have shown that α -4 and β -4 are better parameters.

S4.1.3: taking the union of the positive case data set P and the negative case set RN as a positive case set training set and training a reference classifier C0 by adopting an SVM algorithm; selecting a negative example set W from the difference set of the unlabeled data set U relative to the negative example set RN, training a new classifier by using the union of the negative example set RN and the negative example set W as a negative example training set, and repeating iteration until no negative example exists in the difference set of the unlabeled data set U relative to the negative example set RN or the iteration number reaches a given threshold value, thereby obtaining an iterated classifier C1. It should be noted that, the SVM algorithm is usually expressed as a training SVM, because the SVM itself is a classifier, and the training of the SVM is performed by using the SVM algorithm, so as to obtain a classifier suitable for the scene, the training of the SVM mentioned above is performed by using the SVM algorithm.

Aiming at text data in many fields, the text data is not linearly separable, so that if the text data is continuously treated as being linearly separable, the classification accuracy rate can be reduced, therefore, in the method provided by the invention, for the situation that the text data is not linearly separable, a k-means clustering method can be adopted to filter and purify the set RN to obtain a new negative example set RN ', the RN ' is replaced by the RN ', and the steps S4.1.3 and S4.1.4 are continuously carried out to obtain a final Basic classifier.

The training of the label classifier is carried out according to the following steps:

s4.2.1: using a one-to-many multi-classification strategy, for each class, the classifier uses the data of that class as positive case data and the data of the other classes as negative case data. In order to ensure the balance of positive and negative example data, the data with the class ratio larger than 2:1 is downsampled, namely: firstly, storing all other types of data according to the sequence of the types, then extracting one document every K documents, and adding the documents into a new test set, so that the final training data can be ensured to be relatively balanced, and the classifier can be ensured to reach the due performance.

S5: inputting text to be classified and classifying

The method comprises the steps of initially classifying data texts to be classified by using a Basic classifier, judging whether any one of the data texts to be classified belongs to a regular case category given by a user, screening the category of the input data texts to be classified, determining candidate classification of the data texts, and then determining which category or categories of the classification categories the data texts to be classified specifically belong to by using a label classifier.

After the Basic classifier and the label classifier are trained, classifying the data texts to be classified, screening the input data texts to be classified after the Basic classifier is used for initial classification, and determining candidate classifications of the data texts. In the invention, an elastic search is adopted, a data text similar to the text to be classified is searched in a positive example data set P, and the category of the similar data text is used as a candidate category. When the Elasticissearch (ES) method is adopted, the method specifically comprises the following steps:

1) first, all the labeled documents and the corresponding category labels are stored in the database of the ES. The ES constructs a mapping table T ═ T of word-document contents by using an inverted index_ij|t_i∈T,d_i∈P}，t_ijA weighting value representing the number of times and weight the ith word appears in the jth document.

2) For each document to be classified, words of an article topK are designated and extracted, the words are used as conditions for ES retrieval, the ES automatically extracts the words in the table T, the relevance of each document in the database is scored, and retrieval results are sorted from high to low according to the score.

3) And selecting K marked documents with highest relevance, and taking the category set of the documents as candidate categories of the documents to be classified. At this time, the number of acquired candidate classifications is generally much smaller than the total number of classifications.

Based on the above method flow, verification is performed by using a specific example. The method is characterized in that a Chinese text data set of the university of Qinghua THUCNews is used as a corpus, the THUCNews is generated by filtering according to historical data of a Newcastle RSS subscription channel between 2005 and 2011, and the THUCNews comprises 74 ten thousand news documents (2.19GB) which are in UTF-8 pure text format. Based on the classification system of the original News of the Xinlang, 14 classification categories are reintegrated and divided: finance, lottery, real estate, stock, home, education, science and technology, society, fashion, sports, constellation, games, entertainment. When a positive example data set P and an unlabeled data set U are divided, a document of each category is used as a positive example set, documents of other categories are used as negative example sets, a positive example data set P and a negative example data set N are constructed, a% of documents are randomly selected from the positive example set and the negative example set, then the documents are respectively put into the sets P and N, then the rest (1-a%) of documents of all categories are used as the unlabeled data set U, the sets P and U are used as inputs, the N is selected to ensure that the distribution of a sample space is similar to the distribution of the inputs, the N is not used as an input in an algorithm, and certainly, the comparison test of the algorithm is carried out. Moreover, a test set is not required to be set in the experiment, and the performance of the algorithm can be evaluated by testing the condition of the positive case identification in the U.

1) Training a TF-IDF model and an LSI model based on the corpus;

2) and obtaining the feature vector of each text through the feature vector of the TF-IDF model and the LSI model, and combining the feature vectors to construct a combined text feature vector. The combination is carried out by adopting a weighted summation mode in an ensemble method, so that the feature vector from the TF-IDF model to each text needs to be subjected to feature selection by chi-square test, and the dimension of the feature vector is the same as that of the feature vector of each text obtained by the LSI model.

3) Inputting a positive case data set P and an unlabeled data set U, constructing corresponding prototype vectors by using the positive case data set P and the unlabeled data set U, and dividing a classification plane by using an ROC (Rocchio) algorithm, wherein alpha and beta parameter setting alpha is 4 and beta is 4, so that a negative case set RN can be obtained by simple division. Because the document may have a nonlinear inseparable condition, the condition of using the k-means method is also implemented for comparison, so that the negative case set RN is filtered and refined by using the k-means clustering method to obtain a new negative case set RN'. Training classifiers C0 and C1 by using an SVM algorithm according to steps S4.1.2 and S4.1.3, then selecting a classifier with better performance, and when the selection is performed, taking a positive case data set P as a training set, then respectively calculating f1 scores of classifiers C0 and C1, and selecting a classifier with higher score as a final Basic classifier.

4) Screening out candidate categories of the document through an Elastic Search (ES), calculating the similarity of the document and each labeled document, and ranking from high to low according to the similarity score, wherein the category of the TopK labeled document is used as the candidate category of the document.

5) The label classifier is used to determine which category or categories of the 14 defined categories the document specifically belongs to.

The method of the invention is verified through experiments, and for the text classification problem, the core is to verify the performance of the Basic classifier, so that only the performance comparison of the Basic classifier is carried out here.

And respectively selecting cases of a being 5,15,25,35,45,55 and 65, calculating corresponding f1 scores, and comparing to verify that the classifier finally obtained by the method is high in accuracy. Taking an unlabeled data set U as a test set, and comparing the following concentrated schemes:

scheme 1: directly dividing by using an ROC (Rocchio) algorithm;

scheme 2: training the SVM, called PU-SVM, by directly utilizing the regular data set P and the unlabeled data set U;

scheme 3: by adopting the method, firstly, an ROC (Rocchio) algorithm is used for obtaining a negative example set RN, and then the SVM is trained by using a positive example data set P and the negative example set RN, so that the training is called ROC-SVM;

scheme 4: and filtering by adopting a k-means method to obtain a negative example set RN ', and training the SVM by using the positive example data set P and the negative example set RN', wherein the training is called ROC-SVM with k-means.

The f1 scores are used as comparison indexes to calculate f1 scores of the four schemes, as shown in fig. 2, it can be seen that ROC is the worst effect, when a% is less than 25%, ROC (rocchi) is basically impossible to find a reasonable division plane in an unlabeled data set U, and when a% rises, ROC gradually can simply divide the unlabeled data set U, but positive examples of the division include many documents belonging to negative examples, so f1 scores of positive examples are calculated to be low. This is because ROC compares two categories by calculating prototypes, and when the proportion of positive examples is small, all data in the unlabeled data set U belong to negative examples, so that the data divided into positive examples is small. When the proportion a% of the positive example rises, the calculation prototype can reflect that the two are different, and the f1 score gradually rises. However, since the text features of the high-dimensional representation are basically linear and inseparable, and cannot be completely divided by one classification plane, the ROC effect is the worst of all. Compared with the ROC-SVM, the PU-SVM has a better effect because the PU-SVM has fewer processes for filtering the positive examples from the unlabeled data set U compared with the ROC-SVM, only a part of data is used as a support vector during SVM training, and the support vector which is maximally spaced from the positive example data set P in the unlabeled data set U is basically a document with a larger difference from the document category in the positive example data set P. Therefore, the PU-SVM effect is basically much better than that of ROC. Meanwhile, the ROC-SVM and ROC-SVM with k-means have the best effect, and the effect is better than that of the PU-SVM because the unlabelled data sets in the training data are screened. The f1 score for ROC-SVM is slightly higher when a% is less than 25%, but the f1 score for ROC-SVM with k-means is slightly higher when a% is greater than 25%.

Because it is difficult to determine whether the data is linearly separable, in practical use, text classification is often performed by using a k-means clustering method. Namely, the scheme 4, therefore, two representative a values are selected to test the effect of the ROC-SVM with k-means, and the accuracy (precision), the recall (recall) and the f1 score which are commonly used in the classification field are adopted as indexes.

The results are shown in table 1 when a is 15, 45.

TABLE 1 ROC-kmeans-SVM test results vs. a

According to the table 1, for most categories in the test data, 70% of the documents belonging to the regular example category in the unlabeled data U can be identified, and the identification accuracy reaches 90%. As shown in fig. 3, when the proportion of the positive example is small, the recognition accuracy is low, and is only about 70%, but when the proportion reaches 45% or more, precision is basically maintained to be 90%, and f1 reaches about 80%.

By comparing the performance of the Basic classifier, the classification method can classify the texts with higher accuracy, and can effectively improve the efficiency according to the steps of the method. Therefore, the method can accurately and efficiently classify the data texts lacking negative examples.

The above examples are only preferred embodiments of the present invention, it should be noted that: it will be apparent to those skilled in the art that various modifications and equivalents can be made without departing from the spirit of the invention, and it is intended that all such modifications and equivalents fall within the scope of the invention as defined in the claims.

Claims

1. A method for classifying text lacking negative examples is characterized by comprising the following steps:

s1: determining classified text and classification category

Determining a data text to be classified, and customizing a text classification category, wherein the customized text classification category is used as a normal case category;

s2: construction of vectorized model

s3: constructing text feature vectors

Traversing each text in the corpus, obtaining a feature vector of each text by adopting a trained TF-IDF model, obtaining a feature vector of each text by adopting a trained LSI model, and combining the text feature vector obtained by adopting the TF-IDF model and the text feature vector obtained by adopting the LSI model into one feature vector to obtain a combined text feature vector corresponding to each text;

s4: training classifier

s5: inputting text to be classified and classifying

2. The method for classifying text lacking negative examples according to claim 1, wherein in step S3, when constructing the combined feature vector, the text feature vector obtained from the LSI model and the text feature vector obtained from the TF-IDF model are combined by concatenation or weighted summation using an ensemble method to obtain the combined text feature vector.

3. The method for classifying texts lacking negative examples according to claim 2, wherein when the feature vector combination is performed by weighted summation, feature selection is performed on the text feature vector obtained through the TF-IDF model by using Chi-Square test, dimension reduction is performed on the feature vector so that the dimension of the feature vector is the same as that of the text feature vector obtained through the LSI model, and then weighted summation is performed.

4. The method for classifying text lacking negative examples according to claim 1, wherein the step S4, when training the Basic classifier, comprises the steps of:

s4.1.3: taking the union of the positive case data set P and the negative case set RN as a positive case set training set and training a reference classifier C0 by adopting an SVM algorithm; selecting a negative example set W from a difference set of the unlabeled data set U relative to the negative example set RN, training a new classifier by using a union of the negative example set RN and the negative example set W as a negative example training set, and repeatedly iterating until no negative example exists in the difference set of the unlabeled data set U relative to the negative example set RN or the iteration number reaches a given threshold value to obtain an iterated classifier C1;

5. The method for classifying text lacking negative examples of claim 3, wherein in the step S4.1.2, the set RN is filtered and refined by using a k-means clustering method to obtain a new negative example set RN ', and RN ' is replaced by RN ', and the steps S4.1.3 and S4.1.4 are continued to obtain a final Basic classifier.

6. The method for classifying text without negative examples of claim 3, wherein in the step S4.1.2, the cosine distance or Euclidean distance is used to calculate the similarity of the text.

7. The method for classifying text lacking negative examples according to claim 1, wherein the step S4 of training the label classifier comprises the following steps:

8. The method for classifying texts lacking negative examples according to claim 1, wherein in step S5, when the categories of the input data texts to be classified are selected, an Elasticsearch is used to search for data texts similar to the texts to be classified in the positive example data set P, and the categories of the similar data texts are used as candidate categories.