CN110750638A

CN110750638A - Multi-label corpus text classification method based on semi-supervised learning

Info

Publication number: CN110750638A
Application number: CN201910571367.1A
Authority: CN
Inventors: 肖清林
Original assignee: Central Mdt Infotech Ltd Of United States Of Xiamen
Current assignee: Central Mdt Infotech Ltd Of United States Of Xiamen
Priority date: 2019-06-28
Filing date: 2019-06-28
Publication date: 2020-02-04

Abstract

A multi-label corpus text classification method based on semi-supervised learning comprises the following steps: performing semi-supervised learning based on the multi-label corpus text to obtain a classification strategy knowledge base; preprocessing the corpus text to be classified; classifying the classified texts of the corpus to determine a first text content identification set; determining a first text content set in the preset training data set, and selecting text contents corresponding to N candidate categories according to the certain number of candidate categories in the first text content set to determine a second text content set; and determining the target category of the text to be classified according to the similarity between the text characteristic words and each text content in the second text set. The method has the advantages of reducing the calculation complexity and the calculation amount and improving the efficiency of text classes.

Description

Multi-label corpus text classification method based on semi-supervised learning

Technical Field

The invention relates to the field of text classification of a corpus, in particular to a multi-label corpus text classification method based on semi-supervised learning.

Background

Text classification is an important content of text mining, and means that a category is determined for each document in a document set according to predefined subject categories. Documents are classified through an automatic text classification system, and people can be helped to better find needed information and knowledge. Classification is one of the most basic forms of cognition for information in the human eye.

With the rapid growth of textual information, and particularly the proliferation of online textual information on the Internet (Internet), automatic text classification has become a key technology for processing and organizing large amounts of document data. Text classification is now finding widespread use in various fields. For example, in an internet platform, a server may classify text information corresponding to an inquiry language according to a sentence of inquiry language received by a user through a client, determine a classification corresponding to the text information, automatically answer the inquiry language of the user according to the corresponding classification, and push related information.

In the method for classifying texts in the prior art, as the amount of information becomes richer, the requirements of people on the aspects of content search accuracy, recall ratio and the like become higher and higher, the number of samples contained in a training set is huge, similarity calculation is performed on each sample in the training set in a traversal mode, a large amount of performance of a server needs to be consumed, and the calculation speed is low. Therefore, the effective resources of the server are greatly occupied, and the calculation time is too long, so that a lot of time is consumed for solving or pushing the relevant information to the user.

Disclosure of Invention

Objects of the invention

In order to solve the technical problems in the background art, the invention provides a multi-label corpus text classification method based on semi-supervised learning, which has the advantages of reducing the calculation complexity and the calculation amount and improving the text efficiency.

(II) technical scheme

In order to solve the problems, the invention provides a multi-label corpus text classification method based on semi-supervised learning, which comprises the following steps:

s1, performing semi-supervised learning based on the multi-label corpus text to obtain a classification strategy knowledge base;

s2, preprocessing the corpus text to be classified to obtain feature words in the corpus text;

s3, according to the feature words, performing category division on the corpus classified texts to obtain number candidate categories of the corpus classified texts;

s4, determining a first text content identification set in a pre-stored inverted index table according to a classification strategy knowledge base, wherein the first text content identification set comprises a plurality of text content identifications corresponding to text contents similar to the text characteristic words;

s5, determining a first text content set in the preset training data set according to the first text content identification set, wherein the training data set comprises sample text content identifications, sample text contents and the corresponding categories of each sample text content;

s6, selecting text contents corresponding to N candidate categories from the first text content set according to the certain number of candidate categories to determine a second text content set;

s7, determining the target category of the text to be classified according to the similarity between the text feature words and each text content in the second text set.

Preferably, in S1, the semi-supervised learning includes the steps of:

s11, constructing a multi-label corpus text set and an unknown multi-label corpus text set;

s12, training a classifier according to the samples in the multi-label corpus text to obtain the classifier;

s13, constructing a subset U 'of the unknown multi-label corpus text set, and judging the category of the unknown multi-label corpus text X' in the subset U of the unknown multi-label corpus text set by using a classifier;

s14, if the type of the unknown multi-label corpus text X 'is judged to be a multi-label corpus text, the unknown multi-label corpus text X' is labeled and added into a multi-label corpus text set, and if the type of the unknown multi-label corpus text X 'is judged to be an unknown multi-label corpus text, the document X' is deleted from the unknown multi-label corpus text;

and S15, iterating S11 to S14 until the unknown document set is an empty set, and outputting a classification strategy knowledge base.

Preferably, the inverted index table is constructed according to a training data set preset by a nearest node algorithm, and includes a feature attribute index entry and at least one text content identifier corresponding to each feature attribute.

Preferably, in S12, the training of the classifier includes the following steps:

s121, performing Chinese word segmentation and word stop removing processing on the documents of the sensitive document set;

s122, performing feature representation on the processed sensitive document set by utilizing an SVM algorithm;

s123, extracting the features by using an information gain method, and reserving effective text features;

s124, training a classifier by using a libsvm tool;

s125, evaluating a classifier model and improving a training classifier;

and S126, finishing the training and outputting the classifier.

Preferably, the determining the target category of the text to be classified according to the similarity between the text feature word and each text content in the second text set specifically includes:

respectively calculating the similarity between the text characteristic words and each text content in the second text set;

determining at least one most similar text content according to the similarity;

scoring the category to which each text content belongs in the at least one most similar text content;

and selecting the category with the highest score as the target category of the text.

The technical scheme of the invention has the following beneficial technical effects: by semi-supervised learning, the expandability and practicability of the multi-label corpus text are improved; the classification strategy knowledge base formed by the method is used for classifying and judging the corpus text, whether the corpus text is a multi-label corpus text or not is effectively judged, text characteristic words in the corpus text to be classified are extracted by preprocessing the corpus text to be classified, and then the text to be classified is preliminarily classified by adopting a common quick classification component according to the text characteristic words so as to obtain candidate classes; then, according to the text characteristic words, screening to screen out a set comprising text contents corresponding to the text contents similar to the text characteristic words, in the set, removing the text contents corresponding to the categories except the candidate categories, and finally determining the target category of the text to be classified according to the similarity between the text characteristic words and each sample text content in the final set; by the scheme, a large number of text entries which need to be traversed when the texts are classified can be reduced, the calculation complexity and the calculation amount are reduced, and the efficiency of text classification is improved.

Drawings

Fig. 1 is a schematic structural diagram of a multi-label corpus text classification method based on semi-supervised learning according to the present invention.

Fig. 2 is a schematic diagram of a semi-supervised learning process in the multi-label corpus text classification method based on semi-supervised learning according to the present invention.

Fig. 3 is a schematic flowchart of training classifiers in a multi-label corpus text classification method based on semi-supervised learning according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings in conjunction with the following detailed description. It should be understood that the description is intended to be exemplary only, and is not intended to limit the scope of the present invention. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.

As shown in fig. 1-3, the method for classifying texts in a multi-label corpus based on semi-supervised learning provided by the present invention includes the following steps:

In an alternative embodiment, in S1, the semi-supervised learning includes the steps of:

In an optional embodiment, the inverted index table is constructed according to a preset training data set of a nearest neighbor node algorithm, and includes a feature attribute index entry and at least one text content identifier corresponding to each feature attribute.

In an alternative embodiment, in S12, the training of the classifier includes the steps of:

s124, training a classifier by using a libsvm tool;

s125, evaluating a classifier model and improving a training classifier;

and S126, finishing the training and outputting the classifier.

In an optional embodiment, the determining the target category of the text to be classified according to the similarity between the text feature word and each text content in the second text set specifically includes:

determining at least one most similar text content according to the similarity;

In the invention, the expandability and the practicability of the multi-label corpus text are improved through semi-supervised learning; the classification strategy knowledge base formed by the method is used for classifying and judging the corpus text, whether the corpus text is a multi-label corpus text or not is effectively judged, text characteristic words in the corpus text to be classified are extracted by preprocessing the corpus text to be classified, and then the text to be classified is preliminarily classified by adopting a common quick classification component according to the text characteristic words so as to obtain candidate classes; then, according to the text characteristic words, screening to screen out a set comprising text contents corresponding to the text contents similar to the text characteristic words, in the set, removing the text contents corresponding to the categories except the candidate categories, and finally determining the target category of the text to be classified according to the similarity between the text characteristic words and each sample text content in the final set; by the scheme, a large number of text entries which need to be traversed when the texts are classified can be reduced, the calculation complexity and the calculation amount are reduced, and the efficiency of text classification is improved.

It is to be understood that the above-described embodiments of the present invention are merely illustrative of or explaining the principles of the invention and are not to be construed as limiting the invention. Therefore, any modification, equivalent replacement, improvement and the like made without departing from the spirit and scope of the present invention should be included in the protection scope of the present invention. Further, it is intended that the appended claims cover all such variations and modifications as fall within the scope and boundaries of the appended claims or the equivalents of such scope and boundaries.

Claims

1. A multi-label corpus text classification method based on semi-supervised learning is characterized by comprising the following steps:

2. The method for text classification of a multi-label corpus based on semi-supervised learning as claimed in claim 1, wherein in S1, semi-supervised learning comprises the following steps:

s13, constructing a subset U of the unknown multi-label corpus text set, and judging the category of the unknown multi-label corpus text X' in the subset U of the unknown multi-label corpus text set by using a classifier;

3. The method for classifying texts in a multi-label corpus based on semi-supervised learning according to claim 1, wherein the inverted index table is constructed according to a training data set preset by a nearest node algorithm, and comprises a feature attribute index entry and at least one text content identifier corresponding to each feature attribute.

4. The method for text classification of a multi-label corpus based on semi-supervised learning as claimed in claim 2, wherein the step of training the classifier in S12 includes the following steps:

s124, training a classifier by using a libsvm tool;

s125, evaluating a classifier model and improving a training classifier;

and S126, finishing the training and outputting the classifier.

5. The method for classifying texts of a multi-label corpus based on semi-supervised learning according to claim 1, wherein the determining the target class of the text to be classified according to the similarity between the text feature words and each text content in the second text set specifically comprises:

determining at least one most similar text content according to the similarity;