CN110750638A - Multi-label corpus text classification method based on semi-supervised learning - Google Patents

Multi-label corpus text classification method based on semi-supervised learning Download PDF

Info

Publication number
CN110750638A
CN110750638A CN201910571367.1A CN201910571367A CN110750638A CN 110750638 A CN110750638 A CN 110750638A CN 201910571367 A CN201910571367 A CN 201910571367A CN 110750638 A CN110750638 A CN 110750638A
Authority
CN
China
Prior art keywords
text
corpus
label
label corpus
text content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910571367.1A
Other languages
Chinese (zh)
Inventor
肖清林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central Mdt Infotech Ltd Of United States Of Xiamen
Original Assignee
Central Mdt Infotech Ltd Of United States Of Xiamen
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central Mdt Infotech Ltd Of United States Of Xiamen filed Critical Central Mdt Infotech Ltd Of United States Of Xiamen
Priority to CN201910571367.1A priority Critical patent/CN110750638A/en
Publication of CN110750638A publication Critical patent/CN110750638A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2155Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Abstract

A multi-label corpus text classification method based on semi-supervised learning comprises the following steps: performing semi-supervised learning based on the multi-label corpus text to obtain a classification strategy knowledge base; preprocessing the corpus text to be classified; classifying the classified texts of the corpus to determine a first text content identification set; determining a first text content set in the preset training data set, and selecting text contents corresponding to N candidate categories according to the certain number of candidate categories in the first text content set to determine a second text content set; and determining the target category of the text to be classified according to the similarity between the text characteristic words and each text content in the second text set. The method has the advantages of reducing the calculation complexity and the calculation amount and improving the efficiency of text classes.

Description

Multi-label corpus text classification method based on semi-supervised learning
Technical Field
The invention relates to the field of text classification of a corpus, in particular to a multi-label corpus text classification method based on semi-supervised learning.
Background
Text classification is an important content of text mining, and means that a category is determined for each document in a document set according to predefined subject categories. Documents are classified through an automatic text classification system, and people can be helped to better find needed information and knowledge. Classification is one of the most basic forms of cognition for information in the human eye.
With the rapid growth of textual information, and particularly the proliferation of online textual information on the Internet (Internet), automatic text classification has become a key technology for processing and organizing large amounts of document data. Text classification is now finding widespread use in various fields. For example, in an internet platform, a server may classify text information corresponding to an inquiry language according to a sentence of inquiry language received by a user through a client, determine a classification corresponding to the text information, automatically answer the inquiry language of the user according to the corresponding classification, and push related information.
In the method for classifying texts in the prior art, as the amount of information becomes richer, the requirements of people on the aspects of content search accuracy, recall ratio and the like become higher and higher, the number of samples contained in a training set is huge, similarity calculation is performed on each sample in the training set in a traversal mode, a large amount of performance of a server needs to be consumed, and the calculation speed is low. Therefore, the effective resources of the server are greatly occupied, and the calculation time is too long, so that a lot of time is consumed for solving or pushing the relevant information to the user.
Disclosure of Invention
Objects of the invention
In order to solve the technical problems in the background art, the invention provides a multi-label corpus text classification method based on semi-supervised learning, which has the advantages of reducing the calculation complexity and the calculation amount and improving the text efficiency.
(II) technical scheme
In order to solve the problems, the invention provides a multi-label corpus text classification method based on semi-supervised learning, which comprises the following steps:
s1, performing semi-supervised learning based on the multi-label corpus text to obtain a classification strategy knowledge base;
s2, preprocessing the corpus text to be classified to obtain feature words in the corpus text;
s3, according to the feature words, performing category division on the corpus classified texts to obtain number candidate categories of the corpus classified texts;
s4, determining a first text content identification set in a pre-stored inverted index table according to a classification strategy knowledge base, wherein the first text content identification set comprises a plurality of text content identifications corresponding to text contents similar to the text characteristic words;
s5, determining a first text content set in the preset training data set according to the first text content identification set, wherein the training data set comprises sample text content identifications, sample text contents and the corresponding categories of each sample text content;
s6, selecting text contents corresponding to N candidate categories from the first text content set according to the certain number of candidate categories to determine a second text content set;
s7, determining the target category of the text to be classified according to the similarity between the text feature words and each text content in the second text set.
Preferably, in S1, the semi-supervised learning includes the steps of:
s11, constructing a multi-label corpus text set and an unknown multi-label corpus text set;
s12, training a classifier according to the samples in the multi-label corpus text to obtain the classifier;
s13, constructing a subset U 'of the unknown multi-label corpus text set, and judging the category of the unknown multi-label corpus text X' in the subset U of the unknown multi-label corpus text set by using a classifier;
s14, if the type of the unknown multi-label corpus text X 'is judged to be a multi-label corpus text, the unknown multi-label corpus text X' is labeled and added into a multi-label corpus text set, and if the type of the unknown multi-label corpus text X 'is judged to be an unknown multi-label corpus text, the document X' is deleted from the unknown multi-label corpus text;
and S15, iterating S11 to S14 until the unknown document set is an empty set, and outputting a classification strategy knowledge base.
Preferably, the inverted index table is constructed according to a training data set preset by a nearest node algorithm, and includes a feature attribute index entry and at least one text content identifier corresponding to each feature attribute.
Preferably, in S12, the training of the classifier includes the following steps:
s121, performing Chinese word segmentation and word stop removing processing on the documents of the sensitive document set;
s122, performing feature representation on the processed sensitive document set by utilizing an SVM algorithm;
s123, extracting the features by using an information gain method, and reserving effective text features;
s124, training a classifier by using a libsvm tool;
s125, evaluating a classifier model and improving a training classifier;
and S126, finishing the training and outputting the classifier.
Preferably, the determining the target category of the text to be classified according to the similarity between the text feature word and each text content in the second text set specifically includes:
respectively calculating the similarity between the text characteristic words and each text content in the second text set;
determining at least one most similar text content according to the similarity;
scoring the category to which each text content belongs in the at least one most similar text content;
and selecting the category with the highest score as the target category of the text.
The technical scheme of the invention has the following beneficial technical effects: by semi-supervised learning, the expandability and practicability of the multi-label corpus text are improved; the classification strategy knowledge base formed by the method is used for classifying and judging the corpus text, whether the corpus text is a multi-label corpus text or not is effectively judged, text characteristic words in the corpus text to be classified are extracted by preprocessing the corpus text to be classified, and then the text to be classified is preliminarily classified by adopting a common quick classification component according to the text characteristic words so as to obtain candidate classes; then, according to the text characteristic words, screening to screen out a set comprising text contents corresponding to the text contents similar to the text characteristic words, in the set, removing the text contents corresponding to the categories except the candidate categories, and finally determining the target category of the text to be classified according to the similarity between the text characteristic words and each sample text content in the final set; by the scheme, a large number of text entries which need to be traversed when the texts are classified can be reduced, the calculation complexity and the calculation amount are reduced, and the efficiency of text classification is improved.
Drawings
Fig. 1 is a schematic structural diagram of a multi-label corpus text classification method based on semi-supervised learning according to the present invention.
Fig. 2 is a schematic diagram of a semi-supervised learning process in the multi-label corpus text classification method based on semi-supervised learning according to the present invention.
Fig. 3 is a schematic flowchart of training classifiers in a multi-label corpus text classification method based on semi-supervised learning according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings in conjunction with the following detailed description. It should be understood that the description is intended to be exemplary only, and is not intended to limit the scope of the present invention. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.
As shown in fig. 1-3, the method for classifying texts in a multi-label corpus based on semi-supervised learning provided by the present invention includes the following steps:
s1, performing semi-supervised learning based on the multi-label corpus text to obtain a classification strategy knowledge base;
s2, preprocessing the corpus text to be classified to obtain feature words in the corpus text;
s3, according to the feature words, performing category division on the corpus classified texts to obtain number candidate categories of the corpus classified texts;
s4, determining a first text content identification set in a pre-stored inverted index table according to a classification strategy knowledge base, wherein the first text content identification set comprises a plurality of text content identifications corresponding to text contents similar to the text characteristic words;
s5, determining a first text content set in the preset training data set according to the first text content identification set, wherein the training data set comprises sample text content identifications, sample text contents and the corresponding categories of each sample text content;
s6, selecting text contents corresponding to N candidate categories from the first text content set according to the certain number of candidate categories to determine a second text content set;
s7, determining the target category of the text to be classified according to the similarity between the text feature words and each text content in the second text set.
In an alternative embodiment, in S1, the semi-supervised learning includes the steps of:
s11, constructing a multi-label corpus text set and an unknown multi-label corpus text set;
s12, training a classifier according to the samples in the multi-label corpus text to obtain the classifier;
s13, constructing a subset U 'of the unknown multi-label corpus text set, and judging the category of the unknown multi-label corpus text X' in the subset U of the unknown multi-label corpus text set by using a classifier;
s14, if the type of the unknown multi-label corpus text X 'is judged to be a multi-label corpus text, the unknown multi-label corpus text X' is labeled and added into a multi-label corpus text set, and if the type of the unknown multi-label corpus text X 'is judged to be an unknown multi-label corpus text, the document X' is deleted from the unknown multi-label corpus text;
and S15, iterating S11 to S14 until the unknown document set is an empty set, and outputting a classification strategy knowledge base.
In an optional embodiment, the inverted index table is constructed according to a preset training data set of a nearest neighbor node algorithm, and includes a feature attribute index entry and at least one text content identifier corresponding to each feature attribute.
In an alternative embodiment, in S12, the training of the classifier includes the steps of:
s121, performing Chinese word segmentation and word stop removing processing on the documents of the sensitive document set;
s122, performing feature representation on the processed sensitive document set by utilizing an SVM algorithm;
s123, extracting the features by using an information gain method, and reserving effective text features;
s124, training a classifier by using a libsvm tool;
s125, evaluating a classifier model and improving a training classifier;
and S126, finishing the training and outputting the classifier.
In an optional embodiment, the determining the target category of the text to be classified according to the similarity between the text feature word and each text content in the second text set specifically includes:
respectively calculating the similarity between the text characteristic words and each text content in the second text set;
determining at least one most similar text content according to the similarity;
scoring the category to which each text content belongs in the at least one most similar text content;
and selecting the category with the highest score as the target category of the text.
In the invention, the expandability and the practicability of the multi-label corpus text are improved through semi-supervised learning; the classification strategy knowledge base formed by the method is used for classifying and judging the corpus text, whether the corpus text is a multi-label corpus text or not is effectively judged, text characteristic words in the corpus text to be classified are extracted by preprocessing the corpus text to be classified, and then the text to be classified is preliminarily classified by adopting a common quick classification component according to the text characteristic words so as to obtain candidate classes; then, according to the text characteristic words, screening to screen out a set comprising text contents corresponding to the text contents similar to the text characteristic words, in the set, removing the text contents corresponding to the categories except the candidate categories, and finally determining the target category of the text to be classified according to the similarity between the text characteristic words and each sample text content in the final set; by the scheme, a large number of text entries which need to be traversed when the texts are classified can be reduced, the calculation complexity and the calculation amount are reduced, and the efficiency of text classification is improved.
It is to be understood that the above-described embodiments of the present invention are merely illustrative of or explaining the principles of the invention and are not to be construed as limiting the invention. Therefore, any modification, equivalent replacement, improvement and the like made without departing from the spirit and scope of the present invention should be included in the protection scope of the present invention. Further, it is intended that the appended claims cover all such variations and modifications as fall within the scope and boundaries of the appended claims or the equivalents of such scope and boundaries.

Claims (5)

1. A multi-label corpus text classification method based on semi-supervised learning is characterized by comprising the following steps:
s1, performing semi-supervised learning based on the multi-label corpus text to obtain a classification strategy knowledge base;
s2, preprocessing the corpus text to be classified to obtain feature words in the corpus text;
s3, according to the feature words, performing category division on the corpus classified texts to obtain number candidate categories of the corpus classified texts;
s4, determining a first text content identification set in a pre-stored inverted index table according to a classification strategy knowledge base, wherein the first text content identification set comprises a plurality of text content identifications corresponding to text contents similar to the text characteristic words;
s5, determining a first text content set in the preset training data set according to the first text content identification set, wherein the training data set comprises sample text content identifications, sample text contents and the corresponding categories of each sample text content;
s6, selecting text contents corresponding to N candidate categories from the first text content set according to the certain number of candidate categories to determine a second text content set;
s7, determining the target category of the text to be classified according to the similarity between the text feature words and each text content in the second text set.
2. The method for text classification of a multi-label corpus based on semi-supervised learning as claimed in claim 1, wherein in S1, semi-supervised learning comprises the following steps:
s11, constructing a multi-label corpus text set and an unknown multi-label corpus text set;
s12, training a classifier according to the samples in the multi-label corpus text to obtain the classifier;
s13, constructing a subset U of the unknown multi-label corpus text set, and judging the category of the unknown multi-label corpus text X' in the subset U of the unknown multi-label corpus text set by using a classifier;
s14, if the type of the unknown multi-label corpus text X 'is judged to be a multi-label corpus text, the unknown multi-label corpus text X' is labeled and added into a multi-label corpus text set, and if the type of the unknown multi-label corpus text X 'is judged to be an unknown multi-label corpus text, the document X' is deleted from the unknown multi-label corpus text;
and S15, iterating S11 to S14 until the unknown document set is an empty set, and outputting a classification strategy knowledge base.
3. The method for classifying texts in a multi-label corpus based on semi-supervised learning according to claim 1, wherein the inverted index table is constructed according to a training data set preset by a nearest node algorithm, and comprises a feature attribute index entry and at least one text content identifier corresponding to each feature attribute.
4. The method for text classification of a multi-label corpus based on semi-supervised learning as claimed in claim 2, wherein the step of training the classifier in S12 includes the following steps:
s121, performing Chinese word segmentation and word stop removing processing on the documents of the sensitive document set;
s122, performing feature representation on the processed sensitive document set by utilizing an SVM algorithm;
s123, extracting the features by using an information gain method, and reserving effective text features;
s124, training a classifier by using a libsvm tool;
s125, evaluating a classifier model and improving a training classifier;
and S126, finishing the training and outputting the classifier.
5. The method for classifying texts of a multi-label corpus based on semi-supervised learning according to claim 1, wherein the determining the target class of the text to be classified according to the similarity between the text feature words and each text content in the second text set specifically comprises:
respectively calculating the similarity between the text characteristic words and each text content in the second text set;
determining at least one most similar text content according to the similarity;
scoring the category to which each text content belongs in the at least one most similar text content;
and selecting the category with the highest score as the target category of the text.
CN201910571367.1A 2019-06-28 2019-06-28 Multi-label corpus text classification method based on semi-supervised learning Pending CN110750638A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910571367.1A CN110750638A (en) 2019-06-28 2019-06-28 Multi-label corpus text classification method based on semi-supervised learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910571367.1A CN110750638A (en) 2019-06-28 2019-06-28 Multi-label corpus text classification method based on semi-supervised learning

Publications (1)

Publication Number Publication Date
CN110750638A true CN110750638A (en) 2020-02-04

Family

ID=69275784

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910571367.1A Pending CN110750638A (en) 2019-06-28 2019-06-28 Multi-label corpus text classification method based on semi-supervised learning

Country Status (1)

Country Link
CN (1) CN110750638A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254599A (en) * 2021-06-28 2021-08-13 浙江大学 Multi-label microblog text classification method based on semi-supervised learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102955856A (en) * 2012-11-09 2013-03-06 北京航空航天大学 Chinese short text classification method based on characteristic extension
CN105095223A (en) * 2014-04-25 2015-11-25 阿里巴巴集团控股有限公司 Method for classifying texts and server
CN106897459A (en) * 2016-12-14 2017-06-27 中国电子科技集团公司第三十研究所 A kind of text sensitive information recognition methods based on semi-supervised learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102955856A (en) * 2012-11-09 2013-03-06 北京航空航天大学 Chinese short text classification method based on characteristic extension
CN105095223A (en) * 2014-04-25 2015-11-25 阿里巴巴集团控股有限公司 Method for classifying texts and server
CN106897459A (en) * 2016-12-14 2017-06-27 中国电子科技集团公司第三十研究所 A kind of text sensitive information recognition methods based on semi-supervised learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254599A (en) * 2021-06-28 2021-08-13 浙江大学 Multi-label microblog text classification method based on semi-supervised learning
CN113254599B (en) * 2021-06-28 2021-10-08 浙江大学 Multi-label microblog text classification method based on semi-supervised learning

Similar Documents

Publication Publication Date Title
CN109918673B (en) Semantic arbitration method and device, electronic equipment and computer-readable storage medium
CN106156365B (en) A kind of generation method and device of knowledge mapping
CN109460455B (en) Text detection method and device
KR101219366B1 (en) Classification of ambiguous geographic references
US20110112995A1 (en) Systems and methods for organizing collective social intelligence information using an organic object data model
CN106909669B (en) Method and device for detecting promotion information
US20090276378A1 (en) System and Method for Identifying Document Structure and Associated Metainformation and Facilitating Appropriate Processing
CN109271489B (en) Text detection method and device
CN108829661B (en) News subject name extraction method based on fuzzy matching
AU2013365452B2 (en) Document classification device and program
CN110910175B (en) Image generation method for travel ticket product
CN109033212A (en) A kind of file classification method based on similarity mode
CN110287314A (en) Long text credibility evaluation method and system based on Unsupervised clustering
KR20200127557A (en) A program recording midium for an automatic sentiment information labeling method to news articles for providing sentiment information
CN110196910B (en) Corpus classification method and apparatus
KR20200127553A (en) An automatic sentiment information labeling method to news articles for providing sentiment information
CN110750638A (en) Multi-label corpus text classification method based on semi-supervised learning
CN109241438B (en) Element-based cross-channel hot event discovery method and device and storage medium
JP2003281159A (en) Document processor, document processing method and document processing program
CN112183093A (en) Enterprise public opinion analysis method, device, equipment and readable storage medium
KR101692244B1 (en) Method for spam classfication, recording medium and device for performing the method
CN115168567B (en) Knowledge graph-based object recommendation method
CN107967299B (en) Agricultural public opinion-oriented automatic hot word extraction method and system
KR20200127555A (en) A program for an automatic sentiment information labeling to news articles for providing sentiment information
KR20200127636A (en) A program recording midium for an automatic sentiment information labeling to news articles for providing sentiment information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200204