CN110162629B - Text classification method based on multi-base model framework - Google Patents

Text classification method based on multi-base model framework Download PDF

Info

Publication number
CN110162629B
CN110162629B CN201910378450.7A CN201910378450A CN110162629B CN 110162629 B CN110162629 B CN 110162629B CN 201910378450 A CN201910378450 A CN 201910378450A CN 110162629 B CN110162629 B CN 110162629B
Authority
CN
China
Prior art keywords
training
name
text
fasttext
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910378450.7A
Other languages
Chinese (zh)
Other versions
CN110162629A (en
Inventor
沈雅婷
左志新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN201910378450.7A priority Critical patent/CN110162629B/en
Publication of CN110162629A publication Critical patent/CN110162629A/en
Application granted granted Critical
Publication of CN110162629B publication Critical patent/CN110162629B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text classification method based on a multi-base model frame, which comprises the following steps: 1) preprocessing input data; 2) performing fastText (fast text classification algorithm) training on the training data processed in the step (1), evaluating and performing parameter tuning to obtain two groups of optimal fastText parameters with the parameter word _ grams (N-element mold) as 1 and 2; 3) disorganizing the training data processed in the step (1) to generate 15 training samples, respectively performing 7 times of fastText training on the training samples by using the 2 sets of parameters generated in the step (2) to generate 14 models, then randomly performing the fastText training by using the 2 sets of parameters generated in the step (2) to generate 1 model, and finally obtaining 15 final multi-base models taking the fastText model as a base model; 4) and (4) predicting the test data processed in the step (1) by using the multi-base model obtained in the step (3), and obtaining a final text prediction label by adopting a voting mechanism. The invention has the characteristics of high training speed, high accuracy, high prediction efficiency and the like.

Description

Text classification method based on multi-base model framework
Technical Field
The invention relates to a text classification method based on a multi-base model frame, and belongs to the technical field of supervision algorithms and text classification processing.
Background
The text classification problem is not substantially different from other classification problems, and the method can be summarized as matching according to certain characteristics of the data to be classified, and complete matching is unlikely, so that an optimal matching result must be selected (according to some evaluation criterion) to complete classification. The selection and training of the classifier, and the evaluation and feedback of the classification result are very important.
As one of the most classical scenarios in the field of NLP (natural language recognition), text classification accumulates a large number of technical implementation methods, which can be roughly classified into two categories, i.e., text classification based on conventional machine learning and text classification based on deep learning if deep learning technology is used as a standard for measurement. However, the existing text classification method based on the traditional machine learning and the conventional deep learning has the problems of low accuracy, long training time and the like.
fastText is a fast text classification algorithm that can train over 10 million words in 10 minutes using a standard multi-core CPU (central processing unit) and classify 50 million sentences in 31.2 million categories in less than a minute. Compared with a text classification algorithm based on a neural network, the fast text classification algorithm has two advantages that firstly, fast text greatly accelerates the training speed and the testing speed while maintaining high precision. Again, there is no need to use a pre-trained word vector, since the fastText will train the word vector itself.
Bagging is a method used to improve the accuracy of learning algorithms by constructing a series of prediction functions and then combining them in a certain way into a prediction function. Bagging requires a classification method that is "unstable" (unstable refers to small variations in the data set that can cause significant variations in the classification results). Such as: decision trees, neural network algorithms. The basic idea is that a weak learning algorithm and a training set are given, the accuracy of a single weak learning algorithm is not high, the learning algorithm is used for multiple times to obtain a prediction function sequence, voting is carried out, and finally the result accuracy is improved. Because the accuracy rate of the existing text classification method based on the traditional machine learning is low, the neural network training time of the conventional deep learning method is long. While the text classification algorithm based on fastText has certain advancement, the effect which can be realized by the text classification algorithm is limited for a single model. It is found that a model is occasional for ambiguous data, and one may be inaccurate, and multiple votes may improve accuracy. The present invention can solve the above problems well.
Disclosure of Invention
The invention aims to provide a text classification method based on a multi-base model frame aiming at the defects of the prior art, the method is based on the Bagging integrated learning idea, the text classification of the multi-base model frame is accurately realized, and the problems of low accuracy of the text classification method based on the traditional machine learning and long neural network training time of the conventional deep learning method are effectively solved.
The technical scheme adopted by the invention for solving the technical problems is as follows: the invention provides a text classification method based on a multi-base model frame, which comprises the following steps:
step 1: the text data is preprocessed.
Step 2: and (3) performing fastText training on the training data processed in the step (1), evaluating and optimizing parameters to obtain two groups of optimal fastText parameters with the parameter word _ gurams as 1 and 2.
And step 3: and (2) disorganizing the training data processed in the step (1) to generate 15 training samples, respectively performing 7 times of fastText training on the training samples by using 2 groups of parameters generated in the step (2) to generate 14 models, then randomly performing the fastText training by using 2 groups of parameters generated in the step (2) to generate 1 model, and finally obtaining 15 final multi-base models taking the fastText model as a base model.
And 4, step 4: and (3) predicting the test data processed in the step (1) by using the multi-base model obtained in the step (3), and obtaining a final text prediction label by adopting a voting mechanism.
Further, the specific steps of preprocessing the text data in step 1 of the present invention are as follows:
(1.1) definition of T1For training data a single text information set, T2Defining name and label of single text for testing single text information set of data, and satisfying relation T1={name,label},T2={name};
(1.2) definition of D1For the training data text name dataset, D2 is the training data text label dataset, D3 is the test data text name dataset, D1={name1,name2,…,namen},D2={label1,label2,…,labeln},
D3={name1,name2,…,namem},nameaIs D1The a-th text name data, labelaIs D2The a-th text label data, namebIs D3B-th text name data, where the variable a e [1, n ∈]The variable b is epsilon [1, m ∈ ]];
(1.3) to D1,D3Performing word segmentation and word stop processing on D2Processing the fastText format to obtain a new D1,D2,D3
(1.4) mixing D1And D2Splicing the two to form a text format of fastText, and obtaining a training data set Train { { name { (name)1,label1},{name2,label2},…,{namen,labelnAnd processing the text into txt text, wherein the Test data set Test is D3
Further, in step 2 of the present invention, after data preprocessing and data preprocessing in step 1, training data Train is divided into a new training set T and a new verification set V, fastText training is performed on the training set T, and the verification set V is predicted to obtain an accuracy, wherein a fastText parameter word _ grams is set to 1 or 2, and other parameters are adjusted for multiple cycles to obtain two sets of parameters with the highest accuracy when the fastText parameter word _ grams is 1 and 2 respectively, and the two sets of parameters are recorded.
Further, in step 3 of the present invention, 15 training samples are generated by disturbing the training data Train processed in step 1, and are subjected to fastText training 7 times respectively using 2 sets of parameters generated in step 2, so as to generate 14 models, and then the 2 sets of parameters generated in step 2 are randomly used to perform fastText training, so as to generate 1 model, and finally, 15 final multi-base models (as shown in fig. 2) using the fastText model as a base model are obtained.
Further, in step 4 of the present invention, the multi-base model obtained in step 3 is used to predict the Test data Test generated in step 1, and a voting mechanism is adopted to obtain a final text prediction label.
Has the advantages that:
1. compared with a conventional textCNN (text convolutional neural network) deep learning algorithm on the same data set, the method B _ f (Bagging _ fastText) has the characteristics of high training speed, high accuracy and high prediction efficiency.
2. Compared with the TextCNN and the single independent fastText, the method provided by the invention has the advantages that the data is the commodity classification data of the network retail platform, and the text classification accuracy can be well improved.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Fig. 2 is a conceptual diagram of the final model.
FIG. 3 is a flow chart for generating a multi-base model.
Detailed Description
The present invention is further illustrated by the following examples, which are intended to be purely exemplary of the invention and are not intended to limit its scope, which after reading the present invention, is susceptible of modification in various equivalent forms by those skilled in the art, all falling within the scope of the invention as defined in the appended claims.
As shown in fig. 1, the present invention provides a text classification method based on a multi-base model framework, which specifically includes the following steps:
the first step is as follows: defining T as a single text information set of data, defining a name and a label as the name and the label of a single text respectively, and satisfying the relation T ═ name, label }; definition D1For data text name data sets, D2Tagging datasets for data text, D1={name1,name2,…,namen},D2={label1,label2,…,labeln},nameaIs D1The a-th text name data, labelaIs D2The a-th text label data, where the variable a ∈ [1, n ∈ [ ]](ii) a To D1Performing Chinese word segmentation to stop word processing, and performing word segmentation to D2Performing fastText format processing, and adding __ label __ before the label; will D1And D2Splicing, combining into a text format of fastText to obtain a training data set:
Train={{name1,label1},{name2,label2},…,{namen,labelnand processing the text into txt text and storing the text.
The second step is that: dividing training data Train, dividing a new training set T and a new verification set V, carrying out fastText training on the training set T, predicting the verification set V to obtain the accuracy, wherein the fastText parameter word _ grams is set to 1 or 2, adjusting other parameters for multiple times, obtaining two sets of parameters with the highest accuracy when the fastText parameter word _ grams is 1 and 2 respectively, and the character array bucket is 2000000 and records.
The third step: cross validation was applied. Specifically, the training data Train is randomly and evenly divided into 10 groups, and then one group is extracted as a test set and the rest is extracted as a training set. The learning process is repeated 10 times, and each test results in a corresponding correct or error rate. The average of the accuracy or error rate of the 10 results is used as an estimate of the accuracy of the algorithm. The specific process comprises the following steps: the training set is disturbed to generate 15 sub-training sets with different sequences, 2 groups of parameters generated in the step 2 of the invention are used for performing the fastText training for 7 times respectively to generate 14 models, then 2 groups of parameters generated in the step 2 of the invention are used for performing the fastText training randomly to generate 1 model, and finally 15 final multi-base models (as shown in figure 2) taking the fastText model as a base model are obtained. Then, the multi-base model is used for predicting the test set, a voting mechanism is adopted to obtain a final text prediction label, the flow is shown in fig. 3, and the final text prediction label is compared with the test set label to obtain the accuracy.
By processing 50 ten thousand pieces of network retail platform commodity classification data, the classification accuracy of 86.65% is obtained by using the method B _ f (Bagging _ fastText, fast text classification algorithm based on bootstrap aggregation method), and the classification accuracy is further improved compared with that of single independent fastText.
For 50 ten thousand network retail platform commodity classification data sets, 100%, 50% and 1% of texts are randomly selected, and cross validation is carried out on the invention (B _ f) and a TextCNN (text convolutional neural network) model and a single fastText model to carry out a comparison experiment. Specifically, each text data set was randomly and evenly divided into 10 groups, and then one group was randomly drawn as a test set and the rest were drawn as a training set. In the training set, only 10 labeled samples were selected, the remainder being unlabeled samples. This learning process was repeated 10 times and the average results were recorded in table I, where the best performance in each line of the text data set is shown bolded.
Figure BDA0002052529060000041
TABLE I
As can be seen from the data scale and the accuracy of the table I, the method has higher level on the B _ f accuracy, and the advantage is more obvious when the data volume is larger. Compared with a TextCNN model, when the data scale is 100% and 50%, the accuracy B _ f is obviously dominant, and compared with a single independent fastText model, the accuracy is obviously improved under 3 data scales. Meanwhile, when the data scale is 100%, the training time length of the B _ f of the invention is obviously superior to that of a TextCNN model, namely 366 seconds and 3069 seconds respectively.
The method B _ f provided by the invention can effectively classify the texts with higher accuracy and higher efficiency.
The voting mechanism is to input the test data into the multi-base model composed of 15 base models obtained in the third step in the above example, to generate 15 result labels, and to take the label with the same number and the maximum number as the final label.
As shown in fig. 2, the 1 st to 7 th models are obtained by training the 1 st to 7 th sample sets obtained in the third step in the above example respectively using the optimal parameter set with the n-gram feature value of 1 obtained in the second step in the above example, the 8 th to 14 th models are obtained by training the 8 th to 14 th sample sets obtained in the third step in the above example respectively using the optimal parameter set with the n-gram feature value of 2 obtained in the second step in the above example, and the 15 th model is obtained by randomly training the 15 th sample set obtained in the third step in the above example respectively using the optimal parameter set with the n-gram feature value of 1 or 2 obtained in the second step in the above example, so as to form a multi-base model consisting of 15 base models.
As shown in FIG. 3, the present invention randomly shuffles the training data 15 times, resulting in 15 training sample sets. And training each sample set respectively, wherein the 1 st to 7 th sample sets are trained by using the optimal parameter group with the n-gram characteristic value of 1 obtained in the second step in the example, the 8 th to 14 th sample sets are trained by using the optimal parameter group with the n-gram characteristic value of 2 obtained in the second step in the example, and the 15 th sample set is trained by randomly using the optimal parameter group with the n-gram characteristic value of 1 or 2 obtained in the second step in the example. And obtaining a multi-base model consisting of 15 base models, and fusing test data by adopting a voting mechanism when predicting to obtain a final text prediction label.
The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (1)

1. A text classification method based on a multi-base model framework is characterized by comprising the following steps:
step 1: preprocessing the text data;
preprocessing the text data includes:
(1.1) definition of T1For training data a single text information set, T2Defining name and label of single text for testing single text information set of data, and satisfying relation T1={name,label},T2={name};
(1.2) definition of D1For the training data text name dataset, D2 is the training data text label dataset, D3 is the test data text name dataset, D1={name1,name2,…,namen},D2={label1,label2,…,labeln},
D3={name1,name2,…,namem},nameaIs D1The a-th text name data, labelaIs D2The a-th text label data, namebIs D3B-th text name data, where the variable a e [1, n ∈]The variable b is epsilon [1, m ∈ ]];
(1.3) to D1,D3Performing word segmentation and word stop processing on D2Processing the fastText format to obtain a new D1,D2,D3
(1.4) mixing D1And D2SplicingCombining the text format of the fastText to obtain a training data set Train { { name { (name)1,label1},{name2,label2},…,{namen,labelnAnd processing the text into txt text, wherein the Test data set Test is D3
Step 2: performing fastText training on the training data processed in the step 1, evaluating and optimizing parameters to obtain two groups of optimal fastText parameters with the parameters word _ grams as 1 and 2, dividing the training data Train after preprocessing the data in the step 1 to obtain a new training set T and a new verification set V, performing fastText training on the training set T, predicting the verification set V to obtain the accuracy, setting the fastText parameter word _ grams as 1 or 2, adjusting other parameters for multiple cycles to obtain two groups of parameters with the highest accuracy when the fastText parameter word _ grams is 1 and 2 respectively, and recording;
and step 3: training data processed in the step 1 are disturbed to generate 15 training samples, 2 groups of parameters generated in the step 2 are used for performing fastText training for 7 times respectively to generate 14 models, then 2 groups of parameters generated in the step 2 are used for performing fastText training randomly to generate 1 model, 15 final multi-base models with the fastText models as base models are obtained finally, 15 training samples are generated by disturbing training data Train processed in the step 1 and are performed 7 times with the 2 groups of parameters generated in the step 2 to generate 14 models, then 2 groups of parameters generated in the step 2 are used for performing the fastText training randomly to generate 1 model, and 15 final multi-base models with the fastText models as base models are obtained finally;
and 4, step 4: and (3) predicting the Test data processed in the step (1) by using the multi-base model obtained in the step (3), obtaining a final text prediction label by adopting a voting mechanism, predicting the Test data Test generated in the step (1) by using the multi-base model obtained in the step (3), and obtaining the final text prediction label by adopting the voting mechanism.
CN201910378450.7A 2019-05-08 2019-05-08 Text classification method based on multi-base model framework Active CN110162629B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910378450.7A CN110162629B (en) 2019-05-08 2019-05-08 Text classification method based on multi-base model framework

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910378450.7A CN110162629B (en) 2019-05-08 2019-05-08 Text classification method based on multi-base model framework

Publications (2)

Publication Number Publication Date
CN110162629A CN110162629A (en) 2019-08-23
CN110162629B true CN110162629B (en) 2021-05-14

Family

ID=67633552

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910378450.7A Active CN110162629B (en) 2019-05-08 2019-05-08 Text classification method based on multi-base model framework

Country Status (1)

Country Link
CN (1) CN110162629B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111428103A (en) * 2020-03-19 2020-07-17 竹间智能科技(上海)有限公司 Method for constructing bill auditing model

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108154151A (en) * 2017-12-20 2018-06-12 南京邮电大学 A kind of quick multi-oriented text lines detection method
US10432789B2 (en) * 2017-02-09 2019-10-01 Verint Systems Ltd. Classification of transcripts by sentiment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10432789B2 (en) * 2017-02-09 2019-10-01 Verint Systems Ltd. Classification of transcripts by sentiment
CN108154151A (en) * 2017-12-20 2018-06-12 南京邮电大学 A kind of quick multi-oriented text lines detection method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于快速文本分类器与不平衡数据的研究";杜锦波;《中国优秀硕士学位论文全文数据库 社会科学Ⅱ辑》;20190115(第1期);正文第15-27页,第3.4、4.1-4.5节 *

Also Published As

Publication number Publication date
CN110162629A (en) 2019-08-23

Similar Documents

Publication Publication Date Title
CN109960724B (en) Text summarization method based on TF-IDF
CN108595706B (en) Document semantic representation method based on topic word similarity, and text classification method and device
CN109376242B (en) Text classification method based on cyclic neural network variant and convolutional neural network
Tur et al. Combining active and semi-supervised learning for spoken language understanding
CN109902177B (en) Text emotion analysis method based on dual-channel convolutional memory neural network
CN112069310B (en) Text classification method and system based on active learning strategy
CN107229610A (en) The analysis method and device of a kind of affection data
CN107944014A (en) A kind of Chinese text sentiment analysis method based on deep learning
CN111709242B (en) Chinese punctuation mark adding method based on named entity recognition
CN110516074B (en) Website theme classification method and device based on deep learning
CN110851176B (en) Clone code detection method capable of automatically constructing and utilizing pseudo-clone corpus
CN104361037B (en) Microblogging sorting technique and device
CN105205124A (en) Semi-supervised text sentiment classification method based on random feature subspace
CN110019779B (en) Text classification method, model training method and device
CN111597328B (en) New event theme extraction method
CN110046223B (en) Film evaluation emotion analysis method based on improved convolutional neural network model
JP2020512651A (en) Search method, device, and non-transitory computer-readable storage medium
CN112836509A (en) Expert system knowledge base construction method and system
CN103020167A (en) Chinese text classification method for computer
Chen et al. Chinese Weibo sentiment analysis based on character embedding with dual-channel convolutional neural network
CN109543036A (en) Text Clustering Method based on semantic similarity
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
CN115270752A (en) Template sentence evaluation method based on multilevel comparison learning
WO2023134074A1 (en) Text topic generation method and apparatus, and device and storage medium
CN111241271B (en) Text emotion classification method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant