CN110162629B

CN110162629B - Text classification method based on multi-base model framework

Info

Publication number: CN110162629B
Application number: CN201910378450.7A
Authority: CN
Inventors: 沈雅婷; 左志新
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2019-05-08
Filing date: 2019-05-08
Publication date: 2021-05-14
Anticipated expiration: 2039-05-08
Also published as: CN110162629A

Abstract

The invention discloses a text classification method based on a multi-base model frame, which comprises the following steps: 1) preprocessing input data; 2) performing fastText (fast text classification algorithm) training on the training data processed in the step (1), evaluating and performing parameter tuning to obtain two groups of optimal fastText parameters with the parameter word _ grams (N-element mold) as 1 and 2; 3) disorganizing the training data processed in the step (1) to generate 15 training samples, respectively performing 7 times of fastText training on the training samples by using the 2 sets of parameters generated in the step (2) to generate 14 models, then randomly performing the fastText training by using the 2 sets of parameters generated in the step (2) to generate 1 model, and finally obtaining 15 final multi-base models taking the fastText model as a base model; 4) and (4) predicting the test data processed in the step (1) by using the multi-base model obtained in the step (3), and obtaining a final text prediction label by adopting a voting mechanism. The invention has the characteristics of high training speed, high accuracy, high prediction efficiency and the like.

Description

Text classification method based on multi-base model framework

Technical Field

The invention relates to a text classification method based on a multi-base model frame, and belongs to the technical field of supervision algorithms and text classification processing.

Background

The text classification problem is not substantially different from other classification problems, and the method can be summarized as matching according to certain characteristics of the data to be classified, and complete matching is unlikely, so that an optimal matching result must be selected (according to some evaluation criterion) to complete classification. The selection and training of the classifier, and the evaluation and feedback of the classification result are very important.

As one of the most classical scenarios in the field of NLP (natural language recognition), text classification accumulates a large number of technical implementation methods, which can be roughly classified into two categories, i.e., text classification based on conventional machine learning and text classification based on deep learning if deep learning technology is used as a standard for measurement. However, the existing text classification method based on the traditional machine learning and the conventional deep learning has the problems of low accuracy, long training time and the like.

fastText is a fast text classification algorithm that can train over 10 million words in 10 minutes using a standard multi-core CPU (central processing unit) and classify 50 million sentences in 31.2 million categories in less than a minute. Compared with a text classification algorithm based on a neural network, the fast text classification algorithm has two advantages that firstly, fast text greatly accelerates the training speed and the testing speed while maintaining high precision. Again, there is no need to use a pre-trained word vector, since the fastText will train the word vector itself.

Bagging is a method used to improve the accuracy of learning algorithms by constructing a series of prediction functions and then combining them in a certain way into a prediction function. Bagging requires a classification method that is "unstable" (unstable refers to small variations in the data set that can cause significant variations in the classification results). Such as: decision trees, neural network algorithms. The basic idea is that a weak learning algorithm and a training set are given, the accuracy of a single weak learning algorithm is not high, the learning algorithm is used for multiple times to obtain a prediction function sequence, voting is carried out, and finally the result accuracy is improved. Because the accuracy rate of the existing text classification method based on the traditional machine learning is low, the neural network training time of the conventional deep learning method is long. While the text classification algorithm based on fastText has certain advancement, the effect which can be realized by the text classification algorithm is limited for a single model. It is found that a model is occasional for ambiguous data, and one may be inaccurate, and multiple votes may improve accuracy. The present invention can solve the above problems well.

Disclosure of Invention

The invention aims to provide a text classification method based on a multi-base model frame aiming at the defects of the prior art, the method is based on the Bagging integrated learning idea, the text classification of the multi-base model frame is accurately realized, and the problems of low accuracy of the text classification method based on the traditional machine learning and long neural network training time of the conventional deep learning method are effectively solved.

The technical scheme adopted by the invention for solving the technical problems is as follows: the invention provides a text classification method based on a multi-base model frame, which comprises the following steps:

step 1: the text data is preprocessed.

Step 2: and (3) performing fastText training on the training data processed in the step (1), evaluating and optimizing parameters to obtain two groups of optimal fastText parameters with the parameter word _ gurams as 1 and 2.

And step 3: and (2) disorganizing the training data processed in the step (1) to generate 15 training samples, respectively performing 7 times of fastText training on the training samples by using 2 groups of parameters generated in the step (2) to generate 14 models, then randomly performing the fastText training by using 2 groups of parameters generated in the step (2) to generate 1 model, and finally obtaining 15 final multi-base models taking the fastText model as a base model.

And 4, step 4: and (3) predicting the test data processed in the step (1) by using the multi-base model obtained in the step (3), and obtaining a final text prediction label by adopting a voting mechanism.

Further, the specific steps of preprocessing the text data in step 1 of the present invention are as follows:

(1.1) definition of T₁For training data a single text information set, T₂Defining name and label of single text for testing single text information set of data, and satisfying relation T₁＝{name,label},T₂＝{name}；

(1.2) definition of D₁For the training data text name dataset, D2 is the training data text label dataset, D3 is the test data text name dataset, D₁＝{name₁,name₂,…,name_n},D₂＝{label₁,label₂,…,label_n},

D₃＝{name₁,name₂,…,name_m},name_aIs D₁The a-th text name data, label_aIs D₂The a-th text label data, name_bIs D₃B-th text name data, where the variable a e [1, n ∈]The variable b is epsilon [1, m ∈ ]]；

(1.3) to D₁,D₃Performing word segmentation and word stop processing on D₂Processing the fastText format to obtain a new D₁,D₂,D₃；

(1.4) mixing D₁And D₂Splicing the two to form a text format of fastText, and obtaining a training data set Train { { name { (name)₁,label₁},{name₂,label₂},…,{name_n,label_nAnd processing the text into txt text, wherein the Test data set Test is D₃。

Further, in step 2 of the present invention, after data preprocessing and data preprocessing in step 1, training data Train is divided into a new training set T and a new verification set V, fastText training is performed on the training set T, and the verification set V is predicted to obtain an accuracy, wherein a fastText parameter word _ grams is set to 1 or 2, and other parameters are adjusted for multiple cycles to obtain two sets of parameters with the highest accuracy when the fastText parameter word _ grams is 1 and 2 respectively, and the two sets of parameters are recorded.

Further, in step 3 of the present invention, 15 training samples are generated by disturbing the training data Train processed in step 1, and are subjected to fastText training 7 times respectively using 2 sets of parameters generated in step 2, so as to generate 14 models, and then the 2 sets of parameters generated in step 2 are randomly used to perform fastText training, so as to generate 1 model, and finally, 15 final multi-base models (as shown in fig. 2) using the fastText model as a base model are obtained.

Further, in step 4 of the present invention, the multi-base model obtained in step 3 is used to predict the Test data Test generated in step 1, and a voting mechanism is adopted to obtain a final text prediction label.

Has the advantages that:

1. compared with a conventional textCNN (text convolutional neural network) deep learning algorithm on the same data set, the method B _ f (Bagging _ fastText) has the characteristics of high training speed, high accuracy and high prediction efficiency.

2. Compared with the TextCNN and the single independent fastText, the method provided by the invention has the advantages that the data is the commodity classification data of the network retail platform, and the text classification accuracy can be well improved.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Fig. 2 is a conceptual diagram of the final model.

FIG. 3 is a flow chart for generating a multi-base model.

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary of the invention and are not intended to limit its scope, which after reading the present invention, is susceptible of modification in various equivalent forms by those skilled in the art, all falling within the scope of the invention as defined in the appended claims.

As shown in fig. 1, the present invention provides a text classification method based on a multi-base model framework, which specifically includes the following steps:

the first step is as follows: defining T as a single text information set of data, defining a name and a label as the name and the label of a single text respectively, and satisfying the relation T ═ name, label }; definition D₁For data text name data sets, D₂Tagging datasets for data text, D₁＝{name₁,name₂,…,name_n},D2＝{label₁,label₂,…,label_n},name_aIs D₁The a-th text name data, label_aIs D₂The a-th text label data, where the variable a ∈ [1, n ∈ [ ]](ii) a To D₁Performing Chinese word segmentation to stop word processing, and performing word segmentation to D₂Performing fastText format processing, and adding __ label __ before the label; will D₁And D₂Splicing, combining into a text format of fastText to obtain a training data set:

Train＝{{name₁,label₁},{name₂,label₂},…,{name_n,label_nand processing the text into txt text and storing the text.

The second step is that: dividing training data Train, dividing a new training set T and a new verification set V, carrying out fastText training on the training set T, predicting the verification set V to obtain the accuracy, wherein the fastText parameter word _ grams is set to 1 or 2, adjusting other parameters for multiple times, obtaining two sets of parameters with the highest accuracy when the fastText parameter word _ grams is 1 and 2 respectively, and the character array bucket is 2000000 and records.

The third step: cross validation was applied. Specifically, the training data Train is randomly and evenly divided into 10 groups, and then one group is extracted as a test set and the rest is extracted as a training set. The learning process is repeated 10 times, and each test results in a corresponding correct or error rate. The average of the accuracy or error rate of the 10 results is used as an estimate of the accuracy of the algorithm. The specific process comprises the following steps: the training set is disturbed to generate 15 sub-training sets with different sequences, 2 groups of parameters generated in the step 2 of the invention are used for performing the fastText training for 7 times respectively to generate 14 models, then 2 groups of parameters generated in the step 2 of the invention are used for performing the fastText training randomly to generate 1 model, and finally 15 final multi-base models (as shown in figure 2) taking the fastText model as a base model are obtained. Then, the multi-base model is used for predicting the test set, a voting mechanism is adopted to obtain a final text prediction label, the flow is shown in fig. 3, and the final text prediction label is compared with the test set label to obtain the accuracy.

By processing 50 ten thousand pieces of network retail platform commodity classification data, the classification accuracy of 86.65% is obtained by using the method B _ f (Bagging _ fastText, fast text classification algorithm based on bootstrap aggregation method), and the classification accuracy is further improved compared with that of single independent fastText.

For 50 ten thousand network retail platform commodity classification data sets, 100%, 50% and 1% of texts are randomly selected, and cross validation is carried out on the invention (B _ f) and a TextCNN (text convolutional neural network) model and a single fastText model to carry out a comparison experiment. Specifically, each text data set was randomly and evenly divided into 10 groups, and then one group was randomly drawn as a test set and the rest were drawn as a training set. In the training set, only 10 labeled samples were selected, the remainder being unlabeled samples. This learning process was repeated 10 times and the average results were recorded in table I, where the best performance in each line of the text data set is shown bolded.

TABLE I

As can be seen from the data scale and the accuracy of the table I, the method has higher level on the B _ f accuracy, and the advantage is more obvious when the data volume is larger. Compared with a TextCNN model, when the data scale is 100% and 50%, the accuracy B _ f is obviously dominant, and compared with a single independent fastText model, the accuracy is obviously improved under 3 data scales. Meanwhile, when the data scale is 100%, the training time length of the B _ f of the invention is obviously superior to that of a TextCNN model, namely 366 seconds and 3069 seconds respectively.

The method B _ f provided by the invention can effectively classify the texts with higher accuracy and higher efficiency.

The voting mechanism is to input the test data into the multi-base model composed of 15 base models obtained in the third step in the above example, to generate 15 result labels, and to take the label with the same number and the maximum number as the final label.

As shown in fig. 2, the 1 st to 7 th models are obtained by training the 1 st to 7 th sample sets obtained in the third step in the above example respectively using the optimal parameter set with the n-gram feature value of 1 obtained in the second step in the above example, the 8 th to 14 th models are obtained by training the 8 th to 14 th sample sets obtained in the third step in the above example respectively using the optimal parameter set with the n-gram feature value of 2 obtained in the second step in the above example, and the 15 th model is obtained by randomly training the 15 th sample set obtained in the third step in the above example respectively using the optimal parameter set with the n-gram feature value of 1 or 2 obtained in the second step in the above example, so as to form a multi-base model consisting of 15 base models.

As shown in FIG. 3, the present invention randomly shuffles the training data 15 times, resulting in 15 training sample sets. And training each sample set respectively, wherein the 1 st to 7 th sample sets are trained by using the optimal parameter group with the n-gram characteristic value of 1 obtained in the second step in the example, the 8 th to 14 th sample sets are trained by using the optimal parameter group with the n-gram characteristic value of 2 obtained in the second step in the example, and the 15 th sample set is trained by randomly using the optimal parameter group with the n-gram characteristic value of 1 or 2 obtained in the second step in the example. And obtaining a multi-base model consisting of 15 base models, and fusing test data by adopting a voting mechanism when predicting to obtain a final text prediction label.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A text classification method based on a multi-base model framework is characterized by comprising the following steps:

step 1: preprocessing the text data;

preprocessing the text data includes:

(1.4) mixing D₁And D₂SplicingCombining the text format of the fastText to obtain a training data set Train { { name { (name)₁,label₁},{name₂,label₂},…,{name_n,label_nAnd processing the text into txt text, wherein the Test data set Test is D₃；

Step 2: performing fastText training on the training data processed in the step 1, evaluating and optimizing parameters to obtain two groups of optimal fastText parameters with the parameters word _ grams as 1 and 2, dividing the training data Train after preprocessing the data in the step 1 to obtain a new training set T and a new verification set V, performing fastText training on the training set T, predicting the verification set V to obtain the accuracy, setting the fastText parameter word _ grams as 1 or 2, adjusting other parameters for multiple cycles to obtain two groups of parameters with the highest accuracy when the fastText parameter word _ grams is 1 and 2 respectively, and recording;

and step 3: training data processed in the step 1 are disturbed to generate 15 training samples, 2 groups of parameters generated in the step 2 are used for performing fastText training for 7 times respectively to generate 14 models, then 2 groups of parameters generated in the step 2 are used for performing fastText training randomly to generate 1 model, 15 final multi-base models with the fastText models as base models are obtained finally, 15 training samples are generated by disturbing training data Train processed in the step 1 and are performed 7 times with the 2 groups of parameters generated in the step 2 to generate 14 models, then 2 groups of parameters generated in the step 2 are used for performing the fastText training randomly to generate 1 model, and 15 final multi-base models with the fastText models as base models are obtained finally;

and 4, step 4: and (3) predicting the Test data processed in the step (1) by using the multi-base model obtained in the step (3), obtaining a final text prediction label by adopting a voting mechanism, predicting the Test data Test generated in the step (1) by using the multi-base model obtained in the step (3), and obtaining the final text prediction label by adopting the voting mechanism.