CN110162629B - Text classification method based on multi-base model framework - Google Patents
Text classification method based on multi-base model framework Download PDFInfo
- Publication number
- CN110162629B CN110162629B CN201910378450.7A CN201910378450A CN110162629B CN 110162629 B CN110162629 B CN 110162629B CN 201910378450 A CN201910378450 A CN 201910378450A CN 110162629 B CN110162629 B CN 110162629B
- Authority
- CN
- China
- Prior art keywords
- training
- name
- text
- fasttext
- label
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a text classification method based on a multi-base model frame, which comprises the following steps: 1) preprocessing input data; 2) performing fastText (fast text classification algorithm) training on the training data processed in the step (1), evaluating and performing parameter tuning to obtain two groups of optimal fastText parameters with the parameter word _ grams (N-element mold) as 1 and 2; 3) disorganizing the training data processed in the step (1) to generate 15 training samples, respectively performing 7 times of fastText training on the training samples by using the 2 sets of parameters generated in the step (2) to generate 14 models, then randomly performing the fastText training by using the 2 sets of parameters generated in the step (2) to generate 1 model, and finally obtaining 15 final multi-base models taking the fastText model as a base model; 4) and (4) predicting the test data processed in the step (1) by using the multi-base model obtained in the step (3), and obtaining a final text prediction label by adopting a voting mechanism. The invention has the characteristics of high training speed, high accuracy, high prediction efficiency and the like.
Description
Technical Field
The invention relates to a text classification method based on a multi-base model frame, and belongs to the technical field of supervision algorithms and text classification processing.
Background
The text classification problem is not substantially different from other classification problems, and the method can be summarized as matching according to certain characteristics of the data to be classified, and complete matching is unlikely, so that an optimal matching result must be selected (according to some evaluation criterion) to complete classification. The selection and training of the classifier, and the evaluation and feedback of the classification result are very important.
As one of the most classical scenarios in the field of NLP (natural language recognition), text classification accumulates a large number of technical implementation methods, which can be roughly classified into two categories, i.e., text classification based on conventional machine learning and text classification based on deep learning if deep learning technology is used as a standard for measurement. However, the existing text classification method based on the traditional machine learning and the conventional deep learning has the problems of low accuracy, long training time and the like.
fastText is a fast text classification algorithm that can train over 10 million words in 10 minutes using a standard multi-core CPU (central processing unit) and classify 50 million sentences in 31.2 million categories in less than a minute. Compared with a text classification algorithm based on a neural network, the fast text classification algorithm has two advantages that firstly, fast text greatly accelerates the training speed and the testing speed while maintaining high precision. Again, there is no need to use a pre-trained word vector, since the fastText will train the word vector itself.
Bagging is a method used to improve the accuracy of learning algorithms by constructing a series of prediction functions and then combining them in a certain way into a prediction function. Bagging requires a classification method that is "unstable" (unstable refers to small variations in the data set that can cause significant variations in the classification results). Such as: decision trees, neural network algorithms. The basic idea is that a weak learning algorithm and a training set are given, the accuracy of a single weak learning algorithm is not high, the learning algorithm is used for multiple times to obtain a prediction function sequence, voting is carried out, and finally the result accuracy is improved. Because the accuracy rate of the existing text classification method based on the traditional machine learning is low, the neural network training time of the conventional deep learning method is long. While the text classification algorithm based on fastText has certain advancement, the effect which can be realized by the text classification algorithm is limited for a single model. It is found that a model is occasional for ambiguous data, and one may be inaccurate, and multiple votes may improve accuracy. The present invention can solve the above problems well.
Disclosure of Invention
The invention aims to provide a text classification method based on a multi-base model frame aiming at the defects of the prior art, the method is based on the Bagging integrated learning idea, the text classification of the multi-base model frame is accurately realized, and the problems of low accuracy of the text classification method based on the traditional machine learning and long neural network training time of the conventional deep learning method are effectively solved.
The technical scheme adopted by the invention for solving the technical problems is as follows: the invention provides a text classification method based on a multi-base model frame, which comprises the following steps:
step 1: the text data is preprocessed.
Step 2: and (3) performing fastText training on the training data processed in the step (1), evaluating and optimizing parameters to obtain two groups of optimal fastText parameters with the parameter word _ gurams as 1 and 2.
And step 3: and (2) disorganizing the training data processed in the step (1) to generate 15 training samples, respectively performing 7 times of fastText training on the training samples by using 2 groups of parameters generated in the step (2) to generate 14 models, then randomly performing the fastText training by using 2 groups of parameters generated in the step (2) to generate 1 model, and finally obtaining 15 final multi-base models taking the fastText model as a base model.
And 4, step 4: and (3) predicting the test data processed in the step (1) by using the multi-base model obtained in the step (3), and obtaining a final text prediction label by adopting a voting mechanism.
Further, the specific steps of preprocessing the text data in step 1 of the present invention are as follows:
(1.1) definition of T1For training data a single text information set, T2Defining name and label of single text for testing single text information set of data, and satisfying relation T1={name,label},T2={name};
(1.2) definition of D1For the training data text name dataset, D2 is the training data text label dataset, D3 is the test data text name dataset, D1={name1,name2,…,namen},D2={label1,label2,…,labeln},
D3={name1,name2,…,namem},nameaIs D1The a-th text name data, labelaIs D2The a-th text label data, namebIs D3B-th text name data, where the variable a e [1, n ∈]The variable b is epsilon [1, m ∈ ]];
(1.3) to D1,D3Performing word segmentation and word stop processing on D2Processing the fastText format to obtain a new D1,D2,D3;
(1.4) mixing D1And D2Splicing the two to form a text format of fastText, and obtaining a training data set Train { { name { (name)1,label1},{name2,label2},…,{namen,labelnAnd processing the text into txt text, wherein the Test data set Test is D3。
Further, in step 2 of the present invention, after data preprocessing and data preprocessing in step 1, training data Train is divided into a new training set T and a new verification set V, fastText training is performed on the training set T, and the verification set V is predicted to obtain an accuracy, wherein a fastText parameter word _ grams is set to 1 or 2, and other parameters are adjusted for multiple cycles to obtain two sets of parameters with the highest accuracy when the fastText parameter word _ grams is 1 and 2 respectively, and the two sets of parameters are recorded.
Further, in step 3 of the present invention, 15 training samples are generated by disturbing the training data Train processed in step 1, and are subjected to fastText training 7 times respectively using 2 sets of parameters generated in step 2, so as to generate 14 models, and then the 2 sets of parameters generated in step 2 are randomly used to perform fastText training, so as to generate 1 model, and finally, 15 final multi-base models (as shown in fig. 2) using the fastText model as a base model are obtained.
Further, in step 4 of the present invention, the multi-base model obtained in step 3 is used to predict the Test data Test generated in step 1, and a voting mechanism is adopted to obtain a final text prediction label.
Has the advantages that:
1. compared with a conventional textCNN (text convolutional neural network) deep learning algorithm on the same data set, the method B _ f (Bagging _ fastText) has the characteristics of high training speed, high accuracy and high prediction efficiency.
2. Compared with the TextCNN and the single independent fastText, the method provided by the invention has the advantages that the data is the commodity classification data of the network retail platform, and the text classification accuracy can be well improved.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Fig. 2 is a conceptual diagram of the final model.
FIG. 3 is a flow chart for generating a multi-base model.
Detailed Description
The present invention is further illustrated by the following examples, which are intended to be purely exemplary of the invention and are not intended to limit its scope, which after reading the present invention, is susceptible of modification in various equivalent forms by those skilled in the art, all falling within the scope of the invention as defined in the appended claims.
As shown in fig. 1, the present invention provides a text classification method based on a multi-base model framework, which specifically includes the following steps:
the first step is as follows: defining T as a single text information set of data, defining a name and a label as the name and the label of a single text respectively, and satisfying the relation T ═ name, label }; definition D1For data text name data sets, D2Tagging datasets for data text, D1={name1,name2,…,namen},D2={label1,label2,…,labeln},nameaIs D1The a-th text name data, labelaIs D2The a-th text label data, where the variable a ∈ [1, n ∈ [ ]](ii) a To D1Performing Chinese word segmentation to stop word processing, and performing word segmentation to D2Performing fastText format processing, and adding __ label __ before the label; will D1And D2Splicing, combining into a text format of fastText to obtain a training data set:
Train={{name1,label1},{name2,label2},…,{namen,labelnand processing the text into txt text and storing the text.
The second step is that: dividing training data Train, dividing a new training set T and a new verification set V, carrying out fastText training on the training set T, predicting the verification set V to obtain the accuracy, wherein the fastText parameter word _ grams is set to 1 or 2, adjusting other parameters for multiple times, obtaining two sets of parameters with the highest accuracy when the fastText parameter word _ grams is 1 and 2 respectively, and the character array bucket is 2000000 and records.
The third step: cross validation was applied. Specifically, the training data Train is randomly and evenly divided into 10 groups, and then one group is extracted as a test set and the rest is extracted as a training set. The learning process is repeated 10 times, and each test results in a corresponding correct or error rate. The average of the accuracy or error rate of the 10 results is used as an estimate of the accuracy of the algorithm. The specific process comprises the following steps: the training set is disturbed to generate 15 sub-training sets with different sequences, 2 groups of parameters generated in the step 2 of the invention are used for performing the fastText training for 7 times respectively to generate 14 models, then 2 groups of parameters generated in the step 2 of the invention are used for performing the fastText training randomly to generate 1 model, and finally 15 final multi-base models (as shown in figure 2) taking the fastText model as a base model are obtained. Then, the multi-base model is used for predicting the test set, a voting mechanism is adopted to obtain a final text prediction label, the flow is shown in fig. 3, and the final text prediction label is compared with the test set label to obtain the accuracy.
By processing 50 ten thousand pieces of network retail platform commodity classification data, the classification accuracy of 86.65% is obtained by using the method B _ f (Bagging _ fastText, fast text classification algorithm based on bootstrap aggregation method), and the classification accuracy is further improved compared with that of single independent fastText.
For 50 ten thousand network retail platform commodity classification data sets, 100%, 50% and 1% of texts are randomly selected, and cross validation is carried out on the invention (B _ f) and a TextCNN (text convolutional neural network) model and a single fastText model to carry out a comparison experiment. Specifically, each text data set was randomly and evenly divided into 10 groups, and then one group was randomly drawn as a test set and the rest were drawn as a training set. In the training set, only 10 labeled samples were selected, the remainder being unlabeled samples. This learning process was repeated 10 times and the average results were recorded in table I, where the best performance in each line of the text data set is shown bolded.
TABLE I
As can be seen from the data scale and the accuracy of the table I, the method has higher level on the B _ f accuracy, and the advantage is more obvious when the data volume is larger. Compared with a TextCNN model, when the data scale is 100% and 50%, the accuracy B _ f is obviously dominant, and compared with a single independent fastText model, the accuracy is obviously improved under 3 data scales. Meanwhile, when the data scale is 100%, the training time length of the B _ f of the invention is obviously superior to that of a TextCNN model, namely 366 seconds and 3069 seconds respectively.
The method B _ f provided by the invention can effectively classify the texts with higher accuracy and higher efficiency.
The voting mechanism is to input the test data into the multi-base model composed of 15 base models obtained in the third step in the above example, to generate 15 result labels, and to take the label with the same number and the maximum number as the final label.
As shown in fig. 2, the 1 st to 7 th models are obtained by training the 1 st to 7 th sample sets obtained in the third step in the above example respectively using the optimal parameter set with the n-gram feature value of 1 obtained in the second step in the above example, the 8 th to 14 th models are obtained by training the 8 th to 14 th sample sets obtained in the third step in the above example respectively using the optimal parameter set with the n-gram feature value of 2 obtained in the second step in the above example, and the 15 th model is obtained by randomly training the 15 th sample set obtained in the third step in the above example respectively using the optimal parameter set with the n-gram feature value of 1 or 2 obtained in the second step in the above example, so as to form a multi-base model consisting of 15 base models.
As shown in FIG. 3, the present invention randomly shuffles the training data 15 times, resulting in 15 training sample sets. And training each sample set respectively, wherein the 1 st to 7 th sample sets are trained by using the optimal parameter group with the n-gram characteristic value of 1 obtained in the second step in the example, the 8 th to 14 th sample sets are trained by using the optimal parameter group with the n-gram characteristic value of 2 obtained in the second step in the example, and the 15 th sample set is trained by randomly using the optimal parameter group with the n-gram characteristic value of 1 or 2 obtained in the second step in the example. And obtaining a multi-base model consisting of 15 base models, and fusing test data by adopting a voting mechanism when predicting to obtain a final text prediction label.
The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.
Claims (1)
1. A text classification method based on a multi-base model framework is characterized by comprising the following steps:
step 1: preprocessing the text data;
preprocessing the text data includes:
(1.1) definition of T1For training data a single text information set, T2Defining name and label of single text for testing single text information set of data, and satisfying relation T1={name,label},T2={name};
(1.2) definition of D1For the training data text name dataset, D2 is the training data text label dataset, D3 is the test data text name dataset, D1={name1,name2,…,namen},D2={label1,label2,…,labeln},
D3={name1,name2,…,namem},nameaIs D1The a-th text name data, labelaIs D2The a-th text label data, namebIs D3B-th text name data, where the variable a e [1, n ∈]The variable b is epsilon [1, m ∈ ]];
(1.3) to D1,D3Performing word segmentation and word stop processing on D2Processing the fastText format to obtain a new D1,D2,D3;
(1.4) mixing D1And D2SplicingCombining the text format of the fastText to obtain a training data set Train { { name { (name)1,label1},{name2,label2},…,{namen,labelnAnd processing the text into txt text, wherein the Test data set Test is D3;
Step 2: performing fastText training on the training data processed in the step 1, evaluating and optimizing parameters to obtain two groups of optimal fastText parameters with the parameters word _ grams as 1 and 2, dividing the training data Train after preprocessing the data in the step 1 to obtain a new training set T and a new verification set V, performing fastText training on the training set T, predicting the verification set V to obtain the accuracy, setting the fastText parameter word _ grams as 1 or 2, adjusting other parameters for multiple cycles to obtain two groups of parameters with the highest accuracy when the fastText parameter word _ grams is 1 and 2 respectively, and recording;
and step 3: training data processed in the step 1 are disturbed to generate 15 training samples, 2 groups of parameters generated in the step 2 are used for performing fastText training for 7 times respectively to generate 14 models, then 2 groups of parameters generated in the step 2 are used for performing fastText training randomly to generate 1 model, 15 final multi-base models with the fastText models as base models are obtained finally, 15 training samples are generated by disturbing training data Train processed in the step 1 and are performed 7 times with the 2 groups of parameters generated in the step 2 to generate 14 models, then 2 groups of parameters generated in the step 2 are used for performing the fastText training randomly to generate 1 model, and 15 final multi-base models with the fastText models as base models are obtained finally;
and 4, step 4: and (3) predicting the Test data processed in the step (1) by using the multi-base model obtained in the step (3), obtaining a final text prediction label by adopting a voting mechanism, predicting the Test data Test generated in the step (1) by using the multi-base model obtained in the step (3), and obtaining the final text prediction label by adopting the voting mechanism.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910378450.7A CN110162629B (en) | 2019-05-08 | 2019-05-08 | Text classification method based on multi-base model framework |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910378450.7A CN110162629B (en) | 2019-05-08 | 2019-05-08 | Text classification method based on multi-base model framework |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110162629A CN110162629A (en) | 2019-08-23 |
CN110162629B true CN110162629B (en) | 2021-05-14 |
Family
ID=67633552
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910378450.7A Active CN110162629B (en) | 2019-05-08 | 2019-05-08 | Text classification method based on multi-base model framework |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110162629B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111428103A (en) * | 2020-03-19 | 2020-07-17 | 竹间智能科技(上海)有限公司 | Method for constructing bill auditing model |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108154151A (en) * | 2017-12-20 | 2018-06-12 | 南京邮电大学 | A kind of quick multi-oriented text lines detection method |
US10432789B2 (en) * | 2017-02-09 | 2019-10-01 | Verint Systems Ltd. | Classification of transcripts by sentiment |
-
2019
- 2019-05-08 CN CN201910378450.7A patent/CN110162629B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10432789B2 (en) * | 2017-02-09 | 2019-10-01 | Verint Systems Ltd. | Classification of transcripts by sentiment |
CN108154151A (en) * | 2017-12-20 | 2018-06-12 | 南京邮电大学 | A kind of quick multi-oriented text lines detection method |
Non-Patent Citations (1)
Title |
---|
"基于快速文本分类器与不平衡数据的研究";杜锦波;《中国优秀硕士学位论文全文数据库 社会科学Ⅱ辑》;20190115(第1期);正文第15-27页,第3.4、4.1-4.5节 * |
Also Published As
Publication number | Publication date |
---|---|
CN110162629A (en) | 2019-08-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109960724B (en) | Text summarization method based on TF-IDF | |
CN108595706B (en) | Document semantic representation method based on topic word similarity, and text classification method and device | |
CN109376242B (en) | Text classification method based on cyclic neural network variant and convolutional neural network | |
Tur et al. | Combining active and semi-supervised learning for spoken language understanding | |
CN109902177B (en) | Text emotion analysis method based on dual-channel convolutional memory neural network | |
CN112069310B (en) | Text classification method and system based on active learning strategy | |
CN107229610A (en) | The analysis method and device of a kind of affection data | |
CN107944014A (en) | A kind of Chinese text sentiment analysis method based on deep learning | |
CN111709242B (en) | Chinese punctuation mark adding method based on named entity recognition | |
CN110516074B (en) | Website theme classification method and device based on deep learning | |
CN110851176B (en) | Clone code detection method capable of automatically constructing and utilizing pseudo-clone corpus | |
CN104361037B (en) | Microblogging sorting technique and device | |
CN105205124A (en) | Semi-supervised text sentiment classification method based on random feature subspace | |
CN110019779B (en) | Text classification method, model training method and device | |
CN111597328B (en) | New event theme extraction method | |
CN110046223B (en) | Film evaluation emotion analysis method based on improved convolutional neural network model | |
JP2020512651A (en) | Search method, device, and non-transitory computer-readable storage medium | |
CN112836509A (en) | Expert system knowledge base construction method and system | |
CN103020167A (en) | Chinese text classification method for computer | |
Chen et al. | Chinese Weibo sentiment analysis based on character embedding with dual-channel convolutional neural network | |
CN109543036A (en) | Text Clustering Method based on semantic similarity | |
CN114491062B (en) | Short text classification method integrating knowledge graph and topic model | |
CN115270752A (en) | Template sentence evaluation method based on multilevel comparison learning | |
WO2023134074A1 (en) | Text topic generation method and apparatus, and device and storage medium | |
CN111241271B (en) | Text emotion classification method and device and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |