CN109918497A

CN109918497A - A kind of file classification method, device and storage medium based on improvement textCNN model

Info

Publication number: CN109918497A
Application number: CN201811572759.1A
Authority: CN
Inventors: 马涛; 栾江霞; 章正道; 俞碧洪; 徐晓文
Original assignee: Xiamen Meiya Pico Information Co Ltd
Current assignee: Xiamen Meiya Pico Information Co Ltd
Priority date: 2018-12-21
Filing date: 2018-12-21
Publication date: 2019-06-21

Abstract

The present invention provides a kind of based on the file classification method, device and the storage medium that improve textCNN model.This method comprises: training step, is trained the improvement textCNN model after being trained to textCNN model is improved using sample text；Text classification step classifies to text to be sorted using the improvement textCNN model after training.The present invention obtains the text classification algorithm for improving textCNN model by improving to traditional textCNN model, due to pre-training word embeding layer, so that the training time of training stage and computation amount；Due to having deepened the depth of convolutional layer and having increased batch normalization layer, so that the larger raising of classification accuracy, it is more applicable for requiring higher scene, such as the text classification of internet public feelings to sample requirement of real-time (i.e. will be according to new samples more frequently more new model) and classification accuracy.

Description

A kind of file classification method, device and storage medium based on improvement textCNN model

Technical field

It is especially a kind of based on the text classification side for improving textCNN model the present invention relates to technical field of data processing Method, device and storage medium.

Background technique

The network data of explosive growth proposes more and higher requirement to the analysis of data.Text analyzing and excavation Technology is a technology being widely used at present, is extracted by semantic content of the corresponding technology and methods to text, And then the sequence of operations such as taxonomic clustering are carried out to text, it is mainly used for commercial product recommending, the analysis of public opinion, the fields such as text search.

In the analysis of public opinion, need that public sentiment in network is arranged and analyzed under different themes, such as to acquisition The text arrived carries out text classification, automatically identifies the interested text of user, filters out the uninterested rubbish text of user. For this purpose, carrying out text automatic classification for collected text is a relatively important link in the analysis of public opinion.

Based on the text classification algorithm of traditional vector space model, time series modeling can not be carried out to the word order of word, also without Method is to carrying out semantic modeling between different words, therefore obtained classifying quality is unsatisfactory.Text based on deep learning point Class algorithm also can well model word order and semanteme due to that excessively need not carry out cumbersome Feature Engineering link, take The remote hyper-base of classifying quality obtained is in vector space model, therefore the text classification algorithm for being now based on deep learning becomes mainstream. But in the analysis of public opinion field, the classification and sample of classification all have very strong timeliness, and classification can be frequent according to public sentiment demand Variation, sample can generate new public sentiment hot as time goes by, it is therefore desirable to more frequent update and iterative model.It is based on The text classification algorithm of RNN causes trained and predicted time elongated since calculation amount is huge, frequent to update under public sentiment scene It will cause the huge waste of computing resource with iterative model.

Summary of the invention

The present invention is directed to above-mentioned defect in the prior art, proposes following technical solution.

A kind of file classification method based on improvement textCNN model, this method comprises:

Training step is trained the improvement textCNN after being trained to textCNN model is improved using sample text Model；

Text classification step classifies to text to be sorted using the improvement textCNN model after training.

Further, the improvement textCNN model includes input layer, word embeding layer, the first convolution module, volume Two Volume module, third convolution module, Volume Four volume module, Concat layers, Dropout layers, the full articulamentum with RELU activation primitive Input with classifier layer, institute's predicate embeding layer is connect with the output of the input layer, first convolution module, the second convolution Module, third convolution module and Volume Four volume module parallel processing, input is all connected with the output of institute predicate embeding layer, described First convolution module, the second convolution module, third convolution module and Volume Four volume module output all with described Concat layers Input is connected, and Concat layers of the output is connected with described Dropout layers, Dropout layers of the output and band The input of the full articulamentum of RELU activation primitive is connected, the output of the full articulamentum of the band RELU activation primitive with described point The input of class device layer is connected.

Further, first convolution module includes sequentially connected first convolutional layer, first normalization layer, the One RELU activation primitive, the second convolutional layer, second batch normalization layer, the 2nd RELU activation primitive and the first maximum pond layer；Institute Stating the second convolution module includes sequentially connected third convolutional layer, third batch normalization layer, the 3rd RELU activation primitive, Volume Four The maximum pond layer of lamination, the 4th batch of normalization layer, the 4th RELU activation primitive and second；The third convolution module includes successively The 5th convolutional layer, the 5th batch of normalization layer, the 5th RELU activation primitive, the 6th convolutional layer, the 6th batch of normalization layer, the of connection Six RELU activation primitives and third maximum pond layer；The Volume Four volume module includes sequentially connected 7th convolutional layer, the 7th Criticize normalization layer, the 7th RELU activation primitive, the 8th convolutional layer, the 8th batch of normalization layer, the 8th RELU activation primitive and the 4th Maximum pond layer.

Further, the operation of the training step includes:

The sample text marked is pre-processed, removes garbage character in conjunction with regular expression, segment, remove stop words The set of the word level-one of each sample text is obtained, the length of every text in statistical sample text, in conjunction with average length and warp The uniform length for determining a text is tested, too long text is truncated, polishing is done for too short text；

Term vector training corpus is segmented, stop words is gone to pre-process, and with the skip-gram mould in word2vec Type training obtains the term vector that trained dimension is 256；

By the sample text pre-processed in conjunction with trained term vector, obtain the eigenmatrix of text, and by its It is divided into training set and test set according to a certain percentage；

Training set is input in the improvement textCNN model of initial weight, loss letter is defined as using polynary cross entropy Number, using RMSProp optimizer adaptively changing learning rate, training obtains trained improvement textCNN model；

Test set is input to trained improvement textCNN model, the classification results of test set are obtained, with test set mark It signs comparing calculation and goes out the accuracy rate predicted, and by adjusting hyper parameter, optimization preprocessing process repeatedly, make to improve textCNN points Class model predictablity rate is optimal, and improvement textCNN at this time is classified as the improvement textCNN classification after training.

Further, the operation of the text classification step includes:

Text to be sorted is pre-processed, removes garbage character in conjunction with regular expression, segment, stop words is gone to obtain The set of the word level-one of each sample text；

Trained term vector model file obtains term vector for load, with term vector by pretreated text table It is shown as Text eigenvector matrix；

Improvement textCNN model after load is trained, inputs the improvement textCNN mould for Text eigenvector Type is predicted the classification results for obtaining the text.

The invention also provides a kind of based on the document sorting apparatus for improving textCNN model, which includes:

Training unit is trained the improvement textCNN after being trained to textCNN model is improved using sample text Model；

Text classification unit classifies to text to be sorted using the improvement textCNN model after training.

Further, the operation that the training unit executes includes:

Further, the operation that the text classification unit executes includes:

The invention also provides a kind of computer readable storage medium, computer program generation is stored on the storage medium Code, above-mentioned any method is executed when the computer program code is computer-executed.

Technical effect of the invention are as follows: the present invention is improved by improving to traditional textCNN model The text classification algorithm of textCNN model, due to pre-training word embeding layer, so that the training time of training stage and calculation amount Greatly reduce；Due to having deepened the depth of convolutional layer and having increased batch normalization layer, so that the larger raising of classification accuracy, more Add be suitable for requiring sample requirement of real-time (i.e. will be according to new samples more frequently more new model) and classification accuracy compared with High scene, such as the text classification of internet public feelings.

Detailed description of the invention

By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other Feature, objects and advantages will become more apparent upon.

Fig. 1 is a kind of process based on the file classification method for improving textCNN model of embodiment according to the present invention Figure.

Fig. 2 is the structure chart of the improvement textCNN model of embodiment according to the present invention.

Fig. 3 is the flow chart that the improvement textCNN model of embodiment according to the present invention is trained.

Fig. 4 is the flow chart for improving textCNN model and carrying out text classification of embodiment according to the present invention.

Fig. 5 is a kind of structure based on the document sorting apparatus for improving textCNN model of embodiment according to the present invention Figure.

Specific embodiment

The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to Convenient for description, part relevant to related invention is illustrated only in attached drawing.

It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.

Fig. 1 shows a kind of file classification method based on improvement textCNN model of the invention, this method comprises:

Training step S101 is trained the improvement after being trained to textCNN model is improved using sample text TextCNN model.

Text classification step S101 classifies to text to be sorted using the improvement textCNN model after training.

Essential step of the invention is exactly to construct improved textCNN model, i.e., obtains improvement textCNN by training Model, the i.e. structure of improvement textCNN model are important inventive points of the invention.

As shown in Fig. 2, the improvement textCNN model includes input layer, word embeding layer, the first convolution module, volume Two Volume module, third convolution module, Volume Four volume module, Concat layers, Dropout layers, the full articulamentum with RELU activation primitive Input with classifier layer, institute's predicate embeding layer is connect with the output of the input layer, first convolution module, the second convolution Module, third convolution module and Volume Four volume module parallel processing, input is all connected with the output of institute predicate embeding layer, described First convolution module, the second convolution module, third convolution module and Volume Four volume module output all with described Concat layers Input is connected, and Concat layers of the output is connected with described Dropout layers, Dropout layers of the output and band The input of the full articulamentum of RELU activation primitive is connected, the output of the full articulamentum of the band RELU activation primitive with described point The input of class device layer is connected.

In the present invention, pretreated text input is extremely improved to the word embeding layer (Embedding) of textCNN model, Weight in Embedding layers be by a large amount of unlabeled data, it is good with the skip-gram model pre-training in word2vec , do not update the weight of this layer in train classification models, the number of parameters of this layer is dimension of the dictionary number multiplied by term vector (256), up to ten million level parameters, the training time can greatly be reduced using the good term vector of pre-training.

Using the first convolution module, the second convolution module, third convolution module and Volume Four volume module four in the present invention Module carries out parallel processing, and the step-length of convolution kernel is all different in each module, for capturing the part of different step-lengths respectively Feature uses the one-dimensional convolution kernel of different step-lengths, for example, this method is 4,5,6,7 four kinds of selection, to capture different step-lengths respectively Local feature, particularly, this method has connect batch normalization layer (BatchNorm) below each convolutional layer, carries out to data Standardization prevents gradient disperse problem, model is made to restrain faster.At BatchNorm layers underneath with RELU activation Function.Convolutional layer etc. in each convolution module is twice, more to deepen than traditional textCNN model, enable model tormulation Power is stronger, to improve model prediction accuracy.Certainly, can readily expect can be by network depth more by those skilled in the art It is deep, such as 5 layers, 6 layers, parallel processing can also be carried out using multiple convolution modules, be not limited solely to four, such as 6,8.

As shown in Fig. 2, first convolution module includes sequentially connected first convolutional layer, first normalization layer, the One RELU activation primitive, the second convolutional layer, second batch normalization layer, the 2nd RELU activation primitive and the first maximum pond layer；Institute Stating the second convolution module includes sequentially connected third convolutional layer, third batch normalization layer, the 3rd RELU activation primitive, Volume Four The maximum pond layer of lamination, the 4th batch of normalization layer, the 4th RELU activation primitive and second；The third convolution module includes successively The 5th convolutional layer, the 5th batch of normalization layer, the 5th RELU activation primitive, the 6th convolutional layer, the 6th batch of normalization layer, the of connection Six RELU activation primitives and third maximum pond layer；The Volume Four volume module includes sequentially connected 7th convolutional layer, the 7th Criticize normalization layer, the 7th RELU activation primitive, the 8th convolutional layer, the 8th batch of normalization layer, the 8th RELU activation primitive and the 4th Maximum pond layer.

The convolution module of four different step-lengths is exported by max-pool layers, is compressed output dimension by down-sampling, is passed through Concat layers splice upper one layer of four outputs, obtain an one-dimensional vector, (dropout refers in depth by Dropout layers It spends in the training process of learning network, for neural network unit, temporarily abandons it from network according to certain probability.Note Meaning is temporarily, for stochastic gradient descent, due to being random drop, so each mini-batch is different in training Network.) dropout value be 0.5, for preventing over-fitting.By the full articulamentum (fc) activated with RELU, by it is one-dimensional to Amount is mapped to the one-dimensional vector of 128 dimensions.By classifier layer, categorization vector is mapped that, corresponding value is generic Probability value, classifier layer of the invention generally use softmax function.

As shown in figure 3, the operation of the training step includes:

The sample text marked is pre-processed, removes garbage character in conjunction with regular expression, segment, remove stop words The set of the word level-one of each sample text is obtained, the length of every text in statistical sample text, in conjunction with average length and warp The uniform length for determining a text is tested, too long text is truncated, polishing is done for too short text.

Term vector training corpus is segmented, stop words is gone to pre-process, and with the skip-gram mould in word2vec Type training obtains the term vector that trained dimension is 256.

By the sample text pre-processed in conjunction with trained term vector, obtain the eigenmatrix of text, and by its It is divided into training set and test set according to a certain percentage.

Training set is input in the improvement textCNN model of initial weight, loss letter is defined as using polynary cross entropy Number, using RMSProp optimizer adaptively changing learning rate, training obtains trained improvement textCNN model.RMSprop The principle of optimizer is similar to momentum gradient descent algorithm, and RMSprop optimizer limits the oscillation in vertical direction, makes us Algorithm can take bigger step in the horizontal direction, quickly restrained.

As shown in figure 4, the operation of the text classification step includes:

Text to be sorted is pre-processed, removes garbage character in conjunction with regular expression, segment, stop words is gone to obtain The set of the word level-one of each sample text.The step is to be located in advance with model training to the sample text marked It is identical for managing step.

Trained term vector model file obtains term vector for load, with term vector by pretreated text table It is shown as Text eigenvector matrix.

With further reference to Fig. 5, as the realization to method shown in above-mentioned Fig. 1, this application provides kinds based on improvement One embodiment of the document sorting apparatus of textCNN model, the Installation practice are opposite with embodiment of the method shown in FIG. 1 It answers, which specifically may include in various electronic equipments.

Fig. 5 shows a kind of document sorting apparatus based on improvement textCNN model of the invention, which includes:

Training unit 501 is trained the improvement after being trained to textCNN model is improved using sample text TextCNN model.

Text classification unit 501 classifies to text to be sorted using the improvement textCNN model after training.

Using the first convolution module, the second convolution module, third convolution module and Volume Four volume module four in the present invention Module carries out parallel processing, and the step-length of convolution kernel is all different in each module, for capturing the part of different step-lengths respectively Feature uses the one-dimensional convolution kernel of different step-lengths, for example, the present apparatus is 4,5,6,7 four kinds of selection, to capture different step-lengths respectively Local feature, particularly, the present apparatus has connect batch normalization layer (BatchNorm) below each convolutional layer, carries out to data Standardization prevents gradient disperse problem, model is made to restrain faster.At BatchNorm layers underneath with RELU activation Function.Convolutional layer etc. in each convolution module is twice, more to deepen than traditional textCNN model, enable model tormulation Power is stronger, to improve model prediction accuracy.Certainly, can readily expect can be by network depth more by those skilled in the art It is deep, such as 5 layers, 6 layers, parallel processing can also be carried out using multiple convolution modules, be not limited solely to four, such as 6,8.

As shown in figure 3, the operation that the training unit executes includes:

As shown in figure 4, the operation that the text classification unit executes includes:

Present invention obtains following technical effects: being improved by improving to traditional textCNN model The text classification algorithm of textCNN model, since which increase embeding layers, batch normalization layer and RELU activation primitive, so that instruction Practice time and computation amount, and classification accuracy greatly improves, and is more applicable for (wanting root to sample requirement of real-time According to new samples more frequently more new model) and classification accuracy require higher scene, such as the text in internet public feelings point Class etc..And after Text Pretreatment, trained term vector model file obtains term vector for load, will with term vector Pretreated text representation becomes Text eigenvector matrix and carries out subsequent processing, improves classification speed and accuracy rate.

For convenience of description, it is divided into various units when description apparatus above with function to describe respectively.Certainly, implementing this The function of each unit can be realized in the same or multiple software and or hardware when application.

As seen through the above description of the embodiments, those skilled in the art can be understood that the application can It realizes by means of software and necessary general hardware platform.Based on this understanding, the technical solution essence of the application On in other words the part that contributes to existing technology can be embodied in the form of software products, the computer software product It can store in storage medium, such as ROM/RAM, magnetic disk, CD, including some instructions are used so that a computer equipment (can be personal computer, server or the network equipment etc.) executes the certain of each embodiment of the application or embodiment Method described in part.

It should be noted last that: above embodiments only illustrate and not to limitation technical solution of the present invention, although reference Above-described embodiment describes the invention in detail, those skilled in the art should understand that: it still can be to this hair It is bright to be modified or replaced equivalently, it without departing from the spirit or scope of the invention, or any substitutions, should all It is included within the scope of the claims of the present invention.

Claims

1. a kind of based on the file classification method for improving textCNN model, which is characterized in that this method comprises:

Training step is trained the improvement textCNN mould after being trained to textCNN model is improved using sample text Type；

2. the method according to claim 1, wherein the improvement textCNN model includes input layer, word insertion Layer, the first convolution module, the second convolution module, third convolution module, Volume Four volume module, Concat layers, Dropout layers, band The input of the full articulamentum and classifier layer of RELU activation primitive, institute's predicate embeding layer is connect with the output of the input layer, institute State the first convolution module, the second convolution module, third convolution module and Volume Four volume module parallel processing, input all with institute's predicate The output of embeding layer is connected, first convolution module, the second convolution module, third convolution module and Volume Four volume module Output is all connected with Concat layers of the input, and Concat layers of the output is connected with described Dropout layers, described Dropout layers of output is connected with the input of the full articulamentum with RELU activation primitive, the band RELU activation primitive it is complete The output of articulamentum is connected with the input of the classifier layer.

3. according to the method described in claim 2, it is characterized in that, first convolution module, the second convolution module, third are rolled up The step-length of convolution kernel is all different in volume module and Volume Four volume module, for capturing the local feature of different step-lengths respectively；Institute Stating the first convolution module includes sequentially connected first convolutional layer, first normalization layer, the first RELU activation primitive, volume Two Lamination, second batch normalization layer, the 2nd RELU activation primitive and the first maximum pond layer；Second convolution module includes successively The third convolutional layer of connection, third batch normalization layer, the 3rd RELU activation primitive, Volume Four lamination, the 4th batch of normalization layer, the Four RELU activation primitives and the second maximum pond layer；The third convolution module includes sequentially connected 5th convolutional layer, the 5th Criticize normalization layer, the 5th RELU activation primitive, the 6th convolutional layer, the 6th batch of normalization layer, the 6th RELU activation primitive and third Maximum pond layer；The Volume Four volume module include sequentially connected 7th convolutional layer, the 7th batch of normalization layer, the 7th RELU swash Function, the 8th convolutional layer, the 8th batch of normalization layer, the 8th RELU activation primitive and the 4th maximum pond layer living.

4. the method according to claim 1, wherein the operation of the training step includes:

The sample text marked is pre-processed, removes garbage character in conjunction with regular expression, segment, stop words is gone to obtain The set of the word level-one of each sample text, the length of every text, true in conjunction with average length and experience in statistical sample text The uniform length of a fixed text, is truncated too long text, does polishing for too short text；

Term vector training corpus is segmented, stop words is gone to pre-process, and with the skip-gram model instruction in word2vec Practice, obtains the term vector that trained dimension is 256；

By the sample text pre-processed in conjunction with trained term vector, obtain the eigenmatrix of text, and by its according to Certain proportion is divided into training set and test set；

Training set is input in the improvement textCNN model of initial weight, loss function is defined as using polynary cross entropy, is made With RMSProp optimizer adaptively changing learning rate, training obtains trained improvement textCNN model；

Test set is input to trained improvement textCNN model, the classification results of test set are obtained, with test set label pair Than the accuracy rate for calculating prediction, and by adjusting hyper parameter, optimization preprocessing process repeatedly, make to improve textCNN classification mould Type predictablity rate is optimal, and improvement textCNN at this time is classified as the improvement textCNN classification after training.

5. according to the method described in claim 4, it is characterized in that, the operation of the text classification step includes:

Text to be sorted is pre-processed, in conjunction with regular expression remove garbage character, segment, to go stop words to obtain each The set of the word level-one of sample text；

Load trained term vector model file obtain term vector, with term vector by pretreated text representation at For Text eigenvector matrix；

Improvement textCNN model after load is trained, inputs the improvement textCNN model for Text eigenvector, into Row prediction obtains the classification results of the text.

6. a kind of based on the document sorting apparatus for improving textCNN model, which is characterized in that the device includes:

Training unit is trained the improvement textCNN mould after being trained to textCNN model is improved using sample text Type；

7. device according to claim 6, which is characterized in that the improvement textCNN model includes input layer, word insertion Layer, the first convolution module, the second convolution module, third convolution module, Volume Four volume module, Concat layers, Dropout layers, band The input of the full articulamentum and classifier layer of RELU activation primitive, institute's predicate embeding layer is connect with the output of the input layer, institute State the first convolution module, the second convolution module, third convolution module and Volume Four volume module parallel processing, input all with institute's predicate The output of embeding layer is connected, first convolution module, the second convolution module, third convolution module and Volume Four volume module Output is all connected with Concat layers of the input, and Concat layers of the output is connected with described Dropout layers, described Dropout layers of output is connected with the input of the full articulamentum with RELU activation primitive, the band RELU activation primitive it is complete The output of articulamentum is connected with the input of the classifier layer.

8. device according to claim 7, which is characterized in that first convolution module, the second convolution module, third volume The step-length of convolution kernel is all different in volume module and Volume Four volume module, for capturing the local feature of different step-lengths respectively；Institute Stating the first convolution module includes sequentially connected first convolutional layer, first normalization layer, the first RELU activation primitive, volume Two Lamination, second batch normalization layer, the 2nd RELU activation primitive and the first maximum pond layer；Second convolution module includes successively The third convolutional layer of connection, third batch normalization layer, the 3rd RELU activation primitive, Volume Four lamination, the 4th batch of normalization layer, the Four RELU activation primitives and the second maximum pond layer；The third convolution module includes sequentially connected 5th convolutional layer, the 5th Criticize normalization layer, the 5th RELU activation primitive, the 6th convolutional layer, the 6th batch of normalization layer, the 6th RELU activation primitive and third Maximum pond layer；The Volume Four volume module include sequentially connected 7th convolutional layer, the 7th batch of normalization layer, the 7th RELU swash Function, the 8th convolutional layer, the 8th batch of normalization layer, the 8th RELU activation primitive and the 4th maximum pond layer living.

9. device according to claim 6, which is characterized in that the operation that the training unit executes includes:

10. device according to claim 9, which is characterized in that the operation that the text classification unit executes includes:

11. a kind of computer readable storage medium, which is characterized in that it is stored with computer program code on the storage medium, When the computer program code is computer-executed, perform claim requires any method of 1-5.