CN109918497A - A kind of file classification method, device and storage medium based on improvement textCNN model - Google Patents
A kind of file classification method, device and storage medium based on improvement textCNN model Download PDFInfo
- Publication number
- CN109918497A CN109918497A CN201811572759.1A CN201811572759A CN109918497A CN 109918497 A CN109918497 A CN 109918497A CN 201811572759 A CN201811572759 A CN 201811572759A CN 109918497 A CN109918497 A CN 109918497A
- Authority
- CN
- China
- Prior art keywords
- text
- layer
- volume
- improvement
- textcnn
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of based on the file classification method, device and the storage medium that improve textCNN model.This method comprises: training step, is trained the improvement textCNN model after being trained to textCNN model is improved using sample text;Text classification step classifies to text to be sorted using the improvement textCNN model after training.The present invention obtains the text classification algorithm for improving textCNN model by improving to traditional textCNN model, due to pre-training word embeding layer, so that the training time of training stage and computation amount;Due to having deepened the depth of convolutional layer and having increased batch normalization layer, so that the larger raising of classification accuracy, it is more applicable for requiring higher scene, such as the text classification of internet public feelings to sample requirement of real-time (i.e. will be according to new samples more frequently more new model) and classification accuracy.
Description
Technical field
It is especially a kind of based on the text classification side for improving textCNN model the present invention relates to technical field of data processing
Method, device and storage medium.
Background technique
The network data of explosive growth proposes more and higher requirement to the analysis of data.Text analyzing and excavation
Technology is a technology being widely used at present, is extracted by semantic content of the corresponding technology and methods to text,
And then the sequence of operations such as taxonomic clustering are carried out to text, it is mainly used for commercial product recommending, the analysis of public opinion, the fields such as text search.
In the analysis of public opinion, need that public sentiment in network is arranged and analyzed under different themes, such as to acquisition
The text arrived carries out text classification, automatically identifies the interested text of user, filters out the uninterested rubbish text of user.
For this purpose, carrying out text automatic classification for collected text is a relatively important link in the analysis of public opinion.
Based on the text classification algorithm of traditional vector space model, time series modeling can not be carried out to the word order of word, also without
Method is to carrying out semantic modeling between different words, therefore obtained classifying quality is unsatisfactory.Text based on deep learning point
Class algorithm also can well model word order and semanteme due to that excessively need not carry out cumbersome Feature Engineering link, take
The remote hyper-base of classifying quality obtained is in vector space model, therefore the text classification algorithm for being now based on deep learning becomes mainstream.
But in the analysis of public opinion field, the classification and sample of classification all have very strong timeliness, and classification can be frequent according to public sentiment demand
Variation, sample can generate new public sentiment hot as time goes by, it is therefore desirable to more frequent update and iterative model.It is based on
The text classification algorithm of RNN causes trained and predicted time elongated since calculation amount is huge, frequent to update under public sentiment scene
It will cause the huge waste of computing resource with iterative model.
Summary of the invention
The present invention is directed to above-mentioned defect in the prior art, proposes following technical solution.
A kind of file classification method based on improvement textCNN model, this method comprises:
Training step is trained the improvement textCNN after being trained to textCNN model is improved using sample text
Model;
Text classification step classifies to text to be sorted using the improvement textCNN model after training.
Further, the improvement textCNN model includes input layer, word embeding layer, the first convolution module, volume Two
Volume module, third convolution module, Volume Four volume module, Concat layers, Dropout layers, the full articulamentum with RELU activation primitive
Input with classifier layer, institute's predicate embeding layer is connect with the output of the input layer, first convolution module, the second convolution
Module, third convolution module and Volume Four volume module parallel processing, input is all connected with the output of institute predicate embeding layer, described
First convolution module, the second convolution module, third convolution module and Volume Four volume module output all with described Concat layers
Input is connected, and Concat layers of the output is connected with described Dropout layers, Dropout layers of the output and band
The input of the full articulamentum of RELU activation primitive is connected, the output of the full articulamentum of the band RELU activation primitive with described point
The input of class device layer is connected.
Further, first convolution module includes sequentially connected first convolutional layer, first normalization layer, the
One RELU activation primitive, the second convolutional layer, second batch normalization layer, the 2nd RELU activation primitive and the first maximum pond layer;Institute
Stating the second convolution module includes sequentially connected third convolutional layer, third batch normalization layer, the 3rd RELU activation primitive, Volume Four
The maximum pond layer of lamination, the 4th batch of normalization layer, the 4th RELU activation primitive and second;The third convolution module includes successively
The 5th convolutional layer, the 5th batch of normalization layer, the 5th RELU activation primitive, the 6th convolutional layer, the 6th batch of normalization layer, the of connection
Six RELU activation primitives and third maximum pond layer;The Volume Four volume module includes sequentially connected 7th convolutional layer, the 7th
Criticize normalization layer, the 7th RELU activation primitive, the 8th convolutional layer, the 8th batch of normalization layer, the 8th RELU activation primitive and the 4th
Maximum pond layer.
Further, the operation of the training step includes:
The sample text marked is pre-processed, removes garbage character in conjunction with regular expression, segment, remove stop words
The set of the word level-one of each sample text is obtained, the length of every text in statistical sample text, in conjunction with average length and warp
The uniform length for determining a text is tested, too long text is truncated, polishing is done for too short text;
Term vector training corpus is segmented, stop words is gone to pre-process, and with the skip-gram mould in word2vec
Type training obtains the term vector that trained dimension is 256;
By the sample text pre-processed in conjunction with trained term vector, obtain the eigenmatrix of text, and by its
It is divided into training set and test set according to a certain percentage;
Training set is input in the improvement textCNN model of initial weight, loss letter is defined as using polynary cross entropy
Number, using RMSProp optimizer adaptively changing learning rate, training obtains trained improvement textCNN model;
Test set is input to trained improvement textCNN model, the classification results of test set are obtained, with test set mark
It signs comparing calculation and goes out the accuracy rate predicted, and by adjusting hyper parameter, optimization preprocessing process repeatedly, make to improve textCNN points
Class model predictablity rate is optimal, and improvement textCNN at this time is classified as the improvement textCNN classification after training.
Further, the operation of the text classification step includes:
Text to be sorted is pre-processed, removes garbage character in conjunction with regular expression, segment, stop words is gone to obtain
The set of the word level-one of each sample text;
Trained term vector model file obtains term vector for load, with term vector by pretreated text table
It is shown as Text eigenvector matrix;
Improvement textCNN model after load is trained, inputs the improvement textCNN mould for Text eigenvector
Type is predicted the classification results for obtaining the text.
The invention also provides a kind of based on the document sorting apparatus for improving textCNN model, which includes:
Training unit is trained the improvement textCNN after being trained to textCNN model is improved using sample text
Model;
Text classification unit classifies to text to be sorted using the improvement textCNN model after training.
Further, the improvement textCNN model includes input layer, word embeding layer, the first convolution module, volume Two
Volume module, third convolution module, Volume Four volume module, Concat layers, Dropout layers, the full articulamentum with RELU activation primitive
Input with classifier layer, institute's predicate embeding layer is connect with the output of the input layer, first convolution module, the second convolution
Module, third convolution module and Volume Four volume module parallel processing, input is all connected with the output of institute predicate embeding layer, described
First convolution module, the second convolution module, third convolution module and Volume Four volume module output all with described Concat layers
Input is connected, and Concat layers of the output is connected with described Dropout layers, Dropout layers of the output and band
The input of the full articulamentum of RELU activation primitive is connected, the output of the full articulamentum of the band RELU activation primitive with described point
The input of class device layer is connected.
Further, first convolution module includes sequentially connected first convolutional layer, first normalization layer, the
One RELU activation primitive, the second convolutional layer, second batch normalization layer, the 2nd RELU activation primitive and the first maximum pond layer;Institute
Stating the second convolution module includes sequentially connected third convolutional layer, third batch normalization layer, the 3rd RELU activation primitive, Volume Four
The maximum pond layer of lamination, the 4th batch of normalization layer, the 4th RELU activation primitive and second;The third convolution module includes successively
The 5th convolutional layer, the 5th batch of normalization layer, the 5th RELU activation primitive, the 6th convolutional layer, the 6th batch of normalization layer, the of connection
Six RELU activation primitives and third maximum pond layer;The Volume Four volume module includes sequentially connected 7th convolutional layer, the 7th
Criticize normalization layer, the 7th RELU activation primitive, the 8th convolutional layer, the 8th batch of normalization layer, the 8th RELU activation primitive and the 4th
Maximum pond layer.
Further, the operation that the training unit executes includes:
The sample text marked is pre-processed, removes garbage character in conjunction with regular expression, segment, remove stop words
The set of the word level-one of each sample text is obtained, the length of every text in statistical sample text, in conjunction with average length and warp
The uniform length for determining a text is tested, too long text is truncated, polishing is done for too short text;
Term vector training corpus is segmented, stop words is gone to pre-process, and with the skip-gram mould in word2vec
Type training obtains the term vector that trained dimension is 256;
By the sample text pre-processed in conjunction with trained term vector, obtain the eigenmatrix of text, and by its
It is divided into training set and test set according to a certain percentage;
Training set is input in the improvement textCNN model of initial weight, loss letter is defined as using polynary cross entropy
Number, using RMSProp optimizer adaptively changing learning rate, training obtains trained improvement textCNN model;
Test set is input to trained improvement textCNN model, the classification results of test set are obtained, with test set mark
It signs comparing calculation and goes out the accuracy rate predicted, and by adjusting hyper parameter, optimization preprocessing process repeatedly, make to improve textCNN points
Class model predictablity rate is optimal, and improvement textCNN at this time is classified as the improvement textCNN classification after training.
Further, the operation that the text classification unit executes includes:
Text to be sorted is pre-processed, removes garbage character in conjunction with regular expression, segment, stop words is gone to obtain
The set of the word level-one of each sample text;
Trained term vector model file obtains term vector for load, with term vector by pretreated text table
It is shown as Text eigenvector matrix;
Improvement textCNN model after load is trained, inputs the improvement textCNN mould for Text eigenvector
Type is predicted the classification results for obtaining the text.
The invention also provides a kind of computer readable storage medium, computer program generation is stored on the storage medium
Code, above-mentioned any method is executed when the computer program code is computer-executed.
Technical effect of the invention are as follows: the present invention is improved by improving to traditional textCNN model
The text classification algorithm of textCNN model, due to pre-training word embeding layer, so that the training time of training stage and calculation amount
Greatly reduce;Due to having deepened the depth of convolutional layer and having increased batch normalization layer, so that the larger raising of classification accuracy, more
Add be suitable for requiring sample requirement of real-time (i.e. will be according to new samples more frequently more new model) and classification accuracy compared with
High scene, such as the text classification of internet public feelings.
Detailed description of the invention
By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other
Feature, objects and advantages will become more apparent upon.
Fig. 1 is a kind of process based on the file classification method for improving textCNN model of embodiment according to the present invention
Figure.
Fig. 2 is the structure chart of the improvement textCNN model of embodiment according to the present invention.
Fig. 3 is the flow chart that the improvement textCNN model of embodiment according to the present invention is trained.
Fig. 4 is the flow chart for improving textCNN model and carrying out text classification of embodiment according to the present invention.
Fig. 5 is a kind of structure based on the document sorting apparatus for improving textCNN model of embodiment according to the present invention
Figure.
Specific embodiment
The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched
The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to
Convenient for description, part relevant to related invention is illustrated only in attached drawing.
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase
Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
Fig. 1 shows a kind of file classification method based on improvement textCNN model of the invention, this method comprises:
Training step S101 is trained the improvement after being trained to textCNN model is improved using sample text
TextCNN model.
Text classification step S101 classifies to text to be sorted using the improvement textCNN model after training.
Essential step of the invention is exactly to construct improved textCNN model, i.e., obtains improvement textCNN by training
Model, the i.e. structure of improvement textCNN model are important inventive points of the invention.
As shown in Fig. 2, the improvement textCNN model includes input layer, word embeding layer, the first convolution module, volume Two
Volume module, third convolution module, Volume Four volume module, Concat layers, Dropout layers, the full articulamentum with RELU activation primitive
Input with classifier layer, institute's predicate embeding layer is connect with the output of the input layer, first convolution module, the second convolution
Module, third convolution module and Volume Four volume module parallel processing, input is all connected with the output of institute predicate embeding layer, described
First convolution module, the second convolution module, third convolution module and Volume Four volume module output all with described Concat layers
Input is connected, and Concat layers of the output is connected with described Dropout layers, Dropout layers of the output and band
The input of the full articulamentum of RELU activation primitive is connected, the output of the full articulamentum of the band RELU activation primitive with described point
The input of class device layer is connected.
In the present invention, pretreated text input is extremely improved to the word embeding layer (Embedding) of textCNN model,
Weight in Embedding layers be by a large amount of unlabeled data, it is good with the skip-gram model pre-training in word2vec
, do not update the weight of this layer in train classification models, the number of parameters of this layer is dimension of the dictionary number multiplied by term vector
(256), up to ten million level parameters, the training time can greatly be reduced using the good term vector of pre-training.
Using the first convolution module, the second convolution module, third convolution module and Volume Four volume module four in the present invention
Module carries out parallel processing, and the step-length of convolution kernel is all different in each module, for capturing the part of different step-lengths respectively
Feature uses the one-dimensional convolution kernel of different step-lengths, for example, this method is 4,5,6,7 four kinds of selection, to capture different step-lengths respectively
Local feature, particularly, this method has connect batch normalization layer (BatchNorm) below each convolutional layer, carries out to data
Standardization prevents gradient disperse problem, model is made to restrain faster.At BatchNorm layers underneath with RELU activation
Function.Convolutional layer etc. in each convolution module is twice, more to deepen than traditional textCNN model, enable model tormulation
Power is stronger, to improve model prediction accuracy.Certainly, can readily expect can be by network depth more by those skilled in the art
It is deep, such as 5 layers, 6 layers, parallel processing can also be carried out using multiple convolution modules, be not limited solely to four, such as 6,8.
As shown in Fig. 2, first convolution module includes sequentially connected first convolutional layer, first normalization layer, the
One RELU activation primitive, the second convolutional layer, second batch normalization layer, the 2nd RELU activation primitive and the first maximum pond layer;Institute
Stating the second convolution module includes sequentially connected third convolutional layer, third batch normalization layer, the 3rd RELU activation primitive, Volume Four
The maximum pond layer of lamination, the 4th batch of normalization layer, the 4th RELU activation primitive and second;The third convolution module includes successively
The 5th convolutional layer, the 5th batch of normalization layer, the 5th RELU activation primitive, the 6th convolutional layer, the 6th batch of normalization layer, the of connection
Six RELU activation primitives and third maximum pond layer;The Volume Four volume module includes sequentially connected 7th convolutional layer, the 7th
Criticize normalization layer, the 7th RELU activation primitive, the 8th convolutional layer, the 8th batch of normalization layer, the 8th RELU activation primitive and the 4th
Maximum pond layer.
The convolution module of four different step-lengths is exported by max-pool layers, is compressed output dimension by down-sampling, is passed through
Concat layers splice upper one layer of four outputs, obtain an one-dimensional vector, (dropout refers in depth by Dropout layers
It spends in the training process of learning network, for neural network unit, temporarily abandons it from network according to certain probability.Note
Meaning is temporarily, for stochastic gradient descent, due to being random drop, so each mini-batch is different in training
Network.) dropout value be 0.5, for preventing over-fitting.By the full articulamentum (fc) activated with RELU, by it is one-dimensional to
Amount is mapped to the one-dimensional vector of 128 dimensions.By classifier layer, categorization vector is mapped that, corresponding value is generic
Probability value, classifier layer of the invention generally use softmax function.
As shown in figure 3, the operation of the training step includes:
The sample text marked is pre-processed, removes garbage character in conjunction with regular expression, segment, remove stop words
The set of the word level-one of each sample text is obtained, the length of every text in statistical sample text, in conjunction with average length and warp
The uniform length for determining a text is tested, too long text is truncated, polishing is done for too short text.
Term vector training corpus is segmented, stop words is gone to pre-process, and with the skip-gram mould in word2vec
Type training obtains the term vector that trained dimension is 256.
By the sample text pre-processed in conjunction with trained term vector, obtain the eigenmatrix of text, and by its
It is divided into training set and test set according to a certain percentage.
Training set is input in the improvement textCNN model of initial weight, loss letter is defined as using polynary cross entropy
Number, using RMSProp optimizer adaptively changing learning rate, training obtains trained improvement textCNN model.RMSprop
The principle of optimizer is similar to momentum gradient descent algorithm, and RMSprop optimizer limits the oscillation in vertical direction, makes us
Algorithm can take bigger step in the horizontal direction, quickly restrained.
Test set is input to trained improvement textCNN model, the classification results of test set are obtained, with test set mark
It signs comparing calculation and goes out the accuracy rate predicted, and by adjusting hyper parameter, optimization preprocessing process repeatedly, make to improve textCNN points
Class model predictablity rate is optimal, and improvement textCNN at this time is classified as the improvement textCNN classification after training.
As shown in figure 4, the operation of the text classification step includes:
Text to be sorted is pre-processed, removes garbage character in conjunction with regular expression, segment, stop words is gone to obtain
The set of the word level-one of each sample text.The step is to be located in advance with model training to the sample text marked
It is identical for managing step.
Trained term vector model file obtains term vector for load, with term vector by pretreated text table
It is shown as Text eigenvector matrix.
Improvement textCNN model after load is trained, inputs the improvement textCNN mould for Text eigenvector
Type is predicted the classification results for obtaining the text.
With further reference to Fig. 5, as the realization to method shown in above-mentioned Fig. 1, this application provides kinds based on improvement
One embodiment of the document sorting apparatus of textCNN model, the Installation practice are opposite with embodiment of the method shown in FIG. 1
It answers, which specifically may include in various electronic equipments.
Fig. 5 shows a kind of document sorting apparatus based on improvement textCNN model of the invention, which includes:
Training unit 501 is trained the improvement after being trained to textCNN model is improved using sample text
TextCNN model.
Text classification unit 501 classifies to text to be sorted using the improvement textCNN model after training.
Essential step of the invention is exactly to construct improved textCNN model, i.e., obtains improvement textCNN by training
Model, the i.e. structure of improvement textCNN model are important inventive points of the invention.
As shown in Fig. 2, the improvement textCNN model includes input layer, word embeding layer, the first convolution module, volume Two
Volume module, third convolution module, Volume Four volume module, Concat layers, Dropout layers, the full articulamentum with RELU activation primitive
Input with classifier layer, institute's predicate embeding layer is connect with the output of the input layer, first convolution module, the second convolution
Module, third convolution module and Volume Four volume module parallel processing, input is all connected with the output of institute predicate embeding layer, described
First convolution module, the second convolution module, third convolution module and Volume Four volume module output all with described Concat layers
Input is connected, and Concat layers of the output is connected with described Dropout layers, Dropout layers of the output and band
The input of the full articulamentum of RELU activation primitive is connected, the output of the full articulamentum of the band RELU activation primitive with described point
The input of class device layer is connected.
In the present invention, pretreated text input is extremely improved to the word embeding layer (Embedding) of textCNN model,
Weight in Embedding layers be by a large amount of unlabeled data, it is good with the skip-gram model pre-training in word2vec
, do not update the weight of this layer in train classification models, the number of parameters of this layer is dimension of the dictionary number multiplied by term vector
(256), up to ten million level parameters, the training time can greatly be reduced using the good term vector of pre-training.
Using the first convolution module, the second convolution module, third convolution module and Volume Four volume module four in the present invention
Module carries out parallel processing, and the step-length of convolution kernel is all different in each module, for capturing the part of different step-lengths respectively
Feature uses the one-dimensional convolution kernel of different step-lengths, for example, the present apparatus is 4,5,6,7 four kinds of selection, to capture different step-lengths respectively
Local feature, particularly, the present apparatus has connect batch normalization layer (BatchNorm) below each convolutional layer, carries out to data
Standardization prevents gradient disperse problem, model is made to restrain faster.At BatchNorm layers underneath with RELU activation
Function.Convolutional layer etc. in each convolution module is twice, more to deepen than traditional textCNN model, enable model tormulation
Power is stronger, to improve model prediction accuracy.Certainly, can readily expect can be by network depth more by those skilled in the art
It is deep, such as 5 layers, 6 layers, parallel processing can also be carried out using multiple convolution modules, be not limited solely to four, such as 6,8.
As shown in Fig. 2, first convolution module includes sequentially connected first convolutional layer, first normalization layer, the
One RELU activation primitive, the second convolutional layer, second batch normalization layer, the 2nd RELU activation primitive and the first maximum pond layer;Institute
Stating the second convolution module includes sequentially connected third convolutional layer, third batch normalization layer, the 3rd RELU activation primitive, Volume Four
The maximum pond layer of lamination, the 4th batch of normalization layer, the 4th RELU activation primitive and second;The third convolution module includes successively
The 5th convolutional layer, the 5th batch of normalization layer, the 5th RELU activation primitive, the 6th convolutional layer, the 6th batch of normalization layer, the of connection
Six RELU activation primitives and third maximum pond layer;The Volume Four volume module includes sequentially connected 7th convolutional layer, the 7th
Criticize normalization layer, the 7th RELU activation primitive, the 8th convolutional layer, the 8th batch of normalization layer, the 8th RELU activation primitive and the 4th
Maximum pond layer.
The convolution module of four different step-lengths is exported by max-pool layers, is compressed output dimension by down-sampling, is passed through
Concat layers splice upper one layer of four outputs, obtain an one-dimensional vector, (dropout refers in depth by Dropout layers
It spends in the training process of learning network, for neural network unit, temporarily abandons it from network according to certain probability.Note
Meaning is temporarily, for stochastic gradient descent, due to being random drop, so each mini-batch is different in training
Network.) dropout value be 0.5, for preventing over-fitting.By the full articulamentum (fc) activated with RELU, by it is one-dimensional to
Amount is mapped to the one-dimensional vector of 128 dimensions.By classifier layer, categorization vector is mapped that, corresponding value is generic
Probability value, classifier layer of the invention generally use softmax function.
As shown in figure 3, the operation that the training unit executes includes:
The sample text marked is pre-processed, removes garbage character in conjunction with regular expression, segment, remove stop words
The set of the word level-one of each sample text is obtained, the length of every text in statistical sample text, in conjunction with average length and warp
The uniform length for determining a text is tested, too long text is truncated, polishing is done for too short text.
Term vector training corpus is segmented, stop words is gone to pre-process, and with the skip-gram mould in word2vec
Type training obtains the term vector that trained dimension is 256.
By the sample text pre-processed in conjunction with trained term vector, obtain the eigenmatrix of text, and by its
It is divided into training set and test set according to a certain percentage.
Training set is input in the improvement textCNN model of initial weight, loss letter is defined as using polynary cross entropy
Number, using RMSProp optimizer adaptively changing learning rate, training obtains trained improvement textCNN model.RMSprop
The principle of optimizer is similar to momentum gradient descent algorithm, and RMSprop optimizer limits the oscillation in vertical direction, makes us
Algorithm can take bigger step in the horizontal direction, quickly restrained.
Test set is input to trained improvement textCNN model, the classification results of test set are obtained, with test set mark
It signs comparing calculation and goes out the accuracy rate predicted, and by adjusting hyper parameter, optimization preprocessing process repeatedly, make to improve textCNN points
Class model predictablity rate is optimal, and improvement textCNN at this time is classified as the improvement textCNN classification after training.
As shown in figure 4, the operation that the text classification unit executes includes:
Text to be sorted is pre-processed, removes garbage character in conjunction with regular expression, segment, stop words is gone to obtain
The set of the word level-one of each sample text.The step is to be located in advance with model training to the sample text marked
It is identical for managing step.
Trained term vector model file obtains term vector for load, with term vector by pretreated text table
It is shown as Text eigenvector matrix.
Improvement textCNN model after load is trained, inputs the improvement textCNN mould for Text eigenvector
Type is predicted the classification results for obtaining the text.
Present invention obtains following technical effects: being improved by improving to traditional textCNN model
The text classification algorithm of textCNN model, since which increase embeding layers, batch normalization layer and RELU activation primitive, so that instruction
Practice time and computation amount, and classification accuracy greatly improves, and is more applicable for (wanting root to sample requirement of real-time
According to new samples more frequently more new model) and classification accuracy require higher scene, such as the text in internet public feelings point
Class etc..And after Text Pretreatment, trained term vector model file obtains term vector for load, will with term vector
Pretreated text representation becomes Text eigenvector matrix and carries out subsequent processing, improves classification speed and accuracy rate.
For convenience of description, it is divided into various units when description apparatus above with function to describe respectively.Certainly, implementing this
The function of each unit can be realized in the same or multiple software and or hardware when application.
As seen through the above description of the embodiments, those skilled in the art can be understood that the application can
It realizes by means of software and necessary general hardware platform.Based on this understanding, the technical solution essence of the application
On in other words the part that contributes to existing technology can be embodied in the form of software products, the computer software product
It can store in storage medium, such as ROM/RAM, magnetic disk, CD, including some instructions are used so that a computer equipment
(can be personal computer, server or the network equipment etc.) executes the certain of each embodiment of the application or embodiment
Method described in part.
It should be noted last that: above embodiments only illustrate and not to limitation technical solution of the present invention, although reference
Above-described embodiment describes the invention in detail, those skilled in the art should understand that: it still can be to this hair
It is bright to be modified or replaced equivalently, it without departing from the spirit or scope of the invention, or any substitutions, should all
It is included within the scope of the claims of the present invention.
Claims (11)
1. a kind of based on the file classification method for improving textCNN model, which is characterized in that this method comprises:
Training step is trained the improvement textCNN mould after being trained to textCNN model is improved using sample text
Type;
Text classification step classifies to text to be sorted using the improvement textCNN model after training.
2. the method according to claim 1, wherein the improvement textCNN model includes input layer, word insertion
Layer, the first convolution module, the second convolution module, third convolution module, Volume Four volume module, Concat layers, Dropout layers, band
The input of the full articulamentum and classifier layer of RELU activation primitive, institute's predicate embeding layer is connect with the output of the input layer, institute
State the first convolution module, the second convolution module, third convolution module and Volume Four volume module parallel processing, input all with institute's predicate
The output of embeding layer is connected, first convolution module, the second convolution module, third convolution module and Volume Four volume module
Output is all connected with Concat layers of the input, and Concat layers of the output is connected with described Dropout layers, described
Dropout layers of output is connected with the input of the full articulamentum with RELU activation primitive, the band RELU activation primitive it is complete
The output of articulamentum is connected with the input of the classifier layer.
3. according to the method described in claim 2, it is characterized in that, first convolution module, the second convolution module, third are rolled up
The step-length of convolution kernel is all different in volume module and Volume Four volume module, for capturing the local feature of different step-lengths respectively;Institute
Stating the first convolution module includes sequentially connected first convolutional layer, first normalization layer, the first RELU activation primitive, volume Two
Lamination, second batch normalization layer, the 2nd RELU activation primitive and the first maximum pond layer;Second convolution module includes successively
The third convolutional layer of connection, third batch normalization layer, the 3rd RELU activation primitive, Volume Four lamination, the 4th batch of normalization layer, the
Four RELU activation primitives and the second maximum pond layer;The third convolution module includes sequentially connected 5th convolutional layer, the 5th
Criticize normalization layer, the 5th RELU activation primitive, the 6th convolutional layer, the 6th batch of normalization layer, the 6th RELU activation primitive and third
Maximum pond layer;The Volume Four volume module include sequentially connected 7th convolutional layer, the 7th batch of normalization layer, the 7th RELU swash
Function, the 8th convolutional layer, the 8th batch of normalization layer, the 8th RELU activation primitive and the 4th maximum pond layer living.
4. the method according to claim 1, wherein the operation of the training step includes:
The sample text marked is pre-processed, removes garbage character in conjunction with regular expression, segment, stop words is gone to obtain
The set of the word level-one of each sample text, the length of every text, true in conjunction with average length and experience in statistical sample text
The uniform length of a fixed text, is truncated too long text, does polishing for too short text;
Term vector training corpus is segmented, stop words is gone to pre-process, and with the skip-gram model instruction in word2vec
Practice, obtains the term vector that trained dimension is 256;
By the sample text pre-processed in conjunction with trained term vector, obtain the eigenmatrix of text, and by its according to
Certain proportion is divided into training set and test set;
Training set is input in the improvement textCNN model of initial weight, loss function is defined as using polynary cross entropy, is made
With RMSProp optimizer adaptively changing learning rate, training obtains trained improvement textCNN model;
Test set is input to trained improvement textCNN model, the classification results of test set are obtained, with test set label pair
Than the accuracy rate for calculating prediction, and by adjusting hyper parameter, optimization preprocessing process repeatedly, make to improve textCNN classification mould
Type predictablity rate is optimal, and improvement textCNN at this time is classified as the improvement textCNN classification after training.
5. according to the method described in claim 4, it is characterized in that, the operation of the text classification step includes:
Text to be sorted is pre-processed, in conjunction with regular expression remove garbage character, segment, to go stop words to obtain each
The set of the word level-one of sample text;
Load trained term vector model file obtain term vector, with term vector by pretreated text representation at
For Text eigenvector matrix;
Improvement textCNN model after load is trained, inputs the improvement textCNN model for Text eigenvector, into
Row prediction obtains the classification results of the text.
6. a kind of based on the document sorting apparatus for improving textCNN model, which is characterized in that the device includes:
Training unit is trained the improvement textCNN mould after being trained to textCNN model is improved using sample text
Type;
Text classification unit classifies to text to be sorted using the improvement textCNN model after training.
7. device according to claim 6, which is characterized in that the improvement textCNN model includes input layer, word insertion
Layer, the first convolution module, the second convolution module, third convolution module, Volume Four volume module, Concat layers, Dropout layers, band
The input of the full articulamentum and classifier layer of RELU activation primitive, institute's predicate embeding layer is connect with the output of the input layer, institute
State the first convolution module, the second convolution module, third convolution module and Volume Four volume module parallel processing, input all with institute's predicate
The output of embeding layer is connected, first convolution module, the second convolution module, third convolution module and Volume Four volume module
Output is all connected with Concat layers of the input, and Concat layers of the output is connected with described Dropout layers, described
Dropout layers of output is connected with the input of the full articulamentum with RELU activation primitive, the band RELU activation primitive it is complete
The output of articulamentum is connected with the input of the classifier layer.
8. device according to claim 7, which is characterized in that first convolution module, the second convolution module, third volume
The step-length of convolution kernel is all different in volume module and Volume Four volume module, for capturing the local feature of different step-lengths respectively;Institute
Stating the first convolution module includes sequentially connected first convolutional layer, first normalization layer, the first RELU activation primitive, volume Two
Lamination, second batch normalization layer, the 2nd RELU activation primitive and the first maximum pond layer;Second convolution module includes successively
The third convolutional layer of connection, third batch normalization layer, the 3rd RELU activation primitive, Volume Four lamination, the 4th batch of normalization layer, the
Four RELU activation primitives and the second maximum pond layer;The third convolution module includes sequentially connected 5th convolutional layer, the 5th
Criticize normalization layer, the 5th RELU activation primitive, the 6th convolutional layer, the 6th batch of normalization layer, the 6th RELU activation primitive and third
Maximum pond layer;The Volume Four volume module include sequentially connected 7th convolutional layer, the 7th batch of normalization layer, the 7th RELU swash
Function, the 8th convolutional layer, the 8th batch of normalization layer, the 8th RELU activation primitive and the 4th maximum pond layer living.
9. device according to claim 6, which is characterized in that the operation that the training unit executes includes:
The sample text marked is pre-processed, removes garbage character in conjunction with regular expression, segment, stop words is gone to obtain
The set of the word level-one of each sample text, the length of every text, true in conjunction with average length and experience in statistical sample text
The uniform length of a fixed text, is truncated too long text, does polishing for too short text;
Term vector training corpus is segmented, stop words is gone to pre-process, and with the skip-gram model instruction in word2vec
Practice, obtains the term vector that trained dimension is 256;
By the sample text pre-processed in conjunction with trained term vector, obtain the eigenmatrix of text, and by its according to
Certain proportion is divided into training set and test set;
Training set is input in the improvement textCNN model of initial weight, loss function is defined as using polynary cross entropy, is made
With RMSProp optimizer adaptively changing learning rate, training obtains trained improvement textCNN model;
Test set is input to trained improvement textCNN model, the classification results of test set are obtained, with test set label pair
Than the accuracy rate for calculating prediction, and by adjusting hyper parameter, optimization preprocessing process repeatedly, make to improve textCNN classification mould
Type predictablity rate is optimal, and improvement textCNN at this time is classified as the improvement textCNN classification after training.
10. device according to claim 9, which is characterized in that the operation that the text classification unit executes includes:
Text to be sorted is pre-processed, in conjunction with regular expression remove garbage character, segment, to go stop words to obtain each
The set of the word level-one of sample text;
Load trained term vector model file obtain term vector, with term vector by pretreated text representation at
For Text eigenvector matrix;
Improvement textCNN model after load is trained, inputs the improvement textCNN model for Text eigenvector, into
Row prediction obtains the classification results of the text.
11. a kind of computer readable storage medium, which is characterized in that it is stored with computer program code on the storage medium,
When the computer program code is computer-executed, perform claim requires any method of 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811572759.1A CN109918497A (en) | 2018-12-21 | 2018-12-21 | A kind of file classification method, device and storage medium based on improvement textCNN model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811572759.1A CN109918497A (en) | 2018-12-21 | 2018-12-21 | A kind of file classification method, device and storage medium based on improvement textCNN model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109918497A true CN109918497A (en) | 2019-06-21 |
Family
ID=66959953
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811572759.1A Pending CN109918497A (en) | 2018-12-21 | 2018-12-21 | A kind of file classification method, device and storage medium based on improvement textCNN model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109918497A (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110543629A (en) * | 2019-08-01 | 2019-12-06 | 淮阴工学院 | chemical equipment text classification method based on W-ATT-CNN algorithm |
CN110717039A (en) * | 2019-09-17 | 2020-01-21 | 平安科技(深圳)有限公司 | Text classification method and device, electronic equipment and computer-readable storage medium |
CN111143551A (en) * | 2019-12-04 | 2020-05-12 | 支付宝(杭州)信息技术有限公司 | Text preprocessing method, classification method, device and equipment |
CN111930938A (en) * | 2020-07-06 | 2020-11-13 | 武汉卓尔数字传媒科技有限公司 | Text classification method and device, electronic equipment and storage medium |
CN112242185A (en) * | 2020-09-09 | 2021-01-19 | 山东大学 | Medical image report automatic generation method and system based on deep learning |
CN112270615A (en) * | 2020-10-26 | 2021-01-26 | 西安邮电大学 | Intelligent decomposition method for manufacturing BOM (Bill of Material) by complex equipment based on semantic calculation |
WO2021051586A1 (en) * | 2019-09-18 | 2021-03-25 | 平安科技(深圳)有限公司 | Interview answer text classification method, device, electronic apparatus and storage medium |
CN114207605A (en) * | 2019-10-31 | 2022-03-18 | 深圳市欢太科技有限公司 | Text classification method and device, electronic equipment and storage medium |
CN114416213A (en) * | 2022-03-29 | 2022-04-29 | 北京沃丰时代数据科技有限公司 | Word vector file loading method and device and storage medium |
CN114564942A (en) * | 2021-09-06 | 2022-05-31 | 北京数美时代科技有限公司 | Text error correction method, storage medium and device for supervision field |
CN115936094A (en) * | 2022-12-27 | 2023-04-07 | 北京百度网讯科技有限公司 | Training method and device of text processing model, electronic equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1310825A (en) * | 1998-06-23 | 2001-08-29 | 微软公司 | Methods and apparatus for classifying text and for building a text classifier |
CN108108351A (en) * | 2017-12-05 | 2018-06-01 | 华南理工大学 | A kind of text sentiment classification method based on deep learning built-up pattern |
CN108399230A (en) * | 2018-02-13 | 2018-08-14 | 上海大学 | A kind of Chinese financial and economic news file classification method based on convolutional neural networks |
AU2018101513A4 (en) * | 2018-10-11 | 2018-11-15 | Hui, Bo Mr | Comprehensive Stock Prediction GRU Model: Emotional Index and Volatility Based |
-
2018
- 2018-12-21 CN CN201811572759.1A patent/CN109918497A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1310825A (en) * | 1998-06-23 | 2001-08-29 | 微软公司 | Methods and apparatus for classifying text and for building a text classifier |
CN108108351A (en) * | 2017-12-05 | 2018-06-01 | 华南理工大学 | A kind of text sentiment classification method based on deep learning built-up pattern |
CN108399230A (en) * | 2018-02-13 | 2018-08-14 | 上海大学 | A kind of Chinese financial and economic news file classification method based on convolutional neural networks |
AU2018101513A4 (en) * | 2018-10-11 | 2018-11-15 | Hui, Bo Mr | Comprehensive Stock Prediction GRU Model: Emotional Index and Volatility Based |
Non-Patent Citations (3)
Title |
---|
小简铺子: "《卷积神经网络(TextCNN)在句子分类上的实现》", 《卷积神经网络(TEXTCNN)在句子分类上的实现》 * |
流川枫AI: "《吾爱NLP(4)—基于Text-CNN模型的中文文本分类实战 - 简书》", 《吾爱NLP(4)—基于TEXT-CNN模型的中文文本分类实战 - 简书》 * |
谷宇: "《多模态3D卷积神经网络脑补胶质瘤分割方法》", 《科学技术与工程》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110543629A (en) * | 2019-08-01 | 2019-12-06 | 淮阴工学院 | chemical equipment text classification method based on W-ATT-CNN algorithm |
CN110717039A (en) * | 2019-09-17 | 2020-01-21 | 平安科技(深圳)有限公司 | Text classification method and device, electronic equipment and computer-readable storage medium |
CN110717039B (en) * | 2019-09-17 | 2023-10-13 | 平安科技(深圳)有限公司 | Text classification method and apparatus, electronic device, and computer-readable storage medium |
WO2021051586A1 (en) * | 2019-09-18 | 2021-03-25 | 平安科技(深圳)有限公司 | Interview answer text classification method, device, electronic apparatus and storage medium |
CN114207605A (en) * | 2019-10-31 | 2022-03-18 | 深圳市欢太科技有限公司 | Text classification method and device, electronic equipment and storage medium |
CN111143551A (en) * | 2019-12-04 | 2020-05-12 | 支付宝(杭州)信息技术有限公司 | Text preprocessing method, classification method, device and equipment |
CN111930938A (en) * | 2020-07-06 | 2020-11-13 | 武汉卓尔数字传媒科技有限公司 | Text classification method and device, electronic equipment and storage medium |
CN112242185A (en) * | 2020-09-09 | 2021-01-19 | 山东大学 | Medical image report automatic generation method and system based on deep learning |
CN112270615A (en) * | 2020-10-26 | 2021-01-26 | 西安邮电大学 | Intelligent decomposition method for manufacturing BOM (Bill of Material) by complex equipment based on semantic calculation |
CN114564942A (en) * | 2021-09-06 | 2022-05-31 | 北京数美时代科技有限公司 | Text error correction method, storage medium and device for supervision field |
CN114416213A (en) * | 2022-03-29 | 2022-04-29 | 北京沃丰时代数据科技有限公司 | Word vector file loading method and device and storage medium |
CN115936094A (en) * | 2022-12-27 | 2023-04-07 | 北京百度网讯科技有限公司 | Training method and device of text processing model, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109918497A (en) | A kind of file classification method, device and storage medium based on improvement textCNN model | |
CN108334605B (en) | Text classification method and device, computer equipment and storage medium | |
CN108170736B (en) | Document rapid scanning qualitative method based on cyclic attention mechanism | |
CN111310476B (en) | Public opinion monitoring method and system using aspect-based emotion analysis method | |
CN110413786B (en) | Data processing method based on webpage text classification, intelligent terminal and storage medium | |
CN110502753A (en) | A kind of deep learning sentiment analysis model and its analysis method based on semantically enhancement | |
CN106528528A (en) | A text emotion analysis method and device | |
CN110188047A (en) | A kind of repeated defects report detection method based on binary channels convolutional neural networks | |
CN112100377B (en) | Text classification method, apparatus, computer device and storage medium | |
CN111858878B (en) | Method, system and storage medium for automatically extracting answer from natural language text | |
CN115952292B (en) | Multi-label classification method, apparatus and computer readable medium | |
CN112215696A (en) | Personal credit evaluation and interpretation method, device, equipment and storage medium based on time sequence attribution analysis | |
CN112507114A (en) | Multi-input LSTM-CNN text classification method and system based on word attention mechanism | |
CN112925904A (en) | Lightweight text classification method based on Tucker decomposition | |
Nguyen et al. | An ensemble of shallow and deep learning algorithms for Vietnamese sentiment analysis | |
CN110019796A (en) | A kind of user version information analysis method and device | |
WO2021159099A1 (en) | Searching for normalization-activation layer architectures | |
Sajeevan et al. | An enhanced approach for movie review analysis using deep learning techniques | |
Swami et al. | Resume classifier and summarizer | |
Ram et al. | Supervised sentiment classification with cnns for diverse se datasets | |
CN116186506A (en) | Automatic identification method for accessibility problem report based on BERT pre-training model | |
CN116089886A (en) | Information processing method, device, equipment and storage medium | |
CN113806538B (en) | Label extraction model training method, device, equipment and storage medium | |
US20230063686A1 (en) | Fine-grained stochastic neural architecture search | |
CN114049522A (en) | Garbage classification system based on deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190621 |
|
RJ01 | Rejection of invention patent application after publication |