CN109918497A - A kind of file classification method, device and storage medium based on improvement textCNN model - Google Patents

A kind of file classification method, device and storage medium based on improvement textCNN model Download PDF

Info

Publication number
CN109918497A
CN109918497A CN201811572759.1A CN201811572759A CN109918497A CN 109918497 A CN109918497 A CN 109918497A CN 201811572759 A CN201811572759 A CN 201811572759A CN 109918497 A CN109918497 A CN 109918497A
Authority
CN
China
Prior art keywords
text
layer
volume
improvement
textcnn
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811572759.1A
Other languages
Chinese (zh)
Inventor
马涛
栾江霞
章正道
俞碧洪
徐晓文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Meiya Pico Information Co Ltd
Original Assignee
Xiamen Meiya Pico Information Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Meiya Pico Information Co Ltd filed Critical Xiamen Meiya Pico Information Co Ltd
Priority to CN201811572759.1A priority Critical patent/CN109918497A/en
Publication of CN109918497A publication Critical patent/CN109918497A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of based on the file classification method, device and the storage medium that improve textCNN model.This method comprises: training step, is trained the improvement textCNN model after being trained to textCNN model is improved using sample text;Text classification step classifies to text to be sorted using the improvement textCNN model after training.The present invention obtains the text classification algorithm for improving textCNN model by improving to traditional textCNN model, due to pre-training word embeding layer, so that the training time of training stage and computation amount;Due to having deepened the depth of convolutional layer and having increased batch normalization layer, so that the larger raising of classification accuracy, it is more applicable for requiring higher scene, such as the text classification of internet public feelings to sample requirement of real-time (i.e. will be according to new samples more frequently more new model) and classification accuracy.

Description

A kind of file classification method, device and storage medium based on improvement textCNN model
Technical field
It is especially a kind of based on the text classification side for improving textCNN model the present invention relates to technical field of data processing Method, device and storage medium.
Background technique
The network data of explosive growth proposes more and higher requirement to the analysis of data.Text analyzing and excavation Technology is a technology being widely used at present, is extracted by semantic content of the corresponding technology and methods to text, And then the sequence of operations such as taxonomic clustering are carried out to text, it is mainly used for commercial product recommending, the analysis of public opinion, the fields such as text search.
In the analysis of public opinion, need that public sentiment in network is arranged and analyzed under different themes, such as to acquisition The text arrived carries out text classification, automatically identifies the interested text of user, filters out the uninterested rubbish text of user. For this purpose, carrying out text automatic classification for collected text is a relatively important link in the analysis of public opinion.
Based on the text classification algorithm of traditional vector space model, time series modeling can not be carried out to the word order of word, also without Method is to carrying out semantic modeling between different words, therefore obtained classifying quality is unsatisfactory.Text based on deep learning point Class algorithm also can well model word order and semanteme due to that excessively need not carry out cumbersome Feature Engineering link, take The remote hyper-base of classifying quality obtained is in vector space model, therefore the text classification algorithm for being now based on deep learning becomes mainstream. But in the analysis of public opinion field, the classification and sample of classification all have very strong timeliness, and classification can be frequent according to public sentiment demand Variation, sample can generate new public sentiment hot as time goes by, it is therefore desirable to more frequent update and iterative model.It is based on The text classification algorithm of RNN causes trained and predicted time elongated since calculation amount is huge, frequent to update under public sentiment scene It will cause the huge waste of computing resource with iterative model.
Summary of the invention
The present invention is directed to above-mentioned defect in the prior art, proposes following technical solution.
A kind of file classification method based on improvement textCNN model, this method comprises:
Training step is trained the improvement textCNN after being trained to textCNN model is improved using sample text Model;
Text classification step classifies to text to be sorted using the improvement textCNN model after training.
Further, the improvement textCNN model includes input layer, word embeding layer, the first convolution module, volume Two Volume module, third convolution module, Volume Four volume module, Concat layers, Dropout layers, the full articulamentum with RELU activation primitive Input with classifier layer, institute's predicate embeding layer is connect with the output of the input layer, first convolution module, the second convolution Module, third convolution module and Volume Four volume module parallel processing, input is all connected with the output of institute predicate embeding layer, described First convolution module, the second convolution module, third convolution module and Volume Four volume module output all with described Concat layers Input is connected, and Concat layers of the output is connected with described Dropout layers, Dropout layers of the output and band The input of the full articulamentum of RELU activation primitive is connected, the output of the full articulamentum of the band RELU activation primitive with described point The input of class device layer is connected.
Further, first convolution module includes sequentially connected first convolutional layer, first normalization layer, the One RELU activation primitive, the second convolutional layer, second batch normalization layer, the 2nd RELU activation primitive and the first maximum pond layer;Institute Stating the second convolution module includes sequentially connected third convolutional layer, third batch normalization layer, the 3rd RELU activation primitive, Volume Four The maximum pond layer of lamination, the 4th batch of normalization layer, the 4th RELU activation primitive and second;The third convolution module includes successively The 5th convolutional layer, the 5th batch of normalization layer, the 5th RELU activation primitive, the 6th convolutional layer, the 6th batch of normalization layer, the of connection Six RELU activation primitives and third maximum pond layer;The Volume Four volume module includes sequentially connected 7th convolutional layer, the 7th Criticize normalization layer, the 7th RELU activation primitive, the 8th convolutional layer, the 8th batch of normalization layer, the 8th RELU activation primitive and the 4th Maximum pond layer.
Further, the operation of the training step includes:
The sample text marked is pre-processed, removes garbage character in conjunction with regular expression, segment, remove stop words The set of the word level-one of each sample text is obtained, the length of every text in statistical sample text, in conjunction with average length and warp The uniform length for determining a text is tested, too long text is truncated, polishing is done for too short text;
Term vector training corpus is segmented, stop words is gone to pre-process, and with the skip-gram mould in word2vec Type training obtains the term vector that trained dimension is 256;
By the sample text pre-processed in conjunction with trained term vector, obtain the eigenmatrix of text, and by its It is divided into training set and test set according to a certain percentage;
Training set is input in the improvement textCNN model of initial weight, loss letter is defined as using polynary cross entropy Number, using RMSProp optimizer adaptively changing learning rate, training obtains trained improvement textCNN model;
Test set is input to trained improvement textCNN model, the classification results of test set are obtained, with test set mark It signs comparing calculation and goes out the accuracy rate predicted, and by adjusting hyper parameter, optimization preprocessing process repeatedly, make to improve textCNN points Class model predictablity rate is optimal, and improvement textCNN at this time is classified as the improvement textCNN classification after training.
Further, the operation of the text classification step includes:
Text to be sorted is pre-processed, removes garbage character in conjunction with regular expression, segment, stop words is gone to obtain The set of the word level-one of each sample text;
Trained term vector model file obtains term vector for load, with term vector by pretreated text table It is shown as Text eigenvector matrix;
Improvement textCNN model after load is trained, inputs the improvement textCNN mould for Text eigenvector Type is predicted the classification results for obtaining the text.
The invention also provides a kind of based on the document sorting apparatus for improving textCNN model, which includes:
Training unit is trained the improvement textCNN after being trained to textCNN model is improved using sample text Model;
Text classification unit classifies to text to be sorted using the improvement textCNN model after training.
Further, the improvement textCNN model includes input layer, word embeding layer, the first convolution module, volume Two Volume module, third convolution module, Volume Four volume module, Concat layers, Dropout layers, the full articulamentum with RELU activation primitive Input with classifier layer, institute's predicate embeding layer is connect with the output of the input layer, first convolution module, the second convolution Module, third convolution module and Volume Four volume module parallel processing, input is all connected with the output of institute predicate embeding layer, described First convolution module, the second convolution module, third convolution module and Volume Four volume module output all with described Concat layers Input is connected, and Concat layers of the output is connected with described Dropout layers, Dropout layers of the output and band The input of the full articulamentum of RELU activation primitive is connected, the output of the full articulamentum of the band RELU activation primitive with described point The input of class device layer is connected.
Further, first convolution module includes sequentially connected first convolutional layer, first normalization layer, the One RELU activation primitive, the second convolutional layer, second batch normalization layer, the 2nd RELU activation primitive and the first maximum pond layer;Institute Stating the second convolution module includes sequentially connected third convolutional layer, third batch normalization layer, the 3rd RELU activation primitive, Volume Four The maximum pond layer of lamination, the 4th batch of normalization layer, the 4th RELU activation primitive and second;The third convolution module includes successively The 5th convolutional layer, the 5th batch of normalization layer, the 5th RELU activation primitive, the 6th convolutional layer, the 6th batch of normalization layer, the of connection Six RELU activation primitives and third maximum pond layer;The Volume Four volume module includes sequentially connected 7th convolutional layer, the 7th Criticize normalization layer, the 7th RELU activation primitive, the 8th convolutional layer, the 8th batch of normalization layer, the 8th RELU activation primitive and the 4th Maximum pond layer.
Further, the operation that the training unit executes includes:
The sample text marked is pre-processed, removes garbage character in conjunction with regular expression, segment, remove stop words The set of the word level-one of each sample text is obtained, the length of every text in statistical sample text, in conjunction with average length and warp The uniform length for determining a text is tested, too long text is truncated, polishing is done for too short text;
Term vector training corpus is segmented, stop words is gone to pre-process, and with the skip-gram mould in word2vec Type training obtains the term vector that trained dimension is 256;
By the sample text pre-processed in conjunction with trained term vector, obtain the eigenmatrix of text, and by its It is divided into training set and test set according to a certain percentage;
Training set is input in the improvement textCNN model of initial weight, loss letter is defined as using polynary cross entropy Number, using RMSProp optimizer adaptively changing learning rate, training obtains trained improvement textCNN model;
Test set is input to trained improvement textCNN model, the classification results of test set are obtained, with test set mark It signs comparing calculation and goes out the accuracy rate predicted, and by adjusting hyper parameter, optimization preprocessing process repeatedly, make to improve textCNN points Class model predictablity rate is optimal, and improvement textCNN at this time is classified as the improvement textCNN classification after training.
Further, the operation that the text classification unit executes includes:
Text to be sorted is pre-processed, removes garbage character in conjunction with regular expression, segment, stop words is gone to obtain The set of the word level-one of each sample text;
Trained term vector model file obtains term vector for load, with term vector by pretreated text table It is shown as Text eigenvector matrix;
Improvement textCNN model after load is trained, inputs the improvement textCNN mould for Text eigenvector Type is predicted the classification results for obtaining the text.
The invention also provides a kind of computer readable storage medium, computer program generation is stored on the storage medium Code, above-mentioned any method is executed when the computer program code is computer-executed.
Technical effect of the invention are as follows: the present invention is improved by improving to traditional textCNN model The text classification algorithm of textCNN model, due to pre-training word embeding layer, so that the training time of training stage and calculation amount Greatly reduce;Due to having deepened the depth of convolutional layer and having increased batch normalization layer, so that the larger raising of classification accuracy, more Add be suitable for requiring sample requirement of real-time (i.e. will be according to new samples more frequently more new model) and classification accuracy compared with High scene, such as the text classification of internet public feelings.
Detailed description of the invention
By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other Feature, objects and advantages will become more apparent upon.
Fig. 1 is a kind of process based on the file classification method for improving textCNN model of embodiment according to the present invention Figure.
Fig. 2 is the structure chart of the improvement textCNN model of embodiment according to the present invention.
Fig. 3 is the flow chart that the improvement textCNN model of embodiment according to the present invention is trained.
Fig. 4 is the flow chart for improving textCNN model and carrying out text classification of embodiment according to the present invention.
Fig. 5 is a kind of structure based on the document sorting apparatus for improving textCNN model of embodiment according to the present invention Figure.
Specific embodiment
The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to Convenient for description, part relevant to related invention is illustrated only in attached drawing.
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
Fig. 1 shows a kind of file classification method based on improvement textCNN model of the invention, this method comprises:
Training step S101 is trained the improvement after being trained to textCNN model is improved using sample text TextCNN model.
Text classification step S101 classifies to text to be sorted using the improvement textCNN model after training.
Essential step of the invention is exactly to construct improved textCNN model, i.e., obtains improvement textCNN by training Model, the i.e. structure of improvement textCNN model are important inventive points of the invention.
As shown in Fig. 2, the improvement textCNN model includes input layer, word embeding layer, the first convolution module, volume Two Volume module, third convolution module, Volume Four volume module, Concat layers, Dropout layers, the full articulamentum with RELU activation primitive Input with classifier layer, institute's predicate embeding layer is connect with the output of the input layer, first convolution module, the second convolution Module, third convolution module and Volume Four volume module parallel processing, input is all connected with the output of institute predicate embeding layer, described First convolution module, the second convolution module, third convolution module and Volume Four volume module output all with described Concat layers Input is connected, and Concat layers of the output is connected with described Dropout layers, Dropout layers of the output and band The input of the full articulamentum of RELU activation primitive is connected, the output of the full articulamentum of the band RELU activation primitive with described point The input of class device layer is connected.
In the present invention, pretreated text input is extremely improved to the word embeding layer (Embedding) of textCNN model, Weight in Embedding layers be by a large amount of unlabeled data, it is good with the skip-gram model pre-training in word2vec , do not update the weight of this layer in train classification models, the number of parameters of this layer is dimension of the dictionary number multiplied by term vector (256), up to ten million level parameters, the training time can greatly be reduced using the good term vector of pre-training.
Using the first convolution module, the second convolution module, third convolution module and Volume Four volume module four in the present invention Module carries out parallel processing, and the step-length of convolution kernel is all different in each module, for capturing the part of different step-lengths respectively Feature uses the one-dimensional convolution kernel of different step-lengths, for example, this method is 4,5,6,7 four kinds of selection, to capture different step-lengths respectively Local feature, particularly, this method has connect batch normalization layer (BatchNorm) below each convolutional layer, carries out to data Standardization prevents gradient disperse problem, model is made to restrain faster.At BatchNorm layers underneath with RELU activation Function.Convolutional layer etc. in each convolution module is twice, more to deepen than traditional textCNN model, enable model tormulation Power is stronger, to improve model prediction accuracy.Certainly, can readily expect can be by network depth more by those skilled in the art It is deep, such as 5 layers, 6 layers, parallel processing can also be carried out using multiple convolution modules, be not limited solely to four, such as 6,8.
As shown in Fig. 2, first convolution module includes sequentially connected first convolutional layer, first normalization layer, the One RELU activation primitive, the second convolutional layer, second batch normalization layer, the 2nd RELU activation primitive and the first maximum pond layer;Institute Stating the second convolution module includes sequentially connected third convolutional layer, third batch normalization layer, the 3rd RELU activation primitive, Volume Four The maximum pond layer of lamination, the 4th batch of normalization layer, the 4th RELU activation primitive and second;The third convolution module includes successively The 5th convolutional layer, the 5th batch of normalization layer, the 5th RELU activation primitive, the 6th convolutional layer, the 6th batch of normalization layer, the of connection Six RELU activation primitives and third maximum pond layer;The Volume Four volume module includes sequentially connected 7th convolutional layer, the 7th Criticize normalization layer, the 7th RELU activation primitive, the 8th convolutional layer, the 8th batch of normalization layer, the 8th RELU activation primitive and the 4th Maximum pond layer.
The convolution module of four different step-lengths is exported by max-pool layers, is compressed output dimension by down-sampling, is passed through Concat layers splice upper one layer of four outputs, obtain an one-dimensional vector, (dropout refers in depth by Dropout layers It spends in the training process of learning network, for neural network unit, temporarily abandons it from network according to certain probability.Note Meaning is temporarily, for stochastic gradient descent, due to being random drop, so each mini-batch is different in training Network.) dropout value be 0.5, for preventing over-fitting.By the full articulamentum (fc) activated with RELU, by it is one-dimensional to Amount is mapped to the one-dimensional vector of 128 dimensions.By classifier layer, categorization vector is mapped that, corresponding value is generic Probability value, classifier layer of the invention generally use softmax function.
As shown in figure 3, the operation of the training step includes:
The sample text marked is pre-processed, removes garbage character in conjunction with regular expression, segment, remove stop words The set of the word level-one of each sample text is obtained, the length of every text in statistical sample text, in conjunction with average length and warp The uniform length for determining a text is tested, too long text is truncated, polishing is done for too short text.
Term vector training corpus is segmented, stop words is gone to pre-process, and with the skip-gram mould in word2vec Type training obtains the term vector that trained dimension is 256.
By the sample text pre-processed in conjunction with trained term vector, obtain the eigenmatrix of text, and by its It is divided into training set and test set according to a certain percentage.
Training set is input in the improvement textCNN model of initial weight, loss letter is defined as using polynary cross entropy Number, using RMSProp optimizer adaptively changing learning rate, training obtains trained improvement textCNN model.RMSprop The principle of optimizer is similar to momentum gradient descent algorithm, and RMSprop optimizer limits the oscillation in vertical direction, makes us Algorithm can take bigger step in the horizontal direction, quickly restrained.
Test set is input to trained improvement textCNN model, the classification results of test set are obtained, with test set mark It signs comparing calculation and goes out the accuracy rate predicted, and by adjusting hyper parameter, optimization preprocessing process repeatedly, make to improve textCNN points Class model predictablity rate is optimal, and improvement textCNN at this time is classified as the improvement textCNN classification after training.
As shown in figure 4, the operation of the text classification step includes:
Text to be sorted is pre-processed, removes garbage character in conjunction with regular expression, segment, stop words is gone to obtain The set of the word level-one of each sample text.The step is to be located in advance with model training to the sample text marked It is identical for managing step.
Trained term vector model file obtains term vector for load, with term vector by pretreated text table It is shown as Text eigenvector matrix.
Improvement textCNN model after load is trained, inputs the improvement textCNN mould for Text eigenvector Type is predicted the classification results for obtaining the text.
With further reference to Fig. 5, as the realization to method shown in above-mentioned Fig. 1, this application provides kinds based on improvement One embodiment of the document sorting apparatus of textCNN model, the Installation practice are opposite with embodiment of the method shown in FIG. 1 It answers, which specifically may include in various electronic equipments.
Fig. 5 shows a kind of document sorting apparatus based on improvement textCNN model of the invention, which includes:
Training unit 501 is trained the improvement after being trained to textCNN model is improved using sample text TextCNN model.
Text classification unit 501 classifies to text to be sorted using the improvement textCNN model after training.
Essential step of the invention is exactly to construct improved textCNN model, i.e., obtains improvement textCNN by training Model, the i.e. structure of improvement textCNN model are important inventive points of the invention.
As shown in Fig. 2, the improvement textCNN model includes input layer, word embeding layer, the first convolution module, volume Two Volume module, third convolution module, Volume Four volume module, Concat layers, Dropout layers, the full articulamentum with RELU activation primitive Input with classifier layer, institute's predicate embeding layer is connect with the output of the input layer, first convolution module, the second convolution Module, third convolution module and Volume Four volume module parallel processing, input is all connected with the output of institute predicate embeding layer, described First convolution module, the second convolution module, third convolution module and Volume Four volume module output all with described Concat layers Input is connected, and Concat layers of the output is connected with described Dropout layers, Dropout layers of the output and band The input of the full articulamentum of RELU activation primitive is connected, the output of the full articulamentum of the band RELU activation primitive with described point The input of class device layer is connected.
In the present invention, pretreated text input is extremely improved to the word embeding layer (Embedding) of textCNN model, Weight in Embedding layers be by a large amount of unlabeled data, it is good with the skip-gram model pre-training in word2vec , do not update the weight of this layer in train classification models, the number of parameters of this layer is dimension of the dictionary number multiplied by term vector (256), up to ten million level parameters, the training time can greatly be reduced using the good term vector of pre-training.
Using the first convolution module, the second convolution module, third convolution module and Volume Four volume module four in the present invention Module carries out parallel processing, and the step-length of convolution kernel is all different in each module, for capturing the part of different step-lengths respectively Feature uses the one-dimensional convolution kernel of different step-lengths, for example, the present apparatus is 4,5,6,7 four kinds of selection, to capture different step-lengths respectively Local feature, particularly, the present apparatus has connect batch normalization layer (BatchNorm) below each convolutional layer, carries out to data Standardization prevents gradient disperse problem, model is made to restrain faster.At BatchNorm layers underneath with RELU activation Function.Convolutional layer etc. in each convolution module is twice, more to deepen than traditional textCNN model, enable model tormulation Power is stronger, to improve model prediction accuracy.Certainly, can readily expect can be by network depth more by those skilled in the art It is deep, such as 5 layers, 6 layers, parallel processing can also be carried out using multiple convolution modules, be not limited solely to four, such as 6,8.
As shown in Fig. 2, first convolution module includes sequentially connected first convolutional layer, first normalization layer, the One RELU activation primitive, the second convolutional layer, second batch normalization layer, the 2nd RELU activation primitive and the first maximum pond layer;Institute Stating the second convolution module includes sequentially connected third convolutional layer, third batch normalization layer, the 3rd RELU activation primitive, Volume Four The maximum pond layer of lamination, the 4th batch of normalization layer, the 4th RELU activation primitive and second;The third convolution module includes successively The 5th convolutional layer, the 5th batch of normalization layer, the 5th RELU activation primitive, the 6th convolutional layer, the 6th batch of normalization layer, the of connection Six RELU activation primitives and third maximum pond layer;The Volume Four volume module includes sequentially connected 7th convolutional layer, the 7th Criticize normalization layer, the 7th RELU activation primitive, the 8th convolutional layer, the 8th batch of normalization layer, the 8th RELU activation primitive and the 4th Maximum pond layer.
The convolution module of four different step-lengths is exported by max-pool layers, is compressed output dimension by down-sampling, is passed through Concat layers splice upper one layer of four outputs, obtain an one-dimensional vector, (dropout refers in depth by Dropout layers It spends in the training process of learning network, for neural network unit, temporarily abandons it from network according to certain probability.Note Meaning is temporarily, for stochastic gradient descent, due to being random drop, so each mini-batch is different in training Network.) dropout value be 0.5, for preventing over-fitting.By the full articulamentum (fc) activated with RELU, by it is one-dimensional to Amount is mapped to the one-dimensional vector of 128 dimensions.By classifier layer, categorization vector is mapped that, corresponding value is generic Probability value, classifier layer of the invention generally use softmax function.
As shown in figure 3, the operation that the training unit executes includes:
The sample text marked is pre-processed, removes garbage character in conjunction with regular expression, segment, remove stop words The set of the word level-one of each sample text is obtained, the length of every text in statistical sample text, in conjunction with average length and warp The uniform length for determining a text is tested, too long text is truncated, polishing is done for too short text.
Term vector training corpus is segmented, stop words is gone to pre-process, and with the skip-gram mould in word2vec Type training obtains the term vector that trained dimension is 256.
By the sample text pre-processed in conjunction with trained term vector, obtain the eigenmatrix of text, and by its It is divided into training set and test set according to a certain percentage.
Training set is input in the improvement textCNN model of initial weight, loss letter is defined as using polynary cross entropy Number, using RMSProp optimizer adaptively changing learning rate, training obtains trained improvement textCNN model.RMSprop The principle of optimizer is similar to momentum gradient descent algorithm, and RMSprop optimizer limits the oscillation in vertical direction, makes us Algorithm can take bigger step in the horizontal direction, quickly restrained.
Test set is input to trained improvement textCNN model, the classification results of test set are obtained, with test set mark It signs comparing calculation and goes out the accuracy rate predicted, and by adjusting hyper parameter, optimization preprocessing process repeatedly, make to improve textCNN points Class model predictablity rate is optimal, and improvement textCNN at this time is classified as the improvement textCNN classification after training.
As shown in figure 4, the operation that the text classification unit executes includes:
Text to be sorted is pre-processed, removes garbage character in conjunction with regular expression, segment, stop words is gone to obtain The set of the word level-one of each sample text.The step is to be located in advance with model training to the sample text marked It is identical for managing step.
Trained term vector model file obtains term vector for load, with term vector by pretreated text table It is shown as Text eigenvector matrix.
Improvement textCNN model after load is trained, inputs the improvement textCNN mould for Text eigenvector Type is predicted the classification results for obtaining the text.
Present invention obtains following technical effects: being improved by improving to traditional textCNN model The text classification algorithm of textCNN model, since which increase embeding layers, batch normalization layer and RELU activation primitive, so that instruction Practice time and computation amount, and classification accuracy greatly improves, and is more applicable for (wanting root to sample requirement of real-time According to new samples more frequently more new model) and classification accuracy require higher scene, such as the text in internet public feelings point Class etc..And after Text Pretreatment, trained term vector model file obtains term vector for load, will with term vector Pretreated text representation becomes Text eigenvector matrix and carries out subsequent processing, improves classification speed and accuracy rate.
For convenience of description, it is divided into various units when description apparatus above with function to describe respectively.Certainly, implementing this The function of each unit can be realized in the same or multiple software and or hardware when application.
As seen through the above description of the embodiments, those skilled in the art can be understood that the application can It realizes by means of software and necessary general hardware platform.Based on this understanding, the technical solution essence of the application On in other words the part that contributes to existing technology can be embodied in the form of software products, the computer software product It can store in storage medium, such as ROM/RAM, magnetic disk, CD, including some instructions are used so that a computer equipment (can be personal computer, server or the network equipment etc.) executes the certain of each embodiment of the application or embodiment Method described in part.
It should be noted last that: above embodiments only illustrate and not to limitation technical solution of the present invention, although reference Above-described embodiment describes the invention in detail, those skilled in the art should understand that: it still can be to this hair It is bright to be modified or replaced equivalently, it without departing from the spirit or scope of the invention, or any substitutions, should all It is included within the scope of the claims of the present invention.

Claims (11)

1. a kind of based on the file classification method for improving textCNN model, which is characterized in that this method comprises:
Training step is trained the improvement textCNN mould after being trained to textCNN model is improved using sample text Type;
Text classification step classifies to text to be sorted using the improvement textCNN model after training.
2. the method according to claim 1, wherein the improvement textCNN model includes input layer, word insertion Layer, the first convolution module, the second convolution module, third convolution module, Volume Four volume module, Concat layers, Dropout layers, band The input of the full articulamentum and classifier layer of RELU activation primitive, institute's predicate embeding layer is connect with the output of the input layer, institute State the first convolution module, the second convolution module, third convolution module and Volume Four volume module parallel processing, input all with institute's predicate The output of embeding layer is connected, first convolution module, the second convolution module, third convolution module and Volume Four volume module Output is all connected with Concat layers of the input, and Concat layers of the output is connected with described Dropout layers, described Dropout layers of output is connected with the input of the full articulamentum with RELU activation primitive, the band RELU activation primitive it is complete The output of articulamentum is connected with the input of the classifier layer.
3. according to the method described in claim 2, it is characterized in that, first convolution module, the second convolution module, third are rolled up The step-length of convolution kernel is all different in volume module and Volume Four volume module, for capturing the local feature of different step-lengths respectively;Institute Stating the first convolution module includes sequentially connected first convolutional layer, first normalization layer, the first RELU activation primitive, volume Two Lamination, second batch normalization layer, the 2nd RELU activation primitive and the first maximum pond layer;Second convolution module includes successively The third convolutional layer of connection, third batch normalization layer, the 3rd RELU activation primitive, Volume Four lamination, the 4th batch of normalization layer, the Four RELU activation primitives and the second maximum pond layer;The third convolution module includes sequentially connected 5th convolutional layer, the 5th Criticize normalization layer, the 5th RELU activation primitive, the 6th convolutional layer, the 6th batch of normalization layer, the 6th RELU activation primitive and third Maximum pond layer;The Volume Four volume module include sequentially connected 7th convolutional layer, the 7th batch of normalization layer, the 7th RELU swash Function, the 8th convolutional layer, the 8th batch of normalization layer, the 8th RELU activation primitive and the 4th maximum pond layer living.
4. the method according to claim 1, wherein the operation of the training step includes:
The sample text marked is pre-processed, removes garbage character in conjunction with regular expression, segment, stop words is gone to obtain The set of the word level-one of each sample text, the length of every text, true in conjunction with average length and experience in statistical sample text The uniform length of a fixed text, is truncated too long text, does polishing for too short text;
Term vector training corpus is segmented, stop words is gone to pre-process, and with the skip-gram model instruction in word2vec Practice, obtains the term vector that trained dimension is 256;
By the sample text pre-processed in conjunction with trained term vector, obtain the eigenmatrix of text, and by its according to Certain proportion is divided into training set and test set;
Training set is input in the improvement textCNN model of initial weight, loss function is defined as using polynary cross entropy, is made With RMSProp optimizer adaptively changing learning rate, training obtains trained improvement textCNN model;
Test set is input to trained improvement textCNN model, the classification results of test set are obtained, with test set label pair Than the accuracy rate for calculating prediction, and by adjusting hyper parameter, optimization preprocessing process repeatedly, make to improve textCNN classification mould Type predictablity rate is optimal, and improvement textCNN at this time is classified as the improvement textCNN classification after training.
5. according to the method described in claim 4, it is characterized in that, the operation of the text classification step includes:
Text to be sorted is pre-processed, in conjunction with regular expression remove garbage character, segment, to go stop words to obtain each The set of the word level-one of sample text;
Load trained term vector model file obtain term vector, with term vector by pretreated text representation at For Text eigenvector matrix;
Improvement textCNN model after load is trained, inputs the improvement textCNN model for Text eigenvector, into Row prediction obtains the classification results of the text.
6. a kind of based on the document sorting apparatus for improving textCNN model, which is characterized in that the device includes:
Training unit is trained the improvement textCNN mould after being trained to textCNN model is improved using sample text Type;
Text classification unit classifies to text to be sorted using the improvement textCNN model after training.
7. device according to claim 6, which is characterized in that the improvement textCNN model includes input layer, word insertion Layer, the first convolution module, the second convolution module, third convolution module, Volume Four volume module, Concat layers, Dropout layers, band The input of the full articulamentum and classifier layer of RELU activation primitive, institute's predicate embeding layer is connect with the output of the input layer, institute State the first convolution module, the second convolution module, third convolution module and Volume Four volume module parallel processing, input all with institute's predicate The output of embeding layer is connected, first convolution module, the second convolution module, third convolution module and Volume Four volume module Output is all connected with Concat layers of the input, and Concat layers of the output is connected with described Dropout layers, described Dropout layers of output is connected with the input of the full articulamentum with RELU activation primitive, the band RELU activation primitive it is complete The output of articulamentum is connected with the input of the classifier layer.
8. device according to claim 7, which is characterized in that first convolution module, the second convolution module, third volume The step-length of convolution kernel is all different in volume module and Volume Four volume module, for capturing the local feature of different step-lengths respectively;Institute Stating the first convolution module includes sequentially connected first convolutional layer, first normalization layer, the first RELU activation primitive, volume Two Lamination, second batch normalization layer, the 2nd RELU activation primitive and the first maximum pond layer;Second convolution module includes successively The third convolutional layer of connection, third batch normalization layer, the 3rd RELU activation primitive, Volume Four lamination, the 4th batch of normalization layer, the Four RELU activation primitives and the second maximum pond layer;The third convolution module includes sequentially connected 5th convolutional layer, the 5th Criticize normalization layer, the 5th RELU activation primitive, the 6th convolutional layer, the 6th batch of normalization layer, the 6th RELU activation primitive and third Maximum pond layer;The Volume Four volume module include sequentially connected 7th convolutional layer, the 7th batch of normalization layer, the 7th RELU swash Function, the 8th convolutional layer, the 8th batch of normalization layer, the 8th RELU activation primitive and the 4th maximum pond layer living.
9. device according to claim 6, which is characterized in that the operation that the training unit executes includes:
The sample text marked is pre-processed, removes garbage character in conjunction with regular expression, segment, stop words is gone to obtain The set of the word level-one of each sample text, the length of every text, true in conjunction with average length and experience in statistical sample text The uniform length of a fixed text, is truncated too long text, does polishing for too short text;
Term vector training corpus is segmented, stop words is gone to pre-process, and with the skip-gram model instruction in word2vec Practice, obtains the term vector that trained dimension is 256;
By the sample text pre-processed in conjunction with trained term vector, obtain the eigenmatrix of text, and by its according to Certain proportion is divided into training set and test set;
Training set is input in the improvement textCNN model of initial weight, loss function is defined as using polynary cross entropy, is made With RMSProp optimizer adaptively changing learning rate, training obtains trained improvement textCNN model;
Test set is input to trained improvement textCNN model, the classification results of test set are obtained, with test set label pair Than the accuracy rate for calculating prediction, and by adjusting hyper parameter, optimization preprocessing process repeatedly, make to improve textCNN classification mould Type predictablity rate is optimal, and improvement textCNN at this time is classified as the improvement textCNN classification after training.
10. device according to claim 9, which is characterized in that the operation that the text classification unit executes includes:
Text to be sorted is pre-processed, in conjunction with regular expression remove garbage character, segment, to go stop words to obtain each The set of the word level-one of sample text;
Load trained term vector model file obtain term vector, with term vector by pretreated text representation at For Text eigenvector matrix;
Improvement textCNN model after load is trained, inputs the improvement textCNN model for Text eigenvector, into Row prediction obtains the classification results of the text.
11. a kind of computer readable storage medium, which is characterized in that it is stored with computer program code on the storage medium, When the computer program code is computer-executed, perform claim requires any method of 1-5.
CN201811572759.1A 2018-12-21 2018-12-21 A kind of file classification method, device and storage medium based on improvement textCNN model Pending CN109918497A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811572759.1A CN109918497A (en) 2018-12-21 2018-12-21 A kind of file classification method, device and storage medium based on improvement textCNN model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811572759.1A CN109918497A (en) 2018-12-21 2018-12-21 A kind of file classification method, device and storage medium based on improvement textCNN model

Publications (1)

Publication Number Publication Date
CN109918497A true CN109918497A (en) 2019-06-21

Family

ID=66959953

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811572759.1A Pending CN109918497A (en) 2018-12-21 2018-12-21 A kind of file classification method, device and storage medium based on improvement textCNN model

Country Status (1)

Country Link
CN (1) CN109918497A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110543629A (en) * 2019-08-01 2019-12-06 淮阴工学院 chemical equipment text classification method based on W-ATT-CNN algorithm
CN110717039A (en) * 2019-09-17 2020-01-21 平安科技(深圳)有限公司 Text classification method and device, electronic equipment and computer-readable storage medium
CN111143551A (en) * 2019-12-04 2020-05-12 支付宝(杭州)信息技术有限公司 Text preprocessing method, classification method, device and equipment
CN111930938A (en) * 2020-07-06 2020-11-13 武汉卓尔数字传媒科技有限公司 Text classification method and device, electronic equipment and storage medium
CN112242185A (en) * 2020-09-09 2021-01-19 山东大学 Medical image report automatic generation method and system based on deep learning
CN112270615A (en) * 2020-10-26 2021-01-26 西安邮电大学 Intelligent decomposition method for manufacturing BOM (Bill of Material) by complex equipment based on semantic calculation
WO2021051586A1 (en) * 2019-09-18 2021-03-25 平安科技(深圳)有限公司 Interview answer text classification method, device, electronic apparatus and storage medium
CN114207605A (en) * 2019-10-31 2022-03-18 深圳市欢太科技有限公司 Text classification method and device, electronic equipment and storage medium
CN114416213A (en) * 2022-03-29 2022-04-29 北京沃丰时代数据科技有限公司 Word vector file loading method and device and storage medium
CN114564942A (en) * 2021-09-06 2022-05-31 北京数美时代科技有限公司 Text error correction method, storage medium and device for supervision field
CN115936094A (en) * 2022-12-27 2023-04-07 北京百度网讯科技有限公司 Training method and device of text processing model, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1310825A (en) * 1998-06-23 2001-08-29 微软公司 Methods and apparatus for classifying text and for building a text classifier
CN108108351A (en) * 2017-12-05 2018-06-01 华南理工大学 A kind of text sentiment classification method based on deep learning built-up pattern
CN108399230A (en) * 2018-02-13 2018-08-14 上海大学 A kind of Chinese financial and economic news file classification method based on convolutional neural networks
AU2018101513A4 (en) * 2018-10-11 2018-11-15 Hui, Bo Mr Comprehensive Stock Prediction GRU Model: Emotional Index and Volatility Based

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1310825A (en) * 1998-06-23 2001-08-29 微软公司 Methods and apparatus for classifying text and for building a text classifier
CN108108351A (en) * 2017-12-05 2018-06-01 华南理工大学 A kind of text sentiment classification method based on deep learning built-up pattern
CN108399230A (en) * 2018-02-13 2018-08-14 上海大学 A kind of Chinese financial and economic news file classification method based on convolutional neural networks
AU2018101513A4 (en) * 2018-10-11 2018-11-15 Hui, Bo Mr Comprehensive Stock Prediction GRU Model: Emotional Index and Volatility Based

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
小简铺子: "《卷积神经网络(TextCNN)在句子分类上的实现》", 《卷积神经网络(TEXTCNN)在句子分类上的实现》 *
流川枫AI: "《吾爱NLP(4)—基于Text-CNN模型的中文文本分类实战 - 简书》", 《吾爱NLP(4)—基于TEXT-CNN模型的中文文本分类实战 - 简书》 *
谷宇: "《多模态3D卷积神经网络脑补胶质瘤分割方法》", 《科学技术与工程》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110543629A (en) * 2019-08-01 2019-12-06 淮阴工学院 chemical equipment text classification method based on W-ATT-CNN algorithm
CN110717039A (en) * 2019-09-17 2020-01-21 平安科技(深圳)有限公司 Text classification method and device, electronic equipment and computer-readable storage medium
CN110717039B (en) * 2019-09-17 2023-10-13 平安科技(深圳)有限公司 Text classification method and apparatus, electronic device, and computer-readable storage medium
WO2021051586A1 (en) * 2019-09-18 2021-03-25 平安科技(深圳)有限公司 Interview answer text classification method, device, electronic apparatus and storage medium
CN114207605A (en) * 2019-10-31 2022-03-18 深圳市欢太科技有限公司 Text classification method and device, electronic equipment and storage medium
CN111143551A (en) * 2019-12-04 2020-05-12 支付宝(杭州)信息技术有限公司 Text preprocessing method, classification method, device and equipment
CN111930938A (en) * 2020-07-06 2020-11-13 武汉卓尔数字传媒科技有限公司 Text classification method and device, electronic equipment and storage medium
CN112242185A (en) * 2020-09-09 2021-01-19 山东大学 Medical image report automatic generation method and system based on deep learning
CN112270615A (en) * 2020-10-26 2021-01-26 西安邮电大学 Intelligent decomposition method for manufacturing BOM (Bill of Material) by complex equipment based on semantic calculation
CN114564942A (en) * 2021-09-06 2022-05-31 北京数美时代科技有限公司 Text error correction method, storage medium and device for supervision field
CN114416213A (en) * 2022-03-29 2022-04-29 北京沃丰时代数据科技有限公司 Word vector file loading method and device and storage medium
CN115936094A (en) * 2022-12-27 2023-04-07 北京百度网讯科技有限公司 Training method and device of text processing model, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109918497A (en) A kind of file classification method, device and storage medium based on improvement textCNN model
CN108334605B (en) Text classification method and device, computer equipment and storage medium
CN108170736B (en) Document rapid scanning qualitative method based on cyclic attention mechanism
CN111310476B (en) Public opinion monitoring method and system using aspect-based emotion analysis method
CN110413786B (en) Data processing method based on webpage text classification, intelligent terminal and storage medium
CN110502753A (en) A kind of deep learning sentiment analysis model and its analysis method based on semantically enhancement
CN106528528A (en) A text emotion analysis method and device
CN110188047A (en) A kind of repeated defects report detection method based on binary channels convolutional neural networks
CN112100377B (en) Text classification method, apparatus, computer device and storage medium
CN111858878B (en) Method, system and storage medium for automatically extracting answer from natural language text
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN112215696A (en) Personal credit evaluation and interpretation method, device, equipment and storage medium based on time sequence attribution analysis
CN112507114A (en) Multi-input LSTM-CNN text classification method and system based on word attention mechanism
CN112925904A (en) Lightweight text classification method based on Tucker decomposition
Nguyen et al. An ensemble of shallow and deep learning algorithms for Vietnamese sentiment analysis
CN110019796A (en) A kind of user version information analysis method and device
WO2021159099A1 (en) Searching for normalization-activation layer architectures
Sajeevan et al. An enhanced approach for movie review analysis using deep learning techniques
Swami et al. Resume classifier and summarizer
Ram et al. Supervised sentiment classification with cnns for diverse se datasets
CN116186506A (en) Automatic identification method for accessibility problem report based on BERT pre-training model
CN116089886A (en) Information processing method, device, equipment and storage medium
CN113806538B (en) Label extraction model training method, device, equipment and storage medium
US20230063686A1 (en) Fine-grained stochastic neural architecture search
CN114049522A (en) Garbage classification system based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190621

RJ01 Rejection of invention patent application after publication