CN109840279A

CN109840279A - File classification method based on convolution loop neural network

Info

Publication number: CN109840279A
Application number: CN201910025175.0A
Authority: CN
Inventors: 李钊; 王瑞霜; 曹建; 陈通; 王磊
Original assignee: Shandong Yi Yun Information Technology Co Ltd; Shandong Computer Science Center
Current assignee: Shandong Yi Yun Information Technology Co Ltd; Shandong Computer Science Center National Super Computing Center in Jinan; Shandong Computer Science Center
Priority date: 2019-01-10
Filing date: 2019-01-10
Publication date: 2019-06-04

Abstract

The present invention discloses a kind of file classification method based on convolution loop neural network, the advantage for making full use of convolutional neural networks to extract local feature carries out feature extraction to text, while the contextual feature of extraction is connected the semantic information for preferably indicating text by the advantage using LSTM with memory.This method not only obtains preferable classifying quality on English data set while also obtaining higher classification accuracy on Chinese data collection.

Description

File classification method based on convolution loop neural network

Technical field

The present invention relates to a kind of file classification methods, are a kind of texts based on convolution loop neural network specifically Classification method.

Background technique

With the fast development of depth learning technology, convolutional neural networks and Recognition with Recurrent Neural Network are in various engineerings Huge success is achieved in habit task.For example, convolutional neural networks have been widely used for computer vision field, handling Comparative maturity, such as image classification, object detection, image segmentation, speech recognition in Computer Vision Task.Circulation nerve Network is the important branch of another in deep learning, it is mainly used to processing sequence problem.Long memory network in short-term (LSTM) be Recognition with Recurrent Neural Network a kind of specific type, it can capture the contextual information of sequence, be widely used in the time Sequence problem, such as speech recognition, machine translation.

In recent years, on processing sequence data problem, more and more researchers are neural by convolutional neural networks and circulation Network integration gets up to be used together.The mixed model is referred to as convolution loop neural network (CRNN), and CRNN can be retouched simply State in convolutional neural networks followed by Recognition with Recurrent Neural Network.Convolutional neural networks are primarily used to extract feature in the model, Recognition with Recurrent Neural Network mainly gets up contextual feature informational linkage.Currently, the model is applied to music assorting, height Spectral data classification, bird audio detection etc..

Convolution loop neural network model is equally applicable to text classification.In text classification, convolutional neural networks are used Neatly the feature of text can be extracted, due to during text classification classification results by the shadow of entire content of text It rings, therefore, being connected the feature of extraction using long memory network in short-term can preferably indicate that text is preferably real in turn Existing text classification.Therefore, herein text classify using convolution loop neural network and use Chinese data collection and English Data set is compared as experimental data with other classification methods.

Summary of the invention

The technical problem to be solved in the present invention is to provide a kind of file classification methods based on convolution loop neural network, first Multiple groups feature extraction first is carried out using text information of the convolutional network to input and pond is carried out to extract in text to it respectively Then the feature extracted is carried out fusion and is sent into LSTM neural network and by full articulamentum output category knot by important feature Fruit.

In order to solve the technical problem, the technical issues of present invention uses, is: the text based on convolution loop neural network This classification method, it is characterised in that: the following steps are included:

S01), term vector matrix is converted as the input of convolutional layer using the sample data of text sequence；

S02), convolution operation is carried out to input data using multiple dimensioned convolution kernel, the height of characteristic pattern uses after convolution Formula 1 calculates.During convolution operation, each local feature of input is calculated respectively using single convolution kernel first, Calculation formula such as formula 2, it is then using formula 3 that calculated feature is connected longitudinally, activation primitive is finally reused to calculating As a result it carries out NONLINEAR CALCULATION and obtains final convolution feature, calculation formula such as formula 4,

h_1F(i)=f (W_FX (i:i+F-1)+b) (2),

In formula, H₂The height of characteristic pattern, H after expression convolution₁Indicate the height inputted before convolution, F indicates the height of convolution kernel Degree, P indicate the size of Padding, and S indicates step-length,It indicates to be rounded downwards, W_FIndicate that height is the convolution kernel of F, X (i:i+ F-1 the local feature vectors of the feature from ith feature to the i-th+F-1 in sample input vector) are indicated, b indicates bias；

S03), pond is carried out to extract the important of text to the result after convolution using maximum pond layer MaxPooling1D Then the result of Chi Huahou is played the input as LSTM layers, calculation formula point using Concatenate functional link by feature Not as shown in formula 5,6,

S04), LSTM will be utilized by different convolution kernels treated text feature sequence as the input of LSTM network Network can more accurately indicate the semantic information of text, and then the classification of text is better achieved, LSTM network each moment Calculation formula it is as follows:

f_t=σ (W_f·[h_t-1, h_1t]+b_f) (7),

h_t=o_tοtanh(c_t) (12),

f_tIt indicates to forget door, σ indicates sigmoid function, W_fIndicate the weight matrix of forgetting door,It indicates two Vector is combined into a longer vector, h_t-1Represent the output at LSTM network last moment, h_1tIndicate the output through convolution Chi Huahou h₁In the input of t moment, b_fIt is the bias for forgeing door, i_tIndicate input gate, W_iIndicate the weight matrix of input gate, b_iIndicate defeated The bias of introduction,Indicate location mode currently entered, it is calculated according to last output and current input Come, W_cIndicate the weight matrix of location mode currently entered, b_cIndicate the bias of location mode currently entered, c_tTable Show the location mode at current time, it is by forgetting door f_tMultiplied by the location mode c of last moment_t-1, add input gate i_tMultiply With location mode currently enteredAnd calculating get, the thus memory c that LSTM is long-term_t-1With current memoryIn conjunction with New location mode c is formed together_t, o_tIndicate out gate, W_oThe weight for representing out gate is placed in the middle, b_oRepresent the inclined of out gate Set value, h_tIndicate final output, it is by location mode c_tWith out gate o_tIt is common to determine.

Further, this method further includes step S05), increase full articulamentum, full articulamentum output dimension is in training set Class number and sample calculated by Softmax function belong to the probability of each classification, calculation formula isIn formula, y (i) represents the value of i-th of neuron of output layer, y (k) generation The value of k-th of neuron in table output layer, exp represent the exponential function using e the bottom of as.

3, the file classification method according to claim 1 based on convolutional neural networks, it is characterised in that: step Further include the steps that in S01 in detail below: (1) participle operation being carried out to Chinese training dataset, (2) establish dictionary and establish word Text sequence is mapped as index sequence by the mapping of allusion quotation and index, (3), and the sequence length of all samples is processed into the same by (4) Length, can pass through mend 0 or truncation realize that (5) carry out word insertion using the good term vector of pre-training, if Length of sample series is M, the good term vector dimension of pre-training be N, then word insertion after, each sample data be converted into the term vector matrix of M*N and by its Input as convolutional layer.

Further, in step S02, convolution operation, the height difference of convolution kernel are carried out to input using one-dimensional convolutional layer 2 and 3 two scales are taken, the number of convolution kernel is 256, and activation primitive is Relu function.

Further, it joined Normalization layers of Batch between step S02 and S03 data are normalized Processing, accelerates the convergence rate of model.

Further, Dropout layers be joined between step S04 and S05, the random neuron for disconnecting designated ratio connects It connects, prevents over-fitting.

Beneficial effects of the present invention: the present invention is based on convolutional neural networks and Recognition with Recurrent Neural Network LSTM to propose that one kind is based on The file classification method of convolution loop neural network.This method makes full use of the advantage pair of convolutional neural networks extraction local feature Text carries out feature extraction, while using LSTM there is the advantage of memory the contextual feature of extraction is connected more preferable earth's surface Show the semantic information of text.This method not only obtains preferable classifying quality on English data set while on Chinese data collection Also higher classification accuracy is obtained.

Detailed description of the invention

Fig. 1 is convolution loop Artificial Neural Network Structures figure；

Fig. 2 is convolutional neural networks structure chart；

Fig. 3 is LSTM network structure.

Specific embodiment

The present invention is further illustrated in the following with reference to the drawings and specific embodiments.

Embodiment 1

The present embodiment discloses a kind of file classification method based on convolution loop neural network, and this method is based on convolution loop Neural network model, as shown in Figure 1, the model includes input layer, word embeding layer, convolutional layer, pond layer, long short-term memory LSTM Network layer, full articulamentum, the model use convolutional network to carry out multiple groups feature extraction and difference to the text information of input first Pond is carried out to it to extract feature important in text, the feature extracted is then subjected to fusion and is sent into LSTM neural network And pass through full articulamentum output category result.

The specific steps of this method are as follows:

In text classification, sample data is usually a text sequence, therefore before being sent to neural network, need to be by it It is expressed as term vector matrix.The length of each sample is inconsistent when due to text classification, needs before word insertion by sample Length is processed into the same length, and the size of sample length (sets sample length depending on the size of data set as M).Make herein Carry out word insertion with the good term vector of pre-training and term vector dimension indicated with N, thus each sample be represented by the word of M*N to Moment matrix and input as convolutional layer.

S02), in order to more accurately indicate the semantic feature of text, the present embodiment is using multiple dimensioned convolution kernel to input Data carry out convolution operation, are operated with maximum pondization and carry out pond to the result after convolution to extract the important feature of text, with The result of Chi Huahou is connected into the input as LSTM layers afterwards, convolutional neural networks structure is as shown in Figure 2.

In the present embodiment, convolution operation, the height difference of convolution kernel are carried out to input using one-dimensional convolutional layer (Conv1D) 2 and 3 two scales are taken, the number of convolution kernel is 256, and activation primitive is Relu function.Text size usually takes 100 in text, because The height of characteristic pattern is respectively 99 and 98 (calculation formula such as formulas 1) after this convolution, therefore characteristic pattern dimension is respectively after convolution (99,256) and (98,256).

H2 indicates the height of characteristic pattern after convolution, H in formula (1)₁Indicate the height inputted before convolution, F indicates convolution kernel Height, P indicates the size (text in padding size be 0) of Padding, and S indicates step-length (step-length is 1 in text),It indicates It is rounded downwards.

In convolution characteristic extraction procedure, each local feature of input is calculated respectively using single convolution kernel first (calculation formula such as formula 2), it is then again that calculated feature is (such as formula 3) connected longitudinally, activation primitive is finally reused to calculating As a result it carries out NONLINEAR CALCULATION and obtains final convolution feature (such as formula 4).

h_1F(i)=f (W_F·X(i:i+F-1)+b)(2)

Wherein, W_FIndicate height be F convolution kernel, X (i:i+F-1) indicate sample input vector in from ith feature to The local feature vectors of i-th+F-1 features, b indicate bias.

S04), the advantages of capable of capturing contextual information using shot and long term memory network (LSTM), will pass through different convolution Core treated input of the text feature sequence as LSTM network, can more accurately be indicated the semanteme of text, into And the classification of text is better achieved.LSTM network structure is as shown in Figure 3.

The calculation formula at LSTM network each moment is as follows:

f_t=σ (W_f·[h_t-1, h_1t]+b_f) (7),

h_t=o_tοtanh(c_t) (12),

f_tIt indicates to forget door, σ indicates sigmoid function, W_fIndicate the weight matrix of forgetting door,It indicates two Vector is combined into a longer vector, h_t-1The output at LSTM network last moment is represented,It indicates through the defeated of convolution Chi Huahou H out₁In the input of t moment, b_fIt is the bias for forgeing door, i_tIndicate input gate, W_iIndicate the weight matrix of input gate, b_iIt indicates The bias of input gate,Indicate location mode currently entered, it is calculated according to last output and current input It gets, W_cIndicate the weight matrix of location mode currently entered, b_cIndicate the bias of location mode currently entered, c_t Indicate the location mode at current time, it is by forgetting door f_tMultiplied by the location mode c of last moment_t-1, add input gate i_t Multiplied by location mode currently enteredAnd calculating get, the thus memory c that LSTM is long-term_t-1With current memory It is combined together to form new location mode c_t, o_tIndicate out gate, the weight that Wo represents out gate is placed in the middle, b_oRepresent out gate Bias, h_tIndicate final output, it is by location mode c_tWith out gate o_tIt is common to determine.

S05), over-fitting in order to prevent joined Dropout layers multiple, rate 0.5 in model.Finally it is in model Full articulamentum, the last one full articulamentum output dimension calculate sample for classification number in data set and by softmax function Belong to the probability of each classification, calculation formula such as following formula (13)

In formula, y (i) represents i-th of neuron of output layer Value, y (k) represent the value of k-th of neuron in output layer, and exp represents the exponential function using e the bottom of as.

Embodiment 2

The present embodiment chooses 2 groups of Chinese data collection and 5 groups of common English Text Classification data sets follow the convolution of proposition Ring neural network model is assessed.Chinese data collection is derived from the Hownet paper data that oneself is collected, 5 groups of English data set sources In a text such as Zhang, data set covers different classification tasks such as sentiment analysis, subject classification, news category.Training sample This size is differed from 120K to 1.4M, and the quantity of classification is between 4 and 14 in classification task.Specific data set information is such as Shown in following table.

1 text classification data set information table of table

Data set	Training data	Test data	Classification	Classification task	Language
						Paper Data Set 1	160000	40000	5	Document classification	CH
Paper Data Set2	320000	80000	10	Document classification	CH
						AG's news	120000	7600	4	News category	EN
Sogou news	450000	60000	5	News category	EN
						DBPedia	560000	70000	14	Ontology	EN
Yelp Review Full	650000	50000	5	Sentiment analysis	EN
						Yahoo！Answers	1400000	60000	10	Subject classification	EN

Paper Data Set: academic paper of the academic paper data set in the Hownet that oneself is collected, data set 1 In include 5 document classifications, respectively clinical medicine, mathematics, power industry, biology, vocational education.Each classification is chosen 40000 datas are as experimental data, wherein 80% data set is as training data, 20% data set is as test number According to.It include 10 document classifications, respectively chemistry, light industry handicraft, herding and animal medicine, pharmacy, news in data set 2 With medium, railway transportation, paediatrics, sport, physics, agricultural economy, each classification equally chooses 40000 datas as real Data are tested, 80% data set is as training data, and 20% data set is as test data.

AG ' s news corpus:AG is a set more than 1,000,000 news articles, is ComeToMyHead in several years The news article from more than 2000 a sources of news collected in preceding activity.Data set is mainly used for data mining, and (classification gathers Class), in any non-commercial activities such as information retrieval (ranking, search).The theme of news categorized data set of AG be by Zhang, etc. It builds from data above concentration in character level convolutional neural networks text classification experiment.The data set is from original language 4 maximum classes of selection include World, Sports, Business, Sci/Tech in material library, and each class selects 30000 training Sample and 1900 test samples.Comprising 3 column in each sample, respectively class indexes (1 to 4), title, description information.

Sogou news corpus: search dog theme of news categorized data set is by Zhang etc. from SogouCA and SogouCS In choose for character level convolutional neural networks text classification experiment in.The data set selects 5 from original language material library Maximum classification includes Sports, finance, entertainment, automobile, technology, each class selection 90000 samples are for training, and 12000 samples are for testing.The data set is originally Chinese data collection, but Zhang etc. makes Combine stammerer Words partition system that Chinese data is converted into phonetic text with the library pypinyin.Equally divide comprising 3 column in each sample It Wei not class index (1 to 5), title and content.

DBPedia ontology dataset:DBpedia is a crowdsourcing community, it is intended to knot is extracted from wikipedia The content [24] of structure.DBpedia ontology data collection is constructed by selecting 14 non-overlap classes from DBpedia 2014 , classification include Company, EducationalInstitution, Artist, Athlete, OfficeHolder, MeanOfTransportation、Building、NaturalPlace、Village、Animal、Plant、Album、Film、 WrittenWork.From each of this 14 ontology classes class, 40000 training samples and 5000 tests are randomly choosed Sample.The field of data set includes the title and abstract of class index (1 to 14), every wikipedia article.

Yelp Review Full:Yelp comment data collection is obtained from Yelp Dataset Challenge in 2015 ?.Original comment data collection includes 5 i.e. 1-5 of star comment altogether.Yelp comment data collection is by commenting on from each star In randomly select 130000 training samples and the building of 10000 test samples.Rope is commented on comprising star in each sample Draw (1 to 5) and comment content.

Yahoo！Answers dataset:Yahoo！Answers dataset derives from Yahoo！Webscope data Collection.Yahoo！Include 4483032 problems and their answers in Webscope corpus.Yahoo！Answers subject classification number That 10 maximum classifications buildings are chosen from original language material library according to collection, subject categories include society with culture, science and Mathematics, health, education and reference book, computer and network, sport, commercially with finance, amusement with music, family and relationship and Politics and government.It include 140000 training samples and 6000 test samples in each classification.It include classification in each sample Index (1 to 10), problem title, problem content and optimum answer.

4.2 benchmark model

Choose convolution loop neural network classification of the disaggregated model more classical in recent years as benchmark model and proposition Model compares.Classical fastText and HAN classification are above chosen in homemade 2 groups of Chinese academic papers data set Model is as benchmark model.The benchmark model of text selection includes traditional disaggregated model on 5 groups of general English data sets With model neural network based.Traditional model is mainly linear method, and result provides in a text such as Zhang.It is based on The model of neural network includes char-CNN, fastText and VDCNN, their result is respectively in Zhang etc., Joulin Deng providing in the quotations such as, Conneau, the above benchmark model has used identical experimental data set, therefore for the mould to proposition Type is further to be assessed, and is equally tested using model of the above-mentioned data set to proposition in text.

The setting of 4.3 model parameters

It is to the progress word insertion of input text and fine-tuning during model training using the good term vector of pre-training； Term vector dimension size is 100；Depending on the length of the maximum sentence length Yin Wenben of each sample；The size of dictionary is according to data The difference of collection and it is different, be usually arranged as 20000；The data set that selection ratio is 0.1 is as cross-validation data set； Dropout ratio is 0.5；Convolution kernel size is 2 and 3 and convolution kernel number is 256；The neuron number of LSTM network layer It is 70；Using Adam optimization method and learning rate is set as 1e-4；Batch size is set as 256.

4.4 experimental results and analysis

Herein using above data respectively to the convolution loop neural network textual classification model of proposition carry out experiment and with Benchmark model compares.In addition, in order to enable the convolution loop neural network classification model proposed to obtain preferably text Classifying quality is tested respectively for different convolution kernel numbers in text, in experiment convolution kernel number take 64 respectively, 128, 256,512.Specific experiment result is respectively as shown in table 2 and table 3.

The different convolution kernel number experimental result tables of table 2

3 text classification experimental result table of table

It can be seen that in a certain range from the experimental result in table 2, with the increase of convolution kernel number, text classification Accuracy rate be continuously improved, when convolution kernel number be 256 when, text classification effect is best.In addition, from the experimental result in table 3 As can be seen that the model proposed in text not only achieves preferable classifying quality simultaneously in AG ' s news on Chinese data collection Classification accuracy on corpus and DBPedia ontology dataset is also above other benchmark models.In summary, it mentions Model out is applicable not only to the classification of Chinese data collection, is equally applicable to the classification of English data set.

Convolutional neural networks can be extracted the advantage of local feature and Recognition with Recurrent Neural Network LSTM by the present invention has memory Advantage combine and propose a kind of file classification method based on convolution loop neural network, while choosing 2 groups of Chinese numbers According to collection and 5 groups, commonly English data set tests the model of proposition.The experimental results showed that the model of proposition is not only in Classification accuracy with higher on literary data set also has good classifying quality on other English data sets.

Described above is only basic principle and preferred embodiment of the invention, and those skilled in the art do according to the present invention Improvement and replacement out, belong to the scope of protection of the present invention.

Claims

1. the file classification method based on convolution loop neural network, it is characterised in that: the following steps are included:

S02), convolution operation is carried out to input data using multiple dimensioned convolution kernel, the height of characteristic pattern uses formula 1 after convolution It calculates, during convolution operation, each local feature of input is calculated respectively using single convolution kernel first, calculate public Formula such as formula 2, then using formula 3 calculated feature is connected longitudinally, finally reuse activation primitive to calculated result into Row NONLINEAR CALCULATION obtains final convolution feature, calculation formula such as formula 4,

h_1F(i)=f (W_FX (i:i+F-1)+b) (2),

h_1F=[h_1F(1)；h_1F(2)；...；h_1F(H₂)] (3),

hr_1F=relu (h_1F) (4),

In formula, H₂The height of characteristic pattern, H after expression convolution₁Indicate the height inputted before convolution, F indicates the height of convolution kernel, P Indicating the size of Padding, S indicates step-length,It indicates to be rounded downwards, W_FIndicate that height is the convolution kernel of F, X (i:i+F-1) Indicate the local feature vectors of the feature from ith feature to the i-th+F-1 in sample input vector, b indicates bias；

S03), pond is carried out to extract the important spy of text to the result after convolution using maximum pond layer MaxPooling1D Then the result of Chi Huahou is played the input as LSTM layers, calculation formula difference using Concatenate functional link by sign As shown in formula 5,6,

hrp1_F=max (hr1_F) (5),

S04), LSTM network will be utilized by different convolution kernels treated text feature sequence as the input of LSTM network It can more accurately indicate the semantic information of text, and then the classification of text is better achieved, the meter at LSTM network each moment It is as follows to calculate formula:

f_t=σ (W_f·[h_t-1, h_1t]+b_f) (7),

i_t=σ (W_i·[h_t-1, h_1t]+b_i) (8),

o_t=σ (W_o·[h_t-1, h_1t]+b_o) (11),

h_t=o_t·tanh(c_t) (12),

f_tIt indicates to forget door, σ indicates sigmoid function, W_fIndicate the weight matrix of forgetting door, [h_t-1, h_1t] indicate two to Amount is combined into a longer vector, h_t-1Represent the output at LSTM network last moment, h_1tIndicate the output h through convolution Chi Huahou₁ In the input of t moment, b_fIt is the bias for forgeing door, i_tIndicate input gate, W_iIndicate the weight matrix of input gate, b_iIndicate defeated The bias of introduction,Indicate location mode currently entered, it is calculated according to last output and current input Come, W_cIndicate the weight matrix of location mode currently entered, b_cIndicate the bias of location mode currently entered, c_tTable Show the location mode at current time, it is by forgetting door f_tMultiplied by the location mode c of last moment_t-1, add input gate i_tMultiply With location mode currently enteredAnd calculating get, the thus memory c that LSTM is long-term_t-1With current memoryKnot It is combined to form new location mode c_t, o_tIndicate out gate, W_oThe weight for representing out gate is placed in the middle, b_oRepresent out gate Bias, h_tIndicate final output, it is by location mode c_tWith out gate o_tIt is common to determine.

2. the file classification method according to claim 1 based on convolution loop neural network, it is characterised in that: further include Step S05), increase full articulamentum, the output dimension of full articulamentum is the class number in training set and passes through Softmax letter Number calculates the probability that sample belongs to each classification, and calculation formula isFormula In, y (i) represents the value of i-th of neuron of output layer, and y (k) represents the value of k-th of neuron in output layer, exp represent with e as The exponential function at bottom.

3. the file classification method according to claim 1 based on convolution loop neural network, it is characterised in that: step Further include the steps that in S01 in detail below: (1) participle operation being carried out to Chinese training dataset, (2) establish dictionary and establish word Text sequence is mapped as index sequence by the mapping of allusion quotation and index, (3), and the sequence length of all samples is processed into the same by (4) Length, (5) carry out word insertion using the good term vector of pre-training, if Length of sample series is M, the good term vector dimension of pre-training Degree is N, then after word insertion, each sample data is converted into the term vector matrix of M*N and the input as convolutional layer.

4. the file classification method according to claim 1 based on convolution loop neural network, it is characterised in that: step Convolution operation is carried out to input using one-dimensional convolutional layer in S02, the height of convolution kernel takes 2 and 3 two scales respectively, convolution kernel Number is 256, and activation primitive is Relu function.

5. the file classification method according to claim 1 based on convolution loop neural network, it is characterised in that: step It joined Normalization layers of Batch between S02 and S03 data are normalized, accelerate the convergence speed of model Degree.

6. the file classification method according to claim 1 based on convolution loop neural network, it is characterised in that: step It joined Dropout layers between S04 and S05, the random neuron connection for disconnecting designated ratio prevents over-fitting.