CN109582794A

CN109582794A - Long article classification method based on deep learning

Info

Publication number: CN109582794A
Application number: CN201811440171.0A
Authority: CN
Inventors: 冯姣; 姜恬静; 何军; 李鹏; 刘�文; 于正威
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2018-11-29
Filing date: 2018-11-29
Publication date: 2019-04-05

Abstract

The long article classification method based on deep learning that the invention discloses a kind of, it is extracted at random by pre-process then generation term vector to text, obtained data set is input to convolutional neural networks and lengthens in the model structure of short-term memory network repetition training to frequency of training, loss function is constantly reduced, trained deep learning model is finally obtained.This method, which passes through, to be extracted effective sentence at random and builds the model structure that convolutional neural networks lengthen short-term memory network, can be extracted the feature of long article comprehensively, be guaranteed the accuracy of classification, and can accelerate training speed, improve the efficiency of classification.This method can quickly and correctly obtain classification results, and especially for long text more than 7000 words, and content of text is very close, the article of the careful complexity of classification, and the effect of acquirement is more obvious.

Description

Long article classification method based on deep learning

Technical field

The present invention relates to a kind of long article classification methods, more particularly to a kind of long article classification method based on deep learning.

Background technique

With the continuous development of internet and electronic technology, a large amount of paper document is stored in mutually in the form of electronic document In networking, internet become people propagate information main platform, it is desirable to can according to keyword magnanimity document The desired information of quick obtaining in data.This requires paper documents specific classification and label.

In traditional classification problem, the form of keyword or key sentence is extracted in people's general choice, is carried out to text Classification, for the classification problem with clear feature, such as spam filtering, machine automatic question answering etc. has had good Using.And for technical paper similar in paper document, especially direction, extract key sentence may discrimination it is not high, into It is easy when row classification so that story label is not clear enough, so to extract more accurate feature, passes through analysis full text, length Phase memory network (Long-Short Term Memory, abbreviation LSTM) can release subsequent shape according to the state stored before State.But for the article more than ten thousand words, since data dimension is excessive, only using LSTM to carrying out analysis memory in full, it is easy to Break the bank makes training become very slowly, analysis result to be caused deviation occur.Convolutional neural networks (Convolutional Neural Networks, abbreviation CNN) there is the advantages of partially connected, weight is shared, multi-characteristic, it is multiple to greatly reduce calculating Miscellaneous degree reduces training time and resource, but convolutional neural networks have on learning the dependence before and after long sequence data There is limitation, for long text training data, convolutional neural networks can accomplish local shape factor, but can not remember longer Text, the sentence that can not be associated with before and after long article.Cannot comprehensive extraction and analysis feature, there is no assurance that the standard of long article classification Exactness.Single convolutional neural networks are difficult meet the needs of long text classification.Therefore, in long article analysis field, one kind is found Can learn full text feature becomes especially to weigh with the training method for guaranteeing accuracy and resource constraint being overcome to improve arithmetic speed It wants.

Summary of the invention

Goal of the invention: the long article classification method based on deep learning that the technical problem to be solved in the present invention is to provide a kind of, The problem that local shape factor is not comprehensive enough and memory entire article dimension is too big is overcome in long article analysis field, it can either Study full text feature can efficiently reduce computational complexity to guarantee the accuracy of classification, reduce the training ground time, improve The efficiency of classification.

Technical solution: the long article classification method of the present invention based on deep learning, comprising the following steps:

(1) choose and article and obtain text data, and data are pre-processed, for each word generate at random a word to It measures and is term vector number, the word in text is converted to corresponding term vector, obtains the term vector array file of pure digi-tal；

(2) size for judging every, article words average length, according to every L word for an effective sentence method by long article Segmentation randomly selects K effectively sentences and forms array X, repeat it is above-mentioned randomly select process n times, obtain data set

(3) above-mentioned data set is input in two-dimensional convolution neural network and carries out feature extraction, it is logical to the data after convolution It crosses maximum pond method and obtains maximum pond data y_i, n times are repeated, the N number of local feature value of long article is obtained

(4) above-mentioned local feature value is input in LSTM neural network, LSTM hide layer number be it is N number of, each layer Input is local feature value y_i, obtain the characteristic parameter of text whole；

(5) features described above parameter is subjected to dropout, abandons partial feature value according to a certain percentage；

(6) data after dropout are obtained into predicted value using softmax algorithmCalculate true tag y and predicted valueBetween loss function H, gradient is reversely updated using Adam bp algorithm, reduces the value of H；

(7) it repeats step (3) to step (6) to be trained, constantly reduces the value of H, protected after reaching the frequency of training of setting 1 deep learning model is deposited, is executed repeatedly, terminates training after whole articles reach traversal number；

(8) trained deep learning model is obtained.

Further, loss function H is to intersect entropy function, expression formula in step (6) are as follows:

Wherein, y is the true tag for inputting document,It is the predicted value of model output.

Text feature is more efficiently extracted using various sizes of convolution nuclear energy, the two-dimensional convolution nerve in step (3) Network convolutional layer includes the filter of 3 sizes, respectively 3*3,4*4,5*5, and filter depth is 64, step-length 1.

In order to effectively delete garbage in article, the preprocessing process in step (1) includes by acquired textual data According to deletion document format, punctuation mark, additional character and number.

In order to avoid over-fitting, the ratio in step (5) is 0.5.

In order to also take into account optimization efficiency while complete extraction characteristic value, the traversal number in step (7) is 5 It is secondary.

The utility model has the advantages that this method lengthens short-term memory network by extracting effective sentence at random and building convolutional neural networks Model structure, the feature of long article can be extracted comprehensively, guarantee classification accuracy, and can with and accelerate training speed, mention The efficiency of high-class.This method can quickly and correctly obtain classification results, especially for long text more than 7000 words, and And content of text is very close, the article of the careful complexity of classification, the effect of acquirement is more obvious.In practical applications, this method Need to save trained model, after inputting a long article document, what system will be fast automatic provides the accurate label of document.

Detailed description of the invention

Fig. 1 is the present embodiment overall flow figure；

Fig. 2 is convolutional neural networks and shot and long term memory network integral frame figure；

Fig. 3 is the accuracy comparison diagram of different models.

Specific embodiment

This method can be not only used for English Text Classification, can be used for the classification of other language texts.Embodiment is with 4 labels English article for, first from Cornell Univ USA manage Development of Electronic Preprints document databse downloading for training 4 classes not Same paper data in literature, respectively mathematics (quantum algebra), mathematics (metric geometry), mathematics (algebraic geometry), mathematics are (several He Xue) totally 38309.

As shown in Figure 1, the method for embodiment includes the next steps:

(1) English papers for the PDF format downloaded are pre-processed.Preprocessing process is to delete not use in text Information, including the format in document, punctuation mark, additional character, the unconventional English word such as number.

(2) dictionary is established according to the English word in data, as each word generates a random term vector and is word Vector numbers ultimately generate the array file for the pure digi-tal being made of term vector number.Each array file stores a text Chapter, the article of same class label are stored under the file of label name.

(3) random extraction is done to full text.It is one according to averagely every 20 words according to the statistical result of data 10,000 English words of one article are divided into 500 valid sentences by words, i.e. L=20.The part of less than 10,000 word of long article is with spot patch Fill, according to the actual length of article in data set, randomly select 20 valid sentences, i.e. K=20, convolutional layer will capture simultaneously this 20 The feature of a sentence.To sentence repeated sampling 25 times in same piece lengthy document, i.e. N=25 is defeated if Fig. 3 randomly selects process Enter text and be divided into N group, is expressed as

(4) divide training set data and verifying collection data according to the ratio of 9:1.34479 articles of training set are used for mould The training optimization of type, 3830 articles for verifying collection are used to verify the actual classification effect of model after optimization.

(5) data of training set are input to deep learning model.As shown in Fig. 2, convolutional neural networks are first to input 25 groups of valid sentences of article do local shape factor, and it is 3*3,4* respectively that the present embodiment, which sets 3 kinds of various sizes of filters, 4,5*5, filter depth is 64, step-length 1.Text feature is more efficiently extracted using various sizes of convolution nuclear energy, Then pond, y are done to the data after convolution by the method in maximum pond_iRepresent the maximum pond number of various sizes of convolution kernel According to integrating each group pond data, every article will obtain 25 groups of local feature values and be expressed as

(6) these local feature amounts are inputed into LSTM neural network, to remember full text context, obtains higher Secondary abstract characteristics.The input that each layer of LSTM is that the quantity of CNN local feature value yi, LSTM hidden layer and N value above are protected Consistent i.e. 25 are held, by LSTM, the characteristic parameter of text whole is obtained, is input to full articulamentum, 0.5 dropout ratio is set The random partial data of rejecting of example avoids over-fitting, and the method for extracting traversal full text random in this way ensure that and effectively extract article Full content improves the ability of neural network analysis memory full text.

(7) data after dropout are finally obtained into predicted value using softmax algorithmWith inputting the true classification of text Label y is compared.It calculates it and intersects entropy functionDefinition H is loss function, In, y is the input true label classification of document,It is the predicted value of model output, y is number.Four classes have been used in experiment in patent Document: mathematics (quantum algebra), mathematics (metric geometry), mathematics (algebraic geometry), mathematics (geometry).Experimental mathematics (amount Subalgebra) label classification be 1, mathematics (metric geometry) label classification be 2, mathematics (algebraic geometry) be 3, mathematics (geometry) It is 4.It in experiment, directly numbers in order, text is converted to number, calculated so that computer is read.Using Adam bp algorithm The reversed gradient that updates reduces the value of H, to reduce label y and predicted valueBetween gap, improve accuracy.Training set data is every A training deep learning model of 100 preservations, including saving multiple weight parameters in convolutional neural networks and LSTM network W, offset parameter b, and calculate the 100th loss function and accuracy.The present embodiment is tied after selecting whole articles to traverse 5 times Shu Xunlian, i.e. repeatedly above step 172395 times.The accuracy of all loss functions for having saved network and text classification is observed, The highest network of accuracy optimal models the most are chosen, trained deep learning model is obtained.

It is finally verified, above-mentioned verifying collection data is input in the network, be verified the prediction classification of collection article, It compares, obtain classification accuracy rate and saves with story label, the classification capacity of assessment models.Fig. 3 show different parameters with And the comparison of different category of model accuracy, the accuracy of training set text is illustrated as tendency chart, and with other classical taxonomies The result of method compares, it can be seen that the accuracy of one side transcript analysis is more more effective than local key word analysis.It is another Aspect can be seen that under identical frequency of training, and the classification accuracy rate of single CNN and LSTM model is learned than this patent depth It is much lower to practise model.For the accuracy of this method training set text 99% or so, the accuracy rate of verifying collection text reaches 94%. Experiments have shown that this method quickly can accurately obtain the tag along sort of long article.

Among practical application, by text input to be sorted into above-mentioned trained deep learning model, lead to The accurate tag along sort of text can be quickly obtained automatically by crossing this model.

Claims

1. a kind of long article classification method based on deep learning, it is characterised in that the following steps are included:

(1) it chooses article and obtains text data, and data are pre-processed, generate a term vector at random simultaneously for each word For term vector number, the word in text is converted to corresponding term vector, obtains the term vector array file of pure digi-tal；

(2) size for judging every, article words average length divides long article according to the method that every L word is an effective sentence Cut, randomly select K effectively sentences and form array X, repeat it is above-mentioned randomly select process n times, obtain data set

(3) above-mentioned data set is input in two-dimensional convolution neural network and carries out feature extraction, to the data after convolution by most Great Chiization method obtains maximum pond data y_i, n times are repeated, the N number of local feature value of long article is obtained

(4) above-mentioned local feature value is input in LSTM neural network, it is N number of, each layer of input that LSTM, which hides layer number, For local feature value y_i, obtain the characteristic parameter of text whole；

(7) it repeats step (3) to step (6) to be trained, constantly reduces the value of H, saved 1 time after reaching the frequency of training of setting Deep learning model, executes repeatedly, terminates training after whole articles reach traversal number；

(8) trained deep learning model is obtained.

2. the long article classification method according to claim 1 based on deep learning, it is characterised in that: loss in step (6) Function H is to intersect entropy function, expression formula are as follows:

3. the long article classification method according to claim 1 based on deep learning, it is characterised in that: two in step (3) Dimension convolutional neural networks convolutional layer includes the filter of 3 sizes, and respectively 3*3,4*4,5*5, filter depth are 64, step A length of 1.

4. the long article classification method according to claim 1 based on deep learning, it is characterised in that: pre- in step (1) Treatment process includes that acquired text data is deleted document format, punctuation mark, additional character and number.

5. the long article classification method according to claim 1 based on deep learning, it is characterised in that: the ratio in step (5) Example is 0.5.

6. the long article classification method according to claim 1 based on deep learning, it is characterised in that: time in step (7) All previous number is 5 times.