CN109582794A - Long article classification method based on deep learning - Google Patents

Long article classification method based on deep learning Download PDF

Info

Publication number
CN109582794A
CN109582794A CN201811440171.0A CN201811440171A CN109582794A CN 109582794 A CN109582794 A CN 109582794A CN 201811440171 A CN201811440171 A CN 201811440171A CN 109582794 A CN109582794 A CN 109582794A
Authority
CN
China
Prior art keywords
deep learning
text
long article
classification
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811440171.0A
Other languages
Chinese (zh)
Inventor
冯姣
姜恬静
何军
李鹏
刘�文
于正威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN201811440171.0A priority Critical patent/CN109582794A/en
Publication of CN109582794A publication Critical patent/CN109582794A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The long article classification method based on deep learning that the invention discloses a kind of, it is extracted at random by pre-process then generation term vector to text, obtained data set is input to convolutional neural networks and lengthens in the model structure of short-term memory network repetition training to frequency of training, loss function is constantly reduced, trained deep learning model is finally obtained.This method, which passes through, to be extracted effective sentence at random and builds the model structure that convolutional neural networks lengthen short-term memory network, can be extracted the feature of long article comprehensively, be guaranteed the accuracy of classification, and can accelerate training speed, improve the efficiency of classification.This method can quickly and correctly obtain classification results, and especially for long text more than 7000 words, and content of text is very close, the article of the careful complexity of classification, and the effect of acquirement is more obvious.

Description

Long article classification method based on deep learning
Technical field
The present invention relates to a kind of long article classification methods, more particularly to a kind of long article classification method based on deep learning.
Background technique
With the continuous development of internet and electronic technology, a large amount of paper document is stored in mutually in the form of electronic document In networking, internet become people propagate information main platform, it is desirable to can according to keyword magnanimity document The desired information of quick obtaining in data.This requires paper documents specific classification and label.
In traditional classification problem, the form of keyword or key sentence is extracted in people's general choice, is carried out to text Classification, for the classification problem with clear feature, such as spam filtering, machine automatic question answering etc. has had good Using.And for technical paper similar in paper document, especially direction, extract key sentence may discrimination it is not high, into It is easy when row classification so that story label is not clear enough, so to extract more accurate feature, passes through analysis full text, length Phase memory network (Long-Short Term Memory, abbreviation LSTM) can release subsequent shape according to the state stored before State.But for the article more than ten thousand words, since data dimension is excessive, only using LSTM to carrying out analysis memory in full, it is easy to Break the bank makes training become very slowly, analysis result to be caused deviation occur.Convolutional neural networks (Convolutional Neural Networks, abbreviation CNN) there is the advantages of partially connected, weight is shared, multi-characteristic, it is multiple to greatly reduce calculating Miscellaneous degree reduces training time and resource, but convolutional neural networks have on learning the dependence before and after long sequence data There is limitation, for long text training data, convolutional neural networks can accomplish local shape factor, but can not remember longer Text, the sentence that can not be associated with before and after long article.Cannot comprehensive extraction and analysis feature, there is no assurance that the standard of long article classification Exactness.Single convolutional neural networks are difficult meet the needs of long text classification.Therefore, in long article analysis field, one kind is found Can learn full text feature becomes especially to weigh with the training method for guaranteeing accuracy and resource constraint being overcome to improve arithmetic speed It wants.
Summary of the invention
Goal of the invention: the long article classification method based on deep learning that the technical problem to be solved in the present invention is to provide a kind of, The problem that local shape factor is not comprehensive enough and memory entire article dimension is too big is overcome in long article analysis field, it can either Study full text feature can efficiently reduce computational complexity to guarantee the accuracy of classification, reduce the training ground time, improve The efficiency of classification.
Technical solution: the long article classification method of the present invention based on deep learning, comprising the following steps:
(1) choose and article and obtain text data, and data are pre-processed, for each word generate at random a word to It measures and is term vector number, the word in text is converted to corresponding term vector, obtains the term vector array file of pure digi-tal;
(2) size for judging every, article words average length, according to every L word for an effective sentence method by long article Segmentation randomly selects K effectively sentences and forms array X, repeat it is above-mentioned randomly select process n times, obtain data set
(3) above-mentioned data set is input in two-dimensional convolution neural network and carries out feature extraction, it is logical to the data after convolution It crosses maximum pond method and obtains maximum pond data yi, n times are repeated, the N number of local feature value of long article is obtained
(4) above-mentioned local feature value is input in LSTM neural network, LSTM hide layer number be it is N number of, each layer Input is local feature value yi, obtain the characteristic parameter of text whole;
(5) features described above parameter is subjected to dropout, abandons partial feature value according to a certain percentage;
(6) data after dropout are obtained into predicted value using softmax algorithmCalculate true tag y and predicted valueBetween loss function H, gradient is reversely updated using Adam bp algorithm, reduces the value of H;
(7) it repeats step (3) to step (6) to be trained, constantly reduces the value of H, protected after reaching the frequency of training of setting 1 deep learning model is deposited, is executed repeatedly, terminates training after whole articles reach traversal number;
(8) trained deep learning model is obtained.
Further, loss function H is to intersect entropy function, expression formula in step (6) are as follows:
Wherein, y is the true tag for inputting document,It is the predicted value of model output.
Text feature is more efficiently extracted using various sizes of convolution nuclear energy, the two-dimensional convolution nerve in step (3) Network convolutional layer includes the filter of 3 sizes, respectively 3*3,4*4,5*5, and filter depth is 64, step-length 1.
In order to effectively delete garbage in article, the preprocessing process in step (1) includes by acquired textual data According to deletion document format, punctuation mark, additional character and number.
In order to avoid over-fitting, the ratio in step (5) is 0.5.
In order to also take into account optimization efficiency while complete extraction characteristic value, the traversal number in step (7) is 5 It is secondary.
The utility model has the advantages that this method lengthens short-term memory network by extracting effective sentence at random and building convolutional neural networks Model structure, the feature of long article can be extracted comprehensively, guarantee classification accuracy, and can with and accelerate training speed, mention The efficiency of high-class.This method can quickly and correctly obtain classification results, especially for long text more than 7000 words, and And content of text is very close, the article of the careful complexity of classification, the effect of acquirement is more obvious.In practical applications, this method Need to save trained model, after inputting a long article document, what system will be fast automatic provides the accurate label of document.
Detailed description of the invention
Fig. 1 is the present embodiment overall flow figure;
Fig. 2 is convolutional neural networks and shot and long term memory network integral frame figure;
Fig. 3 is the accuracy comparison diagram of different models.
Specific embodiment
This method can be not only used for English Text Classification, can be used for the classification of other language texts.Embodiment is with 4 labels English article for, first from Cornell Univ USA manage Development of Electronic Preprints document databse downloading for training 4 classes not Same paper data in literature, respectively mathematics (quantum algebra), mathematics (metric geometry), mathematics (algebraic geometry), mathematics are (several He Xue) totally 38309.
As shown in Figure 1, the method for embodiment includes the next steps:
(1) English papers for the PDF format downloaded are pre-processed.Preprocessing process is to delete not use in text Information, including the format in document, punctuation mark, additional character, the unconventional English word such as number.
(2) dictionary is established according to the English word in data, as each word generates a random term vector and is word Vector numbers ultimately generate the array file for the pure digi-tal being made of term vector number.Each array file stores a text Chapter, the article of same class label are stored under the file of label name.
(3) random extraction is done to full text.It is one according to averagely every 20 words according to the statistical result of data 10,000 English words of one article are divided into 500 valid sentences by words, i.e. L=20.The part of less than 10,000 word of long article is with spot patch Fill, according to the actual length of article in data set, randomly select 20 valid sentences, i.e. K=20, convolutional layer will capture simultaneously this 20 The feature of a sentence.To sentence repeated sampling 25 times in same piece lengthy document, i.e. N=25 is defeated if Fig. 3 randomly selects process Enter text and be divided into N group, is expressed as
(4) divide training set data and verifying collection data according to the ratio of 9:1.34479 articles of training set are used for mould The training optimization of type, 3830 articles for verifying collection are used to verify the actual classification effect of model after optimization.
(5) data of training set are input to deep learning model.As shown in Fig. 2, convolutional neural networks are first to input 25 groups of valid sentences of article do local shape factor, and it is 3*3,4* respectively that the present embodiment, which sets 3 kinds of various sizes of filters, 4,5*5, filter depth is 64, step-length 1.Text feature is more efficiently extracted using various sizes of convolution nuclear energy, Then pond, y are done to the data after convolution by the method in maximum pondiRepresent the maximum pond number of various sizes of convolution kernel According to integrating each group pond data, every article will obtain 25 groups of local feature values and be expressed as
(6) these local feature amounts are inputed into LSTM neural network, to remember full text context, obtains higher Secondary abstract characteristics.The input that each layer of LSTM is that the quantity of CNN local feature value yi, LSTM hidden layer and N value above are protected Consistent i.e. 25 are held, by LSTM, the characteristic parameter of text whole is obtained, is input to full articulamentum, 0.5 dropout ratio is set The random partial data of rejecting of example avoids over-fitting, and the method for extracting traversal full text random in this way ensure that and effectively extract article Full content improves the ability of neural network analysis memory full text.
(7) data after dropout are finally obtained into predicted value using softmax algorithmWith inputting the true classification of text Label y is compared.It calculates it and intersects entropy functionDefinition H is loss function, In, y is the input true label classification of document,It is the predicted value of model output, y is number.Four classes have been used in experiment in patent Document: mathematics (quantum algebra), mathematics (metric geometry), mathematics (algebraic geometry), mathematics (geometry).Experimental mathematics (amount Subalgebra) label classification be 1, mathematics (metric geometry) label classification be 2, mathematics (algebraic geometry) be 3, mathematics (geometry) It is 4.It in experiment, directly numbers in order, text is converted to number, calculated so that computer is read.Using Adam bp algorithm The reversed gradient that updates reduces the value of H, to reduce label y and predicted valueBetween gap, improve accuracy.Training set data is every A training deep learning model of 100 preservations, including saving multiple weight parameters in convolutional neural networks and LSTM network W, offset parameter b, and calculate the 100th loss function and accuracy.The present embodiment is tied after selecting whole articles to traverse 5 times Shu Xunlian, i.e. repeatedly above step 172395 times.The accuracy of all loss functions for having saved network and text classification is observed, The highest network of accuracy optimal models the most are chosen, trained deep learning model is obtained.
It is finally verified, above-mentioned verifying collection data is input in the network, be verified the prediction classification of collection article, It compares, obtain classification accuracy rate and saves with story label, the classification capacity of assessment models.Fig. 3 show different parameters with And the comparison of different category of model accuracy, the accuracy of training set text is illustrated as tendency chart, and with other classical taxonomies The result of method compares, it can be seen that the accuracy of one side transcript analysis is more more effective than local key word analysis.It is another Aspect can be seen that under identical frequency of training, and the classification accuracy rate of single CNN and LSTM model is learned than this patent depth It is much lower to practise model.For the accuracy of this method training set text 99% or so, the accuracy rate of verifying collection text reaches 94%. Experiments have shown that this method quickly can accurately obtain the tag along sort of long article.
Among practical application, by text input to be sorted into above-mentioned trained deep learning model, lead to The accurate tag along sort of text can be quickly obtained automatically by crossing this model.

Claims (6)

1. a kind of long article classification method based on deep learning, it is characterised in that the following steps are included:
(1) it chooses article and obtains text data, and data are pre-processed, generate a term vector at random simultaneously for each word For term vector number, the word in text is converted to corresponding term vector, obtains the term vector array file of pure digi-tal;
(2) size for judging every, article words average length divides long article according to the method that every L word is an effective sentence Cut, randomly select K effectively sentences and form array X, repeat it is above-mentioned randomly select process n times, obtain data set
(3) above-mentioned data set is input in two-dimensional convolution neural network and carries out feature extraction, to the data after convolution by most Great Chiization method obtains maximum pond data yi, n times are repeated, the N number of local feature value of long article is obtained
(4) above-mentioned local feature value is input in LSTM neural network, it is N number of, each layer of input that LSTM, which hides layer number, For local feature value yi, obtain the characteristic parameter of text whole;
(5) features described above parameter is subjected to dropout, abandons partial feature value according to a certain percentage;
(6) data after dropout are obtained into predicted value using softmax algorithmCalculate true tag y and predicted valueBetween Loss function H, gradient is reversely updated using Adam bp algorithm, reduces the value of H;
(7) it repeats step (3) to step (6) to be trained, constantly reduces the value of H, saved 1 time after reaching the frequency of training of setting Deep learning model, executes repeatedly, terminates training after whole articles reach traversal number;
(8) trained deep learning model is obtained.
2. the long article classification method according to claim 1 based on deep learning, it is characterised in that: loss in step (6) Function H is to intersect entropy function, expression formula are as follows:
Wherein, y is the true tag for inputting document,It is the predicted value of model output.
3. the long article classification method according to claim 1 based on deep learning, it is characterised in that: two in step (3) Dimension convolutional neural networks convolutional layer includes the filter of 3 sizes, and respectively 3*3,4*4,5*5, filter depth are 64, step A length of 1.
4. the long article classification method according to claim 1 based on deep learning, it is characterised in that: pre- in step (1) Treatment process includes that acquired text data is deleted document format, punctuation mark, additional character and number.
5. the long article classification method according to claim 1 based on deep learning, it is characterised in that: the ratio in step (5) Example is 0.5.
6. the long article classification method according to claim 1 based on deep learning, it is characterised in that: time in step (7) All previous number is 5 times.
CN201811440171.0A 2018-11-29 2018-11-29 Long article classification method based on deep learning Pending CN109582794A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811440171.0A CN109582794A (en) 2018-11-29 2018-11-29 Long article classification method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811440171.0A CN109582794A (en) 2018-11-29 2018-11-29 Long article classification method based on deep learning

Publications (1)

Publication Number Publication Date
CN109582794A true CN109582794A (en) 2019-04-05

Family

ID=65925069

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811440171.0A Pending CN109582794A (en) 2018-11-29 2018-11-29 Long article classification method based on deep learning

Country Status (1)

Country Link
CN (1) CN109582794A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110457469A (en) * 2019-07-05 2019-11-15 中国平安财产保险股份有限公司 Information classification approach, device based on shot and long term memory network, computer equipment
CN110532448A (en) * 2019-07-04 2019-12-03 平安科技(深圳)有限公司 Document Classification Method, device, equipment and storage medium neural network based
CN110609898A (en) * 2019-08-19 2019-12-24 中国科学院重庆绿色智能技术研究院 Self-classification method for unbalanced text data
CN110633470A (en) * 2019-09-17 2019-12-31 北京小米智能科技有限公司 Named entity recognition method, device and storage medium
CN110879934A (en) * 2019-10-31 2020-03-13 杭州电子科技大学 Efficient Wide & Deep learning model
CN111538840A (en) * 2020-06-23 2020-08-14 基建通(三亚)国际科技有限公司 Text classification method and device
CN112069379A (en) * 2020-07-03 2020-12-11 中山大学 Efficient public opinion monitoring system based on LSTM-CNN
CN112133441A (en) * 2020-08-21 2020-12-25 广东省人民医院 Establishment method and terminal of MH post-operation fissure hole state prediction model
CN112418354A (en) * 2020-12-15 2021-02-26 江苏满运物流信息有限公司 Goods source information classification method and device, electronic equipment and storage medium
CN113239190A (en) * 2021-04-27 2021-08-10 天九共享网络科技集团有限公司 Document classification method and device, storage medium and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107169035A (en) * 2017-04-19 2017-09-15 华南理工大学 A kind of file classification method for mixing shot and long term memory network and convolutional neural networks
US20170308790A1 (en) * 2016-04-21 2017-10-26 International Business Machines Corporation Text classification by ranking with convolutional neural networks
CN108875021A (en) * 2017-11-10 2018-11-23 云南大学 A kind of sentiment analysis method based on region CNN-LSTM

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170308790A1 (en) * 2016-04-21 2017-10-26 International Business Machines Corporation Text classification by ranking with convolutional neural networks
CN107169035A (en) * 2017-04-19 2017-09-15 华南理工大学 A kind of file classification method for mixing shot and long term memory network and convolutional neural networks
CN108875021A (en) * 2017-11-10 2018-11-23 云南大学 A kind of sentiment analysis method based on region CNN-LSTM

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021000411A1 (en) * 2019-07-04 2021-01-07 平安科技(深圳)有限公司 Neural network-based document classification method and apparatus, and device and storage medium
CN110532448A (en) * 2019-07-04 2019-12-03 平安科技(深圳)有限公司 Document Classification Method, device, equipment and storage medium neural network based
CN110457469A (en) * 2019-07-05 2019-11-15 中国平安财产保险股份有限公司 Information classification approach, device based on shot and long term memory network, computer equipment
CN110609898A (en) * 2019-08-19 2019-12-24 中国科学院重庆绿色智能技术研究院 Self-classification method for unbalanced text data
CN110633470A (en) * 2019-09-17 2019-12-31 北京小米智能科技有限公司 Named entity recognition method, device and storage medium
CN110879934A (en) * 2019-10-31 2020-03-13 杭州电子科技大学 Efficient Wide & Deep learning model
CN110879934B (en) * 2019-10-31 2023-05-23 杭州电子科技大学 Text prediction method based on Wide & Deep learning model
CN111538840A (en) * 2020-06-23 2020-08-14 基建通(三亚)国际科技有限公司 Text classification method and device
CN112069379A (en) * 2020-07-03 2020-12-11 中山大学 Efficient public opinion monitoring system based on LSTM-CNN
CN112133441A (en) * 2020-08-21 2020-12-25 广东省人民医院 Establishment method and terminal of MH post-operation fissure hole state prediction model
CN112133441B (en) * 2020-08-21 2024-05-03 广东省人民医院 Method and terminal for establishing MH postoperative crack state prediction model
CN112418354A (en) * 2020-12-15 2021-02-26 江苏满运物流信息有限公司 Goods source information classification method and device, electronic equipment and storage medium
CN112418354B (en) * 2020-12-15 2022-07-15 江苏满运物流信息有限公司 Goods source information classification method and device, electronic equipment and storage medium
CN113239190A (en) * 2021-04-27 2021-08-10 天九共享网络科技集团有限公司 Document classification method and device, storage medium and electronic equipment
CN113239190B (en) * 2021-04-27 2024-02-20 天九共享网络科技集团有限公司 Document classification method, device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN109582794A (en) Long article classification method based on deep learning
CN106502985B (en) neural network modeling method and device for generating titles
Song et al. Research on text classification based on convolutional neural network
CN109886020A (en) Software vulnerability automatic classification method based on deep neural network
CN111767725B (en) Data processing method and device based on emotion polarity analysis model
CN109376242A (en) Text classification algorithm based on Recognition with Recurrent Neural Network variant and convolutional neural networks
CN106815369A (en) A kind of file classification method based on Xgboost sorting algorithms
CN110502753A (en) A kind of deep learning sentiment analysis model and its analysis method based on semantically enhancement
CN108388554B (en) Text emotion recognition system based on collaborative filtering attention mechanism
CN112231562A (en) Network rumor identification method and system
CN107832458A (en) A kind of file classification method based on depth of nesting network of character level
CN108664512B (en) Text object classification method and device
CN110069627A (en) Classification method, device, electronic equipment and the storage medium of short text
CN107688576B (en) Construction and tendency classification method of CNN-SVM model
CN111046183A (en) Method and device for constructing neural network model for text classification
CN105787121B (en) A kind of microblogging event summary extracting method based on more story lines
CN110826298B (en) Statement coding method used in intelligent auxiliary password-fixing system
CN109918507B (en) textCNN (text-based network communication network) improved text classification method
CN107357895B (en) Text representation processing method based on bag-of-words model
CN112070139A (en) Text classification method based on BERT and improved LSTM
CN113590764A (en) Training sample construction method and device, electronic equipment and storage medium
CN112148868A (en) Law recommendation method based on law co-occurrence
CN114996467A (en) Knowledge graph entity attribute alignment algorithm based on semantic similarity
Huang A CNN model for SMS spam detection
CN111046177A (en) Automatic arbitration case prejudging method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 210044 No. 219 Ningliu Road, Jiangbei New District, Nanjing City, Jiangsu Province

Applicant after: Nanjing University of Information Science and Technology

Address before: 211500 Yuting Square, 59 Wangqiao Road, Liuhe District, Nanjing City, Jiangsu Province

Applicant before: Nanjing University of Information Science and Technology

CB02 Change of applicant information
RJ01 Rejection of invention patent application after publication

Application publication date: 20190405

RJ01 Rejection of invention patent application after publication