CN110347832A

CN110347832A - A kind of long text mass analysis method based on transformer

Info

Publication number: CN110347832A
Application number: CN201910583213.4A
Authority: CN
Inventors: 田文洪; 莫中勤; 曾柯铭; 张朝阳; 舒展
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-07-01
Filing date: 2019-07-01
Publication date: 2019-10-18

Abstract

The invention discloses a kind of long text mass analysis method based on transformer, it is mainly characterized by devising a method with long text analysis ability, overcome the slow disadvantage of current text sequence training, adapt to the model analysis method end to end of data characteristics, its specific steps includes: data acquisition, downloads thesis from Hownet；Data identification, extracts PDF content of text；Data indicate, handle text, are mapped to the analyzable data mode of computer；Data label obtains credit rating label；Data characterization carries out data characterization by the comparable model of design complexities；Data classification, according to data characteristics, the different characteristic of weighted data characterization carries out data classification.

Description

A kind of long text mass analysis method based on transformer

Technical field

The present invention relates to field of computer technology, in particular to a kind of long text quality analysis based on transformer Method.

Background technique

The entry authority of the development of information-intensive society, wikipedia suspected, the XML text of network Shanghai amount, and Domestic a large amount of thesis all has that certain sentence is obstructed, word is not up to standard, repeat statement is excessive and other issues, such as Fruit reuses human-edited, and this will be a very big workload.

External wikipedia 2017 in one proposed to this field based on the refreshing end to end of two-way LSTM Method through network, but this unstructured data of text is relatively difficult to be characterized with computer, and too long text will lead to existing Model have the function of gradient disappear or gradient explosion to lose extract text feature, they wikipedia collect Entry data collection under, selecting longest intercepted length is 2000, and acc value is 0.68 at present.

And for existing undergraduate course, master's thesis, text is longer than the entry of wikipedia, current existing design Model both for 300 words or so short text, and for as this tens of thousands of long text of thesis, with more analysis Difficulty.I has done a few thing in long text quality analysis this part, and CNN model is the F1 of desirable Chinese text 0.92 Value, but the disadvantage is that it is divided into short text to analyze long text, without characterizing long text global feature well.

CNN and LSTM is the main extractor of the text feature of natural language processing before 2018, not by technology The disconnected tranformer that develops has been feature extractor optimal at present, it has the characteristics that quickly calculating, can be parallel.

By being analyzed above, main problem is as follows at present:

Existing Natural Language Processing Models are analyzed for short text mostly, and lacking, there is long text to analyze energy Power, it may appear that the problem of gradient is exploded influences the generalization ability of final mask；

Short text analysis in, RNN structure has training slow, and being applied to long text this problem can more amplify.

Summary of the invention

In order to solve at least one above-mentioned technical problem, present invention generally provides a kind of long articles based on transformer This mass analysis method solves the problems, such as that long text quality is difficult to assess.

A kind of long text mass analysis method based on transformer, comprising: data acquisition downloads from Hownet and graduates Paper；Data identification, extracts PDF content of text；Data indicate, handle text, are mapped to the analyzable number of computer According to form；Data label obtains credit rating label；Data characterization carries out tables of data by the comparable model of design complexities Sign；Data classification, according to data characteristics, the different characteristic of weighted data characterization carries out data classification.

Further, in the data identification process, the content of textual portions in PDF is extracted using OCR technique.

Further, in the data representation process, text is subjected to subordinate sentence, participle.Token vocabulary is counted, it will Text is mapped to the index of vocabulary, and adds BOS and EOS special index respectively before and after sentence.

Further, during the data label, using paper uplink time, Quality of Papers etc. can be extracted Grade: excellent, good, poor.

Further, during the data characterization, the appropriate model is respectively: long text model, for pair Body part content quality carries out feature extraction in paper；Short text model, for Chinese and English such as research achievements during master Paper carries out feature extraction.

Further, in the long text model, specifically by transformer characteristic extracting module and memory module group At.Transformer module extracts sentence characteristics, memory module carries out sentence characteristics forgetting and selection.

Further, in the transformer characteristic extracting module, mainly by feedforward network and self- Attention composition: feedforward network extracts term vector feature, and self-attention extracts the feature between word and word.

Further, which is characterized in that in the memory module, using memory unit characteristic, to present analysis sentence into Row characteristic weighing.

Further, the memory unit, which will mainly be realized, to carry out feature forgetting and choosing by analysis of sentence sentence characteristics before It takes.

Further, described that characteristic weighing is carried out to present analysis sentence, it is based primarily upon attention mechanism, it will be current Sentence characteristics are as query vector and value vector, and using the sentence vector before current sentence as key vector, acquisition passes through memory The sentence vector of feature extraction.

Further, long text model and short text model output feature are weighted, as complete by the data classification The input of linking layer carries out text quality's presentation class.

The present invention has the advantages that

(1) method with long text analysis ability is devised

(2) and its calculating speed overcomes the slow feature of RNN training, can be with parallel computation

(3) this method is conducive to save a large amount of artificial using the method for neural network end to end

(4) this method, which has, preferably extracts feature capabilities than CNN or RNN, so model has preferable generalization ability.

Detailed description of the invention

The work flow diagram of long text mass analysis method Fig. 1 of the invention

Fig. 2 long text model and short text model training procedure chart

Fig. 3 long text model structure

Memory unit structure figure in Fig. 4 long text model

Specific embodiment

The illustrative embodiments of the disclosure are more fully described below with reference to accompanying drawings.Although showing this public affairs in attached drawing The illustrative embodiments opened, it being understood, however, that may be realized in various forms the disclosure without the implementation that is illustrated here Mode is limited.It is to be able to thoroughly understand the disclosure on the contrary, providing these embodiments, and can be by the disclosure Range be fully disclosed to those skilled in the art.

As shown in Figure 1, being a kind of workflow of long text mass analysis method based on transformer of the invention Figure.Wherein the long text mass analysis method based on transformer includes: data acquisition, downloads thesis from Hownet；Number According to identification, PDF content of text is extracted；Data indicate, handle text, are mapped to the analyzable data mode of computer； Data label obtains credit rating label；Data characterization carries out data characterization by the comparable model of design complexities；Data Classification, according to data characteristics, the different characteristic of weighted data characterization carries out data classification.

In the identification of specific steps S12 data, using OCR technique, the work such as character features extraction, text location are completed, and Character recognition model is established based on convolutional neural networks (CNN), finally promotes effect in conjunction with statistical language model.This is than adjusting Carrying out PDF conversion effect with traditional PDF interpreter will get well.

Specific steps S13 data indicate that mainly preprocessed data becomes mode input data in S14 data label, Its main process is, the text entirely extracted is divided into research achievement text during body part text, master.By body part Text segments text, is fabricated to vocabulary, carries out the mapping of vocabulary table index, then carry out subordinate sentence, adds before sentence with end of the sentence Add special marking, that is, forms long text sequence.It by research achievement text during master, segmented, make vocabulary, by text The mapping of vocabulary table index is carried out, that is, forms short text sequence.Furthermore with paper uplink time, infer credit rating it is excellent, it is good, Difference, respective production vocabulary, three grades is mapped, that is, forms label.By above-mentioned long text sequence, short text sequence Column, forming label are at master sample data structure, on this basis, carry out the upper one layer of encapsulation of data, are packaged into data iteration Device, convenient lower surface model use.

In specific step S15 data characterization, long text model and short text model are divided in data characterization part.

The long text model is specifically if Fig. 3 includes the following: embeding layer, memory unit above, transformer extraction Layer.Long text sequence glossarial index is converted into term vector, term vector additional position coding characteristic, memory unit above by embeding layer There are abstract characteristics above, using attention machining function in the term vector of current sentence, then utilize transformer Carry out the extraction of current sentence feature.Memory unit is specifically applied to current sentence vector herein, and following formula indicates:

As Fig. 3 sentence vector is represented by

Above in formula, T_i-1It is the character representation of previous sentence, S_iIt is current sentence character representation, specifically in Fig. 3 T_i-1For Q query vector, key vector K and value vector are S_iVector.

Specific short text model specifically carries out Time-Series analysis to above-mentioned short text using BiLSTM, obtains feature.

In specific step S16 data classification, the feature of above-mentioned two model extraction is weighted, is then connected entirely Connect classification.Such as Fig. 2 training stage, output and true tag are compared, calculate loss, updates the parameter of two models, carries out mould Type, which updates, to be promoted.The direct output category result of forecast period completes the classification of entire long text quality.

Claims

1. a kind of long text mass analysis method based on transformer, it is characterised in that: data acquisition is downloaded from Hownet Thesis；Data identification, extracts PDF content of text；Data indicate, handle text, being mapped to computer can analyze Data mode；Data label obtains credit rating label；Data characterization is counted by the comparable model of design complexities According to characterization；Data classification, according to data characteristics, the different characteristic of weighted data characterization carries out data classification.

2. long text mass analysis method according to claim 1, which is characterized in that in the data identification process, The content of textual portions in PDF is extracted using OCR technique.

3. long text mass analysis method according to claim 1, which is characterized in that in the data representation process, Text is subjected to subordinate sentence, participle, token vocabulary is counted, maps the text to the index of vocabulary, and before and after sentence respectively Add BOS and EOS special index.

4. during the data label, using paper uplink time, Quality of Papers grade: excellent, good, poor can be extracted.

5. long text mass analysis method according to claim 1, which is characterized in that during the data characterization, The appropriate model is respectively:

Long text model, for carrying out feature extraction to body part content quality in paper；

Short text model, for carrying out feature extraction to the Chinese and English paper such as research achievement during master.

6. long text mass analysis method according to claim 5, which is characterized in that in the long text model, tool Body is made of transformer characteristic extraction part and memory section, transformer extract sentence characteristics, memory section into The selection of row sentence characteristics.

7. long text mass analysis method according to claim 6, which is characterized in that in the transformer feature Extraction module is mainly made of feedforward network and self-attention:

Feedforward network extracts term vector feature, and self-attention extracts the feature between word and word.

8. long text mass analysis method according to claim 8, which is characterized in that the memory unit mainly realize by Analysis of sentence sentence characteristics carry out feature forgetting and selection before.

9. long text mass analysis method according to claim 8, which is characterized in that described to be carried out to present analysis sentence Characteristic weighing is based primarily upon attention mechanism, using current sentence feature as query vector and value vector, by current sentence Sentence vector before obtains the sentence vector extracted by memory character as key vector.

10. long text mass analysis method according to claim 1, which is characterized in that the data classification, by long text Model and short text model output feature are weighted, and as the input of full linking layer, carry out text quality's presentation class.