CN110347832A - A kind of long text mass analysis method based on transformer - Google Patents
A kind of long text mass analysis method based on transformer Download PDFInfo
- Publication number
- CN110347832A CN110347832A CN201910583213.4A CN201910583213A CN110347832A CN 110347832 A CN110347832 A CN 110347832A CN 201910583213 A CN201910583213 A CN 201910583213A CN 110347832 A CN110347832 A CN 110347832A
- Authority
- CN
- China
- Prior art keywords
- data
- text
- sentence
- analysis method
- long text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/62—Text, e.g. of license plates, overlay texts or captions on TV images
- G06V20/635—Overlay text, e.g. embedded captions in a TV program
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a kind of long text mass analysis method based on transformer, it is mainly characterized by devising a method with long text analysis ability, overcome the slow disadvantage of current text sequence training, adapt to the model analysis method end to end of data characteristics, its specific steps includes: data acquisition, downloads thesis from Hownet;Data identification, extracts PDF content of text;Data indicate, handle text, are mapped to the analyzable data mode of computer;Data label obtains credit rating label;Data characterization carries out data characterization by the comparable model of design complexities;Data classification, according to data characteristics, the different characteristic of weighted data characterization carries out data classification.
Description
Technical field
The present invention relates to field of computer technology, in particular to a kind of long text quality analysis based on transformer
Method.
Background technique
The entry authority of the development of information-intensive society, wikipedia suspected, the XML text of network Shanghai amount, and
Domestic a large amount of thesis all has that certain sentence is obstructed, word is not up to standard, repeat statement is excessive and other issues, such as
Fruit reuses human-edited, and this will be a very big workload.
External wikipedia 2017 in one proposed to this field based on the refreshing end to end of two-way LSTM
Method through network, but this unstructured data of text is relatively difficult to be characterized with computer, and too long text will lead to existing
Model have the function of gradient disappear or gradient explosion to lose extract text feature, they wikipedia collect
Entry data collection under, selecting longest intercepted length is 2000, and acc value is 0.68 at present.
And for existing undergraduate course, master's thesis, text is longer than the entry of wikipedia, current existing design
Model both for 300 words or so short text, and for as this tens of thousands of long text of thesis, with more analysis
Difficulty.I has done a few thing in long text quality analysis this part, and CNN model is the F1 of desirable Chinese text 0.92
Value, but the disadvantage is that it is divided into short text to analyze long text, without characterizing long text global feature well.
CNN and LSTM is the main extractor of the text feature of natural language processing before 2018, not by technology
The disconnected tranformer that develops has been feature extractor optimal at present, it has the characteristics that quickly calculating, can be parallel.
By being analyzed above, main problem is as follows at present:
Existing Natural Language Processing Models are analyzed for short text mostly, and lacking, there is long text to analyze energy
Power, it may appear that the problem of gradient is exploded influences the generalization ability of final mask;
Short text analysis in, RNN structure has training slow, and being applied to long text this problem can more amplify.
Summary of the invention
In order to solve at least one above-mentioned technical problem, present invention generally provides a kind of long articles based on transformer
This mass analysis method solves the problems, such as that long text quality is difficult to assess.
A kind of long text mass analysis method based on transformer, comprising: data acquisition downloads from Hownet and graduates
Paper;Data identification, extracts PDF content of text;Data indicate, handle text, are mapped to the analyzable number of computer
According to form;Data label obtains credit rating label;Data characterization carries out tables of data by the comparable model of design complexities
Sign;Data classification, according to data characteristics, the different characteristic of weighted data characterization carries out data classification.
Further, in the data identification process, the content of textual portions in PDF is extracted using OCR technique.
Further, in the data representation process, text is subjected to subordinate sentence, participle.Token vocabulary is counted, it will
Text is mapped to the index of vocabulary, and adds BOS and EOS special index respectively before and after sentence.
Further, during the data label, using paper uplink time, Quality of Papers etc. can be extracted
Grade: excellent, good, poor.
Further, during the data characterization, the appropriate model is respectively: long text model, for pair
Body part content quality carries out feature extraction in paper;Short text model, for Chinese and English such as research achievements during master
Paper carries out feature extraction.
Further, in the long text model, specifically by transformer characteristic extracting module and memory module group
At.Transformer module extracts sentence characteristics, memory module carries out sentence characteristics forgetting and selection.
Further, in the transformer characteristic extracting module, mainly by feedforward network and self-
Attention composition: feedforward network extracts term vector feature, and self-attention extracts the feature between word and word.
Further, which is characterized in that in the memory module, using memory unit characteristic, to present analysis sentence into
Row characteristic weighing.
Further, the memory unit, which will mainly be realized, to carry out feature forgetting and choosing by analysis of sentence sentence characteristics before
It takes.
Further, described that characteristic weighing is carried out to present analysis sentence, it is based primarily upon attention mechanism, it will be current
Sentence characteristics are as query vector and value vector, and using the sentence vector before current sentence as key vector, acquisition passes through memory
The sentence vector of feature extraction.
Further, long text model and short text model output feature are weighted, as complete by the data classification
The input of linking layer carries out text quality's presentation class.
The present invention has the advantages that
(1) method with long text analysis ability is devised
(2) and its calculating speed overcomes the slow feature of RNN training, can be with parallel computation
(3) this method is conducive to save a large amount of artificial using the method for neural network end to end
(4) this method, which has, preferably extracts feature capabilities than CNN or RNN, so model has preferable generalization ability.
Detailed description of the invention
The work flow diagram of long text mass analysis method Fig. 1 of the invention
Fig. 2 long text model and short text model training procedure chart
Fig. 3 long text model structure
Memory unit structure figure in Fig. 4 long text model
Specific embodiment
The illustrative embodiments of the disclosure are more fully described below with reference to accompanying drawings.Although showing this public affairs in attached drawing
The illustrative embodiments opened, it being understood, however, that may be realized in various forms the disclosure without the implementation that is illustrated here
Mode is limited.It is to be able to thoroughly understand the disclosure on the contrary, providing these embodiments, and can be by the disclosure
Range be fully disclosed to those skilled in the art.
As shown in Figure 1, being a kind of workflow of long text mass analysis method based on transformer of the invention
Figure.Wherein the long text mass analysis method based on transformer includes: data acquisition, downloads thesis from Hownet;Number
According to identification, PDF content of text is extracted;Data indicate, handle text, are mapped to the analyzable data mode of computer;
Data label obtains credit rating label;Data characterization carries out data characterization by the comparable model of design complexities;Data
Classification, according to data characteristics, the different characteristic of weighted data characterization carries out data classification.
In the identification of specific steps S12 data, using OCR technique, the work such as character features extraction, text location are completed, and
Character recognition model is established based on convolutional neural networks (CNN), finally promotes effect in conjunction with statistical language model.This is than adjusting
Carrying out PDF conversion effect with traditional PDF interpreter will get well.
Specific steps S13 data indicate that mainly preprocessed data becomes mode input data in S14 data label,
Its main process is, the text entirely extracted is divided into research achievement text during body part text, master.By body part
Text segments text, is fabricated to vocabulary, carries out the mapping of vocabulary table index, then carry out subordinate sentence, adds before sentence with end of the sentence
Add special marking, that is, forms long text sequence.It by research achievement text during master, segmented, make vocabulary, by text
The mapping of vocabulary table index is carried out, that is, forms short text sequence.Furthermore with paper uplink time, infer credit rating it is excellent, it is good,
Difference, respective production vocabulary, three grades is mapped, that is, forms label.By above-mentioned long text sequence, short text sequence
Column, forming label are at master sample data structure, on this basis, carry out the upper one layer of encapsulation of data, are packaged into data iteration
Device, convenient lower surface model use.
In specific step S15 data characterization, long text model and short text model are divided in data characterization part.
The long text model is specifically if Fig. 3 includes the following: embeding layer, memory unit above, transformer extraction
Layer.Long text sequence glossarial index is converted into term vector, term vector additional position coding characteristic, memory unit above by embeding layer
There are abstract characteristics above, using attention machining function in the term vector of current sentence, then utilize transformer
Carry out the extraction of current sentence feature.Memory unit is specifically applied to current sentence vector herein, and following formula indicates:
As Fig. 3 sentence vector is represented by
Above in formula, Ti-1It is the character representation of previous sentence, SiIt is current sentence character representation, specifically in Fig. 3
Ti-1For Q query vector, key vector K and value vector are SiVector.
Specific short text model specifically carries out Time-Series analysis to above-mentioned short text using BiLSTM, obtains feature.
In specific step S16 data classification, the feature of above-mentioned two model extraction is weighted, is then connected entirely
Connect classification.Such as Fig. 2 training stage, output and true tag are compared, calculate loss, updates the parameter of two models, carries out mould
Type, which updates, to be promoted.The direct output category result of forecast period completes the classification of entire long text quality.
Claims (10)
1. a kind of long text mass analysis method based on transformer, it is characterised in that: data acquisition is downloaded from Hownet
Thesis;Data identification, extracts PDF content of text;Data indicate, handle text, being mapped to computer can analyze
Data mode;Data label obtains credit rating label;Data characterization is counted by the comparable model of design complexities
According to characterization;Data classification, according to data characteristics, the different characteristic of weighted data characterization carries out data classification.
2. long text mass analysis method according to claim 1, which is characterized in that in the data identification process,
The content of textual portions in PDF is extracted using OCR technique.
3. long text mass analysis method according to claim 1, which is characterized in that in the data representation process,
Text is subjected to subordinate sentence, participle, token vocabulary is counted, maps the text to the index of vocabulary, and before and after sentence respectively
Add BOS and EOS special index.
4. during the data label, using paper uplink time, Quality of Papers grade: excellent, good, poor can be extracted.
5. long text mass analysis method according to claim 1, which is characterized in that during the data characterization,
The appropriate model is respectively:
Long text model, for carrying out feature extraction to body part content quality in paper;
Short text model, for carrying out feature extraction to the Chinese and English paper such as research achievement during master.
6. long text mass analysis method according to claim 5, which is characterized in that in the long text model, tool
Body is made of transformer characteristic extraction part and memory section, transformer extract sentence characteristics, memory section into
The selection of row sentence characteristics.
7. long text mass analysis method according to claim 6, which is characterized in that in the transformer feature
Extraction module is mainly made of feedforward network and self-attention:
Feedforward network extracts term vector feature, and self-attention extracts the feature between word and word.
8. long text mass analysis method according to claim 8, which is characterized in that the memory unit mainly realize by
Analysis of sentence sentence characteristics carry out feature forgetting and selection before.
9. long text mass analysis method according to claim 8, which is characterized in that described to be carried out to present analysis sentence
Characteristic weighing is based primarily upon attention mechanism, using current sentence feature as query vector and value vector, by current sentence
Sentence vector before obtains the sentence vector extracted by memory character as key vector.
10. long text mass analysis method according to claim 1, which is characterized in that the data classification, by long text
Model and short text model output feature are weighted, and as the input of full linking layer, carry out text quality's presentation class.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910583213.4A CN110347832A (en) | 2019-07-01 | 2019-07-01 | A kind of long text mass analysis method based on transformer |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910583213.4A CN110347832A (en) | 2019-07-01 | 2019-07-01 | A kind of long text mass analysis method based on transformer |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110347832A true CN110347832A (en) | 2019-10-18 |
Family
ID=68177581
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910583213.4A Pending CN110347832A (en) | 2019-07-01 | 2019-07-01 | A kind of long text mass analysis method based on transformer |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110347832A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111028934A (en) * | 2019-12-23 | 2020-04-17 | 科大讯飞股份有限公司 | Diagnostic quality inspection method, diagnostic quality inspection device, electronic equipment and storage medium |
CN111522946A (en) * | 2020-04-22 | 2020-08-11 | 成都中科云集信息技术有限公司 | Paper quality evaluation method based on attention long-short term memory recurrent neural network |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107133211A (en) * | 2017-04-26 | 2017-09-05 | 中国人民大学 | A kind of composition methods of marking based on notice mechanism |
US20180240012A1 (en) * | 2017-02-17 | 2018-08-23 | Wipro Limited | Method and system for determining classification of text |
CN109543824A (en) * | 2018-11-30 | 2019-03-29 | 腾讯科技(深圳)有限公司 | A kind for the treatment of method and apparatus of series model |
-
2019
- 2019-07-01 CN CN201910583213.4A patent/CN110347832A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180240012A1 (en) * | 2017-02-17 | 2018-08-23 | Wipro Limited | Method and system for determining classification of text |
CN107133211A (en) * | 2017-04-26 | 2017-09-05 | 中国人民大学 | A kind of composition methods of marking based on notice mechanism |
CN109543824A (en) * | 2018-11-30 | 2019-03-29 | 腾讯科技(深圳)有限公司 | A kind for the treatment of method and apparatus of series model |
Non-Patent Citations (2)
Title |
---|
XIAOMIN ZHANG等: "AHNN: An Attention-based Hybrid Neural Network for Sentence Modeling", 《SPRINGER INTERNATIONAL PUBLISHING AG》 * |
张谦等: "基于 Word2vec 的微博短文本分类研究", 《技术研究》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111028934A (en) * | 2019-12-23 | 2020-04-17 | 科大讯飞股份有限公司 | Diagnostic quality inspection method, diagnostic quality inspection device, electronic equipment and storage medium |
CN111028934B (en) * | 2019-12-23 | 2022-02-18 | 安徽科大讯飞医疗信息技术有限公司 | Diagnostic quality inspection method, diagnostic quality inspection device, electronic equipment and storage medium |
CN111522946A (en) * | 2020-04-22 | 2020-08-11 | 成都中科云集信息技术有限公司 | Paper quality evaluation method based on attention long-short term memory recurrent neural network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109359293B (en) | Mongolian name entity recognition method neural network based and its identifying system | |
CN107766371B (en) | Text information classification method and device | |
CN109063159B (en) | Entity relation extraction method based on neural network | |
CN104391942B (en) | Short essay eigen extended method based on semantic collection of illustrative plates | |
CN106570179B (en) | A kind of kernel entity recognition methods and device towards evaluation property text | |
CN108984745A (en) | A kind of neural network file classification method merging more knowledge mappings | |
CN107133213A (en) | A kind of text snippet extraction method and system based on algorithm | |
CN110019839A (en) | Medical knowledge map construction method and system based on neural network and remote supervisory | |
CN105631479A (en) | Imbalance-learning-based depth convolution network image marking method and apparatus | |
CN108664474A (en) | A kind of resume analytic method based on deep learning | |
CN109492230A (en) | A method of insurance contract key message is extracted based on textview field convolutional neural networks interested | |
CN107145573A (en) | The problem of artificial intelligence customer service robot, answers method and system | |
CN107273295A (en) | A kind of software problem reporting sorting technique based on text randomness | |
CN106529525A (en) | Chinese and Japanese handwritten character recognition method | |
CN111274814A (en) | Novel semi-supervised text entity information extraction method | |
CN108829823A (en) | A kind of file classification method | |
CN110347832A (en) | A kind of long text mass analysis method based on transformer | |
CN105630772A (en) | Method for extracting webpage comment content | |
CN105117740A (en) | Font identification method and device | |
CN102880631A (en) | Chinese author identification method based on double-layer classification model, and device for realizing Chinese author identification method | |
CN110083832A (en) | Recognition methods, device, equipment and the readable storage medium storing program for executing of article reprinting relationship | |
CN105609116A (en) | Speech emotional dimensions region automatic recognition method | |
CN111597328A (en) | New event theme extraction method | |
CN105389303B (en) | A kind of automatic fusion method of heterologous corpus | |
CN104992166A (en) | Robust measurement based handwriting recognition method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20191018 |