CN110347832A - A kind of long text mass analysis method based on transformer - Google Patents

A kind of long text mass analysis method based on transformer Download PDF

Info

Publication number
CN110347832A
CN110347832A CN201910583213.4A CN201910583213A CN110347832A CN 110347832 A CN110347832 A CN 110347832A CN 201910583213 A CN201910583213 A CN 201910583213A CN 110347832 A CN110347832 A CN 110347832A
Authority
CN
China
Prior art keywords
data
text
sentence
analysis method
long text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910583213.4A
Other languages
Chinese (zh)
Inventor
田文洪
莫中勤
曾柯铭
张朝阳
舒展
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201910583213.4A priority Critical patent/CN110347832A/en
Publication of CN110347832A publication Critical patent/CN110347832A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/635Overlay text, e.g. embedded captions in a TV program
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of long text mass analysis method based on transformer, it is mainly characterized by devising a method with long text analysis ability, overcome the slow disadvantage of current text sequence training, adapt to the model analysis method end to end of data characteristics, its specific steps includes: data acquisition, downloads thesis from Hownet;Data identification, extracts PDF content of text;Data indicate, handle text, are mapped to the analyzable data mode of computer;Data label obtains credit rating label;Data characterization carries out data characterization by the comparable model of design complexities;Data classification, according to data characteristics, the different characteristic of weighted data characterization carries out data classification.

Description

A kind of long text mass analysis method based on transformer
Technical field
The present invention relates to field of computer technology, in particular to a kind of long text quality analysis based on transformer Method.
Background technique
The entry authority of the development of information-intensive society, wikipedia suspected, the XML text of network Shanghai amount, and Domestic a large amount of thesis all has that certain sentence is obstructed, word is not up to standard, repeat statement is excessive and other issues, such as Fruit reuses human-edited, and this will be a very big workload.
External wikipedia 2017 in one proposed to this field based on the refreshing end to end of two-way LSTM Method through network, but this unstructured data of text is relatively difficult to be characterized with computer, and too long text will lead to existing Model have the function of gradient disappear or gradient explosion to lose extract text feature, they wikipedia collect Entry data collection under, selecting longest intercepted length is 2000, and acc value is 0.68 at present.
And for existing undergraduate course, master's thesis, text is longer than the entry of wikipedia, current existing design Model both for 300 words or so short text, and for as this tens of thousands of long text of thesis, with more analysis Difficulty.I has done a few thing in long text quality analysis this part, and CNN model is the F1 of desirable Chinese text 0.92 Value, but the disadvantage is that it is divided into short text to analyze long text, without characterizing long text global feature well.
CNN and LSTM is the main extractor of the text feature of natural language processing before 2018, not by technology The disconnected tranformer that develops has been feature extractor optimal at present, it has the characteristics that quickly calculating, can be parallel.
By being analyzed above, main problem is as follows at present:
Existing Natural Language Processing Models are analyzed for short text mostly, and lacking, there is long text to analyze energy Power, it may appear that the problem of gradient is exploded influences the generalization ability of final mask;
Short text analysis in, RNN structure has training slow, and being applied to long text this problem can more amplify.
Summary of the invention
In order to solve at least one above-mentioned technical problem, present invention generally provides a kind of long articles based on transformer This mass analysis method solves the problems, such as that long text quality is difficult to assess.
A kind of long text mass analysis method based on transformer, comprising: data acquisition downloads from Hownet and graduates Paper;Data identification, extracts PDF content of text;Data indicate, handle text, are mapped to the analyzable number of computer According to form;Data label obtains credit rating label;Data characterization carries out tables of data by the comparable model of design complexities Sign;Data classification, according to data characteristics, the different characteristic of weighted data characterization carries out data classification.
Further, in the data identification process, the content of textual portions in PDF is extracted using OCR technique.
Further, in the data representation process, text is subjected to subordinate sentence, participle.Token vocabulary is counted, it will Text is mapped to the index of vocabulary, and adds BOS and EOS special index respectively before and after sentence.
Further, during the data label, using paper uplink time, Quality of Papers etc. can be extracted Grade: excellent, good, poor.
Further, during the data characterization, the appropriate model is respectively: long text model, for pair Body part content quality carries out feature extraction in paper;Short text model, for Chinese and English such as research achievements during master Paper carries out feature extraction.
Further, in the long text model, specifically by transformer characteristic extracting module and memory module group At.Transformer module extracts sentence characteristics, memory module carries out sentence characteristics forgetting and selection.
Further, in the transformer characteristic extracting module, mainly by feedforward network and self- Attention composition: feedforward network extracts term vector feature, and self-attention extracts the feature between word and word.
Further, which is characterized in that in the memory module, using memory unit characteristic, to present analysis sentence into Row characteristic weighing.
Further, the memory unit, which will mainly be realized, to carry out feature forgetting and choosing by analysis of sentence sentence characteristics before It takes.
Further, described that characteristic weighing is carried out to present analysis sentence, it is based primarily upon attention mechanism, it will be current Sentence characteristics are as query vector and value vector, and using the sentence vector before current sentence as key vector, acquisition passes through memory The sentence vector of feature extraction.
Further, long text model and short text model output feature are weighted, as complete by the data classification The input of linking layer carries out text quality's presentation class.
The present invention has the advantages that
(1) method with long text analysis ability is devised
(2) and its calculating speed overcomes the slow feature of RNN training, can be with parallel computation
(3) this method is conducive to save a large amount of artificial using the method for neural network end to end
(4) this method, which has, preferably extracts feature capabilities than CNN or RNN, so model has preferable generalization ability.
Detailed description of the invention
The work flow diagram of long text mass analysis method Fig. 1 of the invention
Fig. 2 long text model and short text model training procedure chart
Fig. 3 long text model structure
Memory unit structure figure in Fig. 4 long text model
Specific embodiment
The illustrative embodiments of the disclosure are more fully described below with reference to accompanying drawings.Although showing this public affairs in attached drawing The illustrative embodiments opened, it being understood, however, that may be realized in various forms the disclosure without the implementation that is illustrated here Mode is limited.It is to be able to thoroughly understand the disclosure on the contrary, providing these embodiments, and can be by the disclosure Range be fully disclosed to those skilled in the art.
As shown in Figure 1, being a kind of workflow of long text mass analysis method based on transformer of the invention Figure.Wherein the long text mass analysis method based on transformer includes: data acquisition, downloads thesis from Hownet;Number According to identification, PDF content of text is extracted;Data indicate, handle text, are mapped to the analyzable data mode of computer; Data label obtains credit rating label;Data characterization carries out data characterization by the comparable model of design complexities;Data Classification, according to data characteristics, the different characteristic of weighted data characterization carries out data classification.
In the identification of specific steps S12 data, using OCR technique, the work such as character features extraction, text location are completed, and Character recognition model is established based on convolutional neural networks (CNN), finally promotes effect in conjunction with statistical language model.This is than adjusting Carrying out PDF conversion effect with traditional PDF interpreter will get well.
Specific steps S13 data indicate that mainly preprocessed data becomes mode input data in S14 data label, Its main process is, the text entirely extracted is divided into research achievement text during body part text, master.By body part Text segments text, is fabricated to vocabulary, carries out the mapping of vocabulary table index, then carry out subordinate sentence, adds before sentence with end of the sentence Add special marking, that is, forms long text sequence.It by research achievement text during master, segmented, make vocabulary, by text The mapping of vocabulary table index is carried out, that is, forms short text sequence.Furthermore with paper uplink time, infer credit rating it is excellent, it is good, Difference, respective production vocabulary, three grades is mapped, that is, forms label.By above-mentioned long text sequence, short text sequence Column, forming label are at master sample data structure, on this basis, carry out the upper one layer of encapsulation of data, are packaged into data iteration Device, convenient lower surface model use.
In specific step S15 data characterization, long text model and short text model are divided in data characterization part.
The long text model is specifically if Fig. 3 includes the following: embeding layer, memory unit above, transformer extraction Layer.Long text sequence glossarial index is converted into term vector, term vector additional position coding characteristic, memory unit above by embeding layer There are abstract characteristics above, using attention machining function in the term vector of current sentence, then utilize transformer Carry out the extraction of current sentence feature.Memory unit is specifically applied to current sentence vector herein, and following formula indicates:
As Fig. 3 sentence vector is represented by
Above in formula, Ti-1It is the character representation of previous sentence, SiIt is current sentence character representation, specifically in Fig. 3 Ti-1For Q query vector, key vector K and value vector are SiVector.
Specific short text model specifically carries out Time-Series analysis to above-mentioned short text using BiLSTM, obtains feature.
In specific step S16 data classification, the feature of above-mentioned two model extraction is weighted, is then connected entirely Connect classification.Such as Fig. 2 training stage, output and true tag are compared, calculate loss, updates the parameter of two models, carries out mould Type, which updates, to be promoted.The direct output category result of forecast period completes the classification of entire long text quality.

Claims (10)

1. a kind of long text mass analysis method based on transformer, it is characterised in that: data acquisition is downloaded from Hownet Thesis;Data identification, extracts PDF content of text;Data indicate, handle text, being mapped to computer can analyze Data mode;Data label obtains credit rating label;Data characterization is counted by the comparable model of design complexities According to characterization;Data classification, according to data characteristics, the different characteristic of weighted data characterization carries out data classification.
2. long text mass analysis method according to claim 1, which is characterized in that in the data identification process, The content of textual portions in PDF is extracted using OCR technique.
3. long text mass analysis method according to claim 1, which is characterized in that in the data representation process, Text is subjected to subordinate sentence, participle, token vocabulary is counted, maps the text to the index of vocabulary, and before and after sentence respectively Add BOS and EOS special index.
4. during the data label, using paper uplink time, Quality of Papers grade: excellent, good, poor can be extracted.
5. long text mass analysis method according to claim 1, which is characterized in that during the data characterization, The appropriate model is respectively:
Long text model, for carrying out feature extraction to body part content quality in paper;
Short text model, for carrying out feature extraction to the Chinese and English paper such as research achievement during master.
6. long text mass analysis method according to claim 5, which is characterized in that in the long text model, tool Body is made of transformer characteristic extraction part and memory section, transformer extract sentence characteristics, memory section into The selection of row sentence characteristics.
7. long text mass analysis method according to claim 6, which is characterized in that in the transformer feature Extraction module is mainly made of feedforward network and self-attention:
Feedforward network extracts term vector feature, and self-attention extracts the feature between word and word.
8. long text mass analysis method according to claim 8, which is characterized in that the memory unit mainly realize by Analysis of sentence sentence characteristics carry out feature forgetting and selection before.
9. long text mass analysis method according to claim 8, which is characterized in that described to be carried out to present analysis sentence Characteristic weighing is based primarily upon attention mechanism, using current sentence feature as query vector and value vector, by current sentence Sentence vector before obtains the sentence vector extracted by memory character as key vector.
10. long text mass analysis method according to claim 1, which is characterized in that the data classification, by long text Model and short text model output feature are weighted, and as the input of full linking layer, carry out text quality's presentation class.
CN201910583213.4A 2019-07-01 2019-07-01 A kind of long text mass analysis method based on transformer Pending CN110347832A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910583213.4A CN110347832A (en) 2019-07-01 2019-07-01 A kind of long text mass analysis method based on transformer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910583213.4A CN110347832A (en) 2019-07-01 2019-07-01 A kind of long text mass analysis method based on transformer

Publications (1)

Publication Number Publication Date
CN110347832A true CN110347832A (en) 2019-10-18

Family

ID=68177581

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910583213.4A Pending CN110347832A (en) 2019-07-01 2019-07-01 A kind of long text mass analysis method based on transformer

Country Status (1)

Country Link
CN (1) CN110347832A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111028934A (en) * 2019-12-23 2020-04-17 科大讯飞股份有限公司 Diagnostic quality inspection method, diagnostic quality inspection device, electronic equipment and storage medium
CN111522946A (en) * 2020-04-22 2020-08-11 成都中科云集信息技术有限公司 Paper quality evaluation method based on attention long-short term memory recurrent neural network

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107133211A (en) * 2017-04-26 2017-09-05 中国人民大学 A kind of composition methods of marking based on notice mechanism
US20180240012A1 (en) * 2017-02-17 2018-08-23 Wipro Limited Method and system for determining classification of text
CN109543824A (en) * 2018-11-30 2019-03-29 腾讯科技(深圳)有限公司 A kind for the treatment of method and apparatus of series model

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180240012A1 (en) * 2017-02-17 2018-08-23 Wipro Limited Method and system for determining classification of text
CN107133211A (en) * 2017-04-26 2017-09-05 中国人民大学 A kind of composition methods of marking based on notice mechanism
CN109543824A (en) * 2018-11-30 2019-03-29 腾讯科技(深圳)有限公司 A kind for the treatment of method and apparatus of series model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XIAOMIN ZHANG等: "AHNN: An Attention-based Hybrid Neural Network for Sentence Modeling", 《SPRINGER INTERNATIONAL PUBLISHING AG》 *
张谦等: "基于 Word2vec 的微博短文本分类研究", 《技术研究》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111028934A (en) * 2019-12-23 2020-04-17 科大讯飞股份有限公司 Diagnostic quality inspection method, diagnostic quality inspection device, electronic equipment and storage medium
CN111028934B (en) * 2019-12-23 2022-02-18 安徽科大讯飞医疗信息技术有限公司 Diagnostic quality inspection method, diagnostic quality inspection device, electronic equipment and storage medium
CN111522946A (en) * 2020-04-22 2020-08-11 成都中科云集信息技术有限公司 Paper quality evaluation method based on attention long-short term memory recurrent neural network

Similar Documents

Publication Publication Date Title
CN109359293B (en) Mongolian name entity recognition method neural network based and its identifying system
CN107766371B (en) Text information classification method and device
CN109063159B (en) Entity relation extraction method based on neural network
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN106570179B (en) A kind of kernel entity recognition methods and device towards evaluation property text
CN108984745A (en) A kind of neural network file classification method merging more knowledge mappings
CN107133213A (en) A kind of text snippet extraction method and system based on algorithm
CN110019839A (en) Medical knowledge map construction method and system based on neural network and remote supervisory
CN105631479A (en) Imbalance-learning-based depth convolution network image marking method and apparatus
CN108664474A (en) A kind of resume analytic method based on deep learning
CN109492230A (en) A method of insurance contract key message is extracted based on textview field convolutional neural networks interested
CN107145573A (en) The problem of artificial intelligence customer service robot, answers method and system
CN107273295A (en) A kind of software problem reporting sorting technique based on text randomness
CN106529525A (en) Chinese and Japanese handwritten character recognition method
CN111274814A (en) Novel semi-supervised text entity information extraction method
CN108829823A (en) A kind of file classification method
CN110347832A (en) A kind of long text mass analysis method based on transformer
CN105630772A (en) Method for extracting webpage comment content
CN105117740A (en) Font identification method and device
CN102880631A (en) Chinese author identification method based on double-layer classification model, and device for realizing Chinese author identification method
CN110083832A (en) Recognition methods, device, equipment and the readable storage medium storing program for executing of article reprinting relationship
CN105609116A (en) Speech emotional dimensions region automatic recognition method
CN111597328A (en) New event theme extraction method
CN105389303B (en) A kind of automatic fusion method of heterologous corpus
CN104992166A (en) Robust measurement based handwriting recognition method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20191018