CN112417134A - Automatic abstract generation system and method based on voice text deep fusion features - Google Patents
Automatic abstract generation system and method based on voice text deep fusion features Download PDFInfo
- Publication number
- CN112417134A CN112417134A CN202011198008.5A CN202011198008A CN112417134A CN 112417134 A CN112417134 A CN 112417134A CN 202011198008 A CN202011198008 A CN 202011198008A CN 112417134 A CN112417134 A CN 112417134A
- Authority
- CN
- China
- Prior art keywords
- abstract
- text
- voice
- word
- decoder
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000004927 fusion Effects 0.000 title claims abstract description 31
- 238000000034 method Methods 0.000 title claims description 34
- 230000006870 function Effects 0.000 claims abstract description 35
- 238000007781 pre-processing Methods 0.000 claims abstract description 15
- 238000011156 evaluation Methods 0.000 claims abstract description 9
- 238000012549 training Methods 0.000 claims abstract description 4
- 230000008569 process Effects 0.000 claims description 13
- 238000000605 extraction Methods 0.000 claims description 10
- 238000007476 Maximum Likelihood Methods 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 9
- 239000011159 matrix material Substances 0.000 claims description 9
- 238000012937 correction Methods 0.000 claims description 8
- 230000007246 mechanism Effects 0.000 claims description 7
- 230000011218 segmentation Effects 0.000 claims description 5
- 238000013528 artificial neural network Methods 0.000 claims description 2
- 238000009826 distribution Methods 0.000 claims description 2
- 239000012634 fragment Substances 0.000 claims description 2
- 238000010606 normalization Methods 0.000 claims description 2
- 238000012545 processing Methods 0.000 claims description 2
- 230000001502 supplementing effect Effects 0.000 claims description 2
- 230000000694 effects Effects 0.000 description 6
- 238000006243 chemical reaction Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000013461 design Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000013136 deep learning model Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3343—Query execution using phonetics
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/232—Orthographic correction, e.g. spell checking or vowelisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Acoustics & Sound (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Document Processing Apparatus (AREA)
- Machine Translation (AREA)
Abstract
The system comprises a preprocessing and voice corresponding module, an encoder module, a decoder feature fusion module and a loss function module. The preprocessing and voice corresponding module comprises text acquisition and voice corresponding. The decoder feature fusion module comprises intermediate abstract generation, sound feature fusion and modified abstract generation. The loss function module comprises an intermediate summary loss function and an evaluation function of the corrected summary. For user voice data, obtaining a text corresponding to voice through the text, and obtaining voice characteristics corresponding to characters through voice correspondence; text data is subjected to a pre-training xlnet encoder to obtain vector representation of the text; the text vector and the voice feature are subjected to voice feature fusion of a decoder and learning after the intermediate abstract is generated to obtain an intermediate abstract; and (4) the intermediate abstract is encoded by xlnet again to obtain further understanding of the text, and finally the intermediate abstract is corrected and then is generated and learned to obtain a final abstract.
Description
Technical Field
The invention belongs to the field of automatic abstract generation, and particularly relates to an automatic abstract generation method based on a voice text deep fusion characteristic.
Background
With the rapid development of the internet and the updating of communication technology, the communication modes of people using the internet become various. The communication by using voice information is a convenient and common way. According to nail statistics, the number of the platform online conferences initiated exceeds 2000 ten thousand per day; according to the estimation of a prospective industry research institute, by 2018, users of the China online education market break through 2000 thousands of people. The voice information is used as an important information carrier, so that the communication is more convenient and quicker, and meanwhile, the problem of difficulty in information processing and obtaining is also brought. Therefore, the method and the device for automatically generating the text abstract of the voice rich in information value have wide application scenes and values. For example, in a conference scene, a large amount of valuable voice information can be generated, the manual arrangement of conference records consumes manpower, the subjective influence of conference organizers is easily brought, and the objective conference summary can be formed by the automatic summary system while the cost is saved. In the course scene, the key knowledge of the classroom can be extracted by utilizing the automatic abstract generation technology, and the teaching effect is enhanced. The voice information generated in daily production and life is rich in information value, but redundant voice is more, so that the redundant voice information is necessary to be converted into a simple text abstract.
The existing abstract generation technology has a mature method for a text-oriented scene, but cannot be directly applied to a scene using voice as original data. The abstract generation task using voice as original data has the following characteristics: 1) the original text needs to be obtained through a speech recognition technology, and errors such as homophone words and the like easily appear in the text after speech conversion. 2) The voice data itself carries a hint of important information (such as pitch, intensity, etc.), and text data obtained by voice conversion loses such information.
Disclosure of Invention
With the great increase of online and offline meetings and lectures, how to efficiently extract key information from the meetings and lectures becomes an important problem. With the rapid development and application of the deep learning theory, the learning mode of pre-training and fine tuning becomes the mainstream means for extracting the text abstract. However, the existing methods start from texts, although the semantic relation of the context is considered, the understanding and emphasis of the content when the speaker speaks per se are ignored, so that the method for automatically generating the abstract based on the deep fusion of the voice texts is provided, the subjective understanding of the speaker is added, and the abstract is more suitable for the application scene.
The invention aims to disclose an automatic abstract generation method based on deep fusion of voice texts, which aims at the current situations of high voice information yield and difficult information conversion and extraction and develops corresponding work around the improvement of a voice text feature extraction and abstract generation model. The achievement of the invention enriches the theory and method for generating the voice data abstract, improves the abstract generating effect, is widely applied in daily production and life, and creates economic value.
The technical scheme is as follows:
the automatic abstract generation method based on the deep fusion of the voice text is characterized in that main modules comprise a preprocessing and feature extraction module, an encoder module, a decoder feature fusion module and a loss function module. The preprocessing and feature extraction module comprises voice recognition, text error correction and voice corresponding functions. And for the original voice data, obtaining an intermediate text sequence through a voice recognition function, and obtaining a final text sequence through a text error correction model based on the n-gram and the similar word-pronunciation table. And extracting the sound intensity information in the voice data according to the time sequence arrangement and the Chinese character in the text sequence by utilizing the characteristic of one character and one sound of Chinese characters, thereby completing the text data preprocessing and the voice feature extraction. The encoder module gets the original embedded vector of the text features based on the XLnet framework. The decoder feature fusion and loss function module comprises a decoder structure, sound feature fusion and loss function design, and the decoder structure generates an abstract through two steps for an original embedded vector obtained by an encoder. 1) Decoding is carried out by a decoder based on Transformer-XL, and an intermediate summary is generated. 2) Deleting the words in the intermediate abstract, recoding by an encoder to obtain an intermediate abstract embedded vector, and correcting by a decoder in combination with the original embedded vector to generate a final abstract. In the process of generating the intermediate abstract, the sound features are fused in a Value matrix based on a multi-head attention mechanism decoder, so that the deep fusion of the speech text features is realized. The loss function design of the decoder combines the maximum likelihood estimation, the ROUGE-L rule and the information quantity estimation to evaluate the model.
The invention carries out deep and systematic research on the abstract generation field, fully utilizes the feature fusion mechanism of voice and text, combines the attention mechanism and the structure of a coder decoder, provides a new thought and method for the abstract extraction work facing the voice data, expands the abstract generation theory, is beneficial to further promoting the abstract generation quality and effect facing the voice data, and further promotes the application of automatic abstract generation.
Advantageous effects
1) The invention provides a method for fusing depth features of a voice text, aiming at the problems of uneven voice data information quality, difficult key point extraction and the like. And the sound intensity characteristics in the voice are extracted and added into the multilayer attention network for extracting the text characteristics, so that the deep fusion of the voice text characteristics is realized, and the quality of generating the abstract is improved.
2) The invention takes XLNet and transformer-xl as the basic framework of the abstract generating model, has stronger abstract generating capability for long texts, and enables the model to process voice data in more scenes such as large conferences.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 Overall framework of the invention
FIG. 2 is a schematic diagram of a decoder feature fusion mechanism
FIG. 3 is a schematic diagram of a modified network
Detailed Description
The following detailed description of the embodiments of the present invention will be provided with reference to the accompanying drawings and examples, so that how to implement the embodiments of the present invention by using technical means to solve the technical problems and achieve the technical effects can be fully understood and implemented.
The invention discloses a method for generating an abstract based on voice text deep fusion, which generates a corresponding text abstract according to voice data. Defining k words in the voice data as V ═ V1,v2,…,vkFor given voice data, the task of summary generation is to generate a summary Y ═ Y corresponding to the voice1,y2,…,yt}。
First part, preprocessing and voice corresponding module
1.1 text acquisition
For k segments of speech v1,v2,…,vkFirstly, a voice recognition tool is utilized to obtain a corresponding k text sequence { x }1,x2,…,xk}. And performing text spelling error correction based on the n-gram model and the similar word-sound table, and performing correction on each text sequence xi(i is more than or equal to 1 and less than or equal to k) utilizing a word segmentation tool to carry out word segmentation to obtain a word vector Wi={w1,w2,…,wnAnd searching words in each word vector in a similar word-sound table to obtain a similar word set, traversing the set for replacement, and selecting a text sequence t 'with the lowest confusion degree'iAs a final result. The confusion is calculated as follows:
wherein n represents the number of participles, P (w)1,w2,…,wn) The probability of the text sequence is calculated by a probability multiplication formula:
P(w1,w2,…,wn)=p(w1)p(w2|w1)…p(wn|w1,w2,…wn-1)
obtaining a final k-segment text sequence { x 'through error correction'1,x′2,…,x′k}。
1.2 Speech correspondence
And corresponding the acquired text with the sound intensity characteristics: according to the text sequence { x 'by using the characteristic of Chinese character-pronunciation'1,x′2,…,x′kGet the number of words per text { n }1,n2,…,nkK represents the sequence ordinal number of the text, v for each speech fragmentiProcessing to select the top n of the speech segmentiThe maximum sound intensity (i.e. amplitude) and arranged in time sequence to obtain the length niSound intensity vector a ofiEach element in the sound intensity vector corresponds to the sound intensity of each character in the text sequence, so that the correspondence between the voice characteristic and the text is completed;
after the preprocessing module in the step 1, text data { x 'required by the abstract extraction model is obtained'1,x′2,…,x′kWith the corresponding sound characteristics a1,a2,…,ak};
Second part, encoder Module
For error corrected text sequence x'iThe word vector { x is obtained by embedding after the Sentence piece word segmentationi,1,xi,2,...,xi,mM is the word number of the text sequence, and the word number is sent to 6 decoders connected in series to obtain a hidden state { h }i,1,hi,2,...,hi,mThe calculation process is as follows:
Qi,j=Ki,j=Vi,j=W·xi,j
hi,j=Norm(fc(Norm(Zi,j)))
wherein W is the weight matrix parameter to be trained, Q, K, V is the matrix of query, key and value in the transform structure, Si,j,kIs the sentence xi' the attention of the kth word to the jth word, Norm represents the normalization operation, fc represents the fully-connected operation in the neural network;
in order to better process long texts, relative position coding is added and the concept of Transfomer-X is integrated, S is addedi,j,kThe calculation of (c) is modified as follows:
wherein the absolute position of the word j is represented by a fixed vector p, rj-kDenotes the relative position between word j and word k, WqAnd WkRespectively representing weight matrixes to be learned by the model for calculating Q matrixes and V matrixes;
the output of the previous sentence is stored in the cache by utilizing the thought of the Transformer-XL, the next sentence is connected to the output of the previous sentence during training and enters an attention mechanism together, but the content in the cache does not need to be calculated during the reverse gradient calculation;
since multi-head attention is introduced, 8W's are learned, resulting in { Zi,j,1,Zi,j,2,...,Zi,j,8Is connected in series before being input into the full connection layer and then multiplied by W0Then fed into the full connection layer, where W0Is also in need ofThe weight matrix parameters to be trained;
third part, decoder feature fusion module
The process of generating the text abstract by using the decoder is divided into three parts:
3.1 intermediate digest generation: the original embedded coded vector H of the text is obtained by an encoder and is decoded by L decoders based on the transform-XL, and the decoding process is as follows:
oi=Transformerxldecoder(o<i,H),1≤i≤L
H=XLnet(x′1,x′2,…,x′n)
generating an intermediate abstract O with the length L in a right-left mode1,o2,…,olWherein { x'iI is more than or equal to 1 and less than or equal to m represents the text sequence data obtained after the preprocessing;
3.2 fusion of sound features: the sound features are added into a multi-head attention decoder generated by the intermediate abstract, and the specific adding method is as follows: firstly, the obtained sound characteristic a is ═ a1,a2,…,akMin-Max standardization, namely:
then, let the Value matrix in the decoder be V ═ V (V)1,v2,…,vk)TThe fusion of the sound features is performed by:
V′=(a1*v1,a2*v2,…,ak*vk)T
3.3 generation of the modified abstract: since the text sequence embedding code is generated by an xlnet-based encoder, re-encoding the intermediate summary with xlnet can more deeply understand the characteristic meaning of the original embedded vector, and better capture the semantic information of the context.
Deleting each word in O to obtain L different copies { O'1,O′2,…,O′i},O′i=O-oi(1. ltoreq. i.ltoreq.L), each copy having a word missing. Next, for each generated intermediate summary copy, an embedding vector H 'of the intermediate summary copy is obtained by using xlnet'i(1≤i≤L):
H′i=XLnet(O′i)
The copy missing word y is carried out by a transform-xl based decoder in combination with the source document informationi(1. ltoreq. i. ltoreq.L):
yi=Transformerxldecoder(H′i,H)
supplementing missing words to all the copies, and combining the obtained final results to obtain a finally corrected abstract:
Y={y1,y2,…,yl}
fourth, loss function module
4.1 intermediate digest loss function: in the forming process of the intermediate abstract, the formed intermediate abstract is set as a ═ a1,a2,…,alAnd the corresponding loss function adopts maximum likelihood estimation and a ROUGE-L rule, and the specific calculation mode is as follows:
wherein, asIs an intermediate summary, R (a), sampled from the predicted distributions) Is asWith real labelsThe scores obtained after the comparison. Combining the two to obtain the final loss function in the intermediate abstract forming process, wherein beta is a hyper-parameter for controlling the proportion of the two loss functions:
4.2 evaluation of the revised Abstract: let the revised abstract be y ═ y1,…,ylAnd evaluating by adopting maximum likelihood estimation and mutual information quantity.
mutual information quantity represents that the finally formed abstract y contains a real label y*The information amount in (2) is used for measuring the information inclusion degree of the abstract.
And adding the two to obtain an evaluation function of the corrected abstract as follows:
combining the loss functions, the loss function of the decoder model is obtained as follows:
the overall schematic diagram of this embodiment is shown in fig. 1.
While the foregoing specification shows and describes several preferred embodiments of this invention, it is to be understood, as noted above, that the invention is not limited to the forms disclosed herein, but is not to be construed as excluding other embodiments and is capable of use in various other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as expressed herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.
Innovation point
One of the innovations is as follows: deep fusion of speech text features
In the traditional abstract extraction, only text information is considered. The method fully utilizes the emphasized features in the voice, carries out deep fusion on the voice features and the text features after voice conversion through a deep learning model based on an attention mechanism, and generates the voice information to the text abstract.
The second innovation is that: abstract generation model structure design based on XLinet
The deep learning model adopted by the method is based on XLinet, and is improved in the decoder structure and loss function part, which is helpful to promote the information quantity contained in the generated abstract and fully utilize the text characteristics obtained by the encoder. Compared with the prior art, the invention has better abstract effect.
Claims (5)
1. A abstract automatic generation system based on voice text deep fusion is mainly characterized by comprising a preprocessing and voice corresponding module, an encoder module, a decoder feature fusion module and a loss function module; the preprocessing and voice corresponding module comprises text acquisition and voice corresponding; the decoder feature fusion module comprises intermediate abstract generation, sound feature fusion and corrected abstract generation; the loss function module comprises an intermediate abstract loss function and an evaluation function of the modified abstract;
the system completes the following functions:
for user audio, obtaining a corresponding text through a voice recognition tool, correcting misspelling in the text acquisition, and obtaining voice characteristics corresponding to the text according to characters in the voice correspondence;
obtaining vector representation of a text after a pre-trained xlnet encoder, and obtaining an intermediate abstract after the text vector and voice characteristics are fused through sound characteristics of a decoder and the intermediate abstract is generated;
and (4) the intermediate abstract is encoded by xlnet again to obtain further understanding of the text, and finally the intermediate abstract is corrected and then is generated and learned to obtain a final abstract.
2. A method for automatically generating an abstract based on voice text deep fusion is characterized by comprising the following specific steps:
step 1, preprocessing and voice corresponding module
Step 1.1 text acquisition
For k segments of speech v1,v2,...,vkFirstly, a voice recognition tool is utilized to obtain a corresponding k text sequence { x }1,x2,...,xk}; and performing text spelling error correction based on the n-gram model and the similar word-sound table, and performing correction on each text sequence xi(i is more than or equal to 1 and less than or equal to k) utilizing a word segmentation tool to carry out word segmentation to obtain a word vector Wi={w1,w2,...,wnAnd searching words in each word vector in a similar word-sound table to obtain a similar word set, traversing the set for replacement, and selecting a text sequence t 'with the lowest confusion degree'iAs a final result; the confusion is calculated as follows:
wherein n represents the number of participles, P (w)1,w2,...,wn) The probability of the text sequence is calculated by a probability multiplication formula:
P(w1,w2,...,wn)=p(w1)p(w2|w1)...p(wn|w1,w2,...wn-1)
obtaining a final k-segment text sequence { x 'through error correction'1,x′2,...,x′k};
Step 1.2 Speech correspondence
And corresponding the acquired text with the sound intensity characteristics: according to the text sequence { x 'by using the characteristic of Chinese character-pronunciation'1,x′2,...,x′kGet the number of words per text { n }1,n2,...,nkK represents the sequence ordinal number of the text, v for each speech fragmentiProcessing to select the top n of the speech segmentiThe maximum sound intensity (i.e. amplitude) and arranged in time sequence to obtain the length niSound intensity vector a ofiEach element in the sound intensity vector corresponds to the sound intensity of each character in the text sequence, so that the correspondence between the voice characteristic and the text is completed;
after the preprocessing module in the step 1, text data { x 'required by the abstract extraction model is obtained'1,x′2,...,x′kWith the corresponding sound characteristics a1,a2,...,ak};
Step 2, coder module
After the preprocessing module is completed, word segmentation is carried out firstly, and text data is coded by using a language model pre-trained by xlnet to obtain a corresponding text vector;
step 3, a decoder feature fusion module and a loss function module
Integrating sound characteristics into the attention calculation of a decoder to obtain a middle generation abstract, and designing a loss function for learning; and then, recoding the intermediate generated abstract by using xlnet to further learn semantics, decoding again, designing an evaluation function learning model, and finally obtaining a result, namely the required abstract.
3. The method of claim 2, wherein the step 2:
for the text sequence x 'obtained after pre-processing'iThe word vector { x is obtained by embedding after the Sentence piece word segmentationi,1,xi,2,...,xi,mM is the word number of the text sequence, and the word number is sent to 6 decoders connected in series to obtain a hidden state { h }i,1,hi,2,...,hi,mThe calculation process is as follows:
Qi,j=Ki,j=Vi,j=W·xi,j
hi,j=Norm(fc(Norm(Zi,j)))
wherein W is the weight matrix parameter to be trained, Q, K, V is the matrix of query, key and value in the transform structure, Si,j,kIs the sentence xi' the attention of the kth word to the jth word, Norm represents the normalization operation, fc represents the fully-connected operation in the neural network;
in order to better process long texts, relative position coding is added and the concept of transform-X is fused, S is usedi,j,kThe calculation of (c) is modified as follows:
wherein the absolute position of the word j is represented by a fixed vector p, rj-kRepresenting words j and kRelative position between WqAnd WkRespectively representing weight matrixes to be learned by the model for calculating Q matrixes and V matrixes;
the output of the previous sentence is stored in the cache by utilizing the thought of the Transformer-XL, the next sentence is connected to the output of the previous sentence during training and enters an attention mechanism together, but the content in the cache does not need to be calculated during the reverse gradient calculation;
since multi-headed attention is introduced, 8 Ws are learned, and { Z is obtained accordinglyi,j,1,Zi,j,2,...,Zi,j,8Is connected in series before being input into the full connection layer and then multiplied by W0Then fed into the full connection layer, where W0And are also the weight matrix parameters that need to be trained.
4. The method as claimed in claim 2, wherein the decoder in step 3 specifically operates as follows:
1) and (3) generation of a middle abstract: the original embedded coded vector H of the text is obtained by an encoder and is decoded by L decoders based on the transform-XL, and the decoding process is as follows:
oi=Transformerxldecoder(o<i,H),1≤i≤L
H=XLnet(x′1,x′2,...,x′m)。
generating an intermediate abstract O with the length L in a right-left mode1,o2,...,olWherein { x'iI is more than or equal to 1 and less than or equal to m represents the text sequence data obtained after the preprocessing;
2) and (3) fusion of sound characteristics: the sound features are added into a multi-head attention decoder generated by the intermediate abstract, and the specific adding method is as follows: firstly, the obtained sound characteristic a is ═ a1,a2,...,akMin-Max standardization, namely:
then, let the Value matrix in the decoder be V ═ V (V)1,v2,...,vk)TThe fusion of the sound features is performed by:
V′=(a1 *v1,a2 *v2,...,ak *vk)T
3) and (3) generating a revised abstract: because the text sequence embedded code is generated by an xlnet-based encoder, the characteristic meaning of the original embedded vector can be understood more deeply by recoding the intermediate abstract by using xlnet, and the semantic information of the context can be captured better;
deleting each word in O to obtain L different copies { O'1,O′2,...,O′l},O′i=O-oi(1 ≦ i ≦ L), each copy has a word missing; next, for each generated intermediate summary copy, an embedding vector H 'of the intermediate summary copy is obtained by using xlnet'i(1≤i≤L):
H′i=XLnet(O′i)
The copy missing word y is carried out by a transform-xl based decoder in combination with the source document informationi(1. ltoreq. i. ltoreq.L):
yi=Transformerxldecoder(H′i,H)
supplementing missing words to all the copies, and combining the obtained final results to obtain a finally corrected abstract:
Y={y1,y2,...,yl} 。
5. the method as claimed in claim 2, wherein the loss function in step 3 is specifically designed as follows:
1) intermediate digest loss function: in the forming process of the intermediate abstract, the formed intermediate abstract is set as a ═ a1,a2,...,alAnd the corresponding loss function adopts maximum likelihood estimation and a ROUGE-L rule, and the specific calculation mode is as follows:
wherein, asIs an intermediate summary, R (a), sampled from the predicted distributions) Is asA score obtained after comparison with a genuine tag; combining the two to obtain the final loss function in the intermediate abstract forming process, wherein beta is a hyper-parameter for controlling the proportion of the two loss functions:
2) evaluation of the revised summary: let the revised abstract be y ═ y1,...,ylAdopting maximum likelihood estimation and mutual information quantity to evaluate;
mutual information quantity represents that the finally formed abstract y contains a real label y*The information quantity in (1) is used for measuring the information containing degree of the abstract;
and adding the two to obtain an evaluation function of the corrected abstract as follows:
combining the loss functions, the loss function of the decoder model is obtained as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011198008.5A CN112417134B (en) | 2020-10-30 | 2020-10-30 | Automatic abstract generation system and method based on voice text deep fusion features |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011198008.5A CN112417134B (en) | 2020-10-30 | 2020-10-30 | Automatic abstract generation system and method based on voice text deep fusion features |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112417134A true CN112417134A (en) | 2021-02-26 |
CN112417134B CN112417134B (en) | 2022-05-13 |
Family
ID=74828717
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011198008.5A Active CN112417134B (en) | 2020-10-30 | 2020-10-30 | Automatic abstract generation system and method based on voice text deep fusion features |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112417134B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113673248A (en) * | 2021-08-23 | 2021-11-19 | 中国人民解放军32801部队 | Named entity identification method for testing and identifying small sample text |
CN113806520A (en) * | 2021-07-30 | 2021-12-17 | 合肥工业大学 | Text abstract generation method and system based on reinforcement learning |
CN113851133A (en) * | 2021-09-27 | 2021-12-28 | 平安科技(深圳)有限公司 | Model training and calling method and device, computer equipment and storage medium |
CN114118024A (en) * | 2021-12-06 | 2022-03-01 | 成都信息工程大学 | Conditional text generation method and generation system |
CN114398889A (en) * | 2022-01-18 | 2022-04-26 | 平安科技(深圳)有限公司 | Video text summarization method, device and storage medium based on multi-modal model |
CN115544244A (en) * | 2022-09-06 | 2022-12-30 | 内蒙古工业大学 | Cross fusion and reconstruction-based multi-mode generative abstract acquisition method |
CN115827854A (en) * | 2022-12-28 | 2023-03-21 | 数据堂(北京)科技股份有限公司 | Voice abstract generation model training method, voice abstract generation method and device |
CN118210383A (en) * | 2024-05-21 | 2024-06-18 | 南通亚森信息科技有限公司 | Information input system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110029537A1 (en) * | 2008-03-25 | 2011-02-03 | Huawei Technologies Co., Ltd. | Method, device and system for categorizing content |
CN103646094A (en) * | 2013-12-18 | 2014-03-19 | 上海紫竹数字创意港有限公司 | System and method for automatic extraction and generation of audiovisual product content abstract |
CN108305632A (en) * | 2018-02-02 | 2018-07-20 | 深圳市鹰硕技术有限公司 | A kind of the voice abstract forming method and system of meeting |
CN111061861A (en) * | 2019-12-12 | 2020-04-24 | 西安艾尔洛曼数字科技有限公司 | XLNET-based automatic text abstract generation method |
CN111739536A (en) * | 2020-05-09 | 2020-10-02 | 北京捷通华声科技股份有限公司 | Audio processing method and device |
-
2020
- 2020-10-30 CN CN202011198008.5A patent/CN112417134B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110029537A1 (en) * | 2008-03-25 | 2011-02-03 | Huawei Technologies Co., Ltd. | Method, device and system for categorizing content |
CN103646094A (en) * | 2013-12-18 | 2014-03-19 | 上海紫竹数字创意港有限公司 | System and method for automatic extraction and generation of audiovisual product content abstract |
CN108305632A (en) * | 2018-02-02 | 2018-07-20 | 深圳市鹰硕技术有限公司 | A kind of the voice abstract forming method and system of meeting |
CN111061861A (en) * | 2019-12-12 | 2020-04-24 | 西安艾尔洛曼数字科技有限公司 | XLNET-based automatic text abstract generation method |
CN111739536A (en) * | 2020-05-09 | 2020-10-02 | 北京捷通华声科技股份有限公司 | Audio processing method and device |
Non-Patent Citations (1)
Title |
---|
王天课: ""新闻广播语音自动摘要技术研究"", 《万方数据库》 * |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113806520A (en) * | 2021-07-30 | 2021-12-17 | 合肥工业大学 | Text abstract generation method and system based on reinforcement learning |
CN113673248A (en) * | 2021-08-23 | 2021-11-19 | 中国人民解放军32801部队 | Named entity identification method for testing and identifying small sample text |
CN113851133A (en) * | 2021-09-27 | 2021-12-28 | 平安科技(深圳)有限公司 | Model training and calling method and device, computer equipment and storage medium |
CN113851133B (en) * | 2021-09-27 | 2024-09-24 | 平安科技(深圳)有限公司 | Model training and calling method and device, computer equipment and storage medium |
CN114118024A (en) * | 2021-12-06 | 2022-03-01 | 成都信息工程大学 | Conditional text generation method and generation system |
CN114118024B (en) * | 2021-12-06 | 2022-06-21 | 成都信息工程大学 | Conditional text generation method and generation system |
WO2023137913A1 (en) * | 2022-01-18 | 2023-07-27 | 平安科技(深圳)有限公司 | Video text summarization method based on multi-modal model, device and storage medium |
CN114398889A (en) * | 2022-01-18 | 2022-04-26 | 平安科技(深圳)有限公司 | Video text summarization method, device and storage medium based on multi-modal model |
CN115544244A (en) * | 2022-09-06 | 2022-12-30 | 内蒙古工业大学 | Cross fusion and reconstruction-based multi-mode generative abstract acquisition method |
CN115544244B (en) * | 2022-09-06 | 2023-11-17 | 内蒙古工业大学 | Multi-mode generation type abstract acquisition method based on cross fusion and reconstruction |
CN115827854B (en) * | 2022-12-28 | 2023-08-11 | 数据堂(北京)科技股份有限公司 | Speech abstract generation model training method, speech abstract generation method and device |
CN115827854A (en) * | 2022-12-28 | 2023-03-21 | 数据堂(北京)科技股份有限公司 | Voice abstract generation model training method, voice abstract generation method and device |
CN118210383A (en) * | 2024-05-21 | 2024-06-18 | 南通亚森信息科技有限公司 | Information input system |
CN118210383B (en) * | 2024-05-21 | 2024-09-10 | 南通亚森信息科技有限公司 | Information input system |
Also Published As
Publication number | Publication date |
---|---|
CN112417134B (en) | 2022-05-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112417134B (en) | Automatic abstract generation system and method based on voice text deep fusion features | |
CN108519890B (en) | Robust code abstract generation method based on self-attention mechanism | |
WO2019085779A1 (en) | Machine processing and text correction method and device, computing equipment and storage media | |
CN110413729B (en) | Multi-turn dialogue generation method based on clause-context dual attention model | |
CN109992669B (en) | Keyword question-answering method based on language model and reinforcement learning | |
CN112765345A (en) | Text abstract automatic generation method and system fusing pre-training model | |
CN111125333B (en) | Generation type knowledge question-answering method based on expression learning and multi-layer covering mechanism | |
CN114444481B (en) | Sentiment analysis and generation method of news comment | |
CN114942990A (en) | Few-sample abstract dialogue abstract generation system based on prompt learning | |
CN113743095B (en) | Chinese problem generation unified pre-training method based on word lattice and relative position embedding | |
CN114372140A (en) | Layered conference abstract generation model training method, generation method and device | |
CN117995197A (en) | Speech recognition method, device, related equipment and computer program product | |
CN116909435A (en) | Data processing method and device, electronic equipment and storage medium | |
CN113901172B (en) | Case-related microblog evaluation object extraction method based on keyword structural coding | |
CN115858736A (en) | Emotion text generation method based on emotion prompt fine adjustment | |
CN113239166B (en) | Automatic man-machine interaction method based on semantic knowledge enhancement | |
CN115310461A (en) | Low-resource speech translation method and system based on multi-modal data optimization | |
CN115659172A (en) | Generation type text summarization method based on key information mask and copy | |
CN114756679A (en) | Chinese medical text entity relation combined extraction method based on conversation attention mechanism | |
CN115422329A (en) | Knowledge-driven multi-channel screening fusion dialogue generation method | |
CN114492462A (en) | Dialogue generation method and system based on emotion analysis and generation type confrontation network | |
CN111709245A (en) | Chinese-Yuan pseudo parallel sentence pair extraction method based on semantic self-adaptive coding | |
CN112380836A (en) | Intelligent Chinese message question generating method | |
Sun et al. | Human-machine conversation based on hybrid neural network | |
CN118377883B (en) | Session type retrieval method for rewriting query through thinking chain strategy |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |