CN112417134B - Automatic abstract generation system and method based on voice text deep fusion features - Google Patents

Automatic abstract generation system and method based on voice text deep fusion features Download PDF

Info

Publication number
CN112417134B
CN112417134B CN202011198008.5A CN202011198008A CN112417134B CN 112417134 B CN112417134 B CN 112417134B CN 202011198008 A CN202011198008 A CN 202011198008A CN 112417134 B CN112417134 B CN 112417134B
Authority
CN
China
Prior art keywords
abstract
text
word
voice
decoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011198008.5A
Other languages
Chinese (zh)
Other versions
CN112417134A (en
Inventor
申树藩
张思琪
周逸伦
卫志华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN202011198008.5A priority Critical patent/CN112417134B/en
Publication of CN112417134A publication Critical patent/CN112417134A/en
Application granted granted Critical
Publication of CN112417134B publication Critical patent/CN112417134B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3343Query execution using phonetics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Acoustics & Sound (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The system comprises a preprocessing and voice corresponding module, an encoder module, a decoder feature fusion module and a loss function module. The preprocessing and voice corresponding module comprises text acquisition and voice corresponding. The decoder feature fusion module comprises intermediate abstract generation, sound feature fusion and modified abstract generation. The loss function module comprises an intermediate summary loss function and an evaluation function of the corrected summary. For user voice data, obtaining a text corresponding to voice through the text, and obtaining voice characteristics corresponding to characters through voice correspondence; text data is subjected to a pre-training xlnet encoder to obtain vector representation of the text; the text vector and the voice feature are subjected to voice feature fusion of a decoder and learning after the intermediate abstract is generated to obtain an intermediate abstract; and (4) the intermediate abstract is encoded by xlnet again to obtain further understanding of the text, and finally the intermediate abstract is corrected to generate the final abstract through learning.

Description

Automatic abstract generation system and method based on voice text deep fusion features
Technical Field
The invention belongs to the field of automatic abstract generation, and particularly relates to an automatic abstract generation method based on a voice text deep fusion characteristic.
Background
With the rapid development of the internet and the updating of communication technology, the communication modes of people using the internet become various. The communication by using voice information is a convenient and common way. According to nail statistics, the number of the platform online conferences initiated exceeds 2000 ten thousand per day; according to the estimation of a prospective industry research institute, by 2018, users of the China online education market break through 2000 thousands of people. The voice information is used as an important information carrier, so that the communication is more convenient and quicker, and meanwhile, the problem of difficulty in information processing and obtaining is also brought. Therefore, the method and the device for automatically generating the text abstract of the voice rich in information value have wide application scenes and values. For example, in a conference scene, a large amount of valuable voice information can be generated, the manual arrangement of conference records consumes manpower, the subjective influence of conference organizers is easily brought, and the objective conference summary can be formed by the automatic summary system while the cost is saved. In the course scene, the key knowledge of the classroom can be extracted by utilizing the automatic abstract generation technology, and the teaching effect is enhanced. The voice information generated in daily production and life is rich in information value, but redundant voice is more, so that the redundant voice information is necessary to be converted into a simple text abstract.
The existing abstract generation technology has a mature method for a text-oriented scene, but cannot be directly applied to a scene using voice as original data. The abstract generation task using voice as original data has the following characteristics: 1) the original text needs to be obtained through a speech recognition technology, and errors such as homophone words and the like easily appear in the text after speech conversion. 2) The voice data itself carries a hint of important information (such as pitch, intensity, etc.), and text data obtained by voice conversion loses such information.
Disclosure of Invention
With the great increase of online and offline meetings and lectures, how to efficiently extract key information from the meetings and lectures becomes an important problem. With the rapid development and application of the deep learning theory, the learning mode of pre-training and fine tuning becomes the mainstream means for extracting the text abstract. However, the existing methods start from texts, although the semantic relation of the context is considered, the understanding and emphasis of the content when the speaker speaks per se are ignored, so that the method for automatically generating the abstract based on the deep fusion of the voice texts is provided, the subjective understanding of the speaker is added, and the abstract is more suitable for the application scene.
The invention aims to disclose an automatic abstract generation method based on deep fusion of voice texts, which aims at the current situations of high voice information yield and difficult information conversion and extraction and develops corresponding work around the improvement of a voice text feature extraction and abstract generation model. The achievement of the invention enriches the theory and method for generating the voice data abstract, improves the abstract generating effect, is widely applied in daily production and life, and creates economic value.
The technical scheme is as follows:
the automatic abstract generation method based on the deep fusion of the voice text is characterized in that main modules comprise a preprocessing and feature extraction module, an encoder module, a decoder feature fusion module and a loss function module. The preprocessing and feature extraction module comprises voice recognition, text error correction and voice corresponding functions. And for the original voice data, obtaining an intermediate text sequence through a voice recognition function, and obtaining a final text sequence through a text error correction model based on the n-gram and the similar word-pronunciation table. And extracting the sound intensity information in the voice data according to the time sequence arrangement and the Chinese character in the text sequence by utilizing the characteristic of one character and one sound of Chinese characters, thereby completing the text data preprocessing and the voice feature extraction. The encoder module gets the original embedded vector of the text features based on the XLnet framework. The decoder feature fusion and loss function module comprises a decoder structure, sound feature fusion and loss function design, and the decoder structure generates an abstract through two steps for an original embedded vector obtained by an encoder. 1) Decoding is carried out by a decoder based on Transformer-XL, and an intermediate summary is generated. 2) Deleting the words in the intermediate abstract, recoding by an encoder to obtain an intermediate abstract embedded vector, and correcting by a decoder in combination with the original embedded vector to generate a final abstract. In the process of generating the intermediate abstract, the sound features are fused in a Value matrix based on a multi-head attention mechanism decoder, so that the deep fusion of the speech text features is realized. The loss function design of the decoder combines the maximum likelihood estimation, the ROUGE-L rule and the information quantity estimation to evaluate the model.
The invention carries out deep and systematic research on the abstract generation field, fully utilizes the feature fusion mechanism of voice and text, combines the attention mechanism and the structure of a coder decoder, provides a new thought and method for the abstract extraction work facing the voice data, expands the abstract generation theory, is beneficial to further promoting the abstract generation quality and effect facing the voice data, and further promotes the application of automatic abstract generation.
Advantageous effects
1) The invention provides a method for fusing depth features of a voice text, aiming at the problems of uneven voice data information quality, difficult key point extraction and the like. And the sound intensity characteristics in the voice are extracted and added into the multilayer attention network for extracting the text characteristics, so that the deep fusion of the voice text characteristics is realized, and the quality of generating the abstract is improved.
2) The invention takes XLNet and transformer-xl as the basic framework of the abstract generating model, has stronger abstract generating capability for long texts, and enables the model to process voice data in more scenes such as large conferences.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 Overall framework of the invention
FIG. 2 is a schematic diagram of a decoder feature fusion mechanism
FIG. 3 is a schematic diagram of a modified network
Detailed Description
The following detailed description of the embodiments of the present invention will be provided with reference to the accompanying drawings and examples, so that how to implement the embodiments of the present invention by using technical means to solve the technical problems and achieve the technical effects can be fully understood and implemented.
The invention discloses a method for generating an abstract based on voice text deep fusion, which generates a corresponding text abstract according to voice data. Defining k words in the voice data as V ═ V1,v2,…,vkFor given voice data, the task of summary generation is to generate a summary Y ═ Y corresponding to the voice1,y2,…,yt}。
First part, preprocessing and voice corresponding module
1.1 text acquisition
For k segments of speech v1,v2,…,vkFirstly, a voice recognition tool is utilized to obtain a corresponding k text sequence { x }1,x2,…,xk}. And performing text spelling error correction based on the n-gram model and the similar word-sound table, and performing correction on each text sequence xi(i is more than or equal to 1 and less than or equal to k) utilizing a word segmentation tool to carry out word segmentation to obtain a word vector Wi={w1,w2,…,wnAnd searching words in each word vector in a similar word-sound table to obtain a similar word set, traversing the set for replacement, and selecting a text sequence t 'with the lowest confusion degree'iAs a final result. The confusion is calculated as follows:
Figure BDA0002754551840000041
wherein n represents the number of participles, P (w)1,w2,…,wn) The probability of the text sequence is calculated by a probability multiplication formula:
P(w1,w2,…,wn)=p(w1)p(w2|w1)…p(wn|w1,w2,…wn-1)
obtaining a final k-segment text sequence { x 'through error correction'1,x′2,…,x′k}。
1.2 Speech correspondence
And corresponding the acquired text with the sound intensity characteristics: according to the text sequence { x 'by using the characteristic of Chinese character-pronunciation'1,x′2,…,x′kGet the number of words per text { n }1,n2,…,nkK represents the sequence ordinal number of the text, v for each speech fragmentiProcessing to select the top n of the speech segmentiThe maximum sound intensity (i.e., amplitude) and arranged in time sequence to obtain the length niSound intensity vector a ofiIn the vector of sound intensityEach element corresponds to the sound intensity of each character in the text sequence, so that the correspondence between the voice characteristic and the text is completed;
after the preprocessing module in the step 1, text data { x 'required by the abstract extraction model is obtained'1,x′2,…,x′kWith the corresponding sound feature a1,a2,…,ak};
Second part, encoder Module
For error corrected text sequence x'iThe word vector { x is obtained by embedding after the Sentence piece word segmentationi,1,xi,2,...,xi,mM is the word number of the text sequence, and the word number is sent to 6 decoders connected in series to obtain a hidden state { h }i,1,hi,2,...,hi,mThe calculation process is as follows:
Qi,j=Ki,j=Vi,j=W·xi,j
Figure BDA0002754551840000051
Figure BDA0002754551840000052
Figure BDA0002754551840000053
hi,j=Norm(fc(Norm(Zi,j)))
wherein W is the weight matrix parameter to be trained, Q, K, V is the matrix of query, key and value in the transform structure, Si,j,kIs the sentence xi' the attention of the kth word to the jth word, Norm represents the normalization operation, fc represents the fully-connected operation in the neural network;
in order to better process long texts, relative position coding is added and the concept of Transfomer-X is integrated, S is addedi,j,kThe calculation of (c) is modified as follows:
Figure BDA0002754551840000054
wherein the absolute position of the word j is represented by a fixed vector p, rj-kDenotes the relative position between word j and word k, WqAnd WkRespectively representing weight matrixes to be learned by the model for calculating Q matrixes and V matrixes;
the output of the previous sentence is stored in the cache by utilizing the thought of the Transformer-XL, the next sentence is connected to the output of the previous sentence during training and enters an attention mechanism together, but the content in the cache does not need to be calculated during the reverse gradient calculation;
since multi-head attention is introduced, 8W's are learned, resulting in { Zi,j,1,Zi,j,2,...,Zi,j,8Is connected in series before being input into the full connection layer and then multiplied by W0Then fed into the full connection layer, where W0Is also a weight matrix parameter to be trained;
third part, decoder feature fusion module
The process of generating the text abstract by using the decoder is divided into three parts:
3.1 intermediate digest generation: the original embedded coded vector H of the text is obtained by an encoder and is decoded by L decoders based on the transform-XL, and the decoding process is as follows:
oi=Transformerxldecoder(o<i,H),1≤i≤L
H=XLnet(x′1,x′2,…,x′n)
generating an intermediate abstract O with the length L in a right-left mode1,o2,…,ol}, wherein { x'iI is more than or equal to 1 and less than or equal to m represents the text sequence data obtained after the preprocessing;
3.2 fusion of sound features: the sound features are added into a multi-head attention decoder generated by the intermediate abstract, and the specific adding method is as follows: firstly, the obtained sound characteristic a is ═ a1,a2,…,akMin-Max standardization, namely:
Figure BDA0002754551840000061
then, let the Value matrix in the decoder be V ═ V (V)1,v2,…,vk)TThe fusion of the sound features is performed by:
V′=(a1*v1,a2*v2,…,ak*vk)T
3.3 generation of the modified abstract: since the text sequence embedding code is generated by an xlnet-based encoder, re-encoding the intermediate summary with xlnet can more deeply understand the characteristic meaning of the original embedded vector, and better capture the semantic information of the context.
Deleting each word in O to obtain L different copies { O'1,O′2,…,O′i},O′i=O-oi(1. ltoreq. i.ltoreq.L), each copy having a word missing. Next, for each generated intermediate summary copy, an embedding vector H 'of the intermediate summary copy is obtained by using xlnet'i(1≤i≤L):
H′i=XLnet(O′i)
The duplicate missing word y is carried out by a transform-xl-based decoder in combination with the source document informationi(1. ltoreq. i. ltoreq.L):
yi=Transformerxldecoder(H′i,H)
supplementing missing words to all the copies, and combining the obtained final results to obtain a finally corrected abstract:
Y={y1,y2,…,yl}
fourth, loss function module
4.1 intermediate digest loss function: in the forming process of the intermediate abstract, the formed intermediate abstract is set as a ═ a1,a2,…,alAnd the corresponding loss function adopts maximum likelihood estimation and a ROUGE-L rule, and the specific calculation mode is as follows:
maximum likelihood estimation:
Figure BDA0002754551840000071
wherein a is<t={a1,…,at-1},H=XLNet{x′1,x′2,…,x′n},
Figure BDA0002754551840000079
The t-th word of the real abstract.
ROUGE-L rule:
Figure BDA0002754551840000072
wherein, asIs an intermediate summary, R (a), sampled from the predicted distributions) Is asThe score obtained after comparison with the authentic tag. Combining the two to obtain the final loss function in the intermediate abstract forming process, wherein beta is a hyper-parameter for controlling the proportion of the two loss functions:
Figure BDA0002754551840000073
4.2 evaluation of the revised Abstract: let the revised abstract be y ═ y1,…,ylAnd evaluating by adopting maximum likelihood estimation and mutual information quantity.
Maximum likelihood estimation:
Figure BDA0002754551840000074
wherein a is≠t={a1,…,at-1,at+1,…,al},H=XLNet{x1,x2,…,xn},
Figure BDA0002754551840000075
For picking up reallyThe intended t-th word.
And (3) mutual information quantity evaluation:
Figure BDA0002754551840000076
mutual information quantity represents that the finally formed abstract y contains a real label y*The information amount in (2) is used for measuring the information inclusion degree of the abstract.
And adding the two to obtain an evaluation function of the corrected abstract as follows:
Figure BDA0002754551840000077
combining the loss functions, the loss function of the decoder model is obtained as follows:
Figure BDA0002754551840000078
the overall schematic diagram of this embodiment is shown in fig. 1.
While the foregoing specification shows and describes several preferred embodiments of this invention, it is to be understood, as noted above, that the invention is not limited to the forms disclosed herein, but is not to be construed as excluding other embodiments and is capable of use in various other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as expressed herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.
Innovation point
One of the innovations is as follows: deep fusion of speech text features
In the traditional abstract extraction, only text information is considered. The method fully utilizes the emphasized features in the voice, carries out deep fusion on the voice features and the text features after voice conversion through a deep learning model based on an attention mechanism, and generates the voice information to the text abstract.
The second innovation is that: abstract generation model structure design based on XLinet
The deep learning model adopted by the method is based on XLinet, and is improved in the decoder structure and loss function part, which is helpful to promote the information quantity contained in the generated abstract and fully utilize the text characteristics obtained by the encoder. Compared with the prior art, the invention has better abstract effect.

Claims (2)

1. A method for automatically generating an abstract based on voice text deep fusion is characterized by comprising the following specific steps:
step 1, preprocessing and voice corresponding module
Step 1.1 text acquisition
For k segments of speech v1,v2,…,vkFirstly, a voice recognition tool is utilized to obtain a corresponding k text sequence { x }1,x2,…,xk}; and performing text spelling error correction based on the n-gram model and the similar word-sound table, and performing correction on each text sequence xi(i is more than or equal to 1 and less than or equal to k) utilizing a word segmentation tool to carry out word segmentation to obtain a word vector Wi={w1,w2,…,wnSearching words in each word vector in a similar word-sound table to obtain a similar word set, traversing the set for replacement, and selecting a text sequence t 'with the lowest confusion degree'iAs a final result; the confusion is calculated as follows:
Figure FDA0003542298220000011
wherein n represents the number of participles, P (w)1,w2,…,wn) The probability of the text sequence is calculated by a probability multiplication formula:
P(w1,w2,…,wn)=p(w1)p(w2|w1)…p(wn|w1,w2,…wn-1)
obtaining a final k-segment text sequence { x 'through error correction'1,x′2,…,x′k};
Step 1.2 Speech correspondence
And corresponding the acquired text with the sound intensity characteristics: according to the text sequence { x 'by using the characteristic of Chinese character-pronunciation'1,x′2,…,x′kGet the number of words per text { n }1,n2,…,nkK represents the sequence ordinal number of the text, v for each speech fragmentiProcessing to select the top n of the speech segmentiThe maximum sound intensity (i.e. amplitude) and arranged in time sequence to obtain the length niSound intensity vector a ofiEach element in the sound intensity vector corresponds to the sound intensity of each character in the text sequence, so that the correspondence between the voice characteristic and the text is completed;
after the preprocessing module in the step 1, text data { x 'required by the abstract extraction model is obtained'1,x′2,…,x′kWith the corresponding sound characteristics a1,a2,…,ak};
Step 2, coder module
After the preprocessing module is completed, word segmentation is carried out firstly, and text data is coded by using a language model pre-trained by xlnet to obtain a corresponding text vector;
for the text sequence x 'obtained after pre-processing'iThe word vector { x is obtained by embedding after the Sentence piece divides the wordi,1,xi,2,...,xi,mM is the word number of the text sequence, and the word number is sent to 6 decoders connected in series to obtain a hidden state { h }i,1,hi,2,...,hi,mThe calculation process is as follows:
Qi,j=Ki,j=Vi,j=W·xi,j
Figure FDA0003542298220000021
Figure FDA0003542298220000022
Figure FDA0003542298220000023
hi,j=Norm(fc(Norm(Zi,j)))
wherein W is the weight matrix parameter to be trained, Q, K, V is the matrix of query, key and value in the transform structure, Si,j,kIs the sentence xi' the attention of the kth word to the jth word, Norm represents the normalization operation, fc represents the fully-connected operation in the neural network;
in order to better process long texts, relative position coding is added and the concept of transform-X is fused, S is usedi,j,kThe calculation of (c) is modified as follows:
Figure FDA0003542298220000024
wherein the absolute position of the word j is represented by a fixed vector p, rj-kDenotes the relative position between word j and word k, WqAnd WkRespectively representing weight matrixes to be learned for calculating Q matrixes and V matrixes;
the output of the previous sentence is stored in the cache by utilizing the thought of the Transformer-XL, the next sentence is connected to the output of the previous sentence during training and enters an attention mechanism together, but the content in the cache does not need to be calculated during the reverse gradient calculation;
since a lot of attention is drawn, 8 Ws are learned, and { Z is obtained accordinglyi,j,1,Zi,j,2,...,Zi,j,8Is connected in series before being input into the full connection layer and then multiplied by W0Then fed into the full connection layer, where W0Is also a weight matrix parameter to be trained;
step 3, a decoder feature fusion module and a loss function module
Integrating sound characteristics into the attention calculation of a decoder to obtain a middle generation abstract, and designing a loss function for learning; then, recoding the intermediate generated abstract by using xlnet to further learn semantics, decoding again, designing an evaluation function learning model, and finally obtaining a result which is the required abstract;
the specific operation of the decoder is as follows:
1) and (3) generation of a middle abstract: the original embedded coded vector H of the text is obtained by an encoder and is decoded by L decoders based on the transform-XL, and the decoding process is as follows:
oi=Transformerxldecoder(o<i,H),1≤i≤L
H=XLnet(x′1,x′2,…,x′m)
generating an intermediate abstract O with the length L in a right-left mode1,o2,…,olWherein { x'iI is more than or equal to 1 and less than or equal to m represents the text sequence data obtained after the preprocessing;
2) and (3) fusion of sound characteristics: the sound characteristics are added into a multi-head attention decoder generated by the middle abstract, and the specific adding method comprises the following steps: firstly, the obtained sound characteristic a is ═ a1,a2,…,akMin-Max standardization, namely:
Figure FDA0003542298220000031
then, let the Value matrix in the decoder be V ═ V (V)1,v2,…,vk)TThe fusion of the sound features is performed by:
V′=(a1 *v1,a2 *v2,…,ak *vk)T
3) and (3) generating a revised abstract: because the text sequence embedded code is generated by an xlnet-based encoder, the characteristic meaning of the original embedded vector can be understood more deeply by recoding the intermediate abstract by using xlnet, and the semantic information of the context can be captured better;
to in OIs deleted to obtain L different copies { O'1,O′2,…,O′l},O′i=O-oi(1 ≦ i ≦ L), each copy has a word missing; next, for each generated intermediate summary copy, an embedding vector H 'of the intermediate summary copy is obtained by using xlnet'i(1≤i≤L):
H′i=XLnet(O′i)
The copy missing word y is carried out by a transform-xl based decoder in combination with the source document informationi(1. ltoreq. i. ltoreq.L):
yi=Transformerxldecoder(H′i,H)
supplementing missing words to all the copies, and combining the obtained final results to obtain a finally corrected abstract:
Y={y1,y2,…,yl};
the specific design of the loss function in step 3 is as follows:
1) intermediate digest loss function: in the process of forming the intermediate abstract, the formed intermediate abstract is set as a ═ a1,a2,…,alAnd the corresponding loss function adopts maximum likelihood estimation and a ROUGE-L rule, and the specific calculation mode is as follows:
maximum likelihood estimation:
Figure FDA0003542298220000041
wherein a is<t={a1,…,at-1},H=XLNet{x1,x2,…,xn},
Figure FDA0003542298220000042
The t-th word of the real abstract;
ROUGE-L rule:
Figure FDA0003542298220000043
wherein, asIs divided from a predictionIntermediate abstract, R (a), sampled in cloths) Is asA score obtained after comparison with a genuine tag; combining the two to obtain the final loss function in the intermediate abstract forming process, wherein beta is a hyper-parameter for controlling the proportion of the two loss functions:
Figure FDA0003542298220000044
2) evaluation of the revised summary: let the revised abstract be y ═ y1,…,ylAdopting maximum likelihood estimation and mutual information quantity to evaluate;
maximum likelihood estimation:
Figure FDA0003542298220000045
wherein a is≠t={a1,…,at-1,at+1,…,al},H=XLNet{x′1,x′2,…,x′n},
Figure FDA0003542298220000046
The t-th word of the real abstract;
and (3) mutual information quantity evaluation:
Figure FDA0003542298220000047
mutual information quantity represents that the finally formed abstract y contains a real label y*The information amount in (2) is used for measuring the information inclusion degree of the abstract;
and adding the two to obtain an evaluation function of the corrected abstract as follows:
Figure FDA0003542298220000051
combining the above loss functions, the loss function of the decoder model is obtained as follows:
Figure FDA0003542298220000052
2. an automatic generation system of a abstract based on the deep fusion of the speech text designed by the method of claim 1 is characterized in that the system comprises a preprocessing and speech corresponding module, an encoder module, a decoder feature fusion module and a loss function module; the preprocessing and voice corresponding module comprises text acquisition and voice corresponding; the decoder feature fusion module comprises intermediate abstract generation, sound feature fusion and corrected abstract generation; the loss function module comprises an intermediate abstract loss function and an evaluation function of the modified abstract;
the system completes the following functions:
for user audio, obtaining a corresponding text through a voice recognition tool, correcting misspelling in the text acquisition, and obtaining voice characteristics corresponding to the text according to characters in the voice correspondence;
obtaining vector representation of a text after a pre-trained xlnet encoder, and obtaining an intermediate abstract after the text vector and voice characteristics are fused through sound characteristics of a decoder and the intermediate abstract is generated;
and (4) the intermediate abstract is encoded by xlnet again to obtain further understanding of the text, and finally the intermediate abstract is corrected and then is generated and learned to obtain a final abstract.
CN202011198008.5A 2020-10-30 2020-10-30 Automatic abstract generation system and method based on voice text deep fusion features Active CN112417134B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011198008.5A CN112417134B (en) 2020-10-30 2020-10-30 Automatic abstract generation system and method based on voice text deep fusion features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011198008.5A CN112417134B (en) 2020-10-30 2020-10-30 Automatic abstract generation system and method based on voice text deep fusion features

Publications (2)

Publication Number Publication Date
CN112417134A CN112417134A (en) 2021-02-26
CN112417134B true CN112417134B (en) 2022-05-13

Family

ID=74828717

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011198008.5A Active CN112417134B (en) 2020-10-30 2020-10-30 Automatic abstract generation system and method based on voice text deep fusion features

Country Status (1)

Country Link
CN (1) CN112417134B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113673248B (en) * 2021-08-23 2022-02-01 中国人民解放军32801部队 Named entity identification method for testing and identifying small sample text
CN114118024B (en) * 2021-12-06 2022-06-21 成都信息工程大学 Conditional text generation method and generation system
CN114398889A (en) * 2022-01-18 2022-04-26 平安科技(深圳)有限公司 Video text summarization method, device and storage medium based on multi-modal model
CN115544244B (en) * 2022-09-06 2023-11-17 内蒙古工业大学 Multi-mode generation type abstract acquisition method based on cross fusion and reconstruction
CN115827854B (en) * 2022-12-28 2023-08-11 数据堂(北京)科技股份有限公司 Speech abstract generation model training method, speech abstract generation method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103646094A (en) * 2013-12-18 2014-03-19 上海紫竹数字创意港有限公司 System and method for automatic extraction and generation of audiovisual product content abstract
CN108305632A (en) * 2018-02-02 2018-07-20 深圳市鹰硕技术有限公司 A kind of the voice abstract forming method and system of meeting
CN111061861A (en) * 2019-12-12 2020-04-24 西安艾尔洛曼数字科技有限公司 XLNET-based automatic text abstract generation method
CN111739536A (en) * 2020-05-09 2020-10-02 北京捷通华声科技股份有限公司 Audio processing method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101547167A (en) * 2008-03-25 2009-09-30 华为技术有限公司 Content classification method, device and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103646094A (en) * 2013-12-18 2014-03-19 上海紫竹数字创意港有限公司 System and method for automatic extraction and generation of audiovisual product content abstract
CN108305632A (en) * 2018-02-02 2018-07-20 深圳市鹰硕技术有限公司 A kind of the voice abstract forming method and system of meeting
CN111061861A (en) * 2019-12-12 2020-04-24 西安艾尔洛曼数字科技有限公司 XLNET-based automatic text abstract generation method
CN111739536A (en) * 2020-05-09 2020-10-02 北京捷通华声科技股份有限公司 Audio processing method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"新闻广播语音自动摘要技术研究";王天课;《万方数据库》;20140522;第22-38页 *

Also Published As

Publication number Publication date
CN112417134A (en) 2021-02-26

Similar Documents

Publication Publication Date Title
CN112417134B (en) Automatic abstract generation system and method based on voice text deep fusion features
CN108519890B (en) Robust code abstract generation method based on self-attention mechanism
WO2019085779A1 (en) Machine processing and text correction method and device, computing equipment and storage media
CN110413729B (en) Multi-turn dialogue generation method based on clause-context dual attention model
CN109992669B (en) Keyword question-answering method based on language model and reinforcement learning
CN112115687B (en) Method for generating problem by combining triplet and entity type in knowledge base
CN112765345A (en) Text abstract automatic generation method and system fusing pre-training model
CN113158665A (en) Method for generating text abstract and generating bidirectional corpus-based improved dialog text
CN111125333B (en) Generation type knowledge question-answering method based on expression learning and multi-layer covering mechanism
CN112417092A (en) Intelligent text automatic generation system based on deep learning and implementation method thereof
Zhu et al. Robust spoken language understanding with unsupervised asr-error adaptation
CN114444481B (en) Sentiment analysis and generation method of news comment
CN115658898A (en) Chinese and English book entity relation extraction method, system and equipment
CN114942990A (en) Few-sample abstract dialogue abstract generation system based on prompt learning
Chharia et al. Deep recurrent architecture based scene description generator for visually impaired
CN110197521B (en) Visual text embedding method based on semantic structure representation
CN116909435A (en) Data processing method and device, electronic equipment and storage medium
CN115858736A (en) Emotion text generation method based on emotion prompt fine adjustment
CN113239166B (en) Automatic man-machine interaction method based on semantic knowledge enhancement
CN115223549A (en) Vietnamese speech recognition corpus construction method
CN115310461A (en) Low-resource speech translation method and system based on multi-modal data optimization
CN114756679A (en) Chinese medical text entity relation combined extraction method based on conversation attention mechanism
CN111709245A (en) Chinese-Yuan pseudo parallel sentence pair extraction method based on semantic self-adaptive coding
CN112380836A (en) Intelligent Chinese message question generating method
CN117195915B (en) Information extraction method and device for session content, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant