CN112417134B

CN112417134B - Automatic abstract generation system and method based on voice text deep fusion features

Info

Publication number: CN112417134B
Application number: CN202011198008.5A
Authority: CN
Inventors: 申树藩; 张思琪; 周逸伦; 卫志华
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2020-10-30
Filing date: 2020-10-30
Publication date: 2022-05-13
Anticipated expiration: 2040-10-30
Also published as: CN112417134A

Abstract

The system comprises a preprocessing and voice corresponding module, an encoder module, a decoder feature fusion module and a loss function module. The preprocessing and voice corresponding module comprises text acquisition and voice corresponding. The decoder feature fusion module comprises intermediate abstract generation, sound feature fusion and modified abstract generation. The loss function module comprises an intermediate summary loss function and an evaluation function of the corrected summary. For user voice data, obtaining a text corresponding to voice through the text, and obtaining voice characteristics corresponding to characters through voice correspondence; text data is subjected to a pre-training xlnet encoder to obtain vector representation of the text; the text vector and the voice feature are subjected to voice feature fusion of a decoder and learning after the intermediate abstract is generated to obtain an intermediate abstract; and (4) the intermediate abstract is encoded by xlnet again to obtain further understanding of the text, and finally the intermediate abstract is corrected to generate the final abstract through learning.

Description

Automatic abstract generation system and method based on voice text deep fusion features

Technical Field

The invention belongs to the field of automatic abstract generation, and particularly relates to an automatic abstract generation method based on a voice text deep fusion characteristic.

Background

With the rapid development of the internet and the updating of communication technology, the communication modes of people using the internet become various. The communication by using voice information is a convenient and common way. According to nail statistics, the number of the platform online conferences initiated exceeds 2000 ten thousand per day; according to the estimation of a prospective industry research institute, by 2018, users of the China online education market break through 2000 thousands of people. The voice information is used as an important information carrier, so that the communication is more convenient and quicker, and meanwhile, the problem of difficulty in information processing and obtaining is also brought. Therefore, the method and the device for automatically generating the text abstract of the voice rich in information value have wide application scenes and values. For example, in a conference scene, a large amount of valuable voice information can be generated, the manual arrangement of conference records consumes manpower, the subjective influence of conference organizers is easily brought, and the objective conference summary can be formed by the automatic summary system while the cost is saved. In the course scene, the key knowledge of the classroom can be extracted by utilizing the automatic abstract generation technology, and the teaching effect is enhanced. The voice information generated in daily production and life is rich in information value, but redundant voice is more, so that the redundant voice information is necessary to be converted into a simple text abstract.

The existing abstract generation technology has a mature method for a text-oriented scene, but cannot be directly applied to a scene using voice as original data. The abstract generation task using voice as original data has the following characteristics: 1) the original text needs to be obtained through a speech recognition technology, and errors such as homophone words and the like easily appear in the text after speech conversion. 2) The voice data itself carries a hint of important information (such as pitch, intensity, etc.), and text data obtained by voice conversion loses such information.

Disclosure of Invention

With the great increase of online and offline meetings and lectures, how to efficiently extract key information from the meetings and lectures becomes an important problem. With the rapid development and application of the deep learning theory, the learning mode of pre-training and fine tuning becomes the mainstream means for extracting the text abstract. However, the existing methods start from texts, although the semantic relation of the context is considered, the understanding and emphasis of the content when the speaker speaks per se are ignored, so that the method for automatically generating the abstract based on the deep fusion of the voice texts is provided, the subjective understanding of the speaker is added, and the abstract is more suitable for the application scene.

The invention aims to disclose an automatic abstract generation method based on deep fusion of voice texts, which aims at the current situations of high voice information yield and difficult information conversion and extraction and develops corresponding work around the improvement of a voice text feature extraction and abstract generation model. The achievement of the invention enriches the theory and method for generating the voice data abstract, improves the abstract generating effect, is widely applied in daily production and life, and creates economic value.

The technical scheme is as follows:

the automatic abstract generation method based on the deep fusion of the voice text is characterized in that main modules comprise a preprocessing and feature extraction module, an encoder module, a decoder feature fusion module and a loss function module. The preprocessing and feature extraction module comprises voice recognition, text error correction and voice corresponding functions. And for the original voice data, obtaining an intermediate text sequence through a voice recognition function, and obtaining a final text sequence through a text error correction model based on the n-gram and the similar word-pronunciation table. And extracting the sound intensity information in the voice data according to the time sequence arrangement and the Chinese character in the text sequence by utilizing the characteristic of one character and one sound of Chinese characters, thereby completing the text data preprocessing and the voice feature extraction. The encoder module gets the original embedded vector of the text features based on the XLnet framework. The decoder feature fusion and loss function module comprises a decoder structure, sound feature fusion and loss function design, and the decoder structure generates an abstract through two steps for an original embedded vector obtained by an encoder. 1) Decoding is carried out by a decoder based on Transformer-XL, and an intermediate summary is generated. 2) Deleting the words in the intermediate abstract, recoding by an encoder to obtain an intermediate abstract embedded vector, and correcting by a decoder in combination with the original embedded vector to generate a final abstract. In the process of generating the intermediate abstract, the sound features are fused in a Value matrix based on a multi-head attention mechanism decoder, so that the deep fusion of the speech text features is realized. The loss function design of the decoder combines the maximum likelihood estimation, the ROUGE-L rule and the information quantity estimation to evaluate the model.

The invention carries out deep and systematic research on the abstract generation field, fully utilizes the feature fusion mechanism of voice and text, combines the attention mechanism and the structure of a coder decoder, provides a new thought and method for the abstract extraction work facing the voice data, expands the abstract generation theory, is beneficial to further promoting the abstract generation quality and effect facing the voice data, and further promotes the application of automatic abstract generation.

Advantageous effects

1) The invention provides a method for fusing depth features of a voice text, aiming at the problems of uneven voice data information quality, difficult key point extraction and the like. And the sound intensity characteristics in the voice are extracted and added into the multilayer attention network for extracting the text characteristics, so that the deep fusion of the voice text characteristics is realized, and the quality of generating the abstract is improved.

2) The invention takes XLNet and transformer-xl as the basic framework of the abstract generating model, has stronger abstract generating capability for long texts, and enables the model to process voice data in more scenes such as large conferences.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 Overall framework of the invention

FIG. 2 is a schematic diagram of a decoder feature fusion mechanism

FIG. 3 is a schematic diagram of a modified network

Detailed Description

The following detailed description of the embodiments of the present invention will be provided with reference to the accompanying drawings and examples, so that how to implement the embodiments of the present invention by using technical means to solve the technical problems and achieve the technical effects can be fully understood and implemented.

The invention discloses a method for generating an abstract based on voice text deep fusion, which generates a corresponding text abstract according to voice data. Defining k words in the voice data as V ═ V₁,v₂,…,v_kFor given voice data, the task of summary generation is to generate a summary Y ═ Y corresponding to the voice₁,y₂,…,y_t}。

First part, preprocessing and voice corresponding module

1.1 text acquisition

For k segments of speech v₁,v₂,…,v_kFirstly, a voice recognition tool is utilized to obtain a corresponding k text sequence { x }₁,x₂,…,x_k}. And performing text spelling error correction based on the n-gram model and the similar word-sound table, and performing correction on each text sequence x_i(i is more than or equal to 1 and less than or equal to k) utilizing a word segmentation tool to carry out word segmentation to obtain a word vector W_i＝{w₁,w₂,…，w_nAnd searching words in each word vector in a similar word-sound table to obtain a similar word set, traversing the set for replacement, and selecting a text sequence t 'with the lowest confusion degree'_iAs a final result. The confusion is calculated as follows:

wherein n represents the number of participles, P (w)₁，w₂，…，w_n) The probability of the text sequence is calculated by a probability multiplication formula:

P(w₁,w₂，…，w_n)＝p(w₁)p(w₂|w₁)…p(w_n|w₁,w₂，…w_n-1)

obtaining a final k-segment text sequence { x 'through error correction'₁,x′₂,…,x′_k}。

1.2 Speech correspondence

And corresponding the acquired text with the sound intensity characteristics: according to the text sequence { x 'by using the characteristic of Chinese character-pronunciation'₁,x′₂,…,x′_kGet the number of words per text { n }₁,n₂,…,n_kK represents the sequence ordinal number of the text, v for each speech fragment_iProcessing to select the top n of the speech segment_iThe maximum sound intensity (i.e., amplitude) and arranged in time sequence to obtain the length n_iSound intensity vector a of_iIn the vector of sound intensityEach element corresponds to the sound intensity of each character in the text sequence, so that the correspondence between the voice characteristic and the text is completed;

after the preprocessing module in the step 1, text data { x 'required by the abstract extraction model is obtained'₁，x′₂,…,x′_kWith the corresponding sound feature a₁,a₂,…,a_k}；

Second part, encoder Module

For error corrected text sequence x'_iThe word vector { x is obtained by embedding after the Sentence piece word segmentation_i，1，x_i，2，...，x_i,mM is the word number of the text sequence, and the word number is sent to 6 decoders connected in series to obtain a hidden state { h }_i,1,h_i,2,...,h_i,mThe calculation process is as follows:

Q_i,j＝K_i,j＝V_i，j＝W·x_i，j

h_i，j＝Norm(fc(Norm(Z_i,j)))

wherein W is the weight matrix parameter to be trained, Q, K, V is the matrix of query, key and value in the transform structure, S_i,j,kIs the sentence x_i' the attention of the kth word to the jth word, Norm represents the normalization operation, fc represents the fully-connected operation in the neural network;

in order to better process long texts, relative position coding is added and the concept of Transfomer-X is integrated, S is added_i,j,kThe calculation of (c) is modified as follows:

wherein the absolute position of the word j is represented by a fixed vector p, r_j-kDenotes the relative position between word j and word k, W_qAnd W_kRespectively representing weight matrixes to be learned by the model for calculating Q matrixes and V matrixes;

the output of the previous sentence is stored in the cache by utilizing the thought of the Transformer-XL, the next sentence is connected to the output of the previous sentence during training and enters an attention mechanism together, but the content in the cache does not need to be calculated during the reverse gradient calculation;

since multi-head attention is introduced, 8W's are learned, resulting in { Z_i,j，1，Z_i，j，2，...，Z_i，j，8Is connected in series before being input into the full connection layer and then multiplied by W₀Then fed into the full connection layer, where W₀Is also a weight matrix parameter to be trained;

third part, decoder feature fusion module

The process of generating the text abstract by using the decoder is divided into three parts:

3.1 intermediate digest generation: the original embedded coded vector H of the text is obtained by an encoder and is decoded by L decoders based on the transform-XL, and the decoding process is as follows:

o_i＝Transformerxl_decoder(o_<i，H)，1≤i≤L

H＝XLnet(x′₁,x′₂,…,x′_n)

generating an intermediate abstract O with the length L in a right-left mode₁,o₂,…,o_l}, wherein { x'_iI is more than or equal to 1 and less than or equal to m represents the text sequence data obtained after the preprocessing;

3.2 fusion of sound features: the sound features are added into a multi-head attention decoder generated by the intermediate abstract, and the specific adding method is as follows: firstly, the obtained sound characteristic a is ═ a₁,a₂,…,a_kMin-Max standardization, namely:

then, let the Value matrix in the decoder be V ═ V (V)₁,v₂,…,v_k)^TThe fusion of the sound features is performed by:

V′＝(a₁*v₁,a₂*v₂,…,a_k*v_k)^T

3.3 generation of the modified abstract: since the text sequence embedding code is generated by an xlnet-based encoder, re-encoding the intermediate summary with xlnet can more deeply understand the characteristic meaning of the original embedded vector, and better capture the semantic information of the context.

Deleting each word in O to obtain L different copies { O'₁,O′₂,…,O′_i},O′_i＝O-o_i(1. ltoreq. i.ltoreq.L), each copy having a word missing. Next, for each generated intermediate summary copy, an embedding vector H 'of the intermediate summary copy is obtained by using xlnet'_i(1≤i≤L)：

H′_i＝XLnet(O′_i)

The duplicate missing word y is carried out by a transform-xl-based decoder in combination with the source document information_i(1. ltoreq. i. ltoreq.L):

y_i＝Transformerxl_decoder(H′_i,H)

supplementing missing words to all the copies, and combining the obtained final results to obtain a finally corrected abstract:

Y＝{y₁,y₂,…,y_l}

fourth, loss function module

4.1 intermediate digest loss function: in the forming process of the intermediate abstract, the formed intermediate abstract is set as a ═ a₁，a₂,…，a_lAnd the corresponding loss function adopts maximum likelihood estimation and a ROUGE-L rule, and the specific calculation mode is as follows:

maximum likelihood estimation:

wherein a is_<t＝{a₁，…,a_t-1}，H＝XLNet{x′₁，x′₂,…，x′_n}，

The t-th word of the real abstract.

ROUGE-L rule:

wherein, a^sIs an intermediate summary, R (a), sampled from the predicted distribution^s) Is a^sThe score obtained after comparison with the authentic tag. Combining the two to obtain the final loss function in the intermediate abstract forming process, wherein beta is a hyper-parameter for controlling the proportion of the two loss functions:

4.2 evaluation of the revised Abstract: let the revised abstract be y ═ y₁,…,y_lAnd evaluating by adopting maximum likelihood estimation and mutual information quantity.

Maximum likelihood estimation:

wherein a is_≠t＝{a₁,…,a_t-1,a_t+1，…，a_l}，H＝XLNet{x₁，x₂，…,x_n}，

For picking up reallyThe intended t-th word.

And (3) mutual information quantity evaluation:

mutual information quantity represents that the finally formed abstract y contains a real label y^*The information amount in (2) is used for measuring the information inclusion degree of the abstract.

And adding the two to obtain an evaluation function of the corrected abstract as follows:

combining the loss functions, the loss function of the decoder model is obtained as follows:

the overall schematic diagram of this embodiment is shown in fig. 1.

While the foregoing specification shows and describes several preferred embodiments of this invention, it is to be understood, as noted above, that the invention is not limited to the forms disclosed herein, but is not to be construed as excluding other embodiments and is capable of use in various other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as expressed herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Innovation point

One of the innovations is as follows: deep fusion of speech text features

In the traditional abstract extraction, only text information is considered. The method fully utilizes the emphasized features in the voice, carries out deep fusion on the voice features and the text features after voice conversion through a deep learning model based on an attention mechanism, and generates the voice information to the text abstract.

The second innovation is that: abstract generation model structure design based on XLinet

The deep learning model adopted by the method is based on XLinet, and is improved in the decoder structure and loss function part, which is helpful to promote the information quantity contained in the generated abstract and fully utilize the text characteristics obtained by the encoder. Compared with the prior art, the invention has better abstract effect.

Claims

1. A method for automatically generating an abstract based on voice text deep fusion is characterized by comprising the following specific steps:

step 1, preprocessing and voice corresponding module

Step 1.1 text acquisition

For k segments of speech v₁,v₂,…,v_kFirstly, a voice recognition tool is utilized to obtain a corresponding k text sequence { x }₁,x₂,…,x_k}; and performing text spelling error correction based on the n-gram model and the similar word-sound table, and performing correction on each text sequence x_i(i is more than or equal to 1 and less than or equal to k) utilizing a word segmentation tool to carry out word segmentation to obtain a word vector W_i＝{w₁,w₂,…,w_nSearching words in each word vector in a similar word-sound table to obtain a similar word set, traversing the set for replacement, and selecting a text sequence t 'with the lowest confusion degree'_iAs a final result; the confusion is calculated as follows:

wherein n represents the number of participles, P (w)₁,w₂,…,w_n) The probability of the text sequence is calculated by a probability multiplication formula:

P(w₁,w₂,…,w_n)＝p(w₁)p(w₂|w₁)…p(w_n|w₁,w₂,…w_n-1)

obtaining a final k-segment text sequence { x 'through error correction'₁,x′₂,…,x′_k}；

Step 1.2 Speech correspondence

And corresponding the acquired text with the sound intensity characteristics: according to the text sequence { x 'by using the characteristic of Chinese character-pronunciation'₁,x′₂,…,x′_kGet the number of words per text { n }₁,n₂,…,n_kK represents the sequence ordinal number of the text, v for each speech fragment_iProcessing to select the top n of the speech segment_iThe maximum sound intensity (i.e. amplitude) and arranged in time sequence to obtain the length n_iSound intensity vector a of_iEach element in the sound intensity vector corresponds to the sound intensity of each character in the text sequence, so that the correspondence between the voice characteristic and the text is completed;

after the preprocessing module in the step 1, text data { x 'required by the abstract extraction model is obtained'₁,x′₂,…,x′_kWith the corresponding sound characteristics a₁,a₂,…,a_k}；

Step 2, coder module

After the preprocessing module is completed, word segmentation is carried out firstly, and text data is coded by using a language model pre-trained by xlnet to obtain a corresponding text vector;

for the text sequence x 'obtained after pre-processing'_iThe word vector { x is obtained by embedding after the Sentence piece divides the word_i,1,x_i,2,...,x_i,mM is the word number of the text sequence, and the word number is sent to 6 decoders connected in series to obtain a hidden state { h }_i,1,h_i,2,...,h_i,mThe calculation process is as follows:

Q_i,j＝K_i,j＝V_i,j＝W·x_i,j

h_i,j＝Norm(fc(Norm(Z_i,j)))

in order to better process long texts, relative position coding is added and the concept of transform-X is fused, S is used_i,j,kThe calculation of (c) is modified as follows:

wherein the absolute position of the word j is represented by a fixed vector p, r_j-kDenotes the relative position between word j and word k, W_qAnd W_kRespectively representing weight matrixes to be learned for calculating Q matrixes and V matrixes;

since a lot of attention is drawn, 8 Ws are learned, and { Z is obtained accordingly_i,j,1,Z_i,j,2,...,Z_i,j,8Is connected in series before being input into the full connection layer and then multiplied by W₀Then fed into the full connection layer, where W₀Is also a weight matrix parameter to be trained;

step 3, a decoder feature fusion module and a loss function module

Integrating sound characteristics into the attention calculation of a decoder to obtain a middle generation abstract, and designing a loss function for learning; then, recoding the intermediate generated abstract by using xlnet to further learn semantics, decoding again, designing an evaluation function learning model, and finally obtaining a result which is the required abstract;

the specific operation of the decoder is as follows:

1) and (3) generation of a middle abstract: the original embedded coded vector H of the text is obtained by an encoder and is decoded by L decoders based on the transform-XL, and the decoding process is as follows:

o_i＝Transformerxl_decoder(o_＜i,H),1≤i≤L

H＝XLnet(x′₁,x′₂,…,x′_m)

generating an intermediate abstract O with the length L in a right-left mode₁,o₂,…,o_lWherein { x'_iI is more than or equal to 1 and less than or equal to m represents the text sequence data obtained after the preprocessing;

2) and (3) fusion of sound characteristics: the sound characteristics are added into a multi-head attention decoder generated by the middle abstract, and the specific adding method comprises the following steps: firstly, the obtained sound characteristic a is ═ a₁,a₂,…,a_kMin-Max standardization, namely:

V′＝(a₁ ^*v₁,a₂ ^*v₂,…,a_k ^*v_k)^T

3) and (3) generating a revised abstract: because the text sequence embedded code is generated by an xlnet-based encoder, the characteristic meaning of the original embedded vector can be understood more deeply by recoding the intermediate abstract by using xlnet, and the semantic information of the context can be captured better;

to in OIs deleted to obtain L different copies { O'₁,O′₂,…,O′_l},O′_i＝O-o_i(1 ≦ i ≦ L), each copy has a word missing; next, for each generated intermediate summary copy, an embedding vector H 'of the intermediate summary copy is obtained by using xlnet'_i(1≤i≤L)：

H′_i＝XLnet(O′_i)

The copy missing word y is carried out by a transform-xl based decoder in combination with the source document information_i(1. ltoreq. i. ltoreq.L):

y_i＝Transformerxl_decoder(H′_i,H)

Y＝{y₁,y₂,…,y_l}；

the specific design of the loss function in step 3 is as follows:

1) intermediate digest loss function: in the process of forming the intermediate abstract, the formed intermediate abstract is set as a ═ a₁,a₂,…,a_lAnd the corresponding loss function adopts maximum likelihood estimation and a ROUGE-L rule, and the specific calculation mode is as follows:

maximum likelihood estimation:

wherein a is_<t＝{a₁,…,a_t-1}，H＝XLNet{x₁,x₂,…,x_n}，

The t-th word of the real abstract;

ROUGE-L rule:

wherein, a^sIs divided from a predictionIntermediate abstract, R (a), sampled in cloth^s) Is a^sA score obtained after comparison with a genuine tag; combining the two to obtain the final loss function in the intermediate abstract forming process, wherein beta is a hyper-parameter for controlling the proportion of the two loss functions:

2) evaluation of the revised summary: let the revised abstract be y ═ y₁,…,y_lAdopting maximum likelihood estimation and mutual information quantity to evaluate;

maximum likelihood estimation:

wherein a is_≠t＝{a₁,…,a_t-1,a_t+1,…,a_l}，H＝XLNet{x′₁,x′₂,…,x′_n}，

The t-th word of the real abstract;

and (3) mutual information quantity evaluation:

mutual information quantity represents that the finally formed abstract y contains a real label y^*The information amount in (2) is used for measuring the information inclusion degree of the abstract;

combining the above loss functions, the loss function of the decoder model is obtained as follows:

2. an automatic generation system of a abstract based on the deep fusion of the speech text designed by the method of claim 1 is characterized in that the system comprises a preprocessing and speech corresponding module, an encoder module, a decoder feature fusion module and a loss function module; the preprocessing and voice corresponding module comprises text acquisition and voice corresponding; the decoder feature fusion module comprises intermediate abstract generation, sound feature fusion and corrected abstract generation; the loss function module comprises an intermediate abstract loss function and an evaluation function of the modified abstract;

the system completes the following functions:

for user audio, obtaining a corresponding text through a voice recognition tool, correcting misspelling in the text acquisition, and obtaining voice characteristics corresponding to the text according to characters in the voice correspondence;

obtaining vector representation of a text after a pre-trained xlnet encoder, and obtaining an intermediate abstract after the text vector and voice characteristics are fused through sound characteristics of a decoder and the intermediate abstract is generated;

and (4) the intermediate abstract is encoded by xlnet again to obtain further understanding of the text, and finally the intermediate abstract is corrected and then is generated and learned to obtain a final abstract.