CN111243578A

CN111243578A - Chinese mandarin character-voice conversion method based on self-attention mechanism

Info

Publication number: CN111243578A
Application number: CN202010027248.2A
Authority: CN
Inventors: 张鹏远; 黎塔; 邬龙; 颜永红
Original assignee: Institute of Acoustics CAS
Current assignee: Institute of Acoustics CAS
Priority date: 2020-01-10
Filing date: 2020-01-10
Publication date: 2020-06-05

Abstract

The embodiment of the invention provides an end-to-end speech recognition algorithm of time-limited self-attention binding meaning classification, which fuses position-related attention mechanism classification and binding meaning classification, wherein the attention window length is selected according to the influence of different attention window lengths on a recognition result, and further provides a self-attention binding meaning classification criterion.

Description

Chinese mandarin character-voice conversion method based on self-attention mechanism

Technical Field

The invention relates to the field of speech synthesis, in particular to an end-to-end speech recognition method based on time limit self-attention binding meaning classification.

Background

The speech recognition technology is an important technology for converting input speech into text, wherein an end-to-end speech recognition framework has become an important research direction due to the advantages of simple structure, strong universality, independence on linguistic knowledge, high reasoning speed and the like.

Although the traditional speech recognition algorithm based on hidden Markov and deep neural network has achieved high recognition accuracy, the traditional speech recognition algorithm has the defects of complex flow, inconsistent optimization, independent frame conditions, complex decoding, requirement of expert knowledge and the like. Therefore, end-to-end voice recognition becomes a research hotspot, and the method completes the conversion from voice to text through a uniform neural network. The current mainstream end-to-end identification framework mainly comprises: end-to-end speech recognition based on joint sense classification and end-to-end speech recognition based on attention-based codec networks.

The end-to-end architecture of attention-based codec networks treats speech recognition as a problem of sequence mapping, i.e. mapping input features into corresponding words. Wherein the decoding network uses an attention mechanism to find the correspondence between each word output and the encoder state. For each word output, the distribution of the idea weight is calculated by the state of the decoder and the encoder state information, and the state of the encoder is weighted and summed as the input of the decoder. Although the above structure has the advantages of end-to-end speech recognition and no condition-independent assumption is made, the attention coefficient is not constrained enough, and discontinuous attention weights are learned in the actual training process. Therefore, in order to better constrain attention weight, researchers add associative classification criteria to the training for joint optimization, thereby greatly reducing the occurrence of irregular attention coefficients.

However, the end-to-end modeling framework based on the joint sense classification criterion has the assumption that frames are independent of each other, and actual speech is a continuous time sequence that does not satisfy the assumption.

Disclosure of Invention

The invention provides an end-to-end speech recognition algorithm of time-limited self-attention binding meaning classification, which fuses position-dependent attention mechanism classification and binding meaning classification, wherein the attention window length is selected according to the influence of different attention window lengths on a recognition result, and further provides a self-attention binding meaning classification criterion. .

The technical solution adopted by the present invention to solve the above technical problems is to provide an end-to-end speech recognition method, where the end-to-end speech recognition is implemented by a neural network model, the neural network model includes an encoding layer, a decoding layer, and an attention binding meaning classification layer, and the method includes:

inputting voice features into a coding layer of the neural network model, wherein the coding layer converts the voice sequence into voice features;

the decoding layer calculates attention distribution probability to the high-dimensional vector and converts the high-dimensional vector into a first output symbol sequence representing characters;

the attention binding meaning classification layer converts the high-dimensional vector into a second output symbol sequence representing characters by using a binding meaning classifier by using an attention mechanism;

and combining the first output symbol sequence and the second output symbol sequence to obtain an output symbol sequence of the expression characters of the neural network model.

Preferably, the mathematical expression of the classification criterion of the neural network model is:

L_MTL＝λL_ctc+(1-λ)L_attention

where λ is the interpolation coefficient, L_ctcAnd L_attentionThe classification criteria of the decoding layer and the attention tie-aware classification layer, respectively.

Specifically, the mathematical expression of the classification criterion of the attention-binding meaning classification layer is as follows:

L_CTC＝-ln P(y|ph_u),

ph_u＝W_projc_u+b

a_ut＝Attend(ph_u-1,a_u-1,h_t)

wherein, W_projAnd b represents the weight and bias matrix of the output mapping layer of the associative classification criterion, ph_uOutput of the output mapping layer representing the u-time binding semantic classification criterion, a_utRepresents the attention weight, c_uRepresents the result of the weighted summation of the hidden layers, τ represents the window length of attention,

attend () is the attention function, attention weight a_utThe calculation is as follows:

e_ut＝Score(s_u-1,a_u-1,h_t)

where Score () is content-based attention, or location-based attention, the above equation may be:

e_ut＝v^Ttanh(Ks_u-1+Q(F*a_u-1)+Wh_t)

L_CTC＝-lnP(y|ph_u),

ph_u＝W_projc_u+b

wherein, W_projAnd b represents the weight and bias matrix of the output mapping layer of the associative classification criterion, ph_uRepresenting u time binding meaning classification criterion inputOutput from the mapping layer, a_utRepresents the attention weight, c_uRepresents the result of the weighted summation of the hidden layers, τ represents the window length of attention,

wherein the content of the first and second substances,

q_t＝Qb_t,t＝u

k_t＝Kb_t,t＝u-τ,...,u+τ

v_t＝Vb_t,t＝u-τ,...,u+τ

b_t＝W_embdh_t,t＝u-τ,...,u+τ

bt is an input vector for mapping the input ht of the coding network into an attention mechanism through an input mapping matrix Wembd, k, v and q are keys, values and queries, and K, V, Q is a parameter matrix.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a structural diagram of an end-to-end speech recognition neural network model according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The structure of end-to-end speech recognition neural network is as follows:

fig. 1 is a structural diagram of an end-to-end speech recognition neural network model according to an embodiment of the present invention: as shown, it includes a coded layer (Shared Encoder), a decoded layer (Decoder), and an attention binding classification layer (CTC association).

And the coding layer is used for mapping the input features into high-dimensional vectors.

And a decoding network for decoding the high dimension into an output symbol sequence.

Attention tie taxonomy classification layer

The coding layer receives input voice characteristics and converts the input voice characteristics into high-dimensional vectors;

the decoding layer is used for converting the high-dimensional vector into a first output symbol sequence representing characters, and the attention distribution probability of the speech features is calculated in the conversion;

the attention binding meaning classification layer converts the high-dimensional vector into a second output symbol sequence representing characters by using a binding meaning classifier and an attention mechanism;

and combining the first output symbol sequence and the second output symbol sequence representing the characters to obtain an output symbol sequence representing the characters of the neural network model.

Second, the detailed discussion and embodiments of the model:

the method aims to solve the problem that discontinuous attention weight can be learned in the actual training process because the attention coefficient is not restrained sufficiently by an attention coding and decoding network-based end-to-end speech recognition algorithm. The invention provides a multi-task learning mechanism, namely joint optimization is carried out on a joint meaning classification criterion and a criterion of an encoding and decoding network.

Specifically, during the training process, a forward and backward algorithm of a joint sense classification criterion is used to force monotonic alignment between the input speech features and the output labels.

In one embodiment, the mathematical expression for the joint optimization criterion is:

L_MTL＝λL_ctc+(1-λ)L_attention(1)

where λ is the interpolation coefficient, L_ctcAnd L_attentionRespectively, a joint-sense classification criterion and an attention-based codec criterion (e.g., the CTC attribute layer and the Decoder layer in fig. 1 use the classification criterion, respectively).

For the joint sense classification criterion, in order to solve the problem that the length of the output sequence is smaller than that of the input sequence, a blank symbol is added to the output symbol set, and the repeated occurrence of the blank symbol is allowed.

In another embodiment, the conditional probability that the join-sense classification criterion predicts the entire output sequence is:

by assuming mutual independence between frames, the above formula can be decomposed into:

where x represents the input speech feature and y represents the output sequence. L represents the output symbol set, and T represents the total frame number of the voice. Pi_1:T＝(π₁,...,π_T) Representing the output symbol, pi, of each frame_tE L 'and L' ═ L ∪ blank.p (pi)_t| x) is the conditional probability at time t. B is a mapping function that performs the mapping of the output path to the output symbol sequence.

For the encoding and decoding network based on attention mechanism, the final posterior probability is not directly estimated by any condition independent hypothesis, and two networks are used: an encoding network (e.g., the Encoder layer in fig. 1) whose role is to map input features x into an implied layer vector h (a high-dimensional vector) and a decoding network (e.g., the Decoder layer in fig. 1) whose role is to decode the implied layer vector h into an output symbol sequence y.

In one embodiment, the posterior probability can be expressed as:

wherein, c_uIs a function of the input characteristic x, U is the length of the output sequence is not equal to the input frame length, P (y)_u|y_1:u-1,c_u) Can be expressed as:

P(y_u|y_1:u-1,c_u)＝Decoder(y_u-1,s_u-1,c_u) (5)

h_t＝Encoder(x) (7)

a_ut＝Attend(s_u-1,a_u-1,h_t) (8)

wherein Encoder () and Decode () denote encoding network and decoding network, respectively, s is an implicit state vector of the decoding network, h is an implicit state vector of the encoding network, attentive () is an attention network, attention weight a_utThe calculation is as follows:

e_ut＝Score(s_u-1,a_u-1,h_t) (10)

wherein Score () may be either content-based or location-based attention, hi another embodiment,

e_ut＝v^Ttanh(Ks_u-1+Wh_t) (11)

in a still further embodiment of the method,

e_ut＝v^Ttanh(Ks_u-1+Q(F*a_u-1)+Wh_t) (12)

according to the above description, the learning of attention weight can be effectively restricted by adding joint-meaning classification criterion for joint optimization, so that the learned attention weight keeps monotonous characteristic, however, for the joint-meaning classification criterion, the joint probability is decomposed into products of a series of probabilities by the assumption that the frame condition is independent, and the actual speech does not satisfy the assumption that the frames are independent.

To solve this problem, the present invention proposes an end-to-end speech recognition algorithm based on time-limited self-attention-binding classification, as shown in fig. 1, which incorporates a time-limited attention module before binding-meaning classification criteria, so that the output is not only dependent on the encoded network output at the current moment, but also is related to the encoded network output over a period of time.

In one embodiment, the L_CTCCan be expressed as:

ph＝W_projh+b (14)

wherein, W_projAnd b represent the weight and bias matrix of the output mapping layer of the associative classification criterion, respectively, ph represents the input of the associative classification criterion.

In another embodiment, attention weights are added to the associative classification criteria, and the mathematical expression becomes:

L_CTC＝-ln P(y|ph_u) (15)

ph_u＝W_projc_u+b (16)

a_ut＝Attend(ph_u-1,a_u-1,h_t) (18)

wherein, ph is_uOutput of the output mapping layer representing the u-time binding semantic classification criterion, a_utRepresents the attention weight, c_uRepresents the weighted summation result of the hidden layer (the classification layer is included in the classification layer, and the classification layer is a network layer with classification function, such as CTC attribute and decoding layer in FIG. 1), and τ represents the window length of attention.

In one embodiment, the attention weight is a location-based attention weight, and the mathematical expressions are shown in equations (9), (10), (12), however, the attention mechanism requires learning the dependency relationship between the sequences, which increases the modeling difficulty to some extent. In order to alleviate this problem,

in another embodiment, a join-sense classification criterion based on a self-attention mechanism is presented.

First, the input of the coding network is mapped into an input vector of attention mechanism by an input mapping matrix:

b_t＝W_embdh_t,t＝u-τ,...,u+τ (16)

secondly, b in formula (16) is mapped through a linear mapping layer_tMapping to key, value, query is:

q_t＝Qb_t,t＝u (20)

k_t＝Kb_t,t＝u-τ,...,u+τ (21)

v_t＝Vb_t,t＝u-τ,...,u+τ (22)

finally, the attention coefficient obtained from attention and the result can be expressed as:

as can be seen from the above embodiments, the embodiments of the present invention provide an end-to-end speech recognition algorithm for time-limited self-attention binding meaning classification, which fuses a position-dependent attention mechanism classification and a binding meaning classification, wherein the attention window length is taken according to the influence of different attention window lengths on a recognition result, and further provides a self-attention binding meaning classification criterion, and by combining the self-attention mechanism and the binding meaning classification criterion, the problem that the assumption that frames are independent from each other due to the binding meaning classification is not true is solved, and the performance of the end-to-end speech recognition system can be improved.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. An end-to-end speech recognition method, the end-to-end speech recognition being through a neural network model comprising an encoding layer, a decoding layer, an attention-binding meaning classification layer, the method comprising:

inputting voice features into a coding layer of the neural network model, wherein the coding layer converts the voice sequence into a high-dimensional vector;

2. The method of claim 1, wherein the mathematical expression of the classification criteria of the neural network model is:

L_MTL＝λL_ctc+(1-λ)L_attention

3. The method of claim 2, the mathematical expression of the classification criteria of the attention-binding sense classification layer being:

ph_u＝W_projc_u+b

a_ut＝Attend(ph_u-1,a_u-1,h_t)

e_ut＝Score(s_u-1,a_u-1,h_t)

e_ut＝v^Ttanh(Ks_u-1+Q(F*a_u-1)+Wh_t)。

4. the method of claim 2, the mathematical expression of the classification criteria of the attention-binding sense classification layer being:

ph_u＝W_projc_u+b

wherein, W_projAnd b represents the output of the associative classification criterionWeight and bias matrix of mapping layer, ph_uOutput of the output mapping layer representing the u-time binding semantic classification criterion, a_utRepresents the attention weight, c_uRepresents the result of the weighted summation of the hidden layers, τ represents the window length of attention,

wherein the content of the first and second substances,

q_t＝Qb_t,t＝u

k_t＝Kb_t,t＝u-τ,...,u+τ

v_t＝Vb_t,t＝u-τ,...,u+τ

b_t＝W_embdh_t,t＝u-τ,...,u+τ