CN113077785B

CN113077785B - End-to-end multi-language continuous voice stream voice content identification method and system

Info

Publication number: CN113077785B
Application number: CN201911300918.7A
Authority: CN
Inventors: 徐及; 林格平; 刘丹阳; 万辛; 张鹏远; 李娅强; 刘发强; 颜永红
Original assignee: Institute of Acoustics CAS; National Computer Network and Information Security Management Center
Current assignee: Institute of Acoustics CAS; National Computer Network and Information Security Management Center
Priority date: 2019-12-17
Filing date: 2019-12-17
Publication date: 2022-07-12
Anticipated expiration: 2039-12-17
Also published as: CN113077785A

Abstract

The invention belongs to the technical field of network communication, and particularly relates to an end-to-end multilingual continuous voice stream voice content identification method, which comprises the following steps: inputting speech frequency spectrum characteristics to be recognized into a pre-constructed deep neural network-based segment-level language classification model, and extracting a posterior probability distribution vector of a statement-level language state; inputting the speech frequency spectrum characteristic sequence to be recognized of each language type and the posterior probability distribution vector of the statement level language type state into a multi-language speech recognition model which is constructed in advance, and outputting the speech recognition result of the corresponding language type.

Description

End-to-end multi-language continuous voice stream voice content identification method and system

Technical Field

The invention belongs to the technical field of network communication and voice recognition, and particularly relates to an end-to-end multi-language continuous voice stream voice content recognition method and system.

Background

Currently, the end-to-end recognition framework has been widely applied to automatic speech recognition tasks. Since the end-to-end framework does not rely on a pronunciation dictionary in the process of building a speech recognition system, it is more flexible in building a speech recognition system of a new language as well as a multilingual speech recognition system. Furthermore, the end-to-end speech recognition model can directly model the mapping relationship between the acoustic feature sequence and the text modeling unit sequence. Compared with the traditional speech recognition system based on acoustic modeling and language modeling, the end-to-end framework unifies the acoustic modeling and the linguistic modeling processes, and effectively reduces the complexity of the construction of the speech recognition system.

In the construction process of the multi-language speech recognition system, although the end-to-end framework can reduce the complexity of the construction of the speech recognition system, a new problem is brought to the multi-language speech recognition. The multi-language end-to-end framework models the multi-language modeling units under a unified framework, and because the pronunciation mechanism and the grammar rule of different languages are different greatly, compared with a single-language speech recognition system, the multi-language modeling units are inevitably mixed with each other when the multi-language modeling units are subjected to combined modeling. The existing voice content recognition method has the problem that the language distinction of a multi-language voice recognition system cannot be effectively improved.

Disclosure of Invention

The invention aims to solve the defects of the existing voice recognition method, provides an end-to-end multilingual continuous voice stream voice content recognition method and system, and particularly relates to an end-to-end multilingual voice recognition method based on a multi-attention training mechanism.

In order to achieve the above object, the present invention further provides an end-to-end method for recognizing voice content of a multilingual continuous voice stream, the method comprising:

inputting speech frequency spectrum characteristics to be recognized into a pre-constructed deep neural network-based segment-level language classification model, and outputting a sentence-level language state posterior probability distribution vector;

inputting the to-be-recognized speech frequency spectrum characteristic sequence of each language type and the sentence level language state posterior probability distribution vector to a pre-constructed multi-language speech recognition model, and outputting the speech recognition result of the corresponding language type.

As an improvement of the above technical solution, the method further includes: obtaining language classification results of corresponding language types according to the statement level language state posterior probability distribution vector, combining the language classification results with historical information of the speech recognition results of the corresponding language types output by a decoding network in a pre-constructed multi-language speech recognition model, obtaining a corresponding decoding network prediction sequence, and finally obtaining the multi-language speech recognition result.

As an improvement of the above technical solution, the method further includes: the deep neural network-based training step of the segment-level language classification model specifically comprises the following steps:

extracting the frame-level voice frequency spectrum characteristics of the multi-language continuous voice stream of the training set, inputting the frame-level voice frequency spectrum characteristics into the language classification model of the section level, carrying out long-term statistics on the output vector of the current hidden layer, and calculating the mean vector, the variance vector and the section-level statistical vector of the output vector of the current hidden layer;

the mean vector is:

the variance vector is:

the segment-level statistical vector is:

h_segment＝Append(μ,σ) (6)

wherein h is_jThe output vector of the current hidden layer at the moment j is obtained; t is a long-term statistical period; mu is a mean vector of long-term statistics; sigma is a variance vector of long-term statistics; h is a total of_segmentIs a segment-level statistical vector; wherein the segment-level statistical vector is formed by splicing a mean vector and a variance vector together, and has a dimension h_j2 times the dimension; appendix (mu, sigma) represents that mu and sigma are spliced to form a high-dimensional vector;

segment-level statistical vector h_segmentAnd (3) as the input of the next hidden layer, according to the segment-level language labels, training by an error calculation and reverse gradient return process to obtain a trained segment-level language classification model, and completing the establishment of the segment-level language classification model.

As an improvement of the above technical solution, the multilingual speech recognition model includes: the system comprises an encoding network, a plurality of attention mechanism modules and a decoding network; setting a corresponding number of attention mechanism modules according to the number of the language types to be identified;

and setting a corresponding number of attention mechanism modules according to the number of language types contained in the speech spectrum characteristics to be recognized.

As an improvement of the above technical solution, the training step of the attention mechanism module specifically includes:

sequence h of states of speech features^encInputting the input data to a corresponding attention mechanism module, and outputting a corresponding output state sequence;

according to equation (2), the corresponding output sequence is obtained:

e^l _t,i＝w^Ttanh(W^lh^enc+V^lh^dec _i+U^l(F^l*a^l _t,i-1)+b^l) (2)

wherein l represents a language type label of multiple languages; e.g. of the type^l _t,iThe output state of the attention mechanism module represents the speech spectrum feature to be recognized in the t-th frame; w is a^T，W^l，V^l，U^lRespectively representing a first transformation matrix, a second transformation matrix, a third transformation matrix and a fourth transformation matrix; b^lRepresenting a bias vector; tanh () represents a nonlinear activation function; f^lRepresenting a convolution function;

representing the output state of the t frame coding network; h is^dec _iAn implied layer state representing an ith output modeling unit of the decoding network; a is a^l _t,i-1A weight value corresponding to the attention weight vector of the ith language category in the t frame of the (i-1) th output modeling unit;

obtaining attention weight vectors of corresponding language types according to the corresponding output state sequences;

specifically, according to formula (3), the attention weight vector of the corresponding language category is obtained:

wherein, a^l _t,iThe attention weight vector of the ith language category corresponds to the weight value of the t frame of the ith output modeling unit; e.g. of the type^l _t′,iThe output state of the attention mechanism module corresponding to the ith output modeling unit for the t' th frame to be recognized voice spectrum features; t' is more than or equal to 1 and less than or equal to T and is the corresponding frame of the voice characteristic sequence.

As one improvement of the above technical solution, the to-be-recognized speech frequency spectrum feature sequence and the sentence-level language state posterior probability distribution vector of each language category are input to a pre-constructed multi-language speech recognition model, and a speech recognition result of the corresponding language category is output; the method specifically comprises the following steps:

inputting the speech frequency spectrum characteristics to be recognized of each language type into a coding network, and outputting a state sequence of corresponding speech characteristics;

according to the formula (1), obtaining the state sequence h of the corresponding voice features^enc：

h^enc＝Encoder(x) (1)

Wherein the content of the first and second substances,

a state sequence of speech features, namely a hidden state output sequence of a coding network; x is (x)₁,x₂,...,x_t,...,x_T) The method comprises the steps of inputting a speech frequency spectrum characteristic sequence to be recognized, namely an input characteristic; wherein, T is the total frame number of the input characteristic sequence; encoder () is a calculation function of a coding network based on a convolutional neural network/bidirectional long-and-short-term memory network;

carrying out weighted summation on the corresponding voice characteristic state sequence and the attention weight vector of the corresponding language type to obtain a corresponding attention context content vector;

specifically, according to formula (4), a corresponding attention context content vector is obtained;

wherein, c^l _iRepresenting a corresponding attention context content vector, namely an attention context content vector obtained by weighting and summing the coding network by the ith language class;

under the condition of multi-attention mechanism, distributing vector V by language state^lAnd carrying out weighted summation with the corresponding attention context content vector to obtain a final attention context content vector:

wherein, V^lFor language state distribution vectors, i.e. V^l＝(w^l ₁,w^l ₂,...,w^l _n,...,w^l _N) (ii) a N is the number of the language types of the multiple languages to be identified;

and inputting the final attention context content vector to a decoding network to obtain a speech recognition result of the language category.

The invention also provides an end-to-end multilingual continuous speech stream speech content recognition system, comprising: the device comprises an extraction module and a voice recognition module;

the extraction module is used for inputting the speech frequency spectrum characteristics to be recognized into a pre-constructed deep neural network-based segment-level language classification model and extracting the posterior probability distribution vector of the statement-level language state according to the segment-level language classification model;

the speech recognition module inputs the speech frequency spectrum characteristic sequence to be recognized of each language type and the posterior probability distribution vector of the statement level language type state into a pre-constructed multi-language speech recognition model and outputs the speech recognition result of the corresponding language type.

As an improvement of the above technical solution, the system further includes: and the voice result acquisition module is used for acquiring the language classification result of the corresponding language type according to the statement level language state posterior probability distribution vector, combining the language classification result with the historical information of the voice recognition result of the corresponding language type output by the decoding network in the pre-constructed multi-language voice recognition model, acquiring the corresponding decoding network prediction sequence and finally acquiring the multi-language voice recognition result.

The invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method when executing the computer program.

The invention further provides a computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to perform the above-mentioned method.

Compared with the prior art, the invention has the beneficial effects that:

the method is an end-to-end multi-language voice recognition method based on a multi-attention machine system, a specific attention machine module is constructed for each language under an end-to-end framework based on the attention machine system, and the attention machine module carries out language specific modeling on the mapping relation between an input spectrum characteristic sequence and an output annotation sequence of a specific language. In addition, language classification information is introduced into an end-to-end modeling process, and output information of the multi-semantic machine module is weighted, so that language distinctiveness of the multi-language voice recognition system can be effectively improved.

Drawings

Fig. 1 is a flow chart of a method for recognizing speech contents of end-to-end multilingual continuous speech streams according to the present invention.

Detailed Description

The invention will now be further described with reference to the accompanying drawings.

As shown in fig. 1, the present invention provides an end-to-end method for recognizing speech contents of continuous voice streams in multiple languages, the method comprising:

inputting the speech frequency spectrum characteristics to be recognized into a pre-constructed deep neural network-based segment-level language classification model, and extracting a posterior probability distribution vector V of a statement-level language state according to the segment-level language classification model^lObtaining language classification results corresponding to the language types; wherein the language classification result corresponding to the language category is the posterior probability distribution vector V of the statement level language state^l(ii) a The speech frequency spectrum feature to be recognized is represented by a frequency domain obtained by performing fourier transform on a multilingual continuous speech stream, wherein the multilingual continuous speech stream refers to a situation that the speech stream only contains one type of language information, but the language type of the speech stream is unknown.

Specifically, the speech frequency spectrum characteristic sequence to be recognized is input into the segment level language classification model, is calculated forward through a neural network, and is used for calculating the speech frequency spectrum characteristic sequence according to the speech frequency spectrum characteristic sequenceSegment level language classification model, extracting posterior probability distribution vector V of statement level language state^lAnd obtaining language classification results corresponding to the language types.

The establishing of the segment-level language classification model based on the deep neural network specifically comprises the following steps:

the mean vector is:

the variance vector is:

the segment-level statistical vector is:

h_segment＝Append(μ,σ) (6)

wherein h is_jAn output vector at the moment of the current hidden layer j; t is a long-term statistical period; mu is a mean vector of long-term statistics; sigma is a variance vector of long-term statistics; h is a total of_segmentIs a segment-level statistical vector; wherein the segment-level statistical vector is formed by splicing a mean vector and a variance vector together, and has a dimension h_j2 times the dimension; appendix (mu, sigma) represents that mu and sigma are spliced to form a high-dimensional vector;

segment-level statistical vector h_segmentAnd as the input of the next hidden layer, obtaining a trained segment-level language classification model through error calculation and reverse gradient return process training according to the segment-level language labels, and completing the establishment of the segment-level language classification model. Wherein, the language label is a label with language category.

Inputting the to-be-recognized speech frequency spectrum characteristic sequence of each language type and the sentence level language state posterior probability distribution vector into a pre-constructed multi-language speech recognition model, and outputting the speech recognition result of the corresponding language type.

As shown in fig. 1, the multilingual speech recognition model includes: an encoding network, a plurality of attention mechanism modules (attention mechanism module 1, attention mechanism module 2, …, attention mechanism module N) and a decoding network. Setting a corresponding number of attention mechanism modules according to the number of the language types to be identified;

specifically, according to the number of language types contained in the speech frequency spectrum feature to be recognized, a corresponding number of attention mechanism modules are set;

specifically, according to formula (1), a state sequence h of the corresponding speech feature is obtained^enc：

h^enc＝Encoder(x) (1)

Wherein h is^enc＝(h^enc ₁,h^enc ₂,...,h^enc _t,...,h^enc _T) A state sequence of speech features, namely a hidden state output sequence of a coding network; x is (x)₁,x₂,...,x_t,...,x_T) The method comprises the steps of inputting a speech frequency spectrum characteristic sequence to be recognized, namely an input characteristic; wherein, T is the total frame number of the input characteristic sequence; encoder () is a computational function of a convolutional neural network/bidirectional long-term memory network (CNN/BLSTM) -based coding network.

Corresponding state sequence h of voice characteristics^encInputting the input data to a corresponding attention mechanism module, and outputting a corresponding output state sequence;

specifically, according to equation (2), the corresponding output sequence is obtained:

e^l _t,i＝w^Ttanh(W^lh^enc+V^lh^dec _i+U^l(F^l*a^l _t,i-1)+b^l) (2)

wherein l represents a language type label of multiple languages; e.g. of the type^l _t,iThe output state of the attention mechanism module of the speech spectrum feature to be recognized of the t frame is represented; w is a^T，W^l，V^l，U^lRespectively representing a first transformation matrix, a second transformation matrix, a third transformation matrix and a fourth transformation matrix; b is a mixture of^lRepresenting a bias vector; tanh () represents a nonlinear activation function; f^lRepresenting a convolution function; h is^enc _tRepresenting the output state of the t frame coding network; h is a total of^dec _iAn implied layer state representing an ith output modeling unit of the decoding network; a is^l _t,i-1A weight value corresponding to the attention weight vector of the ith language category in the t frame of the (i-1) th output modeling unit;

wherein, a^l _t,iRepresenting a weight value corresponding to the attention weight vector representing the ith language category at the t frame of the ith output modeling unit; e.g. of the type^l _t′,iThe output state of the attention mechanism module corresponding to the ith output modeling unit is used for the t' th frame to-be-recognized speech frequency spectrum characteristic; t' is more than or equal to 1 and less than or equal to T and is the corresponding frame of the voice characteristic sequence;

The method further comprises the following steps: and combining the language classification result of the language type with historical information of the speech recognition result of the corresponding language type output by the decoding network in the pre-constructed multi-language speech recognition model to obtain a corresponding decoding network prediction sequence, and finally obtaining the multi-language speech recognition result.

In particular, for predicting the ith output modeling unit y of the decoding network_iThe output modeling unit is the language-1 output modeling unit, …, language-N output modeling unit shown in fig. 1, and the decoding network hidden layer state h of the ith output modeling unit of the decoding network needs to be predicted first^dec _iWherein the input of the decoding network is the (i-1) th output modeling unit and the attention context content vector c_iFinally, the decoding network combined with the softmax function can be hidden by the i-th output modeling unit of the decoding network as shown in equation (6)Containing layer state h^dec _iIth output modeling unit y of predictive decoding network_iProbability p (y) of_i|y_1:i-1X) as shown in formula (7):

h_i ^dec＝Decoder(y_i-1，c_i) (6)

p(y_i|y_1：i-1，x)＝soft max(h_i ^dec) (7)

wherein x represents an input speech frequency spectrum characteristic sequence to be recognized; y is_i-1Modeling the i-1 output of the decoding network; c. C_iIs the final attention context content vector; y is_1：i-1History information for the 1 st output to the i-1 st output of the decoding network; p (y)_i|y_1：i-1，x) Modelling the ith output of a decoding network_iA predicted probability of (d); soft max (h)_i ^dec) For decoding the network hidden layer state h^dec _iTaking a softmax function; y is_iAn ith output modeling unit representing a decoding network; decoder () represents a long-short time memory network (LSTM) -based decoding network;

general prediction probability p (y)_i|y_1：i-1X) to obtain a modeling unit y with the maximum prediction probability in the prediction process of the ith modeling unit_iBy combining the result of the 1 st prediction with the result of the I-th prediction, the final speech recognition result y ═ can be obtained (y ═ y)₁,y₂,...,y_i,...,y_I)。

The time step mapping of the input characteristic sequence and the output modeling unit sequence is inconsistent for different languages, so that the attention module can be optimized according to the characteristics of a specific language while model information sharing is carried out between an encoding network and a decoding network among multiple languages.

The invention also provides an end-to-end multilingual continuous speech stream speech content recognition system, which is realized based on the method and comprises the following steps:

an extraction module for inputting the speech frequency spectrum characteristics to be recognized to the pre-structureEstablishing a segment level language classification model based on a deep neural network, and extracting a posterior probability distribution vector V of a statement level language state according to the segment level language classification model^l；

A speech recognition module for comparing the speech frequency spectrum characteristic sequence to be recognized of each language type with the posterior probability distribution vector V of the statement level language type state^lInputting the speech recognition result into a pre-constructed multi-language speech recognition model, and outputting the speech recognition result of the corresponding language type.

The system further comprises: a voice result obtaining module for obtaining the posterior probability distribution vector V according to the language state of the statement level^lObtaining the language classification result corresponding to the language type, combining the language classification result of the language type with the historical information of the speech recognition result corresponding to the language type output by the decoding network in the multi-language speech recognition model which is constructed in advance, obtaining the corresponding decoding network prediction sequence, and finally obtaining the multi-language speech recognition result.

The invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, causes the processor to carry out the above-mentioned method.

The rationality and validity of the multilingual speech recognition system based on the multi-attention machine system of the present invention have been verified in real systems, with the results shown in table 1:

TABLE 1 recognition results of the multilingual end-to-end recognition model (% word error Rate)

The method of the invention constructs a multi-language end-to-end speech recognition system by using the tacari, the dorzol, the toki and the Haidecliuer. Wherein, the tacarlo and dormitom are the filigree used in different regions of filigree, and the tokyo and haidi cririol are two different cririol. A common feature of these four languages is that their annotation text is both latin letters and variants of latin letters.

Therefore, the multi-language joint modeling based on the four languages can effectively share information and improve the performance of the multi-language speech recognition system. From table 1, compared to the single-language end-to-end recognition model and the multi-language end-to-end recognition system without the multi-attention mechanism module, the method of the present invention effectively reduces the word error rate of the multi-language recognition model from an average of 62.6% to 60.3% in four languages by merging language information into the multi-language recognition method and combining the multi-attention mechanism module.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it should be understood by those skilled in the art that the technical solutions of the present invention may be modified or substituted with equivalents without departing from the spirit and scope of the technical solutions of the present invention, and all of them should be covered by the scope of the claims of the present invention.

Claims

1. An end-to-end method for recognizing speech contents of a continuous speech stream with multiple languages, the method comprising:

2. The method of claim 1, further comprising: obtaining language classification results of corresponding language types according to the statement level language state posterior probability distribution vector, combining the language classification results with historical information of the speech recognition results of the corresponding language types output by a decoding network in a pre-constructed multi-language speech recognition model, obtaining a corresponding decoding network prediction sequence, and finally obtaining the multi-language speech recognition result.

3. The method of claim 1, further comprising: the deep neural network-based segment-level language classification model training method specifically comprises the following steps:

the mean vector is:

the variance vector is:

the segment-level statistical vector:

h_segment＝Append(μ,σ) (6)

wherein h is_jThe output vector of the current hidden layer at the moment j is obtained; t is a long-term statistical period; mu is a mean vector of long-term statistics; sigma is a variance vector of long-term statistics; h is_segmentIs a segment-level statistical vector; wherein the segment-level statistical vector is formed by splicing a mean vector and a variance vector together, and has a dimension h_j2 times the dimension; appendix (mu, sigma) represents that mu and sigma are spliced to form a high-dimensional vector;

4. The method of claim 1, wherein the multi-lingual speech recognition model comprises: the system comprises an encoding network, a plurality of attention mechanism modules and a decoding network; setting a corresponding number of attention mechanism modules according to the number of the language types to be identified;

5. The method according to claim 4, wherein the step of training the attention mechanism module comprises in particular:

according to equation (2), the corresponding output sequence is obtained:

e^l _t,i＝w^Ttanh(W^lh^enc+V^lh^dec _i+U^l(F^l*a^l _t,i-1)+b^l) (2)

wherein l represents a language type label of multiple languages; e.g. of a cylinder^l _t,iThe output state of the attention mechanism module represents the speech spectrum feature to be recognized in the t-th frame; w is a^T，W^l，V^l，U^lRespectively representing a first transformation matrix, a second transformation matrix, a third transformation matrix and a fourth transformation matrix; b^lRepresenting a bias vector; tanh () represents a nonlinear activation function; f^lRepresenting a convolution function;

representing a tth frame encoded netThe output state of the complex; h is^dec _iRepresenting the hidden layer states of the ith output modeling unit of the decoding network; a is a^l _t,i-1A weight value corresponding to the attention weight vector of the ith language category in the t frame of the (i-1) th output modeling unit;

wherein, a^l _t,iThe attention weight vector of the ith language category corresponds to the weight value of the t frame of the ith output modeling unit; e.g. of the type^l _t′,iThe output state of the attention mechanism module corresponding to the ith output modeling unit is used for the t' th frame to-be-recognized speech frequency spectrum characteristic; t' is more than or equal to 1 and less than or equal to T and is the corresponding frame of the voice characteristic sequence.

6. The method according to claim 1, wherein the speech spectrum feature sequence to be recognized and the sentence-level language state posterior probability distribution vector of each language category are input to a pre-constructed multi-language speech recognition model, and the speech recognition result of the corresponding language category is output; the method comprises the following specific steps:

h^enc＝Encoder(x) (1)

Wherein the content of the first and second substances,

state sequences characterised by speech, i.e. coding networksThe hidden layer state output sequence of (1); x ═ x₁,x₂,...,x_t,...,x_T) The method comprises the steps of inputting a speech frequency spectrum characteristic sequence to be recognized, namely an input characteristic; wherein, T is the total frame number of the input characteristic sequence; encoder () is a calculation function of a coding network based on a convolutional neural network/bidirectional long-and-short-term memory network;

7. An end-to-end multilingual continuous speech stream speech content recognition system, comprising: the device comprises an extraction module and a voice recognition module;

the extraction module is used for inputting the speech frequency spectrum characteristics to be recognized into a pre-constructed deep neural network-based segment-level language classification model and extracting a posterior probability distribution vector of a statement-level language state according to the segment-level language classification model;

8. The system of claim 7, further comprising: and the voice result acquisition module is used for acquiring a language classification result of a corresponding language type according to the statement level language state posterior probability distribution vector, combining the language classification result with historical information of the voice recognition result of the corresponding language type output by the decoding network in the pre-constructed multi-language voice recognition model, acquiring a corresponding decoding network prediction sequence and finally acquiring the multi-language voice recognition result.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of the preceding claims 1-6 when executing the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to carry out the method of any of the preceding claims 1-6.