CN112489622A

CN112489622A - Method and system for recognizing voice content of multi-language continuous voice stream

Info

Publication number: CN112489622A
Application number: CN201910782981.2A
Authority: CN
Inventors: 徐及; 刘丹阳; 张鹏远; 颜永红
Original assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Current assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Priority date: 2019-08-23
Filing date: 2019-08-23
Publication date: 2021-03-12
Anticipated expiration: 2039-08-23
Also published as: CN112489622B

Abstract

The invention provides a method and a system for recognizing voice contents of a multi-language continuous voice stream, wherein the method comprises the following steps: inputting a frame-level language classification model of a multi-language continuous voice stream to be recognized, and outputting segment-level language feature vectors; inputting the segment-level language feature vector into a segment-level language classification model, and outputting posterior probability distribution of the segment-level language state; calculating the optimal language state path of the multilingual continuous voice stream based on a Viterbi retrieval algorithm according to the posterior probability distribution of the segment-level language state; segmenting the multilingual continuous voice stream to be identified according to the optimal language state path to obtain a language state interval; and sending the segmented language state interval into a multi-language acoustic model and a corresponding multi-language decoder for decoding to obtain a content identification result of the multi-language continuous voice stream. The invention solves the problem of dynamic detection and identification of language types with coexisting multi-language contents in continuous voice streams by fusing the language classification model with the Viterbi retrieval algorithm.

Description

Method and system for recognizing voice content of multi-language continuous voice stream

Technical Field

The present invention relates to the field of speech recognition, and in particular, to a method and a system for recognizing speech contents of a continuous speech stream in multiple languages.

Background

With the application of hidden markov technology, deep neural network and other technologies in the field of automatic speech recognition, the automatic speech recognition technology has not been developed. For Chinese, English and other languages with wide number of users, the performance of the corresponding single-language speech recognition system can even reach the recognition level of human beings. With the economic trade among countries in the world, the economic culture among the countries in the world is accelerated to merge, and the establishment of a mixed multi-language voice recognition system becomes a necessary condition for the content detection of multi-language voice streams.

The traditional multi-language voice recognition system is based on a language recognition front end which is connected with a plurality of parallel single-language voice recognition system rear ends in series. Generally, a speech recognition front end performs sentence-level classification and discrimination on the speech type of a speech with respect to speech features of the entire speech. In the multi-language recognition task of the multi-language continuous voice stream, the language classification method based on the statement level cannot cope with the language classification task of the multi-language coexistence in the voice stream.

Disclosure of Invention

The invention aims to solve the problem that a language classification method based on statement level cannot cope with a language classification task with coexistence of multiple languages in a voice stream.

In order to achieve the above object, the present invention provides a method for recognizing voice contents of a multi-language continuous voice stream, comprising:

inputting a frame-level language classification model of a multi-language continuous voice stream to be recognized, and outputting segment-level language feature vectors;

inputting the segment-level language feature vector into a segment-level language classification model, and outputting posterior probability distribution of the segment-level language state;

calculating the optimal language state path of the multilingual continuous voice stream based on a Viterbi retrieval algorithm according to the posterior probability distribution of the segment-level language state;

segmenting the multilingual continuous voice stream to be identified according to the optimal language state path to obtain a language state interval;

and sending the language state interval into a multi-language acoustic model and a corresponding multi-language decoder for decoding to obtain a content identification result of the multi-language continuous voice stream.

As an improvement of the method, the method further comprises a training step of the multilingual acoustic model, and the specific steps are as follows:

step 1-1) constructing a multi-language acoustic model based on a multi-task learning framework, wherein the model comprises a plurality of shared hidden layers and language specific output layers;

step 1-2) extracting the frequency spectrum characteristics of the multi-language continuous voice stream of the training set based on the acoustic state label of the multi-language voice data, and inputting the frequency spectrum characteristics into a shared hidden layer for nonlinear transformation; outputting data of a plurality of single languages to a plurality of language specific output layers;

step 1-3) calculating an error loss function value from the data of the single language at a language specific output layer corresponding to the input spectral feature, wherein the error loss function is as follows:

wherein F_loss,iError loss value, p, for the ith language specific output layer_model,i(x_L) Spectral feature x for the L-th language_LCorresponding output at the L-th language-specific output level, q_label,LAs a spectral feature x_LA corresponding acoustic state tag; the error loss function values of other output layers are zero;

step 1-4) of determining the error loss value F_loss,iReverse feedback; updating parameters of each language specific output layer according to the data of the corresponding single language, and calculating the gradient Delta phi of the language specific output layer parameters_i：

Wherein phi_iParameters for the ith language specific output layer;

error loss value F of parameter of shared hidden layer returned by several language specific output layers_loss,iAnd (3) calculating: calculate the gradient Δ Φ of the shared hidden layer parameter:

phi is a parameter of the shared hidden layer, and L is a language category number corresponding to a specific language output layer of the multilingual acoustic model;

step 1-5) when F_loss,i>And (4) setting a threshold value, turning to the step 1-2),

when F is present_loss,i<And giving a threshold value to obtain the trained multilingual acoustic model.

As an improvement of the method, the method further comprises a training step of a frame-level language classification model, and the specific steps are as follows:

step 2-1), constructing a frame level language classification model, wherein the frame level language classification model is a deep neural network;

step 2-2) extracting frame-level frequency spectrum characteristics of multi-language continuous voice streams of a training set, inputting the frame-level frequency spectrum characteristics into a frame-level language classification model, carrying out long-term statistics on output vectors of a current hidden layer, and calculating a mean vector, a variance vector and a segment-level language characteristic vector of the output vectors of the current hidden layer;

the mean vector is:

the variance vector is:

the segment level language feature vector:

h_segment＝Append(μ,σ) (6)

wherein h is_iIs the output vector of the current hidden layer at the moment i, T is the long-term statistical period, mu is the mean vector of the long-term statistics, sigma is the variance vector of the long-term statistics, h_segmentIs segment level language feature vector, the segment level language feature vector is formed by splicing a mean vector and a variance vector together, and the dimension of the segment level language feature vector is h _i2 times the dimension; wherein, appendix (mu, sigma) represents that mu and sigma are spliced to form a high-dimensional vector;

and 2-3) taking the mean vector and the variance vector as the input of the next hidden layer, and training through error calculation and a reverse gradient return process according to the frame-level language labels to enable each hidden layer to output segment-level language feature vectors to obtain a trained frame-level language classification model.

As an improvement of the method, the method further comprises a training step of a segment-level language classification model, and the specific steps are as follows:

step S2-1) constructing a segment level language classification model;

step S2-2) extracting the frame level spectrum feature of the multilingual continuous voice stream of the training set, inputting the frame level spectrum feature into the hidden layer of the trained frame level language classification model, and extracting the segment level language feature vector from the hidden layer of the trained frame level language classification model;

step S2-3) setting a segment-level language label for each segment-level language feature vector, inputting the segment-level language feature vector into a segment-level language classification model, and training and outputting the posterior probability distribution of the language state corresponding to the segment-level language label to obtain the trained segment-level language classification model.

As an improvement of the method, the multilingual continuous speech stream to be recognized is input into a frame-level language classification model, and segment-level language feature vectors are output; inputting the segment-level language feature vector into a segment-level language classification model to output the posterior probability distribution of the language state; the method specifically comprises the following steps:

extracting the frequency spectrum characteristics of the frame level to be identified from the continuous voice stream of the multi-language to be identified;

inputting the frame-level frequency spectrum features to be identified into the trained frame-level language classification model according to a specific step length and a specific window length, and outputting a segment-level language feature vector h_segment；

The segment level language feature vector h_segmentInputting the trained segment-level language classification model, and outputting the posterior probability distribution of the language state corresponding to the segment-level language feature vector.

As an improvement of the method, based on the viterbi search algorithm, calculating the optimal language state path of the multilingual continuous speech stream according to the posterior probability distribution of the language state specifically includes:

step 3-1) setting the autorotation probability p of the language state of the Viterbi search according to the posterior probability distribution of the language state_loopAnd a probability of hopping p_skipObtaining a transition matrix A of the language state as follows:

wherein p is_loopRepresenting the autorotation probability, p, of the language state_skipExpress the languageThe skipping probability of the state, the rotation probability and the skipping probability value of each language are the same, language state labels are set according to language categories, the language state labels are labels of different language categories, and Arabic

numerals

1,2 are adopted, wherein N is the language state label; the corresponding relation between each element of the transition matrix A and the language state label is as follows:

step 3-2) carrying out Viterbi retrieval on the predicted language state sequence, and calculating a target function based on Viterbi retrieval:

wherein p is_trans(s_T+1|s_T) Language state s representing multilingual continuous speech stream from time T_TLanguage state s by the time T +1_T+1Transition probability of (2):

wherein, language state s_TAnd language state s_T+1Corresponding language classification label is in the labeled language classification label range, T is the segment level language characteristic h_segmentA corresponding statistical period;

p_emit(s_T+1|h_segment) Representing features h of class-to-class language_segmentIn language state s_T+1The posterior probability of the upper prediction:

p_emit(s_T+1|h_segment)＝DNN-LID_{segment level}(h_segment) (11)

The DNN-LID is a segment level language classifier based on a deep neural network DNN;

step 3-3) with an objective function

And the language state sequence with the maximum value is the optimal language state sequence, and language state backtracking is carried out according to the optimal language state sequence to obtain an optimal language state path.

The invention also proposes a system for recognizing the speech content of a continuous speech stream in multiple languages, said system comprising:

the segment-level language feature extraction module is used for inputting a frame-level language classification model of the multilingual continuous voice stream to be identified and outputting segment-level language feature vectors;

the posterior probability calculation module of the language state inputs the segment-level language feature vector into the segment-level language classification model and outputs the posterior probability distribution of the segment-level language state;

a language state path obtaining module, which is used for calculating the optimal language state path of the multi-language voice stream based on the Viterbi retrieval algorithm according to the posterior probability distribution of the segment level language state;

the language state interval segmentation module is used for segmenting the multilingual continuous voice stream to be identified according to the optimal language state path to obtain a language state interval; and

and the content identification module of the multi-language voice stream is used for sending the segmented language state intervals into the multi-language acoustic model and the corresponding multi-language decoder for decoding to obtain the content identification result of the multi-language voice stream.

The invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of the above items when executing the computer program.

The invention also proposes a computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, causes the processor to carry out the method of any of the above.

Compared with the prior art, the invention has the advantages that:

1. the method and the system for recognizing the voice content of the continuous voice stream with the multiple languages can solve the problem of dynamic detection of the language types with the coexistence of the multiple language contents in the continuous voice stream by fusing the language classification model with the Viterbi retrieval algorithm.

2. The method for recognizing the voice content of the multi-language continuous voice stream can perform dynamic language switching point judgment and corresponding multi-language content recognition on the multi-language content in the continuous voice stream.

Drawings

FIG. 1 is a diagram illustrating a method for recognizing speech contents of a multilingual continuous speech stream according to the present invention.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments.

The invention provides a method and a system for recognizing the voice content of a multi-language continuous voice stream, wherein the method comprises the following steps:

step 1) constructing a multi-language acoustic model based on multi-task learning; the acoustic model uniformly constructs acoustic modeling tasks of multiple languages under a neural network classification framework based on multi-task learning, and simultaneously performs joint optimization on the acoustic models of the multiple languages by using acoustic characteristics of the multiple languages; the method specifically comprises the following steps:

step 1-1) constructing a multi-language acoustic model of a neural network classification framework based on multi-task learning, wherein the model is composed of a plurality of shared hidden layers and language specific output layers; wherein the model parameters of the shared hidden layer are jointly optimized by multi-language data; the language specific output layer is optimized by data of each single language;

step 1-2) in the forward calculation process of the model, the shared hidden layer and the language specific output layer of the multilingual acoustic model perform nonlinear transformation on input multilingual frequency spectrum characteristic vectors, and all language specific output layers output information;

step 1-3) in the error loss function calculation process of model updating, according to the acoustic state label corresponding to the spectrum feature, calculating the error loss function value only in the language specific output layer corresponding to the spectrum feature, and calculating the error loss function value of other language specific output layers not corresponding to the spectrum feature language to be zero; the corresponding loss function calculation is as follows:

wherein F_loss,iError loss function value, p, for the ith language specific output layer_model,i(x_L) Spectral feature x for the L-th language_LCorresponding acoustic model output at the L-th language-specific output layer, q_label,LAs a spectral feature x_LA corresponding acoustic state tag;

step 1-4) in the process of model classification error reverse feedback, the error loss value F is used_loss,iReversely returning, and performing model parameter training on each language specific output layer parameter according to the data of the corresponding single language; error loss value F of parameter of shared hidden layer returned by several language specific output layers_loss,iCalculating;

the language specific output layer parameter gradient calculation formula is as follows:

wherein phi_iParameters of the output layer are specified for the ith language.

The gradient calculation formula for the shared hidden layer parameters is:

where Φ is a parameter of the shared hidden layer, and L is a language category number corresponding to a specific language output layer of the multilingual acoustic model.

Step 1-5) repeatedly executing the step 1-2) -the step 1-4) until the model parameters are converged.

Step 2) constructing a frame-level language classification model fusing long-term statistical characteristics based on a deep neural network model; extracting language feature vectors representing language category features based on a frame-level language classification model; the long-term statistical component performs section-level statistics on the output vector of the previous hidden layer in the forward calculation process of the frame-level language classification model, calculates the mean and variance statistics of the output vector of the previous hidden layer, takes the vector of the mean and variance statistics as the input of the next hidden layer, and finally performs error calculation and reverse gradient return process of the language classification model according to the frame-level language label to update the model;

the specific steps of training the frame-level language classification model comprise:

step 2-2) extracting frame-level frequency spectrum characteristics of multi-language continuous voice streams of a training set, taking the frame-level frequency spectrum characteristics as input characteristics to input a frame-level language classification model, carrying out long-term statistics on output vectors of a current hidden layer, and calculating a mean vector, a variance vector and a segment-level language characteristic vector of the output vectors of the current hidden layer;

the mean vector is:

the variance vector is:

the segment level language feature vector is:

h_segment＝Append(μ,σ) (6)

wherein h is_iIs the output vector of the current hidden layer at the moment i, T is the long-term statistical period, mu is the mean vector of the long-term statistics, sigma is the variance vector of the long-term statistics, h_segmentFor segment-level language feature vectors, the segmentThe level language feature vector is formed by splicing a mean vector and a variance vector together, and the dimension of the mean vector and the variance vector is h _i2 times the dimension; wherein, appendix (mu, sigma) represents that mu and sigma are spliced to form a high-dimensional vector;

Based on a trained frame-level language classification model, extracting segment-level language feature vectors from a hidden layer of the frame-level language classification model, constructing segment-level language labels for each segment-level language feature vector, and training the segment-level language classification model according to the segment-level language feature vectors and the segment-level language labels. The method specifically comprises the following steps:

step S2-1) constructing a segment level language classification model;

step S2-2) extracting the frame level spectrum characteristics of the multilingual continuous voice stream of the training set, inputting the hidden layer of the trained frame level language classification model by taking the frame level spectrum characteristics as input characteristics, and extracting segment level language characteristic vectors from the hidden layer of the trained frame level language classification model;

Step 3) extracting segment-level language feature vectors by utilizing a trained frame-level language classification model for the voice of the continuous voice stream of the multi-language to be recognized, carrying out language classification on the segment-level language feature vectors according to the segment-level language classification model, and carrying out real-time detection on language switching points of the continuous voice stream of the multi-language by combining a Viterbi retrieval algorithm; and finally, according to the language detection result, segmenting the continuous voice stream and identifying the content of the multi-language voice stream through a multi-language acoustic model and a corresponding decoder. The method comprises the following specific steps:

step 3-1) extracting segment-level language feature vectors from the frame-level language classification model according to specific step length and window length by using the frequency spectrum features of the voice of the multi-language continuous voice stream to be recognized;

classifying the segment-level language feature vectors through a segment-level language classification model to obtain posterior probability distribution of the language states corresponding to the segment-level language feature vectors;

setting the autorotation probability and the skipping probability of the language state of the Viterbi retrieval, and reducing the language classification error caused by inaccurate classification of the segment level language classification model by improving the autorotation probability of language filling; the method comprises the following steps:

based on the posterior probability distribution of the language state, the autorotation probability and the skipping probability of the language state of the Viterbi retrieval are set, and the transition matrix A of the language state is obtained as follows:

wherein p is_loopRepresenting the autorotation probability, p, of the language state_skipThe method comprises the steps of representing the skipping probability of language states, setting language state labels according to language categories, wherein the language state labels are labels of different language categories, and

Arabic numerals

1,2 and N are used as the language state labels, and the autorotation probability and the skipping probability value of each language are the same; the corresponding relation between each element of the transition matrix A and the language state label is as follows:

step 3-2) calculating the posterior probability p of the predicted segment level language state_emit(s_T+1|h_segment) According to the autorotation probability p of the preset language state_loopAnd a probability of hopping p_skipPerforming viterbi search on the predicted language state, specifically including:

calculating the optimal language state sequence of the continuous voice stream based on the target function of the Viterbi retrieval, wherein the target function is as follows:

p_emit(s_T+1|h_segment) Representing features h of class-to-class language_segmentIn language state s_T+1(ii) an upper predicted posterior probability;

p_emit(s_T+1|h_segment)＝DNN-LID_{segment level}(h_segment) (11)

and 3-3) predicting the optimal language state for retrieval by the posterior probability of the segment-level language state predicted by the segment-level language classification model and the preset autorotation probability and the jump probability of the language state through the recursive formula, wherein the sequence with the maximum target function value is the optimal language state sequence corresponding to the continuous voice stream of multiple languages, and the optimal language state path can be obtained by performing language state backtracking through the optimal language state sequence.

And 4) segmenting the multi-language voice stream according to the language state interval according to the optimal language state path, and sending the segmented language state interval voice stream into a multi-language acoustic model and a corresponding multi-language decoder for decoding to obtain a corresponding content identification result of the multi-language continuous voice stream.

The rationality and validity of the speech recognition system based on the invention has been verified in real systems, the results are shown in table 1:

TABLE 1

The method of the invention carries out multi-language acoustic model combined training by using data of Cantonese, Turkish and Vietnamese, simultaneously constructs a frame level language classification model and a segment level language classification model based on three languages, and carries out language classification and voice content identification on continuous multi-language voice by utilizing a multi-language continuous voice stream voice content identification method based on a Viterbi algorithm. From table 1, it can be seen that the method of the present invention improves the language identification precision from 82.1% to 92.4%, and verifies that the method of the present invention for recognizing the speech content of the continuous multi-language speech stream based on the viterbi algorithm can effectively improve the result of language detection in the continuous multi-language speech stream.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method of multi-lingual continuous speech stream speech content recognition, the method comprising:

and inputting the language state interval into a multi-language acoustic model and a corresponding multi-language decoder for decoding to obtain a content identification result of the multi-language voice stream.

2. The method for recognizing the speech content of a multilingual continuous speech stream according to claim 1, further comprising a step of training a multilingual acoustic model, comprising the steps of:

step 1-1) constructing a multi-language acoustic model based on a multitask learning neural network, wherein the model comprises a plurality of shared hidden layers and language specific output layers;

step 1-2) extracting the frequency spectrum characteristics of the multi-language continuous voice stream of the training set based on the acoustic state label of the multi-language continuous voice data, and inputting the frequency spectrum characteristics into a shared hidden layer for nonlinear transformation; outputting data of a plurality of single languages to a plurality of language specific output layers;

step 1-3) calculating error loss function values of data of a single language at a language specific output layer corresponding to the input spectrum characteristics:

said error loss function F_loss,iComprises the following steps:

step 1-4) of determining the error loss value F_loss,iReverse feedback, each language specific output layer parameter carries out parameter updating according to the data of the corresponding single languageComputing language specific output layer parameter gradient delta phi_i：

Wherein phi_iParameters for the ith language specific output layer;

error loss value F of parameter of shared hidden layer returned by several language specific output layers_loss,iUpdating: calculate the gradient Δ Φ of the shared hidden layer parameter:

step 1-5) when F_loss,i>Setting a threshold value, and then turning to the step 1-2);

3. The method according to claim 1, further comprising a step of training a frame-level language classification model, comprising the steps of:

the mean vector μ is:

the variance vector σ is:

the segment level language feature vector h_segmentComprises the following steps:

h_segment＝Append(μ,σ) (6)

wherein h is_iIs the output vector of the current hidden layer at the moment i, T is the long-term statistical period, mu is the mean vector of the long-term statistics, sigma is the variance vector of the long-term statistics, h_segmentIs segment level language feature vector, the segment level language feature vector is formed by splicing a mean vector and a variance vector together, and the dimension of the segment level language feature vector is h_i2 times of dimensionality, wherein appendix (mu, sigma) represents that mu and sigma are spliced to form a high-dimensional vector;

4. The method for recognizing the speech content of a multilingual continuous speech stream according to claim 1, further comprising the step of training a segment-level language classification model, comprising the steps of:

step S2-1) constructing a segment level language classification model;

5. The method according to claim 1, wherein the continuous speech stream of multiple languages is input into a frame-level language classification model to output segment-level language feature vectors; inputting the segment-level language feature vector into a segment-level language classification model to output the posterior probability distribution of the language state; the method specifically comprises the following steps:

6. The method for recognizing speech content of continuous speech stream in multiple languages according to claim 5, wherein the step of calculating the optimal language state path of the continuous speech stream in multiple languages based on viterbi search algorithm according to the posterior probability distribution of language state comprises:

the autorotation probability and the skipping probability value of each language are the same, language state labels are set according to language categories, the language state labels are labels of different language categories, and Arabic numerals 1, 2. The corresponding relation between each element of the transition matrix A and the language state label is as follows:

step 3-2) carrying out Viterbi retrieval on the predicted language state, and calculating a target function based on Viterbi retrieval:

p_emit(s_T+1|h_segment)＝DNNLID_{segment level}(h_segment) (11)

The DNNLID is a segment level language classifier based on a deep neural network DNN;

step 3-3) with an objective function

7. A multi-language continuous speech stream speech content recognition system, said system comprising:

the language state path acquisition module is used for calculating the optimal language state path of the multi-language voice stream based on the Viterbi retrieval algorithm according to the posterior probability distribution of the section level language state;

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of claims 1-6 when executing the computer program.

9. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to carry out the method of any one of claims 1-6.