CN112489622B

CN112489622B - Multi-language continuous voice stream voice content recognition method and system

Info

Publication number: CN112489622B
Application number: CN201910782981.2A
Authority: CN
Inventors: 徐及; 刘丹阳; 张鹏远; 颜永红
Original assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Current assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Priority date: 2019-08-23
Filing date: 2019-08-23
Publication date: 2024-03-19
Anticipated expiration: 2039-08-23
Also published as: CN112489622A

Abstract

The invention provides a method and a system for recognizing multi-language continuous voice stream voice content, wherein the method comprises the following steps: inputting the multi-language continuous voice stream to be identified into a frame-level language classification model, and outputting segment-level language feature vectors; inputting the segment level language feature vector into a segment level language classification model, and outputting posterior probability distribution of the segment level language state; calculating an optimal language state path of the multi-language continuous voice stream based on a Viterbi search algorithm according to posterior probability distribution of the segment-level language states; dividing the multi-language continuous voice stream to be recognized according to the optimal language state path to obtain a language state interval; and sending the segmented language state interval into a multilingual acoustic model and a corresponding multilingual decoder for decoding to obtain a content recognition result of the multilingual continuous voice stream. The invention solves the problem of dynamic detection and recognition of language types with concurrent multilingual content in continuous voice streams by fusing the language classification model with the Viterbi retrieval algorithm.

Description

Multi-language continuous voice stream voice content recognition method and system

Technical Field

The invention relates to the field of voice recognition, in particular to a method and a system for recognizing multi-language continuous voice stream voice content.

Background

With the application of hidden Markov technology, deep neural network and other technologies in the field of automatic speech recognition, the automatic speech recognition technology has been unprecedented. For languages such as Chinese, english and the like with wide numbers of people, the performance of the corresponding single-language speech recognition system can even reach the recognition level of human beings. Along with the economic trade of world each country, the economic culture of world each country is accelerated to blend, and the construction of a mixed multilingual speech recognition system has become a necessary condition for coping with multilingual speech stream content detection.

The traditional multi-language speech recognition system is based on that the front end of language recognition is connected with the rear end of a plurality of parallel single-language speech recognition systems in series. Generally, the language recognition front end classifies and discriminates the language class of the voice according to the voice characteristics of the whole voice. In the multi-language recognition task of the multi-language continuous voice stream, the language classification method based on the sentence level cannot cope with the language classification task of the multi-language coexistence in the voice stream.

Disclosure of Invention

The invention aims to solve the problem that language classification tasks which coexist in multiple languages in a voice stream cannot be dealt with by a language classification method based on sentence level.

To achieve the above object, the present invention provides a method for recognizing speech content of a multi-language continuous speech stream, the method comprising:

inputting the multi-language continuous voice stream to be identified into a frame-level language classification model, and outputting segment-level language feature vectors;

inputting the segment level language feature vector into a segment level language classification model, and outputting posterior probability distribution of the segment level language state;

calculating an optimal language state path of the multi-language continuous voice stream based on a Viterbi search algorithm according to posterior probability distribution of the segment-level language states;

dividing the multi-language continuous voice stream to be recognized according to the optimal language state path to obtain a language state interval;

and sending the language state interval into a multilingual acoustic model and a corresponding multilingual decoder for decoding to obtain a content recognition result of the multilingual continuous voice stream.

As an improvement of the method, the method further comprises a training step of the multilingual acoustic model, which comprises the following specific steps:

step 1-1) constructing a multi-language acoustic model based on a multi-task learning framework, wherein the model comprises a plurality of shared hidden layers and a language specific output layer;

step 1-2) extracting spectral features of multi-language continuous voice streams of a training set based on acoustic state labels of multi-language voice data, and inputting the spectral features into a shared hidden layer for nonlinear transformation; outputting the data of a plurality of single languages to a plurality of language specific output layers;

step 1-3) calculating an error loss function value of single-language data at a language specific output layer corresponding to the input frequency spectrum characteristics, wherein the error loss function is as follows:

wherein F is _loss,i Error loss value for the ith language specific output layer, p _model,i (x _L ) Spectral feature x for the L-th language _L Corresponding output at the L-th language specific output layer, q _label,L For spectral feature x _L A corresponding acoustic state tag; the error loss function value of other output layers is zero;

step 1-4) comparing the error loss value F _loss,i Reverse return; each language specific output layer parameter is updated according to the data of the corresponding single language, and the gradient delta phi of the language specific output layer parameter is calculated _i ：

Wherein phi is _i Parameters for the i-th language specific output layer;

the parameters sharing the hidden layer are represented by the returned error loss values F of a plurality of language-specific output layers _loss,i And (3) calculating: calculating the gradient delta phi of the shared hidden layer parameters:

wherein phi is a parameter of a shared hidden layer, and L is the number of language types corresponding to a specific language output layer of the multi-language acoustic model;

step 1-5) when F _loss,i >Given the threshold, go to step 1-2),

when F _loss,i <And (5) given a threshold value, obtaining a trained multilingual acoustic model.

As an improvement of the method, the method further comprises a training step of a frame-level language classification model, and the method comprises the following specific steps:

step 2-1), constructing a frame-level language classification model, wherein the frame-level language classification model is a deep neural network;

step 2-2) extracting frame-level spectrum features of the multi-language continuous voice stream of the training set, inputting the frame-level spectrum features into a frame-level language classification model, carrying out long-time statistics on the output vector of the current hidden layer, and calculating a mean value vector, a variance vector and a segment-level language feature vector of the output vector of the current hidden layer;

the mean vector is:

the variance vector is:

the segment level language feature vector:

h _segment ＝Append(μ,σ) (6)

wherein h is _i For the output vector of the current hidden layer at the moment i, T is the long-term statistical period, mu is the average value vector of the long-term statistics, sigma is the variance vector of the long-term statistics, h _segment The segment-level language feature vector is formed by splicing a mean vector and a variance vector, and the dimension is h _i 2 times the dimension; wherein applied (μ, σ) represents stitching μ and σ to form a high-dimensional vector;

step 2-3) taking the mean vector and the variance vector as the input of the next hidden layer, training according to the frame-level language labels through error calculation and reverse gradient feedback process, and enabling each hidden layer to output segment-level language feature vectors to obtain a trained frame-level language classification model.

As an improvement of the method, the method further comprises a training step of a segment-level language classification model, and the method comprises the following specific steps:

s2-1), constructing a segment level language classification model;

s2-2) extracting frame-level spectrum features of the multi-language continuous voice stream of the training set, inputting the frame-level spectrum features into an implicit layer of a trained frame-level language classification model, and extracting segment-level language feature vectors from the implicit layer of the trained frame-level language classification model;

step S2-3), setting segment level language feature vectors for each segment level language feature vector, inputting the segment level language feature vectors into a segment level language classification model, training and outputting posterior probability distribution of language states corresponding to the segment level language feature vectors, and obtaining a trained segment level language classification model.

As an improvement of the method, the multi-language continuous voice stream to be identified is input into a frame-level language classification model, and segment-level language feature vectors are output; inputting the segment level language feature vector into the segment level language classification model to output posterior probability distribution of the language state; the method specifically comprises the following steps:

extracting to-be-identified frame-level spectrum features from a multi-language continuous voice stream to be identified;

inputting the frequency spectrum characteristics of the frame level to be identified into a trained frame level language classification model according to a specific step length and window length, and outputting a segment level language characteristic vector h _segment ；

The segment level language feature vector h is processed _segment And inputting the trained segment-level language classification model, and outputting posterior probability distribution of the language state corresponding to the segment-level language feature vector.

As an improvement of the method, the method calculates the optimal language state path of the multi-language continuous voice stream based on the viterbi search algorithm according to the posterior probability distribution of the language states, and specifically comprises the following steps:

step 3-1) setting the autorotation probability p of the language state of the Viterbi search according to the posterior probability distribution of the language state _loop And probability of jump p _skip The obtained transition matrix A of the language state is as follows:

wherein p is _loop Representing the rotation probability of the language state, p _skip The jump probability of the language state is represented, the rotation probability and the jump probability value of each language are the same, language state labels are set according to language categories, the language state labels are labels of different language categories, arabic numerals 1,2 are adopted, and N is the language state label; the corresponding relation between each element of the transfer matrix A and the language state label is as follows:

step 3-2) carrying out Viterbi search on the predicted language state sequence, and calculating an objective function based on Viterbi search:

wherein p is _trans (s _T+1 |s _T ) Language state s representing a multilingual continuous speech stream from time T _T Language state s up to time T+1 _T+1 Is a transition probability of (2):

wherein the language state s _T And language state s _T+1 The corresponding language classification label is within the range of the labeled language classification label, and T is the segment-level language characteristic h _segment Corresponding statistical period;

p _emit (s _T+1 |h _segment ) Representing para-level language features h _segment In language state s _T+1 Posterior probability of upper prediction:

p _emit (s _T+1 |h _segment )＝DNN-LID _{segment level} (h _segment ) (11)

The DNN-LID is a deep neural network DNN-based segment-level language classifier;

step 3-3) as an objective functionAnd carrying out language state backtracking according to the optimal language state sequence to obtain an optimal language state path.

The invention also provides a multilingual continuous voice stream voice content recognition system, which comprises:

the segment level language feature extraction module is used for inputting the multi-language continuous voice stream to be recognized into the frame level language classification model and outputting segment level language feature vectors;

the posterior probability calculation module of the language state inputs the segment-level language feature vector into the segment-level language classification model, and outputs posterior probability distribution of the segment-level language state;

the language state path acquisition module is used for calculating the optimal language state path of the multilingual voice stream based on a Viterbi retrieval algorithm according to posterior probability distribution of the segment-level language states;

the language state interval segmentation module is used for segmenting the multi-language continuous voice stream to be identified according to the optimal language state path to obtain a language state interval; and

and the content recognition module is used for sending the segmented language state interval into a multilingual acoustic model and a corresponding multilingual decoder for decoding to obtain a content recognition result of the multilingual speech stream.

The invention also proposes a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of the above when executing the computer program.

The invention also proposes a computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the method of any of the above.

Compared with the prior art, the invention has the advantages that:

1. according to the method and the system for recognizing the multi-language continuous voice stream voice content, disclosed by the invention, the language classification model and the Viterbi retrieval algorithm are fused, so that the problem of dynamic detection of language types with concurrent multi-language content in the continuous voice stream can be solved.

2. The method for recognizing the multi-language continuous voice stream voice content can dynamically judge the language switching point of the multi-language content in the continuous voice stream and recognize the corresponding multi-language content.

Drawings

FIG. 1 is a schematic diagram of a multi-language continuous speech stream speech content recognition method according to the present invention.

Detailed Description

The invention will now be described in detail with reference to the drawings and specific examples.

The invention provides a method and a system for recognizing multi-language continuous voice stream voice content, wherein the method comprises the following steps:

step 1) constructing a multilingual acoustic model based on multitasking learning; the acoustic model is used for uniformly constructing the multi-language acoustic modeling task under a neural network classification frame based on multi-task learning, and simultaneously, the multi-language acoustic model is subjected to joint optimization by utilizing the acoustic characteristics of a plurality of languages; the method specifically comprises the following steps:

step 1-1) constructing a multilingual acoustic model of a neural network classification framework based on multi-task learning, wherein the model is composed of a plurality of shared hidden layers and a language specific output layer; wherein model parameters of the shared hidden layer are jointly optimized by the multilingual data; the language specific output layer is optimized by data of each single language;

step 1-2), in the forward calculation process of the model, a shared hidden layer and a language specific output layer of the multi-language acoustic model carry out nonlinear transformation on the input multi-language frequency spectrum feature vector, and all the language specific output layers have information output;

step 1-3) in the error loss function calculation process of model updating, calculating an error loss function value only at a language specific output layer corresponding to the frequency spectrum characteristic according to the acoustic state label corresponding to the frequency spectrum characteristic, and calculating the error loss function value of other language specific output layers not corresponding to the frequency spectrum characteristic language to be zero; the corresponding loss function calculation formula is as follows:

wherein F is _loss,i Error loss function value for the ith language specific output layer, p _model,i (x _L ) Spectral feature x for the L-th language _L Corresponding acoustic model output at the L-th language specific output layer, q _label,L For spectral feature x _L A corresponding acoustic state tag;

step 1-4) in the model classification error reverse return process, the error loss value F _loss,i Back transmission, each language specific output layer parameter carries out model parameter training according to the data of the corresponding single language; the parameters sharing the hidden layer are represented by the returned error loss values F of a plurality of language-specific output layers _loss,i Calculating;

the language specific output layer parameter gradient calculation formula is:

wherein phi is _i Parameters for the i-th language specific output layer.

The gradient calculation formula for sharing hidden layer parameters is as follows:

wherein phi is a parameter of a shared hidden layer, and L is the number of language types corresponding to a specific language output layer of the multi-language acoustic model.

Step 1-5) repeatedly executing step 1-2) -step 1-4) until the model parameters are converged.

Step 2) constructing a frame-level language classification model integrating long-term statistical features based on the deep neural network model; extracting language feature vectors representing language category features based on the frame-level language classification model; the frame-level language classification model is fused with a long-term statistics component, in the forward calculation process of the frame-level language classification model, the long-term statistics component carries out segment-level statistics on the output vector of the previous hidden layer, calculates the mean value and variance statistics of the output vector of the previous hidden layer, takes the vector of the mean value and variance statistics as the input of the next hidden layer, and finally carries out error calculation of the language classification model and model updating in the reverse gradient feedback process according to the frame-level language label;

the training frame level language classification model specifically comprises the following steps:

step 2-2) extracting frame-level spectrum features of the multi-language continuous voice stream of the training set, inputting a frame-level language classification model by taking the frame-level spectrum features as input features, carrying out long-time statistics on the output vector of the current hidden layer, and calculating a mean value vector, a variance vector and a segment-level language feature vector of the output vector of the current hidden layer;

the mean vector is:

the variance vector is:

the segment-level language feature vector is:

h _segment ＝Append(μ,σ) (6)

Based on the trained frame-level language classification model, segment-level language feature vectors are extracted from the implicit layer of the frame-level language classification model, segment-level language labels are built for each segment-level language feature vector, and the segment-level language classification model is trained according to the segment-level language feature vectors and the segment-level language labels. The method specifically comprises the following steps:

s2-1), constructing a segment level language classification model;

s2-2) extracting frame-level frequency spectrum characteristics of the multi-language continuous voice stream of the training set, inputting hidden layers of the trained frame-level language classification model by taking the frame-level frequency spectrum characteristics as input characteristics, and extracting segment-level language feature vectors from the hidden layers of the trained frame-level language classification model;

Step 3), extracting segment-level language feature vectors from the voices of the multi-language continuous voice stream to be recognized by using a trained frame-level language classification model, carrying out language classification on the segment-level language feature vectors according to the segment-level language classification model, and carrying out real-time detection on language switching points of the multi-language continuous voice stream by combining with a Viterbi search algorithm; finally, according to the language detection result, the continuous voice stream is segmented, and the content of the multilingual voice stream is identified through the multilingual acoustic model and the corresponding decoder. The method comprises the following specific steps:

step 3-1), extracting segment-level language feature vectors from the frame-level language classification model according to specific step length and window length by using the frequency spectrum features of the voices of the multi-language continuous voice stream to be recognized;

classifying the segment-level language feature vectors through a segment-level language classification model to obtain posterior probability distribution of language states corresponding to the segment-level language feature vectors;

setting the autorotation probability and the skip probability of the language state of Viterbi retrieval, and reducing language classification errors caused by inaccurate classification of the segment-level language classification model by improving the autorotation probability of language filling; comprising the following steps:

based on posterior probability distribution of language states, autorotation probability and skip probability of the language states of Viterbi retrieval are set, and a transition matrix A of the language states is obtained as follows:

wherein p is _loop Representing the rotation probability of the language state, p _skip The jump probability of the language state is represented, the rotation probability and the jump probability value of each language are the same, the language state labels are set according to the language categories, the language state labels are labels of different language categories, and Arabic numerals 1,2 and N are adopted as the language state labels; the corresponding relation between each element of the transfer matrix A and the language state label is as follows:

step 3-2) calculating the posterior probability p of the predicted segment level language state _emit (s _T+1 |h _segment ) According to the rotation probability p of the preset language state _loop And probability of jump p _skip The Viterbi search is carried out on the predicted language state, and the method concretely comprises the following steps:

calculating an optimal language state sequence of the continuous voice stream based on an objective function of Viterbi retrieval, wherein the objective function is as follows:

p _emit (s _T+1 |h _segment ) Representing para-level language features h _segment In language state s _T+1 Posterior probability of upper prediction;

p _emit (s _T+1 |h _segment )＝DNN-LID _{segment level} (h _segment ) (11)

step 3-3) predicting the best language state through the posterior probability of the segment level language state predicted by the segment level language classification model and the autorotation probability and the skip probability of the preset language state through the above recursive formula, searching the best language state, wherein the sequence with the maximum objective function value finally is the best language state sequence corresponding to the multi-language continuous voice stream, and carrying out language state backtracking through the best language state sequence to obtain the best language state path.

And 4) segmenting the multilingual speech stream according to the optimal language state path and the language state interval, decoding the segmented language state interval speech stream into a multilingual acoustic model and a corresponding multilingual decoder, and obtaining a corresponding content recognition result of the multilingual continuous speech stream.

The rationality and effectiveness of the speech recognition system based on the invention have been verified in practical systems, and the results are shown in table 1:

TABLE 1

The method carries out multi-language acoustic model joint training by using Cantonese, turkish and Vietnam data, simultaneously constructs a frame-level language classification model and a segment-level language classification model based on three languages, and carries out language classification and voice content recognition on continuous multi-language speech by using a multi-language continuous voice stream voice content recognition method based on a Viterbi algorithm. From table 1, it can be known that the accuracy of language identification is improved from 82.1% to 92.4% by the method of the present invention, and it is verified that the method for identifying the speech content of the multilingual continuous speech stream based on the viterbi algorithm of the present invention can effectively improve the result of language detection in the continuous multilingual speech stream.

Finally, it should be noted that the above embodiments are only for illustrating the technical solution of the present invention and are not limiting. Although the present invention has been described in detail with reference to the embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the appended claims.

Claims

1. A method of multi-lingual continuous voice stream voice content recognition, the method comprising:

inputting the language state interval into a multi-language acoustic model and a corresponding multi-language decoder for decoding to obtain a content identification result of the multi-language continuous voice stream;

according to the posterior probability distribution of segment-level language states, based on a Viterbi search algorithm, calculating an optimal language state path of the multi-language continuous voice stream, wherein the method specifically comprises the following steps:

the method comprises the steps of setting a language state label according to language categories, wherein the rotation probability and the jump probability of each language are the same, the language state label is a label of different language categories, arabic numerals 1 and 2 are adopted, and N is the language state label; the corresponding relation between each element of the transfer matrix A and the language state label is as follows:

step 3-2) carrying out Viterbi search on the predicted language state, and calculating an objective function based on Viterbi search:

wherein the language state s _T And language state s _T+1 The corresponding language classification label is within the range of the labeled language classification label, and T is the segment levelLanguage feature h _segment Corresponding statistical period;

p _emit (s _T+1 |h _segment )＝DNNLID _{segment level} (h _segment ) (11)

The DNNLID is a segment-level language classifier based on a deep neural network DNN;

2. The method for recognizing speech content in a multilingual continuous speech stream according to claim 1, further comprising a training step of the multilingual acoustic model, comprising the specific steps of:

step 1-1) constructing a multi-language acoustic model based on a multi-task learning neural network, wherein the model comprises a plurality of shared hidden layers and a language specific output layer;

step 1-2) extracting spectral features of multi-language continuous voice streams of a training set based on acoustic state labels of multi-language continuous voice data, and inputting the spectral features into a shared hidden layer for nonlinear transformation; outputting the data of a plurality of single languages to a plurality of language specific output layers;

step 1-3) calculating an error loss function value at a language specific output layer corresponding to the input spectral features from the single-language data:

the error loss function F _loss,i The method comprises the following steps:

step 1-4) comparing the error loss value F _loss,i Back-pass, each language specific output layer parameter is updated according to the data of the corresponding single language, calculating the language specific output layer parameter gradient delta phi _i ：

Wherein phi is _i Parameters for the i-th language specific output layer;

the parameters sharing the hidden layer are represented by the returned error loss values F of a plurality of language-specific output layers _loss,i Updating: calculating the gradient delta phi of the shared hidden layer parameters:

step 1-5) when F _loss,i >If the threshold value is given, the step 1-2) is carried out;

3. The method of claim 1, further comprising the step of training a frame-level language classification model, comprising the steps of:

the mean vector μ is:

the variance vector sigma is:

the segment-level language feature vector h _segment The method comprises the following steps:

h _segment ＝Append(μ,σ) (6)

wherein h is _i For the output vector of the current hidden layer at the moment i, T is the long-term statistical period, mu is the average value vector of the long-term statistics, sigma is the variance vector of the long-term statistics, h _segment The segment-level language feature vector is formed by splicing a mean vector and a variance vector, and the dimension is h _i 2 times the dimension, where application (μ, σ) represents stitching μ and σ to form a high-dimensional vector;

4. The method of claim 1, further comprising the step of training a segment-level language classification model, comprising the steps of:

s2-1), constructing a segment level language classification model;

5. The method for recognizing speech content according to claim 1, wherein the multi-language continuous speech stream to be recognized is input into a frame-level language classification model to output segment-level language feature vectors; inputting the segment level language feature vector into the segment level language classification model to output posterior probability distribution of the language state; the method specifically comprises the following steps:

6. A system based on the multi-lingual continuous voice stream voice content recognition method of claim 1, the system comprising:

the language state path acquisition module is used for calculating an optimal language state path of the multilingual voice stream based on a Viterbi retrieval algorithm according to posterior probability distribution of the segment-level language states;

and the content recognition module is used for sending the segmented language state interval into a multilingual acoustic model and a corresponding multilingual decoder for decoding to obtain a content recognition result of the multilingual continuous voice stream.

7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of claims 1-5 when executing the computer program.

8. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, causes the processor to perform the method of any of claims 1-5.