CN112489622A - Method and system for recognizing voice content of multi-language continuous voice stream - Google Patents
Method and system for recognizing voice content of multi-language continuous voice stream Download PDFInfo
- Publication number
- CN112489622A CN112489622A CN201910782981.2A CN201910782981A CN112489622A CN 112489622 A CN112489622 A CN 112489622A CN 201910782981 A CN201910782981 A CN 201910782981A CN 112489622 A CN112489622 A CN 112489622A
- Authority
- CN
- China
- Prior art keywords
- language
- segment
- level
- state
- level language
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 53
- 239000013598 vector Substances 0.000 claims abstract description 105
- 238000013145 classification model Methods 0.000 claims abstract description 69
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 11
- 238000001228 spectrum Methods 0.000 claims description 26
- 230000006870 function Effects 0.000 claims description 16
- 230000007774 longterm Effects 0.000 claims description 14
- 238000004364 calculation method Methods 0.000 claims description 13
- 238000013528 artificial neural network Methods 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 8
- 230000003595 spectral effect Effects 0.000 claims description 7
- 239000011159 matrix material Substances 0.000 claims description 6
- 230000007704 transition Effects 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 238000010845 search algorithm Methods 0.000 claims description 2
- 238000001514 detection method Methods 0.000 abstract description 5
- 230000006872 improvement Effects 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 3
- 241000282414 Homo sapiens Species 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000011897 real-time detection Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/005—Language recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Signal Processing (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a method and a system for recognizing voice contents of a multi-language continuous voice stream, wherein the method comprises the following steps: inputting a frame-level language classification model of a multi-language continuous voice stream to be recognized, and outputting segment-level language feature vectors; inputting the segment-level language feature vector into a segment-level language classification model, and outputting posterior probability distribution of the segment-level language state; calculating the optimal language state path of the multilingual continuous voice stream based on a Viterbi retrieval algorithm according to the posterior probability distribution of the segment-level language state; segmenting the multilingual continuous voice stream to be identified according to the optimal language state path to obtain a language state interval; and sending the segmented language state interval into a multi-language acoustic model and a corresponding multi-language decoder for decoding to obtain a content identification result of the multi-language continuous voice stream. The invention solves the problem of dynamic detection and identification of language types with coexisting multi-language contents in continuous voice streams by fusing the language classification model with the Viterbi retrieval algorithm.
Description
Technical Field
The present invention relates to the field of speech recognition, and in particular, to a method and a system for recognizing speech contents of a continuous speech stream in multiple languages.
Background
With the application of hidden markov technology, deep neural network and other technologies in the field of automatic speech recognition, the automatic speech recognition technology has not been developed. For Chinese, English and other languages with wide number of users, the performance of the corresponding single-language speech recognition system can even reach the recognition level of human beings. With the economic trade among countries in the world, the economic culture among the countries in the world is accelerated to merge, and the establishment of a mixed multi-language voice recognition system becomes a necessary condition for the content detection of multi-language voice streams.
The traditional multi-language voice recognition system is based on a language recognition front end which is connected with a plurality of parallel single-language voice recognition system rear ends in series. Generally, a speech recognition front end performs sentence-level classification and discrimination on the speech type of a speech with respect to speech features of the entire speech. In the multi-language recognition task of the multi-language continuous voice stream, the language classification method based on the statement level cannot cope with the language classification task of the multi-language coexistence in the voice stream.
Disclosure of Invention
The invention aims to solve the problem that a language classification method based on statement level cannot cope with a language classification task with coexistence of multiple languages in a voice stream.
In order to achieve the above object, the present invention provides a method for recognizing voice contents of a multi-language continuous voice stream, comprising:
inputting a frame-level language classification model of a multi-language continuous voice stream to be recognized, and outputting segment-level language feature vectors;
inputting the segment-level language feature vector into a segment-level language classification model, and outputting posterior probability distribution of the segment-level language state;
calculating the optimal language state path of the multilingual continuous voice stream based on a Viterbi retrieval algorithm according to the posterior probability distribution of the segment-level language state;
segmenting the multilingual continuous voice stream to be identified according to the optimal language state path to obtain a language state interval;
and sending the language state interval into a multi-language acoustic model and a corresponding multi-language decoder for decoding to obtain a content identification result of the multi-language continuous voice stream.
As an improvement of the method, the method further comprises a training step of the multilingual acoustic model, and the specific steps are as follows:
step 1-1) constructing a multi-language acoustic model based on a multi-task learning framework, wherein the model comprises a plurality of shared hidden layers and language specific output layers;
step 1-2) extracting the frequency spectrum characteristics of the multi-language continuous voice stream of the training set based on the acoustic state label of the multi-language voice data, and inputting the frequency spectrum characteristics into a shared hidden layer for nonlinear transformation; outputting data of a plurality of single languages to a plurality of language specific output layers;
step 1-3) calculating an error loss function value from the data of the single language at a language specific output layer corresponding to the input spectral feature, wherein the error loss function is as follows:
wherein Floss,iError loss value, p, for the ith language specific output layermodel,i(xL) Spectral feature x for the L-th languageLCorresponding output at the L-th language-specific output level, qlabel,LAs a spectral feature xLA corresponding acoustic state tag; the error loss function values of other output layers are zero;
step 1-4) of determining the error loss value Floss,iReverse feedback; updating parameters of each language specific output layer according to the data of the corresponding single language, and calculating the gradient Delta phi of the language specific output layer parametersi:
Wherein phiiParameters for the ith language specific output layer;
error loss value F of parameter of shared hidden layer returned by several language specific output layersloss,iAnd (3) calculating: calculate the gradient Δ Φ of the shared hidden layer parameter:
phi is a parameter of the shared hidden layer, and L is a language category number corresponding to a specific language output layer of the multilingual acoustic model;
step 1-5) when Floss,i>And (4) setting a threshold value, turning to the step 1-2),
when F is presentloss,i<And giving a threshold value to obtain the trained multilingual acoustic model.
As an improvement of the method, the method further comprises a training step of a frame-level language classification model, and the specific steps are as follows:
step 2-1), constructing a frame level language classification model, wherein the frame level language classification model is a deep neural network;
step 2-2) extracting frame-level frequency spectrum characteristics of multi-language continuous voice streams of a training set, inputting the frame-level frequency spectrum characteristics into a frame-level language classification model, carrying out long-term statistics on output vectors of a current hidden layer, and calculating a mean vector, a variance vector and a segment-level language characteristic vector of the output vectors of the current hidden layer;
the mean vector is:
the variance vector is:
the segment level language feature vector:
hsegment=Append(μ,σ) (6)
wherein h isiIs the output vector of the current hidden layer at the moment i, T is the long-term statistical period, mu is the mean vector of the long-term statistics, sigma is the variance vector of the long-term statistics, hsegmentIs segment level language feature vector, the segment level language feature vector is formed by splicing a mean vector and a variance vector together, and the dimension of the segment level language feature vector is h i2 times the dimension; wherein, appendix (mu, sigma) represents that mu and sigma are spliced to form a high-dimensional vector;
and 2-3) taking the mean vector and the variance vector as the input of the next hidden layer, and training through error calculation and a reverse gradient return process according to the frame-level language labels to enable each hidden layer to output segment-level language feature vectors to obtain a trained frame-level language classification model.
As an improvement of the method, the method further comprises a training step of a segment-level language classification model, and the specific steps are as follows:
step S2-1) constructing a segment level language classification model;
step S2-2) extracting the frame level spectrum feature of the multilingual continuous voice stream of the training set, inputting the frame level spectrum feature into the hidden layer of the trained frame level language classification model, and extracting the segment level language feature vector from the hidden layer of the trained frame level language classification model;
step S2-3) setting a segment-level language label for each segment-level language feature vector, inputting the segment-level language feature vector into a segment-level language classification model, and training and outputting the posterior probability distribution of the language state corresponding to the segment-level language label to obtain the trained segment-level language classification model.
As an improvement of the method, the multilingual continuous speech stream to be recognized is input into a frame-level language classification model, and segment-level language feature vectors are output; inputting the segment-level language feature vector into a segment-level language classification model to output the posterior probability distribution of the language state; the method specifically comprises the following steps:
extracting the frequency spectrum characteristics of the frame level to be identified from the continuous voice stream of the multi-language to be identified;
inputting the frame-level frequency spectrum features to be identified into the trained frame-level language classification model according to a specific step length and a specific window length, and outputting a segment-level language feature vector hsegment;
The segment level language feature vector hsegmentInputting the trained segment-level language classification model, and outputting the posterior probability distribution of the language state corresponding to the segment-level language feature vector.
As an improvement of the method, based on the viterbi search algorithm, calculating the optimal language state path of the multilingual continuous speech stream according to the posterior probability distribution of the language state specifically includes:
step 3-1) setting the autorotation probability p of the language state of the Viterbi search according to the posterior probability distribution of the language stateloopAnd a probability of hopping pskipObtaining a transition matrix A of the language state as follows:
wherein p isloopRepresenting the autorotation probability, p, of the language stateskipExpress the languageThe skipping probability of the state, the rotation probability and the skipping probability value of each language are the same, language state labels are set according to language categories, the language state labels are labels of different language categories, and Arabic numerals 1,2 are adopted, wherein N is the language state label; the corresponding relation between each element of the transition matrix A and the language state label is as follows:
step 3-2) carrying out Viterbi retrieval on the predicted language state sequence, and calculating a target function based on Viterbi retrieval:
wherein p istrans(sT+1|sT) Language state s representing multilingual continuous speech stream from time TTLanguage state s by the time T +1T+1Transition probability of (2):
wherein, language state sTAnd language state sT+1Corresponding language classification label is in the labeled language classification label range, T is the segment level language characteristic hsegmentA corresponding statistical period;
pemit(sT+1|hsegment) Representing features h of class-to-class languagesegmentIn language state sT+1The posterior probability of the upper prediction:
pemit(sT+1|hsegment)=DNN-LIDsegment level(hsegment) (11)
The DNN-LID is a segment level language classifier based on a deep neural network DNN;
step 3-3) with an objective functionAnd the language state sequence with the maximum value is the optimal language state sequence, and language state backtracking is carried out according to the optimal language state sequence to obtain an optimal language state path.
The invention also proposes a system for recognizing the speech content of a continuous speech stream in multiple languages, said system comprising:
the segment-level language feature extraction module is used for inputting a frame-level language classification model of the multilingual continuous voice stream to be identified and outputting segment-level language feature vectors;
the posterior probability calculation module of the language state inputs the segment-level language feature vector into the segment-level language classification model and outputs the posterior probability distribution of the segment-level language state;
a language state path obtaining module, which is used for calculating the optimal language state path of the multi-language voice stream based on the Viterbi retrieval algorithm according to the posterior probability distribution of the segment level language state;
the language state interval segmentation module is used for segmenting the multilingual continuous voice stream to be identified according to the optimal language state path to obtain a language state interval; and
and the content identification module of the multi-language voice stream is used for sending the segmented language state intervals into the multi-language acoustic model and the corresponding multi-language decoder for decoding to obtain the content identification result of the multi-language voice stream.
The invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of the above items when executing the computer program.
The invention also proposes a computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, causes the processor to carry out the method of any of the above.
Compared with the prior art, the invention has the advantages that:
1. the method and the system for recognizing the voice content of the continuous voice stream with the multiple languages can solve the problem of dynamic detection of the language types with the coexistence of the multiple language contents in the continuous voice stream by fusing the language classification model with the Viterbi retrieval algorithm.
2. The method for recognizing the voice content of the multi-language continuous voice stream can perform dynamic language switching point judgment and corresponding multi-language content recognition on the multi-language content in the continuous voice stream.
Drawings
FIG. 1 is a diagram illustrating a method for recognizing speech contents of a multilingual continuous speech stream according to the present invention.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments.
The invention provides a method and a system for recognizing the voice content of a multi-language continuous voice stream, wherein the method comprises the following steps:
step 1) constructing a multi-language acoustic model based on multi-task learning; the acoustic model uniformly constructs acoustic modeling tasks of multiple languages under a neural network classification framework based on multi-task learning, and simultaneously performs joint optimization on the acoustic models of the multiple languages by using acoustic characteristics of the multiple languages; the method specifically comprises the following steps:
step 1-1) constructing a multi-language acoustic model of a neural network classification framework based on multi-task learning, wherein the model is composed of a plurality of shared hidden layers and language specific output layers; wherein the model parameters of the shared hidden layer are jointly optimized by multi-language data; the language specific output layer is optimized by data of each single language;
step 1-2) in the forward calculation process of the model, the shared hidden layer and the language specific output layer of the multilingual acoustic model perform nonlinear transformation on input multilingual frequency spectrum characteristic vectors, and all language specific output layers output information;
step 1-3) in the error loss function calculation process of model updating, according to the acoustic state label corresponding to the spectrum feature, calculating the error loss function value only in the language specific output layer corresponding to the spectrum feature, and calculating the error loss function value of other language specific output layers not corresponding to the spectrum feature language to be zero; the corresponding loss function calculation is as follows:
wherein Floss,iError loss function value, p, for the ith language specific output layermodel,i(xL) Spectral feature x for the L-th languageLCorresponding acoustic model output at the L-th language-specific output layer, qlabel,LAs a spectral feature xLA corresponding acoustic state tag;
step 1-4) in the process of model classification error reverse feedback, the error loss value F is usedloss,iReversely returning, and performing model parameter training on each language specific output layer parameter according to the data of the corresponding single language; error loss value F of parameter of shared hidden layer returned by several language specific output layersloss,iCalculating;
the language specific output layer parameter gradient calculation formula is as follows:
wherein phiiParameters of the output layer are specified for the ith language.
The gradient calculation formula for the shared hidden layer parameters is:
where Φ is a parameter of the shared hidden layer, and L is a language category number corresponding to a specific language output layer of the multilingual acoustic model.
Step 1-5) repeatedly executing the step 1-2) -the step 1-4) until the model parameters are converged.
Step 2) constructing a frame-level language classification model fusing long-term statistical characteristics based on a deep neural network model; extracting language feature vectors representing language category features based on a frame-level language classification model; the long-term statistical component performs section-level statistics on the output vector of the previous hidden layer in the forward calculation process of the frame-level language classification model, calculates the mean and variance statistics of the output vector of the previous hidden layer, takes the vector of the mean and variance statistics as the input of the next hidden layer, and finally performs error calculation and reverse gradient return process of the language classification model according to the frame-level language label to update the model;
the specific steps of training the frame-level language classification model comprise:
step 2-1), constructing a frame level language classification model, wherein the frame level language classification model is a deep neural network;
step 2-2) extracting frame-level frequency spectrum characteristics of multi-language continuous voice streams of a training set, taking the frame-level frequency spectrum characteristics as input characteristics to input a frame-level language classification model, carrying out long-term statistics on output vectors of a current hidden layer, and calculating a mean vector, a variance vector and a segment-level language characteristic vector of the output vectors of the current hidden layer;
the mean vector is:
the variance vector is:
the segment level language feature vector is:
hsegment=Append(μ,σ) (6)
wherein h isiIs the output vector of the current hidden layer at the moment i, T is the long-term statistical period, mu is the mean vector of the long-term statistics, sigma is the variance vector of the long-term statistics, hsegmentFor segment-level language feature vectors, the segmentThe level language feature vector is formed by splicing a mean vector and a variance vector together, and the dimension of the mean vector and the variance vector is h i2 times the dimension; wherein, appendix (mu, sigma) represents that mu and sigma are spliced to form a high-dimensional vector;
and 2-3) taking the mean vector and the variance vector as the input of the next hidden layer, and training through error calculation and a reverse gradient return process according to the frame-level language labels to enable each hidden layer to output segment-level language feature vectors to obtain a trained frame-level language classification model.
Based on a trained frame-level language classification model, extracting segment-level language feature vectors from a hidden layer of the frame-level language classification model, constructing segment-level language labels for each segment-level language feature vector, and training the segment-level language classification model according to the segment-level language feature vectors and the segment-level language labels. The method specifically comprises the following steps:
step S2-1) constructing a segment level language classification model;
step S2-2) extracting the frame level spectrum characteristics of the multilingual continuous voice stream of the training set, inputting the hidden layer of the trained frame level language classification model by taking the frame level spectrum characteristics as input characteristics, and extracting segment level language characteristic vectors from the hidden layer of the trained frame level language classification model;
step S2-3) setting a segment-level language label for each segment-level language feature vector, inputting the segment-level language feature vector into a segment-level language classification model, and training and outputting the posterior probability distribution of the language state corresponding to the segment-level language label to obtain the trained segment-level language classification model.
Step 3) extracting segment-level language feature vectors by utilizing a trained frame-level language classification model for the voice of the continuous voice stream of the multi-language to be recognized, carrying out language classification on the segment-level language feature vectors according to the segment-level language classification model, and carrying out real-time detection on language switching points of the continuous voice stream of the multi-language by combining a Viterbi retrieval algorithm; and finally, according to the language detection result, segmenting the continuous voice stream and identifying the content of the multi-language voice stream through a multi-language acoustic model and a corresponding decoder. The method comprises the following specific steps:
step 3-1) extracting segment-level language feature vectors from the frame-level language classification model according to specific step length and window length by using the frequency spectrum features of the voice of the multi-language continuous voice stream to be recognized;
classifying the segment-level language feature vectors through a segment-level language classification model to obtain posterior probability distribution of the language states corresponding to the segment-level language feature vectors;
setting the autorotation probability and the skipping probability of the language state of the Viterbi retrieval, and reducing the language classification error caused by inaccurate classification of the segment level language classification model by improving the autorotation probability of language filling; the method comprises the following steps:
based on the posterior probability distribution of the language state, the autorotation probability and the skipping probability of the language state of the Viterbi retrieval are set, and the transition matrix A of the language state is obtained as follows:
wherein p isloopRepresenting the autorotation probability, p, of the language stateskipThe method comprises the steps of representing the skipping probability of language states, setting language state labels according to language categories, wherein the language state labels are labels of different language categories, and Arabic numerals 1,2 and N are used as the language state labels, and the autorotation probability and the skipping probability value of each language are the same; the corresponding relation between each element of the transition matrix A and the language state label is as follows:
step 3-2) calculating the posterior probability p of the predicted segment level language stateemit(sT+1|hsegment) According to the autorotation probability p of the preset language stateloopAnd a probability of hopping pskipPerforming viterbi search on the predicted language state, specifically including:
calculating the optimal language state sequence of the continuous voice stream based on the target function of the Viterbi retrieval, wherein the target function is as follows:
wherein p istrans(sT+1|sT) Language state s representing multilingual continuous speech stream from time TTLanguage state s by the time T +1T+1Transition probability of (2):
wherein, language state sTAnd language state sT+1Corresponding language classification label is in the labeled language classification label range, T is the segment level language characteristic hsegmentA corresponding statistical period;
pemit(sT+1|hsegment) Representing features h of class-to-class languagesegmentIn language state sT+1(ii) an upper predicted posterior probability;
pemit(sT+1|hsegment)=DNN-LIDsegment level(hsegment) (11)
The DNN-LID is a segment level language classifier based on a deep neural network DNN;
and 3-3) predicting the optimal language state for retrieval by the posterior probability of the segment-level language state predicted by the segment-level language classification model and the preset autorotation probability and the jump probability of the language state through the recursive formula, wherein the sequence with the maximum target function value is the optimal language state sequence corresponding to the continuous voice stream of multiple languages, and the optimal language state path can be obtained by performing language state backtracking through the optimal language state sequence.
And 4) segmenting the multi-language voice stream according to the language state interval according to the optimal language state path, and sending the segmented language state interval voice stream into a multi-language acoustic model and a corresponding multi-language decoder for decoding to obtain a corresponding content identification result of the multi-language continuous voice stream.
The invention also proposes a system for recognizing the speech content of a continuous speech stream in multiple languages, said system comprising:
the segment-level language feature extraction module is used for inputting a frame-level language classification model of the multilingual continuous voice stream to be identified and outputting segment-level language feature vectors;
the posterior probability calculation module of the language state inputs the segment-level language feature vector into the segment-level language classification model and outputs the posterior probability distribution of the segment-level language state;
a language state path obtaining module, which is used for calculating the optimal language state path of the multi-language voice stream based on the Viterbi retrieval algorithm according to the posterior probability distribution of the segment level language state;
the language state interval segmentation module is used for segmenting the multilingual continuous voice stream to be identified according to the optimal language state path to obtain a language state interval; and
and the content identification module of the multi-language voice stream is used for sending the segmented language state intervals into the multi-language acoustic model and the corresponding multi-language decoder for decoding to obtain the content identification result of the multi-language voice stream.
The invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of the above items when executing the computer program.
The invention also proposes a computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, causes the processor to carry out the method of any of the above.
The rationality and validity of the speech recognition system based on the invention has been verified in real systems, the results are shown in table 1:
TABLE 1
The method of the invention carries out multi-language acoustic model combined training by using data of Cantonese, Turkish and Vietnamese, simultaneously constructs a frame level language classification model and a segment level language classification model based on three languages, and carries out language classification and voice content identification on continuous multi-language voice by utilizing a multi-language continuous voice stream voice content identification method based on a Viterbi algorithm. From table 1, it can be seen that the method of the present invention improves the language identification precision from 82.1% to 92.4%, and verifies that the method of the present invention for recognizing the speech content of the continuous multi-language speech stream based on the viterbi algorithm can effectively improve the result of language detection in the continuous multi-language speech stream.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.
Claims (9)
1. A method of multi-lingual continuous speech stream speech content recognition, the method comprising:
inputting a frame-level language classification model of a multi-language continuous voice stream to be recognized, and outputting segment-level language feature vectors;
inputting the segment-level language feature vector into a segment-level language classification model, and outputting posterior probability distribution of the segment-level language state;
calculating the optimal language state path of the multilingual continuous voice stream based on a Viterbi retrieval algorithm according to the posterior probability distribution of the segment-level language state;
segmenting the multilingual continuous voice stream to be identified according to the optimal language state path to obtain a language state interval;
and inputting the language state interval into a multi-language acoustic model and a corresponding multi-language decoder for decoding to obtain a content identification result of the multi-language voice stream.
2. The method for recognizing the speech content of a multilingual continuous speech stream according to claim 1, further comprising a step of training a multilingual acoustic model, comprising the steps of:
step 1-1) constructing a multi-language acoustic model based on a multitask learning neural network, wherein the model comprises a plurality of shared hidden layers and language specific output layers;
step 1-2) extracting the frequency spectrum characteristics of the multi-language continuous voice stream of the training set based on the acoustic state label of the multi-language continuous voice data, and inputting the frequency spectrum characteristics into a shared hidden layer for nonlinear transformation; outputting data of a plurality of single languages to a plurality of language specific output layers;
step 1-3) calculating error loss function values of data of a single language at a language specific output layer corresponding to the input spectrum characteristics:
said error loss function Floss,iComprises the following steps:
wherein Floss,iError loss value, p, for the ith language specific output layermodel,i(xL) Spectral feature x for the L-th languageLCorresponding output at the L-th language-specific output level, qlabel,LAs a spectral feature xLA corresponding acoustic state tag; the error loss function values of other output layers are zero;
step 1-4) of determining the error loss value Floss,iReverse feedback, each language specific output layer parameter carries out parameter updating according to the data of the corresponding single languageComputing language specific output layer parameter gradient delta phii:
Wherein phiiParameters for the ith language specific output layer;
error loss value F of parameter of shared hidden layer returned by several language specific output layersloss,iUpdating: calculate the gradient Δ Φ of the shared hidden layer parameter:
phi is a parameter of the shared hidden layer, and L is a language category number corresponding to a specific language output layer of the multilingual acoustic model;
step 1-5) when Floss,i>Setting a threshold value, and then turning to the step 1-2);
when F is presentloss,i<And giving a threshold value to obtain the trained multilingual acoustic model.
3. The method according to claim 1, further comprising a step of training a frame-level language classification model, comprising the steps of:
step 2-1), constructing a frame level language classification model, wherein the frame level language classification model is a deep neural network;
step 2-2) extracting frame-level frequency spectrum characteristics of multi-language continuous voice streams of a training set, inputting the frame-level frequency spectrum characteristics into a frame-level language classification model, carrying out long-term statistics on output vectors of a current hidden layer, and calculating a mean vector, a variance vector and a segment-level language characteristic vector of the output vectors of the current hidden layer;
the mean vector μ is:
the variance vector σ is:
the segment level language feature vector hsegmentComprises the following steps:
hsegment=Append(μ,σ) (6)
wherein h isiIs the output vector of the current hidden layer at the moment i, T is the long-term statistical period, mu is the mean vector of the long-term statistics, sigma is the variance vector of the long-term statistics, hsegmentIs segment level language feature vector, the segment level language feature vector is formed by splicing a mean vector and a variance vector together, and the dimension of the segment level language feature vector is hi2 times of dimensionality, wherein appendix (mu, sigma) represents that mu and sigma are spliced to form a high-dimensional vector;
and 2-3) taking the mean vector and the variance vector as the input of the next hidden layer, and training through error calculation and a reverse gradient return process according to the frame-level language labels to enable each hidden layer to output segment-level language feature vectors to obtain a trained frame-level language classification model.
4. The method for recognizing the speech content of a multilingual continuous speech stream according to claim 1, further comprising the step of training a segment-level language classification model, comprising the steps of:
step S2-1) constructing a segment level language classification model;
step S2-2) extracting the frame level spectrum feature of the multilingual continuous voice stream of the training set, inputting the frame level spectrum feature into the hidden layer of the trained frame level language classification model, and extracting the segment level language feature vector from the hidden layer of the trained frame level language classification model;
step S2-3) setting a segment-level language label for each segment-level language feature vector, inputting the segment-level language feature vector into a segment-level language classification model, and training and outputting the posterior probability distribution of the language state corresponding to the segment-level language label to obtain the trained segment-level language classification model.
5. The method according to claim 1, wherein the continuous speech stream of multiple languages is input into a frame-level language classification model to output segment-level language feature vectors; inputting the segment-level language feature vector into a segment-level language classification model to output the posterior probability distribution of the language state; the method specifically comprises the following steps:
extracting the frequency spectrum characteristics of the frame level to be identified from the continuous voice stream of the multi-language to be identified;
inputting the frame-level frequency spectrum features to be identified into the trained frame-level language classification model according to a specific step length and a specific window length, and outputting a segment-level language feature vector hsegment;
The segment level language feature vector hsegmentInputting the trained segment-level language classification model, and outputting the posterior probability distribution of the language state corresponding to the segment-level language feature vector.
6. The method for recognizing speech content of continuous speech stream in multiple languages according to claim 5, wherein the step of calculating the optimal language state path of the continuous speech stream in multiple languages based on viterbi search algorithm according to the posterior probability distribution of language state comprises:
step 3-1) setting the autorotation probability p of the language state of the Viterbi search according to the posterior probability distribution of the language stateloopAnd a probability of hopping pskipObtaining a transition matrix A of the language state as follows:
the autorotation probability and the skipping probability value of each language are the same, language state labels are set according to language categories, the language state labels are labels of different language categories, and Arabic numerals 1, 2. The corresponding relation between each element of the transition matrix A and the language state label is as follows:
step 3-2) carrying out Viterbi retrieval on the predicted language state, and calculating a target function based on Viterbi retrieval:
wherein p istrans(sT+1|sT) Language state s representing multilingual continuous speech stream from time TTLanguage state s by the time T +1T+1Transition probability of (2):
wherein, language state sTAnd language state sT+1Corresponding language classification label is in the labeled language classification label range, T is the segment level language characteristic hsegmentA corresponding statistical period;
pemit(sT+1|hsegment) Representing features h of class-to-class languagesegmentIn language state sT+1The posterior probability of the upper prediction:
pemit(sT+1|hsegment)=DNNLIDsegment level(hsegment) (11)
The DNNLID is a segment level language classifier based on a deep neural network DNN;
7. A multi-language continuous speech stream speech content recognition system, said system comprising:
the segment-level language feature extraction module is used for inputting a frame-level language classification model of the multilingual continuous voice stream to be identified and outputting segment-level language feature vectors;
the posterior probability calculation module of the language state inputs the segment-level language feature vector into the segment-level language classification model and outputs the posterior probability distribution of the segment-level language state;
the language state path acquisition module is used for calculating the optimal language state path of the multi-language voice stream based on the Viterbi retrieval algorithm according to the posterior probability distribution of the section level language state;
the language state interval segmentation module is used for segmenting the multilingual continuous voice stream to be identified according to the optimal language state path to obtain a language state interval; and
and the content identification module of the multi-language voice stream is used for sending the segmented language state intervals into the multi-language acoustic model and the corresponding multi-language decoder for decoding to obtain the content identification result of the multi-language voice stream.
8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of claims 1-6 when executing the computer program.
9. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to carry out the method of any one of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910782981.2A CN112489622B (en) | 2019-08-23 | 2019-08-23 | Multi-language continuous voice stream voice content recognition method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910782981.2A CN112489622B (en) | 2019-08-23 | 2019-08-23 | Multi-language continuous voice stream voice content recognition method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112489622A true CN112489622A (en) | 2021-03-12 |
CN112489622B CN112489622B (en) | 2024-03-19 |
Family
ID=74920171
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910782981.2A Active CN112489622B (en) | 2019-08-23 | 2019-08-23 | Multi-language continuous voice stream voice content recognition method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112489622B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113870839A (en) * | 2021-09-29 | 2021-12-31 | 北京中科智加科技有限公司 | Language identification device of language identification model based on multitask |
CN114078468A (en) * | 2022-01-19 | 2022-02-22 | 广州小鹏汽车科技有限公司 | Voice multi-language recognition method, device, terminal and storage medium |
CN115831094A (en) * | 2022-11-08 | 2023-03-21 | 北京数美时代科技有限公司 | Multilingual voice recognition method, system, storage medium and electronic equipment |
WO2023165538A1 (en) * | 2022-03-03 | 2023-09-07 | 北京有竹居网络技术有限公司 | Speech recognition method and apparatus, and computer-readable medium and electronic device |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030050779A1 (en) * | 2001-08-31 | 2003-03-13 | Soren Riis | Method and system for speech recognition |
CN104036774A (en) * | 2014-06-20 | 2014-09-10 | 国家计算机网络与信息安全管理中心 | Method and system for recognizing Tibetan dialects |
US9460711B1 (en) * | 2013-04-15 | 2016-10-04 | Google Inc. | Multilingual, acoustic deep neural networks |
CN108492820A (en) * | 2018-03-20 | 2018-09-04 | 华南理工大学 | Chinese speech recognition method based on Recognition with Recurrent Neural Network language model and deep neural network acoustic model |
US20190189111A1 (en) * | 2017-12-15 | 2019-06-20 | Mitsubishi Electric Research Laboratories, Inc. | Method and Apparatus for Multi-Lingual End-to-End Speech Recognition |
-
2019
- 2019-08-23 CN CN201910782981.2A patent/CN112489622B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030050779A1 (en) * | 2001-08-31 | 2003-03-13 | Soren Riis | Method and system for speech recognition |
US9460711B1 (en) * | 2013-04-15 | 2016-10-04 | Google Inc. | Multilingual, acoustic deep neural networks |
CN104036774A (en) * | 2014-06-20 | 2014-09-10 | 国家计算机网络与信息安全管理中心 | Method and system for recognizing Tibetan dialects |
US20190189111A1 (en) * | 2017-12-15 | 2019-06-20 | Mitsubishi Electric Research Laboratories, Inc. | Method and Apparatus for Multi-Lingual End-to-End Speech Recognition |
CN108492820A (en) * | 2018-03-20 | 2018-09-04 | 华南理工大学 | Chinese speech recognition method based on Recognition with Recurrent Neural Network language model and deep neural network acoustic model |
Non-Patent Citations (3)
Title |
---|
AANCHAN MOHAN ET AL.: "《Multi-lingual speech recognition with low-rank multi-task deep neural networks》", 《ICASSP 2015》, 6 August 2015 (2015-08-06), pages 4994 - 4998 * |
THOMAS NIESLER ET AL.: "《Language identification and multilingual speech recognition using discriminatively trained acoustic models》", 《COMPUTER SCIENCE, LINGUISTICS》, 31 December 2006 (2006-12-31), pages 1 - 6 * |
姚海涛等: "《面向多语言的语音识别声学模型建模方法研究》", 《声学技术》, vol. 34, no. 6, 31 December 2015 (2015-12-31), pages 404 - 407 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113870839A (en) * | 2021-09-29 | 2021-12-31 | 北京中科智加科技有限公司 | Language identification device of language identification model based on multitask |
CN113870839B (en) * | 2021-09-29 | 2022-05-03 | 北京中科智加科技有限公司 | Language identification device of language identification model based on multitask |
CN114078468A (en) * | 2022-01-19 | 2022-02-22 | 广州小鹏汽车科技有限公司 | Voice multi-language recognition method, device, terminal and storage medium |
CN114078468B (en) * | 2022-01-19 | 2022-05-13 | 广州小鹏汽车科技有限公司 | Voice multi-language recognition method, device, terminal and storage medium |
WO2023165538A1 (en) * | 2022-03-03 | 2023-09-07 | 北京有竹居网络技术有限公司 | Speech recognition method and apparatus, and computer-readable medium and electronic device |
CN115831094A (en) * | 2022-11-08 | 2023-03-21 | 北京数美时代科技有限公司 | Multilingual voice recognition method, system, storage medium and electronic equipment |
CN115831094B (en) * | 2022-11-08 | 2023-08-15 | 北京数美时代科技有限公司 | Multilingual voice recognition method, multilingual voice recognition system, storage medium and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN112489622B (en) | 2024-03-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110134757B (en) | Event argument role extraction method based on multi-head attention mechanism | |
CN112489622B (en) | Multi-language continuous voice stream voice content recognition method and system | |
CN108710704B (en) | Method and device for determining conversation state, electronic equipment and storage medium | |
CN106547737A (en) | Based on the sequence labelling method in the natural language processing of deep learning | |
CN112183064B (en) | Text emotion reason recognition system based on multi-task joint learning | |
CN110727844B (en) | Online commented commodity feature viewpoint extraction method based on generation countermeasure network | |
CN110968660A (en) | Information extraction method and system based on joint training model | |
CN110532555B (en) | Language evaluation generation method based on reinforcement learning | |
CN111339260A (en) | BERT and QA thought-based fine-grained emotion analysis method | |
CN115292463B (en) | Information extraction-based method for joint multi-intention detection and overlapping slot filling | |
CN112069801A (en) | Sentence backbone extraction method, equipment and readable storage medium based on dependency syntax | |
CN111581970B (en) | Text recognition method, device and storage medium for network context | |
CN113204952A (en) | Multi-intention and semantic slot joint identification method based on clustering pre-analysis | |
CN111078876A (en) | Short text classification method and system based on multi-model integration | |
CN112528658B (en) | Hierarchical classification method, hierarchical classification device, electronic equipment and storage medium | |
CN111739520A (en) | Speech recognition model training method, speech recognition method and device | |
CN114860942B (en) | Text intention classification method, device, equipment and storage medium | |
CN114417872A (en) | Contract text named entity recognition method and system | |
CN112507124A (en) | Chapter-level event causal relationship extraction method based on graph model | |
CN114492460B (en) | Event causal relationship extraction method based on derivative prompt learning | |
CN115146124A (en) | Question-answering system response method and device, equipment, medium and product thereof | |
CN113239694B (en) | Argument role identification method based on argument phrase | |
Lin et al. | Ctc network with statistical language modeling for action sequence recognition in videos | |
CN115240712A (en) | Multi-mode-based emotion classification method, device, equipment and storage medium | |
Gong et al. | Activity grammars for temporal action segmentation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |