CN112489622A - Method and system for recognizing voice content of multi-language continuous voice stream - Google Patents

Method and system for recognizing voice content of multi-language continuous voice stream Download PDF

Info

Publication number
CN112489622A
CN112489622A CN201910782981.2A CN201910782981A CN112489622A CN 112489622 A CN112489622 A CN 112489622A CN 201910782981 A CN201910782981 A CN 201910782981A CN 112489622 A CN112489622 A CN 112489622A
Authority
CN
China
Prior art keywords
language
segment
level
state
level language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910782981.2A
Other languages
Chinese (zh)
Other versions
CN112489622B (en
Inventor
徐及
刘丹阳
张鹏远
颜永红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Beijing Kexin Technology Co Ltd
Original Assignee
Institute of Acoustics CAS
Beijing Kexin Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS, Beijing Kexin Technology Co Ltd filed Critical Institute of Acoustics CAS
Priority to CN201910782981.2A priority Critical patent/CN112489622B/en
Publication of CN112489622A publication Critical patent/CN112489622A/en
Application granted granted Critical
Publication of CN112489622B publication Critical patent/CN112489622B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a method and a system for recognizing voice contents of a multi-language continuous voice stream, wherein the method comprises the following steps: inputting a frame-level language classification model of a multi-language continuous voice stream to be recognized, and outputting segment-level language feature vectors; inputting the segment-level language feature vector into a segment-level language classification model, and outputting posterior probability distribution of the segment-level language state; calculating the optimal language state path of the multilingual continuous voice stream based on a Viterbi retrieval algorithm according to the posterior probability distribution of the segment-level language state; segmenting the multilingual continuous voice stream to be identified according to the optimal language state path to obtain a language state interval; and sending the segmented language state interval into a multi-language acoustic model and a corresponding multi-language decoder for decoding to obtain a content identification result of the multi-language continuous voice stream. The invention solves the problem of dynamic detection and identification of language types with coexisting multi-language contents in continuous voice streams by fusing the language classification model with the Viterbi retrieval algorithm.

Description

Method and system for recognizing voice content of multi-language continuous voice stream
Technical Field
The present invention relates to the field of speech recognition, and in particular, to a method and a system for recognizing speech contents of a continuous speech stream in multiple languages.
Background
With the application of hidden markov technology, deep neural network and other technologies in the field of automatic speech recognition, the automatic speech recognition technology has not been developed. For Chinese, English and other languages with wide number of users, the performance of the corresponding single-language speech recognition system can even reach the recognition level of human beings. With the economic trade among countries in the world, the economic culture among the countries in the world is accelerated to merge, and the establishment of a mixed multi-language voice recognition system becomes a necessary condition for the content detection of multi-language voice streams.
The traditional multi-language voice recognition system is based on a language recognition front end which is connected with a plurality of parallel single-language voice recognition system rear ends in series. Generally, a speech recognition front end performs sentence-level classification and discrimination on the speech type of a speech with respect to speech features of the entire speech. In the multi-language recognition task of the multi-language continuous voice stream, the language classification method based on the statement level cannot cope with the language classification task of the multi-language coexistence in the voice stream.
Disclosure of Invention
The invention aims to solve the problem that a language classification method based on statement level cannot cope with a language classification task with coexistence of multiple languages in a voice stream.
In order to achieve the above object, the present invention provides a method for recognizing voice contents of a multi-language continuous voice stream, comprising:
inputting a frame-level language classification model of a multi-language continuous voice stream to be recognized, and outputting segment-level language feature vectors;
inputting the segment-level language feature vector into a segment-level language classification model, and outputting posterior probability distribution of the segment-level language state;
calculating the optimal language state path of the multilingual continuous voice stream based on a Viterbi retrieval algorithm according to the posterior probability distribution of the segment-level language state;
segmenting the multilingual continuous voice stream to be identified according to the optimal language state path to obtain a language state interval;
and sending the language state interval into a multi-language acoustic model and a corresponding multi-language decoder for decoding to obtain a content identification result of the multi-language continuous voice stream.
As an improvement of the method, the method further comprises a training step of the multilingual acoustic model, and the specific steps are as follows:
step 1-1) constructing a multi-language acoustic model based on a multi-task learning framework, wherein the model comprises a plurality of shared hidden layers and language specific output layers;
step 1-2) extracting the frequency spectrum characteristics of the multi-language continuous voice stream of the training set based on the acoustic state label of the multi-language voice data, and inputting the frequency spectrum characteristics into a shared hidden layer for nonlinear transformation; outputting data of a plurality of single languages to a plurality of language specific output layers;
step 1-3) calculating an error loss function value from the data of the single language at a language specific output layer corresponding to the input spectral feature, wherein the error loss function is as follows:
Figure BDA0002177149770000021
wherein Floss,iError loss value, p, for the ith language specific output layermodel,i(xL) Spectral feature x for the L-th languageLCorresponding output at the L-th language-specific output level, qlabel,LAs a spectral feature xLA corresponding acoustic state tag; the error loss function values of other output layers are zero;
step 1-4) of determining the error loss value Floss,iReverse feedback; updating parameters of each language specific output layer according to the data of the corresponding single language, and calculating the gradient Delta phi of the language specific output layer parametersi
Figure BDA0002177149770000022
Wherein phiiParameters for the ith language specific output layer;
error loss value F of parameter of shared hidden layer returned by several language specific output layersloss,iAnd (3) calculating: calculate the gradient Δ Φ of the shared hidden layer parameter:
Figure BDA0002177149770000023
phi is a parameter of the shared hidden layer, and L is a language category number corresponding to a specific language output layer of the multilingual acoustic model;
step 1-5) when Floss,i>And (4) setting a threshold value, turning to the step 1-2),
when F is presentloss,i<And giving a threshold value to obtain the trained multilingual acoustic model.
As an improvement of the method, the method further comprises a training step of a frame-level language classification model, and the specific steps are as follows:
step 2-1), constructing a frame level language classification model, wherein the frame level language classification model is a deep neural network;
step 2-2) extracting frame-level frequency spectrum characteristics of multi-language continuous voice streams of a training set, inputting the frame-level frequency spectrum characteristics into a frame-level language classification model, carrying out long-term statistics on output vectors of a current hidden layer, and calculating a mean vector, a variance vector and a segment-level language characteristic vector of the output vectors of the current hidden layer;
the mean vector is:
Figure BDA0002177149770000031
the variance vector is:
Figure BDA0002177149770000032
the segment level language feature vector:
hsegment=Append(μ,σ) (6)
wherein h isiIs the output vector of the current hidden layer at the moment i, T is the long-term statistical period, mu is the mean vector of the long-term statistics, sigma is the variance vector of the long-term statistics, hsegmentIs segment level language feature vector, the segment level language feature vector is formed by splicing a mean vector and a variance vector together, and the dimension of the segment level language feature vector is h i2 times the dimension; wherein, appendix (mu, sigma) represents that mu and sigma are spliced to form a high-dimensional vector;
and 2-3) taking the mean vector and the variance vector as the input of the next hidden layer, and training through error calculation and a reverse gradient return process according to the frame-level language labels to enable each hidden layer to output segment-level language feature vectors to obtain a trained frame-level language classification model.
As an improvement of the method, the method further comprises a training step of a segment-level language classification model, and the specific steps are as follows:
step S2-1) constructing a segment level language classification model;
step S2-2) extracting the frame level spectrum feature of the multilingual continuous voice stream of the training set, inputting the frame level spectrum feature into the hidden layer of the trained frame level language classification model, and extracting the segment level language feature vector from the hidden layer of the trained frame level language classification model;
step S2-3) setting a segment-level language label for each segment-level language feature vector, inputting the segment-level language feature vector into a segment-level language classification model, and training and outputting the posterior probability distribution of the language state corresponding to the segment-level language label to obtain the trained segment-level language classification model.
As an improvement of the method, the multilingual continuous speech stream to be recognized is input into a frame-level language classification model, and segment-level language feature vectors are output; inputting the segment-level language feature vector into a segment-level language classification model to output the posterior probability distribution of the language state; the method specifically comprises the following steps:
extracting the frequency spectrum characteristics of the frame level to be identified from the continuous voice stream of the multi-language to be identified;
inputting the frame-level frequency spectrum features to be identified into the trained frame-level language classification model according to a specific step length and a specific window length, and outputting a segment-level language feature vector hsegment
The segment level language feature vector hsegmentInputting the trained segment-level language classification model, and outputting the posterior probability distribution of the language state corresponding to the segment-level language feature vector.
As an improvement of the method, based on the viterbi search algorithm, calculating the optimal language state path of the multilingual continuous speech stream according to the posterior probability distribution of the language state specifically includes:
step 3-1) setting the autorotation probability p of the language state of the Viterbi search according to the posterior probability distribution of the language stateloopAnd a probability of hopping pskipObtaining a transition matrix A of the language state as follows:
Figure BDA0002177149770000041
wherein p isloopRepresenting the autorotation probability, p, of the language stateskipExpress the languageThe skipping probability of the state, the rotation probability and the skipping probability value of each language are the same, language state labels are set according to language categories, the language state labels are labels of different language categories, and Arabic numerals 1,2 are adopted, wherein N is the language state label; the corresponding relation between each element of the transition matrix A and the language state label is as follows:
Figure BDA0002177149770000042
step 3-2) carrying out Viterbi retrieval on the predicted language state sequence, and calculating a target function based on Viterbi retrieval:
Figure BDA0002177149770000043
wherein p istrans(sT+1|sT) Language state s representing multilingual continuous speech stream from time TTLanguage state s by the time T +1T+1Transition probability of (2):
Figure BDA0002177149770000051
wherein, language state sTAnd language state sT+1Corresponding language classification label is in the labeled language classification label range, T is the segment level language characteristic hsegmentA corresponding statistical period;
pemit(sT+1|hsegment) Representing features h of class-to-class languagesegmentIn language state sT+1The posterior probability of the upper prediction:
pemit(sT+1|hsegment)=DNN-LIDsegment level(hsegment) (11)
The DNN-LID is a segment level language classifier based on a deep neural network DNN;
step 3-3) with an objective function
Figure BDA0002177149770000052
And the language state sequence with the maximum value is the optimal language state sequence, and language state backtracking is carried out according to the optimal language state sequence to obtain an optimal language state path.
The invention also proposes a system for recognizing the speech content of a continuous speech stream in multiple languages, said system comprising:
the segment-level language feature extraction module is used for inputting a frame-level language classification model of the multilingual continuous voice stream to be identified and outputting segment-level language feature vectors;
the posterior probability calculation module of the language state inputs the segment-level language feature vector into the segment-level language classification model and outputs the posterior probability distribution of the segment-level language state;
a language state path obtaining module, which is used for calculating the optimal language state path of the multi-language voice stream based on the Viterbi retrieval algorithm according to the posterior probability distribution of the segment level language state;
the language state interval segmentation module is used for segmenting the multilingual continuous voice stream to be identified according to the optimal language state path to obtain a language state interval; and
and the content identification module of the multi-language voice stream is used for sending the segmented language state intervals into the multi-language acoustic model and the corresponding multi-language decoder for decoding to obtain the content identification result of the multi-language voice stream.
The invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of the above items when executing the computer program.
The invention also proposes a computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, causes the processor to carry out the method of any of the above.
Compared with the prior art, the invention has the advantages that:
1. the method and the system for recognizing the voice content of the continuous voice stream with the multiple languages can solve the problem of dynamic detection of the language types with the coexistence of the multiple language contents in the continuous voice stream by fusing the language classification model with the Viterbi retrieval algorithm.
2. The method for recognizing the voice content of the multi-language continuous voice stream can perform dynamic language switching point judgment and corresponding multi-language content recognition on the multi-language content in the continuous voice stream.
Drawings
FIG. 1 is a diagram illustrating a method for recognizing speech contents of a multilingual continuous speech stream according to the present invention.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments.
The invention provides a method and a system for recognizing the voice content of a multi-language continuous voice stream, wherein the method comprises the following steps:
step 1) constructing a multi-language acoustic model based on multi-task learning; the acoustic model uniformly constructs acoustic modeling tasks of multiple languages under a neural network classification framework based on multi-task learning, and simultaneously performs joint optimization on the acoustic models of the multiple languages by using acoustic characteristics of the multiple languages; the method specifically comprises the following steps:
step 1-1) constructing a multi-language acoustic model of a neural network classification framework based on multi-task learning, wherein the model is composed of a plurality of shared hidden layers and language specific output layers; wherein the model parameters of the shared hidden layer are jointly optimized by multi-language data; the language specific output layer is optimized by data of each single language;
step 1-2) in the forward calculation process of the model, the shared hidden layer and the language specific output layer of the multilingual acoustic model perform nonlinear transformation on input multilingual frequency spectrum characteristic vectors, and all language specific output layers output information;
step 1-3) in the error loss function calculation process of model updating, according to the acoustic state label corresponding to the spectrum feature, calculating the error loss function value only in the language specific output layer corresponding to the spectrum feature, and calculating the error loss function value of other language specific output layers not corresponding to the spectrum feature language to be zero; the corresponding loss function calculation is as follows:
Figure 1
wherein Floss,iError loss function value, p, for the ith language specific output layermodel,i(xL) Spectral feature x for the L-th languageLCorresponding acoustic model output at the L-th language-specific output layer, qlabel,LAs a spectral feature xLA corresponding acoustic state tag;
step 1-4) in the process of model classification error reverse feedback, the error loss value F is usedloss,iReversely returning, and performing model parameter training on each language specific output layer parameter according to the data of the corresponding single language; error loss value F of parameter of shared hidden layer returned by several language specific output layersloss,iCalculating;
the language specific output layer parameter gradient calculation formula is as follows:
Figure BDA0002177149770000022
wherein phiiParameters of the output layer are specified for the ith language.
The gradient calculation formula for the shared hidden layer parameters is:
Figure BDA0002177149770000023
where Φ is a parameter of the shared hidden layer, and L is a language category number corresponding to a specific language output layer of the multilingual acoustic model.
Step 1-5) repeatedly executing the step 1-2) -the step 1-4) until the model parameters are converged.
Step 2) constructing a frame-level language classification model fusing long-term statistical characteristics based on a deep neural network model; extracting language feature vectors representing language category features based on a frame-level language classification model; the long-term statistical component performs section-level statistics on the output vector of the previous hidden layer in the forward calculation process of the frame-level language classification model, calculates the mean and variance statistics of the output vector of the previous hidden layer, takes the vector of the mean and variance statistics as the input of the next hidden layer, and finally performs error calculation and reverse gradient return process of the language classification model according to the frame-level language label to update the model;
the specific steps of training the frame-level language classification model comprise:
step 2-1), constructing a frame level language classification model, wherein the frame level language classification model is a deep neural network;
step 2-2) extracting frame-level frequency spectrum characteristics of multi-language continuous voice streams of a training set, taking the frame-level frequency spectrum characteristics as input characteristics to input a frame-level language classification model, carrying out long-term statistics on output vectors of a current hidden layer, and calculating a mean vector, a variance vector and a segment-level language characteristic vector of the output vectors of the current hidden layer;
the mean vector is:
Figure BDA0002177149770000073
the variance vector is:
Figure BDA0002177149770000074
the segment level language feature vector is:
hsegment=Append(μ,σ) (6)
wherein h isiIs the output vector of the current hidden layer at the moment i, T is the long-term statistical period, mu is the mean vector of the long-term statistics, sigma is the variance vector of the long-term statistics, hsegmentFor segment-level language feature vectors, the segmentThe level language feature vector is formed by splicing a mean vector and a variance vector together, and the dimension of the mean vector and the variance vector is h i2 times the dimension; wherein, appendix (mu, sigma) represents that mu and sigma are spliced to form a high-dimensional vector;
and 2-3) taking the mean vector and the variance vector as the input of the next hidden layer, and training through error calculation and a reverse gradient return process according to the frame-level language labels to enable each hidden layer to output segment-level language feature vectors to obtain a trained frame-level language classification model.
Based on a trained frame-level language classification model, extracting segment-level language feature vectors from a hidden layer of the frame-level language classification model, constructing segment-level language labels for each segment-level language feature vector, and training the segment-level language classification model according to the segment-level language feature vectors and the segment-level language labels. The method specifically comprises the following steps:
step S2-1) constructing a segment level language classification model;
step S2-2) extracting the frame level spectrum characteristics of the multilingual continuous voice stream of the training set, inputting the hidden layer of the trained frame level language classification model by taking the frame level spectrum characteristics as input characteristics, and extracting segment level language characteristic vectors from the hidden layer of the trained frame level language classification model;
step S2-3) setting a segment-level language label for each segment-level language feature vector, inputting the segment-level language feature vector into a segment-level language classification model, and training and outputting the posterior probability distribution of the language state corresponding to the segment-level language label to obtain the trained segment-level language classification model.
Step 3) extracting segment-level language feature vectors by utilizing a trained frame-level language classification model for the voice of the continuous voice stream of the multi-language to be recognized, carrying out language classification on the segment-level language feature vectors according to the segment-level language classification model, and carrying out real-time detection on language switching points of the continuous voice stream of the multi-language by combining a Viterbi retrieval algorithm; and finally, according to the language detection result, segmenting the continuous voice stream and identifying the content of the multi-language voice stream through a multi-language acoustic model and a corresponding decoder. The method comprises the following specific steps:
step 3-1) extracting segment-level language feature vectors from the frame-level language classification model according to specific step length and window length by using the frequency spectrum features of the voice of the multi-language continuous voice stream to be recognized;
classifying the segment-level language feature vectors through a segment-level language classification model to obtain posterior probability distribution of the language states corresponding to the segment-level language feature vectors;
setting the autorotation probability and the skipping probability of the language state of the Viterbi retrieval, and reducing the language classification error caused by inaccurate classification of the segment level language classification model by improving the autorotation probability of language filling; the method comprises the following steps:
based on the posterior probability distribution of the language state, the autorotation probability and the skipping probability of the language state of the Viterbi retrieval are set, and the transition matrix A of the language state is obtained as follows:
Figure BDA0002177149770000091
wherein p isloopRepresenting the autorotation probability, p, of the language stateskipThe method comprises the steps of representing the skipping probability of language states, setting language state labels according to language categories, wherein the language state labels are labels of different language categories, and Arabic numerals 1,2 and N are used as the language state labels, and the autorotation probability and the skipping probability value of each language are the same; the corresponding relation between each element of the transition matrix A and the language state label is as follows:
Figure BDA0002177149770000092
step 3-2) calculating the posterior probability p of the predicted segment level language stateemit(sT+1|hsegment) According to the autorotation probability p of the preset language stateloopAnd a probability of hopping pskipPerforming viterbi search on the predicted language state, specifically including:
calculating the optimal language state sequence of the continuous voice stream based on the target function of the Viterbi retrieval, wherein the target function is as follows:
Figure BDA0002177149770000093
wherein p istrans(sT+1|sT) Language state s representing multilingual continuous speech stream from time TTLanguage state s by the time T +1T+1Transition probability of (2):
Figure BDA0002177149770000094
wherein, language state sTAnd language state sT+1Corresponding language classification label is in the labeled language classification label range, T is the segment level language characteristic hsegmentA corresponding statistical period;
pemit(sT+1|hsegment) Representing features h of class-to-class languagesegmentIn language state sT+1(ii) an upper predicted posterior probability;
pemit(sT+1|hsegment)=DNN-LIDsegment level(hsegment) (11)
The DNN-LID is a segment level language classifier based on a deep neural network DNN;
and 3-3) predicting the optimal language state for retrieval by the posterior probability of the segment-level language state predicted by the segment-level language classification model and the preset autorotation probability and the jump probability of the language state through the recursive formula, wherein the sequence with the maximum target function value is the optimal language state sequence corresponding to the continuous voice stream of multiple languages, and the optimal language state path can be obtained by performing language state backtracking through the optimal language state sequence.
And 4) segmenting the multi-language voice stream according to the language state interval according to the optimal language state path, and sending the segmented language state interval voice stream into a multi-language acoustic model and a corresponding multi-language decoder for decoding to obtain a corresponding content identification result of the multi-language continuous voice stream.
The invention also proposes a system for recognizing the speech content of a continuous speech stream in multiple languages, said system comprising:
the segment-level language feature extraction module is used for inputting a frame-level language classification model of the multilingual continuous voice stream to be identified and outputting segment-level language feature vectors;
the posterior probability calculation module of the language state inputs the segment-level language feature vector into the segment-level language classification model and outputs the posterior probability distribution of the segment-level language state;
a language state path obtaining module, which is used for calculating the optimal language state path of the multi-language voice stream based on the Viterbi retrieval algorithm according to the posterior probability distribution of the segment level language state;
the language state interval segmentation module is used for segmenting the multilingual continuous voice stream to be identified according to the optimal language state path to obtain a language state interval; and
and the content identification module of the multi-language voice stream is used for sending the segmented language state intervals into the multi-language acoustic model and the corresponding multi-language decoder for decoding to obtain the content identification result of the multi-language voice stream.
The invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of the above items when executing the computer program.
The invention also proposes a computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, causes the processor to carry out the method of any of the above.
The rationality and validity of the speech recognition system based on the invention has been verified in real systems, the results are shown in table 1:
TABLE 1
Figure BDA0002177149770000101
Figure BDA0002177149770000111
The method of the invention carries out multi-language acoustic model combined training by using data of Cantonese, Turkish and Vietnamese, simultaneously constructs a frame level language classification model and a segment level language classification model based on three languages, and carries out language classification and voice content identification on continuous multi-language voice by utilizing a multi-language continuous voice stream voice content identification method based on a Viterbi algorithm. From table 1, it can be seen that the method of the present invention improves the language identification precision from 82.1% to 92.4%, and verifies that the method of the present invention for recognizing the speech content of the continuous multi-language speech stream based on the viterbi algorithm can effectively improve the result of language detection in the continuous multi-language speech stream.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (9)

1. A method of multi-lingual continuous speech stream speech content recognition, the method comprising:
inputting a frame-level language classification model of a multi-language continuous voice stream to be recognized, and outputting segment-level language feature vectors;
inputting the segment-level language feature vector into a segment-level language classification model, and outputting posterior probability distribution of the segment-level language state;
calculating the optimal language state path of the multilingual continuous voice stream based on a Viterbi retrieval algorithm according to the posterior probability distribution of the segment-level language state;
segmenting the multilingual continuous voice stream to be identified according to the optimal language state path to obtain a language state interval;
and inputting the language state interval into a multi-language acoustic model and a corresponding multi-language decoder for decoding to obtain a content identification result of the multi-language voice stream.
2. The method for recognizing the speech content of a multilingual continuous speech stream according to claim 1, further comprising a step of training a multilingual acoustic model, comprising the steps of:
step 1-1) constructing a multi-language acoustic model based on a multitask learning neural network, wherein the model comprises a plurality of shared hidden layers and language specific output layers;
step 1-2) extracting the frequency spectrum characteristics of the multi-language continuous voice stream of the training set based on the acoustic state label of the multi-language continuous voice data, and inputting the frequency spectrum characteristics into a shared hidden layer for nonlinear transformation; outputting data of a plurality of single languages to a plurality of language specific output layers;
step 1-3) calculating error loss function values of data of a single language at a language specific output layer corresponding to the input spectrum characteristics:
said error loss function Floss,iComprises the following steps:
Figure FDA0002177149760000011
wherein Floss,iError loss value, p, for the ith language specific output layermodel,i(xL) Spectral feature x for the L-th languageLCorresponding output at the L-th language-specific output level, qlabel,LAs a spectral feature xLA corresponding acoustic state tag; the error loss function values of other output layers are zero;
step 1-4) of determining the error loss value Floss,iReverse feedback, each language specific output layer parameter carries out parameter updating according to the data of the corresponding single languageComputing language specific output layer parameter gradient delta phii
Figure FDA0002177149760000012
Wherein phiiParameters for the ith language specific output layer;
error loss value F of parameter of shared hidden layer returned by several language specific output layersloss,iUpdating: calculate the gradient Δ Φ of the shared hidden layer parameter:
Figure FDA0002177149760000021
phi is a parameter of the shared hidden layer, and L is a language category number corresponding to a specific language output layer of the multilingual acoustic model;
step 1-5) when Floss,i>Setting a threshold value, and then turning to the step 1-2);
when F is presentloss,i<And giving a threshold value to obtain the trained multilingual acoustic model.
3. The method according to claim 1, further comprising a step of training a frame-level language classification model, comprising the steps of:
step 2-1), constructing a frame level language classification model, wherein the frame level language classification model is a deep neural network;
step 2-2) extracting frame-level frequency spectrum characteristics of multi-language continuous voice streams of a training set, inputting the frame-level frequency spectrum characteristics into a frame-level language classification model, carrying out long-term statistics on output vectors of a current hidden layer, and calculating a mean vector, a variance vector and a segment-level language characteristic vector of the output vectors of the current hidden layer;
the mean vector μ is:
Figure FDA0002177149760000022
the variance vector σ is:
Figure FDA0002177149760000023
the segment level language feature vector hsegmentComprises the following steps:
hsegment=Append(μ,σ) (6)
wherein h isiIs the output vector of the current hidden layer at the moment i, T is the long-term statistical period, mu is the mean vector of the long-term statistics, sigma is the variance vector of the long-term statistics, hsegmentIs segment level language feature vector, the segment level language feature vector is formed by splicing a mean vector and a variance vector together, and the dimension of the segment level language feature vector is hi2 times of dimensionality, wherein appendix (mu, sigma) represents that mu and sigma are spliced to form a high-dimensional vector;
and 2-3) taking the mean vector and the variance vector as the input of the next hidden layer, and training through error calculation and a reverse gradient return process according to the frame-level language labels to enable each hidden layer to output segment-level language feature vectors to obtain a trained frame-level language classification model.
4. The method for recognizing the speech content of a multilingual continuous speech stream according to claim 1, further comprising the step of training a segment-level language classification model, comprising the steps of:
step S2-1) constructing a segment level language classification model;
step S2-2) extracting the frame level spectrum feature of the multilingual continuous voice stream of the training set, inputting the frame level spectrum feature into the hidden layer of the trained frame level language classification model, and extracting the segment level language feature vector from the hidden layer of the trained frame level language classification model;
step S2-3) setting a segment-level language label for each segment-level language feature vector, inputting the segment-level language feature vector into a segment-level language classification model, and training and outputting the posterior probability distribution of the language state corresponding to the segment-level language label to obtain the trained segment-level language classification model.
5. The method according to claim 1, wherein the continuous speech stream of multiple languages is input into a frame-level language classification model to output segment-level language feature vectors; inputting the segment-level language feature vector into a segment-level language classification model to output the posterior probability distribution of the language state; the method specifically comprises the following steps:
extracting the frequency spectrum characteristics of the frame level to be identified from the continuous voice stream of the multi-language to be identified;
inputting the frame-level frequency spectrum features to be identified into the trained frame-level language classification model according to a specific step length and a specific window length, and outputting a segment-level language feature vector hsegment
The segment level language feature vector hsegmentInputting the trained segment-level language classification model, and outputting the posterior probability distribution of the language state corresponding to the segment-level language feature vector.
6. The method for recognizing speech content of continuous speech stream in multiple languages according to claim 5, wherein the step of calculating the optimal language state path of the continuous speech stream in multiple languages based on viterbi search algorithm according to the posterior probability distribution of language state comprises:
step 3-1) setting the autorotation probability p of the language state of the Viterbi search according to the posterior probability distribution of the language stateloopAnd a probability of hopping pskipObtaining a transition matrix A of the language state as follows:
Figure FDA0002177149760000031
the autorotation probability and the skipping probability value of each language are the same, language state labels are set according to language categories, the language state labels are labels of different language categories, and Arabic numerals 1, 2. The corresponding relation between each element of the transition matrix A and the language state label is as follows:
Figure FDA0002177149760000041
step 3-2) carrying out Viterbi retrieval on the predicted language state, and calculating a target function based on Viterbi retrieval:
Figure FDA0002177149760000042
wherein p istrans(sT+1|sT) Language state s representing multilingual continuous speech stream from time TTLanguage state s by the time T +1T+1Transition probability of (2):
Figure FDA0002177149760000043
wherein, language state sTAnd language state sT+1Corresponding language classification label is in the labeled language classification label range, T is the segment level language characteristic hsegmentA corresponding statistical period;
pemit(sT+1|hsegment) Representing features h of class-to-class languagesegmentIn language state sT+1The posterior probability of the upper prediction:
pemit(sT+1|hsegment)=DNNLIDsegment level(hsegment) (11)
The DNNLID is a segment level language classifier based on a deep neural network DNN;
step 3-3) with an objective function
Figure FDA0002177149760000044
And the language state sequence with the maximum value is the optimal language state sequence, and language state backtracking is carried out according to the optimal language state sequence to obtain an optimal language state path.
7. A multi-language continuous speech stream speech content recognition system, said system comprising:
the segment-level language feature extraction module is used for inputting a frame-level language classification model of the multilingual continuous voice stream to be identified and outputting segment-level language feature vectors;
the posterior probability calculation module of the language state inputs the segment-level language feature vector into the segment-level language classification model and outputs the posterior probability distribution of the segment-level language state;
the language state path acquisition module is used for calculating the optimal language state path of the multi-language voice stream based on the Viterbi retrieval algorithm according to the posterior probability distribution of the section level language state;
the language state interval segmentation module is used for segmenting the multilingual continuous voice stream to be identified according to the optimal language state path to obtain a language state interval; and
and the content identification module of the multi-language voice stream is used for sending the segmented language state intervals into the multi-language acoustic model and the corresponding multi-language decoder for decoding to obtain the content identification result of the multi-language voice stream.
8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of claims 1-6 when executing the computer program.
9. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to carry out the method of any one of claims 1-6.
CN201910782981.2A 2019-08-23 2019-08-23 Multi-language continuous voice stream voice content recognition method and system Active CN112489622B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910782981.2A CN112489622B (en) 2019-08-23 2019-08-23 Multi-language continuous voice stream voice content recognition method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910782981.2A CN112489622B (en) 2019-08-23 2019-08-23 Multi-language continuous voice stream voice content recognition method and system

Publications (2)

Publication Number Publication Date
CN112489622A true CN112489622A (en) 2021-03-12
CN112489622B CN112489622B (en) 2024-03-19

Family

ID=74920171

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910782981.2A Active CN112489622B (en) 2019-08-23 2019-08-23 Multi-language continuous voice stream voice content recognition method and system

Country Status (1)

Country Link
CN (1) CN112489622B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113870839A (en) * 2021-09-29 2021-12-31 北京中科智加科技有限公司 Language identification device of language identification model based on multitask
CN114078468A (en) * 2022-01-19 2022-02-22 广州小鹏汽车科技有限公司 Voice multi-language recognition method, device, terminal and storage medium
CN115831094A (en) * 2022-11-08 2023-03-21 北京数美时代科技有限公司 Multilingual voice recognition method, system, storage medium and electronic equipment
WO2023165538A1 (en) * 2022-03-03 2023-09-07 北京有竹居网络技术有限公司 Speech recognition method and apparatus, and computer-readable medium and electronic device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030050779A1 (en) * 2001-08-31 2003-03-13 Soren Riis Method and system for speech recognition
CN104036774A (en) * 2014-06-20 2014-09-10 国家计算机网络与信息安全管理中心 Method and system for recognizing Tibetan dialects
US9460711B1 (en) * 2013-04-15 2016-10-04 Google Inc. Multilingual, acoustic deep neural networks
CN108492820A (en) * 2018-03-20 2018-09-04 华南理工大学 Chinese speech recognition method based on Recognition with Recurrent Neural Network language model and deep neural network acoustic model
US20190189111A1 (en) * 2017-12-15 2019-06-20 Mitsubishi Electric Research Laboratories, Inc. Method and Apparatus for Multi-Lingual End-to-End Speech Recognition

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030050779A1 (en) * 2001-08-31 2003-03-13 Soren Riis Method and system for speech recognition
US9460711B1 (en) * 2013-04-15 2016-10-04 Google Inc. Multilingual, acoustic deep neural networks
CN104036774A (en) * 2014-06-20 2014-09-10 国家计算机网络与信息安全管理中心 Method and system for recognizing Tibetan dialects
US20190189111A1 (en) * 2017-12-15 2019-06-20 Mitsubishi Electric Research Laboratories, Inc. Method and Apparatus for Multi-Lingual End-to-End Speech Recognition
CN108492820A (en) * 2018-03-20 2018-09-04 华南理工大学 Chinese speech recognition method based on Recognition with Recurrent Neural Network language model and deep neural network acoustic model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
AANCHAN MOHAN ET AL.: "《Multi-lingual speech recognition with low-rank multi-task deep neural networks》", 《ICASSP 2015》, 6 August 2015 (2015-08-06), pages 4994 - 4998 *
THOMAS NIESLER ET AL.: "《Language identification and multilingual speech recognition using discriminatively trained acoustic models》", 《COMPUTER SCIENCE, LINGUISTICS》, 31 December 2006 (2006-12-31), pages 1 - 6 *
姚海涛等: "《面向多语言的语音识别声学模型建模方法研究》", 《声学技术》, vol. 34, no. 6, 31 December 2015 (2015-12-31), pages 404 - 407 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113870839A (en) * 2021-09-29 2021-12-31 北京中科智加科技有限公司 Language identification device of language identification model based on multitask
CN113870839B (en) * 2021-09-29 2022-05-03 北京中科智加科技有限公司 Language identification device of language identification model based on multitask
CN114078468A (en) * 2022-01-19 2022-02-22 广州小鹏汽车科技有限公司 Voice multi-language recognition method, device, terminal and storage medium
CN114078468B (en) * 2022-01-19 2022-05-13 广州小鹏汽车科技有限公司 Voice multi-language recognition method, device, terminal and storage medium
WO2023165538A1 (en) * 2022-03-03 2023-09-07 北京有竹居网络技术有限公司 Speech recognition method and apparatus, and computer-readable medium and electronic device
CN115831094A (en) * 2022-11-08 2023-03-21 北京数美时代科技有限公司 Multilingual voice recognition method, system, storage medium and electronic equipment
CN115831094B (en) * 2022-11-08 2023-08-15 北京数美时代科技有限公司 Multilingual voice recognition method, multilingual voice recognition system, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN112489622B (en) 2024-03-19

Similar Documents

Publication Publication Date Title
CN110134757B (en) Event argument role extraction method based on multi-head attention mechanism
CN112489622B (en) Multi-language continuous voice stream voice content recognition method and system
CN108710704B (en) Method and device for determining conversation state, electronic equipment and storage medium
CN106547737A (en) Based on the sequence labelling method in the natural language processing of deep learning
CN112183064B (en) Text emotion reason recognition system based on multi-task joint learning
CN110727844B (en) Online commented commodity feature viewpoint extraction method based on generation countermeasure network
CN110968660A (en) Information extraction method and system based on joint training model
CN110532555B (en) Language evaluation generation method based on reinforcement learning
CN111339260A (en) BERT and QA thought-based fine-grained emotion analysis method
CN115292463B (en) Information extraction-based method for joint multi-intention detection and overlapping slot filling
CN112069801A (en) Sentence backbone extraction method, equipment and readable storage medium based on dependency syntax
CN111581970B (en) Text recognition method, device and storage medium for network context
CN113204952A (en) Multi-intention and semantic slot joint identification method based on clustering pre-analysis
CN111078876A (en) Short text classification method and system based on multi-model integration
CN112528658B (en) Hierarchical classification method, hierarchical classification device, electronic equipment and storage medium
CN111739520A (en) Speech recognition model training method, speech recognition method and device
CN114860942B (en) Text intention classification method, device, equipment and storage medium
CN114417872A (en) Contract text named entity recognition method and system
CN112507124A (en) Chapter-level event causal relationship extraction method based on graph model
CN114492460B (en) Event causal relationship extraction method based on derivative prompt learning
CN115146124A (en) Question-answering system response method and device, equipment, medium and product thereof
CN113239694B (en) Argument role identification method based on argument phrase
Lin et al. Ctc network with statistical language modeling for action sequence recognition in videos
CN115240712A (en) Multi-mode-based emotion classification method, device, equipment and storage medium
Gong et al. Activity grammars for temporal action segmentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant