CN112489622B - Multi-language continuous voice stream voice content recognition method and system - Google Patents

Multi-language continuous voice stream voice content recognition method and system Download PDF

Info

Publication number
CN112489622B
CN112489622B CN201910782981.2A CN201910782981A CN112489622B CN 112489622 B CN112489622 B CN 112489622B CN 201910782981 A CN201910782981 A CN 201910782981A CN 112489622 B CN112489622 B CN 112489622B
Authority
CN
China
Prior art keywords
language
segment
level
state
level language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910782981.2A
Other languages
Chinese (zh)
Other versions
CN112489622A (en
Inventor
徐及
刘丹阳
张鹏远
颜永红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Beijing Kexin Technology Co Ltd
Original Assignee
Institute of Acoustics CAS
Beijing Kexin Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS, Beijing Kexin Technology Co Ltd filed Critical Institute of Acoustics CAS
Priority to CN201910782981.2A priority Critical patent/CN112489622B/en
Publication of CN112489622A publication Critical patent/CN112489622A/en
Application granted granted Critical
Publication of CN112489622B publication Critical patent/CN112489622B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a method and a system for recognizing multi-language continuous voice stream voice content, wherein the method comprises the following steps: inputting the multi-language continuous voice stream to be identified into a frame-level language classification model, and outputting segment-level language feature vectors; inputting the segment level language feature vector into a segment level language classification model, and outputting posterior probability distribution of the segment level language state; calculating an optimal language state path of the multi-language continuous voice stream based on a Viterbi search algorithm according to posterior probability distribution of the segment-level language states; dividing the multi-language continuous voice stream to be recognized according to the optimal language state path to obtain a language state interval; and sending the segmented language state interval into a multilingual acoustic model and a corresponding multilingual decoder for decoding to obtain a content recognition result of the multilingual continuous voice stream. The invention solves the problem of dynamic detection and recognition of language types with concurrent multilingual content in continuous voice streams by fusing the language classification model with the Viterbi retrieval algorithm.

Description

Multi-language continuous voice stream voice content recognition method and system
Technical Field
The invention relates to the field of voice recognition, in particular to a method and a system for recognizing multi-language continuous voice stream voice content.
Background
With the application of hidden Markov technology, deep neural network and other technologies in the field of automatic speech recognition, the automatic speech recognition technology has been unprecedented. For languages such as Chinese, english and the like with wide numbers of people, the performance of the corresponding single-language speech recognition system can even reach the recognition level of human beings. Along with the economic trade of world each country, the economic culture of world each country is accelerated to blend, and the construction of a mixed multilingual speech recognition system has become a necessary condition for coping with multilingual speech stream content detection.
The traditional multi-language speech recognition system is based on that the front end of language recognition is connected with the rear end of a plurality of parallel single-language speech recognition systems in series. Generally, the language recognition front end classifies and discriminates the language class of the voice according to the voice characteristics of the whole voice. In the multi-language recognition task of the multi-language continuous voice stream, the language classification method based on the sentence level cannot cope with the language classification task of the multi-language coexistence in the voice stream.
Disclosure of Invention
The invention aims to solve the problem that language classification tasks which coexist in multiple languages in a voice stream cannot be dealt with by a language classification method based on sentence level.
To achieve the above object, the present invention provides a method for recognizing speech content of a multi-language continuous speech stream, the method comprising:
inputting the multi-language continuous voice stream to be identified into a frame-level language classification model, and outputting segment-level language feature vectors;
inputting the segment level language feature vector into a segment level language classification model, and outputting posterior probability distribution of the segment level language state;
calculating an optimal language state path of the multi-language continuous voice stream based on a Viterbi search algorithm according to posterior probability distribution of the segment-level language states;
dividing the multi-language continuous voice stream to be recognized according to the optimal language state path to obtain a language state interval;
and sending the language state interval into a multilingual acoustic model and a corresponding multilingual decoder for decoding to obtain a content recognition result of the multilingual continuous voice stream.
As an improvement of the method, the method further comprises a training step of the multilingual acoustic model, which comprises the following specific steps:
step 1-1) constructing a multi-language acoustic model based on a multi-task learning framework, wherein the model comprises a plurality of shared hidden layers and a language specific output layer;
step 1-2) extracting spectral features of multi-language continuous voice streams of a training set based on acoustic state labels of multi-language voice data, and inputting the spectral features into a shared hidden layer for nonlinear transformation; outputting the data of a plurality of single languages to a plurality of language specific output layers;
step 1-3) calculating an error loss function value of single-language data at a language specific output layer corresponding to the input frequency spectrum characteristics, wherein the error loss function is as follows:
wherein F is loss,i Error loss value for the ith language specific output layer, p model,i (x L ) Spectral feature x for the L-th language L Corresponding output at the L-th language specific output layer, q label,L For spectral feature x L A corresponding acoustic state tag; the error loss function value of other output layers is zero;
step 1-4) comparing the error loss value F loss,i Reverse return; each language specific output layer parameter is updated according to the data of the corresponding single language, and the gradient delta phi of the language specific output layer parameter is calculated i
Wherein phi is i Parameters for the i-th language specific output layer;
the parameters sharing the hidden layer are represented by the returned error loss values F of a plurality of language-specific output layers loss,i And (3) calculating: calculating the gradient delta phi of the shared hidden layer parameters:
wherein phi is a parameter of a shared hidden layer, and L is the number of language types corresponding to a specific language output layer of the multi-language acoustic model;
step 1-5) when F loss,i >Given the threshold, go to step 1-2),
when F loss,i <And (5) given a threshold value, obtaining a trained multilingual acoustic model.
As an improvement of the method, the method further comprises a training step of a frame-level language classification model, and the method comprises the following specific steps:
step 2-1), constructing a frame-level language classification model, wherein the frame-level language classification model is a deep neural network;
step 2-2) extracting frame-level spectrum features of the multi-language continuous voice stream of the training set, inputting the frame-level spectrum features into a frame-level language classification model, carrying out long-time statistics on the output vector of the current hidden layer, and calculating a mean value vector, a variance vector and a segment-level language feature vector of the output vector of the current hidden layer;
the mean vector is:
the variance vector is:
the segment level language feature vector:
h segment =Append(μ,σ) (6)
wherein h is i For the output vector of the current hidden layer at the moment i, T is the long-term statistical period, mu is the average value vector of the long-term statistics, sigma is the variance vector of the long-term statistics, h segment The segment-level language feature vector is formed by splicing a mean vector and a variance vector, and the dimension is h i 2 times the dimension; wherein applied (μ, σ) represents stitching μ and σ to form a high-dimensional vector;
step 2-3) taking the mean vector and the variance vector as the input of the next hidden layer, training according to the frame-level language labels through error calculation and reverse gradient feedback process, and enabling each hidden layer to output segment-level language feature vectors to obtain a trained frame-level language classification model.
As an improvement of the method, the method further comprises a training step of a segment-level language classification model, and the method comprises the following specific steps:
s2-1), constructing a segment level language classification model;
s2-2) extracting frame-level spectrum features of the multi-language continuous voice stream of the training set, inputting the frame-level spectrum features into an implicit layer of a trained frame-level language classification model, and extracting segment-level language feature vectors from the implicit layer of the trained frame-level language classification model;
step S2-3), setting segment level language feature vectors for each segment level language feature vector, inputting the segment level language feature vectors into a segment level language classification model, training and outputting posterior probability distribution of language states corresponding to the segment level language feature vectors, and obtaining a trained segment level language classification model.
As an improvement of the method, the multi-language continuous voice stream to be identified is input into a frame-level language classification model, and segment-level language feature vectors are output; inputting the segment level language feature vector into the segment level language classification model to output posterior probability distribution of the language state; the method specifically comprises the following steps:
extracting to-be-identified frame-level spectrum features from a multi-language continuous voice stream to be identified;
inputting the frequency spectrum characteristics of the frame level to be identified into a trained frame level language classification model according to a specific step length and window length, and outputting a segment level language characteristic vector h segment
The segment level language feature vector h is processed segment And inputting the trained segment-level language classification model, and outputting posterior probability distribution of the language state corresponding to the segment-level language feature vector.
As an improvement of the method, the method calculates the optimal language state path of the multi-language continuous voice stream based on the viterbi search algorithm according to the posterior probability distribution of the language states, and specifically comprises the following steps:
step 3-1) setting the autorotation probability p of the language state of the Viterbi search according to the posterior probability distribution of the language state loop And probability of jump p skip The obtained transition matrix A of the language state is as follows:
wherein p is loop Representing the rotation probability of the language state, p skip The jump probability of the language state is represented, the rotation probability and the jump probability value of each language are the same, language state labels are set according to language categories, the language state labels are labels of different language categories, arabic numerals 1,2 are adopted, and N is the language state label; the corresponding relation between each element of the transfer matrix A and the language state label is as follows:
step 3-2) carrying out Viterbi search on the predicted language state sequence, and calculating an objective function based on Viterbi search:
wherein p is trans (s T+1 |s T ) Language state s representing a multilingual continuous speech stream from time T T Language state s up to time T+1 T+1 Is a transition probability of (2):
wherein the language state s T And language state s T+1 The corresponding language classification label is within the range of the labeled language classification label, and T is the segment-level language characteristic h segment Corresponding statistical period;
p emit (s T+1 |h segment ) Representing para-level language features h segment In language state s T+1 Posterior probability of upper prediction:
p emit (s T+1 |h segment )=DNN-LID segment level (h segment ) (11)
The DNN-LID is a deep neural network DNN-based segment-level language classifier;
step 3-3) as an objective functionAnd carrying out language state backtracking according to the optimal language state sequence to obtain an optimal language state path.
The invention also provides a multilingual continuous voice stream voice content recognition system, which comprises:
the segment level language feature extraction module is used for inputting the multi-language continuous voice stream to be recognized into the frame level language classification model and outputting segment level language feature vectors;
the posterior probability calculation module of the language state inputs the segment-level language feature vector into the segment-level language classification model, and outputs posterior probability distribution of the segment-level language state;
the language state path acquisition module is used for calculating the optimal language state path of the multilingual voice stream based on a Viterbi retrieval algorithm according to posterior probability distribution of the segment-level language states;
the language state interval segmentation module is used for segmenting the multi-language continuous voice stream to be identified according to the optimal language state path to obtain a language state interval; and
and the content recognition module is used for sending the segmented language state interval into a multilingual acoustic model and a corresponding multilingual decoder for decoding to obtain a content recognition result of the multilingual speech stream.
The invention also proposes a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of the above when executing the computer program.
The invention also proposes a computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the method of any of the above.
Compared with the prior art, the invention has the advantages that:
1. according to the method and the system for recognizing the multi-language continuous voice stream voice content, disclosed by the invention, the language classification model and the Viterbi retrieval algorithm are fused, so that the problem of dynamic detection of language types with concurrent multi-language content in the continuous voice stream can be solved.
2. The method for recognizing the multi-language continuous voice stream voice content can dynamically judge the language switching point of the multi-language content in the continuous voice stream and recognize the corresponding multi-language content.
Drawings
FIG. 1 is a schematic diagram of a multi-language continuous speech stream speech content recognition method according to the present invention.
Detailed Description
The invention will now be described in detail with reference to the drawings and specific examples.
The invention provides a method and a system for recognizing multi-language continuous voice stream voice content, wherein the method comprises the following steps:
step 1) constructing a multilingual acoustic model based on multitasking learning; the acoustic model is used for uniformly constructing the multi-language acoustic modeling task under a neural network classification frame based on multi-task learning, and simultaneously, the multi-language acoustic model is subjected to joint optimization by utilizing the acoustic characteristics of a plurality of languages; the method specifically comprises the following steps:
step 1-1) constructing a multilingual acoustic model of a neural network classification framework based on multi-task learning, wherein the model is composed of a plurality of shared hidden layers and a language specific output layer; wherein model parameters of the shared hidden layer are jointly optimized by the multilingual data; the language specific output layer is optimized by data of each single language;
step 1-2), in the forward calculation process of the model, a shared hidden layer and a language specific output layer of the multi-language acoustic model carry out nonlinear transformation on the input multi-language frequency spectrum feature vector, and all the language specific output layers have information output;
step 1-3) in the error loss function calculation process of model updating, calculating an error loss function value only at a language specific output layer corresponding to the frequency spectrum characteristic according to the acoustic state label corresponding to the frequency spectrum characteristic, and calculating the error loss function value of other language specific output layers not corresponding to the frequency spectrum characteristic language to be zero; the corresponding loss function calculation formula is as follows:
wherein F is loss,i Error loss function value for the ith language specific output layer, p model,i (x L ) Spectral feature x for the L-th language L Corresponding acoustic model output at the L-th language specific output layer, q label,L For spectral feature x L A corresponding acoustic state tag;
step 1-4) in the model classification error reverse return process, the error loss value F loss,i Back transmission, each language specific output layer parameter carries out model parameter training according to the data of the corresponding single language; the parameters sharing the hidden layer are represented by the returned error loss values F of a plurality of language-specific output layers loss,i Calculating;
the language specific output layer parameter gradient calculation formula is:
wherein phi is i Parameters for the i-th language specific output layer.
The gradient calculation formula for sharing hidden layer parameters is as follows:
wherein phi is a parameter of a shared hidden layer, and L is the number of language types corresponding to a specific language output layer of the multi-language acoustic model.
Step 1-5) repeatedly executing step 1-2) -step 1-4) until the model parameters are converged.
Step 2) constructing a frame-level language classification model integrating long-term statistical features based on the deep neural network model; extracting language feature vectors representing language category features based on the frame-level language classification model; the frame-level language classification model is fused with a long-term statistics component, in the forward calculation process of the frame-level language classification model, the long-term statistics component carries out segment-level statistics on the output vector of the previous hidden layer, calculates the mean value and variance statistics of the output vector of the previous hidden layer, takes the vector of the mean value and variance statistics as the input of the next hidden layer, and finally carries out error calculation of the language classification model and model updating in the reverse gradient feedback process according to the frame-level language label;
the training frame level language classification model specifically comprises the following steps:
step 2-1), constructing a frame-level language classification model, wherein the frame-level language classification model is a deep neural network;
step 2-2) extracting frame-level spectrum features of the multi-language continuous voice stream of the training set, inputting a frame-level language classification model by taking the frame-level spectrum features as input features, carrying out long-time statistics on the output vector of the current hidden layer, and calculating a mean value vector, a variance vector and a segment-level language feature vector of the output vector of the current hidden layer;
the mean vector is:
the variance vector is:
the segment-level language feature vector is:
h segment =Append(μ,σ) (6)
wherein h is i For the output vector of the current hidden layer at the moment i, T is the long-term statistical period, mu is the average value vector of the long-term statistics, sigma is the variance vector of the long-term statistics, h segment The segment-level language feature vector is formed by splicing a mean vector and a variance vector, and the dimension is h i 2 times the dimension; wherein applied (μ, σ) represents stitching μ and σ to form a high-dimensional vector;
step 2-3) taking the mean vector and the variance vector as the input of the next hidden layer, training according to the frame-level language labels through error calculation and reverse gradient feedback process, and enabling each hidden layer to output segment-level language feature vectors to obtain a trained frame-level language classification model.
Based on the trained frame-level language classification model, segment-level language feature vectors are extracted from the implicit layer of the frame-level language classification model, segment-level language labels are built for each segment-level language feature vector, and the segment-level language classification model is trained according to the segment-level language feature vectors and the segment-level language labels. The method specifically comprises the following steps:
s2-1), constructing a segment level language classification model;
s2-2) extracting frame-level frequency spectrum characteristics of the multi-language continuous voice stream of the training set, inputting hidden layers of the trained frame-level language classification model by taking the frame-level frequency spectrum characteristics as input characteristics, and extracting segment-level language feature vectors from the hidden layers of the trained frame-level language classification model;
step S2-3), setting segment level language feature vectors for each segment level language feature vector, inputting the segment level language feature vectors into a segment level language classification model, training and outputting posterior probability distribution of language states corresponding to the segment level language feature vectors, and obtaining a trained segment level language classification model.
Step 3), extracting segment-level language feature vectors from the voices of the multi-language continuous voice stream to be recognized by using a trained frame-level language classification model, carrying out language classification on the segment-level language feature vectors according to the segment-level language classification model, and carrying out real-time detection on language switching points of the multi-language continuous voice stream by combining with a Viterbi search algorithm; finally, according to the language detection result, the continuous voice stream is segmented, and the content of the multilingual voice stream is identified through the multilingual acoustic model and the corresponding decoder. The method comprises the following specific steps:
step 3-1), extracting segment-level language feature vectors from the frame-level language classification model according to specific step length and window length by using the frequency spectrum features of the voices of the multi-language continuous voice stream to be recognized;
classifying the segment-level language feature vectors through a segment-level language classification model to obtain posterior probability distribution of language states corresponding to the segment-level language feature vectors;
setting the autorotation probability and the skip probability of the language state of Viterbi retrieval, and reducing language classification errors caused by inaccurate classification of the segment-level language classification model by improving the autorotation probability of language filling; comprising the following steps:
based on posterior probability distribution of language states, autorotation probability and skip probability of the language states of Viterbi retrieval are set, and a transition matrix A of the language states is obtained as follows:
wherein p is loop Representing the rotation probability of the language state, p skip The jump probability of the language state is represented, the rotation probability and the jump probability value of each language are the same, the language state labels are set according to the language categories, the language state labels are labels of different language categories, and Arabic numerals 1,2 and N are adopted as the language state labels; the corresponding relation between each element of the transfer matrix A and the language state label is as follows:
step 3-2) calculating the posterior probability p of the predicted segment level language state emit (s T+1 |h segment ) According to the rotation probability p of the preset language state loop And probability of jump p skip The Viterbi search is carried out on the predicted language state, and the method concretely comprises the following steps:
calculating an optimal language state sequence of the continuous voice stream based on an objective function of Viterbi retrieval, wherein the objective function is as follows:
wherein p is trans (s T+1 |s T ) Language state s representing a multilingual continuous speech stream from time T T Language state s up to time T+1 T+1 Is a transition probability of (2):
wherein the language state s T And language state s T+1 The corresponding language classification label is within the range of the labeled language classification label, and T is the segment-level language characteristic h segment Corresponding statistical period;
p emit (s T+1 |h segment ) Representing para-level language features h segment In language state s T+1 Posterior probability of upper prediction;
p emit (s T+1 |h segment )=DNN-LID segment level (h segment ) (11)
The DNN-LID is a deep neural network DNN-based segment-level language classifier;
step 3-3) predicting the best language state through the posterior probability of the segment level language state predicted by the segment level language classification model and the autorotation probability and the skip probability of the preset language state through the above recursive formula, searching the best language state, wherein the sequence with the maximum objective function value finally is the best language state sequence corresponding to the multi-language continuous voice stream, and carrying out language state backtracking through the best language state sequence to obtain the best language state path.
And 4) segmenting the multilingual speech stream according to the optimal language state path and the language state interval, decoding the segmented language state interval speech stream into a multilingual acoustic model and a corresponding multilingual decoder, and obtaining a corresponding content recognition result of the multilingual continuous speech stream.
The invention also provides a multilingual continuous voice stream voice content recognition system, which comprises:
the segment level language feature extraction module is used for inputting the multi-language continuous voice stream to be recognized into the frame level language classification model and outputting segment level language feature vectors;
the posterior probability calculation module of the language state inputs the segment-level language feature vector into the segment-level language classification model, and outputs posterior probability distribution of the segment-level language state;
the language state path acquisition module is used for calculating the optimal language state path of the multilingual voice stream based on a Viterbi retrieval algorithm according to posterior probability distribution of the segment-level language states;
the language state interval segmentation module is used for segmenting the multi-language continuous voice stream to be identified according to the optimal language state path to obtain a language state interval; and
and the content recognition module is used for sending the segmented language state interval into a multilingual acoustic model and a corresponding multilingual decoder for decoding to obtain a content recognition result of the multilingual speech stream.
The invention also proposes a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of the above when executing the computer program.
The invention also proposes a computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the method of any of the above.
The rationality and effectiveness of the speech recognition system based on the invention have been verified in practical systems, and the results are shown in table 1:
TABLE 1
The method carries out multi-language acoustic model joint training by using Cantonese, turkish and Vietnam data, simultaneously constructs a frame-level language classification model and a segment-level language classification model based on three languages, and carries out language classification and voice content recognition on continuous multi-language speech by using a multi-language continuous voice stream voice content recognition method based on a Viterbi algorithm. From table 1, it can be known that the accuracy of language identification is improved from 82.1% to 92.4% by the method of the present invention, and it is verified that the method for identifying the speech content of the multilingual continuous speech stream based on the viterbi algorithm of the present invention can effectively improve the result of language detection in the continuous multilingual speech stream.
Finally, it should be noted that the above embodiments are only for illustrating the technical solution of the present invention and are not limiting. Although the present invention has been described in detail with reference to the embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the appended claims.

Claims (8)

1. A method of multi-lingual continuous voice stream voice content recognition, the method comprising:
inputting the multi-language continuous voice stream to be identified into a frame-level language classification model, and outputting segment-level language feature vectors;
inputting the segment level language feature vector into a segment level language classification model, and outputting posterior probability distribution of the segment level language state;
calculating an optimal language state path of the multi-language continuous voice stream based on a Viterbi search algorithm according to posterior probability distribution of the segment-level language states;
dividing the multi-language continuous voice stream to be recognized according to the optimal language state path to obtain a language state interval;
inputting the language state interval into a multi-language acoustic model and a corresponding multi-language decoder for decoding to obtain a content identification result of the multi-language continuous voice stream;
according to the posterior probability distribution of segment-level language states, based on a Viterbi search algorithm, calculating an optimal language state path of the multi-language continuous voice stream, wherein the method specifically comprises the following steps:
step 3-1) setting the autorotation probability p of the language state of the Viterbi search according to the posterior probability distribution of the language state loop And probability of jump p skip The obtained transition matrix A of the language state is as follows:
the method comprises the steps of setting a language state label according to language categories, wherein the rotation probability and the jump probability of each language are the same, the language state label is a label of different language categories, arabic numerals 1 and 2 are adopted, and N is the language state label; the corresponding relation between each element of the transfer matrix A and the language state label is as follows:
step 3-2) carrying out Viterbi search on the predicted language state, and calculating an objective function based on Viterbi search:
wherein p is trans (s T+1 |s T ) Language state s representing a multilingual continuous speech stream from time T T Language state s up to time T+1 T+1 Is a transition probability of (2):
wherein the language state s T And language state s T+1 The corresponding language classification label is within the range of the labeled language classification label, and T is the segment levelLanguage feature h segment Corresponding statistical period;
p emit (s T+1 |h segment ) Representing para-level language features h segment In language state s T+1 Posterior probability of upper prediction:
p emit (s T+1 |h segment )=DNNLID segment level (h segment ) (11)
The DNNLID is a segment-level language classifier based on a deep neural network DNN;
step 3-3) as an objective functionAnd carrying out language state backtracking according to the optimal language state sequence to obtain an optimal language state path.
2. The method for recognizing speech content in a multilingual continuous speech stream according to claim 1, further comprising a training step of the multilingual acoustic model, comprising the specific steps of:
step 1-1) constructing a multi-language acoustic model based on a multi-task learning neural network, wherein the model comprises a plurality of shared hidden layers and a language specific output layer;
step 1-2) extracting spectral features of multi-language continuous voice streams of a training set based on acoustic state labels of multi-language continuous voice data, and inputting the spectral features into a shared hidden layer for nonlinear transformation; outputting the data of a plurality of single languages to a plurality of language specific output layers;
step 1-3) calculating an error loss function value at a language specific output layer corresponding to the input spectral features from the single-language data:
the error loss function F loss,i The method comprises the following steps:
wherein F is loss,i Error loss value for the ith language specific output layer, p model,i (x L ) Spectral feature x for the L-th language L Corresponding output at the L-th language specific output layer, q label,L For spectral feature x L A corresponding acoustic state tag; the error loss function value of other output layers is zero;
step 1-4) comparing the error loss value F loss,i Back-pass, each language specific output layer parameter is updated according to the data of the corresponding single language, calculating the language specific output layer parameter gradient delta phi i
Wherein phi is i Parameters for the i-th language specific output layer;
the parameters sharing the hidden layer are represented by the returned error loss values F of a plurality of language-specific output layers loss,i Updating: calculating the gradient delta phi of the shared hidden layer parameters:
wherein phi is a parameter of a shared hidden layer, and L is the number of language types corresponding to a specific language output layer of the multi-language acoustic model;
step 1-5) when F loss,i >If the threshold value is given, the step 1-2) is carried out;
when F loss,i <And (5) given a threshold value, obtaining a trained multilingual acoustic model.
3. The method of claim 1, further comprising the step of training a frame-level language classification model, comprising the steps of:
step 2-1), constructing a frame-level language classification model, wherein the frame-level language classification model is a deep neural network;
step 2-2) extracting frame-level spectrum features of the multi-language continuous voice stream of the training set, inputting the frame-level spectrum features into a frame-level language classification model, carrying out long-time statistics on the output vector of the current hidden layer, and calculating a mean value vector, a variance vector and a segment-level language feature vector of the output vector of the current hidden layer;
the mean vector μ is:
the variance vector sigma is:
the segment-level language feature vector h segment The method comprises the following steps:
h segment =Append(μ,σ) (6)
wherein h is i For the output vector of the current hidden layer at the moment i, T is the long-term statistical period, mu is the average value vector of the long-term statistics, sigma is the variance vector of the long-term statistics, h segment The segment-level language feature vector is formed by splicing a mean vector and a variance vector, and the dimension is h i 2 times the dimension, where application (μ, σ) represents stitching μ and σ to form a high-dimensional vector;
step 2-3) taking the mean vector and the variance vector as the input of the next hidden layer, training according to the frame-level language labels through error calculation and reverse gradient feedback process, and enabling each hidden layer to output segment-level language feature vectors to obtain a trained frame-level language classification model.
4. The method of claim 1, further comprising the step of training a segment-level language classification model, comprising the steps of:
s2-1), constructing a segment level language classification model;
s2-2) extracting frame-level spectrum features of the multi-language continuous voice stream of the training set, inputting the frame-level spectrum features into an implicit layer of a trained frame-level language classification model, and extracting segment-level language feature vectors from the implicit layer of the trained frame-level language classification model;
step S2-3), setting segment level language feature vectors for each segment level language feature vector, inputting the segment level language feature vectors into a segment level language classification model, training and outputting posterior probability distribution of language states corresponding to the segment level language feature vectors, and obtaining a trained segment level language classification model.
5. The method for recognizing speech content according to claim 1, wherein the multi-language continuous speech stream to be recognized is input into a frame-level language classification model to output segment-level language feature vectors; inputting the segment level language feature vector into the segment level language classification model to output posterior probability distribution of the language state; the method specifically comprises the following steps:
extracting to-be-identified frame-level spectrum features from a multi-language continuous voice stream to be identified;
inputting the frequency spectrum characteristics of the frame level to be identified into a trained frame level language classification model according to a specific step length and window length, and outputting a segment level language characteristic vector h segment
The segment level language feature vector h is processed segment And inputting the trained segment-level language classification model, and outputting posterior probability distribution of the language state corresponding to the segment-level language feature vector.
6. A system based on the multi-lingual continuous voice stream voice content recognition method of claim 1, the system comprising:
the segment level language feature extraction module is used for inputting the multi-language continuous voice stream to be recognized into the frame level language classification model and outputting segment level language feature vectors;
the posterior probability calculation module of the language state inputs the segment-level language feature vector into the segment-level language classification model, and outputs posterior probability distribution of the segment-level language state;
the language state path acquisition module is used for calculating an optimal language state path of the multilingual voice stream based on a Viterbi retrieval algorithm according to posterior probability distribution of the segment-level language states;
the language state interval segmentation module is used for segmenting the multi-language continuous voice stream to be identified according to the optimal language state path to obtain a language state interval; and
and the content recognition module is used for sending the segmented language state interval into a multilingual acoustic model and a corresponding multilingual decoder for decoding to obtain a content recognition result of the multilingual continuous voice stream.
7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of claims 1-5 when executing the computer program.
8. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, causes the processor to perform the method of any of claims 1-5.
CN201910782981.2A 2019-08-23 2019-08-23 Multi-language continuous voice stream voice content recognition method and system Active CN112489622B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910782981.2A CN112489622B (en) 2019-08-23 2019-08-23 Multi-language continuous voice stream voice content recognition method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910782981.2A CN112489622B (en) 2019-08-23 2019-08-23 Multi-language continuous voice stream voice content recognition method and system

Publications (2)

Publication Number Publication Date
CN112489622A CN112489622A (en) 2021-03-12
CN112489622B true CN112489622B (en) 2024-03-19

Family

ID=74920171

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910782981.2A Active CN112489622B (en) 2019-08-23 2019-08-23 Multi-language continuous voice stream voice content recognition method and system

Country Status (1)

Country Link
CN (1) CN112489622B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113870839B (en) * 2021-09-29 2022-05-03 北京中科智加科技有限公司 Language identification device of language identification model based on multitask
CN114078468B (en) * 2022-01-19 2022-05-13 广州小鹏汽车科技有限公司 Voice multi-language recognition method, device, terminal and storage medium
CN114582329A (en) * 2022-03-03 2022-06-03 北京有竹居网络技术有限公司 Voice recognition method and device, computer readable medium and electronic equipment
CN115831094B (en) * 2022-11-08 2023-08-15 北京数美时代科技有限公司 Multilingual voice recognition method, multilingual voice recognition system, storage medium and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104036774A (en) * 2014-06-20 2014-09-10 国家计算机网络与信息安全管理中心 Method and system for recognizing Tibetan dialects
US9460711B1 (en) * 2013-04-15 2016-10-04 Google Inc. Multilingual, acoustic deep neural networks
CN108492820A (en) * 2018-03-20 2018-09-04 华南理工大学 Chinese speech recognition method based on Recognition with Recurrent Neural Network language model and deep neural network acoustic model

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7043431B2 (en) * 2001-08-31 2006-05-09 Nokia Corporation Multilingual speech recognition system using text derived recognition models
US10593321B2 (en) * 2017-12-15 2020-03-17 Mitsubishi Electric Research Laboratories, Inc. Method and apparatus for multi-lingual end-to-end speech recognition

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9460711B1 (en) * 2013-04-15 2016-10-04 Google Inc. Multilingual, acoustic deep neural networks
CN104036774A (en) * 2014-06-20 2014-09-10 国家计算机网络与信息安全管理中心 Method and system for recognizing Tibetan dialects
CN108492820A (en) * 2018-03-20 2018-09-04 华南理工大学 Chinese speech recognition method based on Recognition with Recurrent Neural Network language model and deep neural network acoustic model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
《Language identification and multilingual speech recognition using discriminatively trained acoustic models》;Thomas Niesler et al.;《Computer Science, Linguistics》;20061231;第1-6页 *
《Multi-lingual speech recognition with low-rank multi-task deep neural networks》;Aanchan Mohan et al.;《ICASSP 2015》;20150806;第4994-4998页 *
《面向多语言的语音识别声学模型建模方法研究》;姚海涛等;《声学技术》;20151231;第34卷(第6期);第404-407页 *

Also Published As

Publication number Publication date
CN112489622A (en) 2021-03-12

Similar Documents

Publication Publication Date Title
CN112489622B (en) Multi-language continuous voice stream voice content recognition method and system
CN111897908B (en) Event extraction method and system integrating dependency information and pre-training language model
CN112528676B (en) Document-level event argument extraction method
CN111709242B (en) Chinese punctuation mark adding method based on named entity recognition
CN110968660B (en) Information extraction method and system based on joint training model
Fonseca et al. A two-step convolutional neural network approach for semantic role labeling
CN112183064B (en) Text emotion reason recognition system based on multi-task joint learning
CN106547737A (en) Based on the sequence labelling method in the natural language processing of deep learning
Li et al. Text-to-text generative adversarial networks
CN111651998B (en) Weak supervision deep learning semantic analysis method under virtual reality and augmented reality scenes
CN111274804A (en) Case information extraction method based on named entity recognition
CN111581970B (en) Text recognition method, device and storage medium for network context
CN115292463B (en) Information extraction-based method for joint multi-intention detection and overlapping slot filling
CN110781290A (en) Extraction method of structured text abstract of long chapter
CN111476024A (en) Text word segmentation method and device and model training method
CN113204952A (en) Multi-intention and semantic slot joint identification method based on clustering pre-analysis
CN114860942B (en) Text intention classification method, device, equipment and storage medium
CN111435375A (en) Threat information automatic labeling method based on FastText
CN115168541A (en) Chapter event extraction method and system based on frame semantic mapping and type perception
CN114743143A (en) Video description generation method based on multi-concept knowledge mining and storage medium
CN112507124A (en) Chapter-level event causal relationship extraction method based on graph model
CN113239694B (en) Argument role identification method based on argument phrase
CN115238693A (en) Chinese named entity recognition method based on multi-word segmentation and multi-layer bidirectional long-short term memory
CN111444720A (en) Named entity recognition method for English text
Lin et al. Ctc network with statistical language modeling for action sequence recognition in videos

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant