CN113077785B - End-to-end multi-language continuous voice stream voice content identification method and system - Google Patents

End-to-end multi-language continuous voice stream voice content identification method and system Download PDF

Info

Publication number
CN113077785B
CN113077785B CN201911300918.7A CN201911300918A CN113077785B CN 113077785 B CN113077785 B CN 113077785B CN 201911300918 A CN201911300918 A CN 201911300918A CN 113077785 B CN113077785 B CN 113077785B
Authority
CN
China
Prior art keywords
language
vector
speech
level
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911300918.7A
Other languages
Chinese (zh)
Other versions
CN113077785A (en
Inventor
徐及
林格平
刘丹阳
万辛
张鹏远
李娅强
刘发强
颜永红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
National Computer Network and Information Security Management Center
Original Assignee
Institute of Acoustics CAS
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS, National Computer Network and Information Security Management Center filed Critical Institute of Acoustics CAS
Priority to CN201911300918.7A priority Critical patent/CN113077785B/en
Publication of CN113077785A publication Critical patent/CN113077785A/en
Application granted granted Critical
Publication of CN113077785B publication Critical patent/CN113077785B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Abstract

The invention belongs to the technical field of network communication, and particularly relates to an end-to-end multilingual continuous voice stream voice content identification method, which comprises the following steps: inputting speech frequency spectrum characteristics to be recognized into a pre-constructed deep neural network-based segment-level language classification model, and extracting a posterior probability distribution vector of a statement-level language state; inputting the speech frequency spectrum characteristic sequence to be recognized of each language type and the posterior probability distribution vector of the statement level language type state into a multi-language speech recognition model which is constructed in advance, and outputting the speech recognition result of the corresponding language type.

Description

End-to-end multi-language continuous voice stream voice content identification method and system
Technical Field
The invention belongs to the technical field of network communication and voice recognition, and particularly relates to an end-to-end multi-language continuous voice stream voice content recognition method and system.
Background
Currently, the end-to-end recognition framework has been widely applied to automatic speech recognition tasks. Since the end-to-end framework does not rely on a pronunciation dictionary in the process of building a speech recognition system, it is more flexible in building a speech recognition system of a new language as well as a multilingual speech recognition system. Furthermore, the end-to-end speech recognition model can directly model the mapping relationship between the acoustic feature sequence and the text modeling unit sequence. Compared with the traditional speech recognition system based on acoustic modeling and language modeling, the end-to-end framework unifies the acoustic modeling and the linguistic modeling processes, and effectively reduces the complexity of the construction of the speech recognition system.
In the construction process of the multi-language speech recognition system, although the end-to-end framework can reduce the complexity of the construction of the speech recognition system, a new problem is brought to the multi-language speech recognition. The multi-language end-to-end framework models the multi-language modeling units under a unified framework, and because the pronunciation mechanism and the grammar rule of different languages are different greatly, compared with a single-language speech recognition system, the multi-language modeling units are inevitably mixed with each other when the multi-language modeling units are subjected to combined modeling. The existing voice content recognition method has the problem that the language distinction of a multi-language voice recognition system cannot be effectively improved.
Disclosure of Invention
The invention aims to solve the defects of the existing voice recognition method, provides an end-to-end multilingual continuous voice stream voice content recognition method and system, and particularly relates to an end-to-end multilingual voice recognition method based on a multi-attention training mechanism.
In order to achieve the above object, the present invention further provides an end-to-end method for recognizing voice content of a multilingual continuous voice stream, the method comprising:
inputting speech frequency spectrum characteristics to be recognized into a pre-constructed deep neural network-based segment-level language classification model, and outputting a sentence-level language state posterior probability distribution vector;
inputting the to-be-recognized speech frequency spectrum characteristic sequence of each language type and the sentence level language state posterior probability distribution vector to a pre-constructed multi-language speech recognition model, and outputting the speech recognition result of the corresponding language type.
As an improvement of the above technical solution, the method further includes: obtaining language classification results of corresponding language types according to the statement level language state posterior probability distribution vector, combining the language classification results with historical information of the speech recognition results of the corresponding language types output by a decoding network in a pre-constructed multi-language speech recognition model, obtaining a corresponding decoding network prediction sequence, and finally obtaining the multi-language speech recognition result.
As an improvement of the above technical solution, the method further includes: the deep neural network-based training step of the segment-level language classification model specifically comprises the following steps:
extracting the frame-level voice frequency spectrum characteristics of the multi-language continuous voice stream of the training set, inputting the frame-level voice frequency spectrum characteristics into the language classification model of the section level, carrying out long-term statistics on the output vector of the current hidden layer, and calculating the mean vector, the variance vector and the section-level statistical vector of the output vector of the current hidden layer;
the mean vector is:
Figure BDA0002321749550000021
the variance vector is:
Figure BDA0002321749550000022
the segment-level statistical vector is:
hsegment=Append(μ,σ) (6)
wherein h isjThe output vector of the current hidden layer at the moment j is obtained; t is a long-term statistical period; mu is a mean vector of long-term statistics; sigma is a variance vector of long-term statistics; h is a total ofsegmentIs a segment-level statistical vector; wherein the segment-level statistical vector is formed by splicing a mean vector and a variance vector together, and has a dimension hj2 times the dimension; appendix (mu, sigma) represents that mu and sigma are spliced to form a high-dimensional vector;
segment-level statistical vector hsegmentAnd (3) as the input of the next hidden layer, according to the segment-level language labels, training by an error calculation and reverse gradient return process to obtain a trained segment-level language classification model, and completing the establishment of the segment-level language classification model.
As an improvement of the above technical solution, the multilingual speech recognition model includes: the system comprises an encoding network, a plurality of attention mechanism modules and a decoding network; setting a corresponding number of attention mechanism modules according to the number of the language types to be identified;
and setting a corresponding number of attention mechanism modules according to the number of language types contained in the speech spectrum characteristics to be recognized.
As an improvement of the above technical solution, the training step of the attention mechanism module specifically includes:
sequence h of states of speech featuresencInputting the input data to a corresponding attention mechanism module, and outputting a corresponding output state sequence;
according to equation (2), the corresponding output sequence is obtained:
el t,i=wTtanh(Wlhenc+Vlhdec i+Ul(Fl*al t,i-1)+bl) (2)
wherein l represents a language type label of multiple languages; e.g. of the typel t,iThe output state of the attention mechanism module represents the speech spectrum feature to be recognized in the t-th frame; w is aT,Wl,Vl,UlRespectively representing a first transformation matrix, a second transformation matrix, a third transformation matrix and a fourth transformation matrix; blRepresenting a bias vector; tanh () represents a nonlinear activation function; flRepresenting a convolution function;
Figure BDA0002321749550000031
representing the output state of the t frame coding network; h isdec iAn implied layer state representing an ith output modeling unit of the decoding network; a is al t,i-1A weight value corresponding to the attention weight vector of the ith language category in the t frame of the (i-1) th output modeling unit;
obtaining attention weight vectors of corresponding language types according to the corresponding output state sequences;
specifically, according to formula (3), the attention weight vector of the corresponding language category is obtained:
Figure BDA0002321749550000032
wherein, al t,iThe attention weight vector of the ith language category corresponds to the weight value of the t frame of the ith output modeling unit; e.g. of the typel t′,iThe output state of the attention mechanism module corresponding to the ith output modeling unit for the t' th frame to be recognized voice spectrum features; t' is more than or equal to 1 and less than or equal to T and is the corresponding frame of the voice characteristic sequence.
As one improvement of the above technical solution, the to-be-recognized speech frequency spectrum feature sequence and the sentence-level language state posterior probability distribution vector of each language category are input to a pre-constructed multi-language speech recognition model, and a speech recognition result of the corresponding language category is output; the method specifically comprises the following steps:
inputting the speech frequency spectrum characteristics to be recognized of each language type into a coding network, and outputting a state sequence of corresponding speech characteristics;
according to the formula (1), obtaining the state sequence h of the corresponding voice featuresenc
henc=Encoder(x) (1)
Wherein the content of the first and second substances,
Figure BDA0002321749550000041
a state sequence of speech features, namely a hidden state output sequence of a coding network; x is (x)1,x2,...,xt,...,xT) The method comprises the steps of inputting a speech frequency spectrum characteristic sequence to be recognized, namely an input characteristic; wherein, T is the total frame number of the input characteristic sequence; encoder () is a calculation function of a coding network based on a convolutional neural network/bidirectional long-and-short-term memory network;
carrying out weighted summation on the corresponding voice characteristic state sequence and the attention weight vector of the corresponding language type to obtain a corresponding attention context content vector;
specifically, according to formula (4), a corresponding attention context content vector is obtained;
Figure BDA0002321749550000042
wherein, cl iRepresenting a corresponding attention context content vector, namely an attention context content vector obtained by weighting and summing the coding network by the ith language class;
under the condition of multi-attention mechanism, distributing vector V by language statelAnd carrying out weighted summation with the corresponding attention context content vector to obtain a final attention context content vector:
Figure BDA0002321749550000043
wherein, VlFor language state distribution vectors, i.e. Vl=(wl 1,wl 2,...,wl n,...,wl N) (ii) a N is the number of the language types of the multiple languages to be identified;
and inputting the final attention context content vector to a decoding network to obtain a speech recognition result of the language category.
The invention also provides an end-to-end multilingual continuous speech stream speech content recognition system, comprising: the device comprises an extraction module and a voice recognition module;
the extraction module is used for inputting the speech frequency spectrum characteristics to be recognized into a pre-constructed deep neural network-based segment-level language classification model and extracting the posterior probability distribution vector of the statement-level language state according to the segment-level language classification model;
the speech recognition module inputs the speech frequency spectrum characteristic sequence to be recognized of each language type and the posterior probability distribution vector of the statement level language type state into a pre-constructed multi-language speech recognition model and outputs the speech recognition result of the corresponding language type.
As an improvement of the above technical solution, the system further includes: and the voice result acquisition module is used for acquiring the language classification result of the corresponding language type according to the statement level language state posterior probability distribution vector, combining the language classification result with the historical information of the voice recognition result of the corresponding language type output by the decoding network in the pre-constructed multi-language voice recognition model, acquiring the corresponding decoding network prediction sequence and finally acquiring the multi-language voice recognition result.
The invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method when executing the computer program.
The invention further provides a computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to perform the above-mentioned method.
Compared with the prior art, the invention has the beneficial effects that:
the method is an end-to-end multi-language voice recognition method based on a multi-attention machine system, a specific attention machine module is constructed for each language under an end-to-end framework based on the attention machine system, and the attention machine module carries out language specific modeling on the mapping relation between an input spectrum characteristic sequence and an output annotation sequence of a specific language. In addition, language classification information is introduced into an end-to-end modeling process, and output information of the multi-semantic machine module is weighted, so that language distinctiveness of the multi-language voice recognition system can be effectively improved.
Drawings
Fig. 1 is a flow chart of a method for recognizing speech contents of end-to-end multilingual continuous speech streams according to the present invention.
Detailed Description
The invention will now be further described with reference to the accompanying drawings.
As shown in fig. 1, the present invention provides an end-to-end method for recognizing speech contents of continuous voice streams in multiple languages, the method comprising:
inputting the speech frequency spectrum characteristics to be recognized into a pre-constructed deep neural network-based segment-level language classification model, and extracting a posterior probability distribution vector V of a statement-level language state according to the segment-level language classification modellObtaining language classification results corresponding to the language types; wherein the language classification result corresponding to the language category is the posterior probability distribution vector V of the statement level language statel(ii) a The speech frequency spectrum feature to be recognized is represented by a frequency domain obtained by performing fourier transform on a multilingual continuous speech stream, wherein the multilingual continuous speech stream refers to a situation that the speech stream only contains one type of language information, but the language type of the speech stream is unknown.
Specifically, the speech frequency spectrum characteristic sequence to be recognized is input into the segment level language classification model, is calculated forward through a neural network, and is used for calculating the speech frequency spectrum characteristic sequence according to the speech frequency spectrum characteristic sequenceSegment level language classification model, extracting posterior probability distribution vector V of statement level language statelAnd obtaining language classification results corresponding to the language types.
The establishing of the segment-level language classification model based on the deep neural network specifically comprises the following steps:
extracting the frame-level voice frequency spectrum characteristics of the multi-language continuous voice stream of the training set, inputting the frame-level voice frequency spectrum characteristics into the language classification model of the section level, carrying out long-term statistics on the output vector of the current hidden layer, and calculating the mean vector, the variance vector and the section-level statistical vector of the output vector of the current hidden layer;
the mean vector is:
Figure BDA0002321749550000061
the variance vector is:
Figure BDA0002321749550000062
the segment-level statistical vector is:
hsegment=Append(μ,σ) (6)
wherein h isjAn output vector at the moment of the current hidden layer j; t is a long-term statistical period; mu is a mean vector of long-term statistics; sigma is a variance vector of long-term statistics; h is a total ofsegmentIs a segment-level statistical vector; wherein the segment-level statistical vector is formed by splicing a mean vector and a variance vector together, and has a dimension hj2 times the dimension; appendix (mu, sigma) represents that mu and sigma are spliced to form a high-dimensional vector;
segment-level statistical vector hsegmentAnd as the input of the next hidden layer, obtaining a trained segment-level language classification model through error calculation and reverse gradient return process training according to the segment-level language labels, and completing the establishment of the segment-level language classification model. Wherein, the language label is a label with language category.
Inputting the to-be-recognized speech frequency spectrum characteristic sequence of each language type and the sentence level language state posterior probability distribution vector into a pre-constructed multi-language speech recognition model, and outputting the speech recognition result of the corresponding language type.
As shown in fig. 1, the multilingual speech recognition model includes: an encoding network, a plurality of attention mechanism modules (attention mechanism module 1, attention mechanism module 2, …, attention mechanism module N) and a decoding network. Setting a corresponding number of attention mechanism modules according to the number of the language types to be identified;
specifically, according to the number of language types contained in the speech frequency spectrum feature to be recognized, a corresponding number of attention mechanism modules are set;
inputting the speech frequency spectrum characteristics to be recognized of each language type into a coding network, and outputting a state sequence of corresponding speech characteristics;
specifically, according to formula (1), a state sequence h of the corresponding speech feature is obtainedenc
henc=Encoder(x) (1)
Wherein h isenc=(henc 1,henc 2,...,henc t,...,henc T) A state sequence of speech features, namely a hidden state output sequence of a coding network; x is (x)1,x2,...,xt,...,xT) The method comprises the steps of inputting a speech frequency spectrum characteristic sequence to be recognized, namely an input characteristic; wherein, T is the total frame number of the input characteristic sequence; encoder () is a computational function of a convolutional neural network/bidirectional long-term memory network (CNN/BLSTM) -based coding network.
Corresponding state sequence h of voice characteristicsencInputting the input data to a corresponding attention mechanism module, and outputting a corresponding output state sequence;
specifically, according to equation (2), the corresponding output sequence is obtained:
el t,i=wTtanh(Wlhenc+Vlhdec i+Ul(Fl*al t,i-1)+bl) (2)
wherein l represents a language type label of multiple languages; e.g. of the typel t,iThe output state of the attention mechanism module of the speech spectrum feature to be recognized of the t frame is represented; w is aT,Wl,Vl,UlRespectively representing a first transformation matrix, a second transformation matrix, a third transformation matrix and a fourth transformation matrix; b is a mixture oflRepresenting a bias vector; tanh () represents a nonlinear activation function; flRepresenting a convolution function; h isenc tRepresenting the output state of the t frame coding network; h is a total ofdec iAn implied layer state representing an ith output modeling unit of the decoding network; a isl t,i-1A weight value corresponding to the attention weight vector of the ith language category in the t frame of the (i-1) th output modeling unit;
obtaining attention weight vectors of corresponding language types according to the corresponding output state sequences;
specifically, according to formula (3), the attention weight vector of the corresponding language category is obtained:
Figure BDA0002321749550000071
wherein, al t,iRepresenting a weight value corresponding to the attention weight vector representing the ith language category at the t frame of the ith output modeling unit; e.g. of the typel t′,iThe output state of the attention mechanism module corresponding to the ith output modeling unit is used for the t' th frame to-be-recognized speech frequency spectrum characteristic; t' is more than or equal to 1 and less than or equal to T and is the corresponding frame of the voice characteristic sequence;
carrying out weighted summation on the corresponding voice characteristic state sequence and the attention weight vector of the corresponding language type to obtain a corresponding attention context content vector;
specifically, according to formula (4), a corresponding attention context content vector is obtained;
Figure BDA0002321749550000072
wherein, cl iRepresenting a corresponding attention context content vector, namely an attention context content vector obtained by weighting and summing the coding network by the ith language class;
under the condition of multi-attention mechanism, distributing vector V by language statelAnd carrying out weighted summation with the corresponding attention context content vector to obtain a final attention context content vector:
Figure BDA0002321749550000081
wherein, VlFor language state distribution vectors, i.e. Vl=(wl 1,wl 2,...,wl n,...,wl N) (ii) a N is the number of the language types of the multiple languages to be identified;
and inputting the final attention context content vector to a decoding network to obtain a speech recognition result of the language category.
The method further comprises the following steps: and combining the language classification result of the language type with historical information of the speech recognition result of the corresponding language type output by the decoding network in the pre-constructed multi-language speech recognition model to obtain a corresponding decoding network prediction sequence, and finally obtaining the multi-language speech recognition result.
In particular, for predicting the ith output modeling unit y of the decoding networkiThe output modeling unit is the language-1 output modeling unit, …, language-N output modeling unit shown in fig. 1, and the decoding network hidden layer state h of the ith output modeling unit of the decoding network needs to be predicted firstdec iWherein the input of the decoding network is the (i-1) th output modeling unit and the attention context content vector ciFinally, the decoding network combined with the softmax function can be hidden by the i-th output modeling unit of the decoding network as shown in equation (6)Containing layer state hdec iIth output modeling unit y of predictive decoding networkiProbability p (y) ofi|y1:i-1X) as shown in formula (7):
hi dec=Decoder(yi-1,ci) (6)
p(yi|y1:i-1,x)=soft max(hi dec) (7)
wherein x represents an input speech frequency spectrum characteristic sequence to be recognized; y isi-1Modeling the i-1 output of the decoding network; c. CiIs the final attention context content vector; y is1:i-1History information for the 1 st output to the i-1 st output of the decoding network; p (y)i|y1:i-1,x) Modelling the ith output of a decoding networkiA predicted probability of (d); soft max (h)i dec) For decoding the network hidden layer state hdec iTaking a softmax function; y isiAn ith output modeling unit representing a decoding network; decoder () represents a long-short time memory network (LSTM) -based decoding network;
general prediction probability p (y)i|y1:i-1X) to obtain a modeling unit y with the maximum prediction probability in the prediction process of the ith modeling unitiBy combining the result of the 1 st prediction with the result of the I-th prediction, the final speech recognition result y ═ can be obtained (y ═ y)1,y2,...,yi,...,yI)。
The time step mapping of the input characteristic sequence and the output modeling unit sequence is inconsistent for different languages, so that the attention module can be optimized according to the characteristics of a specific language while model information sharing is carried out between an encoding network and a decoding network among multiple languages.
The invention also provides an end-to-end multilingual continuous speech stream speech content recognition system, which is realized based on the method and comprises the following steps:
an extraction module for inputting the speech frequency spectrum characteristics to be recognized to the pre-structureEstablishing a segment level language classification model based on a deep neural network, and extracting a posterior probability distribution vector V of a statement level language state according to the segment level language classification modell
A speech recognition module for comparing the speech frequency spectrum characteristic sequence to be recognized of each language type with the posterior probability distribution vector V of the statement level language type statelInputting the speech recognition result into a pre-constructed multi-language speech recognition model, and outputting the speech recognition result of the corresponding language type.
The system further comprises: a voice result obtaining module for obtaining the posterior probability distribution vector V according to the language state of the statement levellObtaining the language classification result corresponding to the language type, combining the language classification result of the language type with the historical information of the speech recognition result corresponding to the language type output by the decoding network in the multi-language speech recognition model which is constructed in advance, obtaining the corresponding decoding network prediction sequence, and finally obtaining the multi-language speech recognition result.
The invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method when executing the computer program.
The invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, causes the processor to carry out the above-mentioned method.
The rationality and validity of the multilingual speech recognition system based on the multi-attention machine system of the present invention have been verified in real systems, with the results shown in table 1:
TABLE 1 recognition results of the multilingual end-to-end recognition model (% word error Rate)
Figure BDA0002321749550000091
The method of the invention constructs a multi-language end-to-end speech recognition system by using the tacari, the dorzol, the toki and the Haidecliuer. Wherein, the tacarlo and dormitom are the filigree used in different regions of filigree, and the tokyo and haidi cririol are two different cririol. A common feature of these four languages is that their annotation text is both latin letters and variants of latin letters.
Therefore, the multi-language joint modeling based on the four languages can effectively share information and improve the performance of the multi-language speech recognition system. From table 1, compared to the single-language end-to-end recognition model and the multi-language end-to-end recognition system without the multi-attention mechanism module, the method of the present invention effectively reduces the word error rate of the multi-language recognition model from an average of 62.6% to 60.3% in four languages by merging language information into the multi-language recognition method and combining the multi-attention mechanism module.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it should be understood by those skilled in the art that the technical solutions of the present invention may be modified or substituted with equivalents without departing from the spirit and scope of the technical solutions of the present invention, and all of them should be covered by the scope of the claims of the present invention.

Claims (10)

1. An end-to-end method for recognizing speech contents of a continuous speech stream with multiple languages, the method comprising:
inputting speech frequency spectrum characteristics to be recognized into a pre-constructed deep neural network-based segment-level language classification model, and outputting a sentence-level language state posterior probability distribution vector;
inputting the to-be-recognized speech frequency spectrum characteristic sequence of each language type and the sentence level language state posterior probability distribution vector to a pre-constructed multi-language speech recognition model, and outputting the speech recognition result of the corresponding language type.
2. The method of claim 1, further comprising: obtaining language classification results of corresponding language types according to the statement level language state posterior probability distribution vector, combining the language classification results with historical information of the speech recognition results of the corresponding language types output by a decoding network in a pre-constructed multi-language speech recognition model, obtaining a corresponding decoding network prediction sequence, and finally obtaining the multi-language speech recognition result.
3. The method of claim 1, further comprising: the deep neural network-based segment-level language classification model training method specifically comprises the following steps:
extracting the frame-level voice frequency spectrum characteristics of the multi-language continuous voice stream of the training set, inputting the frame-level voice frequency spectrum characteristics into the language classification model of the section level, carrying out long-term statistics on the output vector of the current hidden layer, and calculating the mean vector, the variance vector and the section-level statistical vector of the output vector of the current hidden layer;
the mean vector is:
Figure FDA0002321749540000011
the variance vector is:
Figure FDA0002321749540000012
the segment-level statistical vector:
hsegment=Append(μ,σ) (6)
wherein h isjThe output vector of the current hidden layer at the moment j is obtained; t is a long-term statistical period; mu is a mean vector of long-term statistics; sigma is a variance vector of long-term statistics; h issegmentIs a segment-level statistical vector; wherein the segment-level statistical vector is formed by splicing a mean vector and a variance vector together, and has a dimension hj2 times the dimension; appendix (mu, sigma) represents that mu and sigma are spliced to form a high-dimensional vector;
segment-level statistical vector hsegmentAnd (3) as the input of the next hidden layer, according to the segment-level language labels, training by an error calculation and reverse gradient return process to obtain a trained segment-level language classification model, and completing the establishment of the segment-level language classification model.
4. The method of claim 1, wherein the multi-lingual speech recognition model comprises: the system comprises an encoding network, a plurality of attention mechanism modules and a decoding network; setting a corresponding number of attention mechanism modules according to the number of the language types to be identified;
and setting a corresponding number of attention mechanism modules according to the number of language types contained in the speech spectrum characteristics to be recognized.
5. The method according to claim 4, wherein the step of training the attention mechanism module comprises in particular:
sequence h of states of speech featuresencInputting the input data to a corresponding attention mechanism module, and outputting a corresponding output state sequence;
according to equation (2), the corresponding output sequence is obtained:
el t,i=wTtanh(Wlhenc+Vlhdec i+Ul(Fl*al t,i-1)+bl) (2)
wherein l represents a language type label of multiple languages; e.g. of a cylinderl t,iThe output state of the attention mechanism module represents the speech spectrum feature to be recognized in the t-th frame; w is aT,Wl,Vl,UlRespectively representing a first transformation matrix, a second transformation matrix, a third transformation matrix and a fourth transformation matrix; blRepresenting a bias vector; tanh () represents a nonlinear activation function; flRepresenting a convolution function;
Figure FDA0002321749540000021
representing a tth frame encoded netThe output state of the complex; h isdec iRepresenting the hidden layer states of the ith output modeling unit of the decoding network; a is al t,i-1A weight value corresponding to the attention weight vector of the ith language category in the t frame of the (i-1) th output modeling unit;
obtaining attention weight vectors of corresponding language types according to the corresponding output state sequences;
specifically, according to formula (3), the attention weight vector of the corresponding language category is obtained:
Figure FDA0002321749540000022
wherein, al t,iThe attention weight vector of the ith language category corresponds to the weight value of the t frame of the ith output modeling unit; e.g. of the typel t′,iThe output state of the attention mechanism module corresponding to the ith output modeling unit is used for the t' th frame to-be-recognized speech frequency spectrum characteristic; t' is more than or equal to 1 and less than or equal to T and is the corresponding frame of the voice characteristic sequence.
6. The method according to claim 1, wherein the speech spectrum feature sequence to be recognized and the sentence-level language state posterior probability distribution vector of each language category are input to a pre-constructed multi-language speech recognition model, and the speech recognition result of the corresponding language category is output; the method comprises the following specific steps:
inputting the speech frequency spectrum characteristics to be recognized of each language type into a coding network, and outputting a state sequence of corresponding speech characteristics;
according to the formula (1), obtaining the state sequence h of the corresponding voice featuresenc
henc=Encoder(x) (1)
Wherein the content of the first and second substances,
Figure FDA0002321749540000031
state sequences characterised by speech, i.e. coding networksThe hidden layer state output sequence of (1); x ═ x1,x2,...,xt,...,xT) The method comprises the steps of inputting a speech frequency spectrum characteristic sequence to be recognized, namely an input characteristic; wherein, T is the total frame number of the input characteristic sequence; encoder () is a calculation function of a coding network based on a convolutional neural network/bidirectional long-and-short-term memory network;
carrying out weighted summation on the corresponding voice characteristic state sequence and the attention weight vector of the corresponding language type to obtain a corresponding attention context content vector;
specifically, according to formula (4), a corresponding attention context content vector is obtained;
Figure FDA0002321749540000032
wherein, cl iRepresenting a corresponding attention context content vector, namely an attention context content vector obtained by weighting and summing the coding network by the ith language class;
under the condition of multi-attention mechanism, distributing vector V by language statelAnd carrying out weighted summation with the corresponding attention context content vector to obtain a final attention context content vector:
Figure FDA0002321749540000033
wherein, VlFor language state distribution vectors, i.e. Vl=(wl 1,wl 2,...,wl n,...,wl N) (ii) a N is the number of the language types of the multiple languages to be identified;
and inputting the final attention context content vector to a decoding network to obtain a speech recognition result of the language category.
7. An end-to-end multilingual continuous speech stream speech content recognition system, comprising: the device comprises an extraction module and a voice recognition module;
the extraction module is used for inputting the speech frequency spectrum characteristics to be recognized into a pre-constructed deep neural network-based segment-level language classification model and extracting a posterior probability distribution vector of a statement-level language state according to the segment-level language classification model;
the speech recognition module inputs the speech frequency spectrum characteristic sequence to be recognized of each language type and the posterior probability distribution vector of the statement level language type state into a pre-constructed multi-language speech recognition model and outputs the speech recognition result of the corresponding language type.
8. The system of claim 7, further comprising: and the voice result acquisition module is used for acquiring a language classification result of a corresponding language type according to the statement level language state posterior probability distribution vector, combining the language classification result with historical information of the voice recognition result of the corresponding language type output by the decoding network in the pre-constructed multi-language voice recognition model, acquiring a corresponding decoding network prediction sequence and finally acquiring the multi-language voice recognition result.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of the preceding claims 1-6 when executing the computer program.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to carry out the method of any of the preceding claims 1-6.
CN201911300918.7A 2019-12-17 2019-12-17 End-to-end multi-language continuous voice stream voice content identification method and system Active CN113077785B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911300918.7A CN113077785B (en) 2019-12-17 2019-12-17 End-to-end multi-language continuous voice stream voice content identification method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911300918.7A CN113077785B (en) 2019-12-17 2019-12-17 End-to-end multi-language continuous voice stream voice content identification method and system

Publications (2)

Publication Number Publication Date
CN113077785A CN113077785A (en) 2021-07-06
CN113077785B true CN113077785B (en) 2022-07-12

Family

ID=76608263

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911300918.7A Active CN113077785B (en) 2019-12-17 2019-12-17 End-to-end multi-language continuous voice stream voice content identification method and system

Country Status (1)

Country Link
CN (1) CN113077785B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9263036B1 (en) * 2012-11-29 2016-02-16 Google Inc. System and method for speech recognition using deep recurrent neural networks
CN106782518A (en) * 2016-11-25 2017-05-31 深圳市唯特视科技有限公司 A kind of audio recognition method based on layered circulation neutral net language model
CN107408111A (en) * 2015-11-25 2017-11-28 百度(美国)有限责任公司 End-to-end speech recognition
CN109003601A (en) * 2018-08-31 2018-12-14 北京工商大学 A kind of across language end-to-end speech recognition methods for low-resource Tujia language
CN109523993A (en) * 2018-11-02 2019-03-26 成都三零凯天通信实业有限公司 A kind of voice languages classification method merging deep neural network with GRU based on CNN
CN110428818A (en) * 2019-08-09 2019-11-08 中国科学院自动化研究所 The multilingual speech recognition modeling of low-resource, audio recognition method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10672388B2 (en) * 2017-12-15 2020-06-02 Mitsubishi Electric Research Laboratories, Inc. Method and apparatus for open-vocabulary end-to-end speech recognition

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9263036B1 (en) * 2012-11-29 2016-02-16 Google Inc. System and method for speech recognition using deep recurrent neural networks
CN107408111A (en) * 2015-11-25 2017-11-28 百度(美国)有限责任公司 End-to-end speech recognition
CN106782518A (en) * 2016-11-25 2017-05-31 深圳市唯特视科技有限公司 A kind of audio recognition method based on layered circulation neutral net language model
CN109003601A (en) * 2018-08-31 2018-12-14 北京工商大学 A kind of across language end-to-end speech recognition methods for low-resource Tujia language
CN109523993A (en) * 2018-11-02 2019-03-26 成都三零凯天通信实业有限公司 A kind of voice languages classification method merging deep neural network with GRU based on CNN
CN110428818A (en) * 2019-08-09 2019-11-08 中国科学院自动化研究所 The multilingual speech recognition modeling of low-resource, audio recognition method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
戴礼荣等.基于深度学习的语音识别技术现状与展望.《数据采集与处理》.2017,(第02期), *
苗晓晓等.应用于短时语音语种识别的时长扩展方法.《清华大学学报(自然科学版)》.2018,(第03期), *
金马等.基于卷积神经网络的语种识别系统.《数据采集与处理》.2019,(第02期), *

Also Published As

Publication number Publication date
CN113077785A (en) 2021-07-06

Similar Documents

Publication Publication Date Title
CN108647207B (en) Natural language correction method, system, device and storage medium
Toshniwal et al. Multitask learning with low-level auxiliary tasks for encoder-decoder based speech recognition
CN108804611B (en) Dialog reply generation method and system based on self comment sequence learning
KR20200086214A (en) Real-time speech recognition method and apparatus based on truncated attention, equipment and computer-readable storage medium
CN111199727B (en) Speech recognition model training method, system, mobile terminal and storage medium
WO2023024412A1 (en) Visual question answering method and apparatus based on deep learning model, and medium and device
Mangal et al. LSTM vs. GRU vs. Bidirectional RNN for script generation
CN110569505B (en) Text input method and device
CN111767718B (en) Chinese grammar error correction method based on weakened grammar error feature representation
CN111144110A (en) Pinyin marking method, device, server and storage medium
CN111738006A (en) Commodity comment named entity recognition-based problem generation method
WO2020108545A1 (en) Statement processing method, statement decoding method and apparatus, storage medium and device
CN112308080A (en) Image description prediction method for directional visual understanding and segmentation
CN116341651A (en) Entity recognition model training method and device, electronic equipment and storage medium
CN113297374B (en) Text classification method based on BERT and word feature fusion
CN113077785B (en) End-to-end multi-language continuous voice stream voice content identification method and system
CN115630651B (en) Text generation method and training method and device of text generation model
WO2023116572A1 (en) Word or sentence generation method and related device
CN115270792A (en) Medical entity identification method and device
CN115240712A (en) Multi-mode-based emotion classification method, device, equipment and storage medium
CN112364602B (en) Multi-style text generation method, device, equipment and readable storage medium
CN113096646B (en) Audio recognition method and device, electronic equipment and storage medium
CN112434143A (en) Dialog method, storage medium and system based on hidden state constraint of GRU (generalized regression Unit)
CN114818644B (en) Text template generation method, device, equipment and storage medium
CN116822498B (en) Text error correction processing method, model processing method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant