CN113838468A - Streaming voice recognition method, terminal device and medium - Google Patents

Streaming voice recognition method, terminal device and medium Download PDF

Info

Publication number
CN113838468A
CN113838468A CN202111119338.5A CN202111119338A CN113838468A CN 113838468 A CN113838468 A CN 113838468A CN 202111119338 A CN202111119338 A CN 202111119338A CN 113838468 A CN113838468 A CN 113838468A
Authority
CN
China
Prior art keywords
sequence
audio
encoder
encoding
context
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111119338.5A
Other languages
Chinese (zh)
Inventor
蔡旭浦
张俊杰
彭朋
荣玉军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Hangzhou Information Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Hangzhou Information Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202111119338.5A priority Critical patent/CN113838468A/en
Publication of CN113838468A publication Critical patent/CN113838468A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Signal Processing (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The invention discloses a streaming voice recognition method, a terminal device and a computer readable storage medium. The method comprises the following steps: acquiring a word embedding characteristic sequence and an audio characteristic sequence corresponding to an audio stream; encoding the audio feature sequence through an audio encoder to obtain an audio context sequence, and encoding the word embedding sequence according to a tag encoder to obtain a text context sequence, wherein the audio encoder and the tag encoder are self-attention encoders based on relative position encoding; and inputting the audio context sequence and the text context sequence into a joint decoder to obtain a probability distribution result of the label, and determining an identification result according to the probability distribution result. The present invention aims to achieve the effect of reducing the amount of calculation for speech recognition.

Description

Streaming voice recognition method, terminal device and medium
Technical Field
The present invention relates to the field of speech recognition technologies, and in particular, to a streaming speech recognition method, a terminal device, and a computer-readable storage medium.
Background
The speech recognition is a process of converting speech signals into corresponding texts through a computer, and is used as a key entrance of human-machine speech interaction, and the speech recognition is an important research direction in the field of artificial intelligence.
In the related art, the mainstream implementation scheme of the end-to-end model in the related art is based on an attention mechanism, and the attention mechanism can better acquire context information of audio and text, so that the identification accuracy is improved. However, in the conventional scheme of implementing speech recognition based on attention mechanism, since the attention weight needs to be determined according to the absolute position of each speech feature, the calculation amount of the speech feature increases rapidly as the length of the speech increases for streaming speech.
The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.
Disclosure of Invention
The invention mainly aims to provide a streaming voice recognition method, a terminal device and a computer readable storage medium, aiming at achieving the effect of reducing the calculation amount of voice recognition.
In order to achieve the above object, the present invention provides a streaming voice recognition method, which includes the following steps:
acquiring a word embedding characteristic sequence and an audio characteristic sequence corresponding to an audio stream;
encoding the audio feature sequence through an audio encoder to obtain an audio context sequence, and encoding the word embedding sequence according to a tag encoder to obtain a text context sequence, wherein the audio encoder and the tag encoder are self-attention encoders based on relative position encoding;
and inputting the audio context sequence and the text context sequence into a joint decoder to obtain a probability distribution result of the label, and determining an identification result according to the probability distribution result.
Optionally, before the steps of encoding the audio feature sequence by an audio encoder to obtain an audio context sequence, and encoding the word embedding sequence by a tag encoder to obtain a text context sequence, the method further includes:
detecting whether a preset mask window is filled;
updating the position coding sequence when the mask window is filled;
and executing the step of encoding the audio feature sequence through an audio encoder based on the updated position encoding sequence to obtain an audio context sequence, and encoding the word embedding sequence according to a tag encoder to obtain a text context sequence.
Optionally, the audio encoder takes the sequence of audio features as an input vector, and the tag encoder takes the sequence of word insertions as an input vector.
Optionally, the step of encoding the audio feature sequence by an audio encoder to obtain an audio context sequence, and encoding the word embedding sequence by a tag encoder to obtain a text context sequence includes:
the audio encoder and the tag encoder determine attention weight coefficients according to the input vector and the position coding sequence;
performing weighted calculation according to the weight coefficient and the input vector to obtain an initial result;
and inputting the initial result into a corresponding feedforward network layer to obtain the audio context sequence and the text context sequence.
Optionally, the relative position between corresponding audio feature vectors in the audio feature sequence is determined according to the absolute position of the audio feature vector in the audio feature sequence.
Optionally, before the step of detecting whether the preset mask window is filled, the method further includes:
and acquiring and identifying the current load condition of the system, and determining the window size of the mask window according to the current load condition.
Optionally, before the step of obtaining the word embedding feature sequence and the audio feature sequence corresponding to the audio stream, the method further includes:
receiving a radio data stream;
the step of obtaining the word embedding characteristic sequence and the audio characteristic sequence corresponding to the audio stream comprises the following steps:
and generating the audio characteristic sequence based on the audio data stream, and acquiring the word embedding characteristic sequence, wherein the audio characteristic sequence is a Mel frequency cepstrum coefficient or a Mel filter bank coefficient.
In addition, to achieve the above object, the present invention further provides a terminal device, which includes a memory, a processor, and a streaming voice recognition program stored in the memory and operable on the processor, wherein the streaming voice recognition program, when executed by the processor, implements the steps of the streaming voice recognition method as described above.
In addition, to achieve the above object, the present invention also provides a terminal device, including:
the acquisition module is used for acquiring the word embedding characteristic sequence and the audio characteristic sequence corresponding to the audio stream;
the encoding module is used for encoding the audio feature sequence through an audio encoder to obtain an audio context sequence, and encoding the word embedding sequence according to a tag encoder to obtain a text context sequence, wherein the audio encoder and the tag encoder are self-attention encoders based on relative position encoding;
and the decoding module is used for inputting the audio context sequence and the text context sequence into a joint decoder to obtain a probability distribution result of the label and determining an identification result according to the probability distribution result.
Furthermore, to achieve the above object, the present invention also provides a computer readable storage medium having a streaming voice recognition program stored thereon, which when executed by a processor implements the steps of the streaming voice recognition method as described above.
The embodiment of the invention provides a streaming voice recognition method, a terminal device and a computer readable storage medium, which are used for obtaining a word embedding feature sequence and an audio feature sequence corresponding to an audio stream, coding the audio feature sequence through an audio coder to obtain an audio context sequence, and coding the word embedding sequence according to a label coder to obtain a text context sequence, wherein the audio coder and the label coder are self-attention coders based on relative position coding, the audio context sequence and the text context sequence are input into a joint decoder to obtain a probability distribution result of a label, and a recognition result is determined according to the probability distribution result. By using a self-attention encoder based on relative position encoding, it is achieved to avoid repeated calculations in the self-attention mechanism, thereby achieving the effect of reducing the amount of calculations for speech recognition.
Drawings
Fig. 1 is a schematic terminal structure diagram of a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flow chart illustrating a method for streaming speech recognition according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of the main structure of the involved recognition algorithm;
FIG. 4 is a flow chart illustrating a flow of a streaming speech recognition process according to the present invention;
fig. 5 is a schematic block diagram of a terminal device according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, fig. 1 is a schematic terminal structure diagram of a hardware operating environment according to an embodiment of the present invention.
As shown in fig. 1, the control terminal may include: a processor 1001, such as a CPU, a network interface 1003, a memory 1004, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The network interface 1003 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1004 may be a high-speed RAM memory or a non-volatile memory (e.g., a disk memory). The memory 1004 may alternatively be a storage device separate from the processor 1001.
Those skilled in the art will appreciate that the terminal structure shown in fig. 1 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
As shown in fig. 1, the memory 1004, which is a type of computer storage medium, may include an operating system, a network communication module, and a streaming voice recognition program therein.
In the terminal shown in fig. 1, the processor 1001 may be configured to invoke a streaming voice recognition program stored in the memory 1004 and perform the following operations:
acquiring a word embedding characteristic sequence and an audio characteristic sequence corresponding to an audio stream;
encoding the audio feature sequence through an audio encoder to obtain an audio context sequence, and encoding the word embedding sequence according to a tag encoder to obtain a text context sequence, wherein the audio encoder and the tag encoder are self-attention encoders based on relative position encoding;
and inputting the audio context sequence and the text context sequence into a joint decoder to obtain a probability distribution result of the label, and determining an identification result according to the probability distribution result.
Further, the processor 1001 may call the streaming voice recognition program stored in the memory 1004, and further perform the following operations:
detecting whether a preset mask window is filled;
updating the position coding sequence when the mask window is filled;
and executing the step of encoding the audio feature sequence through an audio encoder based on the updated position encoding sequence to obtain an audio context sequence, and encoding the word embedding sequence according to a tag encoder to obtain a text context sequence.
Further, the processor 1001 may call the streaming voice recognition program stored in the memory 1004, and further perform the following operations:
the audio encoder and the tag encoder determine attention weight coefficients according to the input vector and the position coding sequence;
performing weighted calculation according to the weight coefficient and the input vector to obtain an initial result;
and inputting the initial result into a corresponding feedforward network layer to obtain the audio context sequence and the text context sequence.
Further, the processor 1001 may call the streaming voice recognition program stored in the memory 1004, and further perform the following operations:
and acquiring and identifying the current load condition of the system, and determining the window size of the mask window according to the current load condition.
Further, the processor 1001 may call the streaming voice recognition program stored in the memory 1004, and further perform the following operations:
receiving a radio data stream;
the step of obtaining the word embedding characteristic sequence and the audio characteristic sequence corresponding to the audio stream comprises the following steps:
and generating the audio characteristic sequence based on the audio data stream, and acquiring the word embedding characteristic sequence, wherein the audio characteristic sequence is a Mel frequency cepstrum coefficient or a Mel filter bank coefficient.
The speech recognition is a process of converting speech signals into corresponding texts through a computer, and is used as a key entrance of human-machine speech interaction, and the speech recognition is an important research direction in the field of artificial intelligence. The classical speech recognition algorithm is based on a GMM-HMM model and achieves good results, but because the GMM model is a shallow model, the expression capability is insufficient, and the distribution of speech signals cannot be accurately represented. Therefore, research into a speech recognition technology based on deep learning is being started. The mainstream algorithm model based on deep learning is the RNN-HMM model and its variants, which has achieved great success and is still widely used in the industry to date. The HMM model construction and training process is complex and time-consuming, and the joint optimization of voice and characters cannot be carried out, so that an end-to-end voice recognition model is provided, namely, a plurality of modules such as an acoustic model, a linguistic model and a pronunciation dictionary are not required to be connected in series for work, and a voice signal can be directly mapped to a text sequence, thereby greatly simplifying the construction process.
In the related art, the mainstream implementation scheme of the end-to-end model is based on an attention mechanism, and the attention mechanism can better acquire context information of audio and text, so that the identification accuracy is improved. The streaming voice recognition is a voice recognition technology supporting real-time processing, can continuously recognize a voice data stream and return a recognition result, does not need to start a recognition process after all voice input is finished, and has better response speed and user experience.
Based on the scheme given in the related technology, the RNN-HMM model and various variants thereof need to construct separate acoustic models and linguistic models, have complex structures, cannot realize the joint optimization of the two models, and have adverse effects on the improvement of the recognition accuracy. The process of constructing the pronunciation dictionary by the RNN-HMM model needs to master certain linguistic theory, and the technical requirement is high. In addition, the acoustic model needs to be subjected to multi-round iterative alignment in the training process, and the training speed is low. The end-to-end recognition model based on the attention mechanism is a recursion process in calculation and cannot be trained in parallel. Therefore, a self-attention mechanism capable of realizing parallel computing is proposed on the basis, but the self-attention mechanism cannot provide the corresponding relation between audio and text, and the boundary of text context is difficult to determine in the recognition process, so that the streaming recognition is difficult to realize. The method for segmenting the audio through silence detection in the related art is easy to cause a pause in the feedback process of the identification result, and is difficult to cope with the environment with strong noise.
In order to realize end-to-end stream type recognition, the embodiment of the invention provides a stream type speech recognition method, which encodes audio feature vectors and word embedding vectors by constructing an audio encoder and a tag encoder based on a self-attention mechanism to generate audio and text high-order context vectors, decodes the spliced audio and text high-order context vectors by using a joint decoder and outputs tag probability distribution. The method can realize the one-to-one correspondence of audio input and label output, thereby not only playing the capabilities of parallel computation of a self-attention mechanism and extraction of context information, but also realizing the streaming identification capability.
The streaming speech recognition method proposed by the present invention is further explained below by specific embodiments.
In an embodiment, referring to fig. 2, the streaming speech recognition method includes the following steps:
step S10, acquiring a word embedding feature sequence and an audio feature sequence corresponding to the audio stream;
step S20, encoding the audio feature sequence through an audio encoder to obtain an audio context sequence, and encoding the word embedding sequence according to a tag encoder to obtain a text context sequence, wherein the audio encoder and the tag encoder are self-attention encoders based on relative position encoding;
and step S30, inputting the audio context sequence and the text context sequence into a joint decoder to obtain a label probability distribution result, and determining an identification result according to the label probability distribution result.
In this embodiment, an externally input voice data stream may be received through a preset data interface, and when the preset voice data stream is received, an audio feature sequence may be generated through the audio feature sequence generating module. It is understood that the audio feature sequence is usually a Mel-Frequency Cepstrum Coefficient (MFCC) or a Mel-filter bank Coefficient (FBANK), and the MFCC or FBANK is usually down-sampled again to reduce the sequence length or extract local information, and the down-sampling process may be a convolutional neural network or a feature stack. The frame rate of the audio feature sequence after processing by down-sampling is typically 30-50 milliseconds. The word embedding sequence refers to a sequence formed by vectors obtained from a word embedding matrix according to indexes of all words in the word sequence in a word stock.
Further, after the word embedding feature sequence and the audio feature sequence are obtained, the audio feature sequence can be encoded through an audio encoder to obtain an audio context sequence, and the word embedding sequence is encoded according to a tag encoder to obtain a text context sequence. It should be noted that, referring to fig. 3, a main structure of a recognition algorithm according to an embodiment of the present invention is shown in fig. 3. The audio encoder takes the audio feature sequence as an input vector, that is, each frame of the audio data stream generated by the audio feature sequence generation module is called an input vector. The tag encoder takes the word-embedded sequence as an input vector. Further, the audio encoder and the tag encoder are self-attention encoders based on relative position encoding.
Further, after the audio context sequence and the text context sequence are obtained, the audio context sequence and the text context sequence may be input to the joint decoder. The joint decoder expands and splices the audio context sequence and the text context sequence, and then generates a probability distribution network of the labels through transformation. Where the probability distribution network of labels is the union of all words and empty labels.
It can be understood that a plurality of possible alignment paths exist in the probability distribution network, each path in the probability distribution network is traversed and summed by using a forward-backward algorithm during model training to obtain the probability of the text sequence corresponding to the input audio feature sequence, then the gradient of the weight is used by using an error back propagation method, and then the weight value is updated.
The SOS tag is set at the beginning of the model decoding process (i.e., recognition process) as the start flag for the text sequence, a word vector is generated, and an initial text context feature vector is calculated by the text encoder. And continuously acquiring the audio stream, generating an audio feature sequence frame by frame, calculating an audio context feature vector corresponding to the frame vector when generating one frame of audio feature vector, sending the audio context feature vector and the text context feature vector to a joint decoder for prediction, sending the predicted text to a text encoder to generate an updated text context vector when the prediction result is not empty, and repeating the prediction process until the prediction result is empty. And repeating the steps until the audio stream is ended.
Further, after the probability distribution result of the label is obtained, the identification result can be determined according to the probability distribution result.
In addition, in order to solve the problem of controlling the resource consumption of the streaming identification through the local self-attention mechanism, the embodiment provides a scheme that a mask window is set, the range of the self-attention mechanism is limited within a certain interval, context information which is not relevant to the current calculation is ignored, the data amount and the calculation amount of the streaming identification can be controlled to be a constant value, and the continuous increase of the resource consumption is avoided.
In order to solve the problem of controlling resource consumption of streaming identification through a local self-attention mechanism, the present embodiment provides a scheme that a mask window is set, a self-attention range is limited within a certain interval, context information that is not relevant to current calculation is ignored, a data amount and a calculation amount of streaming identification can be controlled to be a constant value, and continuous increase of resource consumption is avoided.
Further, to avoid repeated calculations in the self-attention mechanism, the mask window is slid backwards with the input signal, the absolute position of each frame in the input sequence changes during the sliding process, but the relative position between frames remains the same. By using the relative position coding, when the window slides, the context vector corresponding to each frame in the window, which has already been calculated, does not change, thereby achieving the effect of avoiding the repeated calculation in the self-attention mechanism.
To achieve a balance between accuracy and response speed, the mask window size can be dynamically adjusted. It can be understood that the larger the mask window is, the more contextual information that can be focused on is, which helps to improve the accuracy of model identification, but the larger the memory and the calculation amount is, the more resources are consumed, and the slower the response speed is. The smaller the mask window, the smaller the amount of resources that need to be consumed, but the recognition accuracy is also adversely affected. The model can dynamically adjust the size of the mask window in the identification process, and balance between response speed and accuracy.
It should be noted that, in the present embodiment, the audio encoder and the tag encoder are stacked by the self-attention layer and the feedforward network layer, and there is only a difference in the number of stacked layers. The audio encoder and the tag encoder will be collectively described as an encoder in the following to further explain the working principle of the two encoders.
The self-attention mechanism can effectively extract important features of sparse data and is good at capturing data or internal correlation of the features. In order to extract data features from more dimensions, a plurality of transformation matrices are usually used to transform a sequence, and then the transformation results are spliced, which is called a Multi-Head Attention mechanism (Multi-Head Attention), and the implementation formula of the classical Multi-Head Attention mechanism is as follows:
MultiHead(Q,K,V)=Concat(head1,…,headh)WO (1)
Figure BDA0003276490180000091
Figure BDA0003276490180000092
where h is the number of multiple head attentions,
Figure BDA0003276490180000093
WOas a weight parameter, dkIs the dimension of the vector K. Q, K, V is a feature matrix, each row of the matrix represents a feature vector, and K corresponds to the feature vector of V. And performing dot product multiplication, scaling and softmax transformation on each eigenvector in the Q and a K matrix to obtain an attention weight coefficient, performing weighted summation, multi-head splicing and linear change on a V matrix according to weight parameters, and performing transformation on a feedforward network layer to obtain context vectors corresponding to the eigenvectors (the audio context eigenvectors for an audio encoder and the text context eigenvectors for a tag encoder).
The classical mechanism of self-attention is based on absolute position coding, Q, K, V being the result of the addition of the input vectors q, k, v to the position-coding vector, the index of the input vector in the sequence being the index of the position-coding. The absolute position coding is difficult to reflect the relative position relation of different input vectors when calculating the weight coefficient, and the problem of repeated calculation exists in flow identification, so the method adopts a relative position coding strategy to divide QK on a molecule in the formula (3)TExpansion, replacing the vector using absolute position coding index with position coding bias vector and relative position coding vector, splitting the weight coefficient W into input related weight coefficient and position related weight coefficient, q after transformationiAnd k isjThe weight coefficient calculation method comprises the following steps:
Figure BDA0003276490180000094
in the above formula, E represents an input vector, Wq、Wk,R、Wk,EFor the weighting parameters u, v represent position-coding bias vectors and R represents relative position-coding vectors, it being observed from the lower subscript of R that the position coding is only related to the difference of the absolute position indices of q and k in the sequence, i.e. only to the relative positions of q and k. Using equation (4) to obtain qiAnd attention weight coefficients of all vectors in the k sequence, then weighted summation is carried out on v, and subsequent operation is similar to that of the traditional methodThe multi-head self-attention mechanism is the same, and is not described in detail herein.
In addition, for better understanding of the present invention, the design of the sliding window and the local attention mechanism of the present embodiment will be further explained below.
In the process of long voice or streaming recognition, along with the continuous increase of the length of an input audio feature sequence and a recognized text sequence, the required calculation amount and storage amount are larger and larger, and in order to control resource consumption, a window can be set, so that the self-attention mechanism is limited within the window. Wherein. The window moves backwards as the input sequence grows, discarding older historical data, keeping the self-attention mechanism always working on sequences that are closer in position to the input vector.
Illustratively, assume the input sequence is XL=[x0,…,xi,xi+1,…,xL]Of length L, each of which xiRepresenting an input vector. In the attention formula (4), for the sequence XLEach vector x in (1)jAll need to calculate
Figure BDA0003276490180000101
And Wk,R Ri-j。XLThe relative position of each vector in (1) is always kept constant, because of the use of relative position coding, Ri-jSince they are also kept constant, to avoid repetitive operations, these two values may be saved as intermediate variables, respectively named E'LAnd R'L. When there is a new input vector xL+1Then, x needs to be calculated according to the formula (4)L+1With respect to the new input sequence XL+1Attention weight coefficient of each vector because the intermediate variable E 'has been previously saved'LAnd R'LHence, only calculations are required here
Figure BDA0003276490180000102
And Wk, RR, wherein R is a position coding sequence formed by reversing the relative position coding vector, namely [ RL+1,…,ri,xi-1,…,x0]The reverse order is due to the sum of xL+1The farther away the vector the greater its relative position. Has been calculated
Figure BDA0003276490180000103
And Wk,RAfter R, the subsequent calculation can be carried out according to a formula, and simultaneously
Figure BDA0003276490180000104
And Wk,RTwo vectors of R are respectively from E'LAnd R'LSplicing to obtain an updated intermediate variable E'L+1And R'L+1,E′L+1And R'L+1Can be used as a next new input vector xL+2Intermediate variables of the process are calculated. In fact, the position coding sequence R does not need to be acquired again every time, and only R needs to be acquiredL+1And splicing to the leftmost side of the sequence.
Setting the width of the mask window to dwWhen the length L of the input sequence exceeds dwWhen the mask window is filled, the self-attention mechanism will not pay attention to the index value less than (L-d)w) I.e. the index of j in equation (4) will be from (L-d)w+1), which is the meaning of a local self-attention mechanism. From L to dwAt the beginning, the calculation is finished
Figure BDA0003276490180000105
And Wk,RR is then first of all with E'LAnd R'LAre spliced to generate E'L+1And R'L+1Then E 'is removed'L+1And R'L+1The vector with the middle index of 0 keeps the lengths of the two intermediate variables unchanged, which is equivalent to sliding the window one step to the right. During the window sliding process, the absolute position of each vector in the sequence changes, but the phase position remains unchanged, so the saved intermediate variables remain valid.
The width of the mask window determines the length of the sequence participating in the calculation, once the window is filled, the length participating in the subsequent calculation is not changed, the calculation amount and the storage space are not changed as long as the window width is not adjusted, and the consumption of the calculation resources is maintained at a constant value.
Sliding window and local self-attention mechanism referring to fig. 4, the process flow of the audio encoder is indicated by the dashed box in fig. 4. It should be noted that the text encoder also employs a sliding window and local attention mechanism, which is not repeated in fig. 4 for simplicity of illustration.
In addition, for the dynamic adjustment of the computing resources proposed in the present embodiment, the principle and design are as follows:
speech recognition services are typically deployed on terminal devices that have certain computing and storage resources, providing services to multiple users simultaneously. Because the activity levels of users in different time periods are inconsistent, the average load and the peak load need to be considered when configuring the hardware resources, and the problems of resource waste or longer peak period response time are easy to occur. The sliding window mechanism mentioned in this embodiment can control the consumption of the computing resources by setting the window width, which provides a new way for fully utilizing the computing resources. When a new user requests service, the system can dynamically set the window width according to the current load condition, and when the system is heavy, the window width is reduced as much as possible on the premise of maintaining the availability of identification accuracy; a larger window width may be set when the system is not heavily loaded to provide a more accurate recognition rate.
In addition to balancing computing resource consumption, the window size adjustment strategy can be further extended to a greater number of scenarios, such as providing different window sizes for users of different priorities, providing customized services when computing resources are constant, identifying appropriate identification accuracy and response time by user self-configuration, and so on.
In the technical scheme disclosed in this embodiment, a word embedding feature sequence and an audio feature sequence corresponding to an audio stream are obtained, the audio feature sequence is encoded by an audio encoder to obtain an audio context sequence, and the word embedding sequence is encoded by a tag encoder to obtain a text context sequence, wherein the audio encoder and the tag encoder are self-attention encoders based on relative position encoding, the audio context sequence and the text context sequence are input to a joint decoder to obtain a probability distribution result of a tag, and an identification result is determined according to the probability distribution result. By using a self-attention encoder based on relative position encoding, it is achieved to avoid repeated calculations in the self-attention mechanism, thereby achieving the effect of reducing the amount of calculations for speech recognition.
In addition, an embodiment of the present invention further provides a terminal device, where the terminal device includes: a memory, a processor and a streaming speech recognition program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the streaming speech recognition method as described in the various embodiments above.
In addition, referring to fig. 5, an embodiment of the present invention further provides a terminal device 100, where the terminal device 100 includes:
the acquiring module 101 is configured to acquire a word embedding feature sequence and an audio feature sequence corresponding to an audio stream;
an encoding module 102, configured to encode the audio feature sequence by an audio encoder to obtain an audio context sequence, and encode the word embedding sequence according to a tag encoder to obtain a text context sequence, where the audio encoder and the tag encoder are self-attention encoders based on relative position encoding;
a decoding module 103, configured to input the audio context sequence and the text context sequence into a joint decoder, obtain a probability distribution result of the tag, and determine an identification result according to the probability distribution result.
Furthermore, an embodiment of the present invention further provides a computer-readable storage medium, where a streaming voice recognition program is stored on the computer-readable storage medium, and when being executed by a processor, the streaming voice recognition program implements the steps of the streaming voice recognition method according to the above embodiments.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for causing a terminal device (e.g. PC or server) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A streaming speech recognition method, comprising:
acquiring a word embedding characteristic sequence and an audio characteristic sequence corresponding to an audio stream;
encoding the audio feature sequence through an audio encoder to obtain an audio context sequence, and encoding the word embedding sequence according to a tag encoder to obtain a text context sequence, wherein the audio encoder and the tag encoder are self-attention encoders based on relative position encoding;
and inputting the audio context sequence and the text context sequence into a joint decoder to obtain a probability distribution result of the label, and determining an identification result according to the probability distribution result.
2. The streaming speech recognition method of claim 1, wherein the steps of encoding the audio feature sequence by an audio encoder to obtain an audio context sequence and encoding the word embedding sequence by a tag encoder to obtain a text context sequence further comprise:
detecting whether a preset mask window is filled;
updating the position coding sequence when the mask window is filled;
and executing the step of encoding the audio feature sequence through an audio encoder based on the updated position encoding sequence to obtain an audio context sequence, and encoding the word embedding sequence according to a tag encoder to obtain a text context sequence.
3. The streaming speech recognition method of claim 2, wherein the audio encoder uses the sequence of audio features as an input vector and the tag encoder uses the embedded sequence of words as an input vector.
4. The streaming speech recognition method of claim 3, wherein the steps of encoding the audio feature sequence by an audio encoder to obtain an audio context sequence, and encoding the word embedding sequence by a tag encoder to obtain a text context sequence comprise:
the audio encoder and the tag encoder determine attention weight coefficients according to the input vector and the position coding sequence;
performing weighted calculation according to the weight coefficient and the input vector to obtain an initial result;
and inputting the initial result into a corresponding feedforward network layer to obtain the audio context sequence and the text context sequence.
5. The streaming speech recognition method of claim 2, wherein the relative position between corresponding individual audio feature vectors in the sequence of audio features is determined from the absolute position of the audio feature vectors in the sequence of audio features.
6. The streaming speech recognition method of claim 2, wherein the step of detecting whether the predetermined mask window is filled further comprises:
and acquiring and identifying the current load condition of the system, and determining the window size of the mask window according to the current load condition.
7. The streaming speech recognition method of claim 1, wherein the step of obtaining the word-embedding feature sequence and the audio feature sequence corresponding to the audio stream is preceded by the step of:
receiving a radio data stream;
the step of obtaining the word embedding characteristic sequence and the audio characteristic sequence corresponding to the audio stream comprises the following steps:
and generating the audio characteristic sequence based on the audio data stream, and acquiring the word embedding characteristic sequence, wherein the audio characteristic sequence is a Mel frequency cepstrum coefficient or a Mel filter bank coefficient.
8. A terminal device, characterized in that the terminal device comprises: memory, processor and a streaming speech recognition program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the streaming speech recognition method according to any of claims 1 to 7.
9. A terminal device, characterized in that the terminal device comprises:
the acquisition module is used for acquiring the word embedding characteristic sequence and the audio characteristic sequence corresponding to the audio stream;
the encoding module is used for encoding the audio feature sequence through an audio encoder to obtain an audio context sequence, and encoding the word embedding sequence according to a tag encoder to obtain a text context sequence, wherein the audio encoder and the tag encoder are self-attention encoders based on relative position encoding;
and the decoding module is used for inputting the audio context sequence and the text context sequence into a joint decoder to obtain a probability distribution result of the label and determining an identification result according to the probability distribution result.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a streaming speech recognition program which, when executed by a processor, implements the steps of the streaming speech recognition method according to any one of claims 1 to 7.
CN202111119338.5A 2021-09-24 2021-09-24 Streaming voice recognition method, terminal device and medium Pending CN113838468A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111119338.5A CN113838468A (en) 2021-09-24 2021-09-24 Streaming voice recognition method, terminal device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111119338.5A CN113838468A (en) 2021-09-24 2021-09-24 Streaming voice recognition method, terminal device and medium

Publications (1)

Publication Number Publication Date
CN113838468A true CN113838468A (en) 2021-12-24

Family

ID=78969642

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111119338.5A Pending CN113838468A (en) 2021-09-24 2021-09-24 Streaming voice recognition method, terminal device and medium

Country Status (1)

Country Link
CN (1) CN113838468A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115331660A (en) * 2022-08-09 2022-11-11 北京市商汤科技开发有限公司 Neural network training method, speech recognition method, apparatus, device and medium
CN115631275A (en) * 2022-11-18 2023-01-20 北京红棉小冰科技有限公司 Multi-mode driven human body action sequence generation method and device
CN117116264A (en) * 2023-02-20 2023-11-24 荣耀终端有限公司 Voice recognition method, electronic equipment and medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110473529A (en) * 2019-09-09 2019-11-19 极限元(杭州)智能科技股份有限公司 A kind of Streaming voice transcription system based on from attention mechanism
CN111291183A (en) * 2020-01-16 2020-06-16 支付宝(杭州)信息技术有限公司 Method and device for carrying out classification prediction by using text classification model
CN111415667A (en) * 2020-03-25 2020-07-14 极限元(杭州)智能科技股份有限公司 Stream-type end-to-end speech recognition model training and decoding method
CN111859954A (en) * 2020-07-01 2020-10-30 腾讯科技(深圳)有限公司 Target object identification method, device, equipment and computer readable storage medium
CN112599122A (en) * 2020-12-10 2021-04-02 平安科技(深圳)有限公司 Voice recognition method and device based on self-attention mechanism and memory network
CN113270086A (en) * 2021-07-19 2021-08-17 中国科学院自动化研究所 Voice recognition text enhancement system fusing multi-mode semantic invariance

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110473529A (en) * 2019-09-09 2019-11-19 极限元(杭州)智能科技股份有限公司 A kind of Streaming voice transcription system based on from attention mechanism
CN111291183A (en) * 2020-01-16 2020-06-16 支付宝(杭州)信息技术有限公司 Method and device for carrying out classification prediction by using text classification model
CN111415667A (en) * 2020-03-25 2020-07-14 极限元(杭州)智能科技股份有限公司 Stream-type end-to-end speech recognition model training and decoding method
CN111859954A (en) * 2020-07-01 2020-10-30 腾讯科技(深圳)有限公司 Target object identification method, device, equipment and computer readable storage medium
CN112599122A (en) * 2020-12-10 2021-04-02 平安科技(深圳)有限公司 Voice recognition method and device based on self-attention mechanism and memory network
CN113270086A (en) * 2021-07-19 2021-08-17 中国科学院自动化研究所 Voice recognition text enhancement system fusing multi-mode semantic invariance

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
孙水发 等: "《ImageJ图像处理与实践》", 31 December 2013, 国防工业出版社, pages: 20 - 21 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115331660A (en) * 2022-08-09 2022-11-11 北京市商汤科技开发有限公司 Neural network training method, speech recognition method, apparatus, device and medium
CN115631275A (en) * 2022-11-18 2023-01-20 北京红棉小冰科技有限公司 Multi-mode driven human body action sequence generation method and device
CN117116264A (en) * 2023-02-20 2023-11-24 荣耀终端有限公司 Voice recognition method, electronic equipment and medium

Similar Documents

Publication Publication Date Title
CN109785824B (en) Training method and device of voice translation model
CN113838468A (en) Streaming voice recognition method, terminal device and medium
JP7490804B2 (en) System and method for streaming end-to-end speech recognition with asynchronous decoders - Patents.com
Liu et al. Efficient lattice rescoring using recurrent neural network language models
CN112735373B (en) Speech synthesis method, device, equipment and storage medium
EP1241661B1 (en) Speech recognition apparatus
CN112530403B (en) Voice conversion method and system based on semi-parallel corpus
KR20230084229A (en) Parallel tacotron: non-autoregressive and controllable TTS
CN114678032B (en) Training method, voice conversion method and device and electronic equipment
CN113241075A (en) Transformer end-to-end speech recognition method based on residual Gaussian self-attention
CN115827854A (en) Voice abstract generation model training method, voice abstract generation method and device
CN113488028B (en) Speech transcription recognition training decoding method and system based on fast jump decoding
WO2022083165A1 (en) Transformer-based automatic speech recognition system incorporating time-reduction layer
US20210073645A1 (en) Learning apparatus and method, and program
CN115359780A (en) Speech synthesis method, apparatus, computer device and storage medium
CN110349570B (en) Speech recognition model training method, readable storage medium and electronic device
CN115206281A (en) Speech synthesis model training method and device, electronic equipment and medium
KR20200120595A (en) Method and apparatus for training language model, method and apparatus for recognizing speech
CN112133304A (en) Low-delay speech recognition model based on feedforward neural network and training method
CN117558263B (en) Speech recognition method, device, equipment and readable storage medium
JP2010224418A (en) Voice synthesizer, method, and program
Na et al. Learning adaptive downsampling encoding for online end-to-end speech recognition
CN112562686B (en) Zero-sample voice conversion corpus preprocessing method using neural network
US20230134942A1 (en) Apparatus and method for self-supervised training of end-to-end speech recognition model
CN117456999B (en) Audio identification method, audio identification device, vehicle, computer device, and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination