CN113838468A

CN113838468A - Streaming voice recognition method, terminal device and medium

Info

Publication number: CN113838468A
Application number: CN202111119338.5A
Authority: CN
Inventors: 蔡旭浦; 张俊杰; 彭朋; 荣玉军
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Priority date: 2021-09-24
Filing date: 2021-09-24
Publication date: 2021-12-24

Abstract

The invention discloses a streaming voice recognition method, a terminal device and a computer readable storage medium. The method comprises the following steps: acquiring a word embedding characteristic sequence and an audio characteristic sequence corresponding to an audio stream; encoding the audio feature sequence through an audio encoder to obtain an audio context sequence, and encoding the word embedding sequence according to a tag encoder to obtain a text context sequence, wherein the audio encoder and the tag encoder are self-attention encoders based on relative position encoding; and inputting the audio context sequence and the text context sequence into a joint decoder to obtain a probability distribution result of the label, and determining an identification result according to the probability distribution result. The present invention aims to achieve the effect of reducing the amount of calculation for speech recognition.

Description

Streaming voice recognition method, terminal device and medium

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a streaming speech recognition method, a terminal device, and a computer-readable storage medium.

Background

The speech recognition is a process of converting speech signals into corresponding texts through a computer, and is used as a key entrance of human-machine speech interaction, and the speech recognition is an important research direction in the field of artificial intelligence.

In the related art, the mainstream implementation scheme of the end-to-end model in the related art is based on an attention mechanism, and the attention mechanism can better acquire context information of audio and text, so that the identification accuracy is improved. However, in the conventional scheme of implementing speech recognition based on attention mechanism, since the attention weight needs to be determined according to the absolute position of each speech feature, the calculation amount of the speech feature increases rapidly as the length of the speech increases for streaming speech.

The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.

Disclosure of Invention

The invention mainly aims to provide a streaming voice recognition method, a terminal device and a computer readable storage medium, aiming at achieving the effect of reducing the calculation amount of voice recognition.

In order to achieve the above object, the present invention provides a streaming voice recognition method, which includes the following steps:

acquiring a word embedding characteristic sequence and an audio characteristic sequence corresponding to an audio stream;

encoding the audio feature sequence through an audio encoder to obtain an audio context sequence, and encoding the word embedding sequence according to a tag encoder to obtain a text context sequence, wherein the audio encoder and the tag encoder are self-attention encoders based on relative position encoding;

and inputting the audio context sequence and the text context sequence into a joint decoder to obtain a probability distribution result of the label, and determining an identification result according to the probability distribution result.

Optionally, before the steps of encoding the audio feature sequence by an audio encoder to obtain an audio context sequence, and encoding the word embedding sequence by a tag encoder to obtain a text context sequence, the method further includes:

detecting whether a preset mask window is filled;

updating the position coding sequence when the mask window is filled;

and executing the step of encoding the audio feature sequence through an audio encoder based on the updated position encoding sequence to obtain an audio context sequence, and encoding the word embedding sequence according to a tag encoder to obtain a text context sequence.

Optionally, the audio encoder takes the sequence of audio features as an input vector, and the tag encoder takes the sequence of word insertions as an input vector.

Optionally, the step of encoding the audio feature sequence by an audio encoder to obtain an audio context sequence, and encoding the word embedding sequence by a tag encoder to obtain a text context sequence includes:

the audio encoder and the tag encoder determine attention weight coefficients according to the input vector and the position coding sequence;

performing weighted calculation according to the weight coefficient and the input vector to obtain an initial result;

and inputting the initial result into a corresponding feedforward network layer to obtain the audio context sequence and the text context sequence.

Optionally, the relative position between corresponding audio feature vectors in the audio feature sequence is determined according to the absolute position of the audio feature vector in the audio feature sequence.

Optionally, before the step of detecting whether the preset mask window is filled, the method further includes:

and acquiring and identifying the current load condition of the system, and determining the window size of the mask window according to the current load condition.

Optionally, before the step of obtaining the word embedding feature sequence and the audio feature sequence corresponding to the audio stream, the method further includes:

receiving a radio data stream;

the step of obtaining the word embedding characteristic sequence and the audio characteristic sequence corresponding to the audio stream comprises the following steps:

and generating the audio characteristic sequence based on the audio data stream, and acquiring the word embedding characteristic sequence, wherein the audio characteristic sequence is a Mel frequency cepstrum coefficient or a Mel filter bank coefficient.

In addition, to achieve the above object, the present invention further provides a terminal device, which includes a memory, a processor, and a streaming voice recognition program stored in the memory and operable on the processor, wherein the streaming voice recognition program, when executed by the processor, implements the steps of the streaming voice recognition method as described above.

In addition, to achieve the above object, the present invention also provides a terminal device, including:

the acquisition module is used for acquiring the word embedding characteristic sequence and the audio characteristic sequence corresponding to the audio stream;

the encoding module is used for encoding the audio feature sequence through an audio encoder to obtain an audio context sequence, and encoding the word embedding sequence according to a tag encoder to obtain a text context sequence, wherein the audio encoder and the tag encoder are self-attention encoders based on relative position encoding;

and the decoding module is used for inputting the audio context sequence and the text context sequence into a joint decoder to obtain a probability distribution result of the label and determining an identification result according to the probability distribution result.

Furthermore, to achieve the above object, the present invention also provides a computer readable storage medium having a streaming voice recognition program stored thereon, which when executed by a processor implements the steps of the streaming voice recognition method as described above.

The embodiment of the invention provides a streaming voice recognition method, a terminal device and a computer readable storage medium, which are used for obtaining a word embedding feature sequence and an audio feature sequence corresponding to an audio stream, coding the audio feature sequence through an audio coder to obtain an audio context sequence, and coding the word embedding sequence according to a label coder to obtain a text context sequence, wherein the audio coder and the label coder are self-attention coders based on relative position coding, the audio context sequence and the text context sequence are input into a joint decoder to obtain a probability distribution result of a label, and a recognition result is determined according to the probability distribution result. By using a self-attention encoder based on relative position encoding, it is achieved to avoid repeated calculations in the self-attention mechanism, thereby achieving the effect of reducing the amount of calculations for speech recognition.

Drawings

Fig. 1 is a schematic terminal structure diagram of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flow chart illustrating a method for streaming speech recognition according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of the main structure of the involved recognition algorithm;

FIG. 4 is a flow chart illustrating a flow of a streaming speech recognition process according to the present invention;

fig. 5 is a schematic block diagram of a terminal device according to an embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, fig. 1 is a schematic terminal structure diagram of a hardware operating environment according to an embodiment of the present invention.

As shown in fig. 1, the control terminal may include: a processor 1001, such as a CPU, a network interface 1003, a memory 1004, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The network interface 1003 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1004 may be a high-speed RAM memory or a non-volatile memory (e.g., a disk memory). The memory 1004 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the terminal structure shown in fig. 1 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, the memory 1004, which is a type of computer storage medium, may include an operating system, a network communication module, and a streaming voice recognition program therein.

In the terminal shown in fig. 1, the processor 1001 may be configured to invoke a streaming voice recognition program stored in the memory 1004 and perform the following operations:

Further, the processor 1001 may call the streaming voice recognition program stored in the memory 1004, and further perform the following operations:

detecting whether a preset mask window is filled;

updating the position coding sequence when the mask window is filled;

receiving a radio data stream;

The speech recognition is a process of converting speech signals into corresponding texts through a computer, and is used as a key entrance of human-machine speech interaction, and the speech recognition is an important research direction in the field of artificial intelligence. The classical speech recognition algorithm is based on a GMM-HMM model and achieves good results, but because the GMM model is a shallow model, the expression capability is insufficient, and the distribution of speech signals cannot be accurately represented. Therefore, research into a speech recognition technology based on deep learning is being started. The mainstream algorithm model based on deep learning is the RNN-HMM model and its variants, which has achieved great success and is still widely used in the industry to date. The HMM model construction and training process is complex and time-consuming, and the joint optimization of voice and characters cannot be carried out, so that an end-to-end voice recognition model is provided, namely, a plurality of modules such as an acoustic model, a linguistic model and a pronunciation dictionary are not required to be connected in series for work, and a voice signal can be directly mapped to a text sequence, thereby greatly simplifying the construction process.

In the related art, the mainstream implementation scheme of the end-to-end model is based on an attention mechanism, and the attention mechanism can better acquire context information of audio and text, so that the identification accuracy is improved. The streaming voice recognition is a voice recognition technology supporting real-time processing, can continuously recognize a voice data stream and return a recognition result, does not need to start a recognition process after all voice input is finished, and has better response speed and user experience.

Based on the scheme given in the related technology, the RNN-HMM model and various variants thereof need to construct separate acoustic models and linguistic models, have complex structures, cannot realize the joint optimization of the two models, and have adverse effects on the improvement of the recognition accuracy. The process of constructing the pronunciation dictionary by the RNN-HMM model needs to master certain linguistic theory, and the technical requirement is high. In addition, the acoustic model needs to be subjected to multi-round iterative alignment in the training process, and the training speed is low. The end-to-end recognition model based on the attention mechanism is a recursion process in calculation and cannot be trained in parallel. Therefore, a self-attention mechanism capable of realizing parallel computing is proposed on the basis, but the self-attention mechanism cannot provide the corresponding relation between audio and text, and the boundary of text context is difficult to determine in the recognition process, so that the streaming recognition is difficult to realize. The method for segmenting the audio through silence detection in the related art is easy to cause a pause in the feedback process of the identification result, and is difficult to cope with the environment with strong noise.

In order to realize end-to-end stream type recognition, the embodiment of the invention provides a stream type speech recognition method, which encodes audio feature vectors and word embedding vectors by constructing an audio encoder and a tag encoder based on a self-attention mechanism to generate audio and text high-order context vectors, decodes the spliced audio and text high-order context vectors by using a joint decoder and outputs tag probability distribution. The method can realize the one-to-one correspondence of audio input and label output, thereby not only playing the capabilities of parallel computation of a self-attention mechanism and extraction of context information, but also realizing the streaming identification capability.

The streaming speech recognition method proposed by the present invention is further explained below by specific embodiments.

In an embodiment, referring to fig. 2, the streaming speech recognition method includes the following steps:

step S10, acquiring a word embedding feature sequence and an audio feature sequence corresponding to the audio stream;

step S20, encoding the audio feature sequence through an audio encoder to obtain an audio context sequence, and encoding the word embedding sequence according to a tag encoder to obtain a text context sequence, wherein the audio encoder and the tag encoder are self-attention encoders based on relative position encoding;

and step S30, inputting the audio context sequence and the text context sequence into a joint decoder to obtain a label probability distribution result, and determining an identification result according to the label probability distribution result.

In this embodiment, an externally input voice data stream may be received through a preset data interface, and when the preset voice data stream is received, an audio feature sequence may be generated through the audio feature sequence generating module. It is understood that the audio feature sequence is usually a Mel-Frequency Cepstrum Coefficient (MFCC) or a Mel-filter bank Coefficient (FBANK), and the MFCC or FBANK is usually down-sampled again to reduce the sequence length or extract local information, and the down-sampling process may be a convolutional neural network or a feature stack. The frame rate of the audio feature sequence after processing by down-sampling is typically 30-50 milliseconds. The word embedding sequence refers to a sequence formed by vectors obtained from a word embedding matrix according to indexes of all words in the word sequence in a word stock.

Further, after the word embedding feature sequence and the audio feature sequence are obtained, the audio feature sequence can be encoded through an audio encoder to obtain an audio context sequence, and the word embedding sequence is encoded according to a tag encoder to obtain a text context sequence. It should be noted that, referring to fig. 3, a main structure of a recognition algorithm according to an embodiment of the present invention is shown in fig. 3. The audio encoder takes the audio feature sequence as an input vector, that is, each frame of the audio data stream generated by the audio feature sequence generation module is called an input vector. The tag encoder takes the word-embedded sequence as an input vector. Further, the audio encoder and the tag encoder are self-attention encoders based on relative position encoding.

Further, after the audio context sequence and the text context sequence are obtained, the audio context sequence and the text context sequence may be input to the joint decoder. The joint decoder expands and splices the audio context sequence and the text context sequence, and then generates a probability distribution network of the labels through transformation. Where the probability distribution network of labels is the union of all words and empty labels.

It can be understood that a plurality of possible alignment paths exist in the probability distribution network, each path in the probability distribution network is traversed and summed by using a forward-backward algorithm during model training to obtain the probability of the text sequence corresponding to the input audio feature sequence, then the gradient of the weight is used by using an error back propagation method, and then the weight value is updated.

The SOS tag is set at the beginning of the model decoding process (i.e., recognition process) as the start flag for the text sequence, a word vector is generated, and an initial text context feature vector is calculated by the text encoder. And continuously acquiring the audio stream, generating an audio feature sequence frame by frame, calculating an audio context feature vector corresponding to the frame vector when generating one frame of audio feature vector, sending the audio context feature vector and the text context feature vector to a joint decoder for prediction, sending the predicted text to a text encoder to generate an updated text context vector when the prediction result is not empty, and repeating the prediction process until the prediction result is empty. And repeating the steps until the audio stream is ended.

Further, after the probability distribution result of the label is obtained, the identification result can be determined according to the probability distribution result.

In addition, in order to solve the problem of controlling the resource consumption of the streaming identification through the local self-attention mechanism, the embodiment provides a scheme that a mask window is set, the range of the self-attention mechanism is limited within a certain interval, context information which is not relevant to the current calculation is ignored, the data amount and the calculation amount of the streaming identification can be controlled to be a constant value, and the continuous increase of the resource consumption is avoided.

In order to solve the problem of controlling resource consumption of streaming identification through a local self-attention mechanism, the present embodiment provides a scheme that a mask window is set, a self-attention range is limited within a certain interval, context information that is not relevant to current calculation is ignored, a data amount and a calculation amount of streaming identification can be controlled to be a constant value, and continuous increase of resource consumption is avoided.

Further, to avoid repeated calculations in the self-attention mechanism, the mask window is slid backwards with the input signal, the absolute position of each frame in the input sequence changes during the sliding process, but the relative position between frames remains the same. By using the relative position coding, when the window slides, the context vector corresponding to each frame in the window, which has already been calculated, does not change, thereby achieving the effect of avoiding the repeated calculation in the self-attention mechanism.

To achieve a balance between accuracy and response speed, the mask window size can be dynamically adjusted. It can be understood that the larger the mask window is, the more contextual information that can be focused on is, which helps to improve the accuracy of model identification, but the larger the memory and the calculation amount is, the more resources are consumed, and the slower the response speed is. The smaller the mask window, the smaller the amount of resources that need to be consumed, but the recognition accuracy is also adversely affected. The model can dynamically adjust the size of the mask window in the identification process, and balance between response speed and accuracy.

It should be noted that, in the present embodiment, the audio encoder and the tag encoder are stacked by the self-attention layer and the feedforward network layer, and there is only a difference in the number of stacked layers. The audio encoder and the tag encoder will be collectively described as an encoder in the following to further explain the working principle of the two encoders.

The self-attention mechanism can effectively extract important features of sparse data and is good at capturing data or internal correlation of the features. In order to extract data features from more dimensions, a plurality of transformation matrices are usually used to transform a sequence, and then the transformation results are spliced, which is called a Multi-Head Attention mechanism (Multi-Head Attention), and the implementation formula of the classical Multi-Head Attention mechanism is as follows:

MultiHead(Q,K,V)＝Concat(head₁,…,head_h)W^O (1)

where h is the number of multiple head attentions,

W^Oas a weight parameter, d_kIs the dimension of the vector K. Q, K, V is a feature matrix, each row of the matrix represents a feature vector, and K corresponds to the feature vector of V. And performing dot product multiplication, scaling and softmax transformation on each eigenvector in the Q and a K matrix to obtain an attention weight coefficient, performing weighted summation, multi-head splicing and linear change on a V matrix according to weight parameters, and performing transformation on a feedforward network layer to obtain context vectors corresponding to the eigenvectors (the audio context eigenvectors for an audio encoder and the text context eigenvectors for a tag encoder).

The classical mechanism of self-attention is based on absolute position coding, Q, K, V being the result of the addition of the input vectors q, k, v to the position-coding vector, the index of the input vector in the sequence being the index of the position-coding. The absolute position coding is difficult to reflect the relative position relation of different input vectors when calculating the weight coefficient, and the problem of repeated calculation exists in flow identification, so the method adopts a relative position coding strategy to divide QK on a molecule in the formula (3)^TExpansion, replacing the vector using absolute position coding index with position coding bias vector and relative position coding vector, splitting the weight coefficient W into input related weight coefficient and position related weight coefficient, q after transformation_iAnd k is_jThe weight coefficient calculation method comprises the following steps:

in the above formula, E represents an input vector, W_q、W_k,R、W_k,EFor the weighting parameters u, v represent position-coding bias vectors and R represents relative position-coding vectors, it being observed from the lower subscript of R that the position coding is only related to the difference of the absolute position indices of q and k in the sequence, i.e. only to the relative positions of q and k. Using equation (4) to obtain q_iAnd attention weight coefficients of all vectors in the k sequence, then weighted summation is carried out on v, and subsequent operation is similar to that of the traditional methodThe multi-head self-attention mechanism is the same, and is not described in detail herein.

In addition, for better understanding of the present invention, the design of the sliding window and the local attention mechanism of the present embodiment will be further explained below.

In the process of long voice or streaming recognition, along with the continuous increase of the length of an input audio feature sequence and a recognized text sequence, the required calculation amount and storage amount are larger and larger, and in order to control resource consumption, a window can be set, so that the self-attention mechanism is limited within the window. Wherein. The window moves backwards as the input sequence grows, discarding older historical data, keeping the self-attention mechanism always working on sequences that are closer in position to the input vector.

Illustratively, assume the input sequence is X_L＝[x₀,…,x_i,x_i+1,…,x_L]Of length L, each of which x_iRepresenting an input vector. In the attention formula (4), for the sequence X_LEach vector x in (1)_jAll need to calculate

And W_k,R R_i-j。X_LThe relative position of each vector in (1) is always kept constant, because of the use of relative position coding, R_i-jSince they are also kept constant, to avoid repetitive operations, these two values may be saved as intermediate variables, respectively named E'_LAnd R'_L. When there is a new input vector x_L+1Then, x needs to be calculated according to the formula (4)_L+1With respect to the new input sequence X_L+1Attention weight coefficient of each vector because the intermediate variable E 'has been previously saved'_LAnd R'_LHence, only calculations are required here

And W_k, _RR, wherein R is a position coding sequence formed by reversing the relative position coding vector, namely [ R_L+1,…,r_i,x_i-1,…,x₀]The reverse order is due to the sum of x_L+1The farther away the vector the greater its relative position. Has been calculated

And W_k,RAfter R, the subsequent calculation can be carried out according to a formula, and simultaneously

And W_k,RTwo vectors of R are respectively from E'_LAnd R'_LSplicing to obtain an updated intermediate variable E'_L+1And R'_L+1，E′_L+1And R'_L+1Can be used as a next new input vector x_L+2Intermediate variables of the process are calculated. In fact, the position coding sequence R does not need to be acquired again every time, and only R needs to be acquired_L+1And splicing to the leftmost side of the sequence.

Setting the width of the mask window to d_wWhen the length L of the input sequence exceeds d_wWhen the mask window is filled, the self-attention mechanism will not pay attention to the index value less than (L-d)_w) I.e. the index of j in equation (4) will be from (L-d)_w+1), which is the meaning of a local self-attention mechanism. From L to d_wAt the beginning, the calculation is finished

And W_k，RR is then first of all with E'_LAnd R'_LAre spliced to generate E'_L+1And R'_L+1Then E 'is removed'_L+1And R'_L+1The vector with the middle index of 0 keeps the lengths of the two intermediate variables unchanged, which is equivalent to sliding the window one step to the right. During the window sliding process, the absolute position of each vector in the sequence changes, but the phase position remains unchanged, so the saved intermediate variables remain valid.

The width of the mask window determines the length of the sequence participating in the calculation, once the window is filled, the length participating in the subsequent calculation is not changed, the calculation amount and the storage space are not changed as long as the window width is not adjusted, and the consumption of the calculation resources is maintained at a constant value.

Sliding window and local self-attention mechanism referring to fig. 4, the process flow of the audio encoder is indicated by the dashed box in fig. 4. It should be noted that the text encoder also employs a sliding window and local attention mechanism, which is not repeated in fig. 4 for simplicity of illustration.

In addition, for the dynamic adjustment of the computing resources proposed in the present embodiment, the principle and design are as follows:

speech recognition services are typically deployed on terminal devices that have certain computing and storage resources, providing services to multiple users simultaneously. Because the activity levels of users in different time periods are inconsistent, the average load and the peak load need to be considered when configuring the hardware resources, and the problems of resource waste or longer peak period response time are easy to occur. The sliding window mechanism mentioned in this embodiment can control the consumption of the computing resources by setting the window width, which provides a new way for fully utilizing the computing resources. When a new user requests service, the system can dynamically set the window width according to the current load condition, and when the system is heavy, the window width is reduced as much as possible on the premise of maintaining the availability of identification accuracy; a larger window width may be set when the system is not heavily loaded to provide a more accurate recognition rate.

In addition to balancing computing resource consumption, the window size adjustment strategy can be further extended to a greater number of scenarios, such as providing different window sizes for users of different priorities, providing customized services when computing resources are constant, identifying appropriate identification accuracy and response time by user self-configuration, and so on.

In the technical scheme disclosed in this embodiment, a word embedding feature sequence and an audio feature sequence corresponding to an audio stream are obtained, the audio feature sequence is encoded by an audio encoder to obtain an audio context sequence, and the word embedding sequence is encoded by a tag encoder to obtain a text context sequence, wherein the audio encoder and the tag encoder are self-attention encoders based on relative position encoding, the audio context sequence and the text context sequence are input to a joint decoder to obtain a probability distribution result of a tag, and an identification result is determined according to the probability distribution result. By using a self-attention encoder based on relative position encoding, it is achieved to avoid repeated calculations in the self-attention mechanism, thereby achieving the effect of reducing the amount of calculations for speech recognition.

In addition, an embodiment of the present invention further provides a terminal device, where the terminal device includes: a memory, a processor and a streaming speech recognition program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the streaming speech recognition method as described in the various embodiments above.

In addition, referring to fig. 5, an embodiment of the present invention further provides a terminal device 100, where the terminal device 100 includes:

the acquiring module 101 is configured to acquire a word embedding feature sequence and an audio feature sequence corresponding to an audio stream;

an encoding module 102, configured to encode the audio feature sequence by an audio encoder to obtain an audio context sequence, and encode the word embedding sequence according to a tag encoder to obtain a text context sequence, where the audio encoder and the tag encoder are self-attention encoders based on relative position encoding;

a decoding module 103, configured to input the audio context sequence and the text context sequence into a joint decoder, obtain a probability distribution result of the tag, and determine an identification result according to the probability distribution result.

Furthermore, an embodiment of the present invention further provides a computer-readable storage medium, where a streaming voice recognition program is stored on the computer-readable storage medium, and when being executed by a processor, the streaming voice recognition program implements the steps of the streaming voice recognition method according to the above embodiments.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for causing a terminal device (e.g. PC or server) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A streaming speech recognition method, comprising:

2. The streaming speech recognition method of claim 1, wherein the steps of encoding the audio feature sequence by an audio encoder to obtain an audio context sequence and encoding the word embedding sequence by a tag encoder to obtain a text context sequence further comprise:

detecting whether a preset mask window is filled;

updating the position coding sequence when the mask window is filled;

3. The streaming speech recognition method of claim 2, wherein the audio encoder uses the sequence of audio features as an input vector and the tag encoder uses the embedded sequence of words as an input vector.

4. The streaming speech recognition method of claim 3, wherein the steps of encoding the audio feature sequence by an audio encoder to obtain an audio context sequence, and encoding the word embedding sequence by a tag encoder to obtain a text context sequence comprise:

5. The streaming speech recognition method of claim 2, wherein the relative position between corresponding individual audio feature vectors in the sequence of audio features is determined from the absolute position of the audio feature vectors in the sequence of audio features.

6. The streaming speech recognition method of claim 2, wherein the step of detecting whether the predetermined mask window is filled further comprises:

7. The streaming speech recognition method of claim 1, wherein the step of obtaining the word-embedding feature sequence and the audio feature sequence corresponding to the audio stream is preceded by the step of:

receiving a radio data stream;

8. A terminal device, characterized in that the terminal device comprises: memory, processor and a streaming speech recognition program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the streaming speech recognition method according to any of claims 1 to 7.

9. A terminal device, characterized in that the terminal device comprises:

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a streaming speech recognition program which, when executed by a processor, implements the steps of the streaming speech recognition method according to any one of claims 1 to 7.