CN117636845A

CN117636845A - Speech recognition method, device, equipment and storage medium

Info

Publication number: CN117636845A
Application number: CN202311595931.6A
Authority: CN
Inventors: 郭顺杰; 宋亚楠; 万根顺; 熊世富; 高建清; 潘嘉; 刘聪
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2023-11-23
Filing date: 2023-11-23
Publication date: 2024-03-01

Abstract

The application provides a voice recognition method, a device, equipment and a storage medium, and the specific implementation scheme is as follows: determining a skip frame number based on a tag state of the i-th frame audio; wherein i is a positive integer; performing frame skipping decoding processing on the ith frame of audio by using the frame skipping number to obtain non-blank tag characteristics corresponding to a target audio frame; the target audio frame represents an audio frame with a label state of non-blank labels before the (i+1) th frame of audio; predicting the tag state of the (i+1) -th frame audio based on the non-blank tag features corresponding to the (i+1) -th frame audio and the target audio frame; and determining a voice recognition result of the (i+1) -th frame audio according to the tag state of the (i+1) -th frame audio. According to the technical scheme, the reasoning speed can be remarkably improved, so that the efficiency of voice recognition is improved.

Description

Speech recognition method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of speech recognition technologies, and in particular, to a speech recognition method, apparatus, device, and storage medium.

Background

Speech recognition refers to a method of machine learning, which enables a machine to automatically convert speech into corresponding words, thereby giving the machine a function similar to human hearing. With the continuous breakthrough of artificial intelligence technology, voice input plays a great role in more and more scenes and business fields.

In the current end-to-end speech recognition system scheme, it is necessary to continuously process the input samples and data streams and perform symbolized output. Although the end-to-end voice recognition model has good recognition performance, the end-to-end voice recognition model is influenced by the model structure and memory occupation, and the reasoning speed is low, so that the voice recognition efficiency is lower.

Disclosure of Invention

In order to solve the above problems, the present application provides a voice recognition method, a device, an electronic apparatus, and a storage medium, which can significantly improve the reasoning speed, thereby improving the efficiency of voice recognition.

According to a first aspect of an embodiment of the present application, there is provided a speech recognition method, including:

determining a skip frame number based on a tag state of the i-th frame audio; wherein i is a positive integer;

performing frame skipping decoding processing on the ith frame of audio by using the frame skipping number to obtain non-blank tag characteristics corresponding to a target audio frame; the target audio frame represents an audio frame with a label state of non-blank labels before the (i+1) th frame of audio;

predicting the tag state of the (i+1) -th frame audio based on the non-blank tag features corresponding to the (i+1) -th frame audio and the target audio frame;

and determining a voice recognition result of the (i+1) -th frame audio according to the tag state of the (i+1) -th frame audio.

Optionally, the determining the frame skip number based on the tag state of the ith frame of audio includes:

when the label state of the ith frame of audio is a blank label, determining k blank frames based on the blank label of the ith frame of audio, and determining the k blank frames as skip frames; wherein k is a positive integer less than i;

and determining that the skip frame number is 0 in the case that the label state of the ith frame of audio is a non-blank label.

Optionally, in the case that the label state of the ith frame of audio is a blank label, determining k blank frames based on the blank label of the ith frame of audio includes:

in the case that the blank label is a single blank label, determining that the blank frame number is 1;

in the case that the blank label is a plurality of blank labels, k blank frame numbers are determined according to the types of the plurality of blank labels.

Optionally, when the tag state of the i-th frame audio is a blank tag, performing frame skip decoding processing on the i-th frame audio by using the frame skip number to obtain a non-blank tag feature corresponding to the target audio frame, where the method includes:

and determining the target audio frame as the ith-k frame audio based on the ith frame audio and the k blank frame numbers, and extracting non-blank label characteristics corresponding to the ith-k frame audio.

Optionally, in the case that the tag state of the i-th frame of audio is a non-blank tag, performing frame skip decoding processing on the i-th frame of audio by using the frame skip number to obtain a non-blank tag feature corresponding to the target audio frame, where the method includes:

determining the i-th frame audio as the target audio frame;

and decoding the non-blank label of the ith frame of audio to obtain the non-blank label characteristic corresponding to the ith frame of audio.

Optionally, the predicting the tag state of the i+1st frame audio based on the i+1st frame audio and the non-blank tag feature corresponding to the target audio frame includes:

fusion processing is carried out on the i+1st frame audio and the non-blank label characteristics corresponding to the target audio frame, so that joint audio characteristics are obtained;

and carrying out regression prediction according to the joint audio characteristics to obtain the tag state of the i+1st frame audio.

Optionally, the performing regression prediction according to the joint audio feature to obtain a tag state of the i+1st frame audio includes:

inputting the joint audio features into a preset regression prediction model to obtain the tag state of the i+1st frame audio; the preset regression prediction model is a model obtained by optimizing weight parameters of which the label states are a plurality of blank labels based on gradients of the weight parameters, and the gradients of the weight parameters are determined according to label states output by the model and losses of the label states output by the model.

Optionally, the optimization process of the preset regression prediction model further includes:

determining a corresponding transmitting time constraint function according to the label state output by the preset regression prediction model;

and constraining the gradient of the weight parameter by using the transmitting time constraint function.

According to a second aspect of embodiments of the present application, there is provided a voice recognition apparatus, including:

a determining module, configured to determine a frame skip number based on a tag state of the i-th frame of audio; wherein i is a positive integer;

the processing module is used for carrying out frame skipping decoding processing on the ith frame of audio by utilizing the frame skipping number to obtain non-blank tag characteristics corresponding to the target audio frame; the target audio frame represents an audio frame with a label state of non-blank labels before the (i+1) th frame of audio;

the prediction module is used for predicting the label state of the (i+1) th frame of audio based on the (i+1) th frame of audio and the non-blank label characteristics corresponding to the target audio frame;

and the recognition module is used for determining a voice recognition result of the (i+1) th frame of audio according to the tag state of the (i+1) th frame of audio.

A third aspect of the present application provides an electronic device, comprising:

a memory and a processor;

The memory is connected with the processor and used for storing programs;

the processor implements the above-mentioned voice recognition method by running the program in the memory.

A fourth aspect of the present application provides a storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described speech recognition method.

One embodiment of the above application has the following advantages or benefits:

determining a skip frame number based on a tag state of the i-th frame audio; performing frame skipping decoding processing on the ith frame of audio by using the frame skipping number to obtain non-blank tag characteristics corresponding to the target audio frame; the non-blank label features are obtained by extracting features of the non-blank label under the condition that the label state of the target audio frame is the non-blank label; predicting the tag state of the i+1st frame audio based on the non-blank tag features corresponding to the i+1st frame audio and the target audio frame; and determining the voice recognition result of the i+1st frame of audio according to the tag state of the i+1st frame of audio. Therefore, the frame skipping number is determined according to the label state of the ith frame of audio, the audio frame with the label state closest to the ith frame of audio as the non-blank label is determined through frame skipping decoding, and the non-blank label characteristic of the audio frame is determined, so that the non-blank label is not required to be continuously decoded in the prediction process of the label state of the next frame of audio, the prediction speed of the label state of the next frame of audio is increased, and the efficiency of voice recognition is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings may be obtained according to the provided drawings without inventive effort to a person skilled in the art.

Fig. 1 is a schematic flow chart of a voice recognition method according to an embodiment of the present application;

fig. 2 is a specific flowchart of step S130 of a voice recognition method according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of a model application according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of model training according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a voice recognition device according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical scheme of the embodiment of the application is suitable for being applied to various speech recognition scenes, such as conference scenes, online education scenes and the like. By adopting the technical scheme of the embodiment of the application, the efficiency of voice recognition can be improved.

The technical scheme of the embodiment of the application can be exemplarily applied to hardware devices such as a processor, an electronic device, a server (comprising a cloud server) and the like, or packaged into a software program to be operated, and when the hardware device executes the processing procedure of the technical scheme of the embodiment of the application, or the software program is operated, the purposes of determining the skip frame number according to the tag state of the ith frame of audio, determining the audio frame with the nearest tag state to the ith frame of audio as a non-blank tag through skip frame decoding and determining the non-blank tag characteristic of the audio frame can be realized. The embodiment of the application only exemplary introduces a specific processing procedure of the technical scheme of the application, and does not limit a specific implementation form of the technical scheme of the application, and any technical implementation form capable of executing the processing procedure of the technical scheme of the application can be adopted by the embodiment of the application.

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

In the related art, an end-to-end speech recognition model is generally used to recognize speech content, so as to obtain a recognition result corresponding to the speech content. The end-to-end speech recognition model may be a CTC model, an RNN-T model, or the like. The RNN-T model is actually an improvement on the CTC model, has the outstanding advantages of end-to-end joint optimization, language modeling capability, convenience in realizing on-line voice recognition and the like, and is more suitable for voice recognition tasks.

The end-to-end speech recognition model consists of an acoustic Encoder (Encoder), a Decoder (Decoder), and a joint network (JointNet). The acoustic encoder corresponds to an acoustic model section, and any acoustic model structure can be used. The decoder corresponds to a language model part and is typically constructed using a unidirectional recurrent neural network, e.g., RNN. The joint network can be modeled by using a forward network generally, and the joint network has the function of combining the states of the language model and the acoustic model together through a certain thought, and can be a splicing operation or a direct addition, etc., and the splicing operation is generally adopted in consideration of the possible different weight problems of the language model and the acoustic model.

Specifically, the vocoder converts the input audio features into a higher level representation, and the decoder extracts features of the tag state output by the previous frame model to extract historical context information; the joint network combines the outputs of the acoustic encoder and decoder and outputs a probability distribution over a tag state dictionary based on a regression prediction model.

Exemplary method

Fig. 1 is a flow chart of a speech recognition method according to an embodiment of the present application. In an exemplary embodiment, a method for speech recognition is provided, comprising:

s110, determining a skip frame number based on the tag state of the ith frame of audio; wherein i is a positive integer;

s120, performing frame skipping decoding processing on the ith frame of audio by using the frame skipping number to obtain non-blank tag features corresponding to the target audio frame; the target audio frame represents an audio frame with a label state of non-blank labels before the (i+1) th frame of audio;

s130, predicting the label state of the (i+1) -th frame audio based on the (i+1) -th frame audio and the non-blank label characteristics corresponding to the target audio frame;

s140, determining a voice recognition result of the (i+1) th frame of audio according to the tag state of the (i+1) th frame of audio.

In step S110, the i-th frame of audio may be audio data collected by the audio collection device when the user speaks, or may be audio data obtained from various open source voice libraries, which is not limited herein. The i-th frame of audio may be any frame of audio data, or may be a specified frame of audio data, which is not limited herein.

Illustratively, the tag state indicates the presence of a character in the audio frame. The label status includes blank labels and non-blank labels. The Blank label indicates that the current frame audio is Blank Fu (Blank), i.e., no characters are present in the audio frame. Alternatively, the blank label includes a single blank label and a plurality of blank labels. The plurality of blank labels represent that the blank corresponding to the current frame audio is a plurality of continuous blank. In the present embodiment, the plurality of blank labels may include a plurality of types of blank labels, for example, two blank labels, four blank labels, six blank labels, eight blank labels, and the like.

The non-blank label indicates that the current frame audio does not include a blank, e.g., the non-blank label may indicate a probability that the current frame audio includes a character, thereby enabling the non-blank label to recognize a specific character. The skip frame number indicates the number of frames that need to be skipped forward. Alternatively, a correspondence relationship between the tag state and the frame skip number may be preset, for example, the tag state is a non-blank tag, and the frame skip is not required to be performed forward, and the corresponding frame skip number is 0. The tag state is blank and a frame needs to be skipped forward. Different types of blank labels may be set with different hop frames numbers. For example, the label state is a single blank label, and the corresponding skip frame number is 1; the label state is two blank labels, and the corresponding skip frame number is 2; the label state is four blank labels, the corresponding skip frame number is 4, the label state is eight blank labels, the corresponding skip frame number is 8, and the like.

Alternatively, the tag state of the i-th frame audio may be directly obtained, for example, whether the tag state of the i-th frame audio belongs to a blank tag or a non-blank tag is preset.

Alternatively, the tag status of the i-th frame audio may also be determined according to steps S110-S120. That is, the number of skip frames is determined based on the tag status of the i-1 st frame audio; and performing frame skip decoding processing by utilizing the i-1 frame audio and the frame skip number to obtain non-blank label characteristics corresponding to the target audio frame. And predicting non-blank label characteristics corresponding to the ith frame of audio and the target audio frame to obtain the label state of the ith frame of audio.

Specifically, a correspondence between a tag state and a skip frame number may be preset, for example, the tag state is non-blank tag corresponding to skip frame number 0; the label state is that the number of the skip frames corresponding to a single blank label is 1; the label state is that the number of skip frames corresponding to two blank labels is 2, etc. After the label state of the ith frame of audio is determined, the corresponding skip frame number is searched in the corresponding relation between the label state and the skip frame number according to the label state of the ith frame of audio.

In step S120, illustratively, the skip frame decoding process indicates that the target audio frame is obtained by skipping forward a specified number of frames from the specified audio frame, and determines the decoding result of the audio frame. The target audio frame represents the number of audio frames indicated by the above-described skip frame number skipped forward from the i-th frame audio, and the resulting audio frame. Since the target audio frame is an audio frame preceding the i+1th frame audio, the non-blank tag feature of the target audio frame is a tag feature of history information used when predicting the tag state of the i+1th frame audio. Specifically, the target audio frame obtained by skipping the frame skipping number forward from the ith frame audio is obtained, and the non-blank label characteristic of the target audio frame is obtained.

Further, the non-blank label feature of the target audio frame may be decoded in advance according to the non-blank label of the target audio frame. The non-blank label feature of the target audio frame may also be obtained by decoding the non-blank label after the target audio frame is determined. In this embodiment, if the target audio frame is the i-th frame audio, the non-blank tag feature corresponding to the target audio frame is obtained by decoding the non-blank tag of the i-th frame audio by the decoder. If the target audio frame is the audio before the i-th frame audio, the non-blank label characteristic corresponding to the target audio frame is obtained by decoding the non-blank label corresponding to the target audio frame before the label state of the i-th frame audio is predicted.

Further, after the target audio frame is determined, judging whether the label state corresponding to the target audio frame is a non-blank label or not; if yes, acquiring non-blank label characteristics of the target audio frame; if not, predicting the label state of the ith frame of audio again, or continuously acquiring the label state of the previous frame of the target audio frame until the audio frame with the label state of non-blank label is found.

In step S130, the tag status of the i+1st frame audio is predicted, illustratively, based on the non-blank tag features corresponding to the i+1st frame audio and the target audio frame being input to a pre-trained end-to-end speech recognition model.

That is, the non-blank tag features corresponding to the target audio frame are decoded in advance according to the decoder. And performing feature conversion on the audio of the (i+1) th frame according to the acoustic encoder to obtain acoustic features of the audio of the (i+1) th frame. And inputting the acoustic characteristics of the i+1st frame audio and the non-blank label characteristics corresponding to the target audio frame into the combined network, and predicting the label state of the i+1st frame audio according to the output result of the combined network. The tag state of the i+1st frame audio is used for prediction of the tag state of the i+2nd frame audio.

In step S140, illustratively, in the case where the tag state of the i+1st frame audio is a non-blank tag, a corresponding character is determined according to the probability of the non-blank tag of the i+1st frame audio, thereby outputting the i+1st frame audio speech recognition result. And determining that the voice recognition result of the (i+1) -th frame audio is a blank symbol under the condition that the tag state of the (i+1) -th frame audio is a blank tag.

In the technical scheme of the application, the frame skipping number is determined based on the tag state of the ith frame of audio; performing frame skipping decoding processing on the ith frame of audio by using the frame skipping number to obtain non-blank tag characteristics corresponding to the target audio frame; the non-blank label features are obtained by extracting features of the non-blank label under the condition that the label state of the target audio frame is the non-blank label; predicting the tag state of the i+1st frame audio based on the non-blank tag features corresponding to the i+1st frame audio and the target audio frame; and determining the voice recognition result of the i+1st frame of audio according to the tag state of the i+1st frame of audio. Therefore, the frame skipping number is determined according to the label state of the ith frame of audio, the audio frame with the label state closest to the ith frame of audio as the non-blank label is determined through frame skipping decoding, and the non-blank label characteristic of the audio frame is determined, so that the non-blank label is not required to be continuously decoded in the process of predicting the label state of the next frame of audio, the prediction speed of the label state of the next frame of audio is increased, and the efficiency of voice recognition is improved.

In one embodiment, the step S110 includes:

when the label state of the ith frame of audio is a blank label, determining k blank frames based on the blank label of the ith frame of audio, and determining the k blank frames as skip frames;

Optionally, if the tag state of the i-th frame audio is a non-blank tag, predicting the tag state of the i+1-th frame audio uses the non-blank tag feature of the i-th frame audio. Therefore, the audio frame does not need to be skipped, so the skip frame number is 0. And then determining the non-blank tag characteristics of the ith frame of audio, and predicting the tag state of the ith+1st frame of audio according to the ith+1st frame of audio and the non-blank tag characteristics of the ith frame of audio.

Optionally, if the tag state of the i-th frame of audio is a blank tag, predicting the tag state of the i+1-th frame of audio requires using an audio frame whose tag state before the i-th frame of audio is a non-blank tag. The number k of blank frames is determined according to the type of blank label. In order to find an audio frame whose tag state before the i-th frame audio is a non-blank tag, the skip frame number is set to a blank frame number k. And then determining the non-blank tag characteristics of the i-k frame audio, and predicting the tag state of the i+1th frame audio according to the non-blank tag characteristics of the i+1th frame audio and the i-k frame audio. Therefore, the number of the skip frames is calculated according to different conditions of the tag state, and the non-blank tag characteristics of the audio of the predicted next frame can be rapidly determined, so that the reasoning speed of the tag state is increased.

In this embodiment, the pre-trained end-to-end speech recognition model is trained for a tag state of a single blank tag, a plurality of blank tags and a non-blank tag, so that the trained end-to-end speech recognition model can directly predict the type of the blank tag, and thus can determine the tag state of an audio frame and determine the skip frame number.

Further, in the case that the label state of the ith frame of audio is a blank label, determining k blank frames based on the blank label of the ith frame of audio includes:

Illustratively, the types of the plurality of blank labels include two blank labels, four blank labels, and eight blank labels. Correspondingly, two blank label types correspond to two blank frame numbers, four blank label types correspond to four blank frame numbers, and eight blank label types correspond to eight blank frame numbers.

Specifically, in the case where the tag state of the i-th frame audio is a single blank tag, it is explained that the tag state of the i-1 th frame audio is a non-blank tag, and thus the blank frame is only the i-th frame audio, and therefore the blank frame number is 1. I.e. predicting the tag status of the i+1st frame of audio according to the non-blank tag features corresponding to the i-1 st frame of audio and the i+1st frame of audio.

In the case where the tag state of the i-th frame audio is a plurality of blank tags, the types of the plurality of blank tags are determined. If the plurality of blank labels are k blank label types, the number of blank frames is corresponding to k. The label status of the i-k th frame audio is illustrated as non-blank labels. I.e. predicting the tag status of the i+1th frame audio according to the non-blank tag features corresponding to the i+1th frame audio and the i-k th frame audio. Thus, the historical context information (namely, the non-blank label characteristics corresponding to the i-k frame audio) can be extracted at the label end rapidly.

In this embodiment, the pre-trained end-to-end speech recognition model is trained for a tag state of a single blank tag, two blank tags, four blank tags, eight blank tags and a non-blank tag, so that the trained end-to-end speech recognition model can directly predict specific conditions of various tags, and therefore can determine the number of frames according to the tag state of an audio frame.

In one embodiment, when the tag state of the i-th frame audio is a blank tag, the frame skip decoding process is performed on the i-th frame audio by using the frame skip number to obtain a non-blank tag feature corresponding to the target audio frame, and step S120 includes:

Specifically, if the label state of the i-th frame of audio is a single blank label and the blank frame number is 1, the target audio frame is the label state of the i-1 th frame of audio, and the label state of the i-1 th frame of audio is a non-blank label. And because the non-blank label characteristic corresponding to the ith-1 frame audio is adopted when the label state of the ith frame audio is predicted, the non-blank label characteristic corresponding to the ith-1 frame audio is directly extracted. And simultaneously encoding the audio of the (i+1) th frame to obtain the audio characteristics of the (i+1) th frame. And predicting the label state of the i+1st frame of audio according to the non-blank label characteristic corresponding to the i-1st frame of audio and the i+1st frame of audio characteristic.

If the label state of the ith frame of audio is a plurality of blank labels, the blank frame number is determined according to the types of the blank labels. Taking the label state of the i-th frame of audio as four blank labels as an example, if the blank frame number is 4, the target audio frame is the label state of the i-4 th frame of audio, and the label state of the i-4 th frame of audio is a non-blank label. And because the non-blank label characteristic corresponding to the i-4 th frame of audio is adopted when the label state of the i-3 rd frame of audio is predicted, the non-blank label characteristic corresponding to the i-4 th frame of audio is directly extracted. And simultaneously encoding the audio of the (i+1) th frame to obtain the audio characteristics of the (i+1) th frame. And predicting the label state of the i+1st frame of audio according to the non-blank label characteristic corresponding to the i-4 th frame of audio and the i+1st frame of audio characteristic. Therefore, the label characteristics of the history information required by predicting the audio state of the next frame are rapidly determined through the blank frame number, so that the reasoning speed of predicting the label state is improved.

In one embodiment, when the tag state of the i-th frame of audio is a non-blank tag, the frame-skipping decoding process is performed on the i-th frame of audio by using the frame-skipping number to obtain a non-blank tag feature corresponding to the target audio frame, and step S120 includes:

determining the i-th frame audio as the target audio frame;

If the label state of the ith frame of audio is a non-blank label, directly determining the ith frame of audio as a target audio frame, and then decoding the non-blank label of the ith frame of audio to obtain the non-blank label characteristic of the ith frame of audio. And simultaneously encoding the audio of the (i+1) th frame to obtain the audio characteristics of the (i+1) th frame. And predicting the label state of the i+1st frame of audio according to the non-blank label characteristic corresponding to the i frame of audio and the i+1st frame of audio characteristic. Thus, the label state of the ith frame of audio is a non-blank label, and the information of the non-blank label of the ith frame of audio is extracted so as to predict the label state of the (i+1) th frame of audio.

In one embodiment, as shown in fig. 2, the step S130 includes:

S1310, fusion processing is carried out on the non-blank label characteristics corresponding to the i+1st frame audio and the target audio frame, so as to obtain a joint audio characteristic;

s1320, carrying out regression prediction according to the joint audio characteristics to obtain the tag state of the i+1st frame audio.

Illustratively, the non-blank label features corresponding to the i+1st frame audio and the target audio frame are input into a joint network to obtain joint audio features. The joint audio features are input into a preset regression prediction model (e.g., softmax) preset regression prediction model.

Specifically, as shown in fig. 3, after obtaining the tag state of the i-th frame audio, determining whether the tag state of the i-th frame audio is a blank tag; if the label state of the ith frame of audio is a blank label, determining the skip frame number k according to the type of the blank label, if the blank label is two blank labels, the skip frame number is 2. And determining the target audio frame as the ith-k frame audio, extracting non-blank label characteristics corresponding to the ith-k frame audio obtained by decoding when predicting the label state of the ith frame audio, and inputting the non-blank label characteristics into the joint network. Meanwhile, the (i+1) th frame of audio is encoded according to the acoustic encoder, and the (i+1) th frame of audio characteristics are output. And inputting the i+1st frame audio feature to the federated network.

And then, combining the (i+1) th frame audio characteristics and the non-blank label characteristics corresponding to the (i-k) th frame audio in a combined network to obtain the combined audio characteristics. And then carrying out linear transformation on the combined audio features. Finally, the transformed joint audio characteristics are calculated through softmax, the probability of the i+1st frame audio under each tag state is output, and the tag state of the i+1st frame audio can be determined through beam search.

If the label state of the ith frame of audio is not blank label (namely non-blank label), determining the target audio frame as the ith frame of audio, inputting the non-blank label of the ith frame of audio to a decoder, outputting the non-blank label characteristic corresponding to the ith frame of audio, and inputting the non-blank label characteristic to a joint network. Meanwhile, the (i+1) th frame of audio is encoded according to the acoustic encoder, and the (i+1) th frame of audio characteristics are output. And inputting the i+1st frame audio feature to the federated network.

And then, combining the (i+1) th frame of audio features and the non-blank label features corresponding to the (i) th frame of audio in a joint network to obtain joint audio features. And then carrying out linear transformation on the combined audio features. Finally, the transformed joint audio characteristics are calculated through softmax, the probability of the i+1st frame audio under each tag state is output, and the tag state of the i+1st frame audio can be determined through beam search.

Therefore, in the decoding process, the frame skipping number is calculated according to the label states of the plurality of beam candidates to skip frame decoding, so that the reasoning times of the decoder and the joint network are greatly reduced, and the reasoning speed is increased.

In one embodiment, the performing regression prediction according to the joint audio feature to obtain the tag status of the i+1st frame audio includes:

inputting the joint audio features into a preset regression prediction model to obtain the tag state of the i+1st frame audio; the preset regression prediction model is a model obtained by optimizing weight parameters of which the label states are a plurality of blank labels based on gradients of the weight parameters, and the gradients of the weight parameters are determined according to the label states output by the model and the loss of the label states output by the model.

Illustratively, the preset regression prediction model is part of an end-to-end speech recognition model, and is used for determining probability distribution of joint audio features output by the joint network on a tag state dictionary. And introducing a plurality of blank labels into the end-to-end voice recognition model, so that training data of the plurality of blank labels is introduced into the training data, and the weight parameters of the label states (such as the plurality of blank labels) in the regression prediction model are optimized according to the loss of the model prediction result, so that the probability of the label states output by the model is more accurate. Further, training data of two blank tags, four blank tags and eight blank tags may be divided so that the model can directly output different types of blank tags.

Specifically, as shown in fig. 4, the training process of the model is: the audio samples are input to an encoder, outputting an acoustic representation. And inputting the label state sample corresponding to the audio of the previous frame to a decoder to obtain the corresponding acoustic representation. And (3) inputting the output of the decoder and the encoder into a joint network for combination, and performing softmax operation on the output result of the joint network to obtain the probability of the audio frame on each label state (namely a plurality of blank labels, a single blank label and a non-blank label). And determining a prediction result, and optimizing the tag state weight output by the softmax according to the loss of the prediction result to obtain a trained model. The audio samples may be obtained in any open source audio library, or may be collected by any terminal device, which is not limited herein. The label state corresponding to the audio sample is marked in advance.

In this embodiment, the training process of the end-to-end speech recognition model is as follows: for any input audio feature sequence X @ { X ₁ ,L x _T And the label sequence Y@y { Y }, and ₁ ,L y _U }，y _u e V is the transmitting tag (i.e., tag state) at the U position in the output sequence, T is the length of the input sequence, U is the length of the output sequence, Is an extended dictionary in which Blank (Blank) phi is added to dictionary V, i.e. +.>After adding large blanks (i.e., multiple blanks), the dictionary isM is a set of hop frames, e.g., M= [1,2,4,8 ]]. The encoder in the end-to-end speech recognition model is similar to the acoustic model, taking the acoustic feature x _t Conversion to advanced representation-> t is the time index. The decoder works on the principle that the previous non-blank label y is predicted by the standard end-to-end recognition model _u-1 Condition is carried out to produce a higher level representation +.>Where u is the output token index, +.>

The joint network outputs the encoderAnd decoder output +.>Combined feed forward networkWherein W is ^enc 、W ^dec As a weight matrix, b _z For bias, ψ is a nonlinear function such as Tanh, reLU, etc. z _t,u Connected to the output layer h by linear transformation _t,u ＝W _y z _t,u +b _y 。

Then the softmax (namely a preset regression prediction model) is applied to calculate to obtain the final posterior of each output mark kSimplifying it into: y (t, u) =p (y|t, u), Φ (t, u) =p (Φ| t, u), where forward reasoning defines the forward variable α (t, u) as being at [0:t ]]Output y 0:u in forward reasoning process]Is a probability of (2). All forward variables of 1.ltoreq.t.ltoreq.t and 0.ltoreq.u.ltoreq.u can be calculated recursively. The backward reasoning defines the forward variable β (T, u) as being at [ T: T ] ]Output yu:U in forward reasoning process]Probability of (2). In the conventional end-to-end model, since many blanks need to be filtered, the reasoning speed on the CPU or NPU is slow, so that optimization is required for the blanks, additional blank symbols are explicitly modeled for duration, and the input t dimension is increased by two or more frames. The forward reasoning α (t, u) and backward reasoning β (t, u) are therefore:

therefore, the gradient of the parameter based on the predictive update weight of the blank label is a loss (loss) pairThe bias of (2) is calculated as follows:

in one embodiment, the optimization process of the preset regression prediction model further includes:

The transmission time constraint function is used for accelerating the transmission of the tag state, and thus, different transmission time constraint functions are set in advance for the tag state, that is, corresponding transmission time constraint functions are set according to a plurality of blank tags, a single blank tag, and a non-blank tag, respectively.

Specifically, by a preset expected delay and loss (loss) pairGenerates transmit time constraint functions for different tag states. When the model is trained, a corresponding transmitting time constraint function is determined according to the label state output by the model, the label transmission is quickened through the transmitting time constraint, and the characteristic information in the encoder is moved forwards, so that the gradient of the weight parameter is updated.

In this embodiment, the determination manner of the transmit time constraint function is: let d (t, u) εR.gtoreq.0 be the delay at position (t, u), define it asWherein->Is->The reference time is defined to represent the reference alignment time on the (T, U): t+u=n line. The alignment time is the desire to calculate the time delay from the FA boundary information of the audio, which can be generated by forced alignment using a hidden Markov (Hidden Markov Model, HMM) hybrid model, let ≡>The desired delay on the t+u=n line, defined as time (t, u):

standard conventional end-to-end recognition models tend to be delay aligned to acquire more future context to improve the loss function, which is expected to be delayed to increase the loss function so that the model learns to predict tags accurately and quickly. Specifically, the end-to-end recognition system model total loss is defined as the original loss L and the expected delay The weighted sum of (2) is:

where lambda is the emission excitation factor.

In order to speed up the transmission time, the gradient update calculation is constrained, the constraint rule is as follows,

thus, when characters and spaces are explicitly modeled for duration, time delay constraints are placed on the spaces and the time at which the characters are transmitted. Meanwhile, in the decoding process, the frame hopping number is calculated according to the transmission states of a plurality of beam candidates to carry out frame hopping decoding, so that the reasoning times of a decoder and a joint network are greatly reduced, and the reasoning speed is increased. The end-to-end voice recognition model can reduce the tag transmitting time delay and the reasoning time.

Exemplary apparatus

Accordingly, fig. 5 is a schematic structural diagram of a voice recognition device according to an embodiment of the present application. In an exemplary embodiment, there is provided a voice recognition apparatus including:

a determining module 510, configured to determine a frame skip number based on a tag state of the i-th frame of audio; wherein i is a positive integer;

the processing module 520 is configured to perform frame skipping decoding processing on the ith frame of audio by using the frame skipping number, so as to obtain a non-blank tag feature corresponding to the target audio frame; the target audio frame represents an audio frame with a label state of non-blank labels before the (i+1) th frame of audio;

A prediction module 530, configured to predict a tag state of the i+1st frame audio based on the i+1st frame audio and a non-blank tag feature corresponding to the target audio frame;

and the recognition module 540 is configured to determine a speech recognition result of the i+1st frame audio according to the tag state of the i+1st frame audio.

In one embodiment, the determining module 510 includes:

In one embodiment, in the case that the label status of the ith frame of audio is a blank label, determining k blank frames based on the blank label of the ith frame of audio includes:

under the condition that the blank label is a single blank label, determining the blank frame number to be 1;

in the case that the blank tag is a transmitting plurality of blank tags, k blank frame numbers are determined according to the type of transmitting the plurality of blank tags.

In one embodiment, in the case that the tag status of the i-th frame of audio is a blank tag, the processing module 520 is further configured to:

In one embodiment, in the case that the tag status of the i-th frame of audio is a non-blank tag, the processing module 520 is further configured to:

determining the i-th frame audio as the target audio frame;

In one embodiment, the prediction module 530 is further configured to:

The voice recognition device provided in this embodiment belongs to the same application conception as the voice recognition method provided in the foregoing embodiment of the present application, and may execute the voice recognition method provided in any of the foregoing embodiments of the present application, and has a functional module and beneficial effects corresponding to executing the voice recognition method. Technical details not described in detail in this embodiment may be referred to the specific processing content of the voice recognition method provided in the foregoing embodiment of the present application, and will not be described herein.

Exemplary electronic device

Another embodiment of the present application further proposes an electronic device, referring to fig. 6, including:

a memory 600 and a processor 610;

wherein the memory 600 is connected to the processor 610, and is used for storing a program;

the processor 610 is configured to implement the speech recognition method disclosed in any of the above embodiments by executing the program stored in the memory 600.

Specifically, the mechanical arm may further include: a bus, a communication interface 620, an input device 630, and an output device 640.

The processor 610, the memory 600, the communication interface 620, the input device 630, and the output device 640 are connected to each other by a bus. Wherein:

a bus may comprise a path that communicates information between components of a computer system.

The processor 610 may be a general-purpose processor, such as a general-purpose Central Processing Unit (CPU), microprocessor, etc., or may be an application-specific integrated circuit (ASIC), or one or more integrated circuits, for controlling the execution of programs in accordance with aspects of the present invention. But may also be a Digital Signal Processor (DSP), application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.

The processor 610 may include a main processor, and may also include a baseband chip, a modem, and the like.

The memory 600 stores programs for implementing the technical scheme of the present invention, and may also store an operating system and other critical services. In particular, the program may include program code including computer-operating instructions. More specifically, memory 600 may include read-only memory (ROM), other types of static storage devices that may store static information and instructions, random access memory (random access memory, RAM), other types of dynamic storage devices that may store information and instructions, disk storage, flash, and the like.

The input device 630 may include means for receiving data and information entered by a user, such as a keyboard, mouse, camera, scanner, light pen, voice input means, touch screen, pedometer, or gravity sensor, among others.

Output device 640 may include means such as a display screen, printer, speakers, etc. that allow information to be output to a user.

The communication interface 620 may include devices using any transceiver or the like to communicate with other devices or communication networks, such as ethernet, radio Access Network (RAN), wireless Local Area Network (WLAN), etc.

The processor 610 executes programs stored in the memory 600 and invokes other devices that may be used to implement the steps of any of the speech recognition methods provided in the above-described embodiments of the present application.

Exemplary computer program product and storage Medium

In addition to the methods and apparatus described above, embodiments of the present application may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform the steps in a speech recognition method according to the various embodiments of the present application described in the "exemplary methods" section of the present specification.

The computer program product may write program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.

In addition, embodiments of the present application may also be a storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps in the voice recognition method according to various embodiments of the present application described in the above "exemplary method" section of the present application, and specific working contents of the electronic device described above, and specific working contents of the computer program product described above and the computer program on the storage medium when executed by the processor, may refer to the contents of the above method embodiment, which are not repeated herein.

For the foregoing method embodiments, for simplicity of explanation, the methodologies are shown as a series of acts, but one of ordinary skill in the art will appreciate that the present application is not limited by the order of acts described, as some acts may, in accordance with the present application, occur in other orders or concurrently. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described as different from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other. For the apparatus class embodiments, the description is relatively simple as it is substantially similar to the method embodiments, and reference is made to the description of the method embodiments for relevant points.

The steps in the method of each embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs, and the technical features described in each embodiment can be replaced or combined.

The modules and sub-modules in the device and the terminal of the embodiments of the present application may be combined, divided, and deleted according to actual needs.

In the embodiments provided in the present application, it should be understood that the disclosed terminal, apparatus and method may be implemented in other manners. For example, the above-described terminal embodiments are merely illustrative, and for example, the division of modules or sub-modules is merely a logical function division, and there may be other manners of division in actual implementation, for example, multiple sub-modules or modules may be combined or integrated into another module, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules, which may be in electrical, mechanical, or other forms.

The modules or sub-modules illustrated as separate components may or may not be physically separate, and components that are modules or sub-modules may or may not be physical modules or sub-modules, i.e., may be located in one place, or may be distributed over multiple network modules or sub-modules. Some or all of the modules or sub-modules may be selected according to actual needs to achieve the purpose of the embodiment.

In addition, each functional module or sub-module in each embodiment of the present application may be integrated in one processing module, or each module or sub-module may exist alone physically, or two or more modules or sub-modules may be integrated in one module. The integrated modules or sub-modules may be implemented in hardware or in software functional modules or sub-modules.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software unit executed by a processor, or in a combination of the two. The software elements may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of speech recognition, comprising:

2. The method of claim 1, wherein the determining the number of hop frames based on the tag status of the i-th frame of audio comprises:

determining k blank frames based on the blank label of the ith frame of audio and determining the k blank frames as skip frames when the label state of the ith frame of audio is the blank label; wherein k is a positive integer less than i;

and determining that the skip frame number is 0 under the condition that the label state of the ith frame of audio is a non-blank label.

3. The method of claim 2, wherein the determining k blank frames based on the blank label of the i-th frame audio in the case that the label status of the i-th frame audio is a blank label comprises:

determining that the blank frame number is 1 when the blank label is a single blank label;

and under the condition that the blank labels are a plurality of blank labels, determining k blank frame numbers according to the types of the plurality of blank labels.

4. The method according to claim 2, wherein, in the case where the tag state of the i-th frame audio is a blank tag, the performing a frame skip decoding process on the i-th frame audio by using the frame skip number to obtain a non-blank tag feature corresponding to a target audio frame includes:

5. The method according to claim 2, wherein, in the case that the tag state of the i-th frame audio is a non-blank tag, the performing a frame skip decoding process on the i-th frame audio by using the frame skip number to obtain a non-blank tag feature corresponding to a target audio frame includes:

Determining the i-th frame audio as the target audio frame;

6. The method of claim 1, wherein predicting the tag status of the i+1st frame audio based on the non-blank tag characteristics corresponding to the i+1st frame audio and the target audio frame comprises:

7. The method of claim 6, wherein performing regression prediction based on the joint audio features to obtain a tag status of the i+1st frame of audio comprises:

8. The method of claim 7, wherein the optimizing the preset regression prediction model further comprises:

9. A speech recognition apparatus, comprising:

10. An electronic device, comprising:

a memory and a processor;

The memory is connected with the processor and used for storing programs;

the processor implements the speech recognition method according to any one of claims 1 to 8 by running a program in the memory.

11. A storage medium having stored thereon a computer program which, when executed by a processor, implements a speech recognition method according to any one of claims 1 to 8.