CN114822509A

CN114822509A - Speech recognition method, speech recognition device, computer equipment and storage medium

Info

Publication number: CN114822509A
Application number: CN202210587567.8A
Authority: CN
Inventors: 丁超越; 宗道明; 李家魁; 李宝祥
Original assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Current assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Priority date: 2022-05-25
Filing date: 2022-05-25
Publication date: 2022-07-29

Abstract

The present disclosure provides a voice recognition method, apparatus, computer device and storage medium, wherein the method comprises: acquiring target audio data to be identified, and performing audio feature extraction on the target audio data to obtain audio extraction features corresponding to the target audio data; inputting the audio extraction features into a pre-trained target encoder to obtain audio encoding features which are output by the target encoder and correspond to the target audio data; wherein the target encoder comprises a self-attention network that employs pooling processing operations in determining a target query matrix; the target query matrix is one of a plurality of feature representation matrices determined by the self-attention network during feature extraction based on a self-attention mechanism; and determining a voice recognition result corresponding to the target audio data based on the audio coding characteristics.

Description

Speech recognition method, speech recognition device, computer equipment and storage medium

Technical Field

The present disclosure relates to the field of speech recognition technologies, and in particular, to a speech recognition method, an apparatus, a computer device, and a storage medium.

Background

The voice awakening occupies an important part in the application of intelligent voice interaction, and the voice awakening algorithm has a great application value in application scenes such as an intelligent vehicle cabin, an intelligent household and an intelligent robot.

In a related application scenario, audio data to be recognized for voice wakeup generally needs to be locally processed on a terminal device such as an intelligent sound or a mobile phone, and smaller model parameters and faster processing speed are needed.

Disclosure of Invention

The embodiment of the disclosure at least provides a voice recognition method, a voice recognition device, computer equipment and a storage medium.

In a first aspect, an embodiment of the present disclosure provides a speech recognition method, including:

acquiring target audio data to be identified, and performing audio feature extraction on the target audio data to obtain audio extraction features corresponding to the target audio data;

inputting the audio extraction features into a pre-trained target encoder to obtain audio encoding features which are output by the target encoder and correspond to the target audio data; wherein the target encoder comprises a self-attention network that employs pooling processing operations in determining a target query matrix; the target query matrix is one of a plurality of feature representation matrices determined by the self-attention network during feature extraction based on a self-attention mechanism;

and determining a voice recognition result corresponding to the target audio data based on the audio coding characteristics.

In this way, the audio extraction features corresponding to the target audio data to be recognized are input into the target encoder containing the target query matrix obtained based on the pooling processing operation, so as to obtain the audio coding features corresponding to the target audio data, and thus the voice recognition result corresponding to the target audio data can be determined based on the audio coding features. In this way, by performing the pooling processing operation, the parameter quantity of the generated target query matrix can be reduced, and the parameter quantity of the output from the attention network obtained subsequently based on the target query matrix can be reduced, so that the compression of the output data from the attention network is realized, and the efficiency of voice awakening is improved.

In a possible implementation manner, the performing audio feature extraction on the target audio data to obtain an audio extracted feature corresponding to the target audio data includes:

performing initial feature extraction on the target audio data, and determining a Mel frequency cepstrum coefficient feature corresponding to the target audio data;

performing feature dimension conversion processing on the mel frequency cepstrum coefficient features to obtain the audio extraction features; wherein the dimension of the audio extraction features is higher than the dimension of the mel-frequency cepstrum coefficient features.

In a possible implementation, after determining the mel-frequency cepstrum coefficient feature corresponding to the target audio data, the method further includes:

performing feature enhancement processing on the mel-frequency cepstrum coefficient features to obtain enhanced mel-frequency cepstrum coefficient features;

the step of performing feature dimension conversion processing on the mel-frequency cepstrum coefficient features to obtain the audio extraction features comprises the following steps:

and performing feature dimension conversion processing on the enhanced mel-frequency cepstrum coefficient features based on a target convolutional neural network to obtain audio extraction features corresponding to the target audio data.

In this way, by performing the feature enhancement processing on the mel-frequency cepstrum coefficient features before the feature dimension conversion processing, it is possible to increase the feature information when performing the feature dimension conversion processing, and thus it is possible to improve the feature extraction effect when performing the feature dimension conversion processing.

In a possible implementation, the target encoder further includes a target feedforward neural network structure, and the target feedforward neural network structure includes a convolution layer and a normalization layer for processing the feature output from the attention network.

Thus, compared with a more complex network structure in the original feedforward neural network, the target feedforward neural network structure is set as a convolution layer and a normalization layer, so that the calculation cost of the stage can be reduced, and the efficiency of voice recognition is improved.

In a possible implementation, the self-attention network includes a relative position coding module;

and the relative position coding module is used for carrying out relative position coding processing on the target query matrix.

In this way, the target query matrix is subjected to relative position coding processing based on the relative position coding module, and position information can be introduced into a network structure constructed based on a multi-head self-attention mechanism, so that the accuracy of a final output result of the self-attention network can be ensured.

In a possible implementation, the determining, based on the audio coding feature, a speech recognition result corresponding to the target audio data includes:

based on a target pooling network, pooling the audio coding features to obtain pooled target audio features;

inputting the target audio features into a trained target classification network to obtain a target probability matrix output by the target classification network and aiming at preset keywords of each category;

and determining a voice recognition result corresponding to the target audio data based on the target probability matrix.

Thus, the audio coding features with the sequence length of n corresponding to the target audio data to be recognized can be converted into the target audio features without the sequence length by performing pooling processing on the audio coding features, so that voice recognition is facilitated; and processing the target audio features through the trained target classification network, so that a voice recognition result can be obtained.

In one possible embodiment, the method further comprises training the target classification network according to the following steps:

acquiring sample data and a target label corresponding to the sample data;

determining sample audio features corresponding to the sample data based on the sample data, a target encoder, and the target pooling network;

inputting the sample audio features to the target classification network to be trained to obtain a first prediction result output by the target classification network; inputting the sample audio features into a trained teacher neural network to obtain a second prediction result output by the teacher neural network;

and determining a target loss value of the training based on the first prediction result, the second prediction result and the target label, and adjusting network parameters of the target classification network based on the target loss value.

In this way, knowledge of the target classification network is distilled, so that the target classification can simultaneously take network precision and network scale into consideration.

In one possible embodiment, the determining the target loss value of the current training based on the first prediction result, the second prediction result, and the target label includes:

determining a first loss value based on the first prediction result and the target label; and determining a second loss value based on the first prediction result and the second prediction result;

determining the target loss value based on the first loss value and the second loss value.

In this way, when the target loss value is determined, the second prediction result output by the teacher neural network is processed, the processed result is used as an approximate real label, and the cross entropy loss of the determined approximate real label and the target classification network is used as a second loss value for training the target classification network, so that the first prediction result output by the target classification network and the first loss value determined by the target label are supplemented, and the network precision of the target classification network is improved.

In a second aspect, an embodiment of the present disclosure further provides a speech recognition apparatus, including:

the acquisition module is used for acquiring target audio data to be identified and extracting audio features of the target audio data to obtain audio extraction features corresponding to the target audio data;

the input module is used for inputting the audio extraction features into a pre-trained target encoder to obtain audio coding features which are output by the target encoder and correspond to the target audio data; wherein the target encoder comprises a self-attention network that employs pooling processing operations in determining a target query matrix; the target query matrix is one of a plurality of feature representation matrices determined by the self-attention network during feature extraction based on a self-attention mechanism;

and the determining module is used for determining a voice recognition result corresponding to the target audio data based on the audio coding characteristics.

In a possible implementation manner, when performing audio feature extraction on the target audio data to obtain an audio extracted feature corresponding to the target audio data, the obtaining module is configured to:

In a possible implementation manner, after determining the mel-frequency cepstrum coefficient feature corresponding to the target audio data, the obtaining module is further configured to:

the obtaining module is configured to, when performing feature dimension conversion processing on the mel-frequency cepstrum coefficient feature to obtain the audio extraction feature:

and the relative position coding module is used for coding the relative position of the target query matrix.

In a possible implementation manner, the determining module, when determining a speech recognition result corresponding to the target audio data based on the audio coding feature, is configured to:

In a possible implementation, the input module is further configured to train the target classification network according to the following steps:

acquiring sample data and a target label corresponding to the sample data;

In one possible embodiment, the input module, when determining the target loss value of the current training based on the first prediction result, the second prediction result and the target label, is configured to:

In a third aspect, an embodiment of the present disclosure further provides a computer device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the computer device is running, the machine-readable instructions when executed by the processor performing the steps of the first aspect described above, or any possible implementation of the first aspect.

In a fourth aspect, the disclosed embodiments further provide a computer-readable storage medium, where a computer program is stored, and the computer program is executed by a processor to perform the steps in the first aspect or any one of the possible implementation manners of the first aspect.

For the description of the effects of the speech recognition apparatus, the computer device, and the computer-readable storage medium, reference is made to the description of the speech recognition method, which is not repeated here.

Inputting the audio extraction features into a pre-trained target encoder to obtain audio encoding features corresponding to the target audio data and output by the target encoder

In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is appreciated that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.

FIG. 1 illustrates a flow chart of a speech recognition method provided by an embodiment of the present disclosure;

fig. 2 is a schematic diagram illustrating a network structure of a self-attention network in a speech recognition method provided by an embodiment of the present disclosure;

FIG. 3 illustrates an overall flow chart of a speech recognition method provided by the disclosed embodiments;

fig. 4 is a schematic diagram illustrating an architecture of a speech recognition apparatus provided in an embodiment of the present disclosure;

fig. 5 shows a schematic structural diagram of a computer device provided by an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of the embodiments of the present disclosure, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure, presented in the figures, is not intended to limit the scope of the claimed disclosure, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making creative efforts, shall fall within the protection scope of the disclosure.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The term "and/or" herein merely describes an associative relationship, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Research shows that audio data to be recognized for voice awakening generally needs to be processed locally on terminal equipment such as an intelligent sound or a mobile phone, smaller model parameters and higher processing speed are needed, but in order to ensure accuracy of a recognition result, a model used in the related art often needs to consume more computing resources, so that voice awakening efficiency of the terminal equipment is affected.

Early voice wake-up attempts often choose to use hidden markov models to build the voice wake-up model. In this way, a hidden markov model needs to be trained for each keyword or keyword that can be voice-awakened, and each model needs to be decoded by using the viterbi algorithm, so that more computing resources are required in the recognition process, which results in low efficiency.

With the rapid development of deep learning technology, neural network models, such as Transformer models, are gradually replacing traditional hidden markov models for speech recognition tasks. Taking a used neural network model as an example of a Transformer model, although the Transformer model has strong representation capability and generalization capability, the strong performance of the Transformer model is accompanied by a large model size and high computation cost, so that difficulty is brought to the deployment of the Transformer model on terminal equipment, and how to optimize the network structure of the neural network model so that the neural network model can better adapt to the deployment scene of a voice recognition task becomes a problem to be solved in the field.

Based on the above research, the present disclosure provides a speech recognition method, apparatus, computer device, and storage medium, in which audio extraction features corresponding to target audio data to be recognized are input into a target encoder including a target query matrix obtained based on a pooling process operation, so as to obtain audio coding features corresponding to the target audio data, and thus a speech recognition result corresponding to the target audio data can be determined based on the audio coding features. In this way, by performing the pooling processing operation, the parameter quantity of the generated target query matrix can be reduced, and the parameter quantity of the output from the attention network obtained subsequently based on the target query matrix can be reduced, so that the compression of the output data from the attention network is realized, and the efficiency of voice awakening is improved.

To facilitate understanding of the present embodiment, first, a speech recognition method disclosed in the embodiments of the present disclosure is described in detail, where an execution subject of the speech recognition method provided in the embodiments of the present disclosure is generally a computer device with certain computing capability, and the computer device includes, for example: a terminal device, which may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle mounted device, a wearable device, or a server or other processing device. In some possible implementations, the speech recognition method may be implemented by a processor invoking computer readable instructions stored in a memory.

Referring to fig. 1, a flowchart of a speech recognition method provided in the embodiment of the present disclosure is shown, where the method includes S101 to S103, where:

s101: the method comprises the steps of obtaining target audio data to be identified, and carrying out audio feature extraction on the target audio data to obtain audio extraction features corresponding to the target audio data.

S102: inputting the audio extraction features into a pre-trained target encoder to obtain audio encoding features which are output by the target encoder and correspond to the target audio data; wherein the target encoder comprises a self-attention network that employs pooling processing operations in determining a target query matrix; the target query matrix is one of a plurality of feature representation matrices determined by the self-attention network in feature extraction based on a self-attention mechanism.

S103: and determining a voice recognition result corresponding to the target audio data based on the audio coding characteristics.

The following is a detailed description of the above steps.

For S101, the target audio data may be collected by an audio data collecting module deployed in the terminal device, where the audio data collecting module may be, for example, a microphone or the like.

In one possible implementation, when performing audio feature extraction on the target audio data, the audio feature extraction may be performed through the following steps a1 to a 2:

a1: and performing initial feature extraction on the target audio data, and determining a Mel frequency cepstrum coefficient feature corresponding to the target audio data.

A2: performing feature dimension conversion processing on the mel frequency cepstrum coefficient features to obtain the audio extraction features; wherein the dimension of the audio extraction features is higher than the dimension of the mel-frequency cepstrum coefficient features.

Here, the feature dimension conversion process may be used to convert a two-dimensional mel-frequency cepstrum coefficient feature into a three-dimensional audio extraction feature.

Specifically, when the feature dimension conversion processing is performed, a target convolutional neural network including a convolutional layer may be used to sequentially perform convolution processing, activation processing, and pooling on the mel-frequency cepstrum coefficient feature, so as to obtain an audio extraction feature after feature dimension conversion.

For example, taking the pooling process as the maximum pooling process and the convolution process as the two-dimensional convolution process as an example, the formula when performing the feature dimension conversion process may be:

x ₀ ＝MaxPool(ReLU(conv2d(x)))

wherein, x represents the characteristics of mel frequency cepstrum coefficient to be converted, conv2d represents performing two-dimensional convolution processing, ReLU represents performing activation processing by using a ReLU function, Maxpool represents performing maximum pooling processing, and x ₀ Representing the audio extraction features obtained after the conversion.

In practical application, the audio extraction features obtained after feature dimension conversion processing need to be input into an encoder of a target model, and in order to enable the audio extraction features to meet the input standard of the target model, when two-dimensional convolution processing is performed, the number of channels of the two-dimensional convolution operation needs to be the same as the embedding dimension of the encoder of the target model, so that the audio extraction features after feature dimension conversion processing can better meet the input standard of the target model.

In one possible implementation, before the audio extraction features are input to the pre-trained target encoder, a feature dimension conversion process may be performed multiple times to extract deeper audio features, so as to reduce the amount of parameters input to the target encoder.

Specifically, in order to make the input audio extraction features conform to the input standard of the target encoder, when feature dimension conversion processing is performed multiple times, the number of channels of two-dimensional convolution processing used in the feature dimension conversion processing performed last time may be set to the same value as the embedding dimension of the target encoder.

For example, taking the embedding dimension of the target encoder as 4, and performing feature dimension conversion processing on mel-frequency cepstrum coefficient features to be converted for 4 times as an example, when performing feature dimension conversion processing for the 1 st to 3 rd times, the number of channels for two-dimensional convolution processing may be set to be 3; when the feature dimension conversion processing of the 4 th time is performed, the number of channels of the two-dimensional convolution processing can be set to be 4, so that the audio extraction features obtained after the dimension conversion processing meet the input standard of the target encoder.

Further, after determining the mel-frequency cepstrum coefficient features corresponding to the target audio data, feature enhancement processing may be performed on the mel-frequency cepstrum coefficient features to obtain the mel-frequency cepstrum coefficient features after enhancement processing, and feature dimension conversion processing is performed on the mel-frequency cepstrum coefficient features after enhancement processing based on a target convolutional neural network to obtain audio extraction features corresponding to the target audio data.

Here, the mel-frequency cepstral coefficient features may be feature-enhanced using a data enhancement technique, which may be, for example, a SpecAugment technique.

Here, the pre-trained target encoder may be an encoder in a pre-trained transform model, the target encoder may include a plurality of levels of encoders, network structures of the encoders of the levels may be the same, the encoder of the previous level outputs data to the encoder of the next level, and the output of the encoder of the last level is the audio coding feature; the Transformer model mainly comprises an encoder and a decoder, wherein the encoder and the decoder are respectively used for encoding and decoding input data.

For example, the self-attention network may be a network structure constructed based on a multi-head self-attention mechanism, and a schematic diagram of the network structure of each self-attention network may be as shown in fig. 2, K, Q, V sequentially represents a target Key matrix (Key matrix), a target Query matrix (Query matrix), and a target Value matrix (Value matrix), where three matrices are feature representation matrices determined by the self-attention network when feature extraction is performed based on the self-attention mechanism; the activation function used in the first activation layer may be, for example, a softmax activation function; the activation function used in the second activation layer may be, for example, a GELU activation function; the size of the convolution kernel in the convolutional layer may be, for example, 1 × 1.

K and V are feature representation matrixes obtained after processing of the convolutional layer and the normalization layer, Q is a feature representation matrix obtained after processing of the pooling layer, the convolutional layer and the normalization layer in sequence, and the pooling layer can perform average pooling processing operation on input feature information (namely feature maps) during pooling processing.

In this way, compared with the query matrix directly obtained based on convolution processing and normalization processing, the pooling processing operation is performed by adding the pooling layer, so that the parameter quantity of the generated target query matrix is less, and the subsequent parameter quantity of the output from the attention network obtained based on the target query matrix is less, thereby realizing the compression of the output data from the attention network and improving the efficiency of voice recognition.

In a possible implementation, the target encoder may further include a target feedforward neural network structure, where the target feedforward neural network structure includes a convolution layer and a normalization layer, and is used to process features output by the self-attention network.

Here, the size of the convolution kernel in the convolution layer may be 1 × 1, and the expansion coefficient of the target feedforward neural network structure is 2, that is, after the input feature is processed by the convolution layer and the normalization layer in the target feedforward neural network structure, the feature dimension changes from the initial dimension (d) to twice the initial dimension (2d) to the initial dimension (d).

Thus, by setting the expansion coefficient of the target feedforward neural network structure to 2, the computation cost of this stage can be reduced and the efficiency of speech recognition can be improved, compared with the larger expansion coefficient (e.g. 4, which is common in the original feedforward neural network).

In practical application, because the self-attention network is often a network structure constructed based on a multi-head self-attention mechanism, in actual data processing, split input data is respectively processed by a plurality of same network structures (namely a plurality of heads), and since the input data needs to be merged after being split and processed, position information corresponding to each network structure needs to be recorded, and a position coding module is added to perform corresponding processing when each network structure performs data processing.

In one possible implementation, a relative position coding module may be included in the self-attention network; and the relative position coding module is used for carrying out relative position coding processing on the target query matrix.

For example, the schematic diagram of the relative position encoding module may be as shown in fig. 2, and in fig. 2, the position encoding module may be a relative position encoding module.

For example, the formula when performing the relative position encoding process may be:

wherein Q and K are a target query matrix and a target key matrix, and B is position information added during relative position coding processing; i and i' are positive integers less than or equal to the height h of the characteristic diagram; j and j' are positive integers less than or equal to the width w of the feature map; n is the number of the network structures (heads) constructed based on the multi-head self-attention machine system, and is a positive integer less than or equal to the number N of the network structures (heads) in the multi-head self-attention machine system.

In one possible implementation, when determining the speech recognition result corresponding to the target audio data, the following steps B1 to B3 may be performed:

b1: and based on a target pooling network, pooling the audio coding features to obtain pooled target audio features.

Here, the pooling process is used to transform the audio coding feature with the sequence length n corresponding to the target audio data to be recognized into the target audio feature without the sequence length, thereby facilitating the speech recognition.

Specifically, the formula when pooling the audio coding features may be:

z＝soft max(g(x _s )T)×x _s wherein z represents the target audio features after pooling, the dimension can be b × 99 × d, b is the batch size, and d is the target codeThe embedding dimension of the device; softmax denotes activation processing using a softmax activation function; g represents inputting the linear layer to perform linear processing; x is the number of _s And representing the audio coding characteristics, wherein the dimension is b multiplied by n multiplied by d, and n is the sequence length corresponding to the target audio data.

B2: and inputting the target audio features into a trained target classification network to obtain a target probability matrix output by the target classification network and aiming at preset keywords of each category.

Here, the preset keyword may be a wake-up word capable of waking up the terminal device.

B3: and determining a voice recognition result corresponding to the target audio data based on the target probability matrix.

Here, the target probability matrix is used to represent an estimated probability that the target audio data contains preset keywords of each category.

For example, if the number of the preset keywords is 25, the estimated probability corresponding to each category included in the target probability matrix may be 26, where 25 are estimated probabilities including the corresponding preset keywords, and 1 is an estimated probability not including any preset keywords.

In practical application, since the target classification network is used to obtain the final speech recognition result, the performance of the target classification network has a great influence on the accuracy of the final speech recognition result, and the network accuracy of the target classification network needs to be ensured, but the network accuracy is limited by the hardware device bottleneck of the terminal device during deployment, and the network scale (i.e., the parameter number) of the target classification network cannot be too large, so that the target classification network can be optimized correspondingly, and the target classification network can simultaneously take into account the network accuracy and the network scale.

In one possible implementation, when training the target classification network, the training may be performed through the following steps C1-C4:

c1: and acquiring sample data and a target label corresponding to the sample data.

Here, the target tag is a tag corresponding to a true category of the sample data.

C2: based on the sample data, a target encoder, and the target pooling network, a sample audio feature corresponding to the sample data is determined.

Here, the sample data may be sequentially input to the target encoder and the target pooling network, thereby obtaining the sample audio features output by the target pooling network.

C3: inputting the sample audio features to the target classification network to be trained to obtain a first prediction result output by the target classification network; and inputting the sample audio features into the trained teacher neural network to obtain a second prediction result output by the teacher neural network.

Here, the teacher neural network may be a neural network with a higher precision trained in advance, and the type of the teacher neural network may be the same as that of the target classification network; the target classification network is used as a student neural network for knowledge distillation during training, and the network structure of the target classification network can be divided into two parts, namely a classification head network and a distillation head network.

After the sample audio features are input into the target classification network, a part of the sample audio features are input into the classification head network, and the rest of the sample audio features are input into the distillation head network, so that first prediction results output by the classification head network and the distillation head network respectively are obtained, the first prediction results output by the distillation head network are only used for knowledge distillation of the target classification network, and the first prediction results output by the classification head network in the target classification network can be used as finally obtained speech recognition results during specific deployment.

For example, taking the sample audio feature as b × 99 × d as an example, the dimension of the feature input to the classification head network may be b × 1 × d, and the dimension of the feature input to the distillation head network may be b × 98 × d, where b is the batch size and d is the embedding dimension of the target encoder, so that the data processing amount in actual prediction is smaller, and the efficiency of performing speech recognition may be improved.

In this way, knowledge distillation can be performed on the target classification network through a teacher neural network with high network precision, so as to optimize the target classification network, and a loss value used in the knowledge distillation is described in detail below and is not described herein again.

C4: and determining a target loss value of the training based on the first prediction result, the second prediction result and the target label, and adjusting network parameters of the target classification network based on the target loss value.

Therefore, the teacher neural network is used for training the target classification network to be trained, so that the knowledge distillation of the target classification network can be realized, and the target classification can take network precision and network scale into account at the same time.

In one possible embodiment, the following steps C41-C42 may be used in determining the target loss value for the current training:

c41: determining a first loss value based on the first prediction result and the target label; and determining a second loss value based on the first prediction and the second prediction.

Here, in determining the first loss value, the first loss value may be determined based on a preset first loss function, the first prediction result, and the target label, and the type of the first loss function may be, for example, a cross entropy loss function; in determining the second loss value, the second loss value may be determined based on a preset second loss function, the first prediction result, and the second prediction result, and the type of the second loss function may be, for example, a cross-entropy loss function.

C42: determining the target loss value based on the first loss value and the second loss value.

Here, in determining the target loss value, the target loss value may be determined based on weighting coefficients corresponding to the first loss value, the second loss value, and the first loss value and the second loss value, respectively.

For example, the formula for determining the target loss value may be:

wherein L is _CE Represents the cross entropy loss; psi (Z) _sc ) A first prediction result representing the classification head network output; y represents the target tag; psi (Z) _sd ) A first prediction result representing the distillation head network output; y is _t ＝argmax _c Z _t (c)，y _t A label indicating correspondence of the second prediction result, Z _t (c) Representing a second prediction result, and argmax representing processing of the second prediction result using an argmax function.

In this way, when the target loss value is determined, the second prediction result output by the teacher neural network is processed, the processed result is used as an approximate real label, and the determined cross entropy loss of the approximate real label and the distillation head network is used as a second loss value for training the target classification network, so that the first prediction result output by the classification head network in the target classification network and the first loss value determined by the target label are supplemented, and the network precision of the target classification network is improved.

The above-described speech recognition method will be described in its entirety with reference to specific embodiments. Referring to fig. 3, an overall flowchart of a speech recognition method provided in the embodiment of the present disclosure is shown, where the flowchart mainly includes the following steps:

1. and audio feature extraction is carried out on target audio data to be identified to obtain audio extraction features corresponding to the target audio data.

Specifically, Mel-frequency cepstrum coefficient features of the target audio data may be extracted, and a Waveform map (Waveform) corresponding to the target audio data may be converted into a corresponding Mel spectrum (Mel spectrum).

2. And performing feature enhancement processing on the mel-frequency cepstrum coefficient features to obtain the mel-frequency cepstrum coefficient features after enhancement processing.

Specifically, when the feature enhancement processing is performed, the feature enhancement processing may be performed using the specattribute technique.

3. And carrying out convolution identification processing (Convolutional labeling) on the enhanced mel-frequency cepstrum coefficient characteristics to obtain audio extraction characteristics.

Specifically, during the operation of the Convolution identification processing, the Convolution Subsampling Convolution processing and the Max Pooling maximum Pooling processing may be performed based on the Convolution layer and the Pooling layer, respectively.

4. And inputting the audio extraction features into a pre-trained multi-level target encoder to obtain audio encoding features which are output by the target encoder and correspond to the target audio data.

Specifically, the target encoder may be an encoder transform Encoders in a transform model, and the number of levels of the target encoder may be L, where L is a positive integer greater than 1.

5. And performing Sequence Pooling (Sequence Pooling) on the audio coding features based on a target Pooling network to obtain pooled target audio features.

6. And inputting the target audio features into a trained classification Head network (Classiier Head) to obtain a target probability matrix output by the classification Head network and aiming at each class of preset keywords.

For example, the category (class) of the preset keywords may include Stop, Follow, Marvin, and the like, and the target probability matrix is used to represent the probability of including each preset keyword.

In practical application, because the voice awakening task often has the problem of training data scarcity and the like in training, a Distillation Head network (Distillation Head) can be added in the training process, and the Distillation Head network and the classification Head network can form a target classification network to assist the classification Head network in knowledge Distillation, so that the classification Head network can give consideration to both network precision and network scale after training.

According to the voice recognition method provided by the embodiment of the disclosure, the audio extraction features corresponding to the target audio data to be recognized are input into the target encoder containing the target query matrix obtained based on the pooling processing operation, so as to obtain the audio coding features corresponding to the target audio data, and therefore the voice recognition result corresponding to the target audio data can be determined based on the audio coding features. In this way, by performing the pooling processing operation, the parameter quantity of the generated target query matrix can be reduced, and the parameter quantity of the output from the attention network obtained subsequently based on the target query matrix can be reduced, so that the compression of the output data from the attention network is realized, and the efficiency of voice awakening is improved.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

Based on the same inventive concept, a speech recognition device corresponding to the speech recognition method is also provided in the embodiments of the present disclosure, and because the principle of solving the problem of the device in the embodiments of the present disclosure is similar to the speech recognition method in the embodiments of the present disclosure, the implementation of the device may refer to the implementation of the method, and repeated details are not repeated.

Referring to fig. 4, there is shown a schematic architecture diagram of a speech recognition apparatus according to an embodiment of the present disclosure, the apparatus includes: an acquisition module 401, an input module 402, and a determination module 403; wherein the content of the first and second substances,

the acquisition module 401 is configured to acquire target audio data to be identified, and perform audio feature extraction on the target audio data to obtain an audio extraction feature corresponding to the target audio data;

an input module 402, configured to input the audio extraction feature into a pre-trained target encoder, so as to obtain an audio encoding feature output by the target encoder and corresponding to the target audio data; wherein the target encoder comprises a self-attention network that employs pooling processing operations in determining a target query matrix; the target query matrix is one of a plurality of feature representation matrices determined by the self-attention network during feature extraction based on a self-attention mechanism;

a determining module 403, configured to determine, based on the audio coding feature, a speech recognition result corresponding to the target audio data.

In a possible implementation manner, when performing audio feature extraction on the target audio data to obtain an audio extracted feature corresponding to the target audio data, the obtaining module 401 is configured to:

In a possible implementation manner, after determining the mel-frequency cepstrum coefficient feature corresponding to the target audio data, the obtaining module 401 is further configured to:

the obtaining module 401, when performing feature dimension conversion processing on the mel-frequency cepstrum coefficient feature to obtain the audio extraction feature, is configured to:

In a possible implementation, the target encoder further includes a target feedforward neural network structure, which includes a convolution layer and a normalization layer, for processing the feature output from the attention network.

In a possible implementation, the self-attention network includes a relative position encoding module;

In a possible implementation manner, the determining module 403, when determining the speech recognition result corresponding to the target audio data based on the audio coding feature, is configured to:

In a possible implementation, the input module 402 is further configured to train the target classification network according to the following steps:

acquiring sample data and a target label corresponding to the sample data;

In one possible embodiment, the input module 402, when determining the target loss value of the current training based on the first prediction result, the second prediction result and the target label, is configured to:

According to the voice recognition device provided by the embodiment of the disclosure, the audio extraction features corresponding to the target audio data to be recognized are input into the target encoder containing the target query matrix obtained based on the pooling processing operation, so as to obtain the audio coding features corresponding to the target audio data, and therefore, the voice recognition result corresponding to the target audio data can be determined based on the audio coding features. In this way, by performing the pooling processing operation, the parameter quantity of the generated target query matrix can be reduced, and the parameter quantity of the output from the attention network obtained subsequently based on the target query matrix can be reduced, so that the compression of the output data from the attention network is realized, and the efficiency of voice awakening is improved.

The description of the processing flow of each module in the device and the interaction flow between the modules may refer to the related description in the above method embodiments, and will not be described in detail here.

Based on the same technical concept, the embodiment of the disclosure also provides computer equipment. Referring to fig. 5, a schematic structural diagram of a computer device 500 provided in the embodiment of the present disclosure includes a processor 501, a memory 502, and a bus 503. The memory 502 is used for storing execution instructions and includes a memory 5021 and an external memory 5022; the memory 5021 is also referred to as an internal memory, and is used for temporarily storing operation data in the processor 501 and data exchanged with an external storage 5022 such as a hard disk, the processor 501 exchanges data with the external storage 5022 through the memory 5021, and when the computer device 500 operates, the processor 501 communicates with the storage 502 through the bus 503, so that the processor 501 executes the following instructions:

Embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the speech recognition method described in the above method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.

The embodiments of the present disclosure also provide a computer program product, where the computer program product carries a program code, and instructions included in the program code may be used to execute the steps of the speech recognition method in the foregoing method embodiments, which may be referred to specifically in the foregoing method embodiments, and are not described herein again.

The computer program product may be implemented by hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK) or the like.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: those skilled in the art can still make modifications or changes to the embodiments described in the foregoing embodiments, or make equivalent substitutions for some of the technical features, within the technical scope of the disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A speech recognition method, comprising:

2. The method according to claim 1, wherein the performing audio feature extraction on the target audio data to obtain an audio extracted feature corresponding to the target audio data comprises:

3. The method of claim 2, wherein after determining the mel-frequency cepstral coefficient features corresponding to the target audio data, the method further comprises:

4. The method according to any one of claims 1 to 3, wherein the target encoder further comprises a target feedforward neural network structure, and the target feedforward neural network structure comprises a convolution layer and a normalization layer for processing the characteristics output from the attention network.

5. The method according to any one of claims 1 to 4, wherein the self-attention network comprises a relative position coding module;

6. The method according to any one of claims 1 to 5, wherein the determining the speech recognition result corresponding to the target audio data based on the audio coding features comprises:

7. The method of claim 6, further comprising training the target classification network according to the steps of:

acquiring sample data and a target label corresponding to the sample data;

8. The method of claim 7, wherein determining the target loss value for the training based on the first predicted result, the second predicted result, and the target label comprises:

9. A speech recognition apparatus, comprising:

the acquisition module is used for acquiring target audio data to be identified and performing audio feature extraction on the target audio data to obtain audio extraction features corresponding to the target audio data;

10. A computer device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when a computer device is running, the machine-readable instructions when executed by the processor performing the steps of the speech recognition method according to any one of claims 1 to 8.

11. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, performs the steps of the speech recognition method as claimed in any one of the claims 1 to 8.