CN113223502A

CN113223502A - Speech recognition system optimization method, device, equipment and readable storage medium

Info

Publication number: CN113223502A
Application number: CN202110467147.1A
Authority: CN
Inventors: 罗剑; 王健宗; 程宁
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-04-28
Filing date: 2021-04-28
Publication date: 2021-08-06
Anticipated expiration: 2041-04-28
Also published as: CN113223502B

Abstract

The application belongs to the technical field of voice semantics and provides a method, a device, equipment and a readable storage medium for optimizing a voice recognition system, wherein the method comprises the following steps: acquiring a voice to be recognized, inputting the voice to be recognized into a voice recognition system for classification recognition, predicting a prediction tag type corresponding to the voice to be recognized through a tag prediction model of the voice recognition system, and predicting a prediction loss value of the tag prediction model through an active learning loss prediction model of the voice recognition system; when the prediction label category is determined to be inaccurate according to the prediction loss value, acquiring the actual label category of the voice to be recognized, and taking the voice to be recognized and the actual label category thereof as training data; counting training data to establish a training set; and performing optimization training on the voice recognition system through a training set, and calculating a target loss function until the target loss function is converged to obtain the optimized voice recognition system. The method and the device can improve the recognition accuracy and reliability of the voice recognition system.

Description

Speech recognition system optimization method, device, equipment and readable storage medium

Technical Field

The present application relates to the field of speech semantic technologies, and in particular, to a method, an apparatus, a device, and a readable storage medium for optimizing a speech recognition system.

Background

The voice assistant based on deep learning is widely applied to daily life of people, for example, people from millet, Siri from apple, Cortana from Microsoft and the like, and people can use the voice assistant to inquire weather, add new reminders, set an alarm clock and the like. However, the speech recognition system configured by the current speech assistant is obtained by training limited speech data labeled artificially and inefficiently, and the speech recognition system has a recognition blind area due to the limitation of the training data, so that the speech assistant is easy to have a scene with recognition errors in daily use, the reliability is low, and the use experience is greatly reduced.

Disclosure of Invention

The present application mainly aims to provide a method, an apparatus, a device and a readable storage medium for optimizing a speech recognition system, and aims to solve the technical problems of low recognition accuracy and reliability of the existing speech recognition system.

In a first aspect, the present application provides a method for optimizing a speech recognition system, the method comprising:

acquiring a voice to be recognized, inputting the voice to be recognized into a voice recognition system for classified recognition, so as to obtain a prediction tag category corresponding to the voice to be recognized through prediction of a tag prediction model of the voice recognition system, and obtain a prediction loss value of the tag prediction model through prediction of an active learning loss prediction model of the voice recognition system;

when the prediction label category is determined to be inaccurate according to the prediction loss value, acquiring an actual label category corresponding to the voice to be recognized, and determining the voice to be recognized and the actual label category corresponding to the voice to be recognized as training data;

counting training data, and establishing a training set according to the counted training data;

and inputting the training set into the voice recognition system to carry out optimization training on the voice recognition system, and calculating a target loss function until the target loss function is converged to obtain the optimized voice recognition system.

In a second aspect, the present application further provides a speech recognition system optimization apparatus, including:

the prediction module is used for acquiring a voice to be recognized, inputting the voice to be recognized into a voice recognition system for classified recognition, so as to obtain a prediction tag category corresponding to the voice to be recognized through prediction of a tag prediction model of the voice recognition system, and obtain a prediction loss value of the tag prediction model through prediction of an active learning loss prediction model of the voice recognition system;

the determining module is used for acquiring the actual label category corresponding to the voice to be recognized when the predicted label category is determined to be inaccurate according to the prediction loss value, and determining the voice to be recognized and the actual label category corresponding to the voice to be recognized as training data;

the establishing module is used for counting the training data and establishing a training set according to the counted training data;

and the optimization module is used for inputting the training set into the voice recognition system to carry out optimization training on the voice recognition system, calculating a target loss function until the target loss function is converged, and obtaining the optimized voice recognition system.

In a third aspect, the present application further provides a computer device comprising a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program, when executed by the processor, implements the steps of the speech recognition system optimization method as described above.

In a fourth aspect, the present application further provides a readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the speech recognition system optimization method as described above.

The application discloses a voice recognition system optimization method, a device, equipment and a readable storage medium, wherein the voice recognition system optimization method comprises the steps of obtaining voice to be recognized, inputting the voice to be recognized into a voice recognition system for classification recognition, obtaining a prediction label category corresponding to the voice to be recognized through label prediction model prediction of the voice recognition system, and obtaining a prediction loss value of a label prediction model through active learning loss prediction model prediction of the voice recognition system; when the prediction label category is determined to be inaccurate according to the prediction loss value, acquiring an actual label category corresponding to the voice to be recognized, and taking the voice to be recognized and the actual label category corresponding to the voice to be recognized as training data; then, training data are counted, a training set is established according to the counted training data, the established training set is input into the voice recognition system to carry out optimization training on the voice recognition system, a target loss function is calculated until the target loss function is converged, and the optimized voice recognition system is obtained. Therefore, when the voice recognition system works, the loss value predicted by the self-learning loss prediction module is used for finding out the voice data which is easy to be recognized by the voice recognition system in error, the voice data is used as training data for optimizing the voice recognition system, efficient acquisition of the training data is realized, the training data is reused for optimizing and training the voice recognition system, the recognition breadth of the voice recognition system can be improved, updating and upgrading of the voice recognition system are realized, and therefore the recognition accuracy and reliability of the voice recognition system are improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a method for optimizing a speech recognition system according to an embodiment of the present application;

FIG. 2 is a block diagram of a speech recognition system according to an embodiment of the present application;

fig. 3 is a schematic diagram illustrating an architecture of an audio feature extraction module according to an embodiment of the present application;

FIG. 4 is a block diagram of a single self-attention decoder according to an embodiment of the present application;

fig. 5 is a schematic diagram of an architecture of an active learning module according to an embodiment of the present application;

FIG. 6 is an exemplary diagram for calculating an objective loss function of a speech recognition system according to an embodiment of the present application;

FIG. 7 is a schematic block diagram of an apparatus for optimizing a speech recognition system according to an embodiment of the present application;

fig. 8 is a block diagram schematically illustrating a structure of a computer device according to an embodiment of the present application.

The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

It is to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

The embodiment of the application provides a method, a device and equipment for optimizing a voice recognition system and a computer-readable storage medium. The voice recognition system optimization method is mainly applied to voice recognition system optimization equipment, and can be equipment with a data processing function, such as a mobile terminal, a Personal Computer (PC), a portable computer, a server and the like, wherein the voice recognition system is loaded on the voice recognition system optimization equipment based on active learning.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.

Referring to fig. 1, fig. 1 is a flowchart illustrating a method for optimizing a speech recognition system according to an embodiment of the present disclosure.

As shown in fig. 1, the speech recognition system optimization method includes steps S101 to S104.

Step S101, obtaining a voice to be recognized, inputting the voice to be recognized into a voice recognition system for classification recognition, obtaining a prediction tag category corresponding to the voice to be recognized through prediction of a tag prediction model of the voice recognition system, and obtaining a prediction loss value of the tag prediction model through prediction of an active learning loss prediction model of the voice recognition system.

The video description generation system may be implemented as part of an application having speech recognition capabilities, such as a voice assistant or the like.

As shown in fig. 2, fig. 2 is a schematic structural diagram of a speech recognition system, which is a speech recognition model that completes initial training by using small-magnitude labeled speech data, and mainly includes two parts, namely a label prediction model and an active learning loss prediction model, where the label prediction model and the active learning loss prediction model belong to a parallel relationship. The label prediction model is an end-to-end (end-to-end) neural network and is used for classifying and recognizing the speech to be recognized so as to predict the label category of the speech to be recognized; the active learning loss prediction model is a lightweight neural network and is used for predicting the loss of the prediction result of the to-be-recognized voice of the label prediction model, namely judging the probability that the label prediction model makes correct prediction on the label category corresponding to the to-be-recognized voice.

Taking the application to a voice assistant as an example, when a user sends a voice instruction to the voice assistant, the voice assistant acquires the voice instruction, inputs the voice instruction as a voice to be recognized into a voice recognition system for classification recognition, obtains a prediction tag class corresponding to the voice to be recognized through prediction of a tag prediction model of the voice recognition system, and obtains a prediction loss value of the tag prediction model through prediction of an active learning loss prediction model of the voice recognition system, wherein the prediction loss value is used for representing whether the prediction tag class corresponding to the voice to be recognized is accurate or not.

In an embodiment, the predicting the predicted tag category corresponding to the speech to be recognized through the tag prediction model of the speech recognition system includes: inputting the voice to be recognized into a label prediction model of the voice recognition system, extracting the characteristics of the voice to be recognized to obtain the characteristics of the voice to be recognized, and supplementing the position codes corresponding to the characteristics of the voice to be recognized; decoding the characteristics of the speech to be recognized and the position codes corresponding to the characteristics of the speech to be recognized to obtain hidden characteristic vectors; performing linear transformation on the hidden feature vector to obtain a decoding vector; and performing softmax logistic regression calculation on the decoding vector to obtain a prediction tag category corresponding to the voice to be recognized and output by a tag prediction model of the voice recognition system.

Continuing to refer to fig. 2, as shown in the dashed box portion on the left of fig. 2, the dashed box portion on the left of fig. 2 is an architecture diagram of a tag prediction model, and the tag prediction model mainly includes an audio feature extraction module and a self-attention decoder module, where the self-attention decoder module is formed by overlapping a plurality of self-attention decoders. When the speech to be recognized is input into a speech recognition system for classification recognition, firstly, the speech to be recognized is subjected to feature extraction through an audio feature extraction module of a label prediction model to obtain the speech to be recognizedFeatures corresponding to the speech; then supplementing position coding information of the features corresponding to the speech to be recognized, decoding the features corresponding to the speech to be recognized and the position coding information of the features through a self-attention decoder module, wherein the output of the ith self-attention decoder is the input of the (i + 1) th self-attention decoder during decoding, and the hidden feature vector output from the last self-attention decoder in the self-attention decoder module is used as the final output of the self-attention decoder module and is expressed as Z ═ Z [ Z ] as₁,z₂,...,z_n](ii) a Then, the output from the attention decoder module is processed by linear transformation to obtain a decoding vector, and then the decoding vector is processed by softmax logistic regression, so that the output Z from the attention decoder module is [ Z ═ Z [₁,z₂,...,z_n]Mapping to one-dimensional class space₁,l₁,...,l_m](ii) a Based on the processing, the label prediction model can output the prediction label category corresponding to the speech to be recognized. As shown in table 1, table 1 shows the predicted tag categories output after a common voice command is predicted by a tag prediction model:

TABLE 1 common Voice Instructions and their predictive tag classes

In an embodiment, the extracting the features of the speech to be recognized to obtain the features of the speech to be recognized specifically includes: pre-reinforcing the voice to be recognized by taking a frame as a unit, and performing fast Fourier transform on the pre-reinforced voice to be recognized; processing the voice to be recognized after the fast Fourier transform through a Log Mel spectrum filter to obtain a filtering output value; and sequentially carrying out linear transformation and layer standardization on the filtering output value to obtain the characteristics of the voice to be recognized.

As shown in fig. 3, fig. 3 is a schematic diagram of an architecture of an audio feature extraction module of a tag prediction model, the audio feature extraction module is used for extracting features corresponding to a speech to be recognized, when the features of the speech to be recognized are extracted by the audio feature extraction module, a pre-emphasis is firstly performed on the speech to be recognized by taking a frame as a unit, so as to strengthen high frequency and remove influence of lip radiation, and the high frequency part of the speech to be recognized is strengthened to improve a high frequency signal-to-noise ratio, and the formula is as follows,

s′(x)＝s(x)-k*s(x-1)

wherein k represents the pre-emphasis coefficient, k belongs to [0,1], x is the frame, and s (x) is the speech signal corresponding to the x frame.

And then performing Fast Fourier Transform (FFT) on the pre-enhanced voice to be recognized. The fast fourier transform is to decompose complex sound waves into sound waves of various frequencies, and specifically, discrete fourier transform may be performed on the pre-enhanced speech to be recognized, that is, n-point FFT is performed on each frame to calculate a frequency spectrum, where n may be 256 or 512.

It should be noted that, before performing fast fourier transform on the pre-enhanced speech to be recognized, the pre-enhanced speech to be recognized may be framed, that is, the speech to be recognized with an indefinite length may be cut into paragraphs with a fixed length. The frame length can be chosen to be 20ms, the speech signal within the frame can be regarded as a stationary signal, while the frame shift is set to 10ms, i.e. the time difference between segments is set to 10ms, to avoid that speech information is lost at the framing.

And after the pre-enhanced voice to be recognized is subjected to fast Fourier transform, processing the voice to be recognized after the fast Fourier transform through a Log Mel spectrum filter to obtain a filtering output value. The Log Mel spectrum filter is also called Filter Bank, and can process the audio frequency in a mode similar to human ears so as to achieve the purpose of improving the speech recognition performance. After the voice to be recognized after fast Fourier transform passes through a Log Mel spectrum filter, a two-dimensional array X ═ X is finally output₁,x₂,...x_n]Wherein x is_nFor the nth truncated frame segment, each element in the array

Wherein k represents the number of filters, and can be flexibly set according to actual needsIf k is 40.

In order to enable the feature matrix size of the speech to be recognized output by the audio feature extraction module to be matched with the input size of the self-attention decoder module, the speech to be recognized after the fast Fourier transform is processed by the Log Mel spectrum filter, and then the linear transform and the layer standardization are further carried out, so that the features of the speech to be recognized are finally obtained.

In an embodiment, the decoding the feature of the speech to be recognized and the position code corresponding to the feature of the speech to be recognized to obtain a hidden feature vector specifically includes: and performing multi-head attention calculation on the characteristics of the speech to be recognized and the position codes corresponding to the characteristics of the speech to be recognized to obtain multi-head attention output, and performing feedforward calculation on the multi-head attention output to obtain a hidden feature vector.

As can be seen from the foregoing, the self-attention decoder module is formed by stacking N (N ≧ 2) self-attention decoders, as shown in fig. 4, fig. 4 is a schematic structural diagram of a single self-attention decoder, each self-attention decoder includes two sublayers, the first layer is multi-headed attention, the second layer is a fully-connected feedforward neural network (the simplest fully-connected structure), in addition, the two sublayers each adopt a defective connection and then perform layer normalization, wherein the defective connection is to solve the problem of difficulty in training the multi-layer neural network, so that the neural network only focuses on the difference part during training, and the layer normalization can accelerate the model training process and accelerate convergence.

It should be noted that, since the self-attention formula may cause the loss of the position information during the calculation, when the feature of the speech to be recognized output by the audio feature extraction module is input into the self-attention decoder module, the position coding information corresponding to the feature of the speech to be recognized is supplemented first. Thus, the input from the attention decoder module is the feature of the speech to be recognized and the position coding information corresponding to the feature, as shown in fig. 4, the position coding information corresponding to the feature corresponds to Q in fig. 4, and K, V corresponds to the feature of the speech to be recognized output by the audio feature extraction module.

After inputting the features corresponding to the speech to be recognized and the position coding information of the features into a self-attention decoder module, for a first self-attention decoder in the self-attention decoder module, firstly, performing multi-head attention calculation on the features corresponding to the speech to be recognized and the position coding information of the features in a multi-head attention layer to obtain the output of the multi-head attention layer, and then inputting the output of the multi-head attention layer into a feedforward neural network layer to perform feedforward calculation to obtain the output of the first self-attention decoder, namely a hidden feature vector; for other self-attention decoders except the first self-attention decoder in the self-attention decoder module, performing multi-head attention calculation on the output of the last self-attention decoder in a multi-head attention layer to obtain the output of the multi-head attention layer, and inputting the output of the multi-head attention layer into a feedforward neural network layer to perform feedforward calculation to obtain the outputs of other self-attention decoders; the output from the last of the attention decoder modules is taken as the final output from the attention decoder module.

Among them, in the self-attention decoder, multi-head attention is the most important conversion map. Multi-head attention is composed of a basic attention map. The attention-product entry (SDA) maps the query (Q, query), the key (K, key), and the value (V, value) to a weighted sum, which is expressed as follows:

where the dimensions of query Q and key K are the same and are both d^kThe dimension of the value V is d^v. To obtain multiple different linear mappings, multi-headed attention mapping was introduced. In multiple attention mapping, the basic attention functions are performed in parallel. Each basic attention model outputs dimensions, and finally outputs through dimension connection. The formula of the method is as follows,

MultiHead(Q,K,V)＝Concat(head₁,...,head_h)W^O

head_i＝Attention(QW_i ^Q,KW_i ^K,VW_i ^V)

wherein h is the basic attention number,

a matrix of the parameters is represented and,

in an embodiment, the predicting loss value of the tag prediction model obtained through prediction by the active learning loss prediction model of the speech recognition system specifically includes: inputting the hidden feature vector into an active learning loss prediction model of the speech recognition system, and performing global pooling on the hidden feature vector to obtain a global pooled feature vector; performing full-connection operation on the global pooling feature vector to obtain a full-connection feature vector; carrying out nonlinear mapping on the fully connected feature vector through a ReLU linear rectification function to obtain feature mapping; and carrying out full-connection operation on the feature mapping to obtain a prediction loss value output by an active learning loss prediction model of the voice recognition system.

Continuing to refer to fig. 1, as shown in the dashed-line frame part on the right of fig. 1, the dashed-line frame part on the right of fig. 1 is an architecture diagram of the active learning loss prediction model, and the active learning loss prediction model is formed by overlapping a plurality of active learning modules. As shown in fig. 5, fig. 5 is a schematic diagram of an architecture of an active learning module. The active learning loss prediction model takes the hidden feature vector output from the attention decoder as input, then the hidden feature vector is sequentially processed by a global pooling layer, a full-link layer and a ReLU linear rectification function layer to obtain the output of the active learning module, and the output of the active learning module is finally processed by the full-link layer to obtain the output of the active learning loss prediction model, namely a predicted loss value (such as figure 1 and figure 5), wherein the value represents the probability of making correct prediction by the label prediction model. In particular, a high loss value indicates that the current input is a difficult datum for the speech recognition system, and the label prediction model may make a wrong decision.

Compared with a label prediction model, the loss prediction module is a lightweight network, and can make quick prediction; meanwhile, in order to improve the network utilization rate, the input of each active learning module is the output of each attention decoder. The input of multiple information sources may enable the loss prediction module to select useful information. The global pooling layer may map information of different dimensions to a fixed information dimension.

And S102, when the predicted label category is determined to be inaccurate according to the predicted loss value, acquiring an actual label category corresponding to the voice to be recognized, and determining the voice to be recognized and the actual label category corresponding to the voice to be recognized as training data.

As can be seen from the foregoing, the prediction loss value output by the active learning loss module may indicate whether the prediction tag class corresponding to the speech to be recognized output by the tag prediction model is accurate, and therefore, obtaining the predicted label category corresponding to the speech to be recognized and output by the label prediction model of the speech recognition system, and the predicted loss value output by the active learning loss prediction model of the speech recognition system, determining whether the class of the prediction label corresponding to the speech to be recognized is accurate according to the prediction loss value, specifically comparing the prediction loss value with a preset threshold value, if the prediction loss value is greater than or equal to the preset threshold value, the prediction label category corresponding to the speech to be recognized output by the label prediction model can be determined to be inaccurate, the preset threshold value is used as a critical value for judging whether the predicted label category is accurate or not, and can be flexibly set according to the actual situation.

When the prediction loss value indicates that the prediction tag type corresponding to the speech to be recognized is inaccurate, the speech to be recognized is difficult data for the speech recognition system, and therefore, the speech to be recognized and the actual tag type thereof are used for optimizing and updating the speech recognition system, and then the actual tag type corresponding to the speech to be recognized needs to be acquired. Taking the application to a voice assistant as an example, when a user sends a voice instruction to the voice assistant, the voice assistant acquires the voice instruction, inputs the voice instruction as a voice to be recognized into a voice recognition system, performs loss prediction through an active learning loss prediction model of the voice recognition system, obtains a prediction loss value which is relatively high, and indicates that a prediction label category corresponding to the voice to be recognized is inaccurate, prompt information for asking the user to select a correct label category can be generated and displayed, meanwhile, a label category selection item related to the voice to be recognized is loaded to allow the user to select, then, a selection instruction of the label category selection item by the user is received, and a label category corresponding to the selection instruction is used as an actual label category corresponding to the voice to be recognized.

After the actual label category corresponding to the speech to be recognized is obtained, the speech to be recognized and the actual label category corresponding to the speech to be recognized can be used as training data, so that the training data can be accumulated while the speech recognition system executes a speech recognition task, and the training data can be used for further optimizing and training the speech recognition system.

In conclusion, when the speech recognition system works, the self-learning loss prediction module finds out the speech data which is easy to be recognized by the speech recognition system incorrectly and is used as the training data for optimizing the speech recognition system, so that the training data is efficiently collected, the training data does not need to be obtained by manual marking, and the labor cost is saved.

And S103, counting the training data, and establishing a training set according to the counted training data.

The training data may then be counted, for example, periodically, such as every month. Training sets are then built from the statistical training data to construct training sets, which, illustratively,

a training set { training data 1, training data 2.,. a training data B }

{ (Voice data x)₁Actual tag class y₁) (voice data x)₂Actual tag class y₂) ..., (speech data x)_BActual tag class y_B)}

And step S104, inputting the training set into the voice recognition system to carry out optimization training on the voice recognition system, and calculating a target loss function until the target loss function is converged to obtain the optimized voice recognition system.

In an embodiment, the inputting the training set into the speech recognition system to perform optimization training on the speech recognition system, and calculating an objective loss function specifically includes: inputting each training data in the training set into the speech recognition system, predicting through a label prediction model of the speech recognition system to obtain a prediction label category of the speech in the training data, and predicting through an active learning loss prediction model of the speech recognition system to obtain a prediction loss value aiming at the speech in the training data; and calculating a target loss function according to the actual label category and the predicted label category corresponding to the voice in the training data and the predicted loss value of the voice in the training data.

Inputting the established training set into a speech recognition system to train the speech recognition system, wherein in the training process, for the speech data x in any training data, the prediction label category can be obtained through a label prediction model

And obtaining a predicted loss value by actively learning a loss prediction model

And calculating a target loss function of the voice recognition system by combining the actual label type y.

In an embodiment, the calculating a target loss function according to the actual tag class and the predicted tag class corresponding to the speech in the training data and the predicted loss value for the speech in the training data specifically includes: calculating an actual loss value according to the actual label category and the predicted label category corresponding to the voice in the training data; calculating a loss between the actual loss value and the predicted loss value for speech in the training data; and constructing a target loss function according to the calculated loss and the actual loss value.

As can be seen from the foregoing, in the training process, for the speech data x in any training data, the predicted label category is obtained through the label prediction model

Thus, based on the predicted tag class

And actual label class y, calculating an actual loss value

Then calculating the actual loss value l and the predicted loss value

Loss between

Combining the losses of the two parts to obtain the target loss of the speech recognition system

As shown in fig. 6.

Specifically, the difference between the predicted label class and the actual label class, i.e., the actual loss value, can be calculated by the cross entropy loss function

This difference is used for comparison training of the active learning loss prediction model. The cross entropy loss function is as follows:

wherein p is_kRepresenting the actual tag value, q_kThe predictive tag value is indicated.

Then, the actual loss value l and the predicted loss value are calculated

Loss between

The actual loss value l and the predicted loss value are calculated in the simplest way

The loss function in between is a mean square error loss function, but for two reasons it is not suitable in this training scenario. Firstly, the actual loss is reduced along with the training process, and the label prediction model is updated in the training process, so that the label of the active learning loss module is changed, and the fitting cannot be performed; secondly, the purpose of actively learning the loss prediction model is to reflect the relative magnitude of the loss between different data, without accurately corresponding to the actual loss, in other words, what we want is a sort size rather than the actual loss value. Thus, the entire training process and the corresponding loss function are adjusted. Specifically, pairwise matching is performed on the voice data in the statistical training data, for example, pairwise matching is performed on the voice data in the statistical B training data to obtain B/2 voice data pairs { x }^p＝(x_i,x_j) }; then inputting a training set formed by voice data pairs into the voice recognition system, and constructing an actual loss value l and a predicted loss value by comparing the relative predicted loss relation and the relative actual loss relation of the voice data pairs

Loss between

The loss function is as follows:

wherein the content of the first and second substances,

representing a predicted loss value output by the active learning loss module;

l represents an actual loss value, and is calculated by a predicted label category and an actual label category;

representing a voice data pair (x)_i,x_j) The predicted loss difference of (2);

(l_i，l_j) Representing a voice data pair (x)_i,x_j) The actual loss magnitude relationship;

ξ is a preset positive value hyperparameter.

For the understanding of the above loss function, when l_i≥l_jAt the time, only

Is greater than

The loss value is 0 only when the loss value is 0, and is not 0 in other cases, so as to increase the loss value

And reduce

Combining the two loss functions, a target loss function for updating the speech recognition system is finally obtained, which is summarized as follows:

wherein, (x, y) is the speech data as training data and its corresponding actual label category;

is the predicted label category output by the label prediction model;

L_targetis a cross entropy loss function;

is a predicted loss value predicted by the active learning loss prediction model;

is the actual loss value;

λ is another preset positive value override.

And performing optimization training on the voice recognition system according to the target loss function until the target loss function is converged, thereby obtaining the optimized voice recognition system.

The voice recognition system optimization method provided by the embodiment includes acquiring a voice to be recognized, inputting the voice to be recognized into the voice recognition system for classification recognition, obtaining a prediction tag category corresponding to the voice to be recognized through prediction of a tag prediction model of the voice recognition system, and obtaining a prediction loss value of the tag prediction model through prediction of an active learning loss prediction model of the voice recognition system; when the prediction label category is determined to be inaccurate according to the prediction loss value, acquiring an actual label category corresponding to the voice to be recognized, and taking the voice to be recognized and the actual label category corresponding to the voice to be recognized as training data; then, training data are counted, a training set is established according to the counted training data, the established training set is input into the voice recognition system to carry out optimization training on the voice recognition system, a target loss function is calculated until the target loss function is converged, and the optimized voice recognition system is obtained. Therefore, when the voice recognition system works, the loss value predicted by the self-learning loss prediction module is used for finding out the voice data which is easy to be recognized by the voice recognition system in error, the voice data is used as training data for optimizing the voice recognition system, efficient acquisition of the training data is realized, the training data is reused for optimizing and training the voice recognition system, the recognition breadth of the voice recognition system can be improved, updating and upgrading of the voice recognition system are realized, and therefore the recognition accuracy and reliability of the voice recognition system are improved.

Referring to fig. 7, fig. 7 is a schematic block diagram of an optimization apparatus of a speech recognition system according to an embodiment of the present disclosure.

As shown in fig. 7, the speech recognition system optimizing apparatus 400 includes: a prediction module 401, a determination module 402, a setup module 403, and an optimization module 404.

The word segmentation module 401 is configured to obtain a speech to be recognized, input the speech to be recognized to a speech recognition system for classification recognition, obtain a prediction tag category corresponding to the speech to be recognized through prediction of a tag prediction model of the speech recognition system, and obtain a prediction loss value of the tag prediction model through prediction of an active learning loss prediction model of the speech recognition system;

a matching module 402, configured to, when it is determined that the predicted tag category is inaccurate according to the prediction loss value, obtain an actual tag category corresponding to the speech to be recognized, and determine the speech to be recognized and the actual tag category corresponding to the speech to be recognized as training data;

a first determining module 403, configured to count training data and establish a training set according to the counted training data;

a second determining module 404, configured to input the training set into the speech recognition system to perform optimization training on the speech recognition system, and calculate a target loss function until the target loss function converges, so as to obtain an optimized speech recognition system.

It should be noted that, as will be clear to those skilled in the art, for convenience and brevity of description, the specific working processes of the apparatus and the modules and units described above may refer to the corresponding processes in the foregoing embodiment of the speech recognition system optimization method, and are not described herein again.

The apparatus provided by the above embodiments may be implemented in the form of a computer program, which can be run on a computer device as shown in fig. 8.

Referring to fig. 8, fig. 8 is a schematic block diagram illustrating a structure of a computer device according to an embodiment of the present disclosure. The computer device may be a Personal Computer (PC), a server, or the like having a data processing function.

As shown in fig. 8, the computer device includes a processor, a memory, and a network interface connected by a system bus, wherein the memory may include a nonvolatile storage medium and an internal memory.

The non-volatile storage medium may store an operating system and a computer program. The computer program includes program instructions that, when executed, cause a processor to perform any of the speech recognition system optimization methods.

The processor is used for providing calculation and control capability and supporting the operation of the whole computer equipment.

The internal memory provides an environment for the execution of a computer program on a non-volatile storage medium, which when executed by a processor, causes the processor to perform any of the speech recognition system optimization methods.

The network interface is used for network communication, such as sending assigned tasks and the like. Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

It should be understood that the Processor may be a Central Processing Unit (CPU), and the Processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Wherein, in one embodiment, the processor is configured to execute a computer program stored in the memory to implement the steps of:

acquiring a voice to be recognized, inputting the voice to be recognized into a voice recognition system for classified recognition, so as to obtain a prediction tag category corresponding to the voice to be recognized through prediction of a tag prediction model of the voice recognition system, and obtain a prediction loss value of the tag prediction model through prediction of an active learning loss prediction model of the voice recognition system; when the prediction label category is determined to be inaccurate according to the prediction loss value, acquiring an actual label category corresponding to the voice to be recognized, and determining the voice to be recognized and the actual label category corresponding to the voice to be recognized as training data; counting training data, and establishing a training set according to the counted training data; and inputting the training set into the voice recognition system to carry out optimization training on the voice recognition system, and calculating a target loss function until the target loss function is converged to obtain the optimized voice recognition system.

In some embodiments, the processor implements the predicting, by a tag prediction model of the speech recognition system, to obtain a predicted tag category corresponding to the speech to be recognized, including:

inputting the voice to be recognized into a label prediction model of the voice recognition system, extracting the characteristics of the voice to be recognized to obtain the characteristics of the voice to be recognized, and supplementing the position codes corresponding to the characteristics of the voice to be recognized;

decoding the characteristics of the speech to be recognized and the position codes corresponding to the characteristics of the speech to be recognized to obtain hidden characteristic vectors;

performing linear transformation on the hidden feature vector to obtain a decoding vector;

and performing softmax logistic regression calculation on the decoding vector to obtain a prediction tag category corresponding to the voice to be recognized and output by a tag prediction model of the voice recognition system.

In some embodiments, the processor implements the predicting of the predicted loss value of the tag prediction model by the active learning loss prediction model of the speech recognition system, including:

inputting the hidden feature vector into an active learning loss prediction model of the speech recognition system, and performing global pooling on the hidden feature vector to obtain a global pooled feature vector;

performing full-connection operation on the global pooling feature vector to obtain a full-connection feature vector;

carrying out nonlinear mapping on the fully connected feature vector through a ReLU linear rectification function to obtain feature mapping;

and carrying out full-connection operation on the feature mapping to obtain a prediction loss value output by an active learning loss prediction model of the voice recognition system.

In some embodiments, the processor performs the optimal training of the speech recognition system by inputting the training set into the speech recognition system, and calculating an objective loss function, including:

inputting each training data in the training set into the speech recognition system, predicting through a label prediction model of the speech recognition system to obtain a prediction label category of the speech in the training data, and predicting through an active learning loss prediction model of the speech recognition system to obtain a prediction loss value aiming at the speech in the training data;

and calculating a target loss function according to the actual label category and the predicted label category corresponding to the voice in the training data and the predicted loss value of the voice in the training data.

In some embodiments, the processor implements the calculating an objective loss function according to an actual tag class and a predicted tag class corresponding to speech in the training data and the predicted loss value for speech in the training data, including:

calculating an actual loss value according to the actual label category and the predicted label category corresponding to the voice in the training data;

calculating a loss between the actual loss value and the predicted loss value for speech in the training data;

and constructing a target loss function according to the calculated loss and the actual loss value.

In some embodiments, the performing, by the processor, the feature extraction on the speech to be recognized to obtain the feature of the speech to be recognized includes:

pre-reinforcing the voice to be recognized by taking a frame as a unit, and performing fast Fourier transform on the pre-reinforced voice to be recognized;

processing the voice to be recognized after the fast Fourier transform through a Log Mel spectrum filter to obtain a filtering output value;

and sequentially carrying out linear transformation and layer standardization on the filtering output value to obtain the characteristics of the voice to be recognized.

In some embodiments, the decoding, by the processor, the feature of the speech to be recognized and the position code corresponding to the feature of the speech to be recognized to obtain a hidden feature vector includes:

performing multi-head attention calculation on the characteristics of the speech to be recognized and the position codes corresponding to the characteristics of the speech to be recognized to obtain multi-head attention output;

and performing feedforward calculation on the multi-head attention output to obtain a hidden feature vector.

Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, where the computer program includes program instructions, and a method implemented when the program instructions are executed may refer to the embodiments of the speech recognition system optimization method.

The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device.

Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments. While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for optimizing a speech recognition system, said method comprising the steps of:

2. The method according to claim 1, wherein the predicting the predicted tag category corresponding to the speech to be recognized by the tag prediction model of the speech recognition system comprises:

3. The method of claim 2, wherein the predicting the predicted loss value of the tag prediction model by the active learning loss prediction model of the speech recognition system comprises:

4. The method of claim 1, wherein the inputting the training set into the speech recognition system optimizes the speech recognition system and calculates an objective loss function, comprising:

5. The method of claim 4, wherein the calculating an objective loss function according to the actual tag class and the predicted tag class corresponding to the speech in the training data and the predicted loss value for the speech in the training data comprises:

6. The method for optimizing a speech recognition system according to claim 2, wherein the extracting the features of the speech to be recognized to obtain the features of the speech to be recognized comprises:

7. The method of claim 2, wherein the decoding the feature of the speech to be recognized and the position code corresponding to the feature of the speech to be recognized to obtain the hidden feature vector comprises:

8. A speech recognition system optimization apparatus, comprising:

9. A computer arrangement, characterized in that the computer arrangement comprises a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program, when executed by the processor, carries out the steps of the speech recognition system optimization method according to any one of claims 1 to 7.

10. A readable storage medium, having stored thereon a computer program, wherein the computer program, when being executed by a processor, carries out the steps of the method for speech recognition system optimization according to any one of claims 1 to 7.