CN113223502B

CN113223502B - Speech recognition system optimization method, device, equipment and readable storage medium

Info

Publication number: CN113223502B
Application number: CN202110467147.1A
Authority: CN
Inventors: 罗剑; 王健宗; 程宁
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-04-28
Filing date: 2021-04-28
Publication date: 2024-01-30
Anticipated expiration: 2041-04-28
Also published as: CN113223502A

Abstract

The application belongs to the technical field of voice semantics, and provides a voice recognition system optimization method, a device, equipment and a readable storage medium, wherein the method comprises the following steps: the method comprises the steps of obtaining voice to be recognized, inputting the voice to be recognized into a voice recognition system for classification recognition, predicting a predicted tag class corresponding to the voice to be recognized through a tag prediction model of the voice recognition system, and predicting a predicted loss value of the tag prediction model through an active learning loss prediction model of the voice recognition system; when the predicted tag class is determined to be inaccurate according to the predicted loss value, acquiring an actual tag class of the voice to be recognized, and taking the voice to be recognized and the actual tag class thereof as training data; counting training data to establish a training set; and carrying out optimization training on the voice recognition system through the training set, and calculating a target loss function until the target loss function converges to obtain the optimized voice recognition system. The method and the device can improve the recognition accuracy and reliability of the voice recognition system.

Description

Speech recognition system optimization method, device, equipment and readable storage medium

Technical Field

The present disclosure relates to the field of speech semantic technologies, and in particular, to a method, an apparatus, a device, and a readable storage medium for optimizing a speech recognition system.

Background

The voice assistant based on deep learning is widely applied to daily life of people, such as a little college from millet, siri from apple, cortana from Microsoft and the like, and people can use the voice assistant to inquire weather, add reminding items, set an alarm clock and the like. However, because the voice recognition system configured by the voice assistant at present is obtained by training limited voice data marked by manual inefficiency, the limitation of training data leads to the existence of recognition dead zones of the voice recognition system, so that the voice assistant is easy to recognize wrong scenes in daily use, the reliability is low, and the use experience is greatly reduced.

Disclosure of Invention

The main purpose of the present application is to provide a method, a device and a readable storage medium for optimizing a voice recognition system, which aim to solve the technical problems of low recognition accuracy and reliability of the existing voice recognition system.

In a first aspect, the present application provides a method for optimizing a speech recognition system, the method comprising:

The method comprises the steps of obtaining voice to be recognized, inputting the voice to be recognized into a voice recognition system for classification recognition, predicting a predicted tag class corresponding to the voice to be recognized through a tag prediction model of the voice recognition system, and predicting a predicted loss value of the tag prediction model through an active learning loss prediction model of the voice recognition system;

when the predicted tag class is determined to be inaccurate according to the predicted loss value, acquiring an actual tag class corresponding to the voice to be recognized, and determining the voice to be recognized and the actual tag class corresponding to the voice to be recognized as training data;

counting training data, and establishing a training set according to the counted training data;

and inputting the training set into the voice recognition system to perform optimization training on the voice recognition system, and calculating a target loss function until the target loss function converges to obtain an optimized voice recognition system.

In a second aspect, the present application further provides a voice recognition system optimizing apparatus, the apparatus comprising:

the prediction module is used for acquiring the voice to be recognized, inputting the voice to be recognized into a voice recognition system for classification recognition, predicting a predicted tag class corresponding to the voice to be recognized through a tag prediction model of the voice recognition system, and predicting a predicted loss value of the tag prediction model through an active learning loss prediction model of the voice recognition system;

The determining module is used for acquiring an actual tag class corresponding to the voice to be recognized and determining the voice to be recognized and the actual tag class corresponding to the voice to be recognized as training data when the predicted tag class is determined to be inaccurate according to the predicted loss value;

the building module is used for counting training data and building a training set according to the counted training data;

and the optimizing module is used for inputting the training set into the voice recognition system to perform optimizing training on the voice recognition system, calculating a target loss function until the target loss function converges, and obtaining the optimized voice recognition system.

In a third aspect, the present application also provides a computer device comprising a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program when executed by the processor implements the steps of the speech recognition system optimization method as described above.

In a fourth aspect, the present application further provides a readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a method for optimizing a speech recognition system as described above.

The application discloses a voice recognition system optimizing method, a device, equipment and a readable storage medium, wherein the voice recognition system optimizing method is used for acquiring voice to be recognized, inputting the voice to be recognized into a voice recognition system for classification recognition, predicting a predicted tag class corresponding to the voice to be recognized through a tag prediction model of the voice recognition system, and predicting a predicted loss value of the tag prediction model through an active learning loss prediction model of the voice recognition system; when the predicted tag class is determined to be inaccurate according to the predicted loss value, acquiring an actual tag class corresponding to the voice to be recognized, and taking the voice to be recognized and the actual tag class corresponding to the voice to be recognized as training data; and then counting training data, establishing a training set according to the counted training data, inputting the established training set into a voice recognition system to perform optimization training on the voice recognition system, and calculating a target loss function until the target loss function converges to obtain an optimized voice recognition system. Therefore, when the voice recognition system works, the voice data which is easy to identify errors of the voice recognition system is found out through the loss value predicted by the self-learning loss prediction module, the voice data is used as training data for optimizing the voice recognition system, the high-efficiency acquisition of the training data is realized, the voice recognition system is optimized and trained by utilizing the training data, the recognition breadth of the voice recognition system can be improved, the updating and upgrading of the voice recognition system are realized, and the recognition accuracy and reliability of the voice recognition system are improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a method for optimizing a speech recognition system according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a speech recognition system according to an embodiment of the present disclosure;

fig. 3 is a schematic architecture diagram of an audio feature extraction module according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a single self-attention decoder architecture provided in an embodiment of the present application;

fig. 5 is a schematic architecture diagram of an active learning module according to an embodiment of the present application;

FIG. 6 is an exemplary diagram of calculating an objective loss function for a speech recognition system provided by embodiments of the present application;

FIG. 7 is a schematic block diagram of a speech recognition system optimizing apparatus according to an embodiment of the present application;

Fig. 8 is a schematic block diagram of a computer device according to an embodiment of the present application.

The realization, functional characteristics and advantages of the present application will be further described with reference to the embodiments, referring to the attached drawings.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The flow diagrams depicted in the figures are merely illustrative and not necessarily all of the elements and operations/steps are included or performed in the order described. For example, some operations/steps may be further divided, combined, or partially combined, so that the order of actual execution may be changed according to actual situations.

It is to be understood that the terminology used in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

Embodiments of the present application provide a method, apparatus, device, and computer-readable storage medium for optimizing a speech recognition system. The voice recognition system optimizing method is mainly applied to voice recognition system optimizing equipment, and can be mobile terminals, personal computers (personal computer), portable computers, servers and other equipment with data processing functions.

Some embodiments of the present application are described in detail below with reference to the accompanying drawings. The following embodiments and features of the embodiments may be combined with each other without conflict.

Referring to fig. 1, fig. 1 is a flow chart of a method for optimizing a speech recognition system according to an embodiment of the present application.

As shown in fig. 1, the voice recognition system optimizing method includes steps S101 to S104.

Step S101, obtaining voice to be recognized, inputting the voice to be recognized into a voice recognition system for classification recognition, predicting a predicted tag class corresponding to the voice to be recognized through a tag prediction model of the voice recognition system, and predicting a predicted loss value of the tag prediction model through an active learning loss prediction model of the voice recognition system.

The video description generation system may be implemented as part of an application program having voice recognition functionality, such as a voice assistant or the like.

As shown in fig. 2, fig. 2 is a schematic diagram of a speech recognition system, which is a speech recognition model that uses small-order labeled speech data to complete initial training, and mainly includes a two-part-label prediction model and an active learning loss prediction model, where the label prediction model and the active learning loss prediction model belong to a parallel relationship. The label prediction model is an end-to-end neural network and is used for classifying and identifying the voice to be identified so as to predict the label type of the voice to be identified; the active learning loss prediction model is a lightweight neural network, and is used for predicting the loss of the speech prediction result to be recognized by the label prediction model, namely judging the probability of correctly predicting the label class corresponding to the speech to be recognized by the label prediction model.

Taking a voice assistant as an example, when a user sends a voice command to the voice assistant, the voice assistant obtains the voice command, inputs the voice command as voice to be recognized to a voice recognition system for classification recognition, predicts a predicted tag class corresponding to the voice to be recognized through a tag prediction model of the voice recognition system, and predicts a predicted loss value of the tag prediction model through an active learning loss prediction model of the voice recognition system, wherein the predicted loss value is used for representing whether the predicted tag class corresponding to the voice to be recognized is accurate or not.

In an embodiment, the predicting, by the tag prediction model of the speech recognition system, the predicted tag class corresponding to the speech to be recognized is specifically: inputting the voice to be recognized into a label prediction model of the voice recognition system, extracting the characteristics of the voice to be recognized to obtain the characteristics of the voice to be recognized, and supplementing the position codes corresponding to the characteristics of the voice to be recognized; decoding the characteristics of the voice to be recognized and the position codes corresponding to the characteristics of the voice to be recognized to obtain hidden characteristic vectors; performing linear transformation on the hidden characteristic vector to obtain a decoding vector; and performing softmax logistic regression calculation on the decoding vector to obtain a predicted tag class corresponding to the voice to be recognized, which is output by a tag prediction model of the voice recognition system.

With continued reference to fig. 2, as shown in the left dashed box of fig. 2, the left dashed box of fig. 2 is a schematic architecture diagram of a label prediction model, where the label prediction model mainly includes an audio feature extraction module and a self-attention decoder module, and the self-attention decoder module is formed by stacking a plurality of self-attention decoders. When inputting the voice to be recognized into a voice recognition system for classification recognition, firstly, extracting features of the voice to be recognized through an audio feature extraction module of a label prediction model to obtain features corresponding to the voice to be recognized; then supplementing the position coding information of the feature corresponding to the voice to be recognized, decoding the feature corresponding to the voice to be recognized and the position coding information of the feature through a self-attention decoder module, wherein the output of the ith self-attention decoder is the input of the (i+1) th self-attention decoder during decoding, and the hidden feature vector output by the last self-attention decoder in the self-attention decoder module is taken as the final output of the self-attention decoder module and is expressed as Z= [ Z ] ₁ ,z ₂ ,...,z _n ]The method comprises the steps of carrying out a first treatment on the surface of the Then the output of the self-attention decoder module is processed by linear transformation to obtain decoding vector, and then the decoding vector is processedThe amount is subjected to softmax logistic regression processing, so far, the output Z= [ Z ] of the self-attention decoder module ₁ ,z ₂ ,...,z _n ]Mapping to l= [ L ] in one-dimensional class space ₁ ,l ₁ ,...,l _m ]The method comprises the steps of carrying out a first treatment on the surface of the Based on the processing, the tag prediction model can output the predicted tag class corresponding to the voice to be recognized. As shown in table 1, table 1 shows the predicted tag types outputted after the common voice command is predicted by the tag prediction model:

table 1 common Voice Command and its predictive Label class

In an embodiment, the feature extraction of the voice to be recognized to obtain features of the voice to be recognized specifically includes: pre-strengthening the voice to be recognized by taking a frame as a unit, and performing fast Fourier transform on the pre-strengthened voice to be recognized; processing the voice to be recognized after the fast Fourier transform through a Log Mel spectrum filter to obtain a filtering output value; and sequentially carrying out linear transformation and layer standardization on the filtering output value to obtain the characteristics of the voice to be recognized.

As shown in fig. 3, fig. 3 is a schematic architecture diagram of an audio feature extraction module of a label prediction model, where the audio feature extraction module is used to extract features corresponding to a voice to be recognized, when the audio feature extraction module performs feature extraction on the voice to be recognized, a pre-emphasis is performed on the voice to be recognized in a frame unit, so as to enhance high frequency, and remove the influence of lip radiation, and enhance a high frequency part of the voice to be recognized to improve a high frequency signal-to-noise ratio, where the formula is as follows,

s′(x)＝s(x)-k*s(x-1)

Where k represents a pre-emphasis coefficient, k e [0,1], x is a frame, and s (x) is a speech signal corresponding to the x frame.

The pre-emphasized speech to be recognized is then subjected to a fast fourier transform (FFT, fast Fourier Transformation). The fast fourier transform is to decompose a complex sound wave into sound waves with various frequencies, specifically, the pre-enhanced voice to be recognized can be subjected to discrete fourier transform, that is, n-point FFT is performed on each frame to calculate a frequency spectrum, where n can be 256 or 512.

It should be noted that, before performing the fast fourier transform on the pre-reinforced voice to be recognized, the pre-reinforced voice to be recognized may be framed, i.e., the voice to be recognized with an indefinite length is cut into segments with a fixed length. The frame length may be chosen to be 20ms, the speech signal within the frame may be considered a stationary signal, while the frame shift is set to 10ms, i.e. the time difference between segments is set to 10ms, to avoid speech information being lost at the framing.

After the pre-reinforced voice to be recognized is subjected to fast Fourier transform, the voice to be recognized after the fast Fourier transform is processed through a Log Mel spectrum filter, and a filtering output value is obtained. The Log mel spectrum filter is also called FilterBank, and can process the audio in a manner similar to human ears so as to achieve the aim of improving the voice recognition performance. After the voice to be recognized after the fast Fourier transform passes through the Log Mel spectrum filter, a two-dimensional array X= [ X ] is finally output ₁ ,x ₂ ,...x _n ]Wherein x is _n For the nth truncated frame segment, each element in the arrayWhere k represents the number of filters, and may be flexibly set according to practical needs, for example, k may be 40.

In order to enable the feature matrix size of the voice to be recognized output by the audio feature extraction module to be matched with the input size of the self-attention decoder module, the voice to be recognized after the fast Fourier transform is processed through a Log Mel spectrum filter, and then linear transformation and layer standardization are further carried out, so that the feature of the voice to be recognized is finally obtained.

In an embodiment, the decoding the feature of the voice to be recognized and the position code corresponding to the feature of the voice to be recognized to obtain the hidden feature vector specifically includes: and performing multi-head attention calculation on the characteristics of the voice to be recognized and the position codes corresponding to the characteristics of the voice to be recognized to obtain multi-head attention output, and performing feedforward calculation on the multi-head attention output to obtain hidden characteristic vectors.

From the foregoing, the self-attention decoder module is composed of N (N is greater than or equal to 2) self-attention decoders stacked together, as shown in fig. 4, fig. 4 is a schematic diagram of a single self-attention decoder, each self-attention decoder includes two sub-layers, the first layer is multi-head attention, the second layer is a fully-connected feedforward neural network (the simplest fully-connected structure), besides, the two sub-layers are respectively connected by a incomplete connection, and then are subjected to layer standardization, wherein the incomplete connection is used for solving the problem of difficulty in training of the multi-layer neural network, so that the neural network only focuses on a difference part during training, and the layer standard can accelerate the model training process, so that convergence is accelerated.

It should be noted that, since the self-attention formula may cause loss of position information during calculation, when the features of the speech to be recognized output by the audio feature extraction module are input to the self-attention decoder module, the position coding information corresponding to the features of the speech to be recognized is first supplemented. Thus, the input of the self-attention decoder module is the feature of the voice to be recognized and the position coding information corresponding to the feature, as shown in fig. 4, the position coding information corresponding to the feature corresponds to Q in fig. 4, and K, V corresponds to the feature of the voice to be recognized output by the audio feature extraction module.

After inputting the feature corresponding to the voice to be recognized and the position coding information of the feature into a self-attention decoder module, for a first self-attention decoder in the self-attention decoder module, firstly, carrying out multi-head attention calculation on the feature corresponding to the voice to be recognized and the position coding information of the feature in a multi-head attention layer to obtain the output of the multi-head attention layer, and then inputting the output of the multi-head attention layer into a feedforward neural network layer to carry out feedforward calculation to obtain the output of the first self-attention decoder, namely the hidden feature vector; for other self-attention decoders except the first self-attention decoder in the self-attention decoder module, performing multi-head attention calculation on the multi-head attention layer to obtain the output of the multi-head attention layer, and inputting the output of the multi-head attention layer into a feedforward neural network layer to perform feedforward calculation to obtain the output of other self-attention decoders; the output of the last self-attention decoder in the self-attention decoder module is taken as the final output of the self-attention decoder module.

Among them, in the self-attention decoder, multi-head attention is the most important conversion map. Multi-head attention is composed of a basic attention map. The attention formula (SDA, scaled-product attention) maps a query (Q, query), a key (K, key), and a value (V, value) to a weighted sum value, which is expressed as follows:

wherein the dimensions of query Q and key K are the same and are d ^k The dimension of the value V is d ^v . To obtain multiple different linear mappings, a multi-headed attention map is introduced. In multiple attention maps, the basic attention functions are performed in parallel. And outputting dimensions of each basic attention model, and finally outputting through dimension connection. The formula of the formula is as follows,

MultiHead(Q,K,V)＝Concat(head ₁ ,...,head _h )W ^O

head _i ＝Attention(QW _i ^Q ,KW _i ^K ,VW _i ^V )

wherein h is the basic attention number,representing a parameter matrix->

In an embodiment, the predicting a predicted loss value of the tag prediction model by the active learning loss prediction model of the speech recognition system specifically includes: inputting the hidden feature vector into an active learning loss prediction model of the voice recognition system, and carrying out global pooling on the hidden feature vector to obtain a global pooled feature vector; performing full-connection operation on the global pooling feature vector to obtain a full-connection feature vector; nonlinear mapping is carried out on the fully connected feature vector through a ReLU linear rectification function, so that feature mapping is obtained; and performing full-connection operation on the feature map to obtain a predicted loss value output by an active learning loss prediction model of the voice recognition system.

With continued reference to fig. 1, as shown in the right dashed box portion of fig. 1, the right dashed box portion of fig. 1 is a schematic architecture diagram of an active learning loss prediction model, where the active learning loss prediction model is formed by stacking a plurality of active learning modules. As shown in fig. 5, fig. 5 is a schematic architecture diagram of the active learning module. The active learning loss prediction model uses the hidden feature vector output by the self-attention decoder as input, and then sequentially processes the global pooling layer, the full connection layer and the ReLU linear rectification function layer to obtain the output of the active learning module, and finally the output of the active learning module is processed by the full connection layer to obtain the output of the active learning loss prediction model, namely a predicted loss value (shown in figures 1 and 5), wherein the value represents the probability of correct prediction by the label prediction model. In detail, a high loss value indicates that the current input is difficult data for the speech recognition system, and the tag prediction model may make erroneous decisions.

Compared with a label prediction model, the loss prediction module is a lightweight network, and can make quick predictions; meanwhile, in order to improve the network utilization rate, the input of each active learning module is the output of each attention decoder. The input of multiple information sources may cause the loss prediction module to select useful information. The global pooling layer may map information of different dimensions to a fixed information dimension.

Step S102, when the predicted tag class is determined to be inaccurate according to the predicted loss value, acquiring an actual tag class corresponding to the voice to be recognized, and determining the voice to be recognized and the actual tag class corresponding to the voice to be recognized as training data.

According to the above, the prediction loss value output by the active learning loss module may indicate whether the predicted tag class corresponding to the voice to be recognized output by the tag prediction model is accurate, so that after the predicted tag class corresponding to the voice to be recognized output by the tag prediction model of the voice recognition system and the predicted loss value output by the active learning loss prediction model of the voice recognition system are obtained, whether the predicted tag class corresponding to the voice to be recognized is accurate or not is determined according to the prediction loss value, specifically, the prediction loss value may be compared with a preset threshold, if the prediction loss value is greater than or equal to the preset threshold, it may be determined that the predicted tag class corresponding to the voice to be recognized output by the tag prediction model is inaccurate, where the preset threshold is used as a critical value for determining whether the predicted tag class is accurate or not, and may be set according to practical flexibility.

When the predicted loss value indicates that the predicted tag class corresponding to the voice to be recognized is inaccurate, the voice to be recognized is difficult data for the voice recognition system, so that the voice recognition system is optimized and updated by utilizing the voice to be recognized and the actual tag class thereof, and then the actual tag class corresponding to the voice to be recognized is required to be obtained. Taking a voice assistant as an example, when a user sends a voice command to the voice assistant, the voice assistant obtains the voice command, inputs the voice command as voice to be recognized into a voice recognition system, performs loss prediction through an active learning loss prediction model of the voice recognition system, obtains a relatively high predicted loss value, and can generate and display prompt information requesting the user to select a correct label type when the predicted label type corresponding to the voice to be recognized is inaccurate, simultaneously loads a label type selection item related to the voice to be recognized to enable the user to select, then receives a selection command of the label type selection item from the user, and takes a label type corresponding to the selection command as an actual label type corresponding to the voice to be recognized.

After the actual label category corresponding to the voice to be recognized is obtained, the voice to be recognized and the actual label category corresponding to the voice to be recognized can be used as training data, so that the voice recognition system can accumulate training data for further optimizing training of the voice recognition system while executing the voice recognition task.

In conclusion, when the voice recognition system works, the self-learning loss prediction module is used for finding out voice data which is easy to identify errors by the voice recognition system and is used as training data for optimizing the voice recognition system, so that the high-efficiency acquisition of the training data is realized, the training data is obtained without manual labeling, and the labor cost is saved.

Step S103, statistics of training data is carried out, and a training set is established according to the statistical training data.

The training data may then be counted, for example, periodically, such as once every month. The training set is then built from the statistical training data, which, by way of example,

training set = { training data 1, training data 2, once again, training data B }

= { (voice data x ₁ Actual tag class y ₁ ) (Speech data x) ₂ Actual tag class y ₂ ) .... _B Actual tag class y _B )}

Step S104, inputting the training set into the voice recognition system to perform optimization training on the voice recognition system, and calculating a target loss function until the target loss function converges to obtain an optimized voice recognition system.

In an embodiment, the training set is input into the speech recognition system to perform optimization training on the speech recognition system, and the objective loss function is calculated, specifically: inputting each training data in the training set into the voice recognition system, predicting a predicted label category of the voice in the training data through a label prediction model of the voice recognition system, and predicting a predicted loss value for the voice in the training data through an active learning loss prediction model of the voice recognition system; and calculating a target loss function according to the actual label class and the predicted label class corresponding to the voice in the training data and the predicted loss value aiming at the voice in the training data.

Inputting the established training set into a voice recognition system to train the voice recognition system, wherein in the training process, the predicted label category can be obtained through a label prediction model for the voice data x in any training data Obtaining a predicted loss value +.>And calculating the target loss function of the voice recognition system by combining the actual label class y.

In an embodiment, the calculating a target loss function according to the actual label class and the predicted label class corresponding to the voice in the training data and the predicted loss value of the voice in the training data specifically includes: calculating an actual loss value according to the actual label class and the predicted label class corresponding to the voice in the training data; calculating a loss between the actual loss value and the predicted loss value for speech in the training data; and constructing an objective loss function according to the calculated loss and the actual loss value.

From the foregoing, in the training process, the predicted tag class is obtained for the speech data x in any one training data by the tag prediction modelObtaining a predicted loss value +.>Thus, according to the predictive tag class->And the actual tag class y, calculating the actual loss value +.>Calculating the actual loss value l and the predicted loss value +.>Loss between->The target loss of the voice recognition system can be obtained by combining the losses of the two parts >As shown in fig. 6.

In particular, the difference between the predicted tag class and the actual tag class, i.e., the actual loss value, can be calculated by the cross entropy loss functionThis difference is a comparison training for the active learning loss prediction model. The cross entropy loss function is as follows:

wherein p is _k Represents the actual tag value, q _k Then the predicted tag value, k e (1, …, n), is represented.

Then, calculate the actual loss value/and the predicted loss valueLoss between->Since the simplest calculation of the actual loss value l and the predicted loss value +.>The loss function in between is a mean square error loss function, but it is not suitable in this training scenario for two-point reasons. Firstly, the actual loss can be reduced along with the training process, and the label prediction model can be updated in the training process, so that the labels of the active learning loss module are changed and cannot be fitted; second, the objective of the active learning loss prediction model is to reflect the relative size of the loss between different data, without having to correspond exactly to the actual loss, in other words what we want is a sort size rather than the actual loss value. Thus, the entire training process and corresponding loss function are adjusted. Specifically, the voice data in the statistical training data are paired in pairs, for example, the voice data in the statistical B training data are paired in pairs to obtain B/2 voice data pairs { x } ^p ＝(x _i ,x _j ) -a }; then inputting a training set formed by the voice data pairs into the voice recognition system, and constructing an actual loss value l and a predicted loss value +_ by comparing the relative predicted loss relation and the relative actual loss relation of the voice data pairs>Loss between->The loss function is as follows:

wherein,representing a predicted loss value output by the active learning loss module;

l represents an actual loss value, and is calculated by a predicted tag class and an actual tag class;

representing a pair of speech data (x _i ,x _j ) Is a predictive loss difference of (2);

(l _i ，l _j ) Representing a pair of speech data (x _i ,x _j ) Actual loss magnitude relation;

and xi is a preset positive value of the super-ginseng.

For the understanding of the above-mentioned loss function, when l _i ≥l _j At the time, onlyIs greater than->The loss value is 0 only when the number of the times is equal to or less than 0, so as to increase +.>And reduce->

Combining the two loss functions to finally obtain a target loss function for updating the voice recognition system, wherein the target loss function is summarized as follows:

wherein, (x, y) is speech data as training data and its corresponding actual label class;

the predicted tag class is output by the tag prediction model;

b is Batch (Batch) representing a Batch of training data for each optimization training;

x ^p The voice data in the batch B are paired in pairs, x ^p ＝(x _i ,x _j )；

y ^p Is x ^p Corresponding actual tag class pairing, y ^p ＝(y _i ,y _j )；

B ^S Is the data set after pairing of the batch B;

L _target is a cross entropy loss function;

the prediction loss value predicted by the active learning loss prediction model;

is the actual loss value;

lambda is another preset positive value of the super-parameter.

And carrying out optimization training on the voice recognition system according to the target loss function until the target loss function converges, and obtaining the optimized voice recognition system.

According to the voice recognition system optimizing method, the voice to be recognized is obtained, the voice to be recognized is input into the voice recognition system for classification recognition, the predicted label category corresponding to the voice to be recognized is obtained through the label prediction model prediction of the voice recognition system, and the predicted loss value of the label prediction model is obtained through the active learning loss prediction model prediction of the voice recognition system; when the predicted tag class is determined to be inaccurate according to the predicted loss value, acquiring an actual tag class corresponding to the voice to be recognized, and taking the voice to be recognized and the actual tag class corresponding to the voice to be recognized as training data; and then counting training data, establishing a training set according to the counted training data, inputting the established training set into a voice recognition system to perform optimization training on the voice recognition system, and calculating a target loss function until the target loss function converges to obtain an optimized voice recognition system. Therefore, when the voice recognition system works, the voice data which is easy to identify errors of the voice recognition system is found out through the loss value predicted by the self-learning loss prediction module, the voice data is used as training data for optimizing the voice recognition system, the high-efficiency acquisition of the training data is realized, the voice recognition system is optimized and trained by utilizing the training data, the recognition breadth of the voice recognition system can be improved, the updating and upgrading of the voice recognition system are realized, and the recognition accuracy and reliability of the voice recognition system are improved.

Referring to fig. 7, fig. 7 is a schematic block diagram of a voice recognition system optimizing apparatus according to an embodiment of the present application.

As shown in fig. 7, the voice recognition system optimizing apparatus 400 includes: a prediction module 401, a determination module 402, a setup module 403, and an optimization module 404.

The word segmentation module 401 is configured to obtain a voice to be recognized, input the voice to be recognized to a voice recognition system for classification recognition, predict a predicted tag class corresponding to the voice to be recognized through a tag prediction model of the voice recognition system, and predict a predicted loss value of the tag prediction model through an active learning loss prediction model of the voice recognition system;

the matching module 402 is configured to obtain an actual tag class corresponding to the voice to be recognized when the predicted tag class is determined to be inaccurate according to the predicted loss value, and determine the voice to be recognized and the actual tag class corresponding to the voice to be recognized as training data;

a first determining module 403, configured to count training data, and establish a training set according to the counted training data;

and a second determining module 404, configured to input the training set into the speech recognition system to perform optimization training on the speech recognition system, and calculate a target loss function until the target loss function converges, so as to obtain an optimized speech recognition system.

It should be noted that, for convenience and brevity of description, specific working processes of the above-described apparatus and modules and units may refer to corresponding processes in the foregoing embodiments of the method for optimizing a speech recognition system, which are not described herein again.

The apparatus provided by the above embodiments may be implemented in the form of a computer program which may be run on a computer device as shown in fig. 8.

Referring to fig. 8, fig. 8 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device may be a personal computer (personal computer, PC), a server, or the like having a data processing function.

As shown in fig. 8, the computer device includes a processor, a memory, and a network interface connected by a system bus, wherein the memory may include a non-volatile storage medium and an internal memory.

The non-volatile storage medium may store an operating system and a computer program. The computer program comprises program instructions that, when executed, cause the processor to perform any of a number of speech recognition system optimization methods.

The processor is used to provide computing and control capabilities to support the operation of the entire computer device.

The internal memory provides an environment for the execution of a computer program in a non-volatile storage medium that, when executed by a processor, causes the processor to perform any of the speech recognition system optimization methods.

The network interface is used for network communication such as transmitting assigned tasks and the like. It will be appreciated by those skilled in the art that the structure shown in fig. 5 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

It should be appreciated that the processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field-programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Wherein in one embodiment the processor is configured to run a computer program stored in the memory to implement the steps of:

the method comprises the steps of obtaining voice to be recognized, inputting the voice to be recognized into a voice recognition system for classification recognition, predicting a predicted tag class corresponding to the voice to be recognized through a tag prediction model of the voice recognition system, and predicting a predicted loss value of the tag prediction model through an active learning loss prediction model of the voice recognition system; when the predicted tag class is determined to be inaccurate according to the predicted loss value, acquiring an actual tag class corresponding to the voice to be recognized, and determining the voice to be recognized and the actual tag class corresponding to the voice to be recognized as training data; counting training data, and establishing a training set according to the counted training data; and inputting the training set into the voice recognition system to perform optimization training on the voice recognition system, and calculating a target loss function until the target loss function converges to obtain an optimized voice recognition system.

In some embodiments, the processor implements the predicting, by using a tag prediction model of the speech recognition system, the predicted tag class corresponding to the speech to be recognized, including:

Inputting the voice to be recognized into a label prediction model of the voice recognition system, extracting the characteristics of the voice to be recognized to obtain the characteristics of the voice to be recognized, and supplementing the position codes corresponding to the characteristics of the voice to be recognized;

decoding the characteristics of the voice to be recognized and the position codes corresponding to the characteristics of the voice to be recognized to obtain hidden characteristic vectors;

performing linear transformation on the hidden characteristic vector to obtain a decoding vector;

and performing softmax logistic regression calculation on the decoding vector to obtain a predicted tag class corresponding to the voice to be recognized, which is output by a tag prediction model of the voice recognition system.

In some embodiments, the processor implements the prediction loss value of the tag prediction model predicted by the active learning loss prediction model of the speech recognition system, including:

inputting the hidden feature vector into an active learning loss prediction model of the voice recognition system, and carrying out global pooling on the hidden feature vector to obtain a global pooled feature vector;

performing full-connection operation on the global pooling feature vector to obtain a full-connection feature vector;

Nonlinear mapping is carried out on the fully connected feature vector through a ReLU linear rectification function, so that feature mapping is obtained;

and performing full-connection operation on the feature map to obtain a predicted loss value output by an active learning loss prediction model of the voice recognition system.

In some embodiments, the processor implements the inputting of the training set into the speech recognition system to optimally train the speech recognition system, calculating an objective loss function, comprising:

inputting each training data in the training set into the voice recognition system, predicting a predicted label category of the voice in the training data through a label prediction model of the voice recognition system, and predicting a predicted loss value for the voice in the training data through an active learning loss prediction model of the voice recognition system;

and calculating a target loss function according to the actual label class and the predicted label class corresponding to the voice in the training data and the predicted loss value aiming at the voice in the training data.

In some embodiments, the processor implements the calculating an objective loss function from the actual tag class and the predicted tag class corresponding to the speech in the training data, and the predicted loss value for the speech in the training data, including:

Calculating an actual loss value according to the actual label class and the predicted label class corresponding to the voice in the training data;

calculating a loss between the actual loss value and the predicted loss value for speech in the training data;

and constructing an objective loss function according to the calculated loss and the actual loss value.

In some embodiments, the processor implements the feature extraction of the speech to be recognized to obtain features of the speech to be recognized, including:

pre-strengthening the voice to be recognized by taking a frame as a unit, and performing fast Fourier transform on the pre-strengthened voice to be recognized;

processing the voice to be recognized after the fast Fourier transform through a Log Mel spectrum filter to obtain a filtering output value;

and sequentially carrying out linear transformation and layer standardization on the filtering output value to obtain the characteristics of the voice to be recognized.

In some embodiments, the decoding the feature of the speech to be recognized and the position code corresponding to the feature of the speech to be recognized to obtain the hidden feature vector includes:

performing multi-head attention calculation on the characteristics of the voice to be recognized and the position codes corresponding to the characteristics of the voice to be recognized to obtain multi-head attention output;

And performing feedforward calculation on the multi-head attention output to obtain hidden characteristic vectors.

Embodiments of the present application also provide a computer readable storage medium, where a computer program is stored, where the computer program includes program instructions, and a method implemented when the program instructions are executed may refer to various embodiments of the method for optimizing a speech recognition system of the present application.

The computer readable storage medium may be an internal storage unit of the computer device according to the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, which are provided on the computer device.

Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created from the use of blockchain nodes, and the like.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments. While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for optimizing a speech recognition system, the method comprising the steps of:

The training data are counted, the voice recognition system is optimized and trained according to the counted training data, and a target loss function is calculated until the target loss function converges, so that the optimized voice recognition system is obtained;

the optimizing training is performed on the voice recognition system according to the statistical training data, and a target loss function is calculated until the target loss function converges, so as to obtain an optimized voice recognition system, which comprises the following steps:

predicting the predicted label category of the voice in the statistical training data through a label prediction model of the voice recognition system;

predicting a predicted loss value of the voice in the statistical training data through an active learning loss prediction model of the voice recognition system;

calculating actual loss according to the actual label category of the voice in the statistical training data and the predicted label categoryLoss of valuel represents the actual loss value, y represents the actual tag class,/o>Representing the predicted tag class;

pairwise pairing the voices in the statistical training data to obtain voice data pairs, and inputting the voice data pairs into the voice recognition system for training so as to construct the loss between the actual loss value and the predicted loss value Representing the pair of speech data (x _i ,x _j ) The predictive loss difference of (l) _i ，l _j ) Representing a pair of speech data (x _i ,x _j ) In the relation of the actual loss of (a), xi is a preset first positive value of super-parameter, and ++>Representing the pair of speech data (x _i ,x _j ) Predicted loss value of->Representing the pair of speech data (x _i ,x _j ) Is a real loss value of (1);

constructing an objective loss function based on the calculated actual loss value and the constructed loss(x, y) tableShowing the voice and the corresponding actual label category in the training data, B represents the training data batch used for carrying out the optimized training, x ^p ＝(x _i ,x _j ) Representing the voice data pair, y ^p ＝(y _i ,y _j ) Representing the actual tag class pairing corresponding to the voice data pair, B ^S And (3) representing the data set after pairing of the batch B, wherein lambda is a preset second positive value super parameter.

2. The method for optimizing a speech recognition system according to claim 1, wherein the predicting, by a label prediction model of the speech recognition system, the predicted label class corresponding to the speech to be recognized includes:

3. The method for optimizing a speech recognition system according to claim 2, wherein the predicting the predicted loss value of the tag prediction model by the active learning loss prediction model of the speech recognition system comprises:

4. The method for optimizing a speech recognition system according to claim 2, wherein the feature extraction of the speech to be recognized to obtain the feature of the speech to be recognized includes:

5. The method for optimizing a speech recognition system according to claim 2, wherein decoding the feature of the speech to be recognized and the position code corresponding to the feature of the speech to be recognized to obtain the hidden feature vector comprises:

6. A speech recognition system optimizing apparatus, characterized in that the speech recognition system optimizing apparatus comprises:

the optimizing module is used for counting training data, optimizing training the voice recognition system according to the counted training data, and calculating a target loss function until the target loss function converges to obtain an optimized voice recognition system;

the optimization module is specifically configured to predict and obtain a predicted label class of the voice in the statistical training data through a label prediction model of the voice recognition system;

Calculating an actual loss value according to the actual label category of the voice in the statistical training data and the predicted label categoryl represents the actual loss value, y represents the actual tag class,/o>Representing the predicted tag class;

constructing an objective loss function based on the calculated actual loss value and the constructed loss(x, y) represents the speech in the training data and its corresponding actual label class, B represents the training data batch for performing the optimization training, x ^p ＝(x _i ,x _j ) Representing the voice data pair, y ^p ＝(y _i ,y _j ) Representing the actual tag class pairing corresponding to the voice data pair, B ^S And (3) representing the data set after pairing of the batch B, wherein lambda is a preset second positive value super parameter.

7. A computer device comprising a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program when executed by the processor implements the steps of the speech recognition system optimization method according to any one of claims 1 to 5.

8. A readable storage medium, characterized in that the readable storage medium has stored thereon a computer program, wherein the computer program, when executed by a processor, implements the steps of the speech recognition system optimization method according to any one of claims 1 to 5.