CN113223502A - Speech recognition system optimization method, device, equipment and readable storage medium - Google Patents
Speech recognition system optimization method, device, equipment and readable storage medium Download PDFInfo
- Publication number
- CN113223502A CN113223502A CN202110467147.1A CN202110467147A CN113223502A CN 113223502 A CN113223502 A CN 113223502A CN 202110467147 A CN202110467147 A CN 202110467147A CN 113223502 A CN113223502 A CN 113223502A
- Authority
- CN
- China
- Prior art keywords
- voice
- recognized
- recognition system
- speech
- prediction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 51
- 238000005457 optimization Methods 0.000 title claims abstract description 35
- 238000012549 training Methods 0.000 claims abstract description 138
- 230000006870 function Effects 0.000 claims abstract description 53
- 239000013598 vector Substances 0.000 claims description 42
- 238000004364 calculation method Methods 0.000 claims description 15
- 238000004590 computer program Methods 0.000 claims description 15
- 238000013507 mapping Methods 0.000 claims description 13
- 238000011176 pooling Methods 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 8
- 238000001228 spectrum Methods 0.000 claims description 8
- 238000001914 filtration Methods 0.000 claims description 7
- 230000009466 transformation Effects 0.000 claims description 7
- 238000007477 logistic regression Methods 0.000 claims description 4
- 230000001502 supplementing effect Effects 0.000 claims description 4
- 230000003014 reinforcing effect Effects 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 17
- 230000008569 process Effects 0.000 description 12
- 238000000605 extraction Methods 0.000 description 11
- 238000013528 artificial neural network Methods 0.000 description 7
- 230000002950 deficient Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 241000238558 Eucarida Species 0.000 description 1
- 244000062793 Sorghum vulgare Species 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 210000005069 ears Anatomy 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000037433 frameshift Effects 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 235000019713 millet Nutrition 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000005855 radiation Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000013550 semantic technology Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The application belongs to the technical field of voice semantics and provides a method, a device, equipment and a readable storage medium for optimizing a voice recognition system, wherein the method comprises the following steps: acquiring a voice to be recognized, inputting the voice to be recognized into a voice recognition system for classification recognition, predicting a prediction tag type corresponding to the voice to be recognized through a tag prediction model of the voice recognition system, and predicting a prediction loss value of the tag prediction model through an active learning loss prediction model of the voice recognition system; when the prediction label category is determined to be inaccurate according to the prediction loss value, acquiring the actual label category of the voice to be recognized, and taking the voice to be recognized and the actual label category thereof as training data; counting training data to establish a training set; and performing optimization training on the voice recognition system through a training set, and calculating a target loss function until the target loss function is converged to obtain the optimized voice recognition system. The method and the device can improve the recognition accuracy and reliability of the voice recognition system.
Description
Technical Field
The present application relates to the field of speech semantic technologies, and in particular, to a method, an apparatus, a device, and a readable storage medium for optimizing a speech recognition system.
Background
The voice assistant based on deep learning is widely applied to daily life of people, for example, people from millet, Siri from apple, Cortana from Microsoft and the like, and people can use the voice assistant to inquire weather, add new reminders, set an alarm clock and the like. However, the speech recognition system configured by the current speech assistant is obtained by training limited speech data labeled artificially and inefficiently, and the speech recognition system has a recognition blind area due to the limitation of the training data, so that the speech assistant is easy to have a scene with recognition errors in daily use, the reliability is low, and the use experience is greatly reduced.
Disclosure of Invention
The present application mainly aims to provide a method, an apparatus, a device and a readable storage medium for optimizing a speech recognition system, and aims to solve the technical problems of low recognition accuracy and reliability of the existing speech recognition system.
In a first aspect, the present application provides a method for optimizing a speech recognition system, the method comprising:
acquiring a voice to be recognized, inputting the voice to be recognized into a voice recognition system for classified recognition, so as to obtain a prediction tag category corresponding to the voice to be recognized through prediction of a tag prediction model of the voice recognition system, and obtain a prediction loss value of the tag prediction model through prediction of an active learning loss prediction model of the voice recognition system;
when the prediction label category is determined to be inaccurate according to the prediction loss value, acquiring an actual label category corresponding to the voice to be recognized, and determining the voice to be recognized and the actual label category corresponding to the voice to be recognized as training data;
counting training data, and establishing a training set according to the counted training data;
and inputting the training set into the voice recognition system to carry out optimization training on the voice recognition system, and calculating a target loss function until the target loss function is converged to obtain the optimized voice recognition system.
In a second aspect, the present application further provides a speech recognition system optimization apparatus, including:
the prediction module is used for acquiring a voice to be recognized, inputting the voice to be recognized into a voice recognition system for classified recognition, so as to obtain a prediction tag category corresponding to the voice to be recognized through prediction of a tag prediction model of the voice recognition system, and obtain a prediction loss value of the tag prediction model through prediction of an active learning loss prediction model of the voice recognition system;
the determining module is used for acquiring the actual label category corresponding to the voice to be recognized when the predicted label category is determined to be inaccurate according to the prediction loss value, and determining the voice to be recognized and the actual label category corresponding to the voice to be recognized as training data;
the establishing module is used for counting the training data and establishing a training set according to the counted training data;
and the optimization module is used for inputting the training set into the voice recognition system to carry out optimization training on the voice recognition system, calculating a target loss function until the target loss function is converged, and obtaining the optimized voice recognition system.
In a third aspect, the present application further provides a computer device comprising a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program, when executed by the processor, implements the steps of the speech recognition system optimization method as described above.
In a fourth aspect, the present application further provides a readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the speech recognition system optimization method as described above.
The application discloses a voice recognition system optimization method, a device, equipment and a readable storage medium, wherein the voice recognition system optimization method comprises the steps of obtaining voice to be recognized, inputting the voice to be recognized into a voice recognition system for classification recognition, obtaining a prediction label category corresponding to the voice to be recognized through label prediction model prediction of the voice recognition system, and obtaining a prediction loss value of a label prediction model through active learning loss prediction model prediction of the voice recognition system; when the prediction label category is determined to be inaccurate according to the prediction loss value, acquiring an actual label category corresponding to the voice to be recognized, and taking the voice to be recognized and the actual label category corresponding to the voice to be recognized as training data; then, training data are counted, a training set is established according to the counted training data, the established training set is input into the voice recognition system to carry out optimization training on the voice recognition system, a target loss function is calculated until the target loss function is converged, and the optimized voice recognition system is obtained. Therefore, when the voice recognition system works, the loss value predicted by the self-learning loss prediction module is used for finding out the voice data which is easy to be recognized by the voice recognition system in error, the voice data is used as training data for optimizing the voice recognition system, efficient acquisition of the training data is realized, the training data is reused for optimizing and training the voice recognition system, the recognition breadth of the voice recognition system can be improved, updating and upgrading of the voice recognition system are realized, and therefore the recognition accuracy and reliability of the voice recognition system are improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flowchart of a method for optimizing a speech recognition system according to an embodiment of the present application;
FIG. 2 is a block diagram of a speech recognition system according to an embodiment of the present application;
fig. 3 is a schematic diagram illustrating an architecture of an audio feature extraction module according to an embodiment of the present application;
FIG. 4 is a block diagram of a single self-attention decoder according to an embodiment of the present application;
fig. 5 is a schematic diagram of an architecture of an active learning module according to an embodiment of the present application;
FIG. 6 is an exemplary diagram for calculating an objective loss function of a speech recognition system according to an embodiment of the present application;
FIG. 7 is a schematic block diagram of an apparatus for optimizing a speech recognition system according to an embodiment of the present application;
fig. 8 is a block diagram schematically illustrating a structure of a computer device according to an embodiment of the present application.
The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
It is to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
The embodiment of the application provides a method, a device and equipment for optimizing a voice recognition system and a computer-readable storage medium. The voice recognition system optimization method is mainly applied to voice recognition system optimization equipment, and can be equipment with a data processing function, such as a mobile terminal, a Personal Computer (PC), a portable computer, a server and the like, wherein the voice recognition system is loaded on the voice recognition system optimization equipment based on active learning.
Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.
Referring to fig. 1, fig. 1 is a flowchart illustrating a method for optimizing a speech recognition system according to an embodiment of the present disclosure.
As shown in fig. 1, the speech recognition system optimization method includes steps S101 to S104.
Step S101, obtaining a voice to be recognized, inputting the voice to be recognized into a voice recognition system for classification recognition, obtaining a prediction tag category corresponding to the voice to be recognized through prediction of a tag prediction model of the voice recognition system, and obtaining a prediction loss value of the tag prediction model through prediction of an active learning loss prediction model of the voice recognition system.
The video description generation system may be implemented as part of an application having speech recognition capabilities, such as a voice assistant or the like.
As shown in fig. 2, fig. 2 is a schematic structural diagram of a speech recognition system, which is a speech recognition model that completes initial training by using small-magnitude labeled speech data, and mainly includes two parts, namely a label prediction model and an active learning loss prediction model, where the label prediction model and the active learning loss prediction model belong to a parallel relationship. The label prediction model is an end-to-end (end-to-end) neural network and is used for classifying and recognizing the speech to be recognized so as to predict the label category of the speech to be recognized; the active learning loss prediction model is a lightweight neural network and is used for predicting the loss of the prediction result of the to-be-recognized voice of the label prediction model, namely judging the probability that the label prediction model makes correct prediction on the label category corresponding to the to-be-recognized voice.
Taking the application to a voice assistant as an example, when a user sends a voice instruction to the voice assistant, the voice assistant acquires the voice instruction, inputs the voice instruction as a voice to be recognized into a voice recognition system for classification recognition, obtains a prediction tag class corresponding to the voice to be recognized through prediction of a tag prediction model of the voice recognition system, and obtains a prediction loss value of the tag prediction model through prediction of an active learning loss prediction model of the voice recognition system, wherein the prediction loss value is used for representing whether the prediction tag class corresponding to the voice to be recognized is accurate or not.
In an embodiment, the predicting the predicted tag category corresponding to the speech to be recognized through the tag prediction model of the speech recognition system includes: inputting the voice to be recognized into a label prediction model of the voice recognition system, extracting the characteristics of the voice to be recognized to obtain the characteristics of the voice to be recognized, and supplementing the position codes corresponding to the characteristics of the voice to be recognized; decoding the characteristics of the speech to be recognized and the position codes corresponding to the characteristics of the speech to be recognized to obtain hidden characteristic vectors; performing linear transformation on the hidden feature vector to obtain a decoding vector; and performing softmax logistic regression calculation on the decoding vector to obtain a prediction tag category corresponding to the voice to be recognized and output by a tag prediction model of the voice recognition system.
Continuing to refer to fig. 2, as shown in the dashed box portion on the left of fig. 2, the dashed box portion on the left of fig. 2 is an architecture diagram of a tag prediction model, and the tag prediction model mainly includes an audio feature extraction module and a self-attention decoder module, where the self-attention decoder module is formed by overlapping a plurality of self-attention decoders. When the speech to be recognized is input into a speech recognition system for classification recognition, firstly, the speech to be recognized is subjected to feature extraction through an audio feature extraction module of a label prediction model to obtain the speech to be recognizedFeatures corresponding to the speech; then supplementing position coding information of the features corresponding to the speech to be recognized, decoding the features corresponding to the speech to be recognized and the position coding information of the features through a self-attention decoder module, wherein the output of the ith self-attention decoder is the input of the (i + 1) th self-attention decoder during decoding, and the hidden feature vector output from the last self-attention decoder in the self-attention decoder module is used as the final output of the self-attention decoder module and is expressed as Z ═ Z [ Z ] as1,z2,...,zn](ii) a Then, the output from the attention decoder module is processed by linear transformation to obtain a decoding vector, and then the decoding vector is processed by softmax logistic regression, so that the output Z from the attention decoder module is [ Z ═ Z [1,z2,...,zn]Mapping to one-dimensional class space1,l1,...,lm](ii) a Based on the processing, the label prediction model can output the prediction label category corresponding to the speech to be recognized. As shown in table 1, table 1 shows the predicted tag categories output after a common voice command is predicted by a tag prediction model:
TABLE 1 common Voice Instructions and their predictive tag classes
In an embodiment, the extracting the features of the speech to be recognized to obtain the features of the speech to be recognized specifically includes: pre-reinforcing the voice to be recognized by taking a frame as a unit, and performing fast Fourier transform on the pre-reinforced voice to be recognized; processing the voice to be recognized after the fast Fourier transform through a Log Mel spectrum filter to obtain a filtering output value; and sequentially carrying out linear transformation and layer standardization on the filtering output value to obtain the characteristics of the voice to be recognized.
As shown in fig. 3, fig. 3 is a schematic diagram of an architecture of an audio feature extraction module of a tag prediction model, the audio feature extraction module is used for extracting features corresponding to a speech to be recognized, when the features of the speech to be recognized are extracted by the audio feature extraction module, a pre-emphasis is firstly performed on the speech to be recognized by taking a frame as a unit, so as to strengthen high frequency and remove influence of lip radiation, and the high frequency part of the speech to be recognized is strengthened to improve a high frequency signal-to-noise ratio, and the formula is as follows,
s′(x)=s(x)-k*s(x-1)
wherein k represents the pre-emphasis coefficient, k belongs to [0,1], x is the frame, and s (x) is the speech signal corresponding to the x frame.
And then performing Fast Fourier Transform (FFT) on the pre-enhanced voice to be recognized. The fast fourier transform is to decompose complex sound waves into sound waves of various frequencies, and specifically, discrete fourier transform may be performed on the pre-enhanced speech to be recognized, that is, n-point FFT is performed on each frame to calculate a frequency spectrum, where n may be 256 or 512.
It should be noted that, before performing fast fourier transform on the pre-enhanced speech to be recognized, the pre-enhanced speech to be recognized may be framed, that is, the speech to be recognized with an indefinite length may be cut into paragraphs with a fixed length. The frame length can be chosen to be 20ms, the speech signal within the frame can be regarded as a stationary signal, while the frame shift is set to 10ms, i.e. the time difference between segments is set to 10ms, to avoid that speech information is lost at the framing.
And after the pre-enhanced voice to be recognized is subjected to fast Fourier transform, processing the voice to be recognized after the fast Fourier transform through a Log Mel spectrum filter to obtain a filtering output value. The Log Mel spectrum filter is also called Filter Bank, and can process the audio frequency in a mode similar to human ears so as to achieve the purpose of improving the speech recognition performance. After the voice to be recognized after fast Fourier transform passes through a Log Mel spectrum filter, a two-dimensional array X ═ X is finally output1,x2,...xn]Wherein x isnFor the nth truncated frame segment, each element in the arrayWherein k represents the number of filters, and can be flexibly set according to actual needsIf k is 40.
In order to enable the feature matrix size of the speech to be recognized output by the audio feature extraction module to be matched with the input size of the self-attention decoder module, the speech to be recognized after the fast Fourier transform is processed by the Log Mel spectrum filter, and then the linear transform and the layer standardization are further carried out, so that the features of the speech to be recognized are finally obtained.
In an embodiment, the decoding the feature of the speech to be recognized and the position code corresponding to the feature of the speech to be recognized to obtain a hidden feature vector specifically includes: and performing multi-head attention calculation on the characteristics of the speech to be recognized and the position codes corresponding to the characteristics of the speech to be recognized to obtain multi-head attention output, and performing feedforward calculation on the multi-head attention output to obtain a hidden feature vector.
As can be seen from the foregoing, the self-attention decoder module is formed by stacking N (N ≧ 2) self-attention decoders, as shown in fig. 4, fig. 4 is a schematic structural diagram of a single self-attention decoder, each self-attention decoder includes two sublayers, the first layer is multi-headed attention, the second layer is a fully-connected feedforward neural network (the simplest fully-connected structure), in addition, the two sublayers each adopt a defective connection and then perform layer normalization, wherein the defective connection is to solve the problem of difficulty in training the multi-layer neural network, so that the neural network only focuses on the difference part during training, and the layer normalization can accelerate the model training process and accelerate convergence.
It should be noted that, since the self-attention formula may cause the loss of the position information during the calculation, when the feature of the speech to be recognized output by the audio feature extraction module is input into the self-attention decoder module, the position coding information corresponding to the feature of the speech to be recognized is supplemented first. Thus, the input from the attention decoder module is the feature of the speech to be recognized and the position coding information corresponding to the feature, as shown in fig. 4, the position coding information corresponding to the feature corresponds to Q in fig. 4, and K, V corresponds to the feature of the speech to be recognized output by the audio feature extraction module.
After inputting the features corresponding to the speech to be recognized and the position coding information of the features into a self-attention decoder module, for a first self-attention decoder in the self-attention decoder module, firstly, performing multi-head attention calculation on the features corresponding to the speech to be recognized and the position coding information of the features in a multi-head attention layer to obtain the output of the multi-head attention layer, and then inputting the output of the multi-head attention layer into a feedforward neural network layer to perform feedforward calculation to obtain the output of the first self-attention decoder, namely a hidden feature vector; for other self-attention decoders except the first self-attention decoder in the self-attention decoder module, performing multi-head attention calculation on the output of the last self-attention decoder in a multi-head attention layer to obtain the output of the multi-head attention layer, and inputting the output of the multi-head attention layer into a feedforward neural network layer to perform feedforward calculation to obtain the outputs of other self-attention decoders; the output from the last of the attention decoder modules is taken as the final output from the attention decoder module.
Among them, in the self-attention decoder, multi-head attention is the most important conversion map. Multi-head attention is composed of a basic attention map. The attention-product entry (SDA) maps the query (Q, query), the key (K, key), and the value (V, value) to a weighted sum, which is expressed as follows:
where the dimensions of query Q and key K are the same and are both dkThe dimension of the value V is dv. To obtain multiple different linear mappings, multi-headed attention mapping was introduced. In multiple attention mapping, the basic attention functions are performed in parallel. Each basic attention model outputs dimensions, and finally outputs through dimension connection. The formula of the method is as follows,
MultiHead(Q,K,V)=Concat(head1,...,headh)WO
headi=Attention(QWi Q,KWi K,VWi V)
in an embodiment, the predicting loss value of the tag prediction model obtained through prediction by the active learning loss prediction model of the speech recognition system specifically includes: inputting the hidden feature vector into an active learning loss prediction model of the speech recognition system, and performing global pooling on the hidden feature vector to obtain a global pooled feature vector; performing full-connection operation on the global pooling feature vector to obtain a full-connection feature vector; carrying out nonlinear mapping on the fully connected feature vector through a ReLU linear rectification function to obtain feature mapping; and carrying out full-connection operation on the feature mapping to obtain a prediction loss value output by an active learning loss prediction model of the voice recognition system.
Continuing to refer to fig. 1, as shown in the dashed-line frame part on the right of fig. 1, the dashed-line frame part on the right of fig. 1 is an architecture diagram of the active learning loss prediction model, and the active learning loss prediction model is formed by overlapping a plurality of active learning modules. As shown in fig. 5, fig. 5 is a schematic diagram of an architecture of an active learning module. The active learning loss prediction model takes the hidden feature vector output from the attention decoder as input, then the hidden feature vector is sequentially processed by a global pooling layer, a full-link layer and a ReLU linear rectification function layer to obtain the output of the active learning module, and the output of the active learning module is finally processed by the full-link layer to obtain the output of the active learning loss prediction model, namely a predicted loss value (such as figure 1 and figure 5), wherein the value represents the probability of making correct prediction by the label prediction model. In particular, a high loss value indicates that the current input is a difficult datum for the speech recognition system, and the label prediction model may make a wrong decision.
Compared with a label prediction model, the loss prediction module is a lightweight network, and can make quick prediction; meanwhile, in order to improve the network utilization rate, the input of each active learning module is the output of each attention decoder. The input of multiple information sources may enable the loss prediction module to select useful information. The global pooling layer may map information of different dimensions to a fixed information dimension.
And S102, when the predicted label category is determined to be inaccurate according to the predicted loss value, acquiring an actual label category corresponding to the voice to be recognized, and determining the voice to be recognized and the actual label category corresponding to the voice to be recognized as training data.
As can be seen from the foregoing, the prediction loss value output by the active learning loss module may indicate whether the prediction tag class corresponding to the speech to be recognized output by the tag prediction model is accurate, and therefore, obtaining the predicted label category corresponding to the speech to be recognized and output by the label prediction model of the speech recognition system, and the predicted loss value output by the active learning loss prediction model of the speech recognition system, determining whether the class of the prediction label corresponding to the speech to be recognized is accurate according to the prediction loss value, specifically comparing the prediction loss value with a preset threshold value, if the prediction loss value is greater than or equal to the preset threshold value, the prediction label category corresponding to the speech to be recognized output by the label prediction model can be determined to be inaccurate, the preset threshold value is used as a critical value for judging whether the predicted label category is accurate or not, and can be flexibly set according to the actual situation.
When the prediction loss value indicates that the prediction tag type corresponding to the speech to be recognized is inaccurate, the speech to be recognized is difficult data for the speech recognition system, and therefore, the speech to be recognized and the actual tag type thereof are used for optimizing and updating the speech recognition system, and then the actual tag type corresponding to the speech to be recognized needs to be acquired. Taking the application to a voice assistant as an example, when a user sends a voice instruction to the voice assistant, the voice assistant acquires the voice instruction, inputs the voice instruction as a voice to be recognized into a voice recognition system, performs loss prediction through an active learning loss prediction model of the voice recognition system, obtains a prediction loss value which is relatively high, and indicates that a prediction label category corresponding to the voice to be recognized is inaccurate, prompt information for asking the user to select a correct label category can be generated and displayed, meanwhile, a label category selection item related to the voice to be recognized is loaded to allow the user to select, then, a selection instruction of the label category selection item by the user is received, and a label category corresponding to the selection instruction is used as an actual label category corresponding to the voice to be recognized.
After the actual label category corresponding to the speech to be recognized is obtained, the speech to be recognized and the actual label category corresponding to the speech to be recognized can be used as training data, so that the training data can be accumulated while the speech recognition system executes a speech recognition task, and the training data can be used for further optimizing and training the speech recognition system.
In conclusion, when the speech recognition system works, the self-learning loss prediction module finds out the speech data which is easy to be recognized by the speech recognition system incorrectly and is used as the training data for optimizing the speech recognition system, so that the training data is efficiently collected, the training data does not need to be obtained by manual marking, and the labor cost is saved.
And S103, counting the training data, and establishing a training set according to the counted training data.
The training data may then be counted, for example, periodically, such as every month. Training sets are then built from the statistical training data to construct training sets, which, illustratively,
a training set { training data 1, training data 2.,. a training data B }
{ (Voice data x)1Actual tag class y1) (voice data x)2Actual tag class y2) ..., (speech data x)BActual tag class yB)}
And step S104, inputting the training set into the voice recognition system to carry out optimization training on the voice recognition system, and calculating a target loss function until the target loss function is converged to obtain the optimized voice recognition system.
In an embodiment, the inputting the training set into the speech recognition system to perform optimization training on the speech recognition system, and calculating an objective loss function specifically includes: inputting each training data in the training set into the speech recognition system, predicting through a label prediction model of the speech recognition system to obtain a prediction label category of the speech in the training data, and predicting through an active learning loss prediction model of the speech recognition system to obtain a prediction loss value aiming at the speech in the training data; and calculating a target loss function according to the actual label category and the predicted label category corresponding to the voice in the training data and the predicted loss value of the voice in the training data.
Inputting the established training set into a speech recognition system to train the speech recognition system, wherein in the training process, for the speech data x in any training data, the prediction label category can be obtained through a label prediction modelAnd obtaining a predicted loss value by actively learning a loss prediction modelAnd calculating a target loss function of the voice recognition system by combining the actual label type y.
In an embodiment, the calculating a target loss function according to the actual tag class and the predicted tag class corresponding to the speech in the training data and the predicted loss value for the speech in the training data specifically includes: calculating an actual loss value according to the actual label category and the predicted label category corresponding to the voice in the training data; calculating a loss between the actual loss value and the predicted loss value for speech in the training data; and constructing a target loss function according to the calculated loss and the actual loss value.
As can be seen from the foregoing, in the training process, for the speech data x in any training data, the predicted label category is obtained through the label prediction modelAnd obtaining a predicted loss value by actively learning a loss prediction modelThus, based on the predicted tag classAnd actual label class y, calculating an actual loss valueThen calculating the actual loss value l and the predicted loss valueLoss betweenCombining the losses of the two parts to obtain the target loss of the speech recognition systemAs shown in fig. 6.
Specifically, the difference between the predicted label class and the actual label class, i.e., the actual loss value, can be calculated by the cross entropy loss functionThis difference is used for comparison training of the active learning loss prediction model. The cross entropy loss function is as follows:
wherein p iskRepresenting the actual tag value, qkThe predictive tag value is indicated.
Then, the actual loss value l and the predicted loss value are calculatedLoss betweenThe actual loss value l and the predicted loss value are calculated in the simplest wayThe loss function in between is a mean square error loss function, but for two reasons it is not suitable in this training scenario. Firstly, the actual loss is reduced along with the training process, and the label prediction model is updated in the training process, so that the label of the active learning loss module is changed, and the fitting cannot be performed; secondly, the purpose of actively learning the loss prediction model is to reflect the relative magnitude of the loss between different data, without accurately corresponding to the actual loss, in other words, what we want is a sort size rather than the actual loss value. Thus, the entire training process and the corresponding loss function are adjusted. Specifically, pairwise matching is performed on the voice data in the statistical training data, for example, pairwise matching is performed on the voice data in the statistical B training data to obtain B/2 voice data pairs { x }p=(xi,xj) }; then inputting a training set formed by voice data pairs into the voice recognition system, and constructing an actual loss value l and a predicted loss value by comparing the relative predicted loss relation and the relative actual loss relation of the voice data pairsLoss betweenThe loss function is as follows:
wherein the content of the first and second substances,representing a predicted loss value output by the active learning loss module;
l represents an actual loss value, and is calculated by a predicted label category and an actual label category;
(li,lj) Representing a voice data pair (x)i,xj) The actual loss magnitude relationship;
ξ is a preset positive value hyperparameter.
For the understanding of the above loss function, when li≥ljAt the time, onlyIs greater thanThe loss value is 0 only when the loss value is 0, and is not 0 in other cases, so as to increase the loss valueAnd reduce
Combining the two loss functions, a target loss function for updating the speech recognition system is finally obtained, which is summarized as follows:
wherein, (x, y) is the speech data as training data and its corresponding actual label category;
Ltargetis a cross entropy loss function;
λ is another preset positive value override.
And performing optimization training on the voice recognition system according to the target loss function until the target loss function is converged, thereby obtaining the optimized voice recognition system.
The voice recognition system optimization method provided by the embodiment includes acquiring a voice to be recognized, inputting the voice to be recognized into the voice recognition system for classification recognition, obtaining a prediction tag category corresponding to the voice to be recognized through prediction of a tag prediction model of the voice recognition system, and obtaining a prediction loss value of the tag prediction model through prediction of an active learning loss prediction model of the voice recognition system; when the prediction label category is determined to be inaccurate according to the prediction loss value, acquiring an actual label category corresponding to the voice to be recognized, and taking the voice to be recognized and the actual label category corresponding to the voice to be recognized as training data; then, training data are counted, a training set is established according to the counted training data, the established training set is input into the voice recognition system to carry out optimization training on the voice recognition system, a target loss function is calculated until the target loss function is converged, and the optimized voice recognition system is obtained. Therefore, when the voice recognition system works, the loss value predicted by the self-learning loss prediction module is used for finding out the voice data which is easy to be recognized by the voice recognition system in error, the voice data is used as training data for optimizing the voice recognition system, efficient acquisition of the training data is realized, the training data is reused for optimizing and training the voice recognition system, the recognition breadth of the voice recognition system can be improved, updating and upgrading of the voice recognition system are realized, and therefore the recognition accuracy and reliability of the voice recognition system are improved.
Referring to fig. 7, fig. 7 is a schematic block diagram of an optimization apparatus of a speech recognition system according to an embodiment of the present disclosure.
As shown in fig. 7, the speech recognition system optimizing apparatus 400 includes: a prediction module 401, a determination module 402, a setup module 403, and an optimization module 404.
The word segmentation module 401 is configured to obtain a speech to be recognized, input the speech to be recognized to a speech recognition system for classification recognition, obtain a prediction tag category corresponding to the speech to be recognized through prediction of a tag prediction model of the speech recognition system, and obtain a prediction loss value of the tag prediction model through prediction of an active learning loss prediction model of the speech recognition system;
a matching module 402, configured to, when it is determined that the predicted tag category is inaccurate according to the prediction loss value, obtain an actual tag category corresponding to the speech to be recognized, and determine the speech to be recognized and the actual tag category corresponding to the speech to be recognized as training data;
a first determining module 403, configured to count training data and establish a training set according to the counted training data;
a second determining module 404, configured to input the training set into the speech recognition system to perform optimization training on the speech recognition system, and calculate a target loss function until the target loss function converges, so as to obtain an optimized speech recognition system.
It should be noted that, as will be clear to those skilled in the art, for convenience and brevity of description, the specific working processes of the apparatus and the modules and units described above may refer to the corresponding processes in the foregoing embodiment of the speech recognition system optimization method, and are not described herein again.
The apparatus provided by the above embodiments may be implemented in the form of a computer program, which can be run on a computer device as shown in fig. 8.
Referring to fig. 8, fig. 8 is a schematic block diagram illustrating a structure of a computer device according to an embodiment of the present disclosure. The computer device may be a Personal Computer (PC), a server, or the like having a data processing function.
As shown in fig. 8, the computer device includes a processor, a memory, and a network interface connected by a system bus, wherein the memory may include a nonvolatile storage medium and an internal memory.
The non-volatile storage medium may store an operating system and a computer program. The computer program includes program instructions that, when executed, cause a processor to perform any of the speech recognition system optimization methods.
The processor is used for providing calculation and control capability and supporting the operation of the whole computer equipment.
The internal memory provides an environment for the execution of a computer program on a non-volatile storage medium, which when executed by a processor, causes the processor to perform any of the speech recognition system optimization methods.
The network interface is used for network communication, such as sending assigned tasks and the like. Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
It should be understood that the Processor may be a Central Processing Unit (CPU), and the Processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Wherein, in one embodiment, the processor is configured to execute a computer program stored in the memory to implement the steps of:
acquiring a voice to be recognized, inputting the voice to be recognized into a voice recognition system for classified recognition, so as to obtain a prediction tag category corresponding to the voice to be recognized through prediction of a tag prediction model of the voice recognition system, and obtain a prediction loss value of the tag prediction model through prediction of an active learning loss prediction model of the voice recognition system; when the prediction label category is determined to be inaccurate according to the prediction loss value, acquiring an actual label category corresponding to the voice to be recognized, and determining the voice to be recognized and the actual label category corresponding to the voice to be recognized as training data; counting training data, and establishing a training set according to the counted training data; and inputting the training set into the voice recognition system to carry out optimization training on the voice recognition system, and calculating a target loss function until the target loss function is converged to obtain the optimized voice recognition system.
In some embodiments, the processor implements the predicting, by a tag prediction model of the speech recognition system, to obtain a predicted tag category corresponding to the speech to be recognized, including:
inputting the voice to be recognized into a label prediction model of the voice recognition system, extracting the characteristics of the voice to be recognized to obtain the characteristics of the voice to be recognized, and supplementing the position codes corresponding to the characteristics of the voice to be recognized;
decoding the characteristics of the speech to be recognized and the position codes corresponding to the characteristics of the speech to be recognized to obtain hidden characteristic vectors;
performing linear transformation on the hidden feature vector to obtain a decoding vector;
and performing softmax logistic regression calculation on the decoding vector to obtain a prediction tag category corresponding to the voice to be recognized and output by a tag prediction model of the voice recognition system.
In some embodiments, the processor implements the predicting of the predicted loss value of the tag prediction model by the active learning loss prediction model of the speech recognition system, including:
inputting the hidden feature vector into an active learning loss prediction model of the speech recognition system, and performing global pooling on the hidden feature vector to obtain a global pooled feature vector;
performing full-connection operation on the global pooling feature vector to obtain a full-connection feature vector;
carrying out nonlinear mapping on the fully connected feature vector through a ReLU linear rectification function to obtain feature mapping;
and carrying out full-connection operation on the feature mapping to obtain a prediction loss value output by an active learning loss prediction model of the voice recognition system.
In some embodiments, the processor performs the optimal training of the speech recognition system by inputting the training set into the speech recognition system, and calculating an objective loss function, including:
inputting each training data in the training set into the speech recognition system, predicting through a label prediction model of the speech recognition system to obtain a prediction label category of the speech in the training data, and predicting through an active learning loss prediction model of the speech recognition system to obtain a prediction loss value aiming at the speech in the training data;
and calculating a target loss function according to the actual label category and the predicted label category corresponding to the voice in the training data and the predicted loss value of the voice in the training data.
In some embodiments, the processor implements the calculating an objective loss function according to an actual tag class and a predicted tag class corresponding to speech in the training data and the predicted loss value for speech in the training data, including:
calculating an actual loss value according to the actual label category and the predicted label category corresponding to the voice in the training data;
calculating a loss between the actual loss value and the predicted loss value for speech in the training data;
and constructing a target loss function according to the calculated loss and the actual loss value.
In some embodiments, the performing, by the processor, the feature extraction on the speech to be recognized to obtain the feature of the speech to be recognized includes:
pre-reinforcing the voice to be recognized by taking a frame as a unit, and performing fast Fourier transform on the pre-reinforced voice to be recognized;
processing the voice to be recognized after the fast Fourier transform through a Log Mel spectrum filter to obtain a filtering output value;
and sequentially carrying out linear transformation and layer standardization on the filtering output value to obtain the characteristics of the voice to be recognized.
In some embodiments, the decoding, by the processor, the feature of the speech to be recognized and the position code corresponding to the feature of the speech to be recognized to obtain a hidden feature vector includes:
performing multi-head attention calculation on the characteristics of the speech to be recognized and the position codes corresponding to the characteristics of the speech to be recognized to obtain multi-head attention output;
and performing feedforward calculation on the multi-head attention output to obtain a hidden feature vector.
Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, where the computer program includes program instructions, and a method implemented when the program instructions are executed may refer to the embodiments of the speech recognition system optimization method.
The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device.
Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments. While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (10)
1. A method for optimizing a speech recognition system, said method comprising the steps of:
acquiring a voice to be recognized, inputting the voice to be recognized into a voice recognition system for classified recognition, so as to obtain a prediction tag category corresponding to the voice to be recognized through prediction of a tag prediction model of the voice recognition system, and obtain a prediction loss value of the tag prediction model through prediction of an active learning loss prediction model of the voice recognition system;
when the prediction label category is determined to be inaccurate according to the prediction loss value, acquiring an actual label category corresponding to the voice to be recognized, and determining the voice to be recognized and the actual label category corresponding to the voice to be recognized as training data;
counting training data, and establishing a training set according to the counted training data;
and inputting the training set into the voice recognition system to carry out optimization training on the voice recognition system, and calculating a target loss function until the target loss function is converged to obtain the optimized voice recognition system.
2. The method according to claim 1, wherein the predicting the predicted tag category corresponding to the speech to be recognized by the tag prediction model of the speech recognition system comprises:
inputting the voice to be recognized into a label prediction model of the voice recognition system, extracting the characteristics of the voice to be recognized to obtain the characteristics of the voice to be recognized, and supplementing the position codes corresponding to the characteristics of the voice to be recognized;
decoding the characteristics of the speech to be recognized and the position codes corresponding to the characteristics of the speech to be recognized to obtain hidden characteristic vectors;
performing linear transformation on the hidden feature vector to obtain a decoding vector;
and performing softmax logistic regression calculation on the decoding vector to obtain a prediction tag category corresponding to the voice to be recognized and output by a tag prediction model of the voice recognition system.
3. The method of claim 2, wherein the predicting the predicted loss value of the tag prediction model by the active learning loss prediction model of the speech recognition system comprises:
inputting the hidden feature vector into an active learning loss prediction model of the speech recognition system, and performing global pooling on the hidden feature vector to obtain a global pooled feature vector;
performing full-connection operation on the global pooling feature vector to obtain a full-connection feature vector;
carrying out nonlinear mapping on the fully connected feature vector through a ReLU linear rectification function to obtain feature mapping;
and carrying out full-connection operation on the feature mapping to obtain a prediction loss value output by an active learning loss prediction model of the voice recognition system.
4. The method of claim 1, wherein the inputting the training set into the speech recognition system optimizes the speech recognition system and calculates an objective loss function, comprising:
inputting each training data in the training set into the speech recognition system, predicting through a label prediction model of the speech recognition system to obtain a prediction label category of the speech in the training data, and predicting through an active learning loss prediction model of the speech recognition system to obtain a prediction loss value aiming at the speech in the training data;
and calculating a target loss function according to the actual label category and the predicted label category corresponding to the voice in the training data and the predicted loss value of the voice in the training data.
5. The method of claim 4, wherein the calculating an objective loss function according to the actual tag class and the predicted tag class corresponding to the speech in the training data and the predicted loss value for the speech in the training data comprises:
calculating an actual loss value according to the actual label category and the predicted label category corresponding to the voice in the training data;
calculating a loss between the actual loss value and the predicted loss value for speech in the training data;
and constructing a target loss function according to the calculated loss and the actual loss value.
6. The method for optimizing a speech recognition system according to claim 2, wherein the extracting the features of the speech to be recognized to obtain the features of the speech to be recognized comprises:
pre-reinforcing the voice to be recognized by taking a frame as a unit, and performing fast Fourier transform on the pre-reinforced voice to be recognized;
processing the voice to be recognized after the fast Fourier transform through a Log Mel spectrum filter to obtain a filtering output value;
and sequentially carrying out linear transformation and layer standardization on the filtering output value to obtain the characteristics of the voice to be recognized.
7. The method of claim 2, wherein the decoding the feature of the speech to be recognized and the position code corresponding to the feature of the speech to be recognized to obtain the hidden feature vector comprises:
performing multi-head attention calculation on the characteristics of the speech to be recognized and the position codes corresponding to the characteristics of the speech to be recognized to obtain multi-head attention output;
and performing feedforward calculation on the multi-head attention output to obtain a hidden feature vector.
8. A speech recognition system optimization apparatus, comprising:
the prediction module is used for acquiring a voice to be recognized, inputting the voice to be recognized into a voice recognition system for classified recognition, so as to obtain a prediction tag category corresponding to the voice to be recognized through prediction of a tag prediction model of the voice recognition system, and obtain a prediction loss value of the tag prediction model through prediction of an active learning loss prediction model of the voice recognition system;
the determining module is used for acquiring the actual label category corresponding to the voice to be recognized when the predicted label category is determined to be inaccurate according to the prediction loss value, and determining the voice to be recognized and the actual label category corresponding to the voice to be recognized as training data;
the establishing module is used for counting the training data and establishing a training set according to the counted training data;
and the optimization module is used for inputting the training set into the voice recognition system to carry out optimization training on the voice recognition system, calculating a target loss function until the target loss function is converged, and obtaining the optimized voice recognition system.
9. A computer arrangement, characterized in that the computer arrangement comprises a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program, when executed by the processor, carries out the steps of the speech recognition system optimization method according to any one of claims 1 to 7.
10. A readable storage medium, having stored thereon a computer program, wherein the computer program, when being executed by a processor, carries out the steps of the method for speech recognition system optimization according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110467147.1A CN113223502B (en) | 2021-04-28 | 2021-04-28 | Speech recognition system optimization method, device, equipment and readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110467147.1A CN113223502B (en) | 2021-04-28 | 2021-04-28 | Speech recognition system optimization method, device, equipment and readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113223502A true CN113223502A (en) | 2021-08-06 |
CN113223502B CN113223502B (en) | 2024-01-30 |
Family
ID=77089633
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110467147.1A Active CN113223502B (en) | 2021-04-28 | 2021-04-28 | Speech recognition system optimization method, device, equipment and readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113223502B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113555005A (en) * | 2021-09-22 | 2021-10-26 | 北京世纪好未来教育科技有限公司 | Model training method, model training device, confidence determining method, confidence determining device, electronic equipment and storage medium |
CN114138160A (en) * | 2021-08-27 | 2022-03-04 | 苏州探寻文化科技有限公司 | Learning equipment interacting with user based on multiple modules |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160171974A1 (en) * | 2014-12-15 | 2016-06-16 | Baidu Usa Llc | Systems and methods for speech transcription |
US20180226066A1 (en) * | 2016-10-21 | 2018-08-09 | Microsoft Technology Licensing, Llc | Simultaneous dialogue state management using frame tracking |
US20190130903A1 (en) * | 2017-10-27 | 2019-05-02 | Baidu Usa Llc | Systems and methods for robust speech recognition using generative adversarial networks |
CN110838286A (en) * | 2019-11-19 | 2020-02-25 | 腾讯科技(深圳)有限公司 | Model training method, language identification method, device and equipment |
CN110853618A (en) * | 2019-11-19 | 2020-02-28 | 腾讯科技(深圳)有限公司 | Language identification method, model training method, device and equipment |
CN110853617A (en) * | 2019-11-19 | 2020-02-28 | 腾讯科技(深圳)有限公司 | Model training method, language identification method, device and equipment |
CN111145728A (en) * | 2019-12-05 | 2020-05-12 | 厦门快商通科技股份有限公司 | Speech recognition model training method, system, mobile terminal and storage medium |
CN112002306A (en) * | 2020-08-26 | 2020-11-27 | 阳光保险集团股份有限公司 | Voice category identification method and device, electronic equipment and readable storage medium |
CN112185352A (en) * | 2020-08-31 | 2021-01-05 | 华为技术有限公司 | Voice recognition method and device and electronic equipment |
CN112232480A (en) * | 2020-09-15 | 2021-01-15 | 深圳力维智联技术有限公司 | Method, system and storage medium for training neural network model |
CN112528679A (en) * | 2020-12-17 | 2021-03-19 | 科大讯飞股份有限公司 | Intention understanding model training method and device and intention understanding method and device |
CN112700768A (en) * | 2020-12-16 | 2021-04-23 | 科大讯飞股份有限公司 | Speech recognition method, electronic device and storage device |
CN112712797A (en) * | 2020-12-29 | 2021-04-27 | 平安科技(深圳)有限公司 | Voice recognition method and device, electronic equipment and readable storage medium |
-
2021
- 2021-04-28 CN CN202110467147.1A patent/CN113223502B/en active Active
Patent Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160171974A1 (en) * | 2014-12-15 | 2016-06-16 | Baidu Usa Llc | Systems and methods for speech transcription |
CN107077842A (en) * | 2014-12-15 | 2017-08-18 | 百度(美国)有限责任公司 | System and method for phonetic transcription |
US20180226066A1 (en) * | 2016-10-21 | 2018-08-09 | Microsoft Technology Licensing, Llc | Simultaneous dialogue state management using frame tracking |
US20190130903A1 (en) * | 2017-10-27 | 2019-05-02 | Baidu Usa Llc | Systems and methods for robust speech recognition using generative adversarial networks |
CN109741736A (en) * | 2017-10-27 | 2019-05-10 | 百度(美国)有限责任公司 | The system and method for carrying out robust speech identification using confrontation network is generated |
CN110853618A (en) * | 2019-11-19 | 2020-02-28 | 腾讯科技(深圳)有限公司 | Language identification method, model training method, device and equipment |
CN110838286A (en) * | 2019-11-19 | 2020-02-25 | 腾讯科技(深圳)有限公司 | Model training method, language identification method, device and equipment |
CN110853617A (en) * | 2019-11-19 | 2020-02-28 | 腾讯科技(深圳)有限公司 | Model training method, language identification method, device and equipment |
CN111145728A (en) * | 2019-12-05 | 2020-05-12 | 厦门快商通科技股份有限公司 | Speech recognition model training method, system, mobile terminal and storage medium |
CN112002306A (en) * | 2020-08-26 | 2020-11-27 | 阳光保险集团股份有限公司 | Voice category identification method and device, electronic equipment and readable storage medium |
CN112185352A (en) * | 2020-08-31 | 2021-01-05 | 华为技术有限公司 | Voice recognition method and device and electronic equipment |
CN112232480A (en) * | 2020-09-15 | 2021-01-15 | 深圳力维智联技术有限公司 | Method, system and storage medium for training neural network model |
CN112700768A (en) * | 2020-12-16 | 2021-04-23 | 科大讯飞股份有限公司 | Speech recognition method, electronic device and storage device |
CN112528679A (en) * | 2020-12-17 | 2021-03-19 | 科大讯飞股份有限公司 | Intention understanding model training method and device and intention understanding method and device |
CN112712797A (en) * | 2020-12-29 | 2021-04-27 | 平安科技(深圳)有限公司 | Voice recognition method and device, electronic equipment and readable storage medium |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114138160A (en) * | 2021-08-27 | 2022-03-04 | 苏州探寻文化科技有限公司 | Learning equipment interacting with user based on multiple modules |
CN113555005A (en) * | 2021-09-22 | 2021-10-26 | 北京世纪好未来教育科技有限公司 | Model training method, model training device, confidence determining method, confidence determining device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN113223502B (en) | 2024-01-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zonoozi et al. | Periodic-CRN: A convolutional recurrent model for crowd density prediction with recurring periodic patterns. | |
CN109587713B (en) | Network index prediction method and device based on ARIMA model and storage medium | |
CN110704588A (en) | Multi-round dialogue semantic analysis method and system based on long-term and short-term memory network | |
CN109816221A (en) | Decision of Project Risk method, apparatus, computer equipment and storage medium | |
CN113223502B (en) | Speech recognition system optimization method, device, equipment and readable storage medium | |
CN110276382B (en) | Crowd classification method, device and medium based on spectral clustering | |
CN112084752B (en) | Sentence marking method, device, equipment and storage medium based on natural language | |
CN112634992A (en) | Molecular property prediction method, training method of model thereof, and related device and equipment | |
CN112580346A (en) | Event extraction method and device, computer equipment and storage medium | |
CN110020739B (en) | Method, apparatus, electronic device and computer readable medium for data processing | |
CN112699213A (en) | Speech intention recognition method and device, computer equipment and storage medium | |
CN114332500A (en) | Image processing model training method and device, computer equipment and storage medium | |
US20210239479A1 (en) | Predicted Destination by User Behavior Learning | |
CN116402630A (en) | Financial risk prediction method and system based on characterization learning | |
CN117033657A (en) | Information retrieval method and device | |
CN116684330A (en) | Traffic prediction method, device, equipment and storage medium based on artificial intelligence | |
CN112597292B (en) | Question reply recommendation method, device, computer equipment and storage medium | |
CN114360520A (en) | Training method, device and equipment of voice classification model and storage medium | |
CN114357171A (en) | Emergency event processing method and device, storage medium and electronic equipment | |
WO2021217866A1 (en) | Method and apparatus for ai interview recognition, computer device and storage medium | |
CN116777646A (en) | Artificial intelligence-based risk identification method, apparatus, device and storage medium | |
CN116542783A (en) | Risk assessment method, device, equipment and storage medium based on artificial intelligence | |
CN116310770A (en) | Underwater sound target identification method and system based on mel cepstrum and attention residual error network | |
CN115062769A (en) | Knowledge distillation-based model training method, device, equipment and storage medium | |
CN115358473A (en) | Power load prediction method and prediction system based on deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |