CN113763934A

CN113763934A - Training method and device of audio recognition model, storage medium and electronic equipment

Info

Publication number: CN113763934A
Application number: CN202110593500.0A
Authority: CN
Inventors: 林炳怀; 王丽园
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-05-28
Filing date: 2021-05-28
Publication date: 2021-12-07

Abstract

The invention discloses a training method and a training device for an audio recognition model, a storage medium and electronic equipment. Wherein, the method comprises the following steps: training an audio recognition model to be trained by using a first training sample set to obtain an initial audio recognition model; inputting the audio features of the second group of audio samples into the initial audio recognition model to obtain a first group of predicted audio recognition results; inputting the audio features of the second group of audio samples into an uncertainty analysis model to obtain a first group of uncertainty analysis results; and screening a second group of predicted audio recognition results with the credibility meeting the preset conditions from the first group of predicted audio recognition results according to the first group of uncertainty analysis results. The method can also be applied to artificial intelligence scenes, and particularly relates to technologies such as voice recognition and machine learning. The invention solves the technical problem of low training efficiency of the audio recognition model.

Description

Training method and device of audio recognition model, storage medium and electronic equipment

Technical Field

The invention relates to the field of computers, in particular to a training method and device of an audio recognition model, a storage medium and electronic equipment.

Background

In recent years, the application of audio recognition technology is becoming more widespread, such as in the field of spoken language evaluation and security detection, but how to improve the accuracy of audio recognition is still a subject being studied.

In the related art, training of an audio recognition model is usually utilized to improve the accuracy of audio recognition, but in the training process of the audio recognition model in the related art, a large amount of manually labeled sample data is often required. Or, under the condition that the sample data of the manual labeling is less, the training effect of the audio recognition model is often difficult to guarantee.

However, the manual labeling itself consumes a large amount of manpower and material resources, and further, if the manual labeling and a large amount of sample data are to be obtained, not only the high human and material costs are required, but also a long waiting time is required, so that the training efficiency of the audio recognition model is reduced. Namely, the technical problem of low training efficiency of the audio recognition model exists in the prior art.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides a training method and device of an audio recognition model, a storage medium and electronic equipment, and at least solves the technical problem of low training efficiency of the audio recognition model.

According to an aspect of the embodiments of the present invention, there is provided a training method of an audio recognition model, including: training an audio recognition model to be trained by using a first training sample set to obtain an initial audio recognition model, wherein the first training sample set comprises a first group of audio samples and a first group of actual audio recognition results obtained by labeling the first group of audio samples, and the initial audio recognition model is used for determining prediction audio recognition according to input audio features; inputting the audio features of a second group of audio samples into the initial audio recognition model to obtain a first group of predicted audio recognition results, wherein the second group of audio samples are not marked with corresponding actual audio recognition results; inputting the audio features of the second set of audio samples into an uncertainty analysis model to obtain a first set of uncertainty analysis results, wherein the first set of uncertainty analysis results is used for representing the credibility of the first set of predicted audio recognition results; according to the first group of uncertainty analysis results, screening a second group of predicted audio recognition results with the reliability meeting a preset condition from the first group of predicted audio recognition results, and screening a third group of audio samples corresponding to the second group of predicted audio recognition results from the second group of audio samples; and performing a current round of training on the initial audio recognition model according to the third group of audio samples and the second group of predicted audio recognition results, wherein the initial audio recognition model is set to be subjected to multiple rounds of training until a preset convergence condition is met.

According to another aspect of the embodiments of the present invention, there is provided an audio recognition method including: acquiring input target audio in a target application; obtaining a target audio recognition result determined according to the audio characteristics of the target audio through a target audio recognition model, wherein the target audio recognition model is obtained by performing multiple rounds of training on an initial audio recognition model until a preset convergence condition is met, the initial audio recognition model is obtained by training an audio recognition model to be trained by using a first training sample set, the first training sample set comprises a first group of audio samples and a first group of actual audio recognition results obtained by labeling the first group of audio samples, the initial audio recognition model is used for determining and predicting audio recognition according to the input audio characteristics, the audio recognition model obtained by the previous round of training is trained by using a training sample set corresponding to each round of training in each round of training, and the training sample set corresponding to each round comprises the training sample set obtained by the previous round of training and the audio recognition model obtained by round of screening The training sample set obtained by the current round of screening includes a group of audio samples and a group of predicted audio recognition results corresponding to the group of audio samples, the group of audio samples are not marked with corresponding actual audio recognition results, and the group of predicted audio recognition results are predicted audio recognition results determined by the audio recognition model after the previous round of training according to the audio features of the group of audio samples; and displaying the target audio recognition result in the target application.

As an optional scheme, the screening, according to the first set of uncertainty analysis results, a second set of predicted audio recognition results with reliability meeting a preset condition from the first set of predicted audio recognition results includes: and when the difference between the predicted audio recognition result output by the audio recognition model after the next round of training and the actual audio recognition result in the third training sample set meets the convergence condition, ending the training of the initial audio recognition model to obtain a target audio recognition model, wherein the group of predicted audio recognition results in the third training sample set is regarded as a group of actual audio recognition results.

As an optional scheme, the screening, according to the first set of uncertainty analysis results, a second set of predicted audio recognition results with reliability meeting a preset condition from the first set of predicted audio recognition results includes: when the first group of uncertainty analysis results comprise a group of uncertainty scores, sequencing the group of uncertainty scores from small to large according to the scores to obtain an uncertainty score sequence, wherein the higher the uncertainty score is, the lower the credibility of the corresponding prediction audio recognition result is; acquiring the top N sorted uncertainty scores from the uncertainty score sequence, wherein the uncertainty score sequence comprises M uncertainty scores, and N < M; and screening the second group of predicted audio recognition results corresponding to the uncertainty scores of the M bits before sorting from the first group of predicted audio recognition results.

As an optional solution, the obtaining of the input target audio in the target application includes: under the condition that a reference text is displayed in the target application, acquiring the target audio generated by reading the reference text aloud in the target application, or acquiring the target audio generated by replying the reference text; the displaying the target audio recognition result in the target application includes: and displaying the evaluation score of the target audio determined by the target audio recognition model in the target application.

According to another aspect of the embodiments of the present invention, there is also provided an apparatus for training an audio recognition model, including: the audio recognition system comprises a first training unit and a second training unit, wherein the first training unit is used for training an audio recognition model to be trained by using a first training sample set to obtain an initial audio recognition model, the first training sample set comprises a first group of audio samples and a first group of actual audio recognition results obtained by labeling the first group of audio samples, and the initial audio recognition model is used for determining prediction audio recognition according to input audio features; a first input unit, configured to input audio features of a second group of audio samples to the initial audio recognition model to obtain a first group of predicted audio recognition results, where the second group of audio samples are not labeled with corresponding actual audio recognition results; inputting the audio features of the second set of audio samples into an uncertainty analysis model to obtain a first set of uncertainty analysis results, wherein the first set of uncertainty analysis results is used for representing the credibility of the first set of predicted audio recognition results; the first screening unit is used for screening a second group of predicted audio recognition results with the reliability meeting a preset condition from the first group of predicted audio recognition results according to the first group of uncertainty analysis results, and screening a third group of audio samples corresponding to the second group of predicted audio recognition results from the second group of audio samples; and a second training unit, configured to perform a current round of training on the initial audio recognition model according to the third group of audio samples and the second group of predicted audio recognition results, where the initial audio recognition model is set to undergo multiple rounds of training until a preset convergence condition is satisfied.

As an optional solution, the second training unit includes: a first merging module, configured to merge the third group of audio samples and the second group of predicted audio recognition results into the first training sample set to obtain a second training sample set, where the second group of predicted audio recognition results in the second training sample set are regarded as a second group of actual audio recognition results; and the first training module is used for performing the current round of training on the initial audio recognition model by using the second training sample set to obtain the audio recognition model after the current round of training.

As an optional solution, the apparatus further includes: a first obtaining module, configured to obtain a set of audio samples to be used in a next round of training and a set of predicted audio recognition results corresponding to the set of audio samples when a difference between a predicted audio recognition result output by the audio recognition model after the current round of training and an actual audio recognition result in the second training sample set does not satisfy the convergence condition, where the set of audio samples to be used is not labeled with a corresponding actual audio recognition result, and the set of predicted audio recognition results is a predicted audio recognition result determined by the audio recognition model after the current round of training according to audio features of the set of audio samples; a second merging module, configured to merge the group of audio samples to be used and the corresponding group of predicted audio recognition results into the second training sample set to obtain a third training sample set; and the second training module is used for performing the next round of training on the audio recognition model after the current round of training by using the third training sample set to obtain the audio recognition model after the next round of training.

As an optional scheme, the obtaining module includes: the input submodule is used for inputting the audio characteristics of a fourth group of audio samples into the audio recognition model after the current round of training to obtain a third group of predicted audio recognition results, wherein the fourth group of audio samples are not marked with corresponding actual audio recognition results; inputting the audio features of the fourth set of audio samples into the uncertainty analysis model to obtain a second set of uncertainty analysis results, wherein the second set of uncertainty analysis results is used for representing the credibility of the third set of predicted audio recognition results; and the screening submodule is used for screening a fourth group of predicted audio recognition results with the reliability meeting a preset condition from the third group of predicted audio recognition results according to the second group of uncertainty analysis results, and screening a fifth group of audio samples corresponding to the fourth group of predicted audio recognition results from the fourth group of audio samples.

As an optional solution, the first screening unit includes: and a second obtaining module, configured to end training of the initial audio recognition model when a difference between a predicted audio recognition result output by the audio recognition model after the next round of training and an actual audio recognition result in the third training sample set satisfies the convergence condition, so as to obtain a target audio recognition model, where the group of predicted audio recognition results in the third training sample set is regarded as a group of actual audio recognition results.

As an optional solution, the first screening unit includes: a third obtaining module, configured to, when the first set of uncertainty analysis results includes a set of uncertainty scores, rank the set of uncertainty scores from small to large according to a score to obtain an uncertainty score sequence, where a higher uncertainty score indicates a lower confidence level of a corresponding predicted audio recognition result; a fourth obtaining module, configured to obtain, from the uncertainty score sequence, N top-ranked uncertainty scores, where the uncertainty score sequence includes M uncertainty scores, and N < M; and the screening module is used for screening the second group of predicted audio recognition results corresponding to the uncertainty scores of the M bits before sorting from the first group of predicted audio recognition results.

As an optional solution, the apparatus further includes: a first acquisition unit configured to acquire an input target audio in a target application; a second obtaining unit, configured to obtain a target audio recognition result determined according to an audio feature of the target audio by using a target audio recognition model, where the target audio recognition model is an audio recognition model obtained by performing multiple rounds of training on the initial audio recognition model until a preset convergence condition is met; and the first display unit is used for displaying the target audio recognition result in the target application.

As an alternative, the method comprises the following steps: the first acquiring unit includes: a target audio module, configured to, when a reference text is displayed in the target application, obtain the target audio generated by reading the reference text aloud in the target application, or obtain the target audio generated by replying to the reference text; the first display unit includes: and the first score module is used for displaying the evaluation score of the target audio determined by the target audio recognition model in the target application.

According to another aspect of the embodiments of the present invention, there is also provided an apparatus for training an audio recognition model, including: a third acquisition unit configured to acquire an input target audio in a target application; a fourth obtaining unit, configured to obtain a target audio recognition result determined according to the audio feature of the target audio by using a target audio recognition model, where the target audio recognition model is an audio recognition model obtained by performing multiple rounds of training on an initial audio recognition model until a preset convergence condition is met, the initial audio recognition model is a model obtained by training an audio recognition model to be trained by using a first training sample set, the first training sample set includes a first group of audio samples and a first group of actual audio recognition results obtained by labeling the first group of audio samples, the initial audio recognition model is used to determine predicted audio recognition according to the input audio feature, and the audio recognition model obtained by the previous round of training is trained by using a training sample set corresponding to each round in each round of training, the training sample set obtained by the previous round of training comprises a training sample set obtained by the previous round of training and a training sample set obtained by the current round of screening, the training sample set obtained by the current round of screening comprises a group of audio samples and a group of predicted audio recognition results corresponding to the group of audio samples, the group of audio samples are not marked with corresponding actual audio recognition results, and the group of predicted audio recognition results are predicted audio recognition results determined by the audio recognition model after the previous round of training according to the audio features of the group of audio samples; and a third display unit, configured to display the target audio recognition result in the target application.

As an alternative, the method comprises the following steps: a third training unit, configured to train an audio recognition model to be trained by using a first training sample set before obtaining an input target audio in the target application, to obtain an initial audio recognition model, where the first training sample set includes a first group of audio samples and a first group of actual audio recognition results obtained by labeling the first group of audio samples, and the initial audio recognition model is used to determine predicted audio recognition according to an input audio feature; a second input unit, configured to input audio features of a second group of audio samples to the initial audio recognition model before obtaining an input target audio in the target application, so as to obtain a first group of predicted audio recognition results, where the second group of audio samples are not labeled with corresponding actual audio recognition results; inputting the audio features of the second set of audio samples into an uncertainty analysis model to obtain a first set of uncertainty analysis results, wherein the first set of uncertainty analysis results is used for representing the credibility of the first set of predicted audio recognition results; a second screening unit, configured to, before an input target audio is acquired in the target application, screen, according to the first group of uncertainty analysis results, a second group of predicted audio recognition results of which the reliability satisfies a preset condition from the first group of predicted audio recognition results, and screen, from the second group of audio samples, a third group of audio samples corresponding to the second group of predicted audio recognition results; and a fourth training unit, configured to perform a current round of training on the initial audio recognition model according to the third group of audio samples and the second group of predicted audio recognition results before obtaining an input target audio in the target application, where the initial audio recognition model is set to undergo multiple rounds of training until a preset convergence condition is satisfied.

As an alternative, the method comprises the following steps: a first merging unit, configured to merge the third group of audio samples and the second group of predicted audio recognition results into the first training sample set before an input target audio is obtained in the target application, so as to obtain a second training sample set, where the second group of predicted audio recognition results in the second training sample set is regarded as a second group of actual audio recognition results; and a fifth training unit, configured to perform a current round of training on the initial audio recognition model by using the second training sample set before obtaining an input target audio in the target application, so as to obtain an audio recognition model after the current round of training.

As an alternative, the method comprises the following steps: a fifth obtaining unit, configured to obtain a set of audio samples to be used in a next round of training and a set of predicted audio recognition results corresponding to the set of audio samples when a difference between a predicted audio recognition result output by the audio recognition model after the current round of training and an actual audio recognition result in the second training sample set does not satisfy the convergence condition, where the set of audio samples to be used is not labeled with a corresponding actual audio recognition result, and the set of predicted audio recognition results is a predicted audio recognition result determined by the audio recognition model after the current round of training according to audio features of the set of audio samples; a second merging unit, configured to merge the set of audio samples to be used and the corresponding set of predicted audio recognition results into the second training sample set to obtain a third training sample set; and the fourth training unit is used for performing the next round of training on the audio recognition model after the current round of training by using the third training sample set to obtain the audio recognition model after the next round of training.

As an alternative, the method comprises the following steps: a third input unit, configured to input audio features of a fourth set of audio samples to the audio recognition model after the current round of training before obtaining an input target audio in the target application, so as to obtain a third set of predicted audio recognition results, where no corresponding actual audio recognition result is labeled on the fourth set of audio samples; inputting the audio features of the fourth set of audio samples into the uncertainty analysis model to obtain a second set of uncertainty analysis results, wherein the second set of uncertainty analysis results is used for representing the credibility of the third set of predicted audio recognition results; and a third screening unit, configured to, before an input target audio is acquired in the target application, screen, according to the second set of uncertainty analysis results, a fourth set of predicted audio recognition results with a confidence level meeting a preset condition from the third set of predicted audio recognition results, and screen, from the fourth set of audio samples, a fifth set of audio samples corresponding to the fourth set of predicted audio recognition results.

As an alternative, the method comprises the following steps: a sixth obtaining unit, configured to, before obtaining an input target audio in the target application, end training of the initial audio recognition model when a difference between a predicted audio recognition result output by the audio recognition model after the next round of training and an actual audio recognition result in the third training sample set satisfies the convergence condition, so as to obtain a target audio recognition model, where the group of predicted audio recognition results in the third training sample set is regarded as a group of actual audio recognition results.

As an alternative, the method comprises the following steps: a sorting unit, configured to, before obtaining an input target audio in the target application, sort the set of uncertainty scores according to a score from small to large when the first set of uncertainty analysis results includes a set of uncertainty scores, to obtain an uncertainty score sequence, where a higher uncertainty score indicates a lower confidence level of a corresponding predicted audio recognition result; a seventh obtaining unit, configured to obtain, before obtaining an input target audio in the target application, top N sorted uncertainty scores in the uncertainty score sequence, where the uncertainty score sequence includes M uncertainty scores, and N < M; a fourth filtering unit, configured to, before an input target audio is obtained in the target application, filter the second group of predicted audio recognition results corresponding to the M-bit uncertainty scores before the sorting from the first group of predicted audio recognition results.

As an optional solution, the apparatus further includes: the third acquiring unit includes: a second audio module, configured to, when a reference text is displayed in the target application, obtain the target audio generated by reading the reference text aloud in the target application, or obtain the target audio generated by replying to the reference text; the third display unit includes: and the second score module is used for displaying the evaluation score of the target audio determined by the target audio recognition model in the target application.

According to a further aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to execute the above training method of the audio recognition model when running.

According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the method for training the audio recognition model through the computer program.

In the embodiment of the invention, a first training sample set is used for training an audio recognition model to be trained to obtain an initial audio recognition model, wherein the first training sample set comprises a first group of audio samples and a first group of actual audio recognition results obtained by labeling the first group of audio samples, and the initial audio recognition model is used for determining and predicting audio recognition according to input audio features; inputting the audio features of a second group of audio samples into the initial audio recognition model to obtain a first group of predicted audio recognition results, wherein the second group of audio samples are not marked with corresponding actual audio recognition results; inputting the audio features of the second set of audio samples into an uncertainty analysis model to obtain a first set of uncertainty analysis results, wherein the first set of uncertainty analysis results is used for representing the credibility of the first set of predicted audio recognition results; according to the first group of uncertainty analysis results, screening a second group of predicted audio recognition results with the reliability meeting a preset condition from the first group of predicted audio recognition results, and screening a third group of audio samples corresponding to the second group of predicted audio recognition results from the second group of audio samples; according to the third group of audio samples and the second group of predicted audio recognition results, the current round of training is carried out on the initial audio recognition model, wherein the initial audio recognition model is set to be subjected to multiple rounds of training until a preset convergence condition is met, the training on the audio recognition model can be completed by ensuring that all audio samples are marked without any need, and the purpose of reducing the influence degree of the marked samples on the audio recognition model is achieved, so that the technical effect of improving the training efficiency of the audio recognition model is achieved, and the technical problem that the training efficiency of the audio recognition model is low is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a schematic diagram of an application environment of an alternative audio recognition model training method according to an embodiment of the invention;

FIG. 2 is a schematic diagram illustrating a flow of an alternative method for training an audio recognition model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an alternative audio recognition model training method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an alternative audio recognition model training method according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an alternative audio recognition model training method according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of an alternative audio recognition model training method according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of an alternative audio recognition model training method according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of an alternative audio recognition model training method according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of an alternative audio recognition model training method according to an embodiment of the present invention;

FIG. 10 is a schematic diagram of an alternative audio recognition model training method according to an embodiment of the present invention;

FIG. 11 is a schematic diagram of an alternative audio recognition model training method according to an embodiment of the present invention;

FIG. 12 is a schematic diagram of an alternative audio recognition model training method according to an embodiment of the present invention;

FIG. 13 is a schematic diagram of an alternative audio recognition model training method according to an embodiment of the present invention;

FIG. 14 is a schematic diagram of an alternative training apparatus for audio recognition models, according to an embodiment of the present invention;

FIG. 15 is a schematic diagram of an alternative audio recognition arrangement according to an embodiment of the present invention;

fig. 16 is a schematic structural diagram of an alternative electronic device according to an embodiment of the invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

First, in order to facilitate understanding of the embodiments of the present invention, some terms or nouns related to the present invention are explained as follows:

automatic speech recognition (ASR:), a process for converting audio to text.

Semi-Supervised Learning (SSL) is a Learning method combining Supervised Learning and unsupervised Learning, and uses a large amount of unlabeled data and labeled data at the same time to perform pattern recognition.

Pearson correlation coefficient: is a measure of the correlation (linear correlation) between two variables X and Y, the value of which is between-1 and 1

Support Vector Regression (SVR) is a regression algorithm based on a support vector machine

The K-Nearest Neighbors algorithm (K-Nearest Neighbors, KNN for short) searches K Neighbors for a new prediction example, and then the average value of target values of the K samples is taken as a prediction value of the new sample.

GBT: gradient boost tree, a regression algorithm based on a lifting tree. A regression tree is fitted using the value of the negative gradient of the loss function at the current model as an approximation of the residual.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

The scheme provided by the embodiment of the application relates to technologies such as artificial intelligence voice recognition and machine learning, and is specifically explained by the following embodiments:

according to an aspect of the embodiments of the present invention, there is provided a training method of an audio recognition model, and optionally, as an optional implementation manner, the training method of an audio recognition model may be applied, but not limited, to the environment shown in fig. 1. The system may include, but is not limited to, a user equipment 102, a network 110, and a server 112, wherein the user equipment 102 may include, but is not limited to, a display 108, a processor 106, and a memory 104.

The specific process comprises the following steps:

step S102, the user equipment 102 obtains a training instruction triggered by a virtual button 'start training', wherein the training instruction is used for instructing model training according to a first group of labeled audio samples 1022 and a second group of unlabeled audio samples 1024;

step S104-S106, the user equipment 102 sends a training instruction to the server 112 through the network 110;

step S108, the server 112 responds to the training instruction, and processes the first group of audio samples 1022 and the second group of audio samples 1024 through the processing engine 116, so as to obtain the trained audio recognition model and generate a corresponding training result;

steps S110-S112, the server 112 sends the training results to the user device 102 via the network 110, and the processor 106 in the user device 102 displays the training results in the display 108 and stores the training results in the memory 104.

In addition to the example shown in fig. 1, the above steps may be performed by the user device 102 independently, that is, the user device 102 performs the steps of processing the first set of audio samples 1022 and the second set of audio samples 1024, so as to relieve the processing pressure of the server. The user equipment 102 includes, but is not limited to, a handheld device (e.g., a mobile phone), a computer, an intelligent voice interaction device, an intelligent appliance, a vehicle-mounted terminal, and the like, and the specific implementation manner of the user equipment 102 is not limited in the present invention.

Optionally, as an optional implementation manner, as shown in fig. 2, the method for training the audio recognition model includes:

s202, training an audio recognition model to be trained by using a first training sample set to obtain an initial audio recognition model, wherein the first training sample set comprises a first group of audio samples and a first group of actual audio recognition results obtained by labeling the first group of audio samples, and the initial audio recognition model is used for determining prediction audio recognition according to input audio features;

s204, inputting the audio features of the second group of audio samples into the initial audio recognition model to obtain a first group of predicted audio recognition results, wherein the second group of audio samples are not marked with corresponding actual audio recognition results; inputting the audio features of the second group of audio samples into an uncertainty analysis model to obtain a first group of uncertainty analysis results, wherein the first group of uncertainty analysis results are used for representing the credibility of the first group of predicted audio identification results;

s206, according to the first group of uncertainty analysis results, screening out a second group of predicted audio recognition results with the credibility meeting preset conditions from the first group of predicted audio recognition results, and screening out a third group of audio samples corresponding to the second group of predicted audio recognition results from the second group of audio samples;

and S208, performing a current round of training on the initial audio recognition model according to the third group of audio samples and the second group of predicted audio recognition results, wherein the initial audio recognition model is set to be subjected to multiple rounds of training until a preset convergence condition is met.

Optionally, in this embodiment, the training method of the audio recognition model may be, but is not limited to, applied in an automatic spoken language evaluation scenario, for example, an audio recognition model capable of recognizing spoken language audio is trained by the training method of the audio recognition model, the audio input by the user is evaluated and recognized, and the output of the audio recognition model is displayed as an evaluation result, so that the user can clearly know the spoken language level of the user.

Optionally, in this embodiment, the first group of audio samples and the second group of audio samples may be, but are not limited to, both unlabeled audio samples, and the first group of actual audio recognition results are a group of audio samples obtained by labeling the first group of audio samples.

Optionally, in this embodiment, the initial audio recognition model may be, but is not limited to, an audio recognition model obtained by training with audio samples with few labels, and the initial audio recognition model may be, but is not limited to, a semi-finished audio recognition model with a basic function and a training effect that does not meet a convergence condition.

Optionally, in this embodiment, the uncertainty analysis model may be, but is not limited to, a model that can automatically perform an uncertainty method, wherein the uncertainty method may include, but is not limited to, at least one of: gaussian process regression, monte carlo dropout method, deep mixed density network, etc. The method adopts the variance as the measure of uncertainty, and the larger the variance is, the larger the uncertainty is. The Monte Carlo dropout method integrates the uncertainty of the analytical model using multiple models, which assumes that for uncertain data, the output of each model has diversity [8], with greater uncertainty if the output is more diverse. The deep mixed density network is similar to the gaussian process modeling, and the mean and variance sum of the results are modeled [9], and the method also uses variance as a measure of uncertainty, with greater variance and greater uncertainty.

It should be noted that, an initial audio recognition model is obtained based on training of a first group of audio samples with labels, and initial audio recognition is performed on a second group of audio samples without labels to obtain a first group of predicted audio recognition results; carrying out uncertainty analysis on the unmarked second group of audio samples by using the trained uncertainty analysis model, and screening the first group of predicted audio recognition results by using the analysis results to obtain a second group of predicted audio recognition results; and then, obtaining a third group of audio samples corresponding to the second group of predicted audio recognition results in the second group of audio samples, performing iterative training on the initial audio recognition model by using the third group of audio samples until a preset convergence condition is met, and obtaining the trained audio recognition model.

To further illustrate, optionally, for example, as shown in fig. 3, an audio recognition model 304 to be trained is first trained by using a first group of audio samples 302 with labels, so as to obtain an initial audio recognition model 306; inputting the audio features of the second set of audio samples 308 into the initial audio recognition model 306 to obtain a first set of predicted audio recognition results 312; inputting the audio features of the second set of audio samples 308 into the uncertainty analysis model 310 to obtain a first set of uncertainty analysis results 314, wherein the first set of uncertainty analysis results 314 is used to represent the confidence level of the first set of predicted audio recognition results 312; according to the first group of uncertainty analysis results 314, a second group of predicted audio recognition results 316 with the credibility meeting a preset condition are screened out from the first group of predicted audio recognition results 312, and a third group of audio samples 318 corresponding to the second group of predicted audio recognition results 316 are screened out from the second group of audio samples 308; performing a current round of training on the initial audio recognition model 306 according to the third group of audio samples 318 and the second group of predicted audio recognition results 316, wherein the initial audio recognition model 306 is set to undergo multiple rounds of training until a preset convergence condition is met;

in addition, optionally in the case of acquiring the trained audio recognition model 320, the input audio 322 (e.g., reading audio of reading reference text) in the spoken evaluation scene is acquired, and the input audio 322 is input into the audio recognition model 320, and the audio recognition result 324 (e.g., spoken evaluation score) is determined according to the output result of the audio recognition model 320.

For further example, an optional training process of the audio recognition model is shown in fig. 4, and the specific steps are as follows:

step S402, training an audio recognition model to be trained by using a sample with a label to obtain an initial audio recognition model;

step S404, using a sample without labels, and using an uncertainty analysis model to perform initial training and iterative training on an initial audio recognition model, wherein the initial training comprises screening the sample of the current round by using the uncertainty analysis model, and after the screening is finished, the obtained corresponding sample is the sample used in the first round of training in the iterative training process, and in the first round of training, the uncertainty analysis model is used for performing screening again in addition to the training on the initial audio recognition model by using the sample obtained by the initial training, and after the screening is finished, the obtained corresponding sample is the sample used in the second round of training in the iterative training process; in summary, in the iterative training, except for the sample used in the first round of training as the sample obtained in the initial training, the samples used in the remaining rounds of training are all the samples screened out in the previous round of training;

and step S406, obtaining the trained audio recognition model through multiple rounds of training until a preset convergence condition is met.

According to the embodiment provided by the application, a first training sample set is used for training an audio recognition model to be trained to obtain an initial audio recognition model, wherein the first training sample set comprises a first group of audio samples and a first group of actual audio recognition results obtained by labeling the first group of audio samples, and the initial audio recognition model is used for determining and predicting audio recognition according to input audio features; inputting the audio features of the second group of audio samples into the initial audio recognition model to obtain a first group of predicted audio recognition results, wherein the second group of audio samples are not marked with corresponding actual audio recognition results; inputting the audio features of the second group of audio samples into an uncertainty analysis model to obtain a first group of uncertainty analysis results, wherein the first group of uncertainty analysis results are used for representing the credibility of the first group of predicted audio identification results; according to the first group of uncertainty analysis results, screening a second group of predicted audio recognition results with the credibility meeting a preset condition from the first group of predicted audio recognition results, and screening a third group of audio samples corresponding to the second group of predicted audio recognition results from the second group of audio samples; according to the third group of audio samples and the second group of predicted audio recognition results, the initial audio recognition model is subjected to a current round of training, wherein the initial audio recognition model is set to be subjected to multiple rounds of training until a preset convergence condition is met, and the training on the audio recognition model can be completed by ensuring that all audio samples are marked without any guarantee, so that the purpose of reducing the influence degree of the marked samples on the audio recognition model is achieved, and the technical effect of improving the training efficiency of the audio recognition model is achieved.

As an alternative, performing a current round of training on the initial audio recognition model according to the third set of audio samples and the second set of predicted audio recognition results includes:

s1, merging the third group of audio samples and the second group of predicted audio recognition results into the first training sample set to obtain a second training sample set, wherein the second group of predicted audio recognition results in the second training sample set are regarded as a second group of actual audio recognition results;

and S2, performing the current round of training on the initial audio recognition model by using the second training sample set to obtain the audio recognition model after the current round of training.

Optionally, in this embodiment, the third group of audio samples obtained through the screening may be merged into the first training sample set as audio samples with labels to jointly train the initial audio recognition model. Therefore, even if manpower, material resources or time are limited, and a large number of audio samples with labels cannot be obtained, the training of the audio recognition model can be still completed.

According to the embodiment provided by the application, a third group of audio samples and a second group of predicted audio recognition results are merged into a first training sample set to obtain a second training sample set, wherein the second group of predicted audio recognition results in the second training sample set are regarded as a second group of actual audio recognition results; and performing the current round of training on the initial audio recognition model by using the second training sample set to obtain the audio recognition model after the current round of training, thereby realizing the effect of improving the training efficiency of the audio recognition model.

As an optional solution, the method further comprises:

s1, when the difference between the predicted audio recognition result output by the audio recognition model after the previous training and the actual audio recognition result in the second training sample set does not meet the convergence condition, obtaining a group of audio samples to be used in the next training and a group of predicted audio recognition results corresponding to the group of audio samples, wherein the group of audio samples to be used are not marked with corresponding actual audio recognition results, and the group of predicted audio recognition results are the predicted audio recognition results determined by the audio recognition model after the previous training according to the audio features of the group of audio samples;

s2, merging a group of audio samples to be used and a corresponding group of predicted audio recognition results into a second training sample set to obtain a third training sample set;

and S3, performing next round of training on the audio recognition model after the current round of training by using the third training sample set to obtain the audio recognition model after the next round of training.

Optionally, in this embodiment, each round of training may be, but is not limited to, using a new audio sample, and the new audio samples may be, but is not limited to being, annotated. Based on the method, in the training process of the audio recognition model, except that a small amount of audio samples with annotations are required to be used for constructing the initial audio recognition model at first, other steps can be directly trained by using audio samples without annotations, so that the annotation time of the audio samples is greatly saved, and resources consumed by the annotation are saved for the training of the audio recognition model.

According to the embodiment provided by the application, when the difference between the predicted audio recognition result output by the audio recognition model after the previous training and the actual audio recognition result in the second training sample set does not meet the convergence condition, a group of audio samples to be used in the next training and a group of predicted audio recognition results corresponding to the group of audio samples are obtained, wherein the group of audio samples to be used are not marked with the corresponding actual audio recognition result, and the group of predicted audio recognition results are the predicted audio recognition results determined by the audio recognition model after the previous training according to the audio features of the group of audio samples; merging a group of audio samples to be used and a group of corresponding predicted audio recognition results into a second training sample set to obtain a third training sample set; and performing next round of training on the audio recognition model after the current round of training by using the third training sample set to obtain the audio recognition model after the next round of training, thereby realizing the effect of improving the training efficiency of the audio recognition model.

As an optional scheme, obtaining a set of audio samples to be used in a next round of training and a set of predicted audio recognition results corresponding to the set of audio samples includes:

s1, inputting the audio features of the fourth group of audio samples into the audio recognition model after the current round of training to obtain a third group of predicted audio recognition results, wherein the fourth group of audio samples are not marked with corresponding actual audio recognition results; inputting the audio characteristics of the fourth group of audio samples into an uncertainty analysis model to obtain a second group of uncertainty analysis results, wherein the second group of uncertainty analysis results are used for representing the credibility of the third group of predicted audio identification results;

and S2, according to the second group of uncertainty analysis results, screening out a fourth group of predicted audio recognition results with the credibility meeting the preset conditions from the third group of predicted audio recognition results, and screening out a fifth group of audio samples corresponding to the fourth group of predicted audio recognition results from the fourth group of audio samples.

According to the embodiment provided by the application, the audio characteristics of the fourth group of audio samples are input into the audio recognition model after the current round of training to obtain a third group of predicted audio recognition results, wherein the fourth group of audio samples are not marked with corresponding actual audio recognition results; inputting the audio characteristics of the fourth group of audio samples into an uncertainty analysis model to obtain a second group of uncertainty analysis results, wherein the second group of uncertainty analysis results are used for representing the credibility of the third group of predicted audio identification results; and according to the second group of uncertainty analysis results, a fourth group of predicted audio recognition results with the credibility meeting the preset conditions are screened out from the third group of predicted audio recognition results, and a fifth group of audio samples corresponding to the fourth group of predicted audio recognition results are screened out from the fourth group of audio samples, so that the effect of improving the training integrity of the audio recognition model is realized.

As an optional scheme, according to the first set of uncertainty analysis results, a second set of predicted audio recognition results with reliability meeting a preset condition is screened out from the first set of predicted audio recognition results, and the method includes:

and when the difference between the predicted audio recognition result output by the audio recognition model after the next round of training and the actual audio recognition result in the third training sample set meets the convergence condition, finishing the training of the initial audio recognition model to obtain a target audio recognition model, wherein a group of predicted audio recognition results in the third training sample set is regarded as a group of actual audio recognition results.

Optionally, in this embodiment, when the convergence condition is reached, the training of the initial audio recognition model is ended, so as to obtain a trained target audio recognition model.

By the embodiment provided by the application, when the difference between the predicted audio recognition result output by the audio recognition model after the next round of training and the actual audio recognition result in the third training sample set meets the convergence condition, the training of the initial audio recognition model is finished to obtain the target audio recognition model, wherein a group of predicted audio recognition results in the third training sample set is regarded as a group of actual audio recognition results, and the effect of improving the training integrity of the audio recognition model is realized.

s1, when the first group of uncertainty analysis results comprise a group of uncertainty scores, sequencing the group of uncertainty scores according to the scores from small to large to obtain an uncertainty score sequence, wherein the higher the uncertainty score is, the lower the credibility of the corresponding prediction audio recognition result is;

s2, acquiring N uncertainty scores in the uncertainty score sequence, wherein the uncertainty score sequence comprises M uncertainty scores, and N is less than M;

s3, screening out a second group of predicted audio recognition results corresponding to the M-bit uncertainty scores before sorting from the first group of predicted audio recognition results.

Optionally, in this embodiment, the uncertainty score may be used as one of the screening manners, each audio recognition result is sorted according to the uncertainty score, and the second group of predicted audio recognition results is formed by taking the top M bits or the audio recognition results with the uncertainty score being greater than or equal to the target threshold.

According to the embodiment provided by the application, when the first group of uncertainty analysis results comprise a group of uncertainty scores, the group of uncertainty scores are sequenced from small to large according to the scores to obtain an uncertainty score sequence, wherein the higher the uncertainty score is, the lower the credibility of the corresponding predicted audio recognition result is; acquiring the first N sequenced uncertainty scores from an uncertainty score sequence, wherein the uncertainty score sequence comprises M uncertainty scores, and N < M; and screening a second group of predicted audio recognition results corresponding to the uncertainty scores of M bits before sorting from the first group of predicted audio recognition results, thereby realizing the effect of improving the screening efficiency of the audio recognition results.

As an optional solution, the method further comprises:

s1, acquiring the input target audio in the target application;

s2, obtaining a target audio recognition result determined by the target audio recognition model according to the audio features of the target audio, wherein the target audio recognition model is an audio recognition model obtained by performing multiple rounds of training on the initial audio recognition model until a preset convergence condition is met;

and S3, displaying the target audio recognition result in the target application.

The obtaining of the input target audio in the target application may include, but is not limited to: under the condition that the reference text is displayed in the target application, acquiring target audio generated by reading the reference text aloud in the target application, or acquiring target audio generated by replying the reference text;

displaying the target audio recognition result in the target application may include, but is not limited to: and displaying the evaluation score of the target audio determined by the target audio recognition model in the target application.

Optionally, in this embodiment, each reference text may, but is not limited to, correspond to one or more reference audios, and in a scene of spoken language evaluation, an evaluation score of a target audio may be determined, but is not limited to, by comparing similarity between the acquired target audio and the reference audio.

To further illustrate, optionally, as shown in FIG. 5, for example, a reference text "Who are you? "and a prompt message" please read the text message aloud ", and further as shown in (a) of fig. 5, a touch operation is recognized on the virtual button" start reading ", and further an audio signal in a target time period is collected and input as a target audio into the target audio recognition model, so that the target audio recognition model outputs a corresponding recognition result according to the target audio, and the performance of the recognition process in the foreground may be, but is not limited to, as shown in (b) of fig. 5; further, after the output result of the target audio recognition model is obtained, the output result is converted into an evaluation result, and the evaluation result shown in (c) in fig. 5 represents an evaluation score of "85".

To further illustrate, optionally, as shown in fig. 6, a reference text "How are you displayed on the target application interface 602, for example? "and the prompt message" please answer the above text ", and further as shown in (a) of fig. 6, a touch operation is recognized on the virtual button" start answer ", and further an audio signal within the target time period is collected and input into the target audio recognition model as the target audio, so that the target audio recognition model outputs a corresponding recognition result according to the target audio, and the performance of the recognition process in the foreground may be, but is not limited to, as shown in (b) of fig. 6; after the output result of the target audio recognition model is obtained, the output result is converted into an evaluation result, the evaluation result shown in (c) in fig. 6 is represented as a recognized answer text, and an evaluation score (not shown in the figure) can be given based on whether the answer text is correct and whether the pronunciation is labeled.

According to the embodiment provided by the application, under the condition that the reference text is displayed in the target application, the target audio generated by reading the reference text aloud is obtained in the target application, or the target audio generated by replying the reference text is obtained; and the evaluation score of the target audio determined by the target audio recognition model is displayed in the target application, so that the effect of improving the accuracy of audio evaluation is realized.

Optionally, as an optional implementation manner, as shown in fig. 7, the audio recognition method includes:

s702, acquiring input target audio in a target application;

s704, obtaining a target audio recognition result determined by a target audio recognition model according to the audio features of the target audio, wherein the target audio recognition model is obtained by performing multiple rounds of training on an initial audio recognition model until a preset convergence condition is met, the initial audio recognition model is obtained by training an audio recognition model to be trained by using a first training sample set, the first training sample set comprises a first group of audio samples and a first group of actual audio recognition results obtained by labeling the first group of audio samples, the initial audio recognition model is used for determining and predicting audio recognition according to the input audio features, the audio recognition model obtained by the previous round of training is trained by using a training sample set corresponding to each round in each round of training, the training sample set corresponding to each round comprises the training sample set obtained by the previous round of training and the training sample set obtained by the round of screening, the training sample set obtained by the round of screening comprises a group of audio samples and a group of predicted audio recognition results corresponding to the group of audio samples, wherein the group of audio samples are not marked with corresponding actual audio recognition results, and the group of predicted audio recognition results are predicted audio recognition results determined by the audio recognition model after the previous round of training according to the audio features of the group of audio samples;

and S706, displaying the target audio recognition result in the target application.

Optionally, in this embodiment, the audio recognition method may be, but is not limited to, applied in an automatic spoken language evaluation scenario, for example, the audio recognition method is used to evaluate the spoken language of the audio input by the user, so that the user can clearly know the spoken language level of the user.

According to the embodiment provided by the application, the input target audio is obtained in the target application; obtaining a target audio recognition result determined according to the audio characteristics of a target audio through a target audio recognition model, wherein the target audio recognition model is obtained by performing multiple rounds of training on an initial audio recognition model until a preset convergence condition is met, the initial audio recognition model is obtained by training an audio recognition model to be trained by using a first training sample set, the first training sample set comprises a first group of audio samples and a first group of actual audio recognition results obtained by labeling the first group of audio samples, the initial audio recognition model is used for determining and predicting audio recognition according to the input audio characteristics, the audio recognition model obtained by the previous round of training is trained by using a training sample set corresponding to each round in each round of training, and the training sample set corresponding to each round comprises the training sample set obtained by the previous round of training and a training sample set obtained by the current round of screening, the training sample set obtained by the round of screening comprises a group of audio samples and a group of predicted audio recognition results corresponding to the group of audio samples, wherein the group of audio samples are not marked with corresponding actual audio recognition results, and the group of predicted audio recognition results are predicted audio recognition results determined by the audio recognition model after the previous round of training according to the audio features of the group of audio samples; and displaying a target audio recognition result in the target application, and quickly obtaining an audio recognition model meeting a convergence condition for audio recognition through a model training mode of audio samples without a large amount of labels, thereby realizing the technical effect of improving the efficiency of audio recognition.

As an alternative, before the target audio input is acquired in the target application, the method includes:

s1, training an audio recognition model to be trained by using a first training sample set to obtain an initial audio recognition model, wherein the first training sample set comprises a first group of audio samples and a first group of actual audio recognition results obtained by labeling the first group of audio samples, and the initial audio recognition model is used for determining and predicting audio recognition according to input audio features;

s2, inputting the audio features of the second group of audio samples into the initial audio recognition model to obtain a first group of predicted audio recognition results, wherein the second group of audio samples are not marked with corresponding actual audio recognition results; inputting the audio features of the second group of audio samples into an uncertainty analysis model to obtain a first group of uncertainty analysis results, wherein the first group of uncertainty analysis results are used for representing the credibility of the first group of predicted audio identification results;

s3, according to the first group of uncertainty analysis results, screening out a second group of predicted audio recognition results with the credibility meeting preset conditions from the first group of predicted audio recognition results, and screening out a third group of audio samples corresponding to the second group of predicted audio recognition results from the second group of audio samples;

and performing a current round of training on the initial audio recognition model according to the third group of audio samples and the second group of predicted audio recognition results, wherein the initial audio recognition model is set to be subjected to multiple rounds of training until a preset convergence condition is met.

step S404, using a sample without labels, and using an uncertainty analysis model to perform initial training (round 0 training) and iterative training (rounds 1-n training) on the initial audio recognition model, wherein the initial training comprises screening the sample of the current round by using the uncertainty analysis model, after the screening is finished, the obtained corresponding sample is the sample used by the first round of training in the iterative training process, in the first round of training, the uncertainty analysis model is used for screening again in addition to the sample obtained by using the initial training to train the initial audio recognition model, and after the screening is finished, the obtained corresponding sample is the sample used by the second round of training in the iterative training process; in summary, in the iterative training, except that the sample used in the first round of training is the sample obtained in the initial training, the samples used in the remaining rounds of training are all the samples screened out in the previous round of training, which can be specifically referred to as shown in the right multi-round training diagram in fig. 4;

As an optional solution, the method further comprises:

when the difference between the predicted audio recognition result output by the audio recognition model after the previous training and the actual audio recognition result in the second training sample set does not meet the convergence condition, obtaining a group of audio samples to be used in the next training and a group of predicted audio recognition results corresponding to the group of audio samples, wherein the group of audio samples to be used are not marked with corresponding actual audio recognition results, and the group of predicted audio recognition results are predicted audio recognition results determined by the audio recognition model after the previous training according to the audio features of the group of audio samples;

merging a group of audio samples to be used and a group of corresponding predicted audio recognition results into a second training sample set to obtain a third training sample set;

and performing next round of training on the audio recognition model after the current round of training by using the third training sample set to obtain the audio recognition model after the next round of training.

inputting the audio features of the fourth group of audio samples into the audio recognition model after the current round of training to obtain a third group of predicted audio recognition results, wherein the fourth group of audio samples are not marked with corresponding actual audio recognition results; inputting the audio characteristics of the fourth group of audio samples into an uncertainty analysis model to obtain a second group of uncertainty analysis results, wherein the second group of uncertainty analysis results are used for representing the credibility of the third group of predicted audio identification results;

and according to the second group of uncertainty analysis results, screening a fourth group of predicted audio recognition results with the credibility meeting the preset conditions from the third group of predicted audio recognition results, and screening a fifth group of audio samples corresponding to the fourth group of predicted audio recognition results from the fourth group of audio samples.

As an alternative, the obtaining of the input target audio in the target application includes: under the condition that the reference text is displayed in the target application, acquiring target audio generated by reading the reference text aloud in the target application, or acquiring target audio generated by replying the reference text;

displaying the target audio recognition result in the target application, comprising: and displaying the evaluation score of the target audio determined by the target audio recognition model in the target application.

As an alternative, for convenience of understanding, the training method of the audio recognition model and the audio recognition method are described in an automatic spoken language evaluation scenario, where the automatic spoken language evaluation scenario may include, but is not limited to, relevant scenarios applied in a spoken language test, for example, an objective question type such as a reading question type, and a subjective question type such as a talking on picture, a spoken composition, and the like, and the following details are specifically provided:

automatic spoken language evaluation often depends on a large amount of manual marking data, and the effect is difficult to guarantee under the condition of less manual marking data. Based on the above, an algorithm for performing spoken language evaluation training by using a semi-supervised learning pseudo-label algorithm is provided, and the algorithm effectively relieves the requirement on data. Firstly, a spoken language evaluation model is trained based on a small amount of spoken language evaluation data with labels. And evaluating the audio frequency of the rest unlabeled spoken languages, and predicting scores by using the model. Since the predicted score contains more incorrect label or noisy data, an uncertainty analysis algorithm is used to obtain uncertainty parameters. And screening the predicted test data based on the uncertainty analysis result, and expanding a training set for the screened data. Based on the new expanded training set, the spoken language evaluation model is retrained, and the spoken language evaluation effect is improved.

Specifically, first, as shown in fig. 8 (a), the start follow-reading button 802 is clicked to start follow-reading a sentence; as shown in fig. 8 (b), the finish follow reading button 804 is clicked to finish the follow reading sentence; as shown in fig. 9, the screen returns an evaluation result 902, which is displayed to the user, for example, 4 stars are obtained in the evaluation result of the sentence.

As another example, as shown in fig. 10, the start recording button 1002 is clicked to start answering a question; click the end recording button 1004 and end the answer question. As shown in fig. 11, the screen returns an evaluation result 1102, which is displayed to the user, for example, 4 stars are obtained in the evaluation result of the sentence.

In addition, the overall flow can be referred to as shown in fig. 12, and the specific steps are as follows:

s1202, the user opens the application program 1202 and displays the title on a screen; clicking the starting recording in the application program 1202 to answer the question;

s1204, the application 1202 sends the audio and the reading text to the server 1204;

s1206, the server 1204 sends the audio and the title information to the spoken language evaluation model 1206 based on the pseudo label;

s1208, the spoken language evaluation module 1206 returns the scoring result to the server 1204;

s1210, the server 1204 returns the final score to the application 1202 side, and the user views the final score at the application 1202 side.

Further, a spoken language evaluation model of pseudo tags combined with uncertainty analysis can be seen with reference to fig. 13, which includes the following steps:

first, audio is input, the audio is input into an ASR (automatic speech recognition), and a text of the speech recognition and a start-stop time of each phoneme and each word in the audio are obtained. And inputting the audio, the alignment result and the recognition text into a feature extraction module, and extracting acoustic features and text features. Inputting the features into a trained base model, and predicting the spoken scores. And simultaneously inputting the characteristics into a score uncertainty analysis module to obtain an uncertainty analysis result. And finally, inputting the predicted scores and uncertainty analysis results into a pseudo sample screening module to screen out available pseudo samples. To make a dummy sample D_u(10 topics, 10 audios, score of 10 model predictions) and training data D for the base model_L(10 topics, 10 audios, 10 artificial scores) are combined as shown in the following equation (1). And retraining the spoken language evaluation model. The process can be repeated for many times, new pseudo label samples are continuously blended, and the spoken language evaluation model is continuously retrained until convergence.

D′_L＝D_u∪D_L (1)

The text features extracted based on the ASR recognition mainly comprise semantic features, pragmatic features, keyword features and text disfluency features. The keyword features mainly comprise extraction of keywords in standard answers and keywords of answer contents, calculation accuracy, recall rate and the like. The pragmatic features include the diversity of words of the answer content, the diversity of sentence patterns and the grammatical accuracy of analyzing the answer content based on the language model. The semantic features comprise subject features of the answering content, tf-idf features and the like. The method comprises the following steps of (1) identifying the proportional statistics of disfluency components in a text by using text disfluency characteristics;

the acoustic features are mainly classified into pronunciation accuracy, pronunciation fluency, pronunciation rhythm, and the like. Pronunciation accuracy is evaluated based on the confidence of speech recognition to the phoneme, word, sentence level, etc. in the corresponding pronunciation content. The pronunciation fluency comprises the speech speed characteristics in the pronunciation process, the characteristics based on time length statistics, such as the average time length of pronunciation sections, the average pause time length between pronunciation sections and the like. The pronunciation rhythm degree comprises evaluation on pronunciation rhythm sense, word re-reading correctness evaluation in sentences, sentence boundary tone evaluation and the like;

and constructing a regression model based on the extracted acoustic features and text features, and fitting the regression model to manually score. The regression model can be some traditional regression models such as KNN, SVR, GBT tree models and the like, and can also be a deep neural network model, and the final score is obtained through the forward propagation of a multilayer network;

and constructing an uncertainty analysis model based on the extracted text features and the acoustic features. At present, the uncertain methods are various, and typical methods comprise Gaussian process regression, a Monte Carlo dropout method, a deep mixed density network and the like. The method adopts the variance as the measure of uncertainty, and the larger the variance is, the larger the uncertainty is. The Monte Carlo dropout method integrates the uncertainty of the analytical model using multiple models, which assumes that for uncertain data, the output of each model has diversity [8], with greater uncertainty if the output is more diverse. The deep mixed density network is similar to the Gaussian process modeling, the mean value and the variance of the result are modeled, the variance is also used as the uncertainty measure in the method, and the larger the variance is, the larger the uncertainty is;

and determining whether to adopt the pseudo sample for model retraining or not based on the score output by the spoken language evaluation model and the uncertain score output by the uncertainty analysis module. Suppose that the prediction score of the ith speech is p_iUncertainty fraction of C_iThen whether the final sample is selected as R_iExpressed as the following formula (2). Where T1 and T2 are preset thresholds representing minimum and maximum uncertainties of selected samples, which may be determined manually or by search algorithms, T1<T2, fractional part;

R_i＝I[C_i＞T1&C_i＜T2] (2)

in addition, in the present embodiment, but not limited to, two data sets may also be adopted, one data set is answer data of the spoken question type in the spoken language test, and the other data set is answer data of the spoken question type in the spoken language test. Each of which had a total of 1500 items and was labeled by three experts. The final measurement is mainly based on Pearson correlation coefficient and accuracy (i.e. the ratio of label to prediction score is less than or equal to 1 st). According to results, the effect of the spoken language evaluation model can be greatly improved by combining uncertainty analysis results based on a pseudo tag algorithm.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

According to another aspect of the embodiment of the present invention, there is also provided an audio recognition model training apparatus for implementing the above audio recognition model training method. As shown in fig. 14, the apparatus includes:

a first training unit 1402, configured to train an audio recognition model to be trained by using a first training sample set to obtain an initial audio recognition model, where the first training sample set includes a first group of audio samples and a first group of actual audio recognition results obtained by labeling the first group of audio samples, and the initial audio recognition model is used to determine predicted audio recognition according to input audio features;

a first input unit 1404, configured to input audio features of a second group of audio samples into the initial audio recognition model to obtain a first group of predicted audio recognition results, where the second group of audio samples are not labeled with corresponding actual audio recognition results; inputting the audio features of the second group of audio samples into an uncertainty analysis model to obtain a first group of uncertainty analysis results, wherein the first group of uncertainty analysis results are used for representing the credibility of the first group of predicted audio identification results;

the first screening unit 1406 is configured to screen, according to the first group of uncertainty analysis results, a second group of predicted audio recognition results of which the reliability satisfies a preset condition from the first group of predicted audio recognition results, and screen, from the second group of audio samples, a third group of audio samples corresponding to the second group of predicted audio recognition results;

a second training unit 1408, configured to perform a current training round on the initial audio recognition model according to the third group of audio samples and the second group of predicted audio recognition results, where the initial audio recognition model is set to undergo multiple training rounds until a preset convergence condition is met.

Optionally, in this embodiment, the training device of the audio recognition model may be, but is not limited to, applied in an automatic spoken language evaluation scenario, for example, an audio recognition model capable of recognizing spoken language audio is trained by the training device of the audio recognition model, the audio input by the user is evaluated and recognized, and the output of the audio recognition model is displayed as an evaluation result, so that the user can clearly know his/her spoken language level.

Optionally, in this embodiment, the uncertainty analysis model may be, but is not limited to, a model that can automatically execute an uncertainty apparatus, wherein the uncertainty apparatus may include, but is not limited to, at least one of: gaussian process regression, monte carlo dropout devices, deep mixed density networks, etc. The device adopts the variance as the measurement of uncertainty, and the larger the variance is, the larger the uncertainty is. The Monte Carlo dropout device integrates the uncertainty of the analytical model using multiple models, which assumes that for uncertain data, the output of each model has diversity [8], with greater uncertainty if the output is more diverse. The deep mixture density network is similar to the gaussian process modeling, and models the mean and variance sums of the results [9], and the device also uses variance as a measure of uncertainty, with greater variance and greater uncertainty.

For a specific embodiment, reference may be made to the example shown in the above method for training an audio recognition model, which is not described herein again in this example.

As an alternative, the second training unit 1408 includes:

the first merging module is used for merging the third group of audio samples and the second group of predicted audio recognition results into the first training sample set to obtain a second training sample set, wherein the second group of predicted audio recognition results in the second training sample set are regarded as a second group of actual audio recognition results;

and the first training module is used for performing the current round of training on the initial audio recognition model by using the second training sample set to obtain the audio recognition model after the current round of training.

As an optional scheme, the apparatus further comprises:

the first obtaining module is used for obtaining a group of audio samples to be used in the next round of training and a group of predicted audio recognition results corresponding to the group of audio samples when the difference between the predicted audio recognition result output by the audio recognition model after the current round of training and the actual audio recognition result in the second training sample set does not meet the convergence condition, wherein the group of audio samples to be used are not marked with the corresponding actual audio recognition result, and the group of predicted audio recognition results are predicted audio recognition results determined by the audio recognition model after the current round of training according to the audio features of the group of audio samples;

the second merging module is used for merging the group of audio samples to be used and the corresponding group of predicted audio recognition results into a second training sample set to obtain a third training sample set;

and the second training module is used for carrying out next round of training on the audio recognition model after the current round of training by using the third training sample set to obtain the audio recognition model after the next round of training.

As an optional scheme, the obtaining module includes:

the input submodule is used for inputting the audio characteristics of the fourth group of audio samples into the audio recognition model after the current round of training to obtain a third group of predicted audio recognition results, wherein the fourth group of audio samples are not marked with corresponding actual audio recognition results; inputting the audio characteristics of the fourth group of audio samples into an uncertainty analysis model to obtain a second group of uncertainty analysis results, wherein the second group of uncertainty analysis results are used for representing the credibility of the third group of predicted audio identification results;

and the screening submodule is used for screening a fourth group of predicted audio recognition results with the reliability meeting the preset conditions from the third group of predicted audio recognition results according to the second group of uncertainty analysis results, and screening a fifth group of audio samples corresponding to the fourth group of predicted audio recognition results from the fourth group of audio samples.

As an alternative, the first screening unit 1406 includes:

and the second obtaining module is used for finishing the training of the initial audio recognition model when the difference between the predicted audio recognition result output by the audio recognition model after the next round of training and the actual audio recognition result in the third training sample set meets the convergence condition to obtain the target audio recognition model, wherein a group of predicted audio recognition results in the third training sample set is regarded as a group of actual audio recognition results.

As an alternative, the first screening unit 1406 includes:

the third obtaining module is used for sequencing a group of uncertainty scores from small to large according to the scores when the first group of uncertainty analysis results comprise a group of uncertainty scores to obtain an uncertainty score sequence, wherein the higher the uncertainty score is, the lower the credibility of the corresponding prediction audio recognition result is;

a fourth obtaining module, configured to obtain the first N sorted uncertainty scores from an uncertainty score sequence, where the uncertainty score sequence includes M uncertainty scores, and N < M;

and the screening module is used for screening out a second group of predicted audio recognition results corresponding to the uncertainty scores of M bits before sorting from the first group of predicted audio recognition results.

As an optional scheme, the apparatus further comprises:

a first acquisition unit configured to acquire an input target audio in a target application;

the second acquisition unit is used for acquiring a target audio recognition result determined by a target audio recognition model according to the audio features of the target audio, wherein the target audio recognition model is an audio recognition model obtained by performing multiple rounds of training on the initial audio recognition model until a preset convergence condition is met;

and the first display unit is used for displaying the target audio recognition result in the target application.

As an alternative, the method comprises the following steps:

a first acquisition unit comprising: the target audio module is used for acquiring target audio generated by reading the reference text aloud in the target application or acquiring target audio generated by replying the reference text under the condition that the reference text is displayed in the target application;

a first display unit comprising: and the first score module is used for displaying the evaluation score of the target audio determined by the target audio recognition model in the target application.

According to another aspect of the embodiment of the present invention, there is also provided an audio recognition apparatus for implementing the audio recognition method. As shown in fig. 15, the apparatus includes:

a third acquisition unit 1502 for acquiring an input target audio in a target application;

a fourth obtaining unit 1504, configured to obtain a target audio recognition result determined according to an audio feature of a target audio by a target audio recognition model, where the target audio recognition model is an audio recognition model obtained by performing multiple rounds of training on an initial audio recognition model until a preset convergence condition is met, the initial audio recognition model is a model obtained by training an audio recognition model to be trained by using a first training sample set, the first training sample set includes a first set of audio samples and a first set of actual audio recognition results obtained by labeling the first set of audio samples, the initial audio recognition model is used to determine a predicted audio recognition according to an input audio feature, the audio recognition model obtained by a previous round of training is trained by using a training sample set corresponding to each round of training in each round of training, and the training sample set corresponding to each round includes a training sample set obtained by the previous round of training and a training sample obtained by a current round of screening In the set, the training sample set obtained by the round of screening comprises a group of audio samples and a group of predicted audio recognition results corresponding to the group of audio samples, wherein the group of audio samples are not marked with corresponding actual audio recognition results, and the group of predicted audio recognition results are predicted audio recognition results determined by the audio recognition model after the previous round of training according to the audio features of the group of audio samples;

and a third display unit 1506 for displaying the target audio recognition result in the target application.

Optionally, in this embodiment, the audio recognition apparatus may be, but is not limited to, applied in an automatic spoken language evaluation scenario, for example, by using the audio recognition method, the audio input by the user is evaluated in a spoken language, so that the user can clearly know the spoken language level of the user.

As an alternative, the method comprises the following steps:

the third training unit is used for training the audio recognition model to be trained by using the first training sample set before acquiring the input target audio in the target application to obtain an initial audio recognition model, wherein the first training sample set comprises a first group of audio samples and a first group of actual audio recognition results obtained by labeling the first group of audio samples, and the initial audio recognition model is used for determining and predicting audio recognition according to the input audio features;

the second input unit is used for inputting the audio characteristics of a second group of audio samples to the initial audio recognition model before the input target audio is acquired in the target application to obtain a first group of predicted audio recognition results, wherein the second group of audio samples are not marked with corresponding actual audio recognition results; inputting the audio features of the second group of audio samples into an uncertainty analysis model to obtain a first group of uncertainty analysis results, wherein the first group of uncertainty analysis results are used for representing the credibility of the first group of predicted audio identification results;

the second screening unit is used for screening a second group of predicted audio recognition results with the credibility meeting the preset conditions from the first group of predicted audio recognition results according to the first group of uncertainty analysis results before the input target audio is acquired in the target application, and screening a third group of audio samples corresponding to the second group of predicted audio recognition results from the second group of audio samples;

and the fourth training unit is used for performing the current round of training on the initial audio recognition model according to the third group of audio samples and the second group of predicted audio recognition results before the input target audio is acquired in the target application, wherein the initial audio recognition model is set to be subjected to multiple rounds of training until a preset convergence condition is met.

As an alternative, the method comprises the following steps:

the first merging unit is used for merging a third group of audio samples and a second group of predicted audio recognition results into a first training sample set before an input target audio is obtained in a target application to obtain a second training sample set, wherein the second group of predicted audio recognition results in the second training sample set are regarded as a second group of actual audio recognition results;

and the fifth training unit is used for performing the current round of training on the initial audio recognition model by using the second training sample set before the input target audio is acquired in the target application, so as to obtain the audio recognition model after the current round of training.

As an alternative, the method comprises the following steps:

a fifth obtaining unit, configured to obtain a group of audio samples to be used in a next round of training and a group of predicted audio recognition results corresponding to the group of audio samples when a difference between a predicted audio recognition result output by the audio recognition model after the current round of training and an actual audio recognition result in the second training sample set does not satisfy a convergence condition, where the group of audio samples to be used is not labeled with a corresponding actual audio recognition result, and the group of predicted audio recognition results are predicted audio recognition results determined by the audio recognition model after the current round of training according to audio features of the group of audio samples;

the second merging unit is used for merging the group of audio samples to be used and the corresponding group of predicted audio recognition results into a second training sample set to obtain a third training sample set;

and the fourth training unit is used for performing next round of training on the audio recognition model after the current round of training by using the third training sample set to obtain the audio recognition model after the next round of training.

As an alternative, the method comprises the following steps:

the third input unit is used for inputting the audio characteristics of a fourth group of audio samples to the audio recognition model after the current round of training before the input target audio is acquired in the target application to obtain a third group of predicted audio recognition results, wherein the fourth group of audio samples are not marked with corresponding actual audio recognition results; inputting the audio characteristics of the fourth group of audio samples into an uncertainty analysis model to obtain a second group of uncertainty analysis results, wherein the second group of uncertainty analysis results are used for representing the credibility of the third group of predicted audio identification results;

and the third screening unit is used for screening a fourth group of predicted audio recognition results with the reliability meeting the preset conditions from the third group of predicted audio recognition results according to the second group of uncertainty analysis results before the input target audio is acquired in the target application, and screening a fifth group of audio samples corresponding to the fourth group of predicted audio recognition results from the fourth group of predicted audio recognition results.

As an alternative, the method comprises the following steps:

and a sixth obtaining unit, configured to, before obtaining the input target audio in the target application, end training of the initial audio recognition model when a difference between a predicted audio recognition result output by the audio recognition model after the next round of training and an actual audio recognition result in the third training sample set satisfies a convergence condition, so as to obtain the target audio recognition model, where a group of predicted audio recognition results in the third training sample set is regarded as a group of actual audio recognition results.

As an alternative, the method comprises the following steps:

the device comprises a sequencing unit, a first group of uncertainty analysis results and a second group of uncertainty analysis results, wherein the sequencing unit is used for sequencing a group of uncertainty scores from small to large according to the scores to obtain an uncertainty score sequence before the input target audio is obtained in the target application, and the higher the uncertainty score is, the lower the credibility of the corresponding predicted audio recognition result is;

a seventh obtaining unit, configured to obtain, before obtaining an input target audio in a target application, top N sorted uncertainty scores in an uncertainty score sequence, where the uncertainty score sequence includes M uncertainty scores, and N < M;

and the fourth screening unit is used for screening out a second group of predicted audio recognition results corresponding to the uncertainty scores of M bits before sorting from the first group of predicted audio recognition results before the input target audio is acquired in the target application.

As an optional scheme, the apparatus further comprises:

a third acquisition unit comprising: the second audio module is used for acquiring target audio generated by reading the reference text aloud in the target application or acquiring target audio generated by replying the reference text under the condition that the reference text is displayed in the target application;

a third display unit comprising: and the second score module is used for displaying the evaluation score of the target audio determined by the target audio recognition model in the target application.

According to yet another aspect of the embodiments of the present invention, there is also provided an electronic device for implementing the method for training an audio recognition model, as shown in fig. 16, the electronic device includes a memory 1602 and a processor 1604, the memory 1602 stores therein a computer program, and the processor 1604 is configured to execute the steps in any one of the method embodiments through the computer program.

Optionally, in this embodiment, the electronic device may be located in at least one network device of a plurality of network devices of a computer network.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

and S4, performing a current round of training on the initial audio recognition model according to the third group of audio samples and the second group of predicted audio recognition results, wherein the initial audio recognition model is set to be subjected to multiple rounds of training until a preset convergence condition is met. Or the like, or, alternatively,

s1, acquiring the input target audio in the target application;

s2, obtaining a target audio recognition result determined by a target audio recognition model according to the audio features of the target audio, wherein the target audio recognition model is obtained by performing multiple rounds of training on an initial audio recognition model until a preset convergence condition is met, the initial audio recognition model is obtained by training an audio recognition model to be trained by using a first training sample set, the first training sample set comprises a first group of audio samples and a first group of actual audio recognition results obtained by labeling the first group of audio samples, the initial audio recognition model is used for determining prediction audio recognition according to the input audio features, the audio recognition model obtained by the previous round of training is trained by using a training sample set corresponding to each round in each round of training, the training sample set corresponding to each round comprises the training sample set obtained by the previous round of training and the training sample set obtained by the current round of screening, the training sample set obtained by the round of screening comprises a group of audio samples and a group of predicted audio recognition results corresponding to the group of audio samples, wherein the group of audio samples are not marked with corresponding actual audio recognition results, and the group of predicted audio recognition results are predicted audio recognition results determined by the audio recognition model after the previous round of training according to the audio features of the group of audio samples;

and S3, displaying the target audio recognition result in the target application. Alternatively, it can be understood by those skilled in the art that the structure shown in fig. 16 is only an illustration, and the electronic device may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 16 does not limit the structure of the electronic device. For example, the electronic device may also include more or fewer components (e.g., network interfaces, etc.) than shown in FIG. 16, or have a different configuration than shown in FIG. 16.

The memory 1602 may be configured to store software programs and modules, such as program instructions/modules corresponding to the method and apparatus for training an audio recognition model in the embodiment of the present invention, and the processor 1604 executes various functional applications and data processing by running the software programs and modules stored in the memory 1602, that is, implements the above-mentioned method for training an audio recognition model. The memory 1602 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 1602 can further include memory located remotely from the processor 1604, which can be connected to the terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. The memory 1602 may be used for storing information such as a first group of audio samples, a second group of audio samples, and an audio recognition model. As an example, as shown in fig. 16, the memory 1602 may include, but is not limited to, a first training unit 1402, a first filtering unit 1406, an input unit 1404, and a second training unit 1408 of the training apparatus including the audio recognition model. In addition, other module units in the training apparatus for the audio recognition model may also be included, but are not limited to these, and are not described in detail in this example.

Optionally, the transmission device 1606 is configured to receive or transmit data via a network. Examples of the network may include a wired network and a wireless network. In one example, the transmission device 1606 includes a Network adapter (NIC) that can be connected to a router via a Network line to communicate with the internet or a local area Network. In one example, the transmission device 1606 is a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

In addition, the electronic device further includes: a display 1608 for displaying information such as the first set of audio samples, the second set of audio samples, and the audio recognition model; and a connection bus 1610 for connecting respective module components in the above-described electronic apparatus.

In other embodiments, the terminal device or the server may be a node in a distributed system, where the distributed system may be a blockchain system, and the blockchain system may be a distributed system formed by connecting a plurality of nodes through a network communication. The nodes may form a Peer-To-Peer (P2P) network, and any type of computing device, such as a server, a terminal, and other electronic devices, may become a node in the blockchain system by joining the Peer-To-Peer network.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. A processor of the computer device reads the computer instructions from the computer-readable storage medium, the processor executes the computer instructions, so that the computer device performs the training of the audio recognition model and the audio recognition method, wherein the computer program is arranged to perform the steps in any of the method embodiments described above when executed.

Alternatively, in the present embodiment, the above-mentioned computer-readable storage medium may be configured to store a computer program for executing the steps of:

s1, acquiring the input target audio in the target application;

Alternatively, in this embodiment, a person skilled in the art may understand that all or part of the steps in the methods of the foregoing embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing one or more computer devices (which may be personal computers, servers, network devices, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for training an audio recognition model, comprising:

training an audio recognition model to be trained by using a first training sample set to obtain an initial audio recognition model, wherein the first training sample set comprises a first group of audio samples and a first group of actual audio recognition results obtained by labeling the first group of audio samples, and the initial audio recognition model is used for determining prediction audio recognition according to input audio features;

inputting the audio features of a second group of audio samples into the initial audio recognition model to obtain a first group of predicted audio recognition results, wherein the second group of audio samples are not marked with corresponding actual audio recognition results; inputting the audio features of the second set of audio samples into an uncertainty analysis model to obtain a first set of uncertainty analysis results, wherein the first set of uncertainty analysis results is used for representing the credibility of the first set of predicted audio recognition results;

according to the first group of uncertainty analysis results, screening a second group of predicted audio recognition results with the credibility meeting a preset condition from the first group of predicted audio recognition results, and screening a third group of audio samples corresponding to the second group of predicted audio recognition results from the second group of audio samples;

2. The method of claim 1, wherein the performing a current round of training on the initial audio recognition model according to the third set of audio samples and the second set of predicted audio recognition results comprises:

merging the third group of audio samples and the second group of predicted audio recognition results into the first training sample set to obtain a second training sample set, wherein the second group of predicted audio recognition results in the second training sample set are regarded as a second group of actual audio recognition results;

and performing the current round of training on the initial audio recognition model by using the second training sample set to obtain the audio recognition model after the current round of training.

3. The method of claim 2, further comprising:

when the difference between the predicted audio recognition result output by the audio recognition model after the current round of training and the actual audio recognition result in the second training sample set does not meet the convergence condition, obtaining a group of audio samples to be used in the next round of training and a group of predicted audio recognition results corresponding to the group of audio samples, wherein the group of audio samples to be used are not marked with corresponding actual audio recognition results, and the group of predicted audio recognition results are predicted audio recognition results determined by the audio recognition model after the current round of training according to the audio features of the group of audio samples;

merging the group of audio samples to be used and the corresponding group of predicted audio recognition results into the second training sample set to obtain a third training sample set;

and performing the next round of training on the audio recognition model after the current round of training by using the third training sample set to obtain the audio recognition model after the next round of training.

4. The method according to claim 3, wherein the obtaining a set of audio samples to be used in a next training round and a set of predicted audio recognition results corresponding to the set of audio samples comprises:

inputting the audio features of a fourth group of audio samples into the audio recognition model after the current round of training to obtain a third group of predicted audio recognition results, wherein the fourth group of audio samples are not marked with corresponding actual audio recognition results; inputting the audio features of the fourth set of audio samples into the uncertainty analysis model to obtain a second set of uncertainty analysis results, wherein the second set of uncertainty analysis results is used for representing the credibility of the third set of predicted audio recognition results;

and according to the second group of uncertainty analysis results, screening a fourth group of predicted audio recognition results with the reliability meeting a preset condition from the third group of predicted audio recognition results, and screening a fifth group of audio samples corresponding to the fourth group of predicted audio recognition results from the fourth group of audio samples.

5. The method of claim 3, wherein the step of screening out a second set of predicted audio recognition results with confidence level satisfying a predetermined condition from the first set of predicted audio recognition results according to the first set of uncertainty analysis results comprises:

and when the difference between the predicted audio recognition result output by the audio recognition model after the next round of training and the actual audio recognition result in the third training sample set meets the convergence condition, ending the training of the initial audio recognition model to obtain a target audio recognition model, wherein the group of predicted audio recognition results in the third training sample set is regarded as a group of actual audio recognition results.

6. The method according to any one of claims 1 to 5, wherein the step of screening out a second group of predicted audio recognition results with reliability satisfying a preset condition from the first group of predicted audio recognition results according to the first group of uncertainty analysis results comprises:

when the first group of uncertainty analysis results comprise a group of uncertainty scores, sequencing the group of uncertainty scores from small to large according to the scores to obtain an uncertainty score sequence, wherein the higher the uncertainty score is, the lower the credibility of the corresponding prediction audio recognition result is;

obtaining the top N uncertainty scores in the uncertainty score sequence, wherein the uncertainty score sequence comprises M uncertainty scores, N < M;

and screening the second group of predicted audio recognition results corresponding to the uncertainty scores of the M bits before sorting out from the first group of predicted audio recognition results.

7. The method according to any one of claims 1 to 5, further comprising:

acquiring input target audio in a target application;

acquiring a target audio recognition result determined by a target audio recognition model according to the audio features of the target audio, wherein the target audio recognition model is an audio recognition model obtained by performing multiple rounds of training on the initial audio recognition model until the preset convergence condition is met;

and displaying the target audio recognition result in the target application.

8. The method of claim 7,

the acquiring of the input target audio in the target application comprises: under the condition that a reference text is displayed in the target application, acquiring the target audio generated by reading the reference text or acquiring the target audio generated by replying the reference text in the target application;

the displaying the target audio recognition result in the target application comprises: and displaying the evaluation score of the target audio determined by the target audio recognition model in the target application.

9. An audio recognition method, comprising:

acquiring input target audio in a target application;

obtaining a target audio recognition result determined according to the audio characteristics of the target audio through a target audio recognition model, wherein the target audio recognition model is an audio recognition model obtained by performing multiple rounds of training on an initial audio recognition model until a preset convergence condition is met, the initial audio recognition model is a model obtained by training an audio recognition model to be trained by using a first training sample set, the first training sample set comprises a first group of audio samples and a first group of actual audio recognition results obtained by labeling the first group of audio samples, the initial audio recognition model is used for determining and predicting audio recognition according to the input audio characteristics, the audio recognition model obtained by the previous round of training is trained by using a training sample set corresponding to each round of training, and the training sample set corresponding to each round comprises the training sample set obtained by the previous round of training and the audio recognition model obtained by the current round of screening The training sample set obtained by the current round of screening comprises a group of audio samples and a group of predicted audio recognition results corresponding to the group of audio samples, wherein the group of audio samples are not marked with corresponding actual audio recognition results, and the group of predicted audio recognition results are predicted audio recognition results determined by the audio recognition model after the previous round of training according to the audio features of the group of audio samples;

and displaying the target audio recognition result in the target application.

10. The method of claim 9, wherein before the obtaining the input target audio in the target application, the method comprises:

11. The method of claim 10, wherein the performing a current round of training on the initial audio recognition model according to the third set of audio samples and the second set of predicted audio recognition results comprises:

12. The method of claim 11, further comprising:

13. The method according to claim 12, wherein the obtaining a set of audio samples to be used in a next training round and a set of predicted audio recognition results corresponding to the set of audio samples comprises:

14. A computer-readable storage medium, comprising a stored program, wherein the program is operable to perform the method of any one of claims 1 to 11.

15. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method of any of claims 1 to 11 by means of the computer program.