CN111105786B

CN111105786B - Multi-sampling-rate voice recognition method, device, system and storage medium

Info

Publication number: CN111105786B
Application number: CN201911363288.8A
Authority: CN
Inventors: 施雨豪; 钱彦旻
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2022-10-18
Anticipated expiration: 2039-12-26
Also published as: CN111105786A

Abstract

The invention discloses a multi-sampling rate voice recognition method, a device, a system and a storage medium. Firstly, under the condition of not changing the audio sampling rate, extracting the characteristics of the audio with different sampling rates in a corresponding configuration mode according to different sampling rates, and training a neural network model by using the extracted audio. The neural network model is provided with a general speech recognition label and a sampling rate classification label, and a gradient inversion method is used for carrying out countermeasure training on the sampling rate classification label when the neural network model is trained, so that the trained multi-sampling rate speech recognition model can be self-adapted to audios with different sampling rates. Then, the multi-sampling rate speech recognition model obtained by the training of the method can be used for speech recognition, and the aim of uniformly processing the audio input with various sampling rates by using the same speech recognition model is fulfilled.

Description

Multi-sampling-rate voice recognition method, device, system and storage medium

Technical Field

The invention relates to the field of artificial intelligence voice interaction, in particular to a multi-sampling rate voice recognition method, a device, a system and a storage medium.

Background

With the continuous development and progress of artificial intelligence and electronic communication technology, the intelligent voice interaction technology is increasingly popularized and applied to a plurality of product fields, including intelligent customer service, call centers, intelligent sound boxes, intelligent watches and the like.

However, although the speech recognition is the same, the speech sampling rate is different in different application scenarios. If speech samples of different multi-sampling rates need to be processed in one system, the following scheme is adopted: 1) The sampling rate of the audio is unified through up/down sampling, so that a voice recognition system is unified. This approach may alter the nature of the original audio, resulting in a decrease in the accuracy of speech recognition. 2) And deploying a plurality of voice recognition systems, and screening according to the confidence coefficient or the confusion degree after the results are output to select the most appropriate result. The scheme has the problems of low resource utilization efficiency and high operation and maintenance cost.

Disclosure of Invention

In view of the above problems, the present inventors have creatively provided a method, apparatus, system and storage medium for multi-sampling rate speech recognition.

According to a first aspect of the embodiments of the present invention, a method for training a multiple sampling rate speech recognition model includes: acquiring audio features of at least two different sampling rates; and training the neural network model by taking the audio features as input, wherein the audio features are marked with a voice recognition label and a sampling rate classification label.

According to an embodiment of the present invention, obtaining audio features of at least two different sampling rates includes: receiving audio inputs of at least two different sampling rates; setting configuration information extracted by the characteristic according to the sampling rate classification of the audio input; and performing feature extraction on the audio by using the configuration information to obtain audio features of at least two different sampling rates.

According to an embodiment of the present invention, the training of the neural network model includes: and normally training the neural network model according to the voice recognition label, and carrying out countermeasure training on the neural network model according to the sampling rate classification label.

According to an embodiment of the present invention, the training of the neural network model against the sampling rate classification labels comprises: and according to a cross entropy training criterion, performing countermeasure training on the neural network model aiming at the sampling rate classification labels.

According to an embodiment of the present invention, wherein performing the confrontational training comprises: the countertraining is carried out by adopting a mode of reversing transmission after gradient reversal.

According to a second aspect of the embodiments of the present invention, a method for multi-sampling rate speech recognition includes: receiving an audio feature; and inputting the audio features into a multi-sampling rate voice recognition model to obtain a voice recognition result, wherein the multi-sampling rate voice recognition model is obtained by executing any one of the training methods of the multi-sampling rate voice recognition model.

According to a third aspect of the embodiments of the present invention, an apparatus for training a multiple sampling rate speech recognition model, the apparatus includes: the audio characteristic acquisition module is used for acquiring audio characteristics of at least two different sampling rates; and the neural network model training module is used for training the neural network model by taking the audio features as input, wherein the audio features are marked with a voice recognition label and a sampling rate classification label.

According to an embodiment of the present invention, the audio feature obtaining module includes: an audio input receiving unit for receiving audio inputs of at least two different sampling rates; the characteristic extraction configuration unit is used for classifying and setting configuration information of characteristic extraction according to the sampling rate to which the audio input belongs; and the audio characteristic extraction unit is used for extracting the characteristics of the audio by using the configuration information to obtain the audio characteristics of at least two different sampling rates.

According to an embodiment of the present invention, the neural network model training module includes: the voice recognition training unit is used for normally training the neural network model aiming at the voice recognition label; and the sampling rate classification training unit is used for carrying out countermeasure training on the neural network model according to the sampling rate classification labels.

According to an embodiment of the present invention, the sampling rate classification training unit is specifically configured to perform countermeasure training on the neural network model according to a cross entropy training criterion with respect to the sampling rate classification labels.

According to an embodiment of the present invention, the sampling rate classification training unit includes: and the gradient inversion subunit is used for performing the countermeasure training in a mode of performing inverse transmission after gradient inversion.

According to a fourth aspect of the embodiments of the present invention, there is provided a multi-sampling rate speech recognition apparatus, including: the audio characteristic receiving module is used for receiving the audio characteristics; and the voice recognition module is used for inputting the audio features into a multi-sampling rate voice recognition model to obtain a voice recognition result, wherein the multi-sampling rate voice recognition model is obtained by executing any one of the training methods of the multi-sampling rate voice recognition model.

According to a fifth aspect of embodiments of the present invention there is provided a multiple sampling rate speech recognition system comprising a processor and a memory, wherein the memory has stored therein computer program instructions for performing the method of any one of the above when the computer program instructions are executed by the processor.

According to a sixth aspect of embodiments of the present invention there is provided a computer storage medium comprising a set of computer executable instructions for performing the method of any one of the above when the instructions are executed.

The embodiment of the invention provides a multi-sampling rate voice recognition method, a device, a system and a storage medium. Firstly, under the condition of not changing the audio sampling rate, extracting the characteristics of the audio with different sampling rates in a corresponding configuration mode according to different sampling rates, and training a neural network model by using the extracted audio. The neural network model is provided with a general speech recognition label and a sampling rate classification label, and a gradient inversion method is used for carrying out antagonistic training on the sampling rate classification label when the neural network model is trained, so that the trained multi-sampling rate speech recognition model can be self-adapted to audios with different sampling rates. Then, the multi-sampling rate speech recognition model obtained by the training of the method can be used for speech recognition, and the aim of uniformly processing the audio input with various sampling rates by using the same speech recognition model is fulfilled. Therefore, the property of the original audio can be kept, and the training cost and the maintenance cost of the voice recognition system are greatly saved. In addition, the data of the audio input with different sampling rates can be fused with each other, so that the diversity of the data is further improved, and the available data is multiplied.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

FIG. 1 is a schematic flow chart illustrating an implementation of a method for training a multi-sampling-rate speech recognition model according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a specific implementation flow of a training method for applying a multi-sampling rate speech recognition model according to the present invention;

FIG. 3 is a schematic flow chart illustrating an implementation of a multi-sampling rate speech recognition method according to an embodiment of the present invention;

FIG. 4 is a flow chart illustrating an embodiment of a method for speech recognition using multiple sampling rates according to the present invention;

FIG. 5 is a schematic diagram of a structure of a training apparatus for a multiple sampling rate speech recognition model according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a multi-sampling rate speech recognition apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to implicitly indicate the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

FIG. 1 shows a flow chart of an implementation of a method for training a multi-sampling rate speech recognition model according to an embodiment of the present invention. Referring to fig. 1, the method includes: an operation 110 of obtaining audio features of at least two different sampling rates; at operation 120, the neural network model is trained using the audio features as inputs, wherein the audio features are labeled with a speech recognition tag and a sampling rate classification tag.

In operation 110, the audio feature data is training data for training the neural network model. The audio input may be obtained by feature extraction, or may be obtained from an audio feature supplier or a stock library. It should be noted that the training data is data processed by label labeling, unlike data in actual applications. Here, the label is the content to be predicted, and the value of the label is the expected value. By comparing the actual predicted value with the expected value, the model can be continuously corrected to enable the error between the actual predicted value and the expected value to tend to be minimum, and therefore the model which is high in accuracy and can be put into practical application is obtained. In order to make the trained speech recognition system recognize a plurality of audio data with different sampling rates, at least two audio inputs with different sampling rates are used when training the neural network model. In addition, the training data used for training the neural network model preferably adopts the same sampling rate as the audio data needing to be recognized in practical application, so that in the later application, the accuracy rate of voice recognition on audio input with different sampling rates is higher, and the application effect is better. By using the training method of the multi-sampling rate speech recognition model provided by the embodiment of the invention, the data quantity distribution of various different sampling rates in the training data is not required, and the data can be different sampling rate data in any proportion.

In operation 120, the neural network may be trained using the audio features obtained in operation 110. As mentioned above, the training data carries a label, and the value of the label is the expected value. By comparing the actual predicted value obtained by training with the expected value, the model can be continuously corrected to enable the actual predicted value and the expected value to tend to be minimum until a model which is high in accuracy and can be practically applied is obtained, and the process is the process of training the neural network. Different from other neural network models for voice recognition, the embodiment of the invention is additionally provided with a sampling rate classification label. This also means that this neural network model will predict not only the result of speech recognition but also the sample rate class to which the audio input belongs. The label is used for better fusing audio data of various sampling rates, so that the available data volume of the model is larger, the data of different sources also makes the diversity of the data more sufficient, and the predicted result is relatively more accurate.

The audio input here may be raw audio collected by an audio collecting device, but is preferably audio input processed by a speech signal. The audio processed by the voice signal is clearer, easy to identify and better in training effect. These audio inputs may also be obtained from a data provider or from an audio database. The sample rate classification of the audio input is predetermined and the sample rate classification to which the audio input belongs is also known. Here, the sample rate classification to which the specified audio input belongs only needs to be performed through some input parameters or configuration information. The definition of sample rate classification is made at the time of modeling, where the values specified are to be consistent with the values defined at the time of establishing the sample rate classification tags. For example, assuming that the class of the first sample rate audio is defined as 1 when modeling, the sample rate class that needs to be specified is 1 when training using the first sample rate audio here. The process and parameters for extracting features from audio input species at different audio sampling rates will vary slightly. For this reason, it is also necessary to set some parameters of audio processing for audio input of different audio sampling rates so as to obtain the most representative features specific to the audio sampling rate, and the set values of these parameters are configuration information set for feature extraction. For example, a highest frequency needs to be set in the process of audio feature extraction, and this parameter may be set to 4000 for an audio with a sampling rate of 8K, and may be set to 8000 for an audio with a sampling rate of 16K. The configuration used for a particular sample rate audio may be preset based on the audio properties of the sample rate audio, and the configuration may be performed by obtaining a preset value during this operation. Any suitable method can be used for extracting the audio features, and the embodiment of the invention mainly adopts the current general F-bank audio feature extraction method. The audio features extracted here include speech recognition features for performing speech recognition, such as phonemes, words, etc., and also include sample rate classification values.

Here, the normal training for the speech recognition tag is the same as the implementation process of other speech recognition systems, and the training is generally performed based on a Connection Temporal Classification (CTC) criterion of a neural network, a Maximum Mutual Information (MM) criterion, or any other applicable criterion. The embodiment of the invention has the outstanding characteristic that the countermeasure training is carried out on the neural network model aiming at the sampling rate classification label. The countermeasure training, which may also be referred to as interference training, aims to make the model unable to accurately distinguish sampling rate classification, thereby making full use of different sampling rate audio data, and making prediction based on more diversified data.

According to an embodiment of the present invention, the training of the neural network model against the sampling rate classification labels comprises: and carrying out countermeasure training on the neural network model according to the cross entropy training criterion and aiming at the sampling rate classification label.

The cross entropy is mainly used in the disambiguation field, and the identification and the disambiguation are conducted by calculating the cross entropy of the prior information and the posterior information and guiding the ambiguity by the cross entropy. The method is particularly suitable for the self-adaptive implementation of the computer. And training the neural network model by using a cross entropy criterion for the sampling rate classification labels, wherein the multi-sampling rate speech recognition model obtained by the training can be automatically adapted to a plurality of sampling rate data.

The gradient here can be simply understood as the error between the actual predicted value and the expected value, which can be used as a basis for adjusting the neural network model parameters. The smaller the gradient, the more accurate the prediction is considered, and the more mature the neural network model is. When the neural network model is used for predicting the sampling rate classification, in order to neglect the difference of the sampling rates, the gradient inversion operation is carried out. That is, the gradient is multiplied by minus one to obtain the opposite number, and then the opposite number is returned to train the model. The method is an interference mode adopted by the countertraining, and the neural network model can not accurately predict the sampling rate classification, and can not be treated differently due to different sampling rates, so that the contribution rate of data with different sampling rates is greatly improved, and the audio features with different sampling rates can be processed uniformly. This is also the key point for the discordance caused by the fact that the voice recognition system can be ignorant of the sampling rate in the countertraining to achieve a good recognition effect.

The following describes a specific flow of a method for training a multisampling rate speech recognition model according to the present invention with reference to fig. 2. In the application scenario shown in fig. 2, the system mainly receives three sampling rate audio data, namely, first sampling rate audio data, second sampling rate audio data, and third sampling rate audio data. When a multi-sampling rate speech recognition model is trained for the system, the following steps are mainly adopted:

step 201, receiving audio input;

the audio input here is training data with training labels and is training data of at least two different sampling rates.

Step 220, judging sampling rate classification;

the sampling rate class to which the audio input belongs is known and can be obtained through some parameter or configuration information, where it is required to determine which class the audio input belongs to, and determine the next step according to the class. If the sampling rate of the audio input is the first sampling rate, proceed to step 230; if the sampling rate of the audio input is the second sampling rate, proceed to step 240; if the sampling rate of the audio input is the third sampling rate, then step 250 is continued.

Step 230, extracting audio features at a first sampling rate;

the received audio input is first sample rate audio, and feature extraction is performed using a configuration used for the first sample rate audio. The configuration used by the first sampling rate audio can be preset according to the audio property of the first sampling rate audio, and in this step, the configuration is performed only by acquiring a preset value. After configuration, audio feature extraction can be carried out, and required audio features are obtained and input to the neural network model for training.

Step 240, extracting audio features at a second sampling rate;

the received audio input is second sample rate audio, and feature extraction is performed using a configuration used for the second sample rate audio. The configuration used by the second sampling rate audio can be preset according to the audio property of the second sampling rate audio, and in this step, the configuration is performed only by acquiring the preset value. After configuration, audio feature extraction can be carried out, and required audio features are obtained and input to the neural network model for training.

Step 250, extracting audio features at a third sampling rate;

the received audio input is a third sample rate audio, and feature extraction is performed using a configuration used for the third sample rate audio. The configuration used by the third sampling rate audio may be preset according to the audio property of the third sampling rate audio, and in this step, the configuration is performed only by obtaining the preset value. After configuration, audio feature extraction can be carried out, and required audio features are obtained and input to the neural network model for training.

Step 260, training a neural network model;

the method mainly comprises the steps of training the neural network by using the audio features obtained in the steps, and adjusting parameters of a neural network model according to a prediction result to continuously reduce a prediction error value. In this application of an embodiment of the present invention, the gradient descent method is mainly used.

Step 270, carrying out gradient inversion and back transmission aiming at the sampling rate classification label;

as mentioned above, in order to fuse audio data of different sampling rates, the gradient is inverted during training, and a counterlearning method is adopted.

Step 280, gradient back transmission is performed for the voice recognition tag.

The gradient reverse transmission is normal reverse transmission, and the reverse operation is not carried out.

It should be noted that the process of training the neural network model is an infinite loop process, the standard of model maturation can be determined according to the actual application requirements, and the multi-sampling rate speech recognition model which is considered to be mature is applied to the actual production environment for speech recognition.

After the training method of the multi-sampling rate speech recognition model is executed to obtain a mature multi-sampling rate speech recognition model, the embodiment of the invention also provides a speech recognition method. As shown in fig. 3, the method includes: an operation 310 of receiving an audio feature; in operation 320, the audio features are input to a multi-sampling rate speech recognition model to obtain a speech recognition result, wherein the multi-sampling rate speech recognition model is trained by performing any one of the above-mentioned training methods of the multi-sampling rate speech recognition model.

In operation 310, the audio features are extracted from the new data without annotations in the actual production environment. The new data referred to herein is audio data classified at a certain sampling rate among the training data used, e.g., first sampling rate audio data.

In operation 320, a trained multi-sample rate speech recognition model is used, and a prediction result, i.e., a result of speech recognition, is obtained directly from the input.

It should be noted that after the prediction is performed by using the newly input audio features without labels in the actual production environment, the gradient back propagation operation is not performed, but label labeling can be performed according to the prediction result to become new training data for later half supervised learning.

Referring now to FIG. 4, a method for performing multi-sample rate speech recognition using a multi-sample rate speech recognition model trained using the steps shown in FIG. 2 according to an embodiment of the present invention will be described in detail. As shown in fig. 4, multi-sample rate speech recognition may be performed using the following steps:

step 410, receiving audio input;

the audio input here is audio data that is newly input in the production environment without a training label.

Step 420, judging sampling rate classification;

the sampling rate class to which the audio input belongs is known information and can be obtained through a certain parameter or configuration information, where it is required to determine which class the audio input belongs to, and determine the next step according to the class. If the sampling rate of the audio input is the first sampling rate, proceed to step 430; if the sampling rate of the audio input is the second sampling rate, proceed to step 440; if the sampling rate of the audio input is the third sampling rate, then step 450 is continued.

Step 430, extracting audio features at a first sampling rate;

the received audio input is first sample rate audio, and feature extraction is performed using a configuration used for the first sample rate audio. The configuration used by the first sampling rate audio can be preset according to the audio property of the first sampling rate audio, and in this step, the configuration is performed only by acquiring a preset value. After configuration, audio characteristic extraction can be carried out, required audio characteristics are obtained and input to a multi-sampling rate speech recognition model for speech recognition.

Step 440, extracting audio features at a second sampling rate;

the received audio input is second sample rate audio, and feature extraction is performed using a configuration used for the second sample rate audio. The configuration used by the second sampling rate audio can be preset according to the audio property of the second sampling rate audio, and in this step, the configuration is performed only by acquiring the preset value. After configuration, audio characteristic extraction can be carried out, required audio characteristics are obtained, and a multi-sampling rate speech recognition model is input for speech recognition.

Step 450, extracting audio features at a third sampling rate;

the received audio input is a third sample rate audio, and feature extraction is performed using a configuration used for the third sample rate audio. The configuration used by the third sampling rate audio may be preset according to the audio property of the third sampling rate audio, and in this step, the configuration is performed only by obtaining the preset value. After configuration, audio characteristic extraction can be carried out, required audio characteristics are obtained and input to a multi-sampling rate speech recognition model for speech recognition.

Step 460, performing voice recognition by using a multi-sampling rate voice recognition model;

the predictive model used here is a multi-sampling rate speech recognition model that has been trained and can be used for speech recognition in a production environment.

Step 470, outputting the voice recognition result.

Here, the prediction using the sample rate speech recognition model may obtain a prediction result according to the audio features extracted in step 430, step 440 or step 450, and the prediction result is the speech recognition result.

Further, the embodiment of the invention also provides a training device of the multi-sampling rate speech recognition model. As shown in fig. 5, the apparatus 50 includes: an audio feature obtaining module 501, configured to obtain audio features of at least two different sampling rates; the neural network model training module 502 is configured to train the neural network model by using the audio features as input, where the audio features are labeled with a speech recognition tag and a sampling rate classification tag.

According to an embodiment of the present invention, the audio feature obtaining module 501 includes: an audio input receiving unit for receiving audio inputs of at least two different sampling rates; the characteristic extraction configuration unit is used for classifying and setting configuration information of characteristic extraction according to the sampling rate of the audio input; and the audio characteristic extraction unit is used for extracting the characteristics of the audio by using the configuration information to obtain the audio characteristics of at least two different sampling rates.

According to an embodiment of the present invention, the neural network model training module 502 includes: the voice recognition training unit is used for normally training the neural network model aiming at the voice recognition label; and the sampling rate classification training unit is used for carrying out countermeasure training on the neural network model according to the sampling rate classification labels.

According to an embodiment of the present invention, the sampling rate classification training unit is specifically configured to perform countermeasure training on the neural network model with respect to the sampling rate classification labels according to a cross entropy training criterion.

In addition, an embodiment of the present invention further provides a multi-sampling rate speech recognition apparatus, as shown in fig. 6, where the apparatus 60 includes: an audio feature receiving module 601, configured to receive an audio feature; the speech recognition module 602 is configured to input the audio features into a multi-sampling-rate speech recognition model to obtain a speech recognition result, where the multi-sampling-rate speech recognition model is obtained by performing any one of the above training methods of the multi-sampling-rate speech recognition model.

Here, it should be noted that: the above description of the embodiment of the training apparatus for a multiple sampling rate speech recognition model, the above description of the embodiment of the speech recognition apparatus for a multiple sampling rate, the above description of the embodiment of the speech recognition system for a multiple sampling rate, and the above description of the embodiment of a computer storage medium are similar to the description of the foregoing method embodiments, and have similar beneficial effects to the foregoing method embodiments, and therefore, no further description is given. For the technical details that have not been disclosed yet in the description of the embodiment of the training apparatus for a multiple sampling rate speech recognition model, the description of the embodiment of the multiple sampling rate speech recognition apparatus, the description of the embodiment of the multiple sampling rate speech recognition system, and the description of the embodiment of the computer storage medium of the present invention, please refer to the description of the foregoing method embodiment of the present invention for understanding, and therefore, for brevity, no further description is provided.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a component of' 8230; \8230;" does not exclude the presence of another like element in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only one logical function division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another device, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all the functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media capable of storing program codes, such as a removable storage medium, a Read Only Memory (ROM), a magnetic disk, or an optical disk.

Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention or portions thereof contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods of the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage medium, a ROM, a magnetic disk, an optical disk, or the like.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for training a multiple sampling rate speech recognition model, the method comprising:

receiving audio inputs of at least two different sampling rates; setting configuration information extracted by the characteristic according to the sampling rate classification of the audio input; performing feature extraction on the audio by using the configuration information to obtain audio features of at least two different sampling rates;

and training a neural network model by taking the audio features as input, wherein the audio features are marked with a voice recognition label and a sampling rate classification label.

2. The method of claim 1, wherein training the neural network model comprises:

and normally training the neural network model according to the voice recognition label, and carrying out countermeasure training on the neural network model according to the sampling rate classification label.

3. The method of claim 2, wherein the training the neural network model against the sample rate class labels comprises:

and carrying out countermeasure training on the neural network model according to a cross entropy training criterion and aiming at the sampling rate classification label.

4. The method of any one of claims 2 or 3, wherein the performing counter training comprises:

and performing countermeasure training by adopting a mode of reversing transmission after gradient reversal.

5. A method of multiple sample rate speech recognition, the method comprising:

receiving an audio feature;

inputting the audio features into a multi-sampling rate speech recognition model to obtain a speech recognition result, wherein the multi-sampling rate speech recognition model is obtained by performing the method of any one of claims 1 to 4.

6. An apparatus for training a multiple sampling rate speech recognition model, the apparatus comprising:

the audio characteristic acquisition module is used for receiving audio input of at least two different sampling rates; setting the configuration information extracted by the characteristic according to the sampling rate classification of the audio input; performing feature extraction on the audio by using the configuration information to obtain audio features of at least two different sampling rates;

and the neural network model training module is used for training the neural network model by taking the audio features as input, wherein the audio features are marked with voice recognition labels and sampling rate classification labels.

7. A multi-sample rate speech recognition apparatus, the apparatus comprising:

the audio characteristic receiving module is used for receiving the audio characteristics;

a speech recognition module, configured to input the audio features into a multi-sampling rate speech recognition model to obtain a speech recognition result, where the multi-sampling rate speech recognition model is trained by performing the method according to any one of claims 1 to 4.

8. A multi-sampling rate speech recognition system comprising a processor and a memory, wherein the memory has stored therein computer program instructions for execution by the processor to perform the method of any of claims 1 to 4;

or to perform the method of claim 5.

9. A storage medium having stored thereon program instructions for performing, when executed, the method of any one of claims 1 to 4;

or, performing the method of claim 5.