CN113257235B

CN113257235B - Model training method, voice recognition method, device, server and storage medium

Info

Publication number: CN113257235B
Application number: CN202110484676.2A
Authority: CN
Inventors: 王璐; 魏韬; 马骏; 王少军
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-04-30
Filing date: 2021-04-30
Publication date: 2023-01-03
Anticipated expiration: 2041-04-30
Also published as: CN113257235A

Abstract

The application relates to model construction in artificial intelligence, and provides a model training method, a voice recognition method, a device, a server and a storage medium, wherein the method comprises the following steps: carrying out first signal processing on the voice data to obtain first voice data, and carrying out second signal processing on the voice data to obtain second voice data; inputting the first voice data and the second voice data into a feature extraction model to extract a first feature vector of the first voice data and a second feature vector of the second voice data; calculating mutual information between the first voice data and the second voice data according to the first feature vector and the second feature vector; updating model parameters of the feature extraction model according to mutual information between the first voice data and the second voice data until the feature extraction model converges; and fusing and fine-tuning the converged feature extraction model and the trained voice recognition model to obtain a target voice recognition model. The robustness of the speech recognition model can be improved.

Description

Model training method, voice recognition method, device, server and storage medium

Technical Field

The present application relates to the field of model construction technologies, and in particular, to a model training method, a speech recognition device, a server, and a storage medium.

Background

With the continuous development of new media industry, the channels of voice data are gradually diversified, and different bandwidths and coding formats exist, for example, the voice data is recorded data with 8k or 16k sampling rate, or coding formats such as ulaw, alaw, and amr. In some cases, the voice data is also compressed during transmission. These all present difficulties and challenges to speech recognition.

The existing voice recognition model can only recognize voice data of a single channel, for an application scene with voice data of different channels, a plurality of voice recognition models respectively matched with the voice data of each channel need to be trained, the robustness of the voice recognition model is poor, and the accuracy of each voice recognition model is large due to the fact that the training data of different voice recognition models cannot be shared, or more training data is needed, and the defect is large.

Disclosure of Invention

The application mainly aims to provide a model training method, a voice recognition method, a device, a server and a storage medium, and aims to improve the robustness and the expansibility of a voice recognition model so as to improve the flexibility and the accuracy of voice recognition.

In a first aspect, the present application provides a model training method applied to a server, where a feature extraction model and a trained speech recognition model are stored in the server, the method including:

acquiring voice data serving as a training sample, performing first signal processing on the voice data to obtain first voice data, and performing second signal processing on the voice data to obtain second voice data;

inputting the first voice data and the second voice data into the feature extraction model to extract a first feature vector of the first voice data and a second feature vector of the second voice data;

calculating mutual information between the first voice data and the second voice data according to the first feature vector and the second feature vector;

determining whether the feature extraction model converges according to mutual information between the first voice data and the second voice data;

if the feature extraction model is not converged, updating model parameters of the feature extraction model, and continuing to train the feature extraction model after model parameters are updated through the training sample until the feature extraction model is converged;

fusing the converged feature extraction model and the trained voice recognition model to obtain a fusion model;

and fine-tuning the fusion model to obtain a target voice recognition model.

In a second aspect, the present application further provides a speech recognition method, including:

acquiring target voice data to be identified;

inputting the target voice data into a target voice recognition model to obtain text information corresponding to the target voice data;

wherein the target speech recognition model is trained according to the model training method.

In a third aspect, the present application further provides a model training apparatus, in which a feature extraction model and a trained speech recognition model are stored, the model training apparatus includes:

the acquisition module is used for acquiring voice data serving as a training sample, performing first signal processing on the voice data to obtain first voice data, and performing second signal processing on the voice data to obtain second voice data;

the extraction module is used for inputting the first voice data and the second voice data into the feature extraction model so as to extract a first feature vector of the first voice data and a second feature vector of the second voice data;

the calculation module is used for calculating mutual information between the first voice data and the second voice data according to the first feature vector and the second feature vector;

the determining module is used for determining whether the feature extraction model converges or not according to mutual information between the first voice data and the second voice data;

the updating module is used for updating the model parameters of the feature extraction model if the feature extraction model is not converged, and continuing to train the feature extraction model after the model parameters are updated through the training sample until the feature extraction model is converged;

the fusion module is used for fusing the converged feature extraction model and the trained voice recognition model to obtain a fusion model;

and the fine tuning module is used for fine tuning the fusion model to obtain a target voice recognition model.

In a fourth aspect, the present application further provides a server storing a feature extraction model and a trained speech recognition model, the server comprising a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program, when executed by the processor, implements the steps of the model training method or the speech recognition method as described above.

In a fifth aspect, the present application further provides a computer readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the model training method or the speech recognition method as described above.

The application provides a model training method, a voice recognition device, a server and a storage medium, wherein voice data serving as training samples are obtained, first signal processing is carried out on the voice data to obtain first voice data, and second signal processing is carried out on the voice data to obtain second voice data; inputting the first voice data and the second voice data into a feature extraction model to extract a first feature vector of the first voice data and a second feature vector of the second voice data; calculating mutual information between the first voice data and the second voice data according to the first feature vector and the second feature vector; determining whether the feature extraction model converges according to mutual information between the first voice data and the second voice data; if the feature extraction model is not converged, updating the model parameters of the feature extraction model, and continuing to train the feature extraction model after updating the model parameters through the training sample until the feature extraction model is converged; fusing the converged feature extraction model with the trained voice recognition model to obtain a fused model; and fine-tuning the fusion model to obtain the target voice recognition model. The robustness and the expansibility of the target voice recognition model are greatly improved, and the method can be applied to different application scenes, so that the flexibility and the accuracy of voice recognition are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart illustrating steps of a model training method according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of outputting a first eigenvector and a second eigenvector;

fig. 3 is a schematic flowchart illustrating steps of a speech recognition method according to an embodiment of the present application;

FIG. 4 is a schematic block diagram of a model training apparatus provided in an embodiment of the present application;

fig. 5 is a schematic block diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 6 is a block diagram schematically illustrating a structure of a server according to an embodiment of the present disclosure.

The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation. In addition, although the division of the functional blocks is made in the device diagram, in some cases, it may be divided in blocks different from those in the device diagram.

The embodiment of the application provides a model training method, a voice recognition device, a server and a storage medium. The model training method can be applied to a server, and the server can be a single server or a server cluster consisting of a plurality of servers. The server stores untrained feature extraction models and trained speech recognition models. In some embodiments, the server stores an untrained feature extraction model and an untrained speech recognition model, and performs iterative training on the untrained speech recognition model through a plurality of speech data serving as training samples to obtain a trained speech recognition model, which is not specifically limited in the present application.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments and features of the embodiments described below can be combined with each other without conflict.

Referring to fig. 1, fig. 1 is a schematic flow chart illustrating steps of a model training method according to an embodiment of the present disclosure.

As shown in fig. 1, the model training method includes steps S101 to S107.

Step S101, obtaining voice data serving as a training sample, performing first signal processing on the voice data to obtain first voice data, and performing second signal processing on the voice data to obtain second voice data.

The voice data is voice recorded by a user, for example, 3000 voices are recorded by the user a through a recording device. The first signal processing includes sampling rate adjustment, encoding format adjustment, compression and/or decompression, the first signal processing is not equal to the second signal processing, and the second signal processing includes sampling rate adjustment, encoding format adjustment, compression and/or decompression, which is not limited in this application. The sampling rate comprises 8k, 16k and the like, and the sampling rate adjustment comprises increasing or decreasing the current sampling rate of the voice data; the coding format comprises mu law, alaw, amr and the like, and the coding format adjustment comprises the adjustment of the current coding format of the voice data; the compression format includes rar, zip, etc., and the compression includes compression processing of the voice data, and the decompression performs decompression processing of the voice data.

It should be noted that there are multiple sampling rates and multiple encoding formats for voices that can be frequently touched in life, and a recording is compressed or otherwise processed based on different transmission processes, so that the current voice recognition models cannot recognize voice data well, and some voice recognition models perform down-sampling and up-sampling operations on training data, but cannot be really applied to the above complicated real situations.

It should be noted that, by performing the first signal processing on the voice data, the first voice data of multiple channels can be obtained, and by performing the second signal processing on the voice data, the second voice data of multiple channels can be obtained, where the first voice data and the second voice data are used as training data, which is helpful for improving the robustness and the extensibility of the target voice recognition model, so that the target voice recognition model can recognize the voice data of different channels, and can be applied to different scenarios.

Step S102, inputting the first voice data and the second voice data into a feature extraction model to extract a first feature vector of the first voice data and a second feature vector of the second voice data.

The feature extraction model is, for example, a neural network model, and is configured to extract voice features of the first voice data and the second voice data to obtain a first feature vector of the first voice data and a second feature vector of the second voice data. Notably, the feature extraction model can use contrast-based, self-supervised learning to construct the representation by encoding the first speech data and the second speech data following the principles of contrast learning. The first voice data and the second voice data are based on the same voice data, different channel formats (different sampling rates or coding formats and the like) are characterized, and the feature extraction model can learn feature information in the voice data with more channel formats.

Illustratively, as shown in fig. 2, the feature extraction model includes a first feature extractor 11 and a second feature extractor 12. First voice data obtained by performing first signal processing on voice data is input to the first feature extractor 11 to obtain a first feature vector of the first voice data, and second voice data obtained by performing second signal processing on the voice data is input to the second feature extractor 12 to obtain a second feature vector of the second voice data.

Step S103, calculating mutual information between the first voice data and the second voice data according to the first feature vector and the second feature vector.

Mutual information between the first voice data and the second voice data can be obtained based on the first feature vector and the second feature vector through a preset mutual information calculation formula. Mutual information between the first voice data and the second voice data is a representation of the correlation between the first voice data and the second voice data, the higher the mutual information is, the higher the correlation is, and the mutual information can be used for adjusting model parameters of the feature extraction model, so that the voice features with high correlation of learning content of the feature extraction model are realized.

Wherein, the mutual information calculation formula is as follows:

x represents a first feature vector, y represents a second feature vector, p (x) represents a prior probability of the first feature vector, p (y) represents a prior probability of the second feature vector, and p (x, y) represents a posterior probability of speculatively issuing x after receiving y.

In one embodiment, determining feature information corresponding to each frame of voice data from a first feature vector to obtain a plurality of first frame feature information; determining feature information corresponding to each frame of the voice data from the second feature vectors to obtain a plurality of second frame feature information; and calculating mutual information between the first voice data and the second voice data according to the first frame characteristic information and the second frame characteristic information which correspond to each frame of the voice data. It should be noted that the voice data includes multiple frames of sub data, the first voice data subjected to the first signal processing includes multiple frames of first sub data, and the second voice data subjected to the second signal processing includes multiple frames of second sub data. Therefore, the first feature vector of the first voice data comprises multi-frame first frame feature information, the second feature vector of the second voice data comprises multi-frame second frame feature information, the subdata of each frame of the voice data corresponds to at least one first frame feature information and at least one second frame feature information, and the mutual information between the first voice data and the second voice data can be calculated by applying a preset mutual information calculation formula according to the first frame feature information and the second frame feature information corresponding to each frame of the voice data, so that the influence of a channel on a feature extraction model is reduced.

And step S104, determining whether the feature extraction model converges according to mutual information between the first voice data and the second voice data.

In one embodiment, a loss value of the feature extraction model is calculated according to mutual information between the first voice data and the second voice data; if the loss value of the feature extraction model is smaller than or equal to the preset loss value, determining the feature extraction model to be converged; and if the loss value of the feature extraction model is larger than the preset loss value, determining that the feature extraction model is not converged. It should be noted that the loss value of the feature extraction model may be calculated according to the mutual information and the first weight representing the loss of the mutual information, and the preset loss value may be set according to an actual situation, which is not specifically limited in this embodiment. By calculating the loss value of the feature extraction model through mutual information, whether the feature extraction model converges or not can be accurately determined.

Wherein, according to the mutual information between the first voice data and the second voice data, calculating the loss value of the feature extraction model, including: acquiring a first weight representing mutual information loss and a second weight representing classification loss; determining a first loss value of the feature extraction model through mutual information and a first weight between the first voice data and the second voice data; determining probability distribution information of the first voice data and the second voice data, and determining a second loss value of the feature extraction model according to the probability distribution information and the second weight; and adding the first loss value and the second loss value to obtain a loss value of the feature extraction model. Wherein, the classification loss can be calculated by adopting a binary cross entropy mode.

Illustratively, if λ represents a weight of mutual information loss, γ represents a weight of classification loss, lgglobal is mutual information between the first voice data and the second voice data, and lalabel is probability distribution information of the first voice data and the second voice data. The first loss value is λ · Lglobal, the second loss value is γ · lalabel, and the loss value of the feature extraction model is lgotal = λ · Lglobal + γ · lalabel.

And S105, if the feature extraction model is not converged, updating the model parameters of the feature extraction model, and continuing to train the feature extraction model after the model parameters are updated through the training sample until the feature extraction model is converged.

If the feature extraction model is not converged, it is indicated that the content relevance of the first voice data and the second voice data learned by the feature extraction model is low, and the feature extraction model is easily influenced by the voice data of different channels. Therefore, the model parameters of the feature extraction model can be updated according to the loss values of the feature extraction model, the feature extraction model after the model parameters are updated is continuously trained through the training sample, namely, the steps of obtaining the voice data serving as the training sample and the subsequent steps of the steps are returned to be executed, so that the feature extraction model is trained through the first voice data and the second voice data until the feature extraction model is converged.

It should be noted that, if the feature extraction model is not converged, training the feature extraction model is continued, so that mutual information between the first speech data and the second speech data can be maximized, and similarity of high-dimensional distribution of the first speech data and the second speech data is improved, thereby achieving higher consistency than that obtained by using only frame-by-frame cross entropy loss, so as to improve robustness and expansibility of the speech recognition model.

In an embodiment, if the feature extraction model converges, the feature extraction model does not need to be trained continuously, and model parameters of the feature extraction model do not need to be updated, so that the converged feature extraction model is obtained.

And S106, fusing the converged feature extraction model and the trained voice recognition model to obtain a fusion model.

The feature extraction model is used for extracting voice features of the first voice data and the second voice data, and can comprise a first feature extractor and a second feature extractor. The trained voice recognition model can be obtained by training based on a plurality of voice data and is preset in the server, and the trained voice recognition model can perform voice recognition on the voice data so as to obtain text information corresponding to the voice data. By fusing the converged feature extraction model and the trained voice recognition model, the obtained fusion model has higher robustness and higher expansibility.

In one embodiment, the trained speech recognition model includes a feature extraction layer for extracting speech features of the speech data and a feature recognition layer for recognizing the extracted speech features as corresponding text information. Fusing the converged feature extraction model with the trained voice recognition model to obtain a fusion model, wherein the fusion model comprises the following steps: replacing a feature extraction layer of the speech recognition model with the converged feature extraction model to obtain a fusion model; or, the converged feature extraction model is connected with the feature recognition layer of the voice recognition model, so that voice data can be input to the feature recognition layer of the voice recognition model through the converged feature extraction model to complete voice recognition. The robustness and the expansibility of the fusion model are greatly improved, and the flexibility and the accuracy of voice recognition are improved.

In one embodiment, the trained speech recognition model includes a feature recognition layer for recognizing the extracted speech features as corresponding text information. The converged feature extraction model is connected with the feature recognition layer of the voice recognition model, so that voice data can be input to the feature recognition layer through the converged feature extraction model, the robustness and the expansibility of the fusion model can be greatly improved, and the flexibility and the accuracy of voice recognition are improved.

And S107, fine-tuning the fusion model to obtain a target voice recognition model.

After the fusion model is obtained, the fusion model is finely adjusted to optimize model parameters of the fusion model to obtain the target voice recognition model, the target voice recognition model has better robustness, and the accuracy of voice recognition of the target voice recognition model is improved.

In one embodiment, the fusion model comprises a feature extraction sub-model and a speech recognition sub-model; and alternately fine-tuning the feature extraction submodel or the voice recognition submodel until the feature extraction submodel and the voice recognition submodel are converged to obtain a target voice recognition model. It should be noted that, the feature extraction submodel or the speech recognition submodel is alternately fine-tuned, that is, the model parameter of one of the feature extraction submodel and the speech recognition submodel is fixed, and the other model is fine-tuned, and then the processes are alternately performed. The method can reduce the channel difference between the voice data of different channels and improve the expansibility and the robustness of the target voice recognition model under the condition of changing the model structure as little as possible.

For example, firstly fixing the model parameters of the voice recognition submodel, and finely adjusting the model parameters of the feature extraction submodel; fixing the model parameters of the feature extraction submodel, and finely adjusting the model parameters of the voice recognition submodel; and repeating the steps of fixing the model parameters of the voice recognition submodel and finely adjusting the model parameters of the feature extraction submodel, and repeatedly and alternately finely adjusting until the finely adjusted feature extraction submodel and the finely adjusted voice recognition submodel are converged to obtain the target voice recognition model.

In one embodiment, a target model to be adjusted is determined, wherein the target model comprises a feature extraction sub-model and a voice recognition sub-model; and alternately fine-tuning the feature extraction submodel or the voice recognition submodel according to the target model until the feature extraction submodel and the voice recognition submodel are converged to obtain the target voice recognition model. It should be noted that the fine tuning of the feature extraction submodel or the voice recognition submodel includes training the feature extraction submodel or the voice recognition submodel through the first voice data and the second voice data, and updating the model parameters of the feature extraction submodel or the voice recognition submodel.

Exemplarily, if the target model is the feature extraction submodel, fixing model parameters of the voice recognition submodel, finely adjusting the model parameters of the feature extraction submodel, and determining whether the feature extraction submodel and the voice recognition submodel are converged; if the feature extraction submodel and the voice recognition submodel are not converged, fixing model parameters of the feature extraction submodel, finely adjusting model parameters of the voice recognition submodel, and determining whether the feature extraction submodel and the voice recognition submodel are converged or not; and if the feature extraction submodel and the voice recognition submodel are not converged, executing the steps of fixing the model parameters of the voice recognition submodel and finely adjusting the model parameters of the feature extraction submodel until the feature extraction submodel and the voice recognition submodel are converged to obtain the target voice recognition model. And if the target model is a voice recognition submodel, executing the fixed characteristic extraction submodel model parameters and fine-tuning the model parameters of the voice recognition submodel.

In the model training method provided in the above embodiment, the speech data serving as the training sample is obtained, the first signal processing is performed on the speech data to obtain the first speech data, and the second signal processing is performed on the speech data to obtain the second speech data; inputting the first voice data and the second voice data into a feature extraction model to extract a first feature vector of the first voice data and a second feature vector of the second voice data; calculating mutual information between the first voice data and the second voice data according to the first feature vector and the second feature vector; determining whether the feature extraction model converges according to mutual information between the first voice data and the second voice data; if the feature extraction model is not converged, updating the model parameters of the feature extraction model, and continuing to train the feature extraction model after updating the model parameters through the training sample until the feature extraction model is converged; fusing the converged feature extraction model with the trained voice recognition model to obtain a fused model; and fine-tuning the fusion model to obtain the target voice recognition model. The robustness and the expansibility of the target voice recognition model are greatly improved, and the method can be applied to different application scenes, so that the flexibility and the accuracy of voice recognition are improved.

Referring to fig. 3, fig. 3 is a flowchart illustrating steps of a speech recognition method according to an embodiment of the present application.

As shown in fig. 3, the voice recognition method includes steps S201 to S202.

Step S201, target voice data to be recognized is acquired.

The target voice data comprises voice data recorded by a user through a recording device, for example, the voice data recorded by the user through a mobile phone, and the user sends a voice recognition instruction through the mobile phone, so that the server acquires the recorded voice data and performs voice recognition on the voice data through a target voice recognition model.

In one embodiment, obtaining target speech data to be identified includes: and acquiring voice data to be recognized, and performing signal processing on the target voice data to obtain the target voice data. The signal processing includes sampling rate adjustment, encoding format adjustment, compression and/or decompression, and the application is not limited in particular.

Step S202, inputting the target voice data into the target voice recognition model to obtain text information corresponding to the target voice data.

Target voice data are input into the target voice recognition model, and text information corresponding to the target voice data can be conveniently and accurately obtained. Wherein, the target speech recognition model is trained according to the model training method as the foregoing embodiment. The flexibility and accuracy of speech recognition is higher.

In one embodiment, the fusion model comprises a feature extraction sub-model and a speech recognition sub-model; inputting target voice data into a feature extraction submodel to obtain a voice feature vector; and inputting the voice characteristic vector into the voice recognition submodel to obtain text information corresponding to the target voice data. Through the feature extraction submodel and the voice recognition submodel, the output result of the text information is more accurate.

In the model training method provided by the embodiment, the target voice data to be recognized is acquired, and the target voice data is input into the target voice recognition model to obtain the text information corresponding to the target voice data, wherein the target voice recognition model is obtained by training according to the model training method of the embodiment, and can be applied to different application scenarios to perform voice recognition on the voice data of multiple channels, and the flexibility and the accuracy of the voice recognition are higher.

Referring to fig. 4, fig. 4 is a schematic block diagram of a model training apparatus according to an embodiment of the present disclosure, in which a feature extraction model and a trained speech recognition model are stored in the model training apparatus.

As shown in fig. 4, the model training apparatus 300 includes: an acquisition module 301, an extraction module 302, a calculation module 303, a determination module 304, an update module 305, a fusion module 306, and a fine tuning module 307.

An obtaining module 301, configured to obtain voice data serving as a training sample, perform first signal processing on the voice data to obtain first voice data, and perform second signal processing on the voice data to obtain second voice data;

an extracting module 302, configured to input the first voice data and the second voice data into the feature extraction model to extract a first feature vector of the first voice data and a second feature vector of the second voice data;

a calculating module 303, configured to calculate mutual information between the first voice data and the second voice data according to the first feature vector and the second feature vector;

a determining module 304, configured to determine whether the feature extraction model converges according to mutual information between the first voice data and the second voice data;

an updating module 305, configured to update the model parameters of the feature extraction model if the feature extraction model is not converged, and continue to train the feature extraction model after updating the model parameters through the training sample until the feature extraction model is converged;

a fusion module 306, configured to fuse the converged feature extraction model and the trained speech recognition model to obtain a fusion model;

and a fine tuning module 307, configured to fine tune the fusion model to obtain a target speech recognition model.

In one embodiment, the first signal processing comprises sample rate adjustment, coding format adjustment, compression and/or decompression and the second signal processing comprises sample rate adjustment, coding format adjustment, compression and/or decompression, the first signal processing is not identical to the second signal processing.

In one embodiment, the calculation module 303 is further configured to:

determining feature information corresponding to each frame of the voice data from the first feature vector to obtain a plurality of first frame feature information;

determining feature information corresponding to each frame of the voice data from the second feature vector to obtain a plurality of second frame feature information;

and calculating mutual information between the first voice data and the second voice data according to the first frame characteristic information and the second frame characteristic information which correspond to each frame of the voice data.

In one embodiment, the determination module 304 is further configured to:

calculating a loss value of the feature extraction model according to mutual information between the first voice data and the second voice data;

if the loss value of the feature extraction model is smaller than or equal to a preset loss value, determining that the feature extraction model converges;

and if the loss value of the feature extraction model is larger than a preset loss value, determining that the feature extraction model is not converged.

In one embodiment, the determination module 304 is further configured to:

acquiring a first weight representing mutual information loss and a second weight representing classification loss;

determining a first loss value of the feature extraction model according to mutual information between the first voice data and the second voice data and the first weight;

determining probability distribution information of the first voice data and the second voice data, and determining a second loss value of the feature extraction model according to the probability distribution information and the second weight;

and adding the first loss value and the second loss value to obtain a loss value of the feature extraction model.

In one embodiment, the fusion model includes a feature extraction sub-model and a speech recognition sub-model; the fine tuning module 307 is further configured to:

and alternately fine-tuning the feature extraction submodel or the voice recognition submodel until the feature extraction submodel and the voice recognition submodel are converged to obtain a target voice recognition model.

Referring to fig. 5, fig. 5 is a schematic block diagram of a speech recognition apparatus according to an embodiment of the present application.

As shown in fig. 5, the speech recognition apparatus 400 includes:

an obtaining module 401, configured to obtain target speech data to be identified;

an input module 402, configured to input the target speech data into a target speech recognition model, so as to obtain text information corresponding to the target speech data.

The target speech recognition model is obtained by training according to the model training method in the embodiment.

It should be noted that, as will be clear to those skilled in the art, for convenience and brevity of description, the specific working processes of the above-described apparatus and each module and unit may refer to the corresponding processes in the foregoing embodiment of the model training method, and are not described herein again.

The apparatus provided by the above embodiment may be implemented in a form of a computer program, and the computer program may be run on a server as shown in fig. 6.

Referring to fig. 6, fig. 6 is a schematic block diagram of a server according to an embodiment of the present disclosure. The server may store a feature extraction model and a trained speech recognition model.

As shown in fig. 6, the server includes a processor, a memory and a network interface connected by a system bus, wherein the memory may include a nonvolatile storage medium and an internal memory.

The non-volatile storage medium may store an operating system and a computer program. The computer program includes program instructions that, when executed, cause a processor to perform any one of a model training method or a speech recognition method.

The processor is used for providing calculation and control capacity and supporting the operation of the whole server.

The internal memory provides an environment for the execution of a computer program on a non-volatile storage medium, which when executed by the processor, causes the processor to perform any one of a model training method or a speech recognition method.

The network interface is used for network communication, such as sending assigned tasks and the like. Those skilled in the art will appreciate that the architecture shown in fig. 6 is a block diagram of only a portion of the architecture associated with the subject application, and does not constitute a limitation on the servers to which the subject application applies, as a particular server may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

It should be understood that the Processor may be a Central Processing Unit (CPU), and the Processor may be other general purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Wherein, in one embodiment, the processor is configured to execute a computer program stored in the memory to implement the steps of:

and fine-tuning the fusion model to obtain a target voice recognition model.

In one embodiment, the processor, when implementing the calculating of the mutual information between the first speech data and the second speech data according to the first feature vector and the second feature vector, is configured to implement:

In one embodiment, the processor, when performing the determining whether the feature extraction model converges according to mutual information between the first speech data and the second speech data, is configured to perform:

In one embodiment, the processor, when implementing the calculating the loss value of the feature extraction model according to the mutual information between the first voice data and the second voice data, is configured to implement:

In one embodiment, the fusion model includes a feature extraction sub-model and a speech recognition sub-model; the processor, when implementing the fine tuning of the fusion model to obtain a target speech recognition model, is configured to implement:

In one embodiment, the processor is configured to execute a computer program stored in the memory to perform the steps of:

acquiring target voice data to be identified;

It should be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the server described above may refer to the corresponding process in the foregoing model training method or speech recognition method embodiment, and is not described herein again.

Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, where the computer program includes program instructions, and when the program instructions are executed, a method implemented by the computer-readable storage medium may refer to various embodiments of a model training method or a speech recognition method of the present application.

The computer-readable storage medium may be an internal storage unit of the server according to the foregoing embodiment, for example, a hard disk or a memory of the server. The computer readable storage medium may also be an external storage device of the server, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the server.

It is to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items and includes such combinations. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments. While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A model training method is applied to a server, wherein the server stores a feature extraction model and a trained speech recognition model, and the method comprises the following steps:

acquiring voice data serving as a training sample, performing first signal processing on the voice data to obtain first voice data, and performing second signal processing on the voice data to obtain second voice data; wherein the first signal processing comprises sample rate adjustment, encoding format adjustment, compression and/or decompression, the second signal processing comprises sample rate adjustment, encoding format adjustment, compression and/or decompression, and the first signal processing is not identical to the second signal processing;

and fine-tuning the fusion model to obtain a target voice recognition model.

2. The model training method of claim 1, wherein the calculating mutual information between the first speech data and the second speech data based on the first feature vector and the second feature vector comprises:

determining feature information corresponding to each frame of the voice data from the first feature vectors to obtain a plurality of first frame feature information;

3. The model training method of any one of claims 1-2, wherein the determining whether the feature extraction model converges based on mutual information between the first speech data and the second speech data comprises:

4. The model training method of claim 3, wherein the calculating the loss value of the feature extraction model based on the mutual information between the first speech data and the second speech data comprises:

5. The model training method of any one of claims 1-2, wherein the fusion model comprises a feature extraction submodel and a speech recognizer model; the fine tuning of the fusion model to obtain a target speech recognition model comprises:

6. A speech recognition method, comprising:

acquiring target voice data to be identified;

wherein the target speech recognition model is trained according to the model training method of any one of claims 1 to 5.

7. A model training apparatus that stores a feature extraction model and a trained speech recognition model, comprising:

the acquisition module is used for acquiring voice data serving as a training sample, performing first signal processing on the voice data to obtain first voice data, and performing second signal processing on the voice data to obtain second voice data; wherein the first signal processing comprises sample rate adjustment, coding format adjustment, compression and/or decompression, the second signal processing comprises sample rate adjustment, coding format adjustment, compression and/or decompression, and the first signal processing is not identical to the second signal processing;

the determining module is used for determining whether the feature extraction model converges according to mutual information between the first voice data and the second voice data;

8. A server storing a feature extraction model and a trained speech recognition model, the server comprising a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program, when executed by the processor, implements the model training method of any one of claims 1 to 5 or implements the steps of the speech recognition method of claim 6.

9. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, wherein the computer program, when being executed by a processor, carries out the model training method as set forth in any one of claims 1 to 5, or the steps of the speech recognition method as set forth in claim 6.