CN117012181A

CN117012181A - Speech recognition method and device and computer equipment

Info

Publication number: CN117012181A
Application number: CN202210647855.8A
Authority: CN
Inventors: 朱紫薇; 杨文文; 潘逸倩
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-06-08
Filing date: 2022-06-08
Publication date: 2023-11-07

Abstract

The embodiment of the application discloses a voice recognition method, a voice recognition device and computer equipment; the embodiment of the application can acquire the first voice recognition model, the second voice recognition model and the voice training sample; encoding the voice training sample by using the first voice recognition model and the second voice recognition model to obtain first encoding information and second encoding information; decoding the first encoded information by using the first voice recognition model to obtain first decoded information, and decoding the second encoded information by using the second voice recognition model to obtain second decoded information; calculating coding loss information between the first coding information and the second coding information, and calculating decoding loss information between the first decoding information and the second decoding information; and adjusting model parameters of the second voice recognition model based on the coding loss information and the decoding loss information to obtain a target voice recognition model, so that the efficiency of adapting the voice recognition model to various voice scenes can be improved.

Description

Speech recognition method and device and computer equipment

Technical Field

The application relates to the technical field of computers, in particular to a voice recognition method, a voice recognition device and computer equipment.

Background

Speech recognition aims at converting lexical content in human speech into computer-readable inputs, such as controls, binary codes or character sequences, etc. With the rapid development of artificial intelligence technology, speech recognition can be performed using artificial intelligence models. In addition, with the progress of data processing technology and the rapid popularization of the mobile internet, voice recognition is widely used in various fields of society. Wherein, the corresponding voice scenes of different fields can be different, and a plurality of different voice scenes can exist in the same field, which brings great challenges for applying artificial intelligence technology to voice recognition.

Disclosure of Invention

The embodiment of the application provides a voice recognition method, a voice recognition device and computer equipment, which can improve the efficiency of adapting a voice recognition model to various voice scenes and improve the accuracy of voice recognition.

The embodiment of the application provides a voice recognition method, which comprises the following steps:

acquiring a first voice recognition model, a second voice recognition model and a voice training sample for recognizing voice data of a first voice scene, wherein the voice training sample comprises a voice training sample corresponding to the second voice scene;

Encoding the voice training sample by using the first voice recognition model and the second voice recognition model to obtain first encoding information corresponding to the first voice recognition model and second encoding information corresponding to the second voice recognition model;

decoding the first encoded information by using the first speech recognition model to obtain first decoded information, and decoding the second encoded information by using the second speech recognition model to obtain second decoded information;

calculating coding loss information between the first coding information and the second coding information, and calculating decoding loss information between the first decoding information and the second decoding information;

and adjusting model parameters of the second voice recognition model based on the coding loss information and the decoding loss information to obtain a target voice recognition model for recognizing the voice data in the second voice scene.

Correspondingly, the embodiment of the application also provides a voice recognition device, which comprises:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a first voice recognition model, a second voice recognition model and a voice training sample for recognizing voice data of a first voice scene, and the voice training sample comprises a voice training sample corresponding to the second voice scene;

The coding unit is used for coding the voice training sample by utilizing the first voice recognition model and the second voice recognition model to obtain first coding information corresponding to the first voice recognition model and second coding information corresponding to the second voice recognition model;

the decoding unit is used for decoding the first coding information by utilizing the first voice recognition model to obtain first decoding information, and decoding the second coding information by utilizing the second voice recognition model to obtain second decoding information;

a calculation unit configured to calculate coding loss information between the first coding information and the second coding information, and to calculate decoding loss information between the first decoding information and the second decoding information;

and the adjusting unit is used for adjusting the model parameters of the second voice recognition model based on the coding loss information and the decoding loss information to obtain a target voice recognition model for recognizing the voice data in the second voice scene.

In an embodiment, the computing unit may include:

a first calculation subunit configured to calculate alignment decoding loss information between the first alignment decoding information and the second alignment decoding information;

A second calculation subunit operable to calculate attention decoding loss information between the first and second attention decoding information;

and the first fusion subunit is used for fusing the aligned decoding loss information and the attention decoding loss information to obtain the decoding loss information.

In an embodiment, the first computing subunit may include:

the dividing module is used for dividing the first alignment decoding information and the second alignment decoding information to obtain dividing information;

the logarithmic operation module is used for carrying out logarithmic operation on the division information to obtain information after operation;

and the multiplication module is used for calculating the multiplication of the calculated information and the second alignment decoding information to obtain the alignment decoding loss information.

In an embodiment, the computing unit may include:

a third calculation subunit, configured to calculate coding loss sub-information between each first coding sub-information and the corresponding second coding sub-information;

and the second fusion subunit is used for fusing the plurality of coding loss sub-information to obtain the coding loss information.

In an embodiment, the second fusion subunit may include:

The sorting module is used for sorting each coding loss sub-information based on the numerical value of each coding loss information to obtain sorted coding loss sub-information;

the determining module is used for determining fusion parameters corresponding to each coding loss sub-information according to the ordered coding loss sub-information;

and the first arithmetic operation module is used for carrying out arithmetic operation on each coding loss sub-information and the corresponding fusion parameter thereof to obtain the coding loss information.

In an embodiment, the adjusting unit may include:

a fourth calculating subunit, configured to calculate label loss information between second decoding information corresponding to the second speech recognition model and a preset label;

the third fusion subunit is used for fusing the coding loss information and the decoding loss information to obtain model loss information;

the fourth fusion subunit is used for fusing the model loss information and the label loss information to obtain target fusion information;

and the adjusting subunit is used for adjusting the model parameters of the second voice recognition model based on the target fusion information to obtain a target voice recognition model for recognizing the voice data in the second voice scene.

In an embodiment, the third fusion subunit may include:

the comparison module is used for comparing the coding loss information, the alignment decoding loss information and the attention decoding loss information to obtain a comparison result;

the generation module is used for respectively generating fusion parameters corresponding to the coding loss information, the alignment decoding loss information and the attention decoding loss information based on the comparison result;

and the second arithmetic operation module is used for respectively carrying out arithmetic operation on the coding loss information, the alignment decoding loss information, the attention decoding loss information and the corresponding fusion parameters thereof to obtain the model loss information.

In an embodiment, the encoding unit may include:

the forward processing subunit is used for performing forward processing on the voice training sample by utilizing the first voice recognition model and the second voice recognition model to obtain first forward information corresponding to the first voice recognition model and second forward information corresponding to the second voice recognition model;

the self-attention feature extraction subunit is configured to perform self-attention feature extraction on the first forward information by using the first speech recognition model to obtain first self-attention feature information, and perform self-attention feature extraction on the second forward information by using the second speech recognition model to obtain second self-attention feature information;

The convolution operation subunit is configured to perform convolution operation on the first self-attention feature information by using the first speech recognition model to obtain first encoded information corresponding to the first speech recognition model, and perform convolution operation on the second self-attention information by using the second speech recognition model to obtain second encoded information corresponding to the second speech recognition model.

In an embodiment, the forward processing subunit may include:

the normalization processing module is used for carrying out normalization processing on the voice training sample by utilizing the first voice recognition model and the second voice recognition model to obtain first normalized information and second normalized information;

the linear conversion module is used for carrying out linear conversion on the first normalized information by utilizing the first voice recognition model to obtain first linear converted information, and carrying out linear conversion on the second normalized information by utilizing the second voice recognition model to obtain second linear converted information;

and the nonlinear activation module is used for carrying out nonlinear activation on the first linear converted information by utilizing the first voice recognition model to obtain the first forward information, and carrying out nonlinear activation on the second linear converted information by utilizing the second voice recognition model to obtain the second forward information.

In an embodiment, the decoding unit may include:

an alignment processing subunit, configured to perform alignment processing on the first encoded information by using the first speech recognition model to obtain first aligned information, and perform alignment processing on the second encoded information by using the second speech recognition model to obtain second aligned information;

the distribution fitting subunit is used for carrying out distribution fitting on the first coding information by utilizing the first voice recognition model to obtain first distribution information, and carrying out distribution fitting on the second coding information by utilizing the second voice recognition model to obtain second distribution information;

and the correction subunit is used for correcting the first distribution information by using the first aligned information to obtain the first coding information, and correcting the second distribution information by using the second aligned information to obtain the second coding information.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the methods provided in the various alternatives of the above aspect.

Correspondingly, the embodiment of the application also provides a storage medium, and the storage medium stores instructions which are executed by a processor to realize the voice recognition method provided by any one of the embodiments of the application.

The embodiment of the application can acquire a first voice recognition model, a second voice recognition model and a voice training sample for recognizing the voice data of the first voice scene, wherein the voice training sample comprises a voice training sample corresponding to the second voice scene; encoding the voice training sample by using the first voice recognition model and the second voice recognition model to obtain first encoding information corresponding to the first voice recognition model and second encoding information of the second voice recognition model; decoding the first encoded information by using the first speech recognition model to obtain first decoded information, and decoding the second encoded information by using the second speech recognition model to obtain second decoded information; calculating coding loss information between the first coding information and the second coding information, and calculating decoding loss information between the first decoding information and the second decoding information; the model parameters of the second voice recognition model are adjusted based on the coding loss information and the decoding loss information, so that a target voice recognition model for recognizing the voice data in the second voice scene is obtained, the efficiency of adapting the voice recognition model to various voice scenes can be improved, and the accuracy of voice recognition can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic view of a scenario based on a speech recognition method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of a speech recognition method according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of a voice recognition method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of another scenario based on a speech recognition method according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a voice recognition device according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which embodiments of the application are shown, however, in which embodiments are shown, by way of illustration, only, and not in any way all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.

The embodiment of the application provides a voice-based recognition method which can be executed by a voice-based recognition device, and the voice-based recognition device can be integrated in computer equipment. The computer device may include, among other things, a terminal and a server.

The terminal may be a notebook computer, a personal computer (Personal Computer, PC), an on-board computer, or the like.

The server may be an interworking server or a background server among a plurality of heterogeneous systems, may be an independent physical server, may be a server cluster or a distributed system formed by a plurality of physical servers, and may be a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, basic cloud computing services such as big data and an artificial intelligent platform, and the like.

In an embodiment, as shown in fig. 1, the voice recognition apparatus may be integrated on a computer device such as a terminal or a server, so as to implement the voice recognition method according to the embodiment of the present application. Specifically, the computer device may obtain a first speech recognition model, a second speech recognition model, and a speech training sample for recognizing speech data of a first speech scene, where the speech training sample includes a speech training sample corresponding to the second speech scene; encoding the voice training sample by using the first voice recognition model and the second voice recognition model to obtain first encoding information corresponding to the first voice recognition model and second encoding information of the second voice recognition model; decoding the first encoded information by using the first speech recognition model to obtain first decoded information, and decoding the second encoded information by using the second speech recognition model to obtain second decoded information; calculating coding loss information between the first coding information and the second coding information, and calculating decoding loss information between the first decoding information and the second decoding information; and adjusting model parameters of the second voice recognition model based on the coding loss information and the decoding loss information to obtain a target voice recognition model for recognizing the voice data in the second voice scene.

The following detailed description is given, respectively, of the embodiments, and the description sequence of the following embodiments is not to be taken as a limitation of the preferred sequence of the embodiments.

The embodiments of the present application will be described in terms of a speech recognition device that may be integrated into a computer device, which may be a server or a terminal, etc.

As shown in fig. 2, a voice recognition method is provided, and the specific process includes:

101. the method comprises the steps of obtaining a first voice recognition model, a second voice recognition model and a voice training sample for recognizing voice data of a first voice scene, wherein the voice training sample comprises a voice training sample corresponding to the second voice scene.

Wherein the first speech recognition model and the second speech recognition model may be artificial intelligence models.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Among them, machine Learning (ML) is a multi-domain interdisciplinary, and involves multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

For example, the first and second speech recognition models may be at least one of a convolutional neural network (Convolutional Neural Networks, CNN), a deconvolution neural network (De-Convolutional Networks, DN), a deep neural network (Deep Neural Networks, DNN), a deep convolutional inverse graph network (Deep Convolutional Inverse Graphics Networks, DCIGN), a Region-based convolutional network (Region-based Convolutional Networks, RCNN), a Region-based fast convolutional network (Faster Region-based Convolutional Networks, faster RCNN), a bi-directional codec (Bidirectional Encoder Representations from Transformers, BERT) model, a transducer model, a Conformer model, and the like.

In an embodiment, the first speech recognition model may be used to recognize speech data of the first speech scene. For example, the first speech recognition model may be a model trained by using speech data of a first speech scene, where the first speech recognition model has a better recognition effect on the speech data of the first speech recognition scene, and has a worse recognition effect on the speech data of other speech scenes.

In an embodiment, the second speech recognition model may be a model to be trained, and by training the second speech recognition model, a target speech recognition model for recognizing speech data in the second speech scene may be obtained.

Wherein the speech scene may be an abstract overview of the same type of speech data. For example, the speech scene may include a general speech scene, a conference speech scene, a variety speech scene, a movie speech scene, a television show speech scene, a learning speech scene, and so on.

For example, a generic speech scene may include all speech scenes. When the voice scene corresponding to the voice recognition model is a general voice scene, the model obtained by training the voice recognition model through voice data of various scenes is described, the voice recognition model has better recognition effect on the voice data of various scenes, but compared with the voice recognition model aiming at a certain voice scene, the voice recognition model has relatively poorer recognition effect on the certain voice scene. For example, the speech scene for which the speech recognition model a is directed is a generic speech scene, and the speech scene for which the speech recognition model B is directed is a conference speech scene. Although the voice recognition model a can have a better recognition effect on voice data in various scenes, compared with the voice recognition model B, the recognition effect of the voice recognition model a on conference voice scenes is worse.

As another example, the voice data of the conference voice scene may include a dialog of the conference. As another example, the speech data of a movie speech scene may include a dialogue or a score in a movie, etc. As another example, the voice data of the television voice scene may include a dialogue or a score in a television film, and so on.

In an embodiment, the first speech scene and the second speech scene are different speech scenes. For example, the first speech scene may be a general speech scene, while the second speech scene may be a variety speech scene, and so on.

In an embodiment, the first speech recognition model and the target speech recognition model may be multiple for a speech scene. For example, the first speech scene corresponding to the first speech recognition model may include a general speech scene and a variety speech scene, and so on. For example, the speech scenes for which the target speech recognition model is directed may include general speech scenes, synthetic speech scenes, and movie speech scenes, among others.

In an embodiment, the model structure and model parameters of the first speech recognition model and the second speech recognition model may be the same. For example, the second speech recognition model may be the first speech recognition model, i.e. the second speech recognition model may be a model that already has "knowledge" of the first speech recognition model. Then when the second voice recognition model is trained by utilizing the voice training sample, the second voice recognition model already has a certain foundation, so that the second voice recognition model can obtain the target voice recognition model after a small number of iterations, and the training efficiency is improved. In addition, the second speech recognition model can multiplex the knowledge learned by the first speech recognition model, so that the speech scene adapted by the second speech recognition model can be improved. For example, the first speech scene corresponding to the first speech recognition model may include a general speech scene and a synthetic speech scene. The training purpose of the second speech recognition model is to accurately recognize the speech data of the film speech scene. By setting the model structure and model parameters of the second voice recognition model and the first voice recognition model to be the same, the training to obtain the target voice recognition model can accurately recognize the voice data of the film voice scene and can accurately recognize the voice data of the general voice scene and the synthetic voice scene.

In an embodiment, the voice training samples may include voice training samples corresponding to the second voice scene. For example, when the second speech scene is a movie speech scene, the speech training sample may include speech data corresponding to the movie scene. For another example, when the second speech scene includes a general speech scene and a movie speech scene, the speech training sample may include speech data corresponding to the general speech scene and the movie speech scene.

In an embodiment, when the model structures and model parameters of the first speech recognition model and the second speech recognition model are the same, if there is an overlap between the speech scenes pointed by the first speech recognition model and the second speech recognition model, the proportion of the speech data corresponding to different speech scenes in the speech training sample may be inclined.

For example, the first speech scene may be a generic scene and an a scene. The second speech scene may be a general scene, an a scene, and a B scene. At this time, there is an overlap of the first voice scene and the second voice scene. Because the model structures and model parameters of the second voice recognition model and the first voice recognition model are the same, the second voice recognition model has the capability of recognizing voice data of a general scene and voice data of an A scene, and therefore the capability of recognizing voice data of a B scene by the second voice recognition model can be trained with emphasis when the second voice recognition model is trained, and more voice data corresponding to the B scene can be included in a voice training sample. For example, 70% of the data in the speech training samples may be speech data corresponding to the B scene, while the remaining 30% of the data are speech data of the generic scene and the a scene. By setting the voice training sample in this way, the second voice recognition model can accurately recognize the voice data of the B scene and the voice data of the general scene and the A scene.

102. And coding the voice training sample by using the first voice recognition model and the second voice recognition model to obtain first coding information corresponding to the first voice recognition model and second coding information of the second voice recognition model.

In one embodiment, the speech training sample is typically a section of sound wave, and when the computer device processes the speech training sample, the speech training sample needs to be converted into a form that can be processed by the computer device, so as to obtain the encoded information. For example, the speech training samples may be converted to a binary string of characters, and for example, the speech training samples may be converted to a sequence of characters, and so on. The process of converting speech training samples into a form that can be processed by a computer device may be referred to as encoding. Wherein the encoded information obtained by encoding can characterize the acoustic characteristics of the speech training samples,

acoustic features, which refer to physical quantities representing acoustic properties of speech, are also collectively referred to as acoustic performance of elements of sound. Such as energy concentration regions representing timbres, formant frequencies, formant intensities and bandwidths, duration representing prosodic features of speech, fundamental frequencies, average speech power, etc.

In an embodiment, the first speech recognition model and the second speech recognition model may be used to encode the speech training sample to obtain first encoded information corresponding to the first speech recognition model and second encoded information corresponding to the second speech recognition model.

The speech training sample can be encoded by using the first speech recognition model and the second speech recognition model in various modes, so as to obtain first encoding information corresponding to the first speech recognition model and second encoding information of the second speech recognition model.

In an embodiment, the step of "encoding the speech training samples using the first speech recognition model and the second speech recognition model to obtain the first encoded information corresponding to the first speech recognition model and the second encoded information corresponding to the second speech recognition model" may include:

performing forward processing on the voice training sample by using the first voice recognition model and the second voice recognition model to obtain first forward information corresponding to the first voice recognition model and second forward information corresponding to the second voice recognition model;

performing self-attention feature extraction on the first forward information by using the first voice recognition model to obtain first self-attention feature information, and performing self-attention feature extraction on the second forward information by using the second voice recognition model to obtain second self-attention feature information;

and performing convolution operation on the first self-attention characteristic information by using the first voice recognition model to obtain first coding information corresponding to the first voice recognition model, and performing convolution operation on the second self-attention information by using the second voice recognition model to obtain second coding information of the second voice recognition model.

In one embodiment, forward processing the speech training samples may include: carrying out normalization (normalization) processing on the voice training sample to obtain normalized information; performing linear conversion on the normalized information to obtain linear converted information; and performing nonlinear activation on the linearly converted information to obtain forward information.

For example, the first speech recognition model may normalize the speech training sample to obtain first normalized information; performing linear conversion on the first normalized information to obtain first linear converted information; and performing nonlinear activation on the first linear converted information to obtain first forward information.

For another example, the second speech recognition model may normalize the speech training sample to obtain second normalized information; performing linear conversion on the second normalized information to obtain second linear converted information; and carrying out nonlinear activation on the second linearly converted information to obtain second forward information.

In one embodiment, the speech training samples may be normalized in a number of ways. For example, normalization of the speech training samples may be performed using a normalization-based normalization (LayerNormalization) method. For another example, the speech training samples may be normalized using a standard approach (Batch-Normalization) based on a standard normal distribution.

In an embodiment, the normalized information may be linearly converted by using a linear function to obtain linear converted information, and the linear converted information may be non-linearly activated by using a non-linear activation function to obtain forward information.

For example, the linear converted information may be non-linearly activated by using a Swish non-linear activation function, a Sigmoid non-linear activation function, or a Tanh non-linear activation function, to obtain the forward information.

In an embodiment, the self-attention feature information may be obtained by performing self-attention feature extraction on the forward information based on a self-attention mechanism.

For example, a self-attention mechanism may be included in the first speech recognition model, and then the first speech recognition model may perform self-attention feature extraction on the first forward information based on the self-attention mechanism to obtain the first self-attention feature information.

For another example, a self-attention mechanism may be included in the second speech recognition model, and then the second speech recognition model may perform self-attention feature extraction on the second forward information based on the self-attention mechanism to obtain second self-attention feature information.

The attentiveness mechanism mimics the internal process of biological observation behavior, a mechanism that aligns internal experience with external sensations to increase the observation finesse of a partial region. While self-attention mechanisms are improvements in attention mechanisms that reduce reliance on external information, are more adept at capturing internal dependencies of data or features. The accuracy of the processing of the information can be further improved by a self-attention mechanism.

In one embodiment, the self-attention feature information may be convolved to obtain the encoded information.

For example, point-by-point convolution (poinwise Conv) processing may be performed on the self-attention feature information to obtain point-by-point convolution information; calculating the point-by-point convolution information based on a gating mechanism to obtain calculated information; and carrying out depth separation convolution on the calculated information to obtain coded information.

For example, the first speech recognition model may perform a point-by-point convolution (Pointwise Conv) process on the first self-attention feature information to obtain first point-by-point convolution information; the first voice recognition model can operate the first point-by-point convolution information based on a gating mechanism to obtain first operated information; the first speech recognition model performs a depth separation convolution (Depthwise Conv) on the first operated information to obtain first encoded information.

For example, the second speech recognition model may perform a point-by-point convolution (Pointwise Conv) process on the second self-attention feature information to obtain second point-by-point convolution information; the second speech recognition model can operate the second point-by-point convolution information based on a gating mechanism to obtain second operated information; the second speech recognition model performs a depth separation convolution (Depthwise Conv) on the second calculated information to obtain second encoded information.

In an embodiment, the first speech recognition model may include an encoder, and the speech training sample may be encoded by using the encoder in the first speech recognition model to obtain first encoded information corresponding to the first speech recognition model.

Wherein the encoder may also be an artificial intelligence model. For example, the encoder may be a transducer model or a Conformer model, or the like.

In an embodiment, the second speech recognition model may include an encoder, and the speech training sample may be encoded by using the encoder in the second speech recognition model to obtain second encoded information corresponding to the second speech recognition model.

In an embodiment, the encoder in the first speech recognition model may comprise a number of Conformer modules. Similarly, the encoder in the second speech recognition model may also include a number of Conformer modules. For example, the encoder in the first speech recognition model may include n Conformer modules, and similarly, the encoder in the second speech recognition model may also include n Conformer modules. Wherein n is a positive integer greater than or equal to 1.

For example, when encoding the speech training samples with an encoder in the first speech recognition model, the n Conformer modules may be used to encode the speech training samples. For example, the 1 st Conformer module may be used to encode the speech training samples to obtain encoded sub-information. Then, the 2 nd Conformer module can be utilized to encode the encoded sub-information output by the 1 st Conformer module, so as to obtain the encoded sub-information corresponding to the 2 nd Conformer module, and so on until the nth Conformer module outputs so as to obtain the first encoded information.

For another example, when the speech training samples are encoded with an encoder in the second speech recognition model, the speech training samples may be encoded with the n Conformer modules. For example, the 1 st Conformer module may be used to encode the speech training samples to obtain encoded sub-information. Then, the 2 nd Conformer module can be utilized to encode the encoded sub-information output by the 1 st Conformer module to obtain the encoded sub-information corresponding to the 2 nd Conformer module, and so on until the nth Conformer module outputs to obtain the second encoded information.

103. And decoding the first encoded information by using the first voice recognition model to obtain first decoded information, and decoding the second encoded information by using the second voice recognition model to obtain second decoded information.

In an embodiment, after converting the voice training samples into the encoded information, what the labels corresponding to the voice training samples are may be determined based on the encoded information, for example, what the text information corresponding to the voice training samples are may be determined based on the encoded information, and so on. This process of determining what the label corresponding to the speech training samples is based on the encoded information may be referred to as a decoding process.

For example, decoding the encoded information may refer to identifying what text information corresponding to the speech training samples is based on the encoded information, where the text information may be the decoded information.

There are various ways to decode the encoded information to obtain decoded information.

In an embodiment, the step of "decoding the first encoded information using the first speech recognition model to obtain first decoded information and decoding the second encoded information using the second speech recognition model to obtain second decoded information" may include:

performing alignment processing on the first coding information by using a first voice recognition model to obtain first aligned information, and performing alignment processing on the second coding information by using a second voice recognition model to obtain second aligned information;

performing distribution fitting on the first coding information by using a first voice recognition model to obtain first distribution information, and performing distribution fitting on the second coding information by using a second voice recognition model to obtain second distribution information;

and correcting the first distribution information by using the first aligned information to obtain first coding information, and correcting the second distribution information by using the second aligned information to obtain second coding information.

In one embodiment, the speech recognition model may align the encoded information based on an alignment mechanism (Connectionist Temporal Classification, CTC) to obtain aligned information. For example, the first speech recognition model may align the first encoded information with the file based on a CTC mechanism to obtain first aligned information. For example, the second speech recognition model may align the second encoded information with the file based on a CTC mechanism to obtain second aligned information.

In one embodiment, the speech recognition model may perform a distribution fit on the encoded information based on an Attention (Attention) mechanism to obtain distribution information.

For example, a speech recognition model may model encoded information from left to right to represent past context information when performing a distributed fit on the encoded information based on an attention mechanism. In addition, the speech recognition model may also model the encoded information from right to left based on the attention mechanism to represent future context information. Then, the information of both directions may be mapped to distribution information corresponding to the encoded information.

In an embodiment, the distribution information of the encoded information may also be determined in combination with local and global features of the encoded information. Specifically, the step of performing distribution fitting on the first encoded information by using the first speech recognition model to obtain first distribution information, and performing distribution fitting on the second encoded information by using the second speech recognition model to obtain second distribution information may include:

Extracting global features of the first encoded information by using the first speech recognition model, and extracting global features of the second encoded information by using the first speech recognition model;

extracting local features of the first coding information by using the first voice recognition model, and extracting local features of the second coding information by using the second voice recognition model;

fusing the global features and the local features of the first coding information to obtain target characterization features of the first coding information, and fusing the global features and the local features of the second coding information to obtain target characterization features of the second coding information;

mapping the target characterization features of the first coding information into a preset distribution space to obtain first distribution information, and mapping the target characterization features of the second coding information into the preset distribution space to obtain second distribution information.

For example, the first speech recognition model may extract global features of the first encoded information using convolution kernels. Further, the first speech recognition model may extract local features of the first encoded information based on the attention mechanism. Then, the global feature and the local feature of the first coding information can be fused to obtain the target characterization feature of the first coding information. And then, mapping the target characterization features of the first coding information into a preset distribution space to obtain first distribution information.

For example, the second speech recognition model may extract global features of the second encoded information using convolution kernels. Further, the second speech recognition model may extract local features of the second encoded information based on the attention mechanism. Then, the global feature and the local feature of the second coding information can be fused to obtain the target characterization feature of the second coding information. And then, mapping the target characterization features of the first coding information into a preset distribution space to obtain second distribution information.

In an embodiment, the speech recognition model may include an alignment mechanism based decoder and an attention mechanism based decoder.

Wherein an information decoder based on an alignment mechanism (Connectionist Temporal Classification, CTC) can decode the encoded information into alignment decoded information. Wherein a Attention (Attention) mechanism based decoder may also be based on decoding the encoded information into Attention decoded information. The attention decoding information may then be modified in combination with the alignment decoding information and the attention decoding information to obtain decoding information.

For example, the first speech recognition model may include a decoder of a speech alignment mechanism and a decoder based on an attention mechanism. The first speech recognition model may decode the first encoded information into first alignment decoded information by an alignment mechanism based decoder. The first speech recognition model may decode the first encoded information into first attention-decoded information by an attention-mechanism based decoder. Then, the first speech recognition model may modify the first attention decoding information in combination with the first alignment decoding information and the first attention decoding information to obtain first decoding information.

For example, the second speech recognition model may include a decoder of a speech alignment mechanism and a decoder based on an attention mechanism. The second speech recognition model may decode the second encoded information into second aligned decoded information by an alignment mechanism based decoder. The second speech recognition model may decode the second encoded information into second attention-decoded information by an attention-mechanism based decoder. Then, the second speech recognition model may modify the second attention decoding information in combination with the second alignment decoding information and the second attention decoding information to obtain second decoding information.

104. Coding loss information between the first coding information and the second coding information is calculated, and decoding loss information between the first decoding information and the second decoding information is calculated.

In an embodiment, when the second speech recognition model is trained, the first speech recognition model may be used to assist the second speech recognition model in training, so that the target speech recognition model may be used to recognize both speech data in the second speech scene and speech data in the first speech scene.

Therefore, it is possible to calculate coding loss information between the first coding information and the second coding information, and to calculate decoding loss information between the first decoding information and the second decoding information, and then adjust model parameters of the second speech recognition model based on the coding loss information and the decoding loss information, to obtain a target speech recognition model for recognizing the second speech scene and the speech data in the first speech scene.

In an embodiment, the first encoded information may comprise a plurality of first encoded sub-information characterizing different information depths, and the second encoded information may comprise a plurality of second encoded sub-information characterizing different information depths. The step of "calculating decoding loss information between the first decoding information and the second decoding information" may include:

calculating coding loss sub-information between each first coding sub-information and the corresponding second coding sub-information;

and fusing the plurality of coding loss sub-information to obtain the coding loss information.

In an embodiment, in order to improve accuracy of the second speech recognition model, when the speech training sample is encoded, the speech training sample may be encoded multiple times under multiple different information depths, so as to obtain encoded sub-information of multiple different information depths.

For example, the first speech recognition model and the second speech recognition model are two models having the same model structure. Wherein the first speech recognition model and the second speech recognition model both comprise encoders, which are formed by a plurality of encoding modules, wherein each encoding module can be a deep learning network. For example, each coding module may be a Conformer model, or the like.

In an embodiment, the depth of structure in each coding module may be different, thus resulting in different information depths of the encoded sub-information output by each module. For example, the encoder in the speech recognition model comprises 3 encoding modules, wherein the first encoding module is composed of neurons of layer 3, the second encoding module is composed of neurons of layer 4, and the third encoding module is composed of neurons of layer 5. Since the structure depth of each coding module is different, the information depth of the coding sub-information outputted by each module is different.

In an embodiment, the structural depth in each coding module may be the same, but due to the information transfer relationship between the coding modules, the information depth of the encoded sub information output by each module may be different. For example, the encoder in the speech recognition model includes 3 encoding modules, where the encoded sub-information output by the first encoding module is transferred to the second encoding module to continue encoding, and the encoded sub-information output by the second encoding module is also transferred to the third encoding module to continue encoding, so as to obtain a plurality of encoded sub-information with different information depths.

In an embodiment, it is assumed that the model structures of the first speech recognition module and the second speech recognition module are identical. The encoder in the first speech recognition model comprises 3 coding modules and the encoder in the second speech recognition model also comprises 3 coding modules. Wherein the first coding sub-information output by the coding module in the first speech recognition model isAnd->The second coding sub-information output by the coding module in the second speech recognition model is +.>And->

Then, coding loss sub-information between each first coding sub-information and its corresponding second coding sub-information may be calculated. For example, the first encoded sub-informationThe corresponding second encoded sub-information is +.>First encoded sub information->The corresponding second encoded sub-information is +.>First encoded sub information->The corresponding second encoded sub-information is +.>Then, +.>And->Coding loss sub-information c therebetween ₁ ，/>And->Coding loss sub-information c therebetween ₂ And->And->Coding loss sub-information c therebetween ₃ 。

Wherein the coding loss sub-information between the first coding sub-information and the second coding sub-information may be calculated in a number of ways. For example, coding loss sub-information between the first coding sub-information and the second coding sub-information may be calculated based on cross entropy or KL-diversity Divergence, etc.

In an embodiment, a plurality of coding loss sub-information may be fused to obtain coding loss information. In order to balance the influence of each coding loss sub-information on the second speech recognition model, the fusion parameters corresponding to each coding loss sub-information can be determined according to the coding loss sub-information. And then, carrying out arithmetic operation on each coding loss sub-information and the corresponding fusion parameters thereof to obtain the coding loss information.

Specifically, the step of "fusing a plurality of coding loss sub-information to obtain coding loss information" may include:

sorting each coding loss sub-information based on the numerical value of each coding loss information to obtain sorted coding loss sub-information;

determining fusion parameters corresponding to each coding loss sub-information according to the ordered coding loss sub-information;

and carrying out arithmetic operation on each coding loss sub-information and the corresponding fusion parameters thereof to obtain the coding loss information.

For example, each coding loss sub-information may be ordered according to the numerical value of each coding loss information, to obtain the ordered coding loss sub-information. For example, each coding loss sub-information may be sorted from large to small according to the size of each coding loss information, so as to obtain sorted coding loss sub-information.

Then, according to the sorted coding loss sub-information, determining a fusion parameter corresponding to each coding loss sub-information. For example, when the coding loss sub-information is large, the fusion parameter code corresponding to the coding loss sub-information is small. And when the coding loss sub-information is smaller, the fusion parameter coding corresponding to the coding loss sub-information is larger.

Then, each coding loss sub-information and the corresponding fusion parameter thereof can be subjected to arithmetic operation to obtain the coding loss information. For example, the multiplied loss information may be obtained by multiplying each coding loss sub-information and its corresponding fusion parameter. Then, the multiplied loss information is added to obtain coding loss information.

In an embodiment, the first decoding information may include first alignment decoding information and first attention information, and the second decoding information may include second alignment decoding information and second attention decoding information. The step of "calculating decoding loss information between the first decoding information and the second decoding information" may include:

calculating alignment decoding loss information between the first alignment decoding information and the second alignment decoding information;

calculating attention decoding loss information between the first attention decoding information and the second attention decoding information;

And fusing the aligned decoding loss information and the attention decoding loss information to obtain the decoding loss information.

In an embodiment, the alignment decoding loss information between the first alignment decoding information and the second alignment decoding information may be calculated based on cross entropy or KL-diversity Divergence, etc. For example, the alignment decoding loss information between the first alignment decoding information and the second alignment decoding information may be calculated according to the KL-diversity Divergence.

Wherein the KL-Divergent Divergence operator can be as follows:

where p (x) may represent first alignment decoding information and q (x) may represent second alignment decoding information.

Specifically, the step of calculating the alignment decoding loss information between the first alignment decoding information and the second alignment decoding information may include:

dividing the first alignment decoding information and the second alignment decoding information to obtain divided information;

carrying out logarithmic operation on the division information to obtain information after operation;

and multiplying the calculated information by the second alignment decoding information to obtain the alignment decoding loss information.

In an embodiment, the attention decoding loss information between the first and second attention decoding information may be calculated based on cross entropy or KL-diversity Divergence, etc. For example, attention decoding loss information between the first and second attention decoding information may be calculated according to the KL-diversity Divergence.

In an embodiment, the aligned decoding loss information and the attention decoding loss information may be fused to obtain decoding loss information.

For example, the aligned decoding loss information and the attention decoding loss information may be added to obtain the decoding loss information. For another example, the aligned decoding loss information and the attention decoding loss information may be multiplied by weight coefficients and added to each other to obtain the decoding loss information.

105. And adjusting model parameters of the second voice recognition model based on the coding loss information and the decoding loss information to obtain a target voice recognition model for recognizing the voice data in the second voice scene.

In an embodiment, after the coding loss information and the decoding loss information are calculated, model parameters of the second speech recognition model may be adjusted based on the coding loss information and the decoding loss information to obtain a target speech recognition model for recognizing the speech data in the second speech scene.

In an embodiment, tag loss information between the second decoding information corresponding to the second speech recognition model and the preset tag may also be calculated. And then, adjusting model parameters of the second voice recognition model by combining the label loss information, the coding loss information and the decoding loss information to obtain a target voice recognition model for recognizing the voice data in the second voice scene. Specifically, the step of "adjusting the model parameters of the second speech recognition model based on the coding loss information and the decoding loss information to obtain the target speech recognition model for recognizing the speech data in the second speech scene" may include:

Calculating label loss information between second decoding information corresponding to the second voice recognition model and a preset label;

fusing the coding loss information and the decoding loss information to obtain model loss information;

fusing the model loss information and the label loss information to obtain target fusion information;

and adjusting model parameters of the second voice recognition model based on the target fusion information to obtain a target voice recognition model for recognizing voice data in the second voice scene.

In an embodiment, the label loss information between the second decoding information corresponding to the second speech recognition model and the preset label may be calculated.

For example, the second decoding information may include second alignment decoding information and second attention decoding information. Then, first loss information between the second alignment decoding information and the preset tag and second loss information between the second attention decoding information and the preset tag may be calculated. And then, fusing the first loss information and the second loss information to obtain the label loss information.

For example, the tag loss information may be represented by the following formula:

loss_ori＝α*loss_CTC+(1-α)*loss_ATT

where loss_ctc may represent first loss information, loss_att may represent second loss information, and α may represent a fusion weight.

In an embodiment, the coding loss information and decoding loss information may be fused to obtain model loss information.

Wherein, in order to balance the influence of the coding loss information and the decoding loss information on the second speech recognition model, the fusion parameters corresponding to the coding loss information and the decoding loss information can be determined according to the coding loss information and the decoding loss information. And then, carrying out arithmetic operation on each piece of coding loss information, decoding loss information and corresponding fusion parameters to obtain model loss information. For example, the decoding loss information may include alignment decoding loss information and attention decoding loss information. The step of fusing the coding loss information and the decoding loss information to obtain model loss information may include:

comparing the coding loss information, the alignment decoding loss information and the attention decoding loss information to obtain a comparison result;

based on the comparison result, respectively generating fusion parameters corresponding to the coding loss information, the alignment decoding loss information and the attention decoding loss information;

and respectively carrying out arithmetic operation on the coding loss information, the alignment decoding loss information and the attention decoding loss information and the corresponding fusion parameters thereof to obtain model loss information.

For example, the fusion mode is specifically as follows:

where loss_kl may represent model loss information, loss_kl_ctc may represent aligned decoding loss information, loss_kl_att may represent attention decoding loss information, and loss_kl_layerk may represent coding loss sub-information.Coding loss information may be represented. weight_kl_ctc may represent a fusion weight corresponding to aligned decoding loss information, weight_kl_att may represent a fusion weight corresponding to attention decoding loss information, and weight_kl_layrr may represent a fusion weight corresponding to encoding loss information.

In one embodiment, to allow for equalization of the loss in aspects of the training process, the size of the corresponding weight may be determined based on the loss size in the training. For example, because the peak positions of different models ctc differ greatly, the value of Loss-kl-ctc is large, while the value of Loss-kl-att is small, and therefore weight-kl-att is about ten times that of weight-kl-ctc, thus equalizing the losses in all aspects.

In an embodiment, the model loss information and the tag loss information may be fused to obtain target fusion information. For example, it may be as follows:

loss_total＝weight*loss_ori+(1-weight)*loss_kl

where loss_total may represent target fusion information and weight may represent fusion weight.

Then, model parameters of the second speech recognition model can be adjusted based on the target fusion information to obtain a target speech recognition model for recognizing the speech data in the second speech scene.

The embodiment of the application provides a voice recognition method, which can acquire a first voice recognition model, a second voice recognition model and a voice training sample for recognizing voice data of a first voice scene, wherein the voice training sample comprises a voice training sample corresponding to the second voice scene; encoding the voice training sample by using the first voice recognition model and the second voice recognition model to obtain first encoding information corresponding to the first voice recognition model and second encoding information of the second voice recognition model; decoding the first encoded information by using the first speech recognition model to obtain first decoded information, and decoding the second encoded information by using the second speech recognition model to obtain second decoded information; calculating coding loss information between the first coding information and the second coding information, and calculating decoding loss information between the first decoding information and the second decoding information; and adjusting model parameters of the second voice recognition model based on the coding loss information and the decoding loss information to obtain a target voice recognition model for recognizing the voice data in the second voice scene. In the embodiment of the application, the first voice recognition model and the voice training sample are utilized to train the second voice recognition model, so that a model without any knowledge is not required to be trained, therefore, the second voice recognition model can be used for rapidly and stably carrying out scene self-adaption, and the training efficiency of the model is improved. For example, the second speech recognition model is a model based on the first speech recognition model, i.e. the second speech recognition model already has knowledge reserves before being trained, so that the training efficiency can be improved when the second speech recognition model is trained, and the method has the advantage of rapid adaptation. For example, the training time of the second speech recognition model is approximately one half of the original training time, and the method has the advantage of rapid adaptation.

Secondly, the embodiment of the application combines multiple losses to train the second voice recognition model, so that the recognition performance of the target voice recognition model can be effectively improved. For example, in the case of optimizing song scenes, the performance of the general scenes and video scenes is improved by 0% -1%, and the performance of complex song scenes (mixed with non-song speech) and pure song scenes is improved by 5%. In addition, the embodiment of the application also introduces the loss of calculation of the KL-Divergent Divergence, and the following problems can be solved through the KL-Divergent Divergence:

(1) The distribution difference between the first voice scene and the second voice scene is large, and the KL-Divergent Divergence can enable the distribution gap between the second voice recognition model and the first voice recognition model to be as small as possible when the second voice recognition model meets the distribution of the second voice scene in the training process, so that the distribution meeting the second voice scene and the distribution meeting the first voice scene can be found as soon as possible.

(2) The introduction of KL-Divergent Divergence can meet the requirement of multi-scene performance, thereby meeting the requirement of rapid self-adaption.

(3) If the voice scene comprises a general voice scene, the target voice recognition model needs to meet the data distribution of the general voice scene. Through KL-divengence Divergence, the model can be kept stable to train if error samples exist in training samples of a general voice scene.

According to the method described in the above embodiments, examples are described in further detail below.

The embodiment of the application will be described by taking the example of integrating the voice recognition method on a server.

In one embodiment, as shown in fig. 3, a voice recognition method is used, and the specific flow is as follows:

201. the server acquires a first voice recognition model, a second voice recognition model and a voice training sample for recognizing voice data of a first voice scene, wherein the voice training sample comprises a voice training sample corresponding to the second voice scene.

In an embodiment, the first speech scene may be a generic scene and an a scene. For example, the first speech recognition model may recognize speech data through scenes and a scenes.

In an embodiment, the first speech recognition model and the second speech recognition model may be the same artificial intelligence model. For example, the model structures and model parameters of the first speech recognition model and the second speech recognition model are the same.

In an embodiment, the second speech scene may be a different scene than the first speech scene. For example, the second speech scene may be a general scene, an a scene, and a B scene.

In an embodiment, the target speech recognition model for recognizing speech recognition in the second speech scenario may be obtained by training the second speech recognition model with a speech training sample comprising the second speech scenario.

202. The server encodes the voice training sample by using the first voice recognition model and the second voice recognition model to obtain first encoding information corresponding to the first voice recognition model and second encoding information of the second voice recognition model.

In an embodiment, when the first speech scene and the second speech scene have overlapping portions, the first speech recognition model may be used to assist the second speech recognition model in training in addition to training the second speech recognition model by using the speech training sample, so that the speech recognition effect of the target speech recognition model is better.

For example, the first speech scene may be a generic scene and an a scene. The second speech scene may be a general scene, an a scene, and a B scene. Because the first speech scene and the second speech scene have overlapping scenes, the first speech recognition model may be utilized to assist the second speech recognition model in training.

Therefore, the server can encode the voice training sample by utilizing the first voice recognition model and the second voice recognition model to obtain first encoding information corresponding to the first voice recognition model and second encoding information of the second voice recognition model; the server decodes the first encoded information by using the first voice recognition model to obtain first decoded information, and decodes the second encoded information by using the second voice recognition model to obtain second decoded information; the server calculates coding loss information between the first coding information and the second coding information, and calculates decoding loss information between the first decoding information and the second decoding information; then, the server adjusts model parameters of the second speech recognition model based on the coding loss information and the decoding loss information to obtain a target speech recognition model for recognizing the speech data in the second speech scene.

In an embodiment, the first speech recognition model may include an encoder, and the second speech recognition model may also include an encoder. Wherein the encoder in the first speech recognition model and the encoder in the second speech recognition model may be encoders constructed based on Conformer.

In an embodiment, the server may encode the voice training samples by using an encoder in the first voice recognition model to obtain first encoded information corresponding to the first voice recognition model.

In an embodiment, the server may encode the speech training samples using an encoder in the second speech recognition model to obtain second encoded information of the second speech recognition model.

For example, when encoding the speech training samples with an encoder in the first speech recognition model, the n Conformer modules may be used to encode the speech training samples. For example, as shown in fig. 4, the 1 st Conformer module may be used to encode the voice training samples to obtain encoded sub-information. Then, the 2 nd Conformer module can be utilized to encode the encoded sub-information output by the 1 st Conformer module, so as to obtain the encoded sub-information corresponding to the 2 nd Conformer module, and so on until the nth Conformer module outputs so as to obtain the first encoded information.

203. The server decodes the first encoded information by using the first speech recognition model to obtain first decoded information, and decodes the second encoded information by using the second speech recognition model to obtain second decoded information.

In an embodiment, the first speech recognition model may include a decoder for a speech alignment mechanism and a decoder based on an attention mechanism, for example, as shown in fig. 4. The first speech recognition model may decode the first encoded information into first alignment decoded information by an alignment mechanism based decoder. The first speech recognition model may decode the first encoded information into first attention-decoded information by an attention-mechanism based decoder. Then, the first speech recognition model may modify the first attention decoding information in combination with the first alignment decoding information and the first attention decoding information to obtain first decoding information.

In an embodiment, the second speech recognition model may include a decoder for a speech alignment mechanism and a decoder based on an attention mechanism. The second speech recognition model may decode the second encoded information into second aligned decoded information by an alignment mechanism based decoder. The second speech recognition model may decode the second encoded information into second attention-decoded information by an attention-mechanism based decoder. Then, the second speech recognition model may modify the second attention decoding information in combination with the second alignment decoding information and the second attention decoding information to obtain second decoding information.

204. The server calculates coding loss information between the first coding information and the second coding information, and calculates decoding loss information between the first decoding information and the second decoding information.

In an embodiment, the first encoded information may include first encoded sub-information output by each encoding module in the encoder of the first speech recognition model. The second coding information may include second coding sub-information output by each coding module in the encoder of the second speech recognition model. When the coding loss information between the first coding information and the second coding information is calculated, as shown in fig. 4, the coding loss sub information between each first coding sub information and its corresponding second coding sub information may be calculated. Then, the coding loss sub-information is fused to obtain the coding loss information.

For example, the first encoded information includes n first encoded sub-information, and the second encoded information includes n second encoded sub-information. Coding loss sub-information between each first coding sub-information and its corresponding second coding sub-information may then be calculated based on the KL-diversity Divergence. Then, the plurality of coding loss sub-information may be added to obtain coding loss information.

In an embodiment, the first decoding information may include first alignment decoding information and first attention decoding information. The second decoding information may include first alignment decoding information and second attention decoding information. When calculating the decoding loss information between the first decoding information and the second decoding information, it is possible to calculate aligned decoding loss information between the first aligned decoding information and the second aligned decoding information, and to calculate attention decoding loss information between the first attention decoding information and the second attention decoding information. Then, the aligned decoding loss information and the attention decoding loss information are fused to obtain decoding loss information.

For example, alignment decoding loss information between the first alignment decoding information and the second alignment decoding information may be calculated based on the KL-diversity Divergence. For example, alignment decoding loss information between the first alignment decoding information and the second alignment decoding information may be calculated based on the KL-diversity Divergence. Then, the aligned decoding loss information and the attention decoding loss information are added to obtain decoding loss information.

205. The server adjusts model parameters of the second voice recognition model based on the coding loss information and the decoding loss information to obtain a target voice recognition model for recognizing voice data in the second voice scene.

In an embodiment, the coding loss information and decoding loss information may be fused to obtain model loss information. The fusion mode is specifically as follows:

where loss_kl may represent model loss information, loss_kl_ctc may represent aligned decoding loss information, loss_kl_att may represent attention decoding loss information, and loss_kl_layerk may represent coding loss sub-information.Coding loss information may be represented. weight_kl_ctc may represent a fusion weight corresponding to aligned decoding loss information, weight_kl_att may represent a fusion weight corresponding to attention decoding loss information, and weight_kl_layer may represent a fusion weight corresponding to encoding loss information.

In an embodiment, tag loss information between the second decoding information corresponding to the second speech recognition model and the preset tag may also be calculated.

loss_ori＝α*loss_CTC+(1-α)*loss_ATT

Then, the model loss information and the label loss information can be fused to obtain target fusion information. For example, it may be as follows:

loss_total＝weight*loss_ori+(1-weight)*loss_kl

Then, model parameters of the second speech recognition model can be adjusted based on the target fusion information to obtain a target speech recognition model for recognizing the speech data in the second speech scene. For example, the target voice recognition model obtained through training by the method provided by the application can recognize not only the voice data in the B scene, but also the voice data in the general scene and the A scene.

In addition, the embodiment of the application combines multiple losses to train the second voice recognition model, so that the recognition performance of the target voice recognition model can be effectively improved. For example, in the case of optimizing song scenes, the performance of the general scenes and video scenes is improved by 0% -1%, and the performance of complex song scenes (mixed with non-song speech) and pure song scenes is improved by 5%.

In addition, the second voice recognition model is based on the first voice recognition model, namely the second voice recognition model has knowledge reserves before being trained, so that training efficiency can be improved when the second voice recognition model is trained, and the method has the advantage of rapid self-adaption. For example, the training time of the second speech recognition model is approximately one half of the original training time, and the method has the advantage of rapid adaptation.

In the embodiment of the application, the server can acquire a first voice recognition model, a second voice recognition model and a voice training sample for recognizing the voice data of the first voice scene, wherein the voice training sample comprises a voice training sample corresponding to the second voice scene; the server encodes the voice training sample by using the first voice recognition model and the second voice recognition model to obtain first encoding information corresponding to the first voice recognition model and second encoding information of the second voice recognition model; the server decodes the first encoded information by using the first voice recognition model to obtain first decoded information, and decodes the second encoded information by using the second voice recognition model to obtain second decoded information; the server calculates coding loss information between the first coding information and the second coding information, and calculates decoding loss information between the first decoding information and the second decoding information; the server adjusts model parameters of the second voice recognition model based on the coding loss information and the decoding loss information to obtain a target voice recognition model for recognizing voice data in the second voice scene. The method provided by the application can improve the efficiency of adapting the voice recognition model to various voice scenes and improve the accuracy of voice recognition.

In order to better implement the voice-based recognition method provided by the embodiment of the application, in an embodiment, a voice-based recognition device is also provided, and the voice-based recognition device can be integrated in a computer device. Where the meaning of a noun is the same as in the speech recognition based method described above, specific implementation details may be referred to in the description of the method embodiments.

In an embodiment, a speech recognition based apparatus is provided, which may be integrated in a computer device in particular, as shown in fig. 5, the speech recognition based apparatus comprising: the acquisition unit 301, the encoding unit 302, the decoding unit 303, the calculation unit 304, and the adjustment unit 305 are specifically as follows:

an obtaining unit 301, configured to obtain a first speech recognition model, a second speech recognition model, and a speech training sample for recognizing speech data of a first speech scene, where the speech training sample includes a speech training sample corresponding to the second speech scene;

the encoding unit 302 is configured to encode the speech training sample by using the first speech recognition model and the second speech recognition model, so as to obtain first encoding information corresponding to the first speech recognition model and second encoding information of the second speech recognition model;

A decoding unit 303, configured to decode the first encoded information by using the first speech recognition model to obtain first decoded information, and decode the second encoded information by using the second speech recognition model to obtain second decoded information;

a calculation unit 304 for calculating coding loss information between the first coding information and the second coding information, and calculating decoding loss information between the first decoding information and the second decoding information;

and an adjusting unit 305, configured to adjust model parameters of the second speech recognition model based on the coding loss information and the decoding loss information, so as to obtain a target speech recognition model for recognizing the speech data in the second speech scene.

In an embodiment, the computing unit 304 may include:

In an embodiment, the first computing subunit may include:

In an embodiment, the computing unit 304 may include:

In an embodiment, the second fusion subunit may include:

In an embodiment, the adjusting unit 305 may include:

In an embodiment, the third fusion subunit may include:

In an embodiment, the encoding unit 302 may include:

the convolution operation subunit is configured to perform convolution operation on the first self-attention feature information by using the first speech recognition model to obtain first coding information corresponding to the first speech recognition model, and perform convolution operation on the second self-attention information by using the second speech recognition model to obtain second coding information of the second speech recognition model.

In an embodiment, the forward processing subunit may include:

In an embodiment, the decoding unit 303 may include:

In the implementation, each unit may be implemented as an independent entity, or may be implemented as the same entity or several entities in any combination, and the implementation of each unit may be referred to the foregoing method embodiment, which is not described herein again.

The efficiency of adapting the voice recognition model to various voice scenes and the accuracy of voice recognition can be improved through the voice recognition device.

The embodiment of the application also provides a computer device, which can comprise a terminal or a server, for example, the computer device can be used as a voice recognition-based terminal, and the terminal can be a mobile phone, a tablet computer and the like; for another example, the computer device may be a server, such as a speech recognition based server, or the like. As shown in fig. 6, a schematic structural diagram of a terminal according to an embodiment of the present application is shown, specifically:

The computer device may include one or more processors 401 of a processing core, memory 402 of one or more computer readable storage media, a power supply 403, and an input unit 404, among other components. Those skilled in the art will appreciate that the computer device structure shown in FIG. 6 is not limiting of the computer device and may include more or fewer components than shown, or may be combined with certain components, or a different arrangement of components. Wherein:

the processor 401 is a control center of the computer device, connects various parts of the entire computer device using various interfaces and lines, performs various functions of the computer device and processes data by running or executing software programs and/or modules stored in the memory 402, and calling data stored in the memory 402. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor and a modem processor, wherein the application processor mainly processes an operating system, a user page, an application program, etc., and the modem processor mainly processes wireless communication. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by executing the software programs and modules stored in the memory 402. The memory 402 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data created according to the use of the computer device, etc. In addition, memory 402 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 with access to the memory 402.

The computer device further comprises a power supply 403 for supplying power to the various components, preferably the power supply 403 may be logically connected to the processor 401 by a power management system, so that functions of charge, discharge, and power consumption management may be performed by the power management system. The power supply 403 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

The computer device may also include an input unit 404, which input unit 404 may be used to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the computer device may further include a display unit or the like, which is not described herein. In particular, in this embodiment, the processor 401 in the computer device loads executable files corresponding to the processes of one or more application programs into the memory 402 according to the following instructions, and the processor 401 executes the application programs stored in the memory 402, so as to implement various functions as follows:

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the methods provided in the various alternative implementations of the above embodiments.

It will be appreciated by those of ordinary skill in the art that all or part of the steps of the various methods of the above embodiments may be performed by a computer program, or by computer program control related hardware, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, an embodiment of the present application further provides a storage medium in which a computer program is stored, the computer program being capable of being loaded by a processor to perform any of the steps of the speech recognition based method provided by the embodiment of the present application. For example, the computer program may perform the steps of:

Because the computer program stored in the storage medium can execute any step in the voice recognition method provided by the embodiment of the present application, the beneficial effects that any one of the voice recognition method provided by the embodiment of the present application can achieve can be achieved, which are detailed in the previous embodiments and are not described herein.

The foregoing has described in detail a speech recognition method, apparatus, computer device and storage medium according to embodiments of the present application, and specific examples have been applied to illustrate the principles and embodiments of the present application, and the above description of the embodiments is only for aiding in understanding the method and core idea of the present application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present application, the present description should not be construed as limiting the present application.

Claims

1. A method of speech recognition, comprising:

2. The method of claim 1, wherein the first decoding information comprises first alignment decoding information and first attention decoding information, and the second decoding information comprises second alignment decoding information and second attention decoding information; the calculating decoding loss information between the first decoding information and the second decoding information includes:

calculating attention decoding loss information between the first and second attention decoding information;

3. The method of claim 2, wherein the calculating alignment decoding loss information between the first alignment decoding information and the second alignment decoding information comprises:

and calculating the calculated information and multiplying the second alignment decoding information to obtain the alignment decoding loss information.

4. The method of claim 1, wherein the first encoded information comprises a plurality of first encoded sub-information characterizing different information depths and the second encoded information comprises a plurality of second encoded sub-information characterizing different information depths; the calculating coding loss information between the first coding information and the second coding information includes:

5. The method of claim 4, wherein the fusing the coding loss sub-information to obtain the coding loss information comprises:

6. The method of claim 1, wherein adjusting model parameters of the second speech recognition model based on the coding loss information and the decoding loss information to obtain a target speech recognition model for recognizing speech data in the second speech scene comprises:

and adjusting model parameters of the second voice recognition model based on the target fusion information to obtain a target voice recognition model for recognizing the voice data in the second voice scene.

7. The method of claim 6, wherein the decoding penalty information comprises aligned decoding penalty information and attention decoding penalty information; the step of fusing the coding loss information and the decoding loss information to obtain model loss information includes:

and respectively carrying out arithmetic operation on the coding loss information, the alignment decoding loss information and the attention decoding loss information and the corresponding fusion parameters thereof to obtain the model loss information.

8. The method of claim 1, wherein the encoding the speech training samples using the first speech recognition model and the second speech recognition model to obtain first encoded information corresponding to the first speech recognition model and second encoded information corresponding to the second speech recognition model comprises:

and performing convolution operation on the first self-attention characteristic information by using the first voice recognition model to obtain first coding information corresponding to the first voice recognition model, and performing convolution operation on the second self-attention information by using the second voice recognition model to obtain second coding information corresponding to the second voice recognition model.

9. The method of claim 8, wherein performing forward processing on the speech training samples using the first speech recognition model and the second speech recognition model to obtain first forward information corresponding to the first speech recognition model and second forward information corresponding to the second speech recognition model, comprises:

normalizing the voice training sample by using the first voice recognition model and the second voice recognition model to obtain first normalized information and second normalized information;

performing linear conversion on the first normalized information by using the first voice recognition model to obtain first linear converted information, and performing linear conversion on the second normalized information by using the second voice recognition model to obtain second linear converted information;

and performing nonlinear activation on the first linear converted information by using the first voice recognition model to obtain the first forward information, and performing nonlinear activation on the second linear converted information by using the second voice recognition model to obtain the second forward information.

10. The method of claim 1, wherein decoding the first encoded information using the first speech recognition model to obtain first decoded information and decoding the second encoded information using the second speech recognition model to obtain second decoded information comprises:

Performing alignment processing on the first coding information by using the first voice recognition model to obtain first aligned information, and performing alignment processing on the second coding information by using the second voice recognition model to obtain second aligned information;

performing distribution fitting on the first coding information by using the first voice recognition model to obtain first distribution information, and performing distribution fitting on the second coding information by using the second voice recognition model to obtain second distribution information;

and correcting the first distribution information by using the first aligned information to obtain the first coding information, and correcting the second distribution information by using the second aligned information to obtain the second coding information.

11. The method of claim 10, wherein performing a distribution fit on the first encoded information using the first speech recognition model to obtain first distribution information, and performing a distribution fit on the second encoded information using the second speech recognition model to obtain second distribution information, comprises:

extracting global features of the first encoded information using the first speech recognition model, and extracting global features of the second encoded information using the first speech recognition model;

Extracting local features of the first encoded information using the first speech recognition model, and extracting local features of the second encoded information using the second speech recognition model;

12. A speech recognition apparatus, comprising:

13. A computer device comprising a memory and a processor; the memory stores an application program, and the processor is configured to execute the application program in the memory to perform the operations in the speech recognition method according to any one of claims 1 to 11.

14. A computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the speech recognition method of any one of claims 1 to 11.

15. A computer program product comprising a computer program or instructions which, when executed by a processor, carries out the steps of the speech recognition method according to any one of claims 1 to 11.