CN115376495A

CN115376495A - Speech recognition model training method, speech recognition method and device

Info

Publication number: CN115376495A
Application number: CN202210928842.8A
Authority: CN
Inventors: 张一珂
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-08-03
Filing date: 2022-08-03
Publication date: 2022-11-22

Abstract

The application provides a speech recognition model training method, a speech recognition method and a speech recognition device, wherein the method comprises the following steps: in any iteration process, a first training sample set is obtained, each training data in the first training sample set comprises voice data, a transcription text of the voice data and historical voice data of the voice data, the voice recognition model comprises a feature extraction model and a recognition model, the voice data in the training data and the historical voice data of the voice data are respectively used as the input of a pre-trained feature extraction model, a context feature vector of the voice data and a context feature vector of the historical voice data are output, the voice data, the context feature vector of the voice data and the context feature vector of the historical voice data are used as the input of the recognition model, the recognition text of the voice data is output, and parameters of the recognition model are adjusted according to the recognition text of each voice data and the transcription text of each voice data obtained in each iteration process until a training stopping condition is met.

Description

Speech recognition model training method, speech recognition method and device

Technical Field

The embodiment of the application relates to the technical field of artificial intelligence, in particular to a speech recognition model training method, a speech recognition method and a speech recognition device.

Background

Automatic Speech Recognition (ASR), i.e. the process of converting audio into text, is known. With the development of computer technology and artificial intelligence technology, speech recognition is applied in more and more scenes, and in ASR tasks, end-to-end speech recognition is a current research hotspot.

The existing voice recognition technology has low recognition accuracy on long-tail entities (i.e. words with low occurrence frequency or special terms in a specific field) in long voice recognition scenes such as an intelligent customer service system, an intelligent mobile phone assistant, automatic video subtitle generation, automatic text conversion by instant messaging software voice, online voice interaction and the like.

Disclosure of Invention

The application provides a speech recognition model training method, a speech recognition method and a speech recognition device, which can improve the accuracy of speech recognition, particularly the recognition accuracy of a long-tail entity in a long speech recognition scene.

In a first aspect, the present application provides a method for training a speech recognition model, including:

in any iteration process of speech recognition model training, obtaining a first training sample set, wherein the first training sample set comprises a plurality of training data, each training data comprises speech data, a transcription text of the speech data and historical speech data of the speech data, and the speech recognition model comprises a feature extraction model and a recognition model;

for each training data in the first training sample set, outputting a context feature vector of the speech data by taking speech data in the training data as input of the feature extraction model, and outputting a context feature vector of the historical speech data by taking historical speech data of the speech data as input of the feature extraction model, wherein the feature extraction model is obtained by pre-training;

outputting a recognition text of the voice data by taking the voice data, the context feature vector of the voice data and the context feature vector of the historical voice data as the input of the recognition model;

and adjusting parameters of the recognition model according to the recognition text of each voice data and the transcription text of each voice data in the first training sample set obtained in each iteration process until a training stopping condition is met, and determining the voice recognition model determined in the iteration process meeting the training stopping condition as the trained voice recognition model.

In a second aspect, the present application provides a speech recognition method, including:

acquiring a voice signal;

and inputting the voice signal and the historical voice signal of the voice signal into a pre-trained voice recognition model, and outputting a voice recognition result of the voice signal, wherein the voice recognition model is obtained by training according to the method of the first aspect.

In a third aspect, the present application provides a speech recognition model training apparatus, including:

the system comprises an acquisition module, a recognition module and a processing module, wherein the acquisition module is used for acquiring a first training sample set in any iteration process of voice recognition model training, the first training sample set comprises a plurality of training data, each training data comprises voice data, a transcription text of the voice data and historical voice data of the voice data, and the voice recognition model comprises a feature extraction model and a recognition model;

a first processing module, configured to, for each training data in the first training sample set, take voice data in the training data as an input of the feature extraction model, output a context feature vector of the voice data, take historical voice data of the voice data as an input of the feature extraction model, and output the context feature vector of the historical voice data, where the feature extraction model is obtained through pre-training;

the second processing module is used for taking the voice data, the context feature vector of the voice data and the context feature vector of the historical voice data as the input of the recognition model and outputting the recognition text of the voice data;

and the parameter adjusting module is used for adjusting the parameters of the recognition model according to the recognition text of each voice data and the transcription text of each voice data in the first training sample set obtained in each iteration process until a training stopping condition is met, and determining the voice recognition model determined in the iteration process meeting the training stopping condition as the trained voice recognition model.

In a fourth aspect, the present application provides a speech recognition apparatus comprising:

the acquisition module is used for acquiring a voice signal;

a speech recognition module, configured to input the speech signal and a historical speech signal of the speech signal into a pre-trained speech recognition model, and output a speech recognition result of the speech signal, where the speech recognition model is obtained by training according to the method of the first aspect.

In a fifth aspect, the present application provides a computer device comprising: a processor and a memory, the memory for storing a computer program, the processor for invoking and executing the computer program stored in the memory to perform the method of the first aspect or the second aspect.

In a sixth aspect, the present application provides a computer readable storage medium comprising instructions which, when run on a computer program, cause the computer to perform a method according to the first or second aspect.

In a seventh aspect, the present application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform a method according to the first or second aspect.

In summary, in the present application, by training a speech recognition model including a feature extraction model and a recognition model, in any iteration process, a first training sample set for model training includes a plurality of training data, each of the training data includes speech data, a transcription text of the speech data, and historical speech data of the speech data, the feature extraction model is obtained by pre-training, the feature extraction model directly extracts a context feature vector of the historical speech data from the historical speech data of the speech data, and directly extracts a context feature vector of the speech data from the speech data, and then outputs a recognition text of the speech data with the speech data, the context feature vector of the speech data, and the context feature vector of the historical speech data of the speech data as inputs of the recognition model. And then adjusting parameters of the recognition model according to the recognition text of each voice data and the transcription text of each voice data in the first training sample set obtained in each iteration process until the training stopping condition is met, and determining the voice recognition model determined in the iteration process meeting the training stopping condition as the trained voice recognition model. The pre-trained feature extraction model can directly extract the context feature vector of the voice data from the voice data and extract the context feature vector of the historical voice data from the historical voice data of the voice data, so that the problem that the accuracy of voice recognition is low due to accumulation of recognition errors caused by inputting the recognition result of the historical voice into a language model to extract the context feature in the prior art can be solved, the extracted context feature vector is a high-level information representation related to context, and the voice data, the context feature vector of the voice data and the context feature vector of the historical voice data are jointly used as the input of the recognition model, so that the recognition model can effectively utilize the context information in the learning process, and the trained voice recognition model can improve the accuracy of voice recognition, particularly improve the recognition accuracy of a long-tail entity in a long voice recognition scene.

Further, in the present application, the feature extraction model includes a speech encoder, a text encoder, and a cross-modal encoder, wherein at least one of the speech encoder and the text encoder may be a pre-training model, and by using the pre-training model, model parameters of the speech encoder and the text encoder may be obtained by training a large amount of unlabeled training data, and the training data may come from various fields, so the pre-training model has good robustness and generalization, and further, the speech recognition model has good robustness and generalization. By adopting the pre-training model, the feature extraction model can obtain the effective context feature vector only by a small amount of training data, so that the speech recognition method of the embodiment of the application can be applied to low-resource speech recognition scenes.

Drawings

Fig. 1 is a schematic view of a speech recognition model training method and an implementation scenario of a speech recognition method provided in an embodiment of the present application;

FIG. 2 is a flowchart of a method for training a speech recognition model according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a speech recognition model according to an embodiment of the present application;

fig. 4 is a flowchart of a feature extraction model training method provided in an embodiment of the present application;

fig. 5 is a schematic structural diagram of a feature extraction model provided in an embodiment of the present application;

fig. 6 is a schematic diagram illustrating obtaining a second context feature vector of training data according to an embodiment of the present disclosure;

fig. 7 is a flowchart of a speech recognition method according to an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a speech recognition model according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a speech recognition model training apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 11 is a schematic block diagram of a computer device 700 provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Before the technical scheme of the application is introduced, the related knowledge of the application is introduced as follows:

1. artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

2. Machine Learning (ML): the method is a multi-field cross subject and relates to a plurality of subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The method specially studies how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach to make computers have intelligence, and is applied in various fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

3. Deep Learning (Deep Learning, DL): is a branch of machine learning and is an algorithm that attempts to perform high-level abstraction of data using multiple processing layers that contain complex structures or consist of multiple non-linear transformations. Deep learning is to learn the intrinsic rules and the expression levels of training sample data, and the information obtained in the learning process is very helpful to the interpretation of data such as characters, images and sounds. The final goal of deep learning is to make a machine capable of human-like analytical learning, and recognizing data such as characters, images, and sounds. Deep learning is a complex machine learning algorithm, and achieves the effect in speech and image recognition far exceeding the prior related art.

4. Pre-training: a process for learning neural network models to common features in a data set by training the neural network models using a large data set. The pre-training is intended to provide superior model parameters for subsequent neural network model training on a particular data set. At least one of the speech coder and the text coder in embodiments of the present application may be a pre-trained model.

The Speech recognition model training method provided by the embodiment of the application mainly relates to a Speech Technology (Speech Technology) in an artificial intelligence Technology, and particularly relates to an ASR Technology. The details can be illustrated by the following examples. Key technologies for speech technology are ASR technology and speech synthesis technology (TTS), as well as voiceprint recognition technology. The computer can watch, listen, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

In the prior art, context features are usually adopted to improve the accuracy of speech recognition, and an existing context feature extraction method is to extract context features by using a language model in a speech recognition system, wherein the context features are extracted from an artificial transcribed text (which is a correct transcribed text) corresponding to a historical speech in the training process of the speech recognition model, and when the method is actually used, the speech recognition model cannot obtain the correct transcribed text of the historical speech, a recognition result of the historical speech is usually input into the language model to extract the context features, and the recognition result of the historical speech usually contains recognition errors, so that the context features extracted by the language model are noisy, and therefore, a deviation is introduced in the recognition process of the current speech, and the accuracy of the speech recognition is low. In order to solve the technical problem, in the embodiment of the application, a speech recognition model comprising a feature extraction model and a recognition model is trained, in any iteration process of model training, a first training sample set for model training comprises a plurality of training data, each training data comprises speech data, transcription text of the speech data and historical speech data of the speech data, the feature extraction model is obtained by pre-training, the feature extraction model directly extracts context feature vectors of the historical speech data from the historical speech data of the speech data, directly extracts context feature vectors of the speech data from the speech data, and outputs the recognition text of the speech data by taking the speech data, the context feature vectors of the speech data and the context feature vectors of the historical speech data of the speech data as input of the recognition model. And then adjusting parameters of the recognition model according to the recognition text of each voice data and the transcription text of each voice data in the first training sample set obtained in each iteration process until the training stopping condition is met, and obtaining the trained voice recognition model. The pre-trained feature extraction model can directly extract the context feature vector of the voice data from the voice data and extract the context feature vector of the historical voice data from the historical voice data of the voice data, so that the problem that the accuracy of voice recognition is low due to accumulation of recognition errors caused by inputting the recognition result of the historical voice into a language model to extract the context feature in the prior art can be solved, the extracted context feature vector is a high-level information representation related to context, and the voice data, the context feature vector of the voice data and the context feature vector of the historical voice data are jointly used as the input of the recognition model, so that the recognition model can effectively utilize the context information in the learning process, and the trained voice recognition model can improve the accuracy of voice recognition, particularly improve the recognition accuracy of a long-tail entity in a long voice recognition scene.

Further, the feature extraction model in the embodiment of the present application includes a speech coder, a text coder and a cross-modal coder, wherein at least one of the speech coder and the text coder may be a pre-training model, and by using the pre-training model, model parameters of the speech coder and the text coder may be obtained by training a large amount of unlabeled training data, and the training data may come from various fields, so that the pre-training model has good robustness and generalization, and further, the speech recognition model has good robustness and generalization. By adopting the pre-training model, the feature extraction model can obtain the effective context feature vector only by a small amount of training data, so that the speech recognition method of the embodiment of the application can be applied to low-resource speech recognition scenes.

The speech recognition model training method and the speech recognition method provided by the embodiment of the application can be applied to various long speech recognition scenes, such as speech recognition scenes of an intelligent customer service speech recognition system, an intelligent mobile phone assistant, automatic video subtitle generation, automatic text conversion of instant messaging software speech, online speech interaction and the like. The method can obviously improve the recognition accuracy of the long-tail entity in the long voice recognition scene and improve the user experience. The method and the device can also be applied to other voice recognition scenes, and the method and the device are not limited in this respect.

Fig. 1 is a schematic view of an application scenario of a speech recognition model training method and a speech recognition method provided in an embodiment of the present application, as shown in fig. 1, an implementation scenario of an embodiment of the present application relates to a server 1 and a terminal device 2, and the terminal device 2 may perform data communication with the server 1 through a communication network.

In some implementation manners, the terminal device 2 is a device having a rich man-machine interaction manner, having an internet access capability, generally carrying various operating systems, and having a strong processing capability. The terminal device may be a terminal device such as a smart phone, a tablet computer, a portable notebook computer, a desktop computer, a portable wearable device, a smart speaker, a vehicle-mounted terminal, but is not limited thereto. Optionally, in this embodiment of the application, a client of the speech recognition software is installed in the terminal device 2, and a user may input corresponding speech information to be recognized through the client.

In some implementation manners, the terminal device 2 includes, but is not limited to, a smart phone, a tablet computer, a smart voice interaction device, a smart home appliance, a vehicle-mounted terminal, and the like. Illustratively, the intelligent voice interaction device may be an intelligent sound box, an intelligent television box, an online voice interaction system, an intelligent voice assistant, an on-board intelligent voice device, an intelligent voice device with a simultaneous interpretation function or installed with a voice input method, and the like.

The server 1 in fig. 1 may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides cloud computing services based on cloud services, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, middleware services, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like. This is not limited by the present application.

Illustratively, in some implementations, the server 1 is configured to lay out and train the speech recognition models, deploy the trained speech recognition models in corresponding terminal devices, and process the speech information in the use environment, such as performing speech recognition, by using the deployed speech recognition models through the terminal devices (e.g., the terminal device 2).

It can be understood that before the speech information in the use environment is processed by the speech recognition model, the speech recognition model needs to be trained, and specifically, the speech recognition model training method provided in the embodiment of the present application may be used. The speech recognition model training method provided by the embodiment of the application can improve the accuracy of speech recognition, especially the recognition accuracy of long-tail entities in a long speech recognition scene.

In some implementations, fig. 1 exemplarily shows one terminal device and one server, and may actually include other numbers of terminal devices and servers, which are not limited in this application.

In the speech recognition model training method provided by the embodiment of the present application, the execution subject may be the speech recognition model training apparatus provided by the embodiment of the present application, or a computer device integrated with the speech recognition model training apparatus, where the speech recognition model training apparatus may be implemented in a hardware or software manner. The computer device may be the terminal device 2 or the server 1 in fig. 1.

In some implementation manners, the speech recognition model training method provided in this embodiment of the present application may use a server or a workstation that includes computing hardware such as a CPU, a GPU, or a TPU, to train the speech recognition model, may also use a server cluster or a distributed system that is formed by a plurality of physical servers to train the speech recognition model, and may also use a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, a cloud computing, a cloud function, a cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform to train the speech recognition model.

The technical solutions provided in the embodiments of the present application are explained in detail below with reference to the accompanying drawings.

Fig. 2 is a flowchart of a speech recognition model training method provided in an embodiment of the present application, where an execution subject of the method may be a speech recognition model training apparatus, and as shown in fig. 2, the method may include:

s101, in any iteration process of voice recognition model training, a first training sample set is obtained, the first training sample set comprises a plurality of training data, each training data comprises voice data, a transcription text of the voice data and historical voice data of the voice data, and the voice recognition model comprises a feature extraction model and a recognition model.

Specifically, in this embodiment, when training the speech recognition model, the model parameters need to be iterated for multiple times until the stop training condition is satisfied, for example, until the model converges, and in any iteration process of the speech recognition model training, the training processes of S101-S104 are executed until the stop training condition is satisfied, so as to obtain the trained speech recognition model. The first training sample set is a training sample used for one-time training, the training samples with preset number can be selected from the first original training sample set to form the first training sample set by obtaining the first training sample set, the preset number can be set according to actual training, and the method is not limited to the method.

Wherein the first training sample set comprises a plurality of training data, each training data comprising three parts of data, i.e. speech data, transcribed text of speech data and historical speech data of speech data, e.g. the first training sample set D1= { (S) ₁ ,Y ₁ ，S ₀ ),(S ₂ ,Y ₂ ，S ₁ ),…,(S _T ,Y _t ，S _T-1 ) The training data comprises T training data 'audio-text-audio', the T-th training data comprises voice data S _t 、S _t Of the transcribed text Y _t And S _t History voice data S _t-1 . The historical voice data of the voice data is the previous voice data of the voice data, and optionally, the voice data and the historical voice data thereof may be two voice signals which are continuous in time.

Wherein the voice data S _t Is a speech feature sequence corresponding to the speech signal, i.e.

Representing speech data S _t The nth frame speech feature (one frame speech feature may be 10 ms). The embodiment of the present application does not limit the specific form of the voice feature, for example, the voice feature may beSo as to adopt effective voice characteristics such as Fbank, mel Frequency Cepstral Coeffients (MFCC), PLP, PNCC, PCEN and the like.

S102, aiming at each training data in the first training sample set, taking the voice data in the training data as the input of a feature extraction model, outputting a context feature vector of the voice data, taking the historical voice data of the voice data as the input of the feature extraction model, and outputting the context feature vector of the historical voice data, wherein the feature extraction model is obtained through pre-training.

S103, the voice data, the context feature vector of the voice data and the context feature vector of the historical voice data are used as input of a recognition model, and a recognition text of the voice data is output.

Specifically, the recognition model predicts a recognition text of the speech data based on the input speech data, the context feature vector of the speech data, and the context feature vector of the historical speech data.

Specifically, in an implementable manner, the recognition model includes an encoder and a decoder, and the outputting the recognition text of the speech data in S103 with the speech data, the context feature vector of the speech data, and the context feature vector of the historical speech data as inputs of the recognition model may specifically include:

and S1031, taking the voice data as the input of the encoder, and outputting the voice feature vector of the voice data.

And S1032, outputting the recognition text of the voice data by taking the voice feature vector of the voice data, the context feature vector of the voice data and the context feature vector of the historical voice data as the input of a decoder, wherein the decoder is used for predicting the recognition text of the voice data according to the voice feature vector of the voice data, the context feature vector of the voice data and the context feature vector of the historical voice data through an attention mechanism.

For example, fig. 3 is a schematic structural diagram of a speech recognition model provided in an embodiment of the present application, and as shown in fig. 3, the speech recognition model includes a feature extraction model and a recognition model, the recognition model includes an encoder and a decoder, and for a t-th training numberAccording to (including voice data S) _t 、S _t Of the transcribed text Y _t And S _t History voice data S _t-1 ) In the training, the speech data S in the training data is used _t Outputting context feature vector H of speech data for input of feature extraction model _t ，H _t Can be expressed by the following formula:

H _t ＝Encoder _c (A _t )＝Encoder _c (Encoder _s (S _t ))

wherein, encoder _c Is a cross-modal Encoder, encoder _s Is a speech coder.

Historical speech data S with speech data _t-1 Outputting context feature vector H of historical speech data for input of feature extraction model _t-1 ，H _t-1 Can be expressed by the following formula:

H _t-1 ＝Encoder _c (A _t-1 )＝Encoder _c (Encoder _s (S _t-1 ))。

then using voice data S _t Outputting a speech feature vector Z of speech data for input to an encoder _t ，Z _t Can be expressed by the following formula:

Z _t ＝FFN(Conv(MHSA(S _t )))，

where MHSA represents a multi-head self-attentive mechanism (mult-head self-attention), conv represents a convolution operation, and FFN represents an affine change.

Then, the voice feature vector Z of the voice data is used _t Context feature vector H of voice data _t And context feature vector H of historical speech data _t-1 Outputting recognized text of speech data for input to a decoder

Specifically, the decoder is based on the speech feature vector Z of the speech data _t Context feature vector H of voice data _t And context feature vector H of historical speech data _t-1 Prediction is obtainedVoice data S _t Of the recognized text

In the alternative,

can be expressed by the following formula:

where MHA represents a multi-head attention mechanism (multi-head attention or multi-head cross attention), [ H ] _t ；H _t-1 ]Represents that _t And H _t-1 And (5) splicing in a characteristic dimension.

The predicted value of the distribution of the output characters of the recognition model at the moment k, namely the probability of the occurrence of each character in the character space V where Y is located. Assume that character space V contains three characters: a. b, c, e.g.

One possible value of (a) is (0.1, 0.7, 0.2).

Is a predicted value at the k-th time

After argmax operation, the character with the highest probability is selected, where argmax represents the fetch vector

The dimension or subscript to which the maximum corresponds.

As a sequence of characters Y _t Middle k time character

That is, the corresponding character b of 0.7 is selected as the output of the recognition model at the time t, that is, the recognition text output by the recognition model is the character b.

S104, adjusting parameters of the recognition model according to the recognition text of each voice data and the transcription text of each voice data in the first training sample set obtained in each iteration process until a training stopping condition is met, and determining the voice recognition model determined in the iteration process meeting the training stopping condition as the trained voice recognition model.

Specifically, as an implementable manner, adjusting parameters of the recognition model according to the recognition text of each piece of speech data and the transcription text of each piece of speech data in the first training sample set may specifically include:

s1041, constructing a loss function according to the recognition text of each voice data and the transcription text of each voice data in the first training sample set.

For example, the loss function may be a cross-entropy loss function, and the loss function constructed from the recognized text of each speech data and the transcribed text of each speech data in the first training sample set may be as follows:

wherein, Y _t Is voice data S _t The text of the transcription of (1),

is voice data S _t To identify text, a loss function L _ASR For the sum of the cross-entropy losses of the recognized text of speech data and the transcribed text of speech data in all training data of the first set of training samples,for example, if there are N training data in the first set of training samples, the loss function L is _ASR Is the sum of N cross-entropy losses.

S1042, according to the loss function, the parameters of the recognition model are adjusted through back propagation.

Specifically, the loss function is taken as the above L _ASR For example, the parameters of the recognition model are adjusted according to the N cross-entropy loss sums and the back propagation, so that the cross-entropy loss sum is within the preset range.

Specifically, after a plurality of times of the above iterative training, the training is stopped until the stopping training condition is satisfied, and the stopping training condition is, for example, when the loss value of the loss function decreases to the first value and the first value does not change any more.

In the speech recognition model training method provided by this embodiment, a speech recognition model including a feature extraction model and a recognition model is trained, in any iteration process of model training, a first training sample set used for model training includes a plurality of training data, each of the training data includes speech data, a transcription text of the speech data, and historical speech data of the speech data, the feature extraction model is obtained by pre-training, the feature extraction model directly extracts a context feature vector of the historical speech data from the historical speech data of the speech data, and directly extracts a context feature vector of the speech data from the speech data, and then outputs a recognition text of the speech data by taking the speech data, the context feature vector of the speech data, and the context feature vector of the historical speech data of the speech data as input of the recognition model. And then adjusting parameters of the recognition model according to the recognition text of each voice data and the transcription text of each voice data in the first training sample set obtained in each iteration process until the training stopping condition is met, and determining the voice recognition model determined in the iteration process meeting the training stopping condition as the trained voice recognition model. The pre-trained feature extraction model can directly extract the context feature vector of the voice data from the voice data and extract the context feature vector of the historical voice data from the historical voice data of the voice data, so that the problem that the recognition error accumulation causes lower accuracy of voice recognition in the prior art because the recognition result of the historical voice is input into a language model to extract the context feature can be avoided, the extracted context feature vector is a high-level information representation related to the context, and the voice data, the context feature vector of the voice data and the context feature vector of the historical voice data are jointly used as the input of the recognition model, so that the recognition model can effectively utilize the context information in the learning process, and the trained voice recognition model can improve the accuracy of the voice recognition, particularly the recognition accuracy of a long-tail entity in a long voice recognition scene.

In the embodiment of the present application, the feature extraction model is obtained by pre-training, and the training process of the feature extraction model is described in detail below with reference to fig. 3. On the basis of the method shown in fig. 2, further, the method of this embodiment may further include the following step shown in fig. 3 before performing the speech recognition model training, that is, performing a feature extraction model training process.

Fig. 4 is a flowchart of a feature extraction model training method provided in an embodiment of the present application, where an execution subject of the method may be a speech recognition model training apparatus, and as shown in fig. 4, the method may include:

s201, in any iteration process of feature extraction model training, a second training sample set is obtained, wherein the second training sample set comprises a plurality of training data, and each training data comprises voice data and a transcription text of the voice data.

Specifically, in this embodiment, when training the feature extraction model, the feature extraction model parameters need to be iterated for multiple times until a training stopping condition is satisfied, for example, until the model converges, in any iteration process, the training processes of S201 to S203 are executed until the training stopping condition is satisfied, and the feature extraction model determined by the iteration process that satisfies the training stopping condition is determined as the trained feature extraction model. The second training sample set is a training sample used for one-time training, the training samples of a preset number can be selected from the second original training sample set to form the second training sample set by obtaining the second training sample set, the preset number can be set according to actual training, and the method is not limited to the method.

Optionally, the first original training sample set and the second original training sample set may be the same or different.

In the training process of the feature extraction model, each training data included in the second training sample set includes speech data and a transcription text of the speech data. E.g., a second training sample set D2= { (S) ₁ ,Y ₁ ),(S ₂ ,Y ₂ ),…,(S _T ,Y _T ) T training data "audio-text", the T training data comprising speech data S _t And S _t Of the transcribed text Y _t 。S _t Of the transcribed text Y _t I.e. the converted text corresponding to the speech signal.

Representing speech data S _t The nth frame speech feature (one frame speech feature may be 10 ms). The embodiment of the present application does not limit the specific form of the voice features, and for example, valid voice features such as Fbank, MFCC, PLP, PNCC, PCEN, and the like may be used.

S202, aiming at each training data in the second training sample set, the training data is used as the input of the feature extraction model, the first context feature vector of the training data is output, the training data after modal masking processing is used as the input of the feature extraction model, and the second context feature vector of the training data is output.

Specifically, for each training data in the second training sample set, a first context feature vector of the training data and a second context feature vector of the training data are obtained through the feature extraction model, the first context feature vector is output by inputting the training data to the feature extraction model, and the second context feature vector is output by inputting the training data after modal masking processing to the feature extraction model.

Optionally, the feature extraction model includes a speech coder, a text coder and a cross-modal coder.

In an implementation manner, the step S202 of taking the training data as an input of the feature extraction model and outputting the first context feature vector of the training data may specifically include:

s2021, using the speech data in the training data as an input of the speech encoder, and outputting a first speech feature vector of the training data.

S2022, the transcribed text of the voice data is used as input of a text encoder, and a first text feature vector of the training data is output.

S2023, using the first speech feature vector of the training data and the first text feature vector of the training data as input of a cross-modal encoder, and outputting the first context feature vector of the training data, where the cross-modal encoder is configured to associate the first speech feature vector of the training data and the first text feature vector of the training data.

Fig. 5 is a schematic structural diagram of a feature extraction model provided in an embodiment of the present application, and as shown in fig. 5, the feature extraction model includes a speech encoder, a text encoder, and a cross-mode encoder, and is used for t-th training data (including speech data S) _t And S _t Of the transcribed text Y _t ) For example, take the speech data S in the training data _t For input of speech coder, output training data S _t First speech feature vector A _t Transcribing text Y with speech data _t Outputting a first text feature vector E of training data for input of a text encoder _t . First speech feature vector A of training data _t And a first text feature vector E of the training data _t Outputting a first context feature vector H of training data for input of the cross-mode encoder _t . Cross-modal encoder for encoding a first speech feature vector A of training data _t And a first text feature vector E of the training data _t Are associated so that they can be characterized by each other, i.e. the first speech feature vector A _t (also referred to as implicit feature representation of speech modality) may characterize the first text feature vector E _t (also referred to as implicit feature representation of a text modality), which may characterize the information of the implicit feature representation of a speech modality, a first contextual feature vector H _t Can be represented by:

H _t ＝Encoder _c (A _t ；E _t )＝Encoder _c (Encoder _s (S _t )；Encoder _y (Y _t ))，

wherein, the Encoder _c Is a cross-modal Encoder, encoder _s Is a speech coder, encoder _y Is a text coder.

Optionally, the training data after the modal masking processing includes speech data after the modal masking processing and transcribed text data after the modal masking processing, before the training data after the modal masking processing is used as an input of the feature extraction model in S202 and a second context feature vector of the training data is output, the method of this embodiment may further include: acquiring the training data after the modal masking, fig. 6 is an acquisition schematic diagram of a second context feature vector of the training data provided in the embodiment of the present application, and with reference to fig. 6, the following S1 to S4 are specific implementations of acquiring the training data after the modal masking.

S1, carrying out frame random masking on a voice feature sequence corresponding to voice data in training data to obtain a voice feature sequence after frame masking.

In particular with speech data S _t For example, S _t Is a speech feature sequence corresponding to the speech signal, i.e.

Representing speech data S _t The nth frame speech feature (one frame speech feature may be 10ms, for example), and a speech feature sequence corresponding to the speech data

Carrying out frame random masking, each frame voice feature is with a preset probability p _s Masking to obtain the speech feature sequence after frame masking

Wherein mask of nth frame feature

To be random variables subject to Bernoulli distribution, i.e.

For example, a length 5 signature sequence

Corresponding

mask sequence

1,0, 1,0, the masked speech feature sequence

As shown in fig. 6, voice data S _t Obtaining the masked speech feature sequence by frame random masking

And S2, determining the product of the masked voice feature sequence and the voice modal mask as voice data after modal masking.

Specifically, the voice modality masking process refers to each voice data S in the second training data set _T With a predetermined probability p _m Masked, speech mode mask m ^cs ～Bernoulli(p _m ) When a certain voice data corresponds to m ^cs When =0, the voice mode is masked, the input of the voice coder is a null feature sequence phi at the moment, and the input of the feature extraction model is only a text sequence Y at the moment _t 。

Then, the voice data after the modal masking processing is obtained through modal random masking

And S3, randomly masking the character sequence corresponding to the transcription text of the voice data to obtain a masked character sequence.

Specifically, the transcribed text Y of the voice data _t Which may be generally represented as a sequence of characters. For English, Y _t Usually expressed as a sequence of valid subwords (subwords) such as BPE (Byte pair encoding), wordpace, etc. For Chinese, Y _t Usually expressed as a sequence of chinese characters (characters). The form of the character sequence is not limited by this application, Y _t Can be represented as any valid character sequence, i.e.

Similar to speech data, the transcribed text Y _t Firstly, randomly masking characters, wherein each character has a preset probability p _t Masking to obtain the character sequence after character masking

Mask of m-th character

To be random variables subject to Bernoulli distribution, i.e.

As shown in FIG. 6, a transcribed text Y of the voice data _t Obtaining a masked character sequence after masking by character random masking

And S4, determining the product of the masked character sequence and the text mode mask as the transcribed text data after the mode masking processing.

In particular, text modal masking refers to the transcribed text Y of each speech data in the second training data set _T At a predetermined probability p _m Masked, speech mode mask m ^cy ～Bernoulli(p _m ) When a text of a certain speech data is transcribed, m corresponds to ^cy When the word is not less than 0, the text mode is masked, the input of the text encoder is a null feature sequence phi, and the input of the feature extraction model is only a voice feature sequence S _t . As shown in FIG. 6, a transcribed text Y of the voice data _t Obtaining a masked character sequence after the random masking of characters

Obtaining the transcription text data after the modal masking treatment by modal random masking

In an implementable manner, the outputting the second context feature vector of the training data by using the training data after the modal masking processing as the input of the feature extraction model in S202 may specifically include:

s2021', the speech data after the modal masking processing is used as an input of the speech encoder, and a second speech feature vector of the training data is output.

Specifically, as shown in fig. 6, the voice data after the processing is masked in the mode

(or phi) is the input of the speech coder, and the second speech feature vector of the output training data is

The speech encoder maps the speech signal from a feature space to an implicit feature space.

S2022', the transcribed text data after the modal masking processing is used as an input of the text encoder, and a second text feature vector of the training data is output.

Specifically, as shown in fig. 6, the processed transcribed text data is masked in the modality

(or phi) is the input to the text encoder, and the second text feature vector of the output training data is

The text encoder maps the transcribed text from a character space to an implicit feature space.

S2023', using the second speech feature vector of the training data and the second text feature vector of the training data as input of the cross-modal encoder, and outputting the second context feature vector of the training data, where the cross-modal encoder is configured to associate the second speech feature vector of the training data and the second text feature vector of the training data.

Specifically, the cross-modal encoder pairs the second speech feature vector of the training data

And a second text feature vector of the training data

Making the correlation so that they can be mutually characterized, i.e. the second speech feature vector

(also referred to as implicit feature representation of speech modality) the second text feature vector may be characterized

(also referred to as implicit feature representation of a text modality), implicit feature representation of a text modality may characterize the information of the implicit feature representation of a speech modality, a second contextual feature vector

Can be represented by:

wherein, the Encoder _c Is a cross-mode Encoder, encoder _s Is a speech coder, encoder _y Is a text encoder. As shown in fig. 6, the second speech feature vector of the training data is used

And a second text feature vector of the training data

Outputting a second context feature vector of the training data for input of the cross-modal encoder

S203, adjusting parameters of the feature extraction model according to the first context feature vector of each training data and the second context feature vector of each training data in the second training sample set obtained in each iteration process until the training stopping condition is met, and determining the feature extraction model determined in the iteration process meeting the training stopping condition as the trained feature extraction model.

In an implementation manner, in S203, adjusting parameters of the feature extraction model according to the first context feature vector of each training data in the second training sample set and the second context feature vector of each training data may specifically include:

s2031, a first loss function is constructed according to the first context feature vector of each training data in the second training sample set and the second context feature vector of each training data.

Optionally, a first loss function L _s Can be expressed by the following formula:

where sim represents the cosine similarity function.

H _t ＝Encoder _c (A _t ；E _t )＝Encoder _c (Encoder _s (S _t )；Encoder _y (Y _t ))；

Representing the training data (S) divided in the second set of training samples _t ,Y _t ) Second context feature vectors for other training data than the one (i.e., any training data for j ≠ t).

S2032, according to the masked characters in the character sequence corresponding to the transcription text of the voice data in each training data and the masked characters in the character sequence predicted by the feature extraction model, a second loss function is constructed, and the masked characters in the predicted character sequence are output of a second context feature vector of the training data through an output layer.

In particular, the second loss function L _y Can be expressed by the following formula:

wherein, the first and the second end of the pipe are connected with each other,

representing masked characters in a sequence of characters corresponding to the transcribed text of speech data in training data, i.e. masks with a value of 0

Corresponding character, this rootFrom which the masking probability is known.

Representing the masked characters in the character sequence predicted by the feature extraction model. As shown in FIG. 6, the characters masked in the predicted character sequence

Second context feature vector as training data

And outputting through the output layer. The output layer in FIG. 6 is only used for calculating the loss function L in the training process of the feature extraction model _y The output layer will be discarded after the training of the feature extraction model is completed.

S2033, a cross-modal loss function is constructed according to the second context feature vector of each training data and the transcription text of the voice data in each training data.

Alternatively, the cross-modal loss function may employ a CTC (Connectionist Temporal Classification) loss function

I.e. from the second context feature vector of the training data

To reduce the unmasked character sequence Y _t (i.e., the transcribed text of the speech data in the training data).

S2034, the first loss function, the second loss function and the cross-modal loss function are subjected to weighted summation to determine the first loss function, the second loss function and the cross-modal loss function as a target loss function.

Specifically, the weighted sum of the first loss function, the second loss function, and the cross-modal loss function can be represented by the following formula:

L _ctx ＝αL _s +βL _y +(1-α-β)L _m

wherein, alpha and beta (0 < alpha, beta < 1) are weight coefficients used for controlling the action of different loss functions.

S2035, according to the target loss function, the parameters of the feature extraction model are adjusted through back propagation.

Optionally, at least one of the speech coder and the text coder is a pre-trained model.

It can be appreciated that if the speech coder and the text coder are pre-trained models, the parameters of the feature extraction model, in particular, the parameters of the cross-modal coder, are adjusted by back propagation.

Optionally, the stop training condition may be L _ctx The loss value of (a) falls to the first value and the first value no longer changes, i.e. the training is stopped.

It should be noted that fig. 3 shows a training process of the feature extraction model, when the training of the feature extraction model is completed, the feature extraction model obtained by training is used for the training of the speech recognition model shown in fig. 2, and when the training of the speech recognition model is performed, parameters of the feature extraction model do not change any more, and are only used for extracting context feature vectors to assist the training of the recognition model. And when the speech recognition model is trained and actually used, the text encoder branch and all masking operations in the feature extraction model are not carried out (discarded), only the speech encoder and the cross-mode encoder are reserved, and the context feature vectors are integrated into the encoder part of the recognition model through an attention mechanism.

In the speech recognition model training method provided by this embodiment, in any iteration process of feature extraction model training, a second training sample set including a plurality of training data is obtained, each training data includes speech data and a transcribed text of the speech data, for each training data in the second training sample set, the training data is used as an input of a feature extraction model, a first context feature vector of the training data is output, the training data after modal masking processing is used as an input of the feature extraction model, a second context feature vector of the training data is output, parameters of the feature extraction model are adjusted according to the first context feature vector of each training data in the second training sample set and the second context feature vector of each training data until a training stopping condition is satisfied, and a feature extraction model determined in an iteration process that satisfies the training stopping condition is determined as a trained feature extraction model. By training the feature extraction model, the feature extraction model can directly extract the context feature vector of the voice data from the voice data and extract the context feature vector of the historical voice data from the historical voice data of the voice data, so that the problem that the recognition error accumulation causes lower accuracy of voice recognition due to the fact that the recognition result of the historical voice is input into the language model to extract the context feature in the prior art can be solved, and the recognition accuracy of the entity in the specific field can be improved.

The embodiment of the present application further provides a speech recognition model training method, which can adopt two-stage training: in the first stage, the training method shown in fig. 4 may be specifically adopted to train the feature extraction model first, and then the model parameters of the feature extraction model are fixed, and in the second stage, the training method shown in fig. 2 is adopted to train the speech recognition model, and in this stage, the model parameters of the feature extraction model are not updated, and are only used for extracting context feature vectors to assist in training the recognition model. For a specific process, reference may be made to the description in the embodiment shown in fig. 2 and fig. 4, which is not described herein again.

Fig. 7 is a flowchart of a speech recognition method provided in an embodiment of the present application, where an execution subject of the method may be a terminal device or other computer device, and as shown in fig. 7, the method may include:

s401, voice signals are obtained.

S402, inputting the voice signal and the historical voice signal of the voice signal into a pre-trained voice recognition model, and outputting a voice recognition result of the voice signal.

Wherein the speech recognition model is trained according to the method shown in fig. 2.

Further, the speech recognition model includes a feature extraction model and a recognition model, and the speech recognition model inputs the speech signal and the historical speech signal of the speech signal into a pre-trained speech recognition model and outputs the speech recognition result of the speech signal, which may specifically include:

s4021, inputting the voice signal and the historical voice signal of the voice signal into the feature extraction model, and outputting the context feature vector of the voice signal.

S4022, inputting the voice signal and the context feature vector of the voice signal into the recognition model, and outputting the voice recognition result of the voice signal.

Fig. 8 is a schematic structural diagram of a speech recognition model provided in an embodiment of the present application, and as shown in fig. 8, the speech recognition model includes a feature extraction model and a recognition model, and takes a speech signal as S _t The historical speech signal of the speech signal is S _t-1 For example, a speech signal S _t And a historical speech signal S of the speech signal _t-1 Inputting a feature extraction model, outputting a context feature vector H of a speech signal _t And a historical speech signal S _t-1 Context feature vector H of _t-1 Then the speech signal S _t Context feature vector H of speech signal _t And a historical speech signal S _t-1 Context feature vector H of _t-1 Inputting a recognition model, outputting a speech signal S _t Speech recognition result Y of _t ^* 。

According to the voice recognition method provided by the embodiment, the voice recognition model comprising the feature extraction model and the recognition model is adopted, when voice recognition is carried out, the context feature vector of the voice signal is extracted according to the voice signal and the historical voice signal of the voice signal, and the voice recognition result is obtained according to the context feature vector of the voice signal and the context feature vector of the voice signal.

Illustratively, the embodiment of the present application compares the recognition error rate of the speech recognition method provided by the embodiment of the present application with that of the existing speech recognition method based on the long-context language model method by using two test sets, as shown in table one below:

table one identification word error rate comparison

As can be seen from table one, the lower the error rate of the recognized word indicates the better the speech recognition performance, and the speech recognition method provided by the embodiment of the present application has a higher recognition accuracy.

Fig. 9 is a schematic structural diagram of a speech recognition model training apparatus according to an embodiment of the present application, and as shown in fig. 9, the apparatus may include: an acquisition module 11, a first processing module 12, a second processing module 13 and a parameter adjustment module 14, wherein,

the obtaining module 11 is configured to obtain a first training sample set in any iteration process of the speech recognition model training, where the first training sample set includes multiple training data, each training data includes speech data, a transcription text of the speech data, and historical speech data of the speech data, and the speech recognition model includes a feature extraction model and a recognition model;

the first processing module 12 is configured to, for each piece of training data in the first training sample set, take voice data in the training data as input of the feature extraction model, output a context feature vector of the voice data, take historical voice data of the voice data as input of the feature extraction model, and output the context feature vector of the historical voice data, where the feature extraction model is obtained through pre-training;

the second processing module 13 is configured to take the voice data, the context feature vector of the voice data, and the context feature vector of the historical voice data as inputs of the recognition model, and output a recognition text of the voice data;

the parameter adjusting module 14 is configured to adjust parameters of the recognition model according to the recognition text of each piece of speech data and the transcription text of each piece of speech data in the first training sample set obtained in each iteration process until a training stopping condition is met, and determine the speech recognition model determined in the iteration process that meets the training stopping condition as the trained speech recognition model.

Optionally, the obtaining module 11 is further configured to:

in any iteration process of the feature extraction model training, a second training sample set is obtained, the second training sample set comprises a plurality of training data, and each training data comprises voice data and a transcription text of the voice data;

the first processing module 12 is further configured to: and aiming at each training data in the second training sample set, taking the training data as the input of the feature extraction model, outputting a first context feature vector of the training data, taking the training data after the modal masking processing as the input of the feature extraction model, and outputting a second context feature vector of the training data.

The parameter adjustment module 14 is further configured to: and adjusting parameters of the feature extraction model according to the first context feature vector of each training data and the second context feature vector of each training data in the second training sample set obtained in each iteration process until a training stopping condition is met, and determining the feature extraction model determined in the iteration process meeting the training stopping condition as the trained feature extraction model.

Optionally, the first processing module 12 is specifically configured to: taking the voice data in the training data as the input of a voice coder, and outputting a first voice feature vector of the training data;

taking a transcription text of the voice data as an input of a text encoder, and outputting a first text feature vector of the training data;

and outputting the first context feature vector of the training data by taking the first voice feature vector of the training data and the first text feature vector of the training data as the input of a cross-mode encoder, wherein the cross-mode encoder is used for correlating the first voice feature vector of the training data and the first text feature vector of the training data.

Optionally, the training data after the modal masking processing includes speech data after the modal masking processing and transcribed text data after the modal masking processing, and the first processing module 12 is further configured to, before taking the training data after the modal masking processing as an input of the feature extraction model and outputting a second context feature vector of the training data:

carrying out frame random masking on a voice feature sequence corresponding to voice data in training data to obtain a voice feature sequence after frame masking;

determining the product of the masked voice feature sequence and the voice modal mask as voice data after modal masking;

carrying out character random masking on a character sequence corresponding to a transcription text of voice data to obtain a masked character sequence;

and determining the product of the masked character sequence and the mode mask of the training data transcribed text as the transcribed text data after the mode masking processing.

Optionally, the first processing module 12 is specifically configured to:

the voice data after the modal masking processing is used as the input of a voice coder, and a second voice feature vector of the training data is output;

using the transcribed text data after the modal masking processing as the input of a text encoder, and outputting a second text feature vector of the training data;

and outputting a second context feature vector of the training data by taking the second voice feature vector of the training data and the second text feature vector of the training data as the input of a cross-mode encoder, wherein the cross-mode encoder is used for associating the second voice feature vector of the training data with the second text feature vector of the training data.

Optionally, the parameter adjusting module 14 is specifically configured to:

constructing a first loss function according to the first context feature vector of each training data in the second training sample set and the second context feature vector of each training data;

constructing a second loss function according to the masked characters in the character sequence corresponding to the transcription text of the voice data in each training data and the masked characters in the character sequence predicted by the feature extraction model, wherein the masked characters in the predicted character sequence are output of a second context feature vector of the training data through an output layer;

constructing a cross-modal loss function according to the second context feature vector of each training data and the transcription text of the voice data in each training data;

weighting and summing the first loss function, the second loss function and the cross-modal loss function to determine a target loss function;

and according to the target loss function, reversely propagating and adjusting the parameters of the feature extraction model.

Optionally, the recognition model includes an encoder and a decoder, and the second processing module 13 is specifically configured to:

taking the voice data as the input of the encoder, and outputting the voice characteristic vector of the voice data;

and the decoder is used for predicting the recognition text of the voice data according to the voice feature vector of the voice data, the context feature vector of the voice data and the context feature vector of the historical voice data through an attention mechanism.

Optionally, the parameter adjusting module 14 is specifically configured to:

constructing a loss function according to the recognition text of each voice data in the first training sample set and the transcription text of each voice data;

the parameters of the recognition model are adjusted according to the loss function by back propagation.

Fig. 10 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application, and as shown in fig. 10, the apparatus may include: an acquisition module 21 and a speech recognition model 22, wherein,

the obtaining module 21 is configured to obtain a voice signal;

the speech recognition model 22 is used to input the speech signal and the historical speech signal of the speech signal into a pre-trained speech recognition model, and output the speech recognition result of the speech signal, and the speech recognition model is obtained by training according to the method shown in fig. 2.

Optionally, the speech recognition model 22 is used to: inputting the voice signal and the historical voice signal of the voice signal into a feature extraction model, and outputting a context feature vector of the voice signal and a context feature vector of the historical voice signal;

and inputting the context feature vector of the voice signal and the context feature vector of the historical voice signal into a recognition model, and outputting a voice recognition result of the voice signal.

It is to be understood that apparatus embodiments and method embodiments may correspond to one another and that similar descriptions may refer to method embodiments. To avoid repetition, further description is omitted here. Specifically, the speech recognition model training apparatus shown in fig. 8 or the speech recognition apparatus shown in fig. 9 may execute the method embodiment corresponding to the computer device, and the foregoing and other operations and/or functions of each module in the apparatus are respectively for implementing the method embodiment corresponding to the computer device, and are not described herein again for brevity.

The speech recognition model training apparatus and the speech recognition apparatus of the embodiments of the present application are described above from the perspective of functional blocks in conjunction with the drawings. It should be understood that the functional modules may be implemented by hardware, by instructions in software, or by a combination of hardware and software modules. Specifically, the steps of the method embodiments in the present application may be implemented by integrated logic circuits of hardware in a processor and/or instructions in the form of software, and the steps of the method disclosed in conjunction with the embodiments in the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. Alternatively, the software modules may be located in random access memory, flash memory, read only memory, programmable read only memory, electrically erasable programmable memory, registers, and the like, as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps in the above method embodiments in combination with hardware thereof.

Fig. 11 is a schematic block diagram of a computer device 700 provided by an embodiment of the present application.

As shown in fig. 11, the computer device 700 may include:

a memory 710 and a processor 720, the memory 710 being configured to store a computer program and to transfer the program code to the processor 720. In other words, the processor 720 may call and run a computer program from the memory 710 to implement the method in the embodiment of the present application.

For example, the processor 720 may be configured to perform the above-described method embodiments according to instructions in the computer program.

In some embodiments of the present application, the processor 720 may include, but is not limited to:

general purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like.

In some embodiments of the present application, the memory 710 includes, but is not limited to:

volatile memory and/or non-volatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of example, but not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), double Data Rate Synchronous Dynamic random access memory (DDR SDRAM), enhanced Synchronous SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DR RAM).

In some embodiments of the present application, the computer program can be divided into one or more modules, which are stored in the memory 710 and executed by the processor 720 to perform the methods provided herein. The one or more modules may be a series of computer program instruction segments capable of performing certain functions, the instruction segments describing the execution of the computer program in the electronic device.

As shown in fig. 11, the computer apparatus may further include:

a transceiver 730, the transceiver 730 being connectable to the processor 720 or the memory 710.

The processor 720 may control the transceiver 730 to communicate with other devices, and specifically, may transmit information or data to the other devices or receive information or data transmitted by the other devices. The transceiver 730 may include a transmitter and a receiver. The transceiver 730 may further include an antenna, and the number of antennas may be one or more.

It should be understood that the various components in the electronic device are connected by a bus system that includes a power bus, a control bus, and a status signal bus in addition to a data bus.

The present application also provides a computer storage medium having stored thereon a computer program which, when executed by a computer, enables the computer to perform the method of the above-described method embodiments. In other words, the present application also provides a computer program product containing instructions, which when executed by a computer, cause the computer to execute the method of the above method embodiments.

When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the embodiments of the present application occur, in whole or in part, when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a Digital Video Disk (DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the module is merely a logical division, and other divisions may be realized in practice, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

Modules described as separate parts may or may not be physically separate, and parts shown as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. For example, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for training a speech recognition model, the method comprising:

2. The method of claim 1, further comprising:

in any iteration process of feature extraction model training, obtaining a second training sample set, wherein the second training sample set comprises a plurality of training data, and each training data comprises voice data and a transcription text of the voice data;

for each training data in the second training sample set, taking the training data as the input of the feature extraction model, outputting a first context feature vector of the training data, taking the training data after modal masking as the input of the feature extraction model, and outputting a second context feature vector of the training data;

and adjusting parameters of the feature extraction model according to the first context feature vector of each training data and the second context feature vector of each training data in the second training sample set obtained in each iteration process until a training stopping condition is met, and determining the feature extraction model determined in the iteration process meeting the training stopping condition as the trained feature extraction model.

3. The method of claim 2, wherein the feature extraction model comprises a speech coder, a text coder, and a cross-modal coder.

4. The method of claim 3, wherein the outputting the first context feature vector of the training data with the training data as an input of the feature extraction model comprises:

taking voice data in the training data as input of the voice coder, and outputting a first voice feature vector of the training data;

taking the transcription text of the voice data as the input of the text encoder, and outputting a first text feature vector of the training data;

and outputting the first context feature vector of the training data by taking the first voice feature vector of the training data and the first text feature vector of the training data as the input of the cross-modal encoder, wherein the cross-modal encoder is used for associating the first voice feature vector of the training data with the first text feature vector of the training data.

5. The method according to claim 4, wherein the training data after the modal masking process includes speech data after the modal masking process and transcript text data after the modal masking process, and before outputting the second context feature vector of the training data with the training data after the modal masking process as the input of the feature extraction model, the method further includes:

carrying out frame random masking on a voice feature sequence corresponding to the voice data in the training data to obtain a voice feature sequence after frame masking;

determining the product of the masked voice feature sequence and a voice modal mask as voice data after modal masking;

carrying out character random masking on a character sequence corresponding to the transcription text of the voice data to obtain a masked character sequence;

and determining the product of the masked character sequence and a text mode mask as the transcribed text data after the mode masking processing.

6. The method according to claim 5, wherein the training data after the modal masking processing is an input of the feature extraction model, and outputting a second context feature vector of the training data comprises:

taking the voice data after the modal masking processing as the input of the voice coder, and outputting a second voice feature vector of the training data;

using the transcribed text data after the modal masking processing as the input of the text encoder, and outputting a second text feature vector of the training data;

and outputting a second context feature vector of the training data by taking the second speech feature vector of the training data and the second text feature vector of the training data as the input of the cross-modal encoder, wherein the cross-modal encoder is used for associating the second speech feature vector of the training data with the second text feature vector of the training data.

7. The method of claim 2, wherein the adjusting the parameters of the feature extraction model according to the first contextual feature vector of each training data in the second training sample set and the second contextual feature vector of each training data comprises:

constructing a second loss function according to the masked characters in the character sequence corresponding to the transcription text of the voice data in each training data and the masked characters in the character sequence predicted by the feature extraction model, wherein the predicted masked characters in the character sequence are output of a second context feature vector of the training data through an output layer;

determining a weighted sum of the first loss function, the second loss function and the cross-modal loss function as a target loss function;

8. The method of claim 3, wherein at least one of the speech coder and the text coder is a pre-trained model.

9. The method of claim 1, wherein the recognition model comprises an encoder and a decoder, and wherein outputting the recognized text of the speech data with the speech data, the context feature vector of the speech data, and the context feature vector of the historical speech data as inputs of the recognition model comprises:

taking the voice data as the input of the encoder, and outputting the voice feature vector of the voice data;

and outputting the recognition text of the voice data by taking the voice feature vector of the voice data, the context feature vector of the voice data and the context feature vector of the historical voice data as the input of the decoder, wherein the decoder is used for predicting the recognition text of the voice data according to the voice feature vector of the voice data, the context feature vector of the voice data and the context feature vector of the historical voice data through an attention mechanism.

10. The method of claim 1, wherein the adjusting parameters of the recognition model according to the recognized text of each speech data and the transcribed text of each speech data in the first training sample set comprises:

and adjusting the parameters of the recognition model by back propagation according to the loss function.

11. A speech recognition method, comprising:

acquiring a voice signal;

inputting the speech signal and the historical speech signal of the speech signal into a pre-trained speech recognition model, and outputting a speech recognition result of the speech signal, wherein the speech recognition model is obtained by training according to the method of any one of claims 1-10.

12. The method of claim 11, wherein the speech recognition model comprises a feature extraction model and a recognition model, and the inputting the speech signal and the historical speech signal of the speech signal into a pre-trained speech recognition model and outputting the speech recognition result of the speech signal comprises:

inputting the speech signal and a historical speech signal of the speech signal into the feature extraction model, and outputting a context feature vector of the speech signal and a context feature vector of the historical speech signal;

and inputting the context feature vector of the voice signal and the context feature vector of the historical voice signal into the recognition model, and outputting a voice recognition result of the voice signal.

13. A speech recognition model training apparatus, comprising:

14. A speech recognition apparatus, comprising:

the acquisition module is used for acquiring a voice signal;

a speech recognition module, configured to input the speech signal and a historical speech signal of the speech signal into a pre-trained speech recognition model, and output a speech recognition result of the speech signal, where the speech recognition model is obtained by training according to the method of any one of claims 1 to 10.

15. A computer device, comprising:

a processor and a memory for storing a computer program, the processor for invoking and executing the computer program stored in the memory to perform the method of any one of claims 1 to 10 or 11 to 12.

16. A computer-readable storage medium comprising instructions which, when run on a computer program, cause the computer to perform the method of any of claims 1 to 10 or 11 to 12.

17. A computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 10 or 11 to 12.