CN116092485A

CN116092485A - Training method and device of voice recognition model, and voice recognition method and device

Info

Publication number: CN116092485A
Application number: CN202310081310.XA
Authority: CN
Inventors: 李盛强
Original assignee: Shanghai Anting Horizon Intelligent Transportation Technology Co ltd
Current assignee: Shanghai Anting Horizon Intelligent Transportation Technology Co ltd
Priority date: 2023-01-30
Filing date: 2023-01-30
Publication date: 2023-05-09

Abstract

The embodiment of the disclosure discloses a training method and device of a voice recognition model, the voice recognition method and device, a computer readable storage medium and electronic equipment, wherein the training method of the voice recognition model comprises the following steps: carrying out data masking processing on the sample video data and the sample audio data based on the generated random numbers to obtain masked video data and masked audio data; carrying out fusion encoding and decoding on the masked video data and the masked audio data by utilizing an initial speech recognition model to be trained to obtain speech prediction data; based on the loss function and the speech prediction data, training an initial speech recognition model to obtain a pre-trained speech recognition model. According to the embodiment of the disclosure, the data volumes of the sample video data and the sample audio data are not balanced, so that the capability of the model for processing unbalanced multi-mode data can be improved, the trained voice recognition model can adapt to various noise scenes, and the recognition accuracy of the voice recognition model is improved.

Description

Training method and device of voice recognition model, and voice recognition method and device

Technical Field

The disclosure relates to the field of computer technology, and in particular relates to a training method and device for a speech recognition model, a speech recognition method and device, a computer readable storage medium and electronic equipment.

Background

The multimode voice recognition technology is a technology for assisting voice recognition by means of visual information such as lip motion video, face motion video and eye motion video, and the recognition accuracy of voice in a high-noise scene is improved to a certain extent.

In a training stage of the multimode voice recognition model, data of two modes of video data and audio data are generally required to be input at the same time, the model is required to process video features and audio features at the same time, and the trained multimode voice recognition model has a good recognition effect under the condition that the input audio and video can be obtained at the same time. When the data volume of two modes is unbalanced, namely, one mode of data is lack, the recognition accuracy of the model is reduced.

Disclosure of Invention

The present disclosure has been made in order to solve the above technical problems. Embodiments of the present disclosure provide a training method and apparatus for a speech recognition model, a speech recognition method and apparatus, a computer-readable storage medium, and an electronic device.

The embodiment of the disclosure provides a training method of a voice recognition model, which comprises the following steps: generating a random number in a preset numerical value interval; carrying out data masking processing on the sample video data and the sample audio data based on the random number to obtain masked video data and masked audio data; fusion coding is carried out on the masked video data and the masked audio data by utilizing a fusion coding network of an initial speech recognition model to be trained, so as to obtain fusion coding data; decoding the fusion encoded data by utilizing a decoding network of the initial voice recognition model to obtain voice prediction data; determining a loss value representing an error between the speech prediction data and a predetermined speech tag sequence based on a predetermined loss function and the speech prediction data; based on the loss value, adjusting parameters of the initial speech recognition model to obtain an adjusted speech recognition model; and determining the adjusted speech recognition model as a pre-trained speech recognition model in response to determining that the adjusted speech recognition model meets a preset training end condition.

According to another aspect of an embodiment of the present disclosure, there is provided a voice recognition method including: acquiring video data to be identified and audio data to be identified; fusion coding is carried out on the video data to be identified and the audio data to be identified by utilizing a fusion coding network of a pre-trained voice identification model, so that fusion coding data are obtained; decoding the fusion encoded data by utilizing a decoding network of the voice recognition model to obtain voice prediction data; based on the speech prediction data, speech recognition text is generated.

According to another aspect of an embodiment of the present disclosure, there is provided a training apparatus of a speech recognition model, the apparatus including: the first generation module is used for generating a random number in a preset numerical value interval; the masking module is used for carrying out data masking processing on the sample video data and the sample audio data based on the random number to obtain masked video data and masked audio data; the first fusion module is used for carrying out fusion coding on the masked video data and the masked audio data by utilizing a fusion coding network of an initial voice recognition model to be trained to obtain fusion coding data; the first decoding module is used for decoding the fusion encoded data by utilizing a decoding network of the initial voice recognition model to obtain voice prediction data; a first determining module for determining a loss value representing an error between the speech prediction data and a predetermined speech tag sequence based on a predetermined loss function and the speech prediction data; the adjusting module is used for adjusting parameters of the initial voice recognition model based on the loss value to obtain an adjusted voice recognition model; and the second determining module is used for determining the adjusted voice recognition model as a pre-trained voice recognition model in response to determining that the adjusted voice recognition model meets a preset training ending condition.

According to another aspect of an embodiment of the present disclosure, there is provided a training apparatus of a speech recognition model, the apparatus including: the acquisition module is used for acquiring the video data to be identified and the audio data to be identified; the second fusion module is used for carrying out fusion coding on the video data to be identified and the audio data to be identified by utilizing a fusion coding network of the pre-trained voice recognition model to obtain fusion coding data; the second decoding module is used for decoding the fusion encoded data by utilizing a decoding network of the voice recognition model to obtain voice prediction data; and the second generation module is used for generating voice recognition text based on the voice prediction data.

According to another aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for execution by a processor to implement a training method or a speech recognition method that performs the above-described speech recognition model.

According to another aspect of an embodiment of the present disclosure, there is provided an electronic device including: a processor; a memory for storing processor-executable instructions; and the processor is used for reading the executable instructions from the memory and executing the instructions to realize the training method or the voice recognition method of the voice recognition model.

According to the training method and device, the voice recognition method and device, the computer readable storage medium and the electronic equipment based on the voice recognition model provided by the embodiment of the disclosure, random numbers are generated, data masking processing is carried out on sample video data and sample audio data based on the random numbers to obtain masked video data and masked audio data, then fusion encoding is carried out on the masked video data and the masked audio data by using an initial voice recognition model to obtain fusion encoded data, then the fusion encoded data is decoded to obtain voice prediction data, then a loss value representing errors between the voice prediction data and a preset voice tag sequence is determined based on a preset loss function, and finally parameters of the initial voice recognition model are adjusted based on the loss value to obtain the pre-trained voice recognition model. According to the embodiment of the disclosure, random factors are introduced into the multi-mode training sample data in the training stage of the voice recognition model, so that the data volume of the sample video data and the sample audio data is no longer balanced, the capability of the model for processing unbalanced multi-mode data can be improved, the trained voice recognition model can adapt to various noise scenes, and the recognition accuracy of the voice recognition model is improved.

The technical scheme of the present disclosure is described in further detail below through the accompanying drawings and examples.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing embodiments thereof in more detail with reference to the accompanying drawings. The accompanying drawings are included to provide a further understanding of embodiments of the disclosure, and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure, without limitation to the disclosure. In the drawings, like reference numerals generally refer to like parts or steps;

FIG. 1 is a system diagram to which the present disclosure is applicable;

FIG. 2 is a flowchart of a method of training a speech recognition model provided in an exemplary embodiment of the present disclosure;

FIG. 3 is a flowchart of a method of training a speech recognition model provided in another exemplary embodiment of the present disclosure;

FIG. 4 is a flowchart of a method of training a speech recognition model provided in another exemplary embodiment of the present disclosure;

FIG. 5 is a flowchart of a method of training a speech recognition model provided in another exemplary embodiment of the present disclosure;

FIG. 6 is a flowchart of a method of training a speech recognition model provided in another exemplary embodiment of the present disclosure;

FIG. 7 is a flowchart of a method of training a speech recognition model provided in another exemplary embodiment of the present disclosure;

FIG. 8 is a flowchart of a speech recognition method provided by an exemplary embodiment of the present disclosure;

FIG. 9 is a flowchart of a speech recognition method provided by another exemplary embodiment of the present disclosure;

FIG. 10 is a flowchart of a speech recognition method provided by another exemplary embodiment of the present disclosure;

FIG. 11 is a schematic diagram of a training device for a speech recognition model according to an exemplary embodiment of the present disclosure;

FIG. 12 is a schematic structural diagram of a training device of a speech recognition model according to another exemplary embodiment of the present disclosure;

FIG. 13 is a schematic diagram of a voice recognition apparatus according to an exemplary embodiment of the present disclosure;

fig. 14 is a schematic structural view of a voice recognition apparatus provided in another exemplary embodiment of the present disclosure;

fig. 15 is a block diagram of an electronic device provided in an exemplary embodiment of the present disclosure.

Detailed Description

Hereinafter, example embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present disclosure and not all of the embodiments of the present disclosure, and that the present disclosure is not limited by the example embodiments described herein.

It should be noted that: the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless it is specifically stated otherwise.

It will be appreciated by those of skill in the art that the terms "first," "second," etc. in embodiments of the present disclosure are used merely to distinguish between different steps, devices or modules, etc., and do not represent any particular technical meaning nor necessarily logical order between them.

It should also be understood that in embodiments of the present disclosure, "plurality" may refer to two or more, and "at least one" may refer to one, two or more.

It should also be appreciated that any component, data, or structure referred to in the presently disclosed embodiments may be generally understood as one or more without explicit limitation or the contrary in the context.

In addition, the term "and/or" in this disclosure is merely an association relationship describing an association object, and indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the front and rear association objects are an or relationship.

It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and that the same or similar features may be referred to each other, and for brevity, will not be described in detail.

Meanwhile, it should be understood that the sizes of the respective parts shown in the drawings are not drawn in actual scale for convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

Embodiments of the present disclosure may be applicable to electronic devices such as terminal devices, computer systems, servers, etc., which may operate with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with the terminal device, computer system, server, or other electronic device include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, network personal computers, minicomputer systems, mainframe computer systems, and distributed cloud computing technology environments that include any of the above systems, and the like.

Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc., that perform particular tasks or implement particular abstract data types. The computer system/server may be implemented in a distributed cloud computing environment in which tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computing system storage media including memory storage devices.

Summary of the application

The current multimode voice recognition model generally needs to input data of two modes of video data and audio data at the same time in a training stage, the model needs to process video features and audio features at the same time, and the recognition effect of the trained multimode voice recognition model is good under the condition that the input audio and video can be obtained at the same time.

Because the conventional recognition scene is complex, the situation that data of one mode is missing or noise is large can occur, so that when the data quantity of two modes is unbalanced, the recognition accuracy of the model is reduced.

The embodiment of the disclosure aims to solve the problems, and the data masking processing is carried out on sample video data and sample audio data randomly in the training stage of a model, so that the data volume of training samples of two modes is unbalanced, the capability of the model for processing unbalanced multi-mode data is improved, and the recognition accuracy of the trained model is greatly improved.

Exemplary System

FIG. 1 illustrates a training method of a speech recognition model or a training apparatus of a speech recognition model, and an exemplary system architecture 100 of a speech recognition method or speech recognition apparatus, to which embodiments of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include a terminal device 101, a network 102, a server 103, a video capture device 104, and an audio capture device 105. Network 102 is a medium used to provide communication links between terminal device 101 and server 103. Network 102 may include various connection types such as wired, wireless communication links, or fiber optic cables, among others.

The video capture device 104 and the audio capture device 105 may be disposed within a target space, which may be various types of spaces, such as an in-vehicle space, an in-house space, and the like. The video capture device 104 and the audio capture device 105 are used to capture video data and audio data. The collected video data and audio data may be saved to the terminal device 101 or transmitted by the terminal device 101 to the server 103.

A user may interact with the server 103 via the network 102 using the terminal device 101 to receive or send messages or the like. The terminal device 101 may have various communication client applications installed thereon, such as a search class application, a web browser application, a multimedia application, an instant messaging tool, and the like.

The terminal device 101 may be various electronic devices including, but not limited to, mobile terminals such as in-vehicle terminals, mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), and the like, and fixed terminals such as digital TVs, desktop computers, and the like.

The server 103 may be a server providing various services, such as a background model training server that trains a speech recognition model using sample video data and sample audio data. The server 101 may also be a speech recognition server, and the speech recognition server may perform speech recognition on the audio data and the video data uploaded from the terminal device 101 using the trained speech recognition model, and feed back the recognition result to the terminal device 101.

It should be noted that, the training method or the speech recognition method of the speech recognition model provided by the embodiments of the present disclosure may be executed by the server 103 or may be executed by the terminal device 101, and accordingly, the training apparatus or the speech recognition apparatus of the speech recognition model may be provided in the server 103 or may be provided in the terminal device 101.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. In the case where the sample video data and the sample audio data, or the video data to be identified and the audio data to be identified, do not need to be acquired from a remote place, the above system architecture may not include a network, but include only a server or a terminal device.

Exemplary method

Fig. 2 is a flow chart of a training method of a speech recognition model according to an exemplary embodiment of the present disclosure. The present embodiment is applicable to an electronic device (such as the terminal device 101 or the server 103 shown in fig. 1), and as shown in fig. 2, the method includes the steps of:

step 201, generating a random number in a preset value interval.

In this embodiment, the electronic device may generate the random number in the preset numerical range. The preset value interval may be arbitrarily set, for example, a [0,1] interval.

And 202, carrying out data masking processing on the sample video data and the sample audio data based on the random numbers to obtain masked video data and masked audio data.

In this embodiment, the electronic device may perform data masking processing on the sample video data and the sample audio data based on the random number, to obtain masked video data and masked audio data.

Specifically, a correspondence relation between the random number and the masked data may be set in advance, and the masked data may be determined from the sample video data and the sample audio data according to the correspondence relation. That is, when the generated random number corresponds to the sample video data, the sample video data is determined as masked data; when the generated random number corresponds to the sample audio data, the sample audio is determined as masked data. The masked data may be set as preset data, and the unmasked data is left unchanged, thereby obtaining masked video data and masked audio data. When the generated random number does not correspond to both the sample audio data and the sample video data, the sample audio data and the sample video data are kept unchanged, namely the sample audio data and the sample video data are respectively determined to be the masked audio data and the masked video data.

And 203, performing fusion coding on the masked video data and the masked audio data by using a fusion coding network of the initial speech recognition model to be trained to obtain fusion coding data.

In this embodiment, the electronic device may perform fusion encoding on the masked video data and the masked audio data by using a fusion encoding network of the initial speech recognition model to be trained, to obtain fusion encoded data.

The fusion coding network can extract video feature data and audio feature data from the masked video data and the masked audio data respectively, then fuse the video feature data and the audio feature data to obtain fusion feature data, and then code the fusion feature data to obtain coding feature data arranged in time sequence.

The method for extracting video feature data and audio feature data may be implemented by using an existing video feature extraction method and an existing audio feature extraction method, for example, audio feature extraction and video feature extraction may be performed by using a neural network, which may include, but is not limited to, RNN (recurrent neural network ), LSTM (Long Short-Term Memory), UNet (U-type network), complex UNet, transformer, and the like.

The method of fusing video feature data and audio feature data may employ, for example, a concat feature fusion method, an elemwase_add feature fusion method, an attention feature fusion method, or the like.

The method for encoding the fusion characteristic data can be realized by adopting a Conformer, transformer neural network and other types.

It should be appreciated that the above-described network for extracting video feature data and audio feature data, the network for fusing video feature data and audio feature data, and the network for encoding the fused feature data may be integrated into a fused encoding network, and the functions of these networks may be performed by the fused encoding network.

And 204, decoding the fusion encoded data by utilizing a decoding network of the initial speech recognition model to obtain speech prediction data.

In this embodiment, the electronic device may decode the fusion encoded data using the decoding network of the initial speech recognition model to obtain the speech prediction data.

The decoding network may implement a decoding algorithm, may include a transducer network, and may further include a CTC (Connectionist Temporal Classification) decoder, to implement a CTC decoding algorithm. The obtained voice prediction data represents the probability that the predicted voice tag sequence output by the voice recognition model is consistent with the preset voice tag sequence.

Step 205, determining a loss value representing an error between the speech prediction data and the predetermined speech tag sequence based on the predetermined loss function and the speech prediction data.

In this embodiment, the electronic device may determine a loss value representing an error between the voice prediction data and the preset voice tag sequence based on the preset loss function and the voice prediction data.

The preset voice tag sequence is a real voice tag sequence marked in advance. Since the speech prediction data represents the probability that the predicted speech tag sequence output by the speech recognition model matches the actual speech tag sequence, the training goal of the model may be to maximize that probability. As an example, when the decoding network includes a CTC decoder, the loss function may be represented by the following formula (1):

L＝-logp _ctc (y|X _e ) (1)

wherein X is _e Representing the fusion encoded data, y representing the output predicted voice tag sequence, p _ctc (y|X _e ) That is, the fusion encoded data at the input CTC decoder is represented as X _e When y is equal to the preset valueThe probability that the voice tag sequences are consistent takes the negative logarithm of the probability value, and minimizes L during training, namely is equivalent to maximizing the probability value.

And step 206, adjusting parameters of the initial speech recognition model based on the loss value to obtain an adjusted speech recognition model.

In this embodiment, the electronic device may adjust parameters of the initial speech recognition model based on the loss value, and obtain an adjusted speech recognition model.

Specifically, a gradient descent method and a back propagation method may be adopted to iteratively update parameters of the initial speech recognition model so as to gradually reduce the loss value. Typically, each iteration is performed once, step 201-

Step 206, performing iterative training on the model by using multiple groups of training samples (including sample video data and sample audio data). And repeatedly iterating the adjusted voice recognition model obtained after each iteration as an initial voice recognition model of the next iteration for a plurality of times until the adjusted voice recognition model meets the preset training ending condition.

In step 207, in response to determining that the adjusted speech recognition model meets a preset training end condition, the adjusted speech recognition model is determined to be a pre-trained speech recognition model.

In this embodiment, the electronic device may determine the adjusted speech recognition model as the pre-trained speech recognition model in response to determining that the adjusted speech recognition model satisfies a preset training end condition.

Specifically, the steps 201-206 are repeatedly performed by using multiple sets of training samples, and whether the current model meets the training ending condition is determined after each training. And when the training ending condition is met, the model after the parameters are currently adjusted is the pre-trained voice recognition model. Wherein the training end condition may include, but is not limited to, at least one of: the loss value of the loss function converges, the training time exceeds the preset duration, and the training times exceed the preset times.

According to the method provided by the embodiment of the disclosure, the masked video data and the masked audio data are obtained by generating random numbers and carrying out data masking processing on the sample video data and the sample audio data based on the random numbers, then fusion encoding is carried out on the masked video data and the masked audio data by utilizing an initial speech recognition model to obtain fusion encoded data, then the fusion encoded data is decoded to obtain speech prediction data, then a loss value representing an error between the speech prediction data and a preset speech tag sequence is determined based on a preset loss function, and finally parameters of the initial speech recognition model are adjusted based on the loss value to obtain a pre-trained speech recognition model. According to the embodiment of the disclosure, random factors are introduced into the multi-mode training sample data in the training stage of the voice recognition model, so that the data volume of the sample video data and the sample audio data is no longer balanced, the capability of the model for processing unbalanced multi-mode data can be improved, the trained voice recognition model can adapt to various noise scenes, and the recognition accuracy of the voice recognition model is improved.

In some alternative implementations, as shown in fig. 3, step 202 includes:

In response to determining that the random number is within the first preset interval, the sample video data is reset based on the first preset value, resulting in masked video data, and the sample audio data is determined to be masked audio data, step 2021.

In response to determining that the random number is within the second preset interval, the sample audio data is reset based on the second preset value, resulting in masked audio data, and the sample video data is determined to be masked video data, step 2022.

In response to determining that the random number is within the third preset interval, the sample video data is determined to be masked video data and the sample audio data is determined to be masked audio data, step 2023.

The first preset interval, the second preset interval and the third preset interval can be set arbitrarily. For example, the total interval where the random number is located is [0,1], the first preset interval is (0.75,1), the second preset interval is (0.5, 0.75), and the third preset interval is [0,0.5], and the first preset value and the second preset value may be the same or different, for example, the first preset value and the second preset value are both 0.

The sample video data is reset, i.e. each value in the sample video data is set to a first preset value, e.g. the elements in the matrix or vector comprised by the sample video data are set to 0. Similarly, the sample audio data is reset, i.e. each value in the sample audio data is set to a second preset value, e.g. the elements in the matrix or vector comprised by the sample audio data are set to 0.

According to the embodiment, the interval where the random number is located is judged, and the sample video data or the sample audio data is reset, or the sample video data and the sample audio data are kept unchanged, so that the probability of flexibly setting the sample video data or the sample audio data to be reset is realized, the influence of random factors is conveniently and flexibly substituted into the training process of the model, the data of the voice recognition model is more conveniently and accurately simulated to be input into an actual voice recognition scene, and the accuracy of voice recognition of the model in a complex noise scene is improved.

In some alternative implementations, as shown in fig. 4, step 203 includes:

step 2031, coding the masked video data and the masked audio data by using the video coding sub-network and the audio coding sub-network of the fusion coding network to obtain video feature data to be fused and audio feature data to be fused.

The video encoding sub-network and the audio encoding sub-network may employ a neural network of a structure such as Conformer, transformer to encode the masked video data and the masked audio data. The obtained video feature data to be fused may be a plurality of sets of feature data arranged in time, and each set of feature data may represent a change condition of a feature such as an action, an appearance, etc. of a target part (e.g., a lip, a face, etc.) of a target object (e.g., a person) to be photographed. The obtained audio feature data to be fused can also be a plurality of groups of feature data arranged according to time, and each group of feature data can represent the pronunciation feature of one audio frame.

Step 2032, fusing the video feature data to be fused and the audio feature data to be fused by using a feature fusion sub-network of the fusion coding network to obtain fusion feature data.

Specifically, a feature fusion subnetwork may be constructed based on various feature fusion methods. For example, the feature fusion sub-network may include a layerrnorm layer, where the layerrnorm layer performs normalization processing on the video feature data to be fused and the audio feature data to be fused, and performs stitching on the normalized video feature data to be fused and the audio feature data to be fused, and finally performs dimension reduction on the stitched feature data through a full-connection layer, so as to obtain fused feature data.

Step 2033, coding the fusion characteristic data by using the fusion characteristic coding sub-network of the fusion coding network to obtain fusion coding data.

The fusion characteristic coding sub-network can adopt a neural network with a structure such as Conformer, transformer and the like to code the fusion characteristic data to obtain fusion coding data.

According to the embodiment, the video coding sub-network, the audio coding sub-network, the feature fusion sub-network and the fusion feature coding sub-network are arranged in the fusion coding network, so that the fusion coding network can be thinned, the structure of the fusion coding network can be constructed more conveniently, and the voice recognition model can be trained more efficiently.

In some alternative implementations, as shown in fig. 5, step 2031 includes:

step 20311, performing feature extraction on the masked video data by using a video feature extraction layer of the video coding sub-network to obtain basic video feature data.

Specifically, the video feature extraction layer may extract, as the base video feature data, data representing features such as a position, an outline, a size, and the like of the target portion from each video frame included in the masked video data. As an example, the video feature extraction layer may be constructed from CNNs, which may convolve, pool, fully connect, etc., video frames input thereto to obtain base video feature data.

In step 20312, the basic video feature data is encoded by using the video feature encoding layer of the video encoding sub-network, so as to obtain the video feature data to be fused.

Specifically, the video feature encoding layer may encode the basic video feature data, and the obtained video feature data to be fused may be used to represent the change situations of the position, the shape, the size, and the like of the target portion at different moments, that is, represent the action state of the target portion. Alternatively, the video feature encoding layer may be constructed from a neural network of structure RNN, LSTM, transformer, conformer or the like. As an example, when the target site is a lip, the video feature encoding layer may be constructed based on a construction method of a neural network for lip recognition, so that the video feature encoding layer may encode basic video feature data according to the method for lip recognition.

Step 20313, performing feature extraction on the masked audio data by using the audio feature extraction layer of the audio coding sub-network to obtain basic audio feature data.

Wherein the underlying audio feature data is an acoustic feature representing the masked audio data. As examples, acoustic features are Filter-Bank (FBank), mel-frequency cepstral coefficient (Mel-frequency Cepstral Coefficient, MFCC), and the like. The audio feature extraction layer may be constructed based on algorithms for extracting acoustic features of FBank, MFCC, etc.

Step 20314, encoding the basic audio feature data by using the audio feature encoding layer of the audio encoding sub-network to obtain audio feature data to be fused.

Specifically, the audio feature encoding layer may encode the basic audio feature data, and the obtained audio feature data to be fused may be a sequence formed by encoding data corresponding to each audio frame included in the sample audio data. Alternatively, the audio feature coding layer may be constructed from a neural network of structure RNN, LSTM, transformer, conformer or the like.

According to the embodiment, the video characteristic extraction layer and the video characteristic coding layer are arranged in the video coding sub-network, and the audio characteristic extraction layer and the audio characteristic coding layer are arranged in the audio coding sub-network, so that further refinement of the video coding sub-network and the audio coding sub-network can be realized, the structure of the video coding sub-network and the audio coding sub-network can be constructed more conveniently, and further the voice recognition model can be trained more efficiently.

In some alternative implementations, as shown in fig. 6, step 204 includes:

step 2041, decoding the fusion encoded data according to a first arrangement sequence of a preset voice tag sequence by using a first decoding sub-network of the decoding network to obtain first voice prediction data.

The first arrangement sequence may be a forward time sequence, the first decoding sub-network uses the voice tag sequence as a labeled expected recognition result, decodes the input fusion encoded data according to the forward time sequence, and the obtained first voice prediction data may reflect a result of predicting the input video data and the audio data according to the forward time sequence by the voice recognition model.

Step 2042, decoding the fusion encoded data according to a second arrangement sequence of the preset voice tag sequence by using a second decoding sub-network of the decoding network to obtain second voice prediction data.

The second arrangement sequence may be a reverse time sequence, the second decoding sub-network uses the voice tag sequence as a labeled expected recognition result, decodes the input fusion encoded data according to the reverse time sequence, and the obtained second voice prediction data may reflect a result of predicting the input video data and the audio data according to the reverse time sequence by the voice recognition model.

The network structures of the first decoding sub-network and the second decoding sub-network may be the same (for example, both are in a transducer structure), or may be different (for example, one is in a transducer structure, the other is in a transducer structure, etc.). Because the ordering directions of the voice tag sequences aimed by the first decoding sub-network and the second decoding sub-network are opposite in prediction, the trained network parameters of the first decoding sub-network and the second decoding sub-network are different.

And 2043, decoding the fusion encoded data by using a third decoding sub-network of the decoding network to obtain third voice prediction data.

Wherein the third decoding sub-network is a sub-network having a different structure from the first decoding sub-network and the second decoding sub-network, and an algorithm for decoding by the third decoding sub-network is different from the first decoding sub-network and the second decoding sub-network. As an example, the first decoding sub-network and the second decoding sub-network may be networks of a Transformer structure, and the third decoding sub-network is a network (also referred to as CTC decoder) running a CTC decoding algorithm.

The first voice prediction data, the second voice prediction data and the third voice prediction data all represent probabilities that the corresponding predicted voice tag sequences are consistent with the preset voice tag sequences.

In the existing voice recognition method, only a decoder for decoding a voice tag sequence according to a forward time sequence is generally set, the first decoding sub-network and the second decoding sub-network respectively take the voice tag sequence as an expected recognition result, and decode is performed according to the reverse arrangement sequence of the voice tag sequence, so that prediction of fusion encoded data according to the forward time sequence and the reverse time sequence can be realized, and third voice prediction data obtained by combining with a third decoding sub-network can be realized.

In some alternative implementations, based on the corresponding embodiment of fig. 6, as shown in fig. 7, step 205 includes:

step 2051, based on a predetermined loss function, determines a first loss value representing an error between the first speech prediction data and the speech tag sequence, a second loss value representing an error between the second speech prediction data and the speech tag sequence, and a third loss value representing an error between the third speech prediction data and the speech tag sequence.

Specifically, the first loss function may be represented by the following formula (2):

/>

wherein X is _e Representing the fusion encoded data, y representing a preset voice tag sequence (i.e., a pre-labeled real voice tag sequence, which may be arranged in forward time order), L representing the total length of the preset voice tag sequence, L representing the serial number of the voice tag, y _l Representing the first voice tag, y, in a predetermined sequence of voice tags _l |y _1:l-1 Representing the 1 st-1 st voice tag in a predetermined sequence of voice tags, p (y _l |y _1:l-1 |X _e ) Represented by fusing encoded data to X _e And the probability that the first predicted speech tag coincides with the first real speech tag has been obtained in the case that the first-1 predicted speech tag coincides with the first-1 real speech tag. P is p _l (y|X _e ) I.e., the product of the probabilities of each predicted speech tag obtained by predicting the fusion encoded data in forward temporal order.

The second loss function may be represented by the following formula (3):

wherein y represents a preset voice tag sequence, and p _r (y|X _e ) Represented by fusing encoded data to X _e And predicting the fusion coded data according to the reverse time sequence to obtain products of probabilities of the predicted voice tags. y is _L:l+1 Representing the L-th +1 voice tag in the preset voice tag sequence. p (y) _l |y _L:l+1 |X _e ) Represented by fusing encoded data to X _e And the probability that the L-th+1th predicted speech tag is identical to the L-th+1th real speech tag is obtained, the L-th predicted speech tag is identical to the L-th real speech tagProbability of sign-on. It should be understood that the meaning of each parameter in equation (3) is the same as the meaning of the corresponding parameter in equation (2), i.e. the probability calculation is performed with the same real voice tag sequence as the expected result, except that the order of calculating the probability of each predicted voice tag is reversed.

The third loss function may be set according to the type of the third decoding sub-network. For example, when the third decoding sub-network is a CTC decoder, the third loss function may be represented by the above formula (1), and the details of the description of the corresponding embodiment of fig. 2 may be referred to, which will not be described herein.

Step 2052, determining a loss value of the loss function based on the first loss value, the second loss value, and the third loss value.

Specifically, the first loss value, the second loss value, and the third loss value may be added to obtain the loss value. Alternatively, the first loss value, the second loss value, and the third loss value may be weighted and summed based on a preset weight to obtain the loss value.

As an example, the above-described loss function may be represented by the following formula (4):

L＝L1+L2+L3

＝-αlogp _l (y|X _e )-βlogp _r (y|X _e )-(1-α-β)logp _ctc (y|X _e ) (4)

wherein L1, L2, L3 are used to calculate a first loss value, a second loss value, and a third loss value, respectively, and α and β are preset weights.

According to the embodiment, the first loss value, the second loss value and the third loss value are calculated by using the first voice prediction data, the second voice prediction data and the third voice prediction data respectively, and the loss value used for representing the error between the voice prediction data and the preset voice tag sequence is calculated based on the first loss value, the second loss value and the third loss value, so that the error can be accurately determined according to the recognition results in more aspects during model training, and the prediction precision of the trained model is improved.

Fig. 8 is a flow chart of a training method of a speech recognition model according to an exemplary embodiment of the present disclosure. The present embodiment is applicable to an electronic device (such as the terminal device 101 or the server 103 shown in fig. 1), and as shown in fig. 8, the method includes the steps of:

step 801, obtaining video data to be identified and audio data to be identified.

In this embodiment, the electronic device may acquire the video data to be identified and the audio data to be identified. The video data to be identified and the audio data to be identified may be original video data and original audio data acquired by the video acquisition device 104 and the audio acquisition device 105 in the target space as shown in fig. 1, or may be generated video data and audio data. The target space may be various types of spaces, such as a vehicle interior space, a room interior space, and the like.

Step 802, fusion encoding is performed on the video data to be identified and the audio data to be identified by using a fusion encoding network of the pre-trained voice recognition model, so as to obtain fusion encoded data.

In this embodiment, the electronic device may perform fusion encoding on the video data to be identified and the audio data to be identified by using a fusion encoding network of the pre-trained speech recognition model, so as to obtain fusion encoded data.

The pre-trained speech recognition model may be previously trained according to the method described in any of the embodiments of fig. 2-7. The structure and function of the fusion coding network in this embodiment are the same as those of the fusion coding network described in any of the embodiments of fig. 2 to 7, and are not described here again.

Step 803, decoding the fusion encoded data by using the decoding network of the speech recognition model to obtain speech prediction data.

In this embodiment, the electronic device may decode the fusion encoded data using a decoding network of the speech recognition model to obtain speech prediction data.

The decoding network in this embodiment has the same structure and function as the decoding network described in any of the embodiments of fig. 2 to 7, and is not described here again.

In step 804, speech recognition text is generated based on the speech prediction data.

In this embodiment, the electronic device may generate the speech recognition text based on the speech prediction data.

In particular, the speech prediction data may include a probability of each predicted speech tag in the sequence of predicted speech tags, i.e., pronunciation tags representing one of syllables or factors of a segment of speech. In general, the decoding network may output a plurality of voice prediction data, each corresponding to a predicted voice tag sequence, and the electronic device may determine a predicted voice tag sequence having a maximum probability (which may be a product of probabilities of each voice tag in the predicted voice tag sequence) as the recognized voice tag sequence, and then convert the voice tag sequence into voice recognition text.

Based on the voice recognition method provided by the embodiment of the disclosure, the pre-trained voice recognition model for carrying out multi-modal voice recognition is used for recognizing the video data to be recognized and the audio data to be recognized, so that the capability of the voice recognition model for processing unbalanced multi-modal data is effectively utilized, and the accuracy of carrying out voice recognition under different noise scenes is improved.

In some alternative implementations, as shown in fig. 9, step 801 includes:

in step 8011, data to be identified is obtained, where the data to be identified includes initial video data and/or initial audio data.

The initial video data may be video data collected by the video collecting device 104 shown in fig. 1, and the initial audio data may be audio data collected by the audio collecting device 105 shown in fig. 1.

In response to determining that the data to be identified includes initial video data and initial audio data, the initial video data is determined to be the video data to be identified and the initial audio data is determined to be the audio data to be identified, step 8012.

In step 8013, in response to determining that the data to be identified includes the initial audio data and does not include the initial video data, the initial audio data is determined to be the audio data to be identified, and the video data to be identified is generated based on the first preset value.

In step 8014, in response to determining that the data to be identified includes the initial video data and does not include the initial audio data, the initial video data is determined to be the video data to be identified, and the audio data to be identified is generated based on the second preset value.

The first preset value and the second preset value may be the same or different, for example, the first preset value and the second preset value are both 0. In general, video data to be recognized, which is composed of a corresponding dimension and a corresponding number of first preset values, may be generated according to the dimension and the data amount of video data that can be processed by the speech recognition model. And generating the audio data to be recognized, which is composed of the corresponding dimension and the corresponding number of second preset values, according to the dimension and the data quantity of the audio data which can be processed by the voice recognition model.

According to the embodiment, under the condition that the audio data to be recognized or the video data to be recognized are missing, the audio data to be recognized or the video data to be recognized are generated, so that the data type and the data volume of the input voice recognition model are consistent with those of the input voice recognition model during training, the model can still recognize the single-mode audio data to be recognized or the video data to be recognized according to the multi-mode voice recognition method, the application scene of the voice recognition method is expanded, and the accuracy of voice recognition by utilizing the voice recognition model under different scenes is improved.

In some alternative implementations, step 803 may be performed as follows:

firstly, decoding the fusion encoded data according to a first arrangement sequence of a preset voice tag sequence by utilizing a first decoding sub-network of a decoding network to obtain first voice prediction data.

And then, decoding the fusion encoded data according to a second arrangement sequence of a preset voice tag sequence by utilizing a second decoding sub-network of the decoding network to obtain second voice prediction data.

It should be understood that the first decoding sub-network and the second decoding sub-network in this embodiment have the same structures and functions as those of the first decoding sub-network and the second decoding sub-network described in the above-described embodiment corresponding to fig. 6, and are not described herein.

In the embodiment, when the voice recognition model is used for voice recognition, the first decoding sub-network and the second decoding sub-network respectively decode the fusion encoded data, and because the first decoding sub-network and the second decoding sub-network train the model training stage based on the forward arrangement sequence and the reverse arrangement sequence of the real voice tag sequence, the first decoding sub-network and the second decoding sub-network can predict the fusion encoded data according to the forward time sequence and the reverse time sequence, namely analyze the input video data and the input audio data from more aspects, and further improve the voice recognition accuracy.

In some alternative implementations, as shown in fig. 10, step 804 includes:

step 8041, decoding the fusion encoded data by using a third decoding sub-network of the decoding network to obtain at least two predicted voice tag sequences.

Wherein the third decoding sub-network is a sub-network having a different structure from the first decoding sub-network and the second decoding sub-network, and an algorithm for decoding by the third decoding sub-network is different from the first decoding sub-network and the second decoding sub-network. As an example, the first decoding sub-network and the second decoding sub-network may be networks of a Transformer structure, and the third decoding sub-network is a network (also referred to as CTC decoder) running a CTC decoding algorithm. The third decoding subnetwork, upon decoding the fusion encoded data, can generate at least two voice tag sequences (also referred to as N-best lists).

Step 8042, determining a score for each of the at least two predicted voice tag sequences based on the first voice prediction data and the second voice prediction data.

The first decoding sub-network and the second decoding sub-network may generate a score of each predicted voice tag sequence of the at least two predicted voice tag sequences according to the at least two voice tag sequences, where the first voice predicted data and the second voice predicted data are two groups of scores obtained by respectively scoring the predicted voice tag sequences arranged in different directions.

For a predicted voice tag sequence, in the process of decoding the fusion encoded data by using the first decoding sub-network to obtain first voice predicted data, a confidence level of each predicted voice tag in the predicted voice tag sequence may be generated, so that the predicted voice tag sequence corresponds to the first score of the first decoding sub-network, and may be obtained based on the confidence level of each predicted voice tag included in the predicted voice tag sequence, for example, a product obtained by multiplying the respective confidence levels is taken as the corresponding first score. Similarly, a second score may be derived for the predicted voice tag sequence corresponding to a second decoded subnetwork. The score of the predicted voice tag sequence may be obtained by adding the first score and the second score corresponding to the same predicted voice tag sequence, or by taking an average or the like.

Step 8043 selects a target predicted voice tag sequence from the at least two predicted voice tag sequences based on the score obtained for each predicted voice tag sequence.

In general, the highest scoring predicted voice tag sequence may be determined as the target predicted voice tag sequence.

Step 8044 generates speech recognition text based on the target predicted speech tag sequence.

Each predicted phonetic label in the target predicted phonetic label sequence is a pronunciation mark representing one of syllables or factors of a section of speech, and the target predicted phonetic label sequence can be converted into a speech recognition text according to the pronunciation mark.

According to the embodiment, at least two predicted voice tag sequences are generated through the third decoding sub-network, and then the target predicted voice tag sequence is selected from the at least two predicted voice tag sequences through the first voice predicted data and the second voice predicted data, so that operation separation of generating the at least two predicted voice tag sequences and scoring each predicted voice tag sequence is achieved, efficient scoring of each predicted voice tag sequence is facilitated, and meanwhile, voice recognition texts with high accuracy are obtained according to the scores.

Exemplary apparatus

Fig. 11 is a schematic structural diagram of a training device for a speech recognition model according to an exemplary embodiment of the present disclosure. The present embodiment may be applied to an electronic device, as shown in fig. 11, where the training apparatus for a speech recognition model includes: a first generation module 1101, configured to generate a random number in a preset value interval; a masking module 1102, configured to perform data masking processing on the sample video data and the sample audio data based on the random number, to obtain masked video data and masked audio data; the first fusion module 1103 is configured to fusion-encode the masked video data and the masked audio data by using a fusion encoding network of an initial speech recognition model to be trained, so as to obtain fusion encoded data; a first decoding module 1104, configured to decode the fusion encoded data using a decoding network of the initial speech recognition model to obtain speech prediction data; a first determining module 1105, configured to determine a loss value representing an error between the speech prediction data and a preset speech tag sequence based on a preset loss function and the speech prediction data; an adjustment module 1106, configured to adjust parameters of the initial speech recognition model based on the loss value, to obtain an adjusted speech recognition model; a second determining module 1107 is configured to determine the adjusted speech recognition model as a pre-trained speech recognition model in response to determining that the adjusted speech recognition model meets a preset training end condition.

Since the training device of the speech recognition model provided in this embodiment corresponds to the training method of the speech recognition model provided in the embodiment shown in fig. 2, the functions of each module in this embodiment correspond to steps 201 to 207 in the embodiment shown in fig. 2 one by one, and are not described here again.

Referring to fig. 12, fig. 12 is a schematic structural view of a training apparatus for a speech recognition model according to another exemplary embodiment of the present disclosure.

In some alternative implementations, masking module 1102 includes: a first masking unit 11021, configured to reset the sample video data based on the first preset value in response to determining that the random number is in the first preset interval, obtain masked video data, and determine the sample audio data as masked audio data; a second masking unit 11022, configured to reset the sample audio data based on the second preset value in response to determining that the random number is in the second preset interval, obtain masked audio data, and determine the sample video data as masked video data; the third masking unit 11023 is configured to determine the sample video data as masked video data and the sample audio data as masked audio data in response to determining that the random number is in a third preset interval.

In some alternative implementations, the first fusion module 1103 includes: a first coding unit 11031, configured to encode the masked video data and the masked audio data by using a video encoding sub-network and an audio encoding sub-network of the fusion encoding network, to obtain video feature data to be fused and audio feature data to be fused; a fusion unit 11032, configured to fuse the video feature data to be fused and the audio feature data to be fused by using a feature fusion sub-network of the fusion encoding network, so as to obtain fusion feature data; and the second coding unit 11033 is configured to encode the fused feature data by using a fused feature encoding sub-network of the fused encoding network to obtain fused encoded data.

In some alternative implementations, the first encoding unit 11031 includes: a first extraction subunit 110311, configured to perform feature extraction on the masked video data by using a video feature extraction layer of the video coding sub-network, so as to obtain basic video feature data; the first coding subunit 110312 is configured to encode the basic video feature data by using a video feature encoding layer of the video coding sub-network to obtain video feature data to be fused; a second extraction subunit 110313, configured to perform feature extraction on the masked audio data by using an audio feature extraction layer of the audio coding sub-network to obtain basic audio feature data; the second encoding subunit 110314 is configured to encode the basic audio feature data by using the audio feature encoding layer of the audio encoding sub-network to obtain audio feature data to be fused.

In some alternative implementations, the first decoding module 1104 includes: a first decoding unit 11041, configured to decode the fusion encoded data according to a first arrangement sequence of a preset voice tag sequence by using a first decoding sub-network of the decoding network, so as to obtain first voice prediction data; a second decoding unit 11042, configured to decode the fusion encoded data according to a second arrangement sequence of a preset voice tag sequence by using a second decoding sub-network of the decoding network, so as to obtain second voice prediction data; the third decoding unit 11043 is configured to decode the fusion encoded data by using a third decoding sub-network of the decoding network, so as to obtain third speech prediction data.

In some alternative implementations, the first determining module 1105 includes: a first determining unit 11051 configured to determine a first loss value representing an error between the first speech prediction data and the speech tag sequence, a second loss value representing an error between the second speech prediction data and the speech tag sequence, and a third loss value representing an error between the third speech prediction data and the speech tag sequence, based on a preset loss function; the second determining unit 11052 is configured to determine a loss value of the loss function based on the first loss value, the second loss value, and the third loss value.

According to the training device for the voice recognition model, provided by the embodiment of the disclosure, the random number is generated, the sample video data and the sample audio data are subjected to data masking processing based on the random number to obtain the masked video data and the masked audio data, then the masked video data and the masked audio data are subjected to fusion coding by using the initial voice recognition model to obtain fusion coding data, the fusion coding data are decoded to obtain voice prediction data, then a loss value representing an error between the voice prediction data and a preset voice tag sequence is determined based on a preset loss function, and finally parameters of the initial voice recognition model are adjusted based on the loss value to obtain the pre-trained voice recognition model. According to the embodiment of the disclosure, random factors are introduced into the multi-mode training sample data in the training stage of the voice recognition model, so that the data volume of the sample video data and the sample audio data is no longer balanced, the capability of the model for processing unbalanced multi-mode data can be improved, the trained voice recognition model can adapt to various noise scenes, and the recognition accuracy of the voice recognition model is improved.

Fig. 13 is a schematic structural view of a voice recognition apparatus according to an exemplary embodiment of the present disclosure. The present embodiment may be applied to an electronic device, as shown in fig. 13, where the voice recognition apparatus includes: an acquiring module 1301, configured to acquire video data to be identified and audio data to be identified; the second fusion module 1302 is configured to perform fusion encoding on the video data to be identified and the audio data to be identified by using a fusion encoding network of the pre-trained speech recognition model, so as to obtain fusion encoded data; the second decoding module 1303 is configured to decode the fusion encoded data by using a decoding network of the speech recognition model to obtain speech prediction data; a second generation module 1304 for generating speech recognition text based on the speech prediction data.

Since the voice recognition apparatus provided in this embodiment corresponds to the voice recognition method provided in the embodiment shown in fig. 8, the functions of each module in this embodiment correspond to steps 801 to 804 in the embodiment shown in fig. 8 one by one, and are not repeated here.

Referring to fig. 14, fig. 14 is a schematic structural view of a voice recognition apparatus provided in another exemplary embodiment of the present disclosure.

In some alternative implementations, the acquiring module 1301 includes: an obtaining unit 13011, configured to obtain data to be identified, where the data to be identified includes initial video data and/or initial audio data; a third determining unit 13012 for determining the initial video data as video data to be recognized and the initial audio data as audio data to be recognized in response to determining that the data to be recognized includes the initial video data and the initial audio data; a fourth determining unit 13013 configured to determine the initial audio data as the audio data to be recognized and generate the video data to be recognized based on the first preset value in response to determining that the data to be recognized includes the initial audio data and does not include the initial video data; a fifth determining unit 13014 for determining the initial video data as the video data to be identified and generating the audio data to be identified based on the second preset value in response to determining that the data to be identified comprises the initial video data and does not comprise the initial audio data.

In some alternative implementations, the second decoding module 1303 includes: a fourth decoding unit 13031, configured to decode the fusion encoded data according to a first arrangement sequence of a preset voice tag sequence by using a first decoding sub-network of the decoding network, to obtain first voice prediction data; and a fifth decoding unit 13032, configured to decode the fusion encoded data according to a second arrangement sequence of a preset voice tag sequence by using a second decoding sub-network of the decoding network, so as to obtain second voice prediction data.

In some alternative implementations, the second generation module 1304 includes: a sixth decoding unit 13041, configured to decode the fusion encoded data by using a third decoding sub-network of the decoding network to obtain at least two predicted voice tag sequences; a sixth determining unit 13042 for determining a score for each of the at least two predicted voice tag sequences based on the first voice prediction data and the second voice prediction data; a selection unit 13043 for selecting a target predicted voice tag sequence from at least two predicted voice tag sequences based on the obtained score of each predicted voice tag sequence; a third generating unit 13044 for generating a speech recognition text based on the target predicted speech tag sequence.

According to the voice recognition device provided by the embodiment of the disclosure, the pre-trained voice recognition model for carrying out multi-modal voice recognition is used for recognizing the video data to be recognized and the audio data to be recognized, so that the capability of the voice recognition model for processing unbalanced multi-modal data is effectively utilized, and the accuracy of carrying out voice recognition under different noise scenes is improved.

Exemplary electronic device

Next, an electronic device according to an embodiment of the present disclosure is described with reference to fig. 15. The electronic device may be either or both of the terminal device 101 and the server 103 as shown in fig. 1, or a stand-alone device independent thereof, which may communicate with the terminal device 101 and the server 103 to receive the acquired input signals therefrom.

Fig. 15 shows a block diagram of an electronic device according to an embodiment of the disclosure.

As shown in fig. 15, electronic device 1500 includes one or more processors 1501 and memory 1502.

The processor 1501 may be a Central Processing Unit (CPU) or other form of processing unit having data processing and/or instruction execution capabilities, and may control other components in the electronic device 1500 to perform desired functions.

Memory 1502 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, random Access Memory (RAM) and/or cache memory (cache) and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on a computer readable storage medium and the processor 1501 may execute the program instructions to implement the above training method or speech recognition method and/or other desired functions of the speech recognition model of the various embodiments of the present disclosure. Various contents such as sample video data, sample audio data, video data to be recognized, audio data to be recognized, and the like may also be stored in the computer-readable storage medium.

In one example, the electronic device 1500 may further include: an input device 1503 and an output device 1504, interconnected by a bus system and/or other forms of connection mechanisms (not shown).

For example, when the electronic device is the terminal device 101 or the server 103, the input means 1503 may be a camera, a microphone, a mouse, a keyboard, or the like for inputting video, audio, various commands, or the like. When the electronic device is a stand-alone device, the input means 1503 may be a communication network connector for receiving inputted video, audio, various commands, and the like from the terminal device 101 and the server 103.

The output device 1504 can output various information to the outside, including a trained speech recognition module, recognized text, and the like. The output device 1504 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, etc.

Of course, for simplicity, only some of the components of the electronic device 1500 that are relevant to the present disclosure are shown in fig. 15, components such as buses, input/output interfaces, and the like are omitted. In addition, electronic device 1500 may include any other suitable components depending on the particular application.

Exemplary computer program product and computer readable storage Medium

In addition to the methods and apparatus described above, embodiments of the present disclosure may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform the steps in a training method or a speech recognition method of a speech recognition model according to various embodiments of the present disclosure described in the above "exemplary methods" section of the present description.

The computer program product may write program code for performing the operations of embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium, having stored thereon computer program instructions, which when executed by a processor, cause the processor to perform the steps in a training method or a speech recognition method of a speech recognition model according to various embodiments of the present disclosure described in the above "exemplary methods" section of the present description.

The computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The basic principles of the present disclosure have been described above in connection with specific embodiments, however, it should be noted that the advantages, benefits, effects, etc. mentioned in the present disclosure are merely examples and not limiting, and these advantages, benefits, effects, etc. are not to be considered as necessarily possessed by the various embodiments of the present disclosure. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, since the disclosure is not necessarily limited to practice with the specific details described.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different manner from other embodiments, so that the same or similar parts between the embodiments are mutually referred to. For system embodiments, the description is relatively simple as it essentially corresponds to method embodiments, and reference should be made to the description of method embodiments for relevant points.

The block diagrams of the devices, apparatuses, devices, systems referred to in this disclosure are merely illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatuses, devices, systems may be connected, arranged, configured in any manner. Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.

The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, firmware. The above-described sequence of steps for the method is for illustration only, and the steps of the method of the present disclosure are not limited to the sequence specifically described above unless specifically stated otherwise. Furthermore, in some embodiments, the present disclosure may also be implemented as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

It is also noted that in the apparatus, devices and methods of the present disclosure, components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered equivalent to the present disclosure.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit the embodiments of the disclosure to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, a person of ordinary skill in the art will recognize certain variations, modifications, alterations, additions, and subcombinations thereof.

Claims

1. A method of training a speech recognition model, comprising:

generating a random number in a preset numerical value interval;

carrying out data masking processing on the sample video data and the sample audio data based on the random number to obtain masked video data and masked audio data;

carrying out fusion coding on the masked video data and the masked audio data by utilizing a fusion coding network of an initial speech recognition model to be trained to obtain fusion coding data;

decoding the fusion encoded data by utilizing a decoding network of the initial voice recognition model to obtain voice prediction data;

determining a loss value representing an error between the speech prediction data and a preset speech tag sequence based on a preset loss function and the speech prediction data;

based on the loss value, adjusting parameters of the initial speech recognition model to obtain an adjusted speech recognition model;

And in response to determining that the adjusted speech recognition model meets a preset training ending condition, determining the adjusted speech recognition model as a pre-trained speech recognition model.

2. The method of claim 1, wherein the performing data masking processing on the sample video data and the sample audio data based on the random number to obtain masked video data and masked audio data comprises:

resetting the sample video data based on a first preset value in response to determining that the random number is in a first preset interval to obtain the masked video data, and determining the sample audio data as the masked audio data;

resetting the sample audio data based on a second preset value in response to determining that the random number is in a second preset interval to obtain the masked audio data, and determining the sample video data as the masked video data;

in response to determining that the random number is within a third preset interval, the sample video data is determined to be the masked video data and the sample audio data is determined to be the masked audio data.

3. The method of claim 1, wherein the fusion encoding of the masked video data and the masked audio data using the fusion encoding network of the initial speech recognition model to be trained to obtain fusion encoded data comprises:

the video coding sub-network and the audio coding sub-network of the fusion coding network are utilized to respectively code the masked video data and the masked audio data to obtain video characteristic data to be fused and audio characteristic data to be fused;

fusing the video feature data to be fused and the audio feature data to be fused by utilizing a feature fusion sub-network of the fusion coding network to obtain fusion feature data;

and coding the fusion characteristic data by utilizing a fusion characteristic coding sub-network of the fusion coding network to obtain the fusion coding data.

4. The method according to claim 3, wherein the encoding the masked video data and the masked audio data with the video encoding sub-network and the audio encoding sub-network of the fusion encoding network respectively to obtain video feature data to be fused and audio feature data to be fused includes:

Performing feature extraction on the masked video data by utilizing a video feature extraction layer of the video coding sub-network to obtain basic video feature data;

coding the basic video characteristic data by utilizing a video characteristic coding layer of the video coding sub-network to obtain the video characteristic data to be fused;

performing feature extraction on the masked audio data by utilizing an audio feature extraction layer of the audio coding sub-network to obtain basic audio feature data;

and coding the basic audio feature data by utilizing the audio feature coding layer of the audio coding sub-network to obtain the audio feature data to be fused.

5. The method of claim 1, wherein the decoding the fusion encoded data using the decoding network of the initial speech recognition model to obtain speech prediction data comprises:

decoding the fusion encoded data according to a first arrangement sequence of the preset voice tag sequence by using a first decoding sub-network of the decoding network to obtain first voice prediction data;

decoding the fusion encoded data according to a second arrangement sequence of the preset voice tag sequence by using a second decoding sub-network of the decoding network to obtain second voice prediction data;

And decoding the fusion encoded data by using a third decoding sub-network of the decoding network to obtain third voice prediction data.

6. The method of claim 5, wherein the determining a loss value representing an error between the speech prediction data and a predetermined sequence of speech tags based on a predetermined loss function and the speech prediction data comprises:

determining a first loss value representing an error between the first speech prediction data and the speech tag sequence, a second loss value representing an error between the second speech prediction data and the speech tag sequence, and a third loss value representing an error between the third speech prediction data and the speech tag sequence based on a preset loss function;

a loss value of the loss function is determined based on the first loss value, the second loss value, and the third loss value.

7. A method of speech recognition, comprising:

acquiring video data to be identified and audio data to be identified;

performing fusion coding on the video data to be identified and the audio data to be identified by utilizing a fusion coding network of a pre-trained voice identification model to obtain fusion coding data;

Decoding the fusion encoded data by utilizing a decoding network of the voice recognition model to obtain voice prediction data;

and generating voice recognition text based on the voice prediction data.

8. The method of claim 7, wherein the acquiring video data and audio data to be identified comprises:

acquiring data to be identified, wherein the data to be identified comprises initial video data and/or initial audio data;

in response to determining that the data to be identified includes initial video data and initial audio data, determining the initial video data as the video data to be identified and the initial audio data as the audio data to be identified;

in response to determining that the data to be identified includes initial audio data and does not include initial video data, determining the initial audio data as the audio data to be identified, and generating video data to be identified based on a first preset value;

in response to determining that the data to be identified includes initial video data and does not include initial audio data, determining the initial video data as the video data to be identified, and generating audio data to be identified based on a second preset value.

9. The method of claim 7, wherein the decoding the fused encoded data using the decoding network of the speech recognition model to obtain speech prediction data comprises:

decoding the fusion encoded data according to a first arrangement sequence of a preset voice tag sequence by using a first decoding sub-network of the decoding network to obtain first voice prediction data;

and decoding the fusion encoded data according to a second arrangement sequence of the preset voice tag sequence by using a second decoding sub-network of the decoding network to obtain second voice prediction data.

10. The method of claim 9, wherein the generating speech recognition text based on the speech prediction data comprises:

decoding the fusion encoded data by using a third decoding sub-network of the decoding network to obtain at least two predicted voice tag sequences;

determining a score for each of the at least two predicted voice tag sequences based on the first voice prediction data and the second voice prediction data;

selecting a target predicted voice tag sequence from the at least two predicted voice tag sequences based on the score of each predicted voice tag sequence obtained;

The speech recognition text is generated based on the target predicted speech tag sequence.

11. A training device for a speech recognition model, comprising:

the first generation module is used for generating a random number in a preset numerical value interval;

the masking module is used for carrying out data masking processing on the sample video data and the sample audio data based on the random number to obtain masked video data and masked audio data;

the first fusion module is used for carrying out fusion coding on the masked video data and the masked audio data by utilizing a fusion coding network of an initial voice recognition model to be trained to obtain fusion coding data;

the first decoding module is used for decoding the fusion encoded data by utilizing a decoding network of the initial voice recognition model to obtain voice prediction data;

a first determining module, configured to determine a loss value representing an error between the speech prediction data and a preset speech tag sequence based on a preset loss function and the speech prediction data;

the adjusting module is used for adjusting parameters of the initial voice recognition model based on the loss value to obtain an adjusted voice recognition model;

And the second determining module is used for determining the adjusted voice recognition model as a pre-trained voice recognition model in response to determining that the adjusted voice recognition model meets a preset training ending condition.

12. A speech recognition apparatus comprising:

the acquisition module is used for acquiring the video data to be identified and the audio data to be identified;

the second fusion module is used for carrying out fusion coding on the video data to be identified and the audio data to be identified by utilizing a fusion coding network of a pre-trained voice identification model to obtain fusion coding data;

the second decoding module is used for decoding the fusion encoded data by utilizing a decoding network of the voice recognition model to obtain voice prediction data;

and the second generation module is used for generating voice recognition text based on the voice prediction data.

13. A computer readable storage medium storing a computer program for execution by a processor to implement the method of any one of the preceding claims 1-10.

14. An electronic device, the electronic device comprising:

a processor;

a memory for storing executable instructions of the processor;

The processor is configured to read the executable instructions from the memory and execute the instructions to implement the method of any of the preceding claims 1-10.