CN111968647B

CN111968647B - Voice recognition method, device, medium and electronic equipment

Info

Publication number: CN111968647B
Application number: CN202010873809.0A
Authority: CN
Inventors: 殷翔
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2020-08-26
Filing date: 2020-08-26
Publication date: 2021-11-12
Anticipated expiration: 2040-08-26
Also published as: CN111968647A

Abstract

The disclosure relates to a voice recognition method, a voice recognition device, a voice recognition medium and an electronic device. The method comprises the following steps: acquiring target video data, wherein the target video data comprises target audio data and target image data; extracting first text data corresponding to the target audio data; extracting feature information of the target image data, and generating second text data for describing the target image data according to the feature information; and correcting the first text data according to the second text data to obtain the corrected first text data. Therefore, the influence of noise or background music in the target video data on the speech recognition accuracy can be avoided, and the accuracy of the text content corresponding to the target audio data is improved.

Description

Voice recognition method, device, medium and electronic equipment

Technical Field

The present disclosure relates to the field of speech recognition, and in particular, to a speech recognition method, apparatus, medium, and electronic device.

Background

With the development of artificial intelligence technology, the technology of Speech Recognition (ASR) has made great progress, and has begun to enter various fields such as home appliances, communications, automobiles, and medical care. Among them, the ASR technique is commonly used to obtain text content corresponding to audio in video. But when there is noise or background music in the video, the recognition accuracy of ASR will be affected.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a speech recognition method, including:

acquiring target video data, wherein the target video data comprises target audio data and target image data;

extracting first text data corresponding to the target audio data;

extracting feature information of the target image data, and generating second text data for describing the target image data according to the feature information;

and correcting the first text data according to the second text data to obtain the corrected first text data.

In a second aspect, the present disclosure provides a speech recognition apparatus comprising:

the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring target video data, and the target video data comprises target audio data and target image data;

the first extraction module is used for extracting first text data corresponding to the target audio data acquired by the acquisition module;

the second extraction module is used for extracting the feature information of the target image data acquired by the acquisition module and generating second text data for describing the target image data according to the feature information;

and the correcting module is used for correcting the first text data extracted by the first extracting module according to the second text data extracted by the second extracting module to obtain the corrected first text data.

In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which, when executed by a processing apparatus, performs the steps of the method provided by the first aspect of the present disclosure.

In a fourth aspect, the present disclosure provides an electronic device comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to implement the steps of the method provided by the first aspect of the present disclosure.

In the above technical solution, when there is a noise or background music in the target video data, it may cause that the first text data corresponding to the target audio data in the extracted target video data is inaccurate, and therefore, after the first text data is extracted, it is not directly used as a speech recognition result, but is corrected by the second text data describing the target image data in the target video data, and the corrected first text data is used as a speech recognition result. Therefore, the influence of noise or background music in the target video data on the speech recognition accuracy can be avoided, and the accuracy of the text content corresponding to the target audio data is improved.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale. In the drawings:

FIG. 1 is a flow diagram illustrating a method of speech recognition according to an example embodiment.

FIG. 2 is a flow diagram illustrating a method of model training in accordance with an exemplary embodiment.

FIG. 3 is a schematic diagram illustrating a model training process in accordance with an exemplary embodiment.

FIG. 4 is a flow chart illustrating a method of model training in accordance with another exemplary embodiment.

FIG. 5 is a schematic diagram illustrating a pre-training process of a speech recognition model and a description information generation model, according to an example embodiment.

FIG. 6 is a diagram illustrating a preliminary training of a speech recognition model, a speech synthesis model, an image generation model, and a description information generation model, respectively, according to an example embodiment.

FIG. 7 is a block diagram illustrating a speech recognition apparatus according to an example embodiment.

FIG. 8 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

FIG. 1 is a flow diagram illustrating a method of speech recognition according to an example embodiment. As shown in fig. 1, the method includes S101 to S104.

In S101, target video data including target audio data and target image data is acquired.

In S102, first text data corresponding to the target audio data is extracted.

In S103, feature information of the target image data is extracted, and second text data describing the target image data is generated based on the feature information.

In the present disclosure, the feature information of the target image data may include global feature information corresponding to the target image data and feature information of a target object (e.g., a person, a backpack, the sun, etc.) included in the target image data.

In S104, the first text data is corrected according to the second text data, and the corrected first text data is obtained.

Illustratively, the first text data is "a group of people are away from the backIce sculptureWalking on a path between mountains, the second text data is' multi-person carrying on backShoulder bagWhen the user walks on a path between mountains, it can be known that the ice sculpture in the first text data is wrong, and the ice sculpture can be corrected into a backpack according to the second text data. Thus, the corrected first text data is "a group of people are away from the backShoulder bagWalk on the path between mountains ".

A detailed description will be given below of a specific embodiment of the first text data corresponding to the extraction target audio data in S102.

In one embodiment, the first text data corresponding to the target audio data may be obtained in a manner of manual annotation.

In another embodiment, the target audio data may be input into a speech recognition model to obtain first text data corresponding to the target audio data. The speech recognition model may be, for example, a feed-forward neural network (DFSMN) model, a Long-Short-Term Memory network (LSTM) model, or the like.

The following is a detailed description of a specific embodiment of extracting feature information of target image data in S103 and generating second text data for describing the target image data according to the feature information.

In one embodiment, feature information of the target image data may be extracted through an image feature extraction model (e.g., a convolutional neural network trained in advance), and then, the feature information is input into a description information generation sub-model (e.g., Mask-RCNN network) to generate second text data for describing the target image data.

In another embodiment, the target image data may be input into the description information generation model to extract feature information of the target image data by the description information generation model, and second text data describing the target image data may be generated according to the feature information. The description information generation model can comprise a multilayer convolution neural network and a Mask-RCNN network.

Wherein, the speech recognition model and the description information generation model can be obtained by training through S201 and S202 shown in fig. 2.

In S201, first reference text data is acquired.

Illustratively, the plurality of first reference text data may be downloaded from a local text repository, a network.

In S202, model training is performed such that the first reference text data, the output of the speech recognition model, and the output of the description information generation model are input to the speech synthesis model, the output of the speech synthesis model is input to the speech recognition model, the first reference text data is output as a target of the speech recognition model, the output of the first reference text data, the output of the speech recognition model, and the output of the description information generation model are input to the image generation model, the output of the image generation model is input to the description information generation model, and the first reference text data is output as a target of the description information generation model, so as to obtain the speech recognition model and the description information generation model.

In the present disclosure, since the speech recognition model has a high requirement on the diversity of the training samples, for example, it requires audio data of speakers with different signal-to-noise ratios and different ages (e.g., old people, children, adults, etc.) as the training samples, and the actually obtained audio data is relatively limited and difficult to satisfy the diversity requirement, the audio data of speakers with different signal-to-noise ratios and different ages can be generated by the speech synthesis model (e.g., end-to-end speech synthesis model) to serve as the training samples of the speech recognition model.

Similarly, the description information generation model has a high requirement on the diversity of training samples, for example, image data of different objects and different lighting conditions are required as training samples, and actually acquired image data are relatively limited and difficult to meet the diversity requirement, so that image data of different objects and different lighting conditions can be generated by the image generation model (for example, a multilayer convolutional neural network) to serve as training samples of the description information generation model.

As shown in fig. 3, inputting the first reference text data into the speech synthesis model to obtain first generated audio data, and then inputting the first generated audio data into the speech recognition model to obtain predicted text data corresponding to the first generated audio data; meanwhile, the first reference text data is input into the image generation model to obtain first generated image data, and then the first generated image data is input into the description information generation model to obtain predicted text data corresponding to the first generated image data. Then, inputting the predicted text data corresponding to the first generated audio data and the predicted text data corresponding to the first generated image data (both are collectively referred to as first generated text data) and the first reference text data into a speech synthesis model together to obtain new first generated audio data, and then inputting the new first generated audio data into a speech recognition model to obtain the predicted text data corresponding to the new first generated audio data; meanwhile, inputting the first generated text data and the first reference text data into an image generation model to obtain new first generated image data, and then inputting the new first generated image data into a description information generation model to obtain predicted text data corresponding to the new first generated image data. Next, the predicted text data corresponding to the new first generated audio data, the predicted text data corresponding to the new first generated image data, and the first reference text data are input to the speech synthesis model and the image generation model together, and the process is repeated until the recognition accuracy of the speech recognition model and the recognition accuracy of the description information generation model are not increased any more, the new first reference data is obtained, and model training is performed based on the new first reference data until the recognition accuracy of the speech recognition model is greater than a first preset accuracy threshold (e.g., 90%) and the recognition accuracy of the description information generation model is greater than a second preset accuracy threshold (e.g., 90%). In the model training process, model parameters of the voice recognition model are updated according to a comparison result of predicted text data corresponding to the first generated audio data and the first reference text data, and meanwhile, the model parameters of the description information generation model are updated according to a comparison result of the predicted text data corresponding to the first generated image data and the first reference text data.

In the model training process, a loss function of the speech recognition model may adopt a neural network-based time sequence class classification (CTC) loss to improve the recognition accuracy of the speech recognition model, and a loss function of the description information generation model may adopt a confusion loss and a Mask-RCNN loss, where the adoption of the Mask-RCNN loss may enable the description information generation work to fully utilize local information of an image, so that the generated description information is more accurate.

The training samples of the voice recognition model are generated through the voice synthesis model, the training samples of the description information generation model are generated through the image generation model, the diversity requirements of the voice recognition model and the description information generation model on the training samples can be met, and therefore the recognition accuracy of the voice recognition model and the description information generation model can be improved. In addition, in the model training process, the input of the voice synthesis model and the image generation model comprises the output of the voice recognition model and the output of the description information generation model besides the first reference text data, and the output of the voice synthesis model is used as the input of the voice recognition model and the output of the image generation model is used as the input of the description information generation model, so that a closed loop can be formed among the four models, the joint updating of the voice recognition model and the description information generation model is realized, and the voice recognition model, the description information generation model and the recognition accuracy are further improved. In addition, the first generated image data generated by the image generation model according to the first reference text data contains all information of the first reference text data, so that the training sample of the description information generation model contains richer information, and the recognition accuracy of the description information generation model can be further improved.

In addition, in order to improve the recognition accuracy of the speech recognition model and the description information generation model and reduce the time overhead of model training, as shown in fig. 4, before the above S201, the above method may further include the following S203.

In S203, the speech recognition model and the description information generation model are pre-trained.

In one embodiment, the speech recognition model may be pre-trained, and then the description information generation model may be pre-trained based on the last output of the speech recognition model during the pre-training process. Specifically, this can be achieved by:

(1) second reference text data is acquired.

The second reference text data may be the same as or different from the first reference text data, and is not specifically limited in this disclosure.

(2) The speech recognition model is pre-trained by taking the second reference text data and the output of the speech recognition model as the input of the speech synthesis model, taking the output of the speech synthesis model as the input of the speech recognition model, and taking the second reference text data as the target output of the speech recognition model.

(3) And pre-training the description information generation model by taking the second reference text data, the output of the description information generation model and the output of the voice recognition model obtained after pre-training as the input of the image generation model, taking the output of the image generation model as the input of the description information generation model and taking the second reference text data as the target output of the description information generation model.

As shown in fig. 5, the second reference text data is input into the speech synthesis model to obtain second generated audio data, and then the second generated audio data is input into the speech recognition model to obtain second generated text data; inputting the second generated text data and the second reference text data into a speech synthesis model together to obtain new second generated audio data, and then inputting the new second generated audio data into a speech recognition model to obtain new second generated text data; next, new second generated text data and second reference text data are input into the speech synthesis model, and so on until the recognition accuracy of the speech recognition model no longer increases, new second reference text data are acquired, and model pre-training is continued based on the new second reference text data until the recognition accuracy of the speech recognition model is greater than a third preset accuracy threshold (e.g., 85%), which is less than the first preset accuracy threshold. And updating the model parameters of the voice recognition model according to the comparison result of the second generated text data and the second reference text data in the model pre-training process.

After the pre-training of the voice recognition model is completed, the second generated text data and the second reference text data which are output last time during the pre-training of the voice recognition model can be input into the image generation model together to obtain second generated image data, and then the second generated image data is input into the description information generation model to obtain third generated text data; inputting the third generated text data and the second reference text data into the image generation model together to obtain new second generated image data, and then inputting the new second generated image data into the description information generation model to obtain new third generated text data; next, new third generation text data and second reference text data are input into the image generation model, and the process is cycled through until the recognition accuracy of the description information generation model does not increase any more, new second reference text data is obtained, and model pre-training is continued based on the new second reference text data until the recognition accuracy of the description information generation model is greater than a fourth preset accuracy threshold (e.g., 85%), wherein the fourth preset accuracy threshold is less than the second preset accuracy threshold. And updating the model parameters of the description information generation model according to the comparison result of the third generation text data and the second reference text data in the model pre-training process.

In another embodiment, the speech recognition model of the description information generation model may be pre-trained, and then the speech recognition model may be pre-trained after the last output of the description information generation model in the pre-training process. Specifically, this can be achieved by:

(1) second reference text data is acquired.

(2) The description information generation model is pre-trained by taking the output of the second reference text data and the description information generation model as the input of the image generation model, taking the output of the image generation model as the input of the description information generation model, and taking the second reference text data as the target output of the description information generation model.

(3) And pre-training the voice recognition model by taking the second reference text data, the output of the voice recognition model and the output of the description information generation model obtained after pre-training as the input of the voice synthesis model, taking the output of the voice synthesis model as the input of the voice recognition model and taking the second reference text data as the target output of the voice recognition model.

In addition, in order to further improve the recognition accuracy of the speech recognition model and the description information generation model and reduce the time overhead of model training, before S203, the method may further include the following steps:

acquiring reference video data, wherein the reference audio data comprises reference image data, reference audio data and third reference text data corresponding to the reference audio data; and respectively carrying out preliminary training on the voice recognition model, the voice synthesis model, the image generation model and the description information generation model according to the reference video data.

As shown in fig. 6, the reference audio data may be used as an input of the speech recognition model, the third reference text data may be used as a target output of the speech recognition model, and the speech recognition model may be primarily trained; taking the reference image data as the input of the description information generation model, taking the third reference text data as the target output of the description information generation model, and performing primary training on the description information generation model; taking the third reference text data as the input of the voice synthesis model, taking the reference audio data as the target output of the voice synthesis model, and performing primary training on the voice synthesis model; and taking the third reference text data as the input of the image generation model, taking the reference image data as the target output of the image generation model, and performing primary training on the image generation model.

For example, in the model training process, the loss function of the speech synthesis model may adopt a mean square error loss, a smoothed L1 norm loss, and a stopping loss for determining when the output sequence of the speech synthesis model stops, wherein the smoothed L1 norm loss may improve the smoothness of the speech synthesized by the speech synthesis model; the Loss function of the image generation model can adopt pixel-level mean square error Loss and Loss sensitivity compensation Loss of a Loss sensitive generation type countermeasure network (LSGAN), wherein the image generated by the image generation model can be more accurate by adopting the pixel-level mean square error Loss.

FIG. 7 is a block diagram illustrating a speech recognition apparatus according to an example embodiment. As shown in fig. 7, the apparatus 700 includes: an obtaining module 701, configured to obtain target video data, where the target video data includes target audio data and target image data; a first extracting module 702, configured to extract first text data corresponding to the target audio data acquired by the acquiring module 701; a second extracting module 703, configured to extract feature information of the target image data acquired by the acquiring module 701, and generate second text data for describing the target image data according to the feature information; a correcting module 704, configured to correct the first text data extracted by the first extracting module 702 according to the second text data extracted by the second extracting module 703, so as to obtain corrected first text data.

In one embodiment, the first extraction module 702 is configured to obtain the first text data corresponding to the target audio data in a manner of manual annotation.

In another embodiment, the first extraction module 702 is configured to input the target audio data into a speech recognition model to obtain first text data corresponding to the target audio data, where a training sample of the speech recognition model is generated by a speech synthesis model.

In one embodiment, the second extraction module 703 is configured to extract feature information of the target image data through an image feature extraction model, and then input the feature information into the description information generation sub-model to generate second text data for describing the target image data.

In another embodiment, the second extraction module 703 is configured to input the target image data into a description information generation model, so as to extract feature information of the target image data through the description information generation model, and generate second text data for describing the target image data according to the feature information; wherein the training sample describing the information generation model is generated by an image generation model.

Optionally, the speech recognition model and the description information generation model are obtained by training as follows: acquiring first reference text data; model training is performed by using the first reference text data, the output of the speech recognition model, and the output of the description information generation model as the inputs of the speech synthesis model, the output of the speech synthesis model as the input of the speech recognition model, the first reference text data as the target output of the speech recognition model, the output of the first reference text data, the output of the speech recognition model, and the output of the description information generation model as the inputs of the image generation model, the output of the image generation model as the input of the description information generation model, and the first reference text data as the target output of the description information generation model, so as to obtain the speech recognition model and the description information generation model.

Optionally, before performing model training, the apparatus 700 further includes: and the pre-training module is used for pre-training the voice recognition model and the description information generation model.

Optionally, the pre-training module includes: the first obtaining submodule is used for obtaining second reference text data; a first pre-training sub-module, configured to pre-train the speech recognition model by using the second reference text data and the output of the speech recognition model as the input of the speech synthesis model, using the output of the speech synthesis model as the input of the speech recognition model, and using the second reference text data as the target output of the speech recognition model; and the second pre-training sub-module is used for pre-training the description information generation model by taking the second reference text data, the output of the description information generation model and the output of the pre-trained voice recognition model as the input of the image generation model, taking the output of the image generation model as the input of the description information generation model and taking the second reference text data as the target output of the description information generation model.

Optionally, the pre-training module comprises: the second obtaining submodule is used for obtaining second reference text data; a third pre-training sub-module, configured to pre-train the description information generation model by using the second reference text data and the output of the description information generation model as the input of the image generation model, using the output of the image generation model as the input of the description information generation model, and using the second reference text data as the target output of the description information generation model; and the fourth pre-training sub-module is used for pre-training the voice recognition model in a mode that the second reference text data, the output of the voice recognition model and the output of the description information generation model obtained after pre-training are used as the input of the voice synthesis model, the output of the voice synthesis model is used as the input of the voice recognition model, and the second reference text data is used as the target output of the voice recognition model.

Referring now to fig. 8, a schematic diagram of an electronic device (e.g., a terminal device or server) 800 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 8, an electronic device 800 may include a processing means (e.g., central processing unit, graphics processor, etc.) 801 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage means 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the electronic apparatus 800 are also stored. The processing apparatus 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

Generally, the following devices may be connected to the I/O interface 805: input devices 806 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 807 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage 808 including, for example, magnetic tape, hard disk, etc.; and a communication device 809. The communication means 809 may allow the electronic device 800 to communicate wirelessly or by wire with other devices to exchange data. While fig. 8 illustrates an electronic device 800 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication means 809, or installed from the storage means 808, or installed from the ROM 802. The computer program, when executed by the processing apparatus 801, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring target video data, wherein the target video data comprises target audio data and target image data; extracting first text data corresponding to the target audio data; extracting feature information of the target image data, and generating second text data for describing the target image data according to the feature information; and correcting the first text data according to the second text data to obtain the corrected first text data.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. Here, the name of the module does not constitute a limitation to the module itself in some cases, for example, the first extraction module may also be described as a "module that extracts first text data corresponding to the target audio data".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Example 1 provides, in accordance with one or more embodiments of the present disclosure, a speech recognition method including: acquiring target video data, wherein the target video data comprises target audio data and target image data; extracting first text data corresponding to the target audio data; extracting feature information of the target image data, and generating second text data for describing the target image data according to the feature information; and correcting the first text data according to the second text data to obtain the corrected first text data.

Example 2 provides the method of example 1, the extracting first text data corresponding to the target audio data, including: inputting the target audio data into a speech recognition model to obtain first text data corresponding to the target audio data; the extracting feature information of the target image data and generating second text data for describing the target image data according to the feature information includes: inputting the target image data into a description information generation model, extracting feature information of the target image data through the description information generation model, and generating second text data for describing the target image data according to the feature information; wherein the training samples of the speech recognition model are generated by a speech synthesis model, and the training samples of the description information generation model are generated by an image generation model.

Example 3 provides the method of example 2, the speech recognition model and the description information generation model being trained in the following manner: acquiring first reference text data; model training is performed by using the first reference text data, the output of the speech recognition model, and the output of the description information generation model as the inputs of the speech synthesis model, the output of the speech synthesis model as the input of the speech recognition model, the first reference text data as the target output of the speech recognition model, the output of the first reference text data, the output of the speech recognition model, and the output of the description information generation model as the inputs of the image generation model, the output of the image generation model as the input of the description information generation model, and the first reference text data as the target output of the description information generation model, so as to obtain the speech recognition model and the description information generation model.

Example 4 provides the method of example 2, further comprising, prior to performing model training: and pre-training the voice recognition model and the description information generation model.

Example 5 provides the method of example 4, the pre-training the speech recognition model and the description information generation model, comprising: acquiring second reference text data; pre-training the speech recognition model by taking the second reference text data and the output of the speech recognition model as the input of the speech synthesis model, taking the output of the speech synthesis model as the input of the speech recognition model, and taking the second reference text data as the target output of the speech recognition model; and pre-training the description information generation model by taking the second reference text data, the output of the description information generation model and the output of the pre-trained voice recognition model as the input of the image generation model, taking the output of the image generation model as the input of the description information generation model and taking the second reference text data as the target output of the description information generation model.

Example 6 provides the method of example 4, the pre-training the speech recognition model and the description information generation model, comprising: acquiring second reference text data; pre-training the description information generation model by taking the output of the second reference text data and the description information generation model as the input of the image generation model, taking the output of the image generation model as the input of the description information generation model, and taking the second reference text data as the target output of the description information generation model; and pre-training the voice recognition model by taking the second reference text data, the output of the voice recognition model and the output of the description information generation model obtained after pre-training as the input of the voice synthesis model, taking the output of the voice synthesis model as the input of the voice recognition model and taking the second reference text data as the target output of the voice recognition model.

Example 7 provides, in accordance with one or more embodiments of the present disclosure, a speech recognition apparatus comprising: the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring target video data, and the target video data comprises target audio data and target image data; the first extraction module is used for extracting first text data corresponding to the target audio data acquired by the acquisition module; the second extraction module is used for extracting the feature information of the target image data acquired by the acquisition module and generating second text data for describing the target image data according to the feature information; and the correcting module is used for correcting the first text data extracted by the first extracting module according to the second text data extracted by the second extracting module to obtain the corrected first text data.

Example 8 provides the apparatus of example 7, the first extraction module to input the target audio data into a speech recognition model to obtain first text data corresponding to the target audio data; the second extraction module is used for inputting the target image data into a description information generation model, extracting feature information of the target image data through the description information generation model, and generating second text data for describing the target image data according to the feature information; wherein the training samples of the speech recognition model are generated by a speech synthesis model, and the training samples of the description information generation model are generated by an image generation model.

Example 9 provides a computer readable medium having stored thereon a computer program that, when executed by a processing apparatus, performs the steps of the method of any of examples 1-6, in accordance with one or more embodiments of the present disclosure.

Example 10 provides, in accordance with one or more embodiments of the present disclosure, an electronic device comprising: a storage device having a computer program stored thereon; processing means for executing the computer program in the storage means to carry out the steps of the method of any of examples 1-6.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Claims

1. A speech recognition method, comprising:

inputting the target audio data into a speech recognition model to obtain first text data corresponding to the target audio data;

inputting the target image data into a description information generation model, extracting feature information of the target image data through the description information generation model, and generating second text data for describing the target image data according to the feature information;

according to the second text data, correcting the first text data to obtain corrected first text data;

the speech recognition model and the description information generation model are obtained by training in the following way:

acquiring first reference text data;

model training is performed to obtain the speech recognition model and the description information generation model by using the first reference text data, the output of the speech recognition model, and the output of the description information generation model as inputs of the speech synthesis model, the output of the speech synthesis model as an input of the speech recognition model, the first reference text data as a target output of the speech recognition model, the output of the first reference text data, the output of the speech recognition model, and the output of the description information generation model as inputs of an image generation model, the output of the image generation model as an input of the description information generation model, and the first reference text data as a target output of the description information generation model.

2. The method of claim 1, wherein prior to performing model training, the method further comprises:

and pre-training the voice recognition model and the description information generation model.

3. The method of claim 2, wherein the pre-training the speech recognition model and the description information generation model comprises:

acquiring second reference text data;

pre-training the speech recognition model by taking the second reference text data and the output of the speech recognition model as the input of the speech synthesis model, taking the output of the speech synthesis model as the input of the speech recognition model, and taking the second reference text data as the target output of the speech recognition model;

and pre-training the description information generation model by taking the second reference text data, the output of the description information generation model and the output of the pre-trained voice recognition model as the input of the image generation model, taking the output of the image generation model as the input of the description information generation model and taking the second reference text data as the target output of the description information generation model.

4. The method of claim 2, wherein the pre-training the speech recognition model and the description information generation model comprises:

acquiring second reference text data;

pre-training the description information generation model by taking the output of the second reference text data and the description information generation model as the input of the image generation model, taking the output of the image generation model as the input of the description information generation model, and taking the second reference text data as the target output of the description information generation model;

and pre-training the voice recognition model by taking the second reference text data, the output of the voice recognition model and the output of the description information generation model obtained after pre-training as the input of the voice synthesis model, taking the output of the voice synthesis model as the input of the voice recognition model and taking the second reference text data as the target output of the voice recognition model.

5. A speech recognition apparatus, comprising:

the first extraction module is used for inputting the target audio data into a voice recognition model so as to obtain first text data corresponding to the target audio data;

the second extraction module is used for inputting the target image data into a description information generation model, extracting the characteristic information of the target image data through the description information generation model, and generating second text data for describing the target image data according to the characteristic information;

the correction module is used for correcting the first text data extracted by the first extraction module according to the second text data extracted by the second extraction module to obtain corrected first text data;

acquiring first reference text data;

6. A computer-readable medium, on which a computer program is stored, characterized in that the program, when being executed by processing means, carries out the steps of the method of any one of claims 1-4.

7. An electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to carry out the steps of the method according to any one of claims 1 to 4.