CN111161724B

CN111161724B - Method, system, equipment and medium for Chinese audio-visual combined speech recognition

Info

Publication number: CN111161724B
Application number: CN201911297060.3A
Authority: CN
Inventors: 郭永亮; 张坤雷
Original assignee: Aiways Automobile Co Ltd
Current assignee: Aiways Automobile Co Ltd
Priority date: 2019-12-16
Filing date: 2019-12-16
Publication date: 2022-12-13
Anticipated expiration: 2039-12-16
Also published as: CN111161724A

Abstract

The invention provides a method, a system, equipment and a medium for recognizing Chinese audio-visual combined voice, wherein the method comprises the following steps: respectively receiving a video signal and an audio signal to be identified; inputting the video signal and the audio signal into a trained pinyin character sequence recognition model to obtain a pinyin character sequence output by the pinyin character sequence recognition model; and inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model. The invention provides an audio-visual combined voice recognition scheme aiming at Chinese sentence levels based on an end-to-end mode, combines a deep neural network and an attention mechanism, fully excavates and fuses the characteristics of an audio signal and a video signal, and is favorable for improving the recognition capability of a voice recognition system.

Description

Method, system, equipment and medium for Chinese audio-visual combined speech recognition

Technical Field

The invention relates to the technical field of voice recognition, in particular to a method, a system, equipment and a medium for recognizing Chinese audio-visual combined voice.

Background

Speech recognition technology is a technology that allows a machine to convert speech signals into corresponding text or commands through a process of recognition and understanding. In the last two decades, the development of speech recognition technology has made remarkable progress, and the development of artificial intelligence technology has started to move from laboratories to markets, especially in recent years, so that the research of speech recognition technology has made a major breakthrough. At present, the voice recognition technology is widely applied to the fields of vehicle-mounted systems, social chat, smart homes and the like, provides much convenience for the life of people, and shows strong practicability of people.

In general, the input of speech recognition is only an audio signal, and if the speech recognition process can simultaneously utilize audio and video signals, the two signals can be mutually supplemented, so that the input information is richer, and the accuracy of recognition is improved.

For the speech recognition problem of audio-visual combination by simultaneously utilizing audio and video signals, at present, the solution specially aiming at Chinese is almost not available, most of the solutions aim at the speech recognition of English, or a specific language is not specified, but Chinese has the particularity, for example, words in Chinese do not have strict morphological change, the number of commonly used Chinese characters is large, about 3500 Chinese characters and the like, and the complex characteristics make the Chinese speech recognition task of audio-visual combination full of challenges, so the audio-visual combination speech recognition solution aiming at English or general audio-visual combination cannot be directly used for solving the Chinese speech recognition problem. In addition, in the existing audio-visual combined speech recognition scheme, the problem of recognition at a word level is solved, and a sentence-level recognition task during continuous speaking cannot be processed; some other schemes use traditional machine learning methods to realize audio-visual combined speech recognition, and such methods require manual feature extraction, which is tedious in process and limited in final recognition effect.

Disclosure of Invention

Aiming at the problems in the prior art, the invention aims to provide a method, a system, equipment and a medium for recognizing Chinese audio-visual combined speech based on deep learning, and provides an end-to-end audio-visual combined speech recognition scheme aiming at Chinese sentence level.

The embodiment of the invention provides a method for recognizing Chinese audio-visual combined voice, which comprises the following steps:

respectively receiving a video signal and an audio signal to be identified;

inputting the video signal and the audio signal into a trained pinyin character sequence recognition model to obtain a pinyin character sequence output by the pinyin character sequence recognition model;

and inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model.

Optionally, the pinyin character sequence recognition model includes a video encoder, an audio encoder, and a first decoder;

inputting the video signal and the audio signal into a trained pinyin character sequence recognition model to obtain a pinyin character sequence output by the pinyin character sequence recognition model, and the method comprises the following steps:

the video encoder extracts a characteristic sequence of the video signal;

the audio encoder extracts a characteristic sequence of the audio signal;

the audio encoder fuses the characteristic sequence of the video signal and the characteristic sequence of the audio signal through an attention mechanism to obtain a fused characteristic sequence;

and the first decoder outputs a pinyin character sequence according to the fusion characteristic sequence.

Optionally, the video encoder extracting a feature sequence of the video signal includes the following steps:

inputting the image frame sequence of the video signal into a convolutional neural network to obtain the image characteristics of each frame of image output by the convolutional neural network;

inputting the image features into a first recurrent neural network, extracting time sequence features among the image frame sequences, and taking an output sequence of the first recurrent neural network as a feature sequence of the video signal.

Optionally, the audio encoder extracting a feature sequence of the audio signal includes:

calculating mel-frequency cepstrum coefficient values of the audio signal;

and inputting the Mel frequency cepstrum coefficient value into a second recurrent neural network, and extracting the characteristic sequence of the audio signal.

Optionally, the second recurrent neural network comprises three long-short term memory layers;

the audio encoder fuses the characteristic sequence of the video signal and the characteristic sequence of the audio signal through an attention mechanism, and the audio encoder fuses the state data of the top long-short term memory layer and the characteristic sequence of the video signal in the top long-short term memory layer of the second recurrent neural network based on the attention mechanism.

Optionally, the first decoder outputs a pinyin character sequence according to the fused feature sequence, and includes the following steps:

the first decoder inputs the fused feature sequence into a third recurrent neural network based on an attention mechanism, wherein the recurrent neural network comprises a feature extraction layer and a classification layer;

and obtaining the pinyin character sequence from the output of the classification layer of the third recurrent neural network.

Optionally, the chinese character sequence recognition model includes an encoder and a second decoder;

inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model, and the method comprises the following steps:

the encoder inputs the pinyin character sequence into a fourth recurrent neural network to extract the characteristics of the pinyin character sequence;

inputting the characteristics of the pinyin character sequence into a fifth cyclic neural network by the second decoder, wherein the fifth cyclic neural network comprises a characteristic extraction layer and a classification layer;

and obtaining the Chinese character sequence from the output of the classification layer of the fifth cyclic neural network.

Optionally, the fourth recurrent neural network and the fifth recurrent neural network each include two gated recurrent unit layers.

Optionally, the method further includes training the pinyin character sequence recognition model and the hanzi sequence recognition model by:

fixing the parameters of the pinyin character sequence recognition model, and training by adjusting the parameters of the Chinese character sequence recognition model;

fixing the parameters of the Chinese character sequence recognition model, and training by adjusting the parameters of the pinyin character sequence recognition model;

and training the whole network of the pinyin character sequence recognition model and the Chinese character sequence recognition model.

The embodiment of the invention also provides a system for recognizing the combination of Chinese audio-visual and speech, which is characterized in that the system is applied to the method for recognizing the combination of Chinese audio-visual and speech, and comprises the following steps:

the signal receiving module is used for respectively receiving a video signal and an audio signal to be identified;

the pinyin identification module is used for inputting the video signal and the audio signal into a trained pinyin character sequence identification model to obtain a pinyin character sequence output by the pinyin character sequence identification model;

and the Chinese character recognition module is used for inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model.

The embodiment of the invention also provides a device for recognizing Chinese audio-visual combined speech, which comprises:

a processor;

a memory having stored therein executable instructions of the processor;

wherein the processor is configured to perform the steps of the Chinese audiovisual combined with speech recognition method via execution of the executable instructions.

The embodiment of the invention also provides a computer readable storage medium for storing a program, and the program realizes the steps of the Chinese audio-visual combined speech recognition method when being executed.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

The Chinese audio-visual combined voice recognition method, the system, the equipment and the medium provided by the invention have the following advantages:

the invention solves the problems in the prior art, provides an audio-visual combined speech recognition scheme aiming at Chinese sentence levels based on an end-to-end mode, and fills the blank of the technical field; furthermore, the machine learning model combining audio and video with voice recognition is constructed by utilizing the deep neural network, the whole process does not need to manually extract the characteristics, and compared with the traditional machine learning mode, the method can well extract the characteristics of audio and video signals and is beneficial to improving the recognition capability of a voice recognition system; furthermore, the invention combines the deep neural network and the attention mechanism, and fully excavates and fuses the characteristics of the audio signal and the video signal, so that the recognition effect is greatly improved compared with a multi-mode characteristic splicing mode.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, with reference to the accompanying drawings.

FIG. 1 is a flow chart of a method for Chinese audio visual combined with speech recognition according to an embodiment of the present invention;

FIG. 2 is a flow diagram of the recognition of sentences from audio and video signals in accordance with an embodiment of the present invention;

FIG. 3 is a diagram illustrating a structure of a Pinyin character sequence recognition model according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a Chinese character sequence recognition model according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a Chinese audio-visual system combined with speech recognition according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a Chinese audiovisual combined with speech recognition apparatus in accordance with an embodiment of the present invention;

fig. 7 is a schematic diagram of a computer-readable storage medium according to an embodiment of the present invention.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

As shown in fig. 1, in order to solve the problems of the prior art, the present invention provides a method for combining chinese audio-visual with speech recognition, which comprises the following steps:

s100: respectively receiving a video signal and an audio signal to be identified;

s200: inputting the video signal and the audio signal into a trained pinyin character sequence recognition model to obtain a pinyin character sequence output by the pinyin character sequence recognition model;

s300: and inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model.

The invention provides an end-to-end audio-visual combined speech recognition method, which can directly obtain a corresponding Chinese character sequence, namely a sentence comprising speaking content, by inputting a video signal and an audio signal. Moreover, the invention is specially used for Chinese character recognition, the pinyin character sequence is obtained from the video signal and the audio signal through the step S200, then Chinese character recognition is carried out through the step S300, the audio signal and the video signal are fully mined and fused, and the recognition effect is greatly improved.

As shown in fig. 2, the pinyin character sequence recognition model inputs a sequence of picture frames, i.e., a video signal, of an audio signal when a person speaks and a lip motion video of the speaker. The pinyin character sequence recognition model performs fusion processing on input audio signals and video signals, and finally outputs a pinyin character sequence, and the process can be expressed by the following formula:

p＝av2p(a,v) (1)

wherein, a = (a) ₁ ，a ₂ ，…，a _n ) Representing a sequence of audio signals, v = (v) ₁ ，v ₂ ，…，v _m ) Picture frame sequence representing lip motion, p = (p) ₁ ，p ₂ ，…，p _k ) Represents the recognized pinyin character sequence, and av2p (.) represents a pinyin character sequence recognition model that converts audio and video signals into pinyin character sequences.

The input of the Chinese character sequence recognition model is a pinyin character sequence output by the pinyin character sequence recognition model, and the Chinese character sequence is output by learning the characteristics of the pinyin character sequence, namely the content of speaking, and the process can be expressed by adopting the following formula:

s＝p2s(p) (2)

wherein p = (p) ₁ ，p ₂ ，…，p _k ) Representing the input pinyin character sequence, s = (c) ₁ ，c ₂ ，…，c _l ) Representing a recognized sequence of Chinese characters, i.e. sentences, c _i Represents the ith Chinese character in the sentence s, and p2s (.) represents a Chinese character sequence recognition model that converts the pinyin character sequence to a Chinese character sequence.

The following describes the working process of the pinyin character sequence recognition model and the hanzi sequence recognition model in an embodiment with reference to fig. 3 and 4.

As shown in fig. 3, the pinyin character sequence recognition model includes a video encoder, an audio encoder, and a first decoder. Specifically, in this embodiment, the pinyin character Sequence recognition model may be a Sequence to Sequence (Sequence to Sequence) model based on the attention mechanism.

The step S200: inputting the video signal and the audio signal into a trained pinyin character sequence recognition model to obtain a pinyin character sequence output by the pinyin character sequence recognition model, and the method comprises the following steps:

s210: the video encoder extracts a characteristic sequence of the video signal;

s220: the audio encoder extracts a characteristic sequence of the audio signal;

s230: the audio encoder fuses the characteristic sequence of the video signal and the characteristic sequence of the audio signal through an attention mechanism to obtain a fused characteristic sequence;

s240: and the first decoder outputs a pinyin character sequence according to the fusion characteristic sequence.

The step S210: the video encoder extracting the characteristic sequence of the video signal comprises the following steps:

inputting the image frame sequence of the video signal into a convolutional neural network to obtain the image characteristics of each frame of image output by the convolutional neural network; the convolutional neural Network may be a Residual Network (ResNet), and an input of the Residual Network may be a three-channel RGB image; convolutional Neural Networks (CNNs) are a class of feed-forward Neural Networks that contain Convolutional computations and have a depth structure, and residual Networks are characterized by being easily optimized and capable of increasing accuracy by adding a considerable depth. The inner residual block uses jump connection, so that the problem of gradient disappearance caused by depth increase in a deep neural network is solved;

inputting the image features into a first recurrent neural network, extracting time sequence features among the image frame sequences, and taking an output sequence of the first recurrent neural network as a feature sequence of the video signal; in this embodiment, the first recurrent neural network may be a three-layer Long-Short Term Memory (LSTM) network, which is used to extract the time-series features between image frame sequences, and take the output sequence of the top-layer LSTM as the feature sequence of the video signal output by the video encoder. The recurrent neural network is a recurrent neural network which takes sequence data as input, recurs in the evolution direction of the sequence and all nodes (recurrent units) are connected in a chain manner; the long-term and short-term memory Network is a time-cycle Neural Network, is a special cycle Neural Network (RNN), and can learn long-term dependency.

The step S220: the audio encoder extracts the characteristic sequence of the audio signal, and comprises the following steps:

calculating Mel Frequency Cepstrum Coefficient values (MFCC) of the audio signal; mel-Frequency Cepstral coeffients (Mel Frequency Cepstral coeffients) is a widely used feature in automated speech and speaker recognition;

inputting the mel frequency cepstrum coefficient value into a second recurrent neural network, and extracting a characteristic sequence of the audio signal; in this embodiment, the second recurrent neural network may also be a three-layer long-short term memory network, and the state data output by the top layer is used as the audio signal.

The step S230: the audio encoder fuses the feature sequence of the video signal and the feature sequence of the audio signal through an attention mechanism, and the method comprises the following steps:

the audio encoder fuses state data of the top long short-term memory layer and the feature sequence of the video signal based on an attention mechanism in the top long short-term memory layer of the second recurrent neural network. Specifically, the fusion employs the following equations (3) and (4):

a _ij ＝score(value _j ,query _i ) (3)

C _i ＝∑ _j a _ij value _j (4)

wherein, query represents the state data of the top LSTM of the audio encoder, value represents the output of the top LSTM of the video encoder, and the output of the top LSTM of the audio encoder is the fusion characteristic of the audio and video signals through the fusion of the attention mechanism.

The step S240: the first decoder outputs a pinyin character sequence according to the fusion characteristic sequence, and the method comprises the following steps:

the first decoder inputs the fused feature sequence into a third recurrent neural network based on an attention mechanism, wherein the recurrent neural network comprises a feature extraction layer and a classification layer; in this embodiment, the feature extraction layer in the third recurrent neural network mayA single-layer long-short term memory layer and a four-head attention mechanism are adopted to improve the performance, the classification layer can adopt a softmax layer, and the classification layer outputs a predicted pinyin character sequence p = (p) ₁ ，p ₂ ，…，p _k )。

When the pinyin character sequence recognition model is trained, the input of the model is g = (g) ₁ ，g ₂ ，…，g _k ),g _i The sample label representing the real sample is processed by an LSTM layer and a softmax layer, and finally, a predicted pinyin character sequence p = (p) ₁ ，p ₂ ，…，p _k ). In the test phase, g _i Representing the output p of the network at time i _i 。

As shown in fig. 4, in this embodiment, the kanji sequence recognition model includes an encoder and a second decoder. Specifically, in this embodiment, the chinese character sequence recognition model is also implemented based on the Seq2Seq framework. The step S300: inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model, and the method comprises the following steps:

the encoder inputs the pinyin character sequence into a fourth recurrent neural network to extract the characteristics of the pinyin character sequence; the input of the coder is a pinyin character sequence p = (p) ₁ ，p ₂ ，…，p _k )，p _i In this embodiment, the fourth Recurrent neural network includes two layers of Gated Recurrent Unit (GRU) networks, and the output sequence is O = (O) ₁ ，O ₂ ，…，O _k ) Output at time i _i For parameterizing the input p at the next moment _i+1 The predicted distribution of (2); the gated cyclic unit is a commonly used gated cyclic neural network, and the gated cyclic neural network is proposed to better capture the dependence relationship with larger time step distance in a time sequence.

The second decoder inputs the characteristics of the pinyin character sequence into a fifth recurrent neural network, wherein the fifth recurrent neural network comprises a characteristic extraction layer and a classification layer; in this embodiment, the fifth recurrent neural network may include two layers of gated recurrent units, and the classification layer may be a softmax layer;

deriving the Hanzi sequence s = (c) from the output of the classification layer of the fifth recurrent neural network ₁ ，c ₂ ，…，c _l )。

In the training stage of the Chinese character sequence recognition model, the input of the decoder is the Chinese character sequence y = (y) ₁ ，y ₂ ，…，y _l )，y _i And inputting the real sample label into a fifth-cycle neural network for calculation.

The Chinese audio-visual combined speech recognition system constructed in the invention comprises two models: the pinyin character sequence recognition model and the Chinese character sequence recognition model are formed and are of a multi-model structure, and the problem of multi-model training needs to be solved for obtaining a final recognition system. In this embodiment, the method for recognizing Chinese audio-visual combined with speech further comprises training the pinyin character sequence recognition model and the hanzi sequence recognition model by the following steps:

fixing the parameters of the pinyin character sequence recognition model, and training by adjusting the parameters of the Chinese character sequence recognition model until the Chinese character sequence recognition model converges;

fixing the parameters of the Chinese character sequence recognition model, and training by adjusting the parameters of the pinyin character sequence recognition model until the pinyin character sequence recognition model is converged;

and training the whole network of the pinyin character sequence recognition model and the Chinese character sequence recognition model to converge, thus obtaining the whole Chinese audio-visual combined speech recognition system.

As shown in fig. 5, an embodiment of the present invention further provides a system for combining a chinese audio visual system with a speech recognition, where the system is applied to the method for combining a chinese audio visual system with a speech recognition, and the system includes:

a signal receiving module M100, configured to receive a video signal and an audio signal to be identified respectively;

the pinyin identification module M200 is used for inputting the video signal and the audio signal into a trained pinyin character sequence identification model to obtain a pinyin character sequence output by the pinyin character sequence identification model;

and the Chinese character recognition module M300 is used for inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model.

The invention provides an end-to-end audio-visual combined voice recognition system, which inputs video signals and audio signals and can directly obtain corresponding Chinese character sequences by the method of the invention. Moreover, the invention is specially used for Chinese character recognition, a pinyin character sequence is obtained from a video signal and an audio signal through the pinyin recognition module M200, and then Chinese character recognition is carried out through the Chinese character recognition module M300, so that the audio signal and the video signal are fully mined and fused, and the recognition effect is greatly improved.

The pinyin character sequence recognition model of the invention may have a structure as shown in fig. 3, and the chinese character sequence recognition model may have a structure as shown in fig. 4. The embodiment of the pinyin identification module M200 for obtaining the pinyin character sequence may adopt the specific implementation of the steps S210 to S240 as described above, but the invention is not limited thereto. The embodiment of the kanji sequence recognition module M300 to obtain kanji sequences may adopt the specific embodiment of step S300, but the present invention is not limited thereto.

The embodiment of the invention also provides a Chinese audio-visual combined voice recognition device, which comprises a processor; a memory having stored therein executable instructions of the processor; wherein the processor is configured to perform the steps of the Chinese audiovisual combined with speech recognition method via execution of the executable instructions.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" platform.

An electronic device 600 according to this embodiment of the invention is described below with reference to fig. 6. The electronic device 600 shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 6, the electronic device 600 is embodied in the form of a general purpose computing device. The combination of the electronic device 600 may include, but is not limited to: at least one processing unit 610, at least one memory unit 620, a bus 630 connecting different platform combinations (including memory unit 620 and processing unit 610), a display unit 640, etc.

Wherein the storage unit stores program code executable by the processing unit 610 to cause the processing unit 610 to perform steps according to various exemplary embodiments of the present invention described in the above-mentioned electronic prescription flow processing method section of the present specification. For example, the processing unit 610 may perform the steps as shown in fig. 1.

The storage unit 620 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM) 6201 and/or a cache memory unit 6202, and may further include a read-only memory unit (ROM) 6203.

The memory unit 620 may also include a program/utility 6204 having a set (at least one) of program modules 6205, such program modules 6205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 630 can be any bus representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 600 may also communicate with one or more external devices 700 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 600, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 600 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interface 650. Also, the electronic device 600 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 660. The network adapter 660 may communicate with other modules of the electronic device 600 via the bus 630. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage platforms, to name a few.

The embodiment of the invention also provides a computer readable storage medium for storing a program, and the program realizes the steps of the Chinese audio-visual combined speech recognition method when being executed. In some possible embodiments, aspects of the present invention may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the present invention described in the above-mentioned electronic prescription flow processing method section of this specification, when the program product is run on the terminal device.

Referring to fig. 7, a program product 800 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

In summary, compared with the prior art, the method, system, device and medium for recognizing Chinese audio-visual combined speech provided by the invention have the following advantages:

the invention solves the problems in the prior art, provides a scheme of audio-visual combined speech recognition aiming at Chinese sentence level based on an end-to-end mode, and fills the blank of the technical field; furthermore, the machine learning model combining audio and video with voice recognition is constructed by utilizing the deep neural network, the whole process does not need manual feature extraction, compared with the traditional machine learning mode, the method can well extract the features of audio and video signals, and is beneficial to improving the recognition capability of a voice recognition system; furthermore, the invention combines the deep neural network and the attention mechanism, and fully excavates and fuses the characteristics of the audio signal and the video signal, so that the recognition effect is greatly improved compared with a multi-mode characteristic splicing mode.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A Chinese audio-visual combined speech recognition method is characterized by comprising the following steps:

respectively receiving a video signal and an audio signal to be identified;

inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model;

the pinyin character sequence identification model comprises a video encoder, an audio encoder and a first decoder;

the video encoder extracts a characteristic sequence of the video signal;

the audio encoder extracts a characteristic sequence of the audio signal;

the audio encoder fuses the characteristic sequence of the video signal and the characteristic sequence of the audio signal through an attention mechanism to obtain a fused characteristic sequence; wherein, the fusion adopts the following formulas (1) and (2):

（1）

（2）

wherein the content of the first and second substances,

the score of attention is shown as a score,

representing fusion characteristics obtained after fusion, wherein query represents state data of a top LSTM of an audio encoder, value represents output of the top LSTM of a video encoder, and the output of the top LSTM of the audio encoder is fusion characteristics of audio and video signals after fusion of an attention mechanism;

2. The method of claim 1, wherein the video encoder extracts the feature sequence of the video signal, and comprises the following steps:

3. The method of claim 1, wherein the audio encoder extracts the feature sequence of the audio signal, and comprises the following steps:

calculating mel-frequency cepstrum coefficient values of the audio signal;

4. The method of claim 3, wherein the second recurrent neural network comprises three layers of long-short term memory;

5. The method of claim 1, wherein the first decoder outputs a Pinyin character sequence according to the fused feature sequence, comprising the steps of:

6. The method of claim 1, wherein the Chinese character recognition model comprises an encoder and a second decoder;

7. The method of claim 6, wherein the fourth recurrent neural network and the fifth recurrent neural network each comprise two gated recurrent unit layers.

8. The method of claim 1, further comprising training the pinyin character sequence recognition model and the hanzi sequence recognition model by:

9. A system for combining chinese audio-visual and speech recognition, which is applied to the method for combining chinese audio-visual and speech recognition according to any one of claims 1 to 8, the system comprising:

the Chinese character recognition module is used for inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model;

the pinyin identification module is specifically configured to:

the video encoder extracts a characteristic sequence of the video signal;

the audio encoder extracts a characteristic sequence of the audio signal;

（1）

（2）

wherein the content of the first and second substances,

the score of attention is shown as a score,

10. A chinese audio-visual combined speech recognition apparatus, comprising:

a processor;

a memory having stored therein executable instructions of the processor;

wherein the processor is configured to perform the steps of the Chinese audiovisual combined speech recognition method of any of claims 1-8 via execution of the executable instructions.

11. A computer-readable storage medium storing a program, wherein the program when executed implements the steps of the chinese audiovisual in combination with speech recognition method of any of claims 1 to 8.