CN111161724B - Method, system, equipment and medium for Chinese audio-visual combined speech recognition - Google Patents

Method, system, equipment and medium for Chinese audio-visual combined speech recognition Download PDF

Info

Publication number
CN111161724B
CN111161724B CN201911297060.3A CN201911297060A CN111161724B CN 111161724 B CN111161724 B CN 111161724B CN 201911297060 A CN201911297060 A CN 201911297060A CN 111161724 B CN111161724 B CN 111161724B
Authority
CN
China
Prior art keywords
character sequence
sequence
audio
recognition model
chinese
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911297060.3A
Other languages
Chinese (zh)
Other versions
CN111161724A (en
Inventor
郭永亮
张坤雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Aiways Automobile Co Ltd
Original Assignee
Aiways Automobile Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aiways Automobile Co Ltd filed Critical Aiways Automobile Co Ltd
Priority to CN201911297060.3A priority Critical patent/CN111161724B/en
Publication of CN111161724A publication Critical patent/CN111161724A/en
Application granted granted Critical
Publication of CN111161724B publication Critical patent/CN111161724B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics

Landscapes

  • Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)
  • Character Discrimination (AREA)

Abstract

The invention provides a method, a system, equipment and a medium for recognizing Chinese audio-visual combined voice, wherein the method comprises the following steps: respectively receiving a video signal and an audio signal to be identified; inputting the video signal and the audio signal into a trained pinyin character sequence recognition model to obtain a pinyin character sequence output by the pinyin character sequence recognition model; and inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model. The invention provides an audio-visual combined voice recognition scheme aiming at Chinese sentence levels based on an end-to-end mode, combines a deep neural network and an attention mechanism, fully excavates and fuses the characteristics of an audio signal and a video signal, and is favorable for improving the recognition capability of a voice recognition system.

Description

Method, system, equipment and medium for Chinese audio-visual combined speech recognition
Technical Field
The invention relates to the technical field of voice recognition, in particular to a method, a system, equipment and a medium for recognizing Chinese audio-visual combined voice.
Background
Speech recognition technology is a technology that allows a machine to convert speech signals into corresponding text or commands through a process of recognition and understanding. In the last two decades, the development of speech recognition technology has made remarkable progress, and the development of artificial intelligence technology has started to move from laboratories to markets, especially in recent years, so that the research of speech recognition technology has made a major breakthrough. At present, the voice recognition technology is widely applied to the fields of vehicle-mounted systems, social chat, smart homes and the like, provides much convenience for the life of people, and shows strong practicability of people.
In general, the input of speech recognition is only an audio signal, and if the speech recognition process can simultaneously utilize audio and video signals, the two signals can be mutually supplemented, so that the input information is richer, and the accuracy of recognition is improved.
For the speech recognition problem of audio-visual combination by simultaneously utilizing audio and video signals, at present, the solution specially aiming at Chinese is almost not available, most of the solutions aim at the speech recognition of English, or a specific language is not specified, but Chinese has the particularity, for example, words in Chinese do not have strict morphological change, the number of commonly used Chinese characters is large, about 3500 Chinese characters and the like, and the complex characteristics make the Chinese speech recognition task of audio-visual combination full of challenges, so the audio-visual combination speech recognition solution aiming at English or general audio-visual combination cannot be directly used for solving the Chinese speech recognition problem. In addition, in the existing audio-visual combined speech recognition scheme, the problem of recognition at a word level is solved, and a sentence-level recognition task during continuous speaking cannot be processed; some other schemes use traditional machine learning methods to realize audio-visual combined speech recognition, and such methods require manual feature extraction, which is tedious in process and limited in final recognition effect.
Disclosure of Invention
Aiming at the problems in the prior art, the invention aims to provide a method, a system, equipment and a medium for recognizing Chinese audio-visual combined speech based on deep learning, and provides an end-to-end audio-visual combined speech recognition scheme aiming at Chinese sentence level.
The embodiment of the invention provides a method for recognizing Chinese audio-visual combined voice, which comprises the following steps:
respectively receiving a video signal and an audio signal to be identified;
inputting the video signal and the audio signal into a trained pinyin character sequence recognition model to obtain a pinyin character sequence output by the pinyin character sequence recognition model;
and inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model.
Optionally, the pinyin character sequence recognition model includes a video encoder, an audio encoder, and a first decoder;
inputting the video signal and the audio signal into a trained pinyin character sequence recognition model to obtain a pinyin character sequence output by the pinyin character sequence recognition model, and the method comprises the following steps:
the video encoder extracts a characteristic sequence of the video signal;
the audio encoder extracts a characteristic sequence of the audio signal;
the audio encoder fuses the characteristic sequence of the video signal and the characteristic sequence of the audio signal through an attention mechanism to obtain a fused characteristic sequence;
and the first decoder outputs a pinyin character sequence according to the fusion characteristic sequence.
Optionally, the video encoder extracting a feature sequence of the video signal includes the following steps:
inputting the image frame sequence of the video signal into a convolutional neural network to obtain the image characteristics of each frame of image output by the convolutional neural network;
inputting the image features into a first recurrent neural network, extracting time sequence features among the image frame sequences, and taking an output sequence of the first recurrent neural network as a feature sequence of the video signal.
Optionally, the audio encoder extracting a feature sequence of the audio signal includes:
calculating mel-frequency cepstrum coefficient values of the audio signal;
and inputting the Mel frequency cepstrum coefficient value into a second recurrent neural network, and extracting the characteristic sequence of the audio signal.
Optionally, the second recurrent neural network comprises three long-short term memory layers;
the audio encoder fuses the characteristic sequence of the video signal and the characteristic sequence of the audio signal through an attention mechanism, and the audio encoder fuses the state data of the top long-short term memory layer and the characteristic sequence of the video signal in the top long-short term memory layer of the second recurrent neural network based on the attention mechanism.
Optionally, the first decoder outputs a pinyin character sequence according to the fused feature sequence, and includes the following steps:
the first decoder inputs the fused feature sequence into a third recurrent neural network based on an attention mechanism, wherein the recurrent neural network comprises a feature extraction layer and a classification layer;
and obtaining the pinyin character sequence from the output of the classification layer of the third recurrent neural network.
Optionally, the chinese character sequence recognition model includes an encoder and a second decoder;
inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model, and the method comprises the following steps:
the encoder inputs the pinyin character sequence into a fourth recurrent neural network to extract the characteristics of the pinyin character sequence;
inputting the characteristics of the pinyin character sequence into a fifth cyclic neural network by the second decoder, wherein the fifth cyclic neural network comprises a characteristic extraction layer and a classification layer;
and obtaining the Chinese character sequence from the output of the classification layer of the fifth cyclic neural network.
Optionally, the fourth recurrent neural network and the fifth recurrent neural network each include two gated recurrent unit layers.
Optionally, the method further includes training the pinyin character sequence recognition model and the hanzi sequence recognition model by:
fixing the parameters of the pinyin character sequence recognition model, and training by adjusting the parameters of the Chinese character sequence recognition model;
fixing the parameters of the Chinese character sequence recognition model, and training by adjusting the parameters of the pinyin character sequence recognition model;
and training the whole network of the pinyin character sequence recognition model and the Chinese character sequence recognition model.
The embodiment of the invention also provides a system for recognizing the combination of Chinese audio-visual and speech, which is characterized in that the system is applied to the method for recognizing the combination of Chinese audio-visual and speech, and comprises the following steps:
the signal receiving module is used for respectively receiving a video signal and an audio signal to be identified;
the pinyin identification module is used for inputting the video signal and the audio signal into a trained pinyin character sequence identification model to obtain a pinyin character sequence output by the pinyin character sequence identification model;
and the Chinese character recognition module is used for inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model.
The embodiment of the invention also provides a device for recognizing Chinese audio-visual combined speech, which comprises:
a processor;
a memory having stored therein executable instructions of the processor;
wherein the processor is configured to perform the steps of the Chinese audiovisual combined with speech recognition method via execution of the executable instructions.
The embodiment of the invention also provides a computer readable storage medium for storing a program, and the program realizes the steps of the Chinese audio-visual combined speech recognition method when being executed.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
The Chinese audio-visual combined voice recognition method, the system, the equipment and the medium provided by the invention have the following advantages:
the invention solves the problems in the prior art, provides an audio-visual combined speech recognition scheme aiming at Chinese sentence levels based on an end-to-end mode, and fills the blank of the technical field; furthermore, the machine learning model combining audio and video with voice recognition is constructed by utilizing the deep neural network, the whole process does not need to manually extract the characteristics, and compared with the traditional machine learning mode, the method can well extract the characteristics of audio and video signals and is beneficial to improving the recognition capability of a voice recognition system; furthermore, the invention combines the deep neural network and the attention mechanism, and fully excavates and fuses the characteristics of the audio signal and the video signal, so that the recognition effect is greatly improved compared with a multi-mode characteristic splicing mode.
Drawings
Other features, objects and advantages of the present invention will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, with reference to the accompanying drawings.
FIG. 1 is a flow chart of a method for Chinese audio visual combined with speech recognition according to an embodiment of the present invention;
FIG. 2 is a flow diagram of the recognition of sentences from audio and video signals in accordance with an embodiment of the present invention;
FIG. 3 is a diagram illustrating a structure of a Pinyin character sequence recognition model according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a Chinese character sequence recognition model according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a Chinese audio-visual system combined with speech recognition according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a Chinese audiovisual combined with speech recognition apparatus in accordance with an embodiment of the present invention;
fig. 7 is a schematic diagram of a computer-readable storage medium according to an embodiment of the present invention.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
As shown in fig. 1, in order to solve the problems of the prior art, the present invention provides a method for combining chinese audio-visual with speech recognition, which comprises the following steps:
s100: respectively receiving a video signal and an audio signal to be identified;
s200: inputting the video signal and the audio signal into a trained pinyin character sequence recognition model to obtain a pinyin character sequence output by the pinyin character sequence recognition model;
s300: and inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model.
The invention provides an end-to-end audio-visual combined speech recognition method, which can directly obtain a corresponding Chinese character sequence, namely a sentence comprising speaking content, by inputting a video signal and an audio signal. Moreover, the invention is specially used for Chinese character recognition, the pinyin character sequence is obtained from the video signal and the audio signal through the step S200, then Chinese character recognition is carried out through the step S300, the audio signal and the video signal are fully mined and fused, and the recognition effect is greatly improved.
As shown in fig. 2, the pinyin character sequence recognition model inputs a sequence of picture frames, i.e., a video signal, of an audio signal when a person speaks and a lip motion video of the speaker. The pinyin character sequence recognition model performs fusion processing on input audio signals and video signals, and finally outputs a pinyin character sequence, and the process can be expressed by the following formula:
p=av2p(a,v) (1)
wherein, a = (a) 1 ,a 2 ,…,a n ) Representing a sequence of audio signals, v = (v) 1 ,v 2 ,…,v m ) Picture frame sequence representing lip motion, p = (p) 1 ,p 2 ,…,p k ) Represents the recognized pinyin character sequence, and av2p (.) represents a pinyin character sequence recognition model that converts audio and video signals into pinyin character sequences.
The input of the Chinese character sequence recognition model is a pinyin character sequence output by the pinyin character sequence recognition model, and the Chinese character sequence is output by learning the characteristics of the pinyin character sequence, namely the content of speaking, and the process can be expressed by adopting the following formula:
s=p2s(p) (2)
wherein p = (p) 1 ,p 2 ,…,p k ) Representing the input pinyin character sequence, s = (c) 1 ,c 2 ,…,c l ) Representing a recognized sequence of Chinese characters, i.e. sentences, c i Represents the ith Chinese character in the sentence s, and p2s (.) represents a Chinese character sequence recognition model that converts the pinyin character sequence to a Chinese character sequence.
The following describes the working process of the pinyin character sequence recognition model and the hanzi sequence recognition model in an embodiment with reference to fig. 3 and 4.
As shown in fig. 3, the pinyin character sequence recognition model includes a video encoder, an audio encoder, and a first decoder. Specifically, in this embodiment, the pinyin character Sequence recognition model may be a Sequence to Sequence (Sequence to Sequence) model based on the attention mechanism.
The step S200: inputting the video signal and the audio signal into a trained pinyin character sequence recognition model to obtain a pinyin character sequence output by the pinyin character sequence recognition model, and the method comprises the following steps:
s210: the video encoder extracts a characteristic sequence of the video signal;
s220: the audio encoder extracts a characteristic sequence of the audio signal;
s230: the audio encoder fuses the characteristic sequence of the video signal and the characteristic sequence of the audio signal through an attention mechanism to obtain a fused characteristic sequence;
s240: and the first decoder outputs a pinyin character sequence according to the fusion characteristic sequence.
The step S210: the video encoder extracting the characteristic sequence of the video signal comprises the following steps:
inputting the image frame sequence of the video signal into a convolutional neural network to obtain the image characteristics of each frame of image output by the convolutional neural network; the convolutional neural Network may be a Residual Network (ResNet), and an input of the Residual Network may be a three-channel RGB image; convolutional Neural Networks (CNNs) are a class of feed-forward Neural Networks that contain Convolutional computations and have a depth structure, and residual Networks are characterized by being easily optimized and capable of increasing accuracy by adding a considerable depth. The inner residual block uses jump connection, so that the problem of gradient disappearance caused by depth increase in a deep neural network is solved;
inputting the image features into a first recurrent neural network, extracting time sequence features among the image frame sequences, and taking an output sequence of the first recurrent neural network as a feature sequence of the video signal; in this embodiment, the first recurrent neural network may be a three-layer Long-Short Term Memory (LSTM) network, which is used to extract the time-series features between image frame sequences, and take the output sequence of the top-layer LSTM as the feature sequence of the video signal output by the video encoder. The recurrent neural network is a recurrent neural network which takes sequence data as input, recurs in the evolution direction of the sequence and all nodes (recurrent units) are connected in a chain manner; the long-term and short-term memory Network is a time-cycle Neural Network, is a special cycle Neural Network (RNN), and can learn long-term dependency.
The step S220: the audio encoder extracts the characteristic sequence of the audio signal, and comprises the following steps:
calculating Mel Frequency Cepstrum Coefficient values (MFCC) of the audio signal; mel-Frequency Cepstral coeffients (Mel Frequency Cepstral coeffients) is a widely used feature in automated speech and speaker recognition;
inputting the mel frequency cepstrum coefficient value into a second recurrent neural network, and extracting a characteristic sequence of the audio signal; in this embodiment, the second recurrent neural network may also be a three-layer long-short term memory network, and the state data output by the top layer is used as the audio signal.
The step S230: the audio encoder fuses the feature sequence of the video signal and the feature sequence of the audio signal through an attention mechanism, and the method comprises the following steps:
the audio encoder fuses state data of the top long short-term memory layer and the feature sequence of the video signal based on an attention mechanism in the top long short-term memory layer of the second recurrent neural network. Specifically, the fusion employs the following equations (3) and (4):
a ij =score(value j ,query i ) (3)
C i =∑ j a ij value j (4)
wherein, query represents the state data of the top LSTM of the audio encoder, value represents the output of the top LSTM of the video encoder, and the output of the top LSTM of the audio encoder is the fusion characteristic of the audio and video signals through the fusion of the attention mechanism.
The step S240: the first decoder outputs a pinyin character sequence according to the fusion characteristic sequence, and the method comprises the following steps:
the first decoder inputs the fused feature sequence into a third recurrent neural network based on an attention mechanism, wherein the recurrent neural network comprises a feature extraction layer and a classification layer; in this embodiment, the feature extraction layer in the third recurrent neural network mayA single-layer long-short term memory layer and a four-head attention mechanism are adopted to improve the performance, the classification layer can adopt a softmax layer, and the classification layer outputs a predicted pinyin character sequence p = (p) 1 ,p 2 ,…,p k )。
And obtaining the pinyin character sequence from the output of the classification layer of the third recurrent neural network.
When the pinyin character sequence recognition model is trained, the input of the model is g = (g) 1 ,g 2 ,…,g k ),g i The sample label representing the real sample is processed by an LSTM layer and a softmax layer, and finally, a predicted pinyin character sequence p = (p) 1 ,p 2 ,…,p k ). In the test phase, g i Representing the output p of the network at time i i
As shown in fig. 4, in this embodiment, the kanji sequence recognition model includes an encoder and a second decoder. Specifically, in this embodiment, the chinese character sequence recognition model is also implemented based on the Seq2Seq framework. The step S300: inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model, and the method comprises the following steps:
the encoder inputs the pinyin character sequence into a fourth recurrent neural network to extract the characteristics of the pinyin character sequence; the input of the coder is a pinyin character sequence p = (p) 1 ,p 2 ,…,p k ),p i In this embodiment, the fourth Recurrent neural network includes two layers of Gated Recurrent Unit (GRU) networks, and the output sequence is O = (O) 1 ,O 2 ,…,O k ) Output at time i i For parameterizing the input p at the next moment i+1 The predicted distribution of (2); the gated cyclic unit is a commonly used gated cyclic neural network, and the gated cyclic neural network is proposed to better capture the dependence relationship with larger time step distance in a time sequence.
The second decoder inputs the characteristics of the pinyin character sequence into a fifth recurrent neural network, wherein the fifth recurrent neural network comprises a characteristic extraction layer and a classification layer; in this embodiment, the fifth recurrent neural network may include two layers of gated recurrent units, and the classification layer may be a softmax layer;
deriving the Hanzi sequence s = (c) from the output of the classification layer of the fifth recurrent neural network 1 ,c 2 ,…,c l )。
In the training stage of the Chinese character sequence recognition model, the input of the decoder is the Chinese character sequence y = (y) 1 ,y 2 ,…,y l ),y i And inputting the real sample label into a fifth-cycle neural network for calculation.
The Chinese audio-visual combined speech recognition system constructed in the invention comprises two models: the pinyin character sequence recognition model and the Chinese character sequence recognition model are formed and are of a multi-model structure, and the problem of multi-model training needs to be solved for obtaining a final recognition system. In this embodiment, the method for recognizing Chinese audio-visual combined with speech further comprises training the pinyin character sequence recognition model and the hanzi sequence recognition model by the following steps:
fixing the parameters of the pinyin character sequence recognition model, and training by adjusting the parameters of the Chinese character sequence recognition model until the Chinese character sequence recognition model converges;
fixing the parameters of the Chinese character sequence recognition model, and training by adjusting the parameters of the pinyin character sequence recognition model until the pinyin character sequence recognition model is converged;
and training the whole network of the pinyin character sequence recognition model and the Chinese character sequence recognition model to converge, thus obtaining the whole Chinese audio-visual combined speech recognition system.
As shown in fig. 5, an embodiment of the present invention further provides a system for combining a chinese audio visual system with a speech recognition, where the system is applied to the method for combining a chinese audio visual system with a speech recognition, and the system includes:
a signal receiving module M100, configured to receive a video signal and an audio signal to be identified respectively;
the pinyin identification module M200 is used for inputting the video signal and the audio signal into a trained pinyin character sequence identification model to obtain a pinyin character sequence output by the pinyin character sequence identification model;
and the Chinese character recognition module M300 is used for inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model.
The invention provides an end-to-end audio-visual combined voice recognition system, which inputs video signals and audio signals and can directly obtain corresponding Chinese character sequences by the method of the invention. Moreover, the invention is specially used for Chinese character recognition, a pinyin character sequence is obtained from a video signal and an audio signal through the pinyin recognition module M200, and then Chinese character recognition is carried out through the Chinese character recognition module M300, so that the audio signal and the video signal are fully mined and fused, and the recognition effect is greatly improved.
The pinyin character sequence recognition model of the invention may have a structure as shown in fig. 3, and the chinese character sequence recognition model may have a structure as shown in fig. 4. The embodiment of the pinyin identification module M200 for obtaining the pinyin character sequence may adopt the specific implementation of the steps S210 to S240 as described above, but the invention is not limited thereto. The embodiment of the kanji sequence recognition module M300 to obtain kanji sequences may adopt the specific embodiment of step S300, but the present invention is not limited thereto.
The embodiment of the invention also provides a Chinese audio-visual combined voice recognition device, which comprises a processor; a memory having stored therein executable instructions of the processor; wherein the processor is configured to perform the steps of the Chinese audiovisual combined with speech recognition method via execution of the executable instructions.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" platform.
An electronic device 600 according to this embodiment of the invention is described below with reference to fig. 6. The electronic device 600 shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 6, the electronic device 600 is embodied in the form of a general purpose computing device. The combination of the electronic device 600 may include, but is not limited to: at least one processing unit 610, at least one memory unit 620, a bus 630 connecting different platform combinations (including memory unit 620 and processing unit 610), a display unit 640, etc.
Wherein the storage unit stores program code executable by the processing unit 610 to cause the processing unit 610 to perform steps according to various exemplary embodiments of the present invention described in the above-mentioned electronic prescription flow processing method section of the present specification. For example, the processing unit 610 may perform the steps as shown in fig. 1.
The storage unit 620 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM) 6201 and/or a cache memory unit 6202, and may further include a read-only memory unit (ROM) 6203.
The memory unit 620 may also include a program/utility 6204 having a set (at least one) of program modules 6205, such program modules 6205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 630 can be any bus representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 600 may also communicate with one or more external devices 700 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 600, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 600 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interface 650. Also, the electronic device 600 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 660. The network adapter 660 may communicate with other modules of the electronic device 600 via the bus 630. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage platforms, to name a few.
The embodiment of the invention also provides a computer readable storage medium for storing a program, and the program realizes the steps of the Chinese audio-visual combined speech recognition method when being executed. In some possible embodiments, aspects of the present invention may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the present invention described in the above-mentioned electronic prescription flow processing method section of this specification, when the program product is run on the terminal device.
Referring to fig. 7, a program product 800 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
In summary, compared with the prior art, the method, system, device and medium for recognizing Chinese audio-visual combined speech provided by the invention have the following advantages:
the invention solves the problems in the prior art, provides a scheme of audio-visual combined speech recognition aiming at Chinese sentence level based on an end-to-end mode, and fills the blank of the technical field; furthermore, the machine learning model combining audio and video with voice recognition is constructed by utilizing the deep neural network, the whole process does not need manual feature extraction, compared with the traditional machine learning mode, the method can well extract the features of audio and video signals, and is beneficial to improving the recognition capability of a voice recognition system; furthermore, the invention combines the deep neural network and the attention mechanism, and fully excavates and fuses the characteristics of the audio signal and the video signal, so that the recognition effect is greatly improved compared with a multi-mode characteristic splicing mode.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (11)

1. A Chinese audio-visual combined speech recognition method is characterized by comprising the following steps:
respectively receiving a video signal and an audio signal to be identified;
inputting the video signal and the audio signal into a trained pinyin character sequence recognition model to obtain a pinyin character sequence output by the pinyin character sequence recognition model;
inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model;
the pinyin character sequence identification model comprises a video encoder, an audio encoder and a first decoder;
inputting the video signal and the audio signal into a trained pinyin character sequence recognition model to obtain a pinyin character sequence output by the pinyin character sequence recognition model, and the method comprises the following steps:
the video encoder extracts a characteristic sequence of the video signal;
the audio encoder extracts a characteristic sequence of the audio signal;
the audio encoder fuses the characteristic sequence of the video signal and the characteristic sequence of the audio signal through an attention mechanism to obtain a fused characteristic sequence; wherein, the fusion adopts the following formulas (1) and (2):
Figure M_221109160728697_697331001
(1)
Figure M_221109160728775_775463001
(2)
wherein the content of the first and second substances,
Figure M_221109160728808_808134001
the score of attention is shown as a score,
Figure M_221109160728839_839911002
representing fusion characteristics obtained after fusion, wherein query represents state data of a top LSTM of an audio encoder, value represents output of the top LSTM of a video encoder, and the output of the top LSTM of the audio encoder is fusion characteristics of audio and video signals after fusion of an attention mechanism;
and the first decoder outputs a pinyin character sequence according to the fusion characteristic sequence.
2. The method of claim 1, wherein the video encoder extracts the feature sequence of the video signal, and comprises the following steps:
inputting the image frame sequence of the video signal into a convolutional neural network to obtain the image characteristics of each frame of image output by the convolutional neural network;
inputting the image features into a first recurrent neural network, extracting time sequence features among the image frame sequences, and taking an output sequence of the first recurrent neural network as a feature sequence of the video signal.
3. The method of claim 1, wherein the audio encoder extracts the feature sequence of the audio signal, and comprises the following steps:
calculating mel-frequency cepstrum coefficient values of the audio signal;
and inputting the Mel frequency cepstrum coefficient value into a second recurrent neural network, and extracting the characteristic sequence of the audio signal.
4. The method of claim 3, wherein the second recurrent neural network comprises three layers of long-short term memory;
the audio encoder fuses the characteristic sequence of the video signal and the characteristic sequence of the audio signal through an attention mechanism, and the audio encoder fuses the state data of the top long-short term memory layer and the characteristic sequence of the video signal in the top long-short term memory layer of the second recurrent neural network based on the attention mechanism.
5. The method of claim 1, wherein the first decoder outputs a Pinyin character sequence according to the fused feature sequence, comprising the steps of:
the first decoder inputs the fused feature sequence into a third recurrent neural network based on an attention mechanism, wherein the recurrent neural network comprises a feature extraction layer and a classification layer;
and obtaining the pinyin character sequence from the output of the classification layer of the third recurrent neural network.
6. The method of claim 1, wherein the Chinese character recognition model comprises an encoder and a second decoder;
inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model, and the method comprises the following steps:
the encoder inputs the pinyin character sequence into a fourth recurrent neural network to extract the characteristics of the pinyin character sequence;
inputting the characteristics of the pinyin character sequence into a fifth cyclic neural network by the second decoder, wherein the fifth cyclic neural network comprises a characteristic extraction layer and a classification layer;
and obtaining the Chinese character sequence from the output of the classification layer of the fifth cyclic neural network.
7. The method of claim 6, wherein the fourth recurrent neural network and the fifth recurrent neural network each comprise two gated recurrent unit layers.
8. The method of claim 1, further comprising training the pinyin character sequence recognition model and the hanzi sequence recognition model by:
fixing the parameters of the pinyin character sequence recognition model, and training by adjusting the parameters of the Chinese character sequence recognition model;
fixing the parameters of the Chinese character sequence recognition model, and training by adjusting the parameters of the pinyin character sequence recognition model;
and training the whole network of the pinyin character sequence recognition model and the Chinese character sequence recognition model.
9. A system for combining chinese audio-visual and speech recognition, which is applied to the method for combining chinese audio-visual and speech recognition according to any one of claims 1 to 8, the system comprising:
the signal receiving module is used for respectively receiving a video signal and an audio signal to be identified;
the pinyin identification module is used for inputting the video signal and the audio signal into a trained pinyin character sequence identification model to obtain a pinyin character sequence output by the pinyin character sequence identification model;
the Chinese character recognition module is used for inputting the pinyin character sequence into a trained Chinese character sequence recognition model to obtain a Chinese character sequence output by the Chinese character sequence recognition model;
the pinyin character sequence identification model comprises a video encoder, an audio encoder and a first decoder;
the pinyin identification module is specifically configured to:
the video encoder extracts a characteristic sequence of the video signal;
the audio encoder extracts a characteristic sequence of the audio signal;
the audio encoder fuses the characteristic sequence of the video signal and the characteristic sequence of the audio signal through an attention mechanism to obtain a fused characteristic sequence; wherein, the fusion adopts the following formulas (1) and (2):
Figure M_221109160728886_886771001
(1)
Figure M_221109160728933_933667001
(2)
wherein the content of the first and second substances,
Figure M_221109160729139_139211001
the score of attention is shown as a score,
Figure M_221109160729186_186097002
representing fusion characteristics obtained after fusion, wherein query represents state data of a top LSTM of an audio encoder, value represents output of the top LSTM of a video encoder, and the output of the top LSTM of the audio encoder is fusion characteristics of audio and video signals after fusion of an attention mechanism;
and the first decoder outputs a pinyin character sequence according to the fusion characteristic sequence.
10. A chinese audio-visual combined speech recognition apparatus, comprising:
a processor;
a memory having stored therein executable instructions of the processor;
wherein the processor is configured to perform the steps of the Chinese audiovisual combined speech recognition method of any of claims 1-8 via execution of the executable instructions.
11. A computer-readable storage medium storing a program, wherein the program when executed implements the steps of the chinese audiovisual in combination with speech recognition method of any of claims 1 to 8.
CN201911297060.3A 2019-12-16 2019-12-16 Method, system, equipment and medium for Chinese audio-visual combined speech recognition Active CN111161724B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911297060.3A CN111161724B (en) 2019-12-16 2019-12-16 Method, system, equipment and medium for Chinese audio-visual combined speech recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911297060.3A CN111161724B (en) 2019-12-16 2019-12-16 Method, system, equipment and medium for Chinese audio-visual combined speech recognition

Publications (2)

Publication Number Publication Date
CN111161724A CN111161724A (en) 2020-05-15
CN111161724B true CN111161724B (en) 2022-12-13

Family

ID=70557201

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911297060.3A Active CN111161724B (en) 2019-12-16 2019-12-16 Method, system, equipment and medium for Chinese audio-visual combined speech recognition

Country Status (1)

Country Link
CN (1) CN111161724B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112102830B (en) * 2020-09-14 2023-07-25 广东工业大学 Coarse granularity instruction identification method and device
CN112786052B (en) * 2020-12-30 2024-05-31 科大讯飞股份有限公司 Speech recognition method, electronic equipment and storage device
CN113033538B (en) * 2021-03-25 2024-05-10 北京搜狗科技发展有限公司 Formula identification method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101825953A (en) * 2010-04-06 2010-09-08 朱建政 Chinese character input product with combined voice input and Chinese phonetic alphabet input functions
CN102347026A (en) * 2011-07-04 2012-02-08 深圳市子栋科技有限公司 Audio/video on demand method and system based on natural voice recognition
CN108073875A (en) * 2016-11-14 2018-05-25 广东技术师范学院 A kind of band noisy speech identifying system and method based on monocular cam
CN109410918A (en) * 2018-10-15 2019-03-01 百度在线网络技术(北京)有限公司 For obtaining the method and device of information

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7587318B2 (en) * 2002-09-12 2009-09-08 Broadcom Corporation Correlating video images of lip movements with audio signals to improve speech recognition

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101825953A (en) * 2010-04-06 2010-09-08 朱建政 Chinese character input product with combined voice input and Chinese phonetic alphabet input functions
CN102347026A (en) * 2011-07-04 2012-02-08 深圳市子栋科技有限公司 Audio/video on demand method and system based on natural voice recognition
CN108073875A (en) * 2016-11-14 2018-05-25 广东技术师范学院 A kind of band noisy speech identifying system and method based on monocular cam
CN109410918A (en) * 2018-10-15 2019-03-01 百度在线网络技术(北京)有限公司 For obtaining the method and device of information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种基于数据筛的音频视频连续语音识别系统;谢磊等;《计算机应用》;20030728(第07期);第1-3页 *

Also Published As

Publication number Publication date
CN111161724A (en) 2020-05-15

Similar Documents

Publication Publication Date Title
US10380996B2 (en) Method and apparatus for correcting speech recognition result, device and computer-readable storage medium
CN108985358B (en) Emotion recognition method, device, equipment and storage medium
US11741355B2 (en) Training of student neural network with teacher neural networks
WO2021072875A1 (en) Intelligent dialogue generation method, device, computer apparatus and computer storage medium
CN111402861B (en) Voice recognition method, device, equipment and storage medium
US11610108B2 (en) Training of student neural network with switched teacher neural networks
CN114694076A (en) Multi-modal emotion analysis method based on multi-task learning and stacked cross-modal fusion
US20150325240A1 (en) Method and system for speech input
WO2022134894A1 (en) Speech recognition method and apparatus, computer device, and storage medium
CN111161724B (en) Method, system, equipment and medium for Chinese audio-visual combined speech recognition
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
KR20170022445A (en) Apparatus and method for speech recognition based on unified model
CN112037773B (en) N-optimal spoken language semantic recognition method and device and electronic equipment
CN114596844B (en) Training method of acoustic model, voice recognition method and related equipment
CN114676234A (en) Model training method and related equipment
CN109726397B (en) Labeling method and device for Chinese named entities, storage medium and electronic equipment
CN110263218B (en) Video description text generation method, device, equipment and medium
CN110991175B (en) Method, system, equipment and storage medium for generating text in multi-mode
JP7178394B2 (en) Methods, apparatus, apparatus, and media for processing audio signals
CN112349294B (en) Voice processing method and device, computer readable medium and electronic equipment
CN111653270B (en) Voice processing method and device, computer readable storage medium and electronic equipment
WO2023082931A1 (en) Method for punctuation recovery in speech recognition, and device and storage medium
CN110647613A (en) Courseware construction method, courseware construction device, courseware construction server and storage medium
CN115828889A (en) Text analysis method, emotion classification model, device, medium, terminal and product
CN113393841B (en) Training method, device, equipment and storage medium of voice recognition model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PP01 Preservation of patent right
PP01 Preservation of patent right

Effective date of registration: 20230131

Granted publication date: 20221213

PD01 Discharge of preservation of patent
PD01 Discharge of preservation of patent

Date of cancellation: 20240108

Granted publication date: 20221213

PP01 Preservation of patent right
PP01 Preservation of patent right

Effective date of registration: 20240227

Granted publication date: 20221213