CN114974215A - Audio and video dual-mode-based voice recognition method and system - Google Patents

Audio and video dual-mode-based voice recognition method and system Download PDF

Info

Publication number
CN114974215A
CN114974215A CN202210515512.6A CN202210515512A CN114974215A CN 114974215 A CN114974215 A CN 114974215A CN 202210515512 A CN202210515512 A CN 202210515512A CN 114974215 A CN114974215 A CN 114974215A
Authority
CN
China
Prior art keywords
audio
video
layer
encoder
transform
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210515512.6A
Other languages
Chinese (zh)
Inventor
赵鹏
唐宝威
韩莉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University filed Critical Anhui University
Priority to CN202210515512.6A priority Critical patent/CN114974215A/en
Publication of CN114974215A publication Critical patent/CN114974215A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention relates to the technical field of voice, and discloses a voice recognition method and a system based on audio and video dual modes, wherein the method comprises the steps of acquiring audio data and video data to be processed; carrying out feature extraction on the audio data to obtain audio features; performing feature extraction on video data by adopting 3D and 2D convolutional networks to obtain video features; coding audio features and video features by adopting a Transformer-based bidirectional information interaction coder; predicting a state code of a character of a current time step by adopting an audio and video decoder based on a Transformer to obtain a predicted state sequence corresponding to the audio and video; and mapping the state sequences into texts one by one to obtain text information. The method is effective and robust, can meet the voice recognition requirement of the user in a noisy environment, improves the accuracy of the voice recognition result and enhances the user experience.

Description

Audio and video dual-mode-based voice recognition method and system
Technical Field
The invention relates to the technical field of voice recognition, in particular to a voice recognition method and system based on audio and video dual modes.
Background
Along with popularization of voice recognition, voice recognition technology is applied to various fields, a user can realize intelligent input by means of the voice recognition technology, only voice is needed to be passed, character input, instruction control and the like can be completed, and production and life of people are greatly facilitated. However, speech recognition on the market today is limited to speech signals as the only input source. In real-world scenes, speech signals are severely interfered in some scenes, such as far-field reverberation, simultaneous speaking of multiple people, or background noise, which greatly reduces the speech recognition effect. Thus, there is an urgent need for improvement of speech recognition technology based on a single-mode speech signal.
In the related technology, the invention patent application with the publication number of CN111754992A discloses a noise robust audio and video bimodal speech recognition method and system, wherein a Mel frequency cepstrum coefficient of an audio and a first-order second-order dynamic coefficient thereof are extracted as audio features, a video is framed and subjected to face detection and alignment, and a fixed lip region is intercepted and sent to a residual error network to obtain the video features; and an attention mechanism is introduced to align and correct the feature information of the high-level network of the audio and video to obtain feature representation fusing the audio and the video, so that the early-stage fusion of the features is realized, the later-stage fusion of the features is realized through two independent attention mechanisms of the audio and the video, and then the recognition result is decoded and output. And modal attention is added during the later fusion of the characteristics, the weight of the audio and video characteristics is adaptively given according to the information content of the audio and video modes, and then the characteristics are fused, so that the model is more stable.
However, the method uses the LSTM neural network of the GRU, and the encoding and decoding processes of the LSTM in the training stage and the prediction stage are calculated step by step (frame by frame), namely, the calculation is cyclic, so that the calculation efficiency is not high; in addition, the method carries out independent coding on the audio and the video to obtain the characteristics, then carries out connection or weighted fusion on the coded characteristics, and does not carry out information interaction between two modes in the respective coding process of the audio and the video, namely the audio coding characteristics can not directly influence video coding characteristic extraction in the coding stage, and similarly, the video coding characteristics can not directly influence the audio coding characteristic extraction in the coding stage.
The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.
Disclosure of Invention
The invention mainly aims to provide a speech recognition method and system based on audio and video dual modes, and aims to improve the speech recognition effect.
The invention provides a speech recognition method based on audio and video dual modes, which comprises the following steps:
acquiring audio and video data to be processed, wherein the audio and video data comprises paired audio data and video data;
performing feature extraction on the audio data to obtain audio features;
performing feature extraction on the video data by adopting a video preprocessing method and a 3D and 2D convolution network to obtain video features;
coding audio features and video features by adopting a pre-trained two-way information interaction encoder based on a Transformer, and performing two-way information interaction on the audio features and the video features by utilizing an information interaction submodule in a coding stage to obtain the audio coding features and the video coding features, wherein a cross reconstruction module is utilized to obtain cross reconstruction loss in the training process of the encoder for the training of the encoder, and in the training process, network parameters of the Transformer-based two-way information interaction encoder are to be trained;
iteratively predicting a state code of a character of each time step by adopting a pre-trained audio and video decoder based on a Transformer and combining the audio coding characteristics, the video coding characteristics and a character state code sequence predicted by a previous time step to obtain a predicted state sequence corresponding to the audio and video, wherein in the training process, network parameters of the audio and video decoder based on the Transformer are to be trained;
and converting the prediction state sequence into text information.
Further, the extracting the features of the audio data to obtain audio Fbank features includes:
processing the audio data by utilizing a first-order high-pass filter to obtain an enhanced audio signal;
performing frame windowing on the enhanced audio signal to obtain a voice signal taking a frame as a unit;
and performing feature extraction on the voice signal by using a short-time Fourier transform and a triangular filter group to obtain audio Fbank features.
Further, the extracting the features of the video data by using a video preprocessing method and a 3D and 2D convolutional network to obtain the video features includes:
screening image frames containing human face lips in the video data by using a human face detection tool and sequentially retaining the image frames;
cutting the human face area in the image frame to form continuous frames as a preprocessing result;
and sending the preprocessing result into a pre-trained 3D and 2D convolutional neural network, and extracting to obtain the visual features.
Further, the Transformer-based bidirectional information interaction encoder comprises a Transformer video encoder, a Transformer audio encoder and an information interaction submodule;
the transform video encoder comprises N-layer video coding blocks, and the transform audio encoder comprises N-layer audio coding blocks;
the output of the ith layer video coding block of the transform video encoder and the output of the ith layer audio coding block of the transform audio encoder are connected to an information interaction submodule, and the information interaction submodule respectively outputs audio complementary information and video complementary information of the ith layer;
adding the audio complementary information of the ith layer output by the information interaction submodule and the output of the audio coding block of the ith layer of the transform audio coder to be used as the input of the audio coding block of the (i + 1) th layer of the transform audio coder;
adding the video complementary information of the ith layer output by the information interaction submodule and the output of the video coding block of the ith layer of the transform video encoder to be used as the input of the video coding block of the (i + 1) th layer of the transform video encoder;
and i is more than or equal to 1 and less than N, wherein N represents the layer number of an audio encoder or a video encoder, and the last layer of video coding block of the transform video encoder and the last layer of audio coding block of the transform audio encoder are connected with the transform-based audio/video decoder.
Further, the information interaction submodule comprises a common spatial mapping layer and a self-attention computing layer;
connecting the output of the ith layer video coding block of the transform video encoder with the output of the ith layer audio coding block of the transform audio encoder to obtain the splicing characteristic of the ith layer and using the splicing characteristic as the input of the common spatial mapping layer;
the common spatial mapping layer linearly maps the splicing characteristics of the ith layer into a multi-modal vector of the ith layer, and the multi-modal vector is used as an input Key of the self-attention computing layer;
the self-attention calculation layer respectively calculates audio complementary information of an ith layer and video complementary information of an ith layer by utilizing the multi-modal vectors of the ith layer, the output of an ith layer coding block of the transform audio encoder and the output of an ith layer coding block of the transform video encoder;
the self-attention calculation layer is used for taking the output of the ith layer coding block of the transform audio encoder as an input Query, taking the output of the ith layer coding block of the transform video encoder as an input Value, acquiring the audio complementary information of the ith layer through a self-attention mechanism, and adding the audio complementary information of the ith layer and the output of the ith layer coding block of the transform audio encoder as the input of the (i + 1) th layer coding block of the transform audio encoder;
the self-attention calculation layer is used for taking the output of the ith layer coding block of the transform video encoder as an input Query, taking the output of the ith layer coding block of the transform audio encoder as an input Value, acquiring the video complementary information of the ith layer through a self-attention mechanism, and adding the video complementary information of the ith layer and the output of the ith layer coding block of the transform video encoder as the input of the (i + 1) th layer coding block of the transform video encoder;
wherein the input of the first layer coding block in the transform audio encoder is from audio Fbank characteristics, and the input of the first layer coding block in the transform video encoder is from preprocessed video characteristics;
and the output of the audio coding block of the last layer is used as the output of the transform audio coder, and the output of the video coding block of the last layer is used as the output of the transform video coder.
Further, the output of the last layer of the transform video encoder and the output of the last layer of the transform audio encoder are both connected with the cross reconstruction submodule, and the cross reconstruction submodule is used for acquiring cross reconstruction loss for model training;
the cross reconstruction submodule comprises an audio reconstruction branch network and a video reconstruction branch network, and the audio reconstruction branch network and the video reconstruction branch network both adopt a Transformer model consisting of three layers of self-attention mechanisms;
the audio reconstruction branch network is based on a Transformer audio reconstruction network and is used for outputting the input video coding characteristics and reconstructing the audio characteristics;
the video reconstruction branch network is based on a Transformer video reconstruction network and is used for outputting the input audio coding characteristics and the reconstructed video characteristics;
the reconstruction loss of the audio reconstruction branch network is defined as an L1 regular term of the difference between the reconstructed audio feature and the initial audio feature, the reconstruction loss of the video reconstruction branch network is defined as an L1 regular term of the difference between the reconstructed video feature and the initial video feature, and the sum of the reconstruction loss of the audio reconstruction branch network and the reconstruction loss of the video reconstruction branch network is the cross reconstruction loss.
Further, the decoding process of the decoder is carried out in an iteration mode, wherein the decoder of each iteration step comprises a plurality of layers of decoding blocks and a Softmax prediction layer, and the decoding blocks comprise a multi-head self-attention layer, an implicit multi-head self-attention layer and a feedforward neural network which are connected in sequence;
in the iterative process of a first time step, the input of a decoder is a predicted character state code sequence consisting of starting characters, the decoder outputs decoding characteristics, and the decoding characteristics pass through a Softmax prediction layer to obtain a probability vector;
taking the position code of the maximum value in the probability vector as the state code of the character predicted at the current time step;
splicing the state code predicted at the previous time step into a predicted character state code sequence in the iteration process of the next time step of the decoder, using the state code as new input of the decoder, and repeating the steps until an ending character is predicted, ending the whole decoding process, and obtaining a predicted state sequence corresponding to the whole audio and video;
in the iterative process of the t time step of the decoder, the input sequence of a first layer decoding block is a character state code sequence predicted by the previous time step, and the decoding block outputs decoding characteristics in the decoding process by combining the audio coding characteristics finally output by the audio encoder and the video coding characteristics finally output by the video encoder, wherein the output of the decoding block of the i layer is the decoding characteristics of the i layer, the input of the decoding block of the i +1 layer is the decoding characteristics of the i layer, and the output of the decoding block of the last layer is used as the decoding characteristics of the decoder;
in the multi-head self-attention layer in the i-th layer decoding block, performing self-attention calculation by taking the decoding characteristics of the i-1-th layer decoding block as Key, Query and Value to obtain an output vector of the multi-head self-attention layer;
two parallel self-attention layers exist in the implicit multi-head self-attention layer in an i-th layer decoding block and are respectively used for calculating two modes of audio and video;
the first self-attention layer of the implicit multi-head self-attention layer takes the audio coding features as Key and Value, takes the output vector of the multi-head self-attention layer as Query, and performs self-attention calculation to obtain an audio decoding vector Ac;
the second self-attention layer of the implicit multi-head self-attention layer takes the video coding features as Key and Va l ue, and the output vector of the multi-head self-attention layer is taken as Query to perform self-attention calculation to obtain a video decoding vector Vc;
after the audio decoding vector Ac and the video decoding vector Vc are subjected to splicing operation, inputting the audio decoding vector Ac and the video decoding vector Vc into the feed-forward neural network for multi-mode feature fusion, and taking obtained fusion features as decoding features of an i-th layer decoding block;
and the decoding features output by the decoding block of the last layer are input into a Softmax prediction layer as a prediction vector to complete the prediction of the state code of the character of the t-th time step.
Further, the method further comprises: model training of network parameters of the encoder based on the transform bidirectional information interaction and the audio/video decoder based on the transform is specifically as follows:
acquiring a training data set, wherein the training data set comprises audio and video data and corresponding real text labels, and preprocessing all audio and video in the audio data and the video data of the training set;
labeling the corresponding real text by the audio and video data by using a dictionary, converting the labeled real text into a state sequence, and using the state sequence as a label of training data in the training set;
training the encoder based on the Transformer bidirectional information interaction and the audio/video decoder based on the Transformer by using the training data set, comparing a predicted state sequence with a corresponding state sequence marked by a real text, calculating cross entropy loss of decoding, calculating the gradient of an objective function by combining the cross reconstruction loss, finely adjusting the network towards the gradient descending direction until the objective function is converged, and obtaining the pre-trained encoder based on the Transformer bidirectional information interaction and the audio/video decoder based on the Transformer.
Further, the method further comprises:
and carrying out augmentation operation on the audio data and the video data to obtain augmented audio and video data.
In addition, the invention also provides a voice recognition system using the audio/video dual-mode-based voice recognition method, and the system comprises:
the acquisition module is used for acquiring audio and video data to be processed, and the audio and video data comprises paired audio data and video data;
the first feature extraction module is used for performing feature extraction on the audio data to obtain audio features;
the second feature extraction module is used for extracting features of the video data by adopting a video preprocessing method and a 3D and 2D convolution network to obtain video features;
the encoding module is used for encoding audio features and video features by adopting a pre-trained two-way information interaction-based encoder, and performing two-way information interaction on the audio features and the video features by utilizing an information interaction submodule in an encoding stage to obtain the audio encoding features and the video encoding features, wherein a cross reconstruction loss is obtained by utilizing a cross reconstruction module in the training process of the encoder for the encoder training, and in the training process, network parameters of the two-way information interaction-based encoder based on the Transformer are to be trained;
the decoding module is used for iteratively predicting a state code of a character of each time step by adopting a pre-trained transform-based audio-video decoder and combining the audio coding features, the video coding features and a character state code sequence predicted by a previous time step to obtain a predicted state sequence corresponding to the audio and video, wherein in the training process, network parameters of the transform-based audio-video decoder are to be trained;
and the text prediction module is used for converting the prediction state sequence into text information.
The audio and video dual-mode-based voice recognition method and the system provided by the invention have the following technical advantages:
(1) and adding an information interaction submodule. The audio modality can provide better semantic information in a clean background scene, but it is difficult to provide good semantic information in a noisy background. And the video mode is not influenced by background noise, and complementary semantic information can be provided for the audio mode in a noise scene. Meanwhile, the audio modality can provide complementary semantic information for the visual modality under a clean background scene, and the ambiguity of the lip language recognition process is reduced. In the encoding process, the audio mode and the video mode complement semantic information, so that the encoding characteristics are more robust, and the semantic information is richer, thereby achieving a better voice recognition effect.
(2) And adding a cross reconstruction submodule. And in the training process of the model, the cross reconstruction submodule carries out cross-mode cross reconstruction, and the information interaction submodule is guided to obtain effective interaction complementary information through a reconstruction mode. Meanwhile, the cross reconstruction submodule can slow down the marginalization of the video modality and accelerate the convergence of the model.
(3) The method and the system for identifying the audio and video based on the bidirectional information interaction of the transducer can obtain the coding characteristics and the decoding characteristics with robustness and richer semantic information, and effectively and accurately identify the audio and video data. The voice recognition requirement of the user in a noisy environment can be met, the accuracy of a voice recognition result is improved, and the user experience is enhanced.
Drawings
FIG. 1 is a flowchart of an audio-video bimodal-based speech recognition method according to a first embodiment of the present invention;
FIG. 2 is a diagram of an encoder structure for transform-based bi-directional interaction in the present invention;
FIG. 3 is a diagram of an information interaction submodule according to the present invention;
FIG. 4 is a schematic diagram of the operation of a transform-based audio/video decoder according to the present invention;
FIG. 5 is a diagram illustrating a single-step process of the decoder according to the present invention;
fig. 6 is an internal structural view of each decoding Block in the decoder of the present invention;
FIG. 7 is a schematic block diagram of audio/video dual-modality based speech recognition in the present invention;
fig. 8 is a structural diagram of an audio-video dual-modality based speech recognition system in a second embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
First embodiment
Referring to fig. 1 to 2, the audio/video dual-modality based speech recognition method includes the following steps:
s10, acquiring audio and video data to be processed, wherein the audio and video data comprise paired audio data and video data;
it should be noted that, in this embodiment, the audio/video data is from a training set, and the training set is composed of pairs of audio data and video data and corresponding text labels.
It should be understood that the audio data in this embodiment is audio including a speaker, and the video data is video including a lip region of the speaker. The lip region captured by vision must be distinguishable and continuous, lip recognition is the underlying principle for the application of the vision model.
It should be noted that, in this embodiment, a dictionary may be used to query the state code corresponding to each text label to form a state sequence;
and S20, performing feature extraction on the audio data to obtain audio features.
It should be noted that, in this embodiment, a voice data processing tool may be used to perform feature extraction on the audio data to obtain the audio Fbank feature.
And S30, extracting the features of the video data by adopting a video preprocessing method and a 3D and 2D convolution network to obtain video features.
It should be noted that, in this embodiment, a face processing tool and a pre-trained 3D and 2D convolutional neural network may be used to perform feature extraction on video data.
S40, coding audio features and video features by adopting a pre-trained bidirectional information interaction coder based on a Transformer, and in the coding stage, performing bidirectional information interaction on the audio features and the video features by utilizing an information interaction submodule to obtain audio coding features and video coding features, wherein in the training process of the coder, a cross reconstruction loss is obtained by utilizing a cross reconstruction submodule to be used for the training of the coder, and in the training process, network parameters of the bidirectional information interaction coder based on the Transformer are to be trained.
S50, adopting a pre-trained audio and video decoder based on a Transformer, combining the audio coding characteristics, the video coding characteristics and the character state code sequence predicted at the previous time step, iteratively predicting the state code of the character at each time step, and obtaining a predicted state sequence corresponding to the audio and video, wherein in the training process, the network parameters of the audio and video decoder based on the Transformer are to be trained.
In this embodiment, in the training process, the transform-based bidirectional information interaction encoder and the transform-based audio/video decoder are used. And training the model by using the audio and video training set and the labels corresponding to the training set. And in the application stage, the trained encoder and decoder are directly used for audio and video recognition.
And S60, converting the prediction state sequence into text information.
Specifically, in this embodiment, the gradient of the objective function is calculated by using the predicted state sequence and the error of the cross-reconstruction loss. Defining the sum of cross entropy loss and cross reconstruction loss of the predicted state sequence as a total target loss function, calculating the gradient of the target function, finely adjusting the network towards the gradient descending direction until the target function is converged, and obtaining the pre-trained encoder and decoder. In practical application, the trained encoder and decoder are used for speech recognition.
It should be noted that, in the solution described in the invention patent application with publication number CN111754992A, an LSTM neural network based on GRU is used, and LSTM is calculated time step by time step (frame by frame) in the encoding and decoding processes of the training phase and the prediction phase, that is, the calculation is performed circularly; while the present embodiment uses a Transformer model, the Transformer can capture long-distance feature dependence and has better encoding capability, the Transformer simultaneously computes all time steps (frames) in the encoding and decoding processes in the training process, i.e. the Transformer simultaneously computes all frames in the training process, the Transformer simultaneously computes all frames in the encoding process in the prediction stage, i.e. the encoding process is parallel, and obviously, the efficiency of parallel computation is higher than that of circular computation.
In an embodiment, the step S20 includes the following steps:
and S21, processing the audio data by utilizing a first-order high-pass filter to obtain an enhanced audio signal.
And S22, performing frame division and windowing on the enhanced audio signal to obtain a voice signal with a frame as a unit.
And S23, extracting the features of the voice signal by using a short-time Fourier transform and a triangular filter group to obtain the audio Fbank features.
In the embodiment, a Kaldi voice data processing tool is utilized to perform high-pass filtering on voice audio, enhance high-frequency signals, perform framing and windowing, convert voice signals with sampling points as units into voice signals with frames as units, perform windowing and smoothing, and extract stable audio Fbank features by using short-time Fourier transform and logarithm.
It should be noted that the Fbank feature is used for two reasons: 1) the Fbank characteristic is that audio is converted from a time domain to a frequency domain, and stable audio characteristics can be extracted; 2) the Fbank spectrum is a two-dimensional feature and can be processed by using deep neural networks such as convolution and Transformer.
It should be noted here that the audio feature extraction method is not limited to the contents described in this embodiment, and the Fbank audio feature and the MFCC audio feature and other audio features can be used as stable expressions of audio.
In one embodiment, the step S30 includes the following steps:
s31, screening out image frames containing human lips in the video data by using a human face detection tool and sequentially retaining the image frames.
And S32, cutting the face area in the image frame to form continuous frames as a preprocessing result.
And S33, sending the preprocessing result into a pre-trained 3D and 2D convolutional neural network, and extracting to obtain visual features.
Specifically, in this embodiment, after the facial video of the user is acquired, the human face detection tool is used to screen human face video frames with discriminability, the human face video frames are sequentially retained, then the human face detection tool is used to cut out the human face region of each frame to form a lip language video as a preprocessing result, and then the preprocessed video is subjected to a pre-trained 3D and 2D convolutional neural network to extract the visual features of the video.
In the embodiment, the visual modality and the audio modality are considered to be complementary, and the visual information and the audio information are fused, so that the ambiguity in the speech recognition based on the single-modality information is reduced, and the accuracy of the speech recognition is improved; the two-way information interaction of the audio mode and the visual mode is introduced into the voice recognition, so that the voice recognition effect is more robust.
In one embodiment, the step S40 includes the following steps:
and S41, inputting the audio features into an audio encoder, and acquiring the audio encoding features.
And S42, inputting the video characteristics to a video encoder, and acquiring the video encoding characteristics.
And S43, in the process of acquiring the audio coding characteristics by the audio coder and the video coding characteristics by the video coder, the information interaction submodule is utilized to realize audio and video mode fusion in the coding process.
And S44, in the training stage, the video reconstruction network reconstructs the initial video characteristics by using the audio coding characteristics finally output by the audio coder. The audio reconstruction network reconstructs the initial audio features using the video coding features last output by the video encoder.
In an embodiment, referring to fig. 3 and 4, the Transformer-based bidirectional information interaction encoder comprises a Transformer video encoder, a Transformer audio encoder and an information interaction submodule;
the transform video encoder comprises an N-layer video coding block and the transform audio encoder comprises an N-layer audio coding block.
And the output of the ith layer video coding block of the transform video encoder and the output of the ith layer audio coding block of the transform audio encoder are connected to an information interaction submodule, and the information interaction submodule respectively outputs audio complementary information and video complementary information of the ith layer.
Adding the audio complementary information of the ith layer output by the information interaction submodule and the output of the audio coding block of the ith layer of the transform audio coder to be used as the input of the audio coding block of the (i + 1) th layer of the transform audio coder;
adding the video complementary information of the ith layer output by the information interaction submodule and the output of the video coding block of the ith layer of the transform video encoder to be used as the input of the video coding block of the (i + 1) th layer of the transform video encoder;
wherein i is greater than or equal to 1 and less than N, N represents the layer number of an audio encoder or a video encoder, that is, the last layer of video coding block of the transform video encoder and the last layer of audio coding block of the transform audio encoder are connected with the decoder.
It should be noted that the Transformer is a deep learning model of the self-attention-based codec framework, and the network model includes several coding blocks and decoding blocks. Using ReLU as an activation function to avoid gradient disappearance; using a normalization layer to make the data distribution consistent; a multi-head mechanism is used, so that the model is learned in different subspaces; and using the bidirectional information exchange block to complete information interaction among multiple modes.
It should be noted that, according to the scheme described in the invention patent application with publication number CN111754992A, after the audio and video are independently encoded to obtain features, the encoded features are connected or weighted and fused, and information interaction between two modalities is not performed in the respective encoding processes of the audio and video, that is, the audio encoding features do not directly affect video encoding feature extraction in the encoding stage, and similarly, the video encoding features do not directly affect audio encoding feature extraction in the encoding stage.
In the embodiment, the output of each layer of video coding block and the output of each layer of audio coding block are connected with the information interaction submodule, and each layer in the respective coding process of the audio and the video is fused with the information of the opposite mode, so that the deep cross fusion of information of different modes is realized, namely, the complementary information of the opposite mode is obtained by each layer in the respective coding process of the audio and the video through a self-attention mechanism for fusion, and the deep fusion of information in the coding stage is realized.
In an embodiment, referring to fig. 4, the Transformer-based encoder for bidirectional information interaction includes the information interaction submodule. The information interaction submodule comprises a common spatial mapping layer and a self-attention computing layer, and the output of the ith layer of video coding block of the transform video encoder is connected with the output of the ith layer of audio coding block of the transform audio encoder to obtain the splicing characteristic of the ith layer.
The common spatial mapping layer linearly maps the splicing characteristics of the ith layer into a multi-modal vector of the ith layer, and the multi-modal vector is used as an input Key of the self-attention computing layer;
and the self-attention calculation layer respectively calculates audio complementary information of the ith layer and video complementary information of the ith layer by utilizing the multi-modal vectors of the ith layer, the output of the ith layer coding block of the transform audio encoder and the output of the ith layer coding block of the transform video encoder.
The self-attention calculation layer is used for taking the output of the ith layer coding block of the transform audio encoder as an input Query, taking the output of the ith layer coding block of the transform video encoder as an input Value, acquiring the audio complementary information of the ith layer through a self-attention mechanism, and adding the audio complementary information of the ith layer and the output of the ith layer coding block of the transform audio encoder to be used as the input of the (i + 1) th layer coding block of the transform audio encoder;
the self-attention calculation layer is used for taking the output of the ith layer coding block of the transform video encoder as an input Query, taking the output of the ith layer coding block of the transform audio encoder as an input Value, acquiring the video complementary information of the ith layer through a self-attention mechanism, and adding the video complementary information of the ith layer and the output of the ith layer coding block of the transform video encoder to be used as the input of the (i + 1) th layer coding block of the transform video encoder.
Wherein the input of the first layer coding block in the audio encoder is from the audio Fbank feature, and the input of the first layer coding block in the video encoder is from the preprocessed video feature.
The output of the audio coding block of the last layer is taken as the output of the audio coder, i.e. the audio coding feature. The output of the video coding block of the last layer is taken as the output of the video encoder, i.e. the video coding characteristics.
The embodiment adds an information interaction submodule on each corresponding layer of the transform video encoder and the transform audio encoder. Namely, the output of the ith layer coder of the transform video coder and the output of the ith layer coder of the transform audio coder are input into the information interaction submodule.
The working equation of the self-attention mechanism is as follows:
SelfAtettention(Query,Key,Value)
=Softmax(Mat(Query,Key))*Value
in the embodiment, through the audio and video multi-mode fusion method, interactive feature fusion is performed in the process of extracting the video coding features and the audio coding features, so that the independence among all modes is ensured, and the multi-mode information complementation is fully utilized. The active role of the video features in the identification process is fully utilized from bottom to top, and the modal difference between the audio modality and the visual modality is spanned.
In an embodiment, referring to fig. 3, the Transformer-based bidirectional information interaction encoder further includes a cross reconstruction submodule, where an output of a last layer of the Transformer video encoder and an output of a last layer of the Transformer audio encoder are both connected to the cross reconstruction submodule, and the cross reconstruction submodule is configured to obtain a cross reconstruction loss for model training;
the cross reconstruction submodule comprises an audio reconstruction branch network and a video reconstruction branch network, and the audio reconstruction branch network and the video reconstruction branch network both adopt a Transformer model consisting of three layers of self-attention mechanisms;
the audio reconstruction branch network inputs video coding characteristics and outputs reconstructed audio characteristics based on the Transformer audio reconstruction network;
and the video reconstruction branch network inputs audio coding characteristics and outputs reconstructed video characteristics based on the Transformer video reconstruction network.
The reconstruction loss of the audio reconstruction branch network is defined as an L1 regular term of the difference between the reconstructed audio feature and the initial audio feature, and the reconstruction loss of the video reconstruction branch network is defined as the reconstructed video feature and the initial video featureThe sum of the reconstruction loss of the audio reconstruction branch network and the reconstruction loss of the video reconstruction branch network is the cross reconstruction loss cross
In order to alleviate the problem of visual modal marginalization caused by over-strong audio modes in the training process, balance performance among the multiple modes, supervise effective information interaction among the audio and video multiple modes and increase a cross reconstruction submodule in the training process.
In one embodiment, the step S50 includes the following steps:
and S51, forming a prediction state sequence by the starting characters.
And S52, inputting the prediction state sequence into a decoder, and calculating layer by layer to obtain decoding characteristics.
And S53, in the decoding block of each layer, performing audio-video multi-mode fusion and feature decoding by utilizing the multi-head self-attention layer and the implicit multi-head self-attention layer.
And S54, the Softmax prediction layer obtains the probability vector by utilizing the decoding characteristics finally output by the decoder, and predicts the state code of the character in the current step.
S55, if the character predicted in the S54 step is not the end character, splicing the state code of the predicted character into a predicted state sequence, and jumping to the S51 step;
s56, if the character predicted in the step S54 is the end character, the decoding is ended.
In an embodiment, referring to fig. 5, 6 and 7, the decoding process of the transform-based audio-video decoder is performed iteratively, wherein the decoder of each step of iteration comprises several layers of decoding blocks and a Softmax prediction layer, as shown in fig. 6. The decoding block comprises a multi-head self-attention layer, an implicit multi-head self-attention layer and a feedforward neural network which are connected in sequence, as shown in FIG. 7.
As shown in fig. 5, in the iterative process of the first time step, the input of the decoder is a predicted character state code sequence composed of start characters, the decoder outputs decoding features, and the decoding features pass through a Softmax prediction layer to obtain probability vectors. And taking the position code of the maximum value in the probability vector as the state code of the character predicted at the current time step. In the iterative process of the next time step, the state code predicted at the previous time step is spliced to the predicted character state code sequence and then used as the new input of the decoder. And repeating the steps until the ending character is predicted, and ending the whole decoding process.
In the iterative process of the t time step of the decoder, the input sequence of the first layer decoding block is the character state code sequence predicted by the previous time step, and the decoding block needs to combine the audio coding feature finally output by the audio encoder and the video coding feature finally output by the video encoder in the decoding process. The output of the i layer decoding block is the decoding characteristic of the i layer. The input to the i +1 th layer decoding block is the decoding characteristic of the i-th layer. The output of the last layer of decoding block is used as the decoding characteristic of a decoder;
in the multi-head self-attention layer in the i-th layer decoding block, performing self-attention calculation by taking the decoding characteristics of the i-1-th layer decoding block as Key, Query and Value to obtain an output vector of the multi-head self-attention layer;
in the i-th layer decoding block, there are two parallel self-attention layers for computing two modalities of audio and video, respectively, as shown in fig. 7.
The first self-attention layer of the implicit multi-head self-attention layer takes the audio coding features as Key and Value, takes the output vector of the multi-head self-attention layer as Query, and carries out self-attention calculation to obtain an audio decoding vector Ac;
the second self-attention layer of the implicit multi-head self-attention layer takes the video coding features as Key and Value, and the output vector of the multi-head self-attention layer as Query to perform self-attention calculation to obtain a video decoding vector Vc;
after the audio decoding vector Ac and the video decoding vector Vc are subjected to splicing operation, inputting the audio decoding vector Ac and the video decoding vector Vc into the feed-forward neural network for multi-mode feature fusion, and taking obtained fusion features as decoding features of an i-th layer decoding block;
and the decoding features output by the decoding block of the last layer are input into a Softmax prediction layer as a prediction vector to complete the prediction of the state code of the character of the t-th time step.
In this embodiment, the multimodal feature fusion and the character prediction occur in a decoding process of a decoder, the decoder performs decoding in an iterative manner, one character is predicted at each time step, and the decoding process is as follows:
the input sequence of the decoder of the first time step is a predicted character state code sequence consisting of starting characters, and the state code of the predicted character of each time step is spliced with the state code sequence of the predicted character of the previous time step to be used as the input of the decoder of the next time step until the decoder predicts an ending character, thereby finishing the whole decoding process. The predicted character state code, after tag smoothing and position embedding, is used as the input of the first decoding block of the decoder. The decoding block consists of two self-attention layers: a multi-headed self-attention layer and an implicit multi-headed self-attention layer. In the multi-head self-attention layer, the output of the decoding block of the previous layer is used as Key, Query and Value to carry out self-attention calculation, and the output vector of the multi-head self-attention layer is obtained. In the following implicit self-attention layer, there are two parallel self-attention mechanism layers, handling the audio modality and the video modality, respectively. In the process of processing the audio mode, the audio coding features output by the audio coder finally serve as Key and Value, the output vector of the multi-head self-attention layer serves as Query, self-attention calculation is carried out, and an audio decoding vector Ac is obtained. In the process of processing the video mode, the video coding features output by the video coder finally serve as Key and Value, the output vector of the multi-head self-attention layer serves as Query, self-attention calculation is carried out, and the video decoding vector Vc is obtained. And the audio decoding vector Ac and the video decoding vector Vc are input into a feed-forward neural network for multi-mode feature fusion after splicing operation. And finally, the fused multi-modal characteristics are used as the input of the decoding blocks of the next layer, and the input is sequentially carried out until the decoding blocks of the last layer of the decoder. And the output of the decoding block of the last layer is used as a prediction vector to be input into a Softmax prediction layer, and the character state code of the current time step is predicted.
In an embodiment, the method further comprises: model training of network parameters of the encoder based on the transform bidirectional information interaction and the audio/video decoder based on the transform is specifically as follows:
s1, a training data set is obtained, the training data set comprises audio and video data and corresponding real text labels, and all audio and video in the audio data and the video data of the training set are preprocessed.
And S2, labeling labels in the training stage, wherein the labeled content is corresponding text information, and converting the corresponding text information into a corresponding real text state sequence according to the dictionary.
It should be noted that the format of the dictionary is fixed, for example, "i 1", and each character has a unique corresponding integer identifier. "i love china" translates to a state sequence "132100125" where the characters in the dictionary are a set of commonly used chinese characters.
It should be noted that the start character is spliced to the head of the state sequence to indicate the start of the sequence.
Specifically, in this embodiment, audio/video data to be input into a network is acquired, and if the audio/video data is labeled in a training process, the audio/video data is converted into a corresponding state sequence through a dictionary and used as a training tag.
S3, training the Transformer-based bidirectional information interaction encoder and the Transformer-based audio/video decoder by using the training data set, comparing the predicted state sequence with the corresponding state sequence labeled by the real text, calculating the cross entropy loss of decoding, combining the cross reconstruction loss, calculating the gradient of an objective function, finely adjusting the network in the gradient descending direction until the objective function is converged, and obtaining the pre-trained Transformer-based bidirectional information interaction encoder and the Transformer-based audio/video decoder
Specifically, in the training phase, the sum of the cross entropy of each state code in the predicted state sequence and the corresponding state sequence labeled by the real text is calculated as the cross entropy loss CE . If not, in direct knotAnd (5) bundling the step. And then, weighting and adding the cross entropy Loss and the cross reconstruction Loss to obtain a total Loss of the training model: loss is less CE +αloss cross Wherein alpha is a hyper-parameter which can be adjusted during the training process; and finally, optimizing the network parameters of the Transformer-based bidirectional information interaction encoder and the Transformer-based audio-video decoder by using a gradient descent method. And repeatedly training until network parameters of the encoder and the decoder converge.
It should be noted that, in the training process, the audio/video data pair in the training set is used as the output of the network, the text label is used as the output of the network, and in the training process, the audio/video data pair is first converted into a state sequence, the predicted output of the model is the state sequence, and in the inference process, the state sequence is converted into a text.
The training process is a supervised learning process, in the process, a speaker audio and video with a period of time of 5s is assumed, 30 frames are obtained through preprocessing, a transducer uses an attention mechanism to extract audio and video characteristics layer by layer, a transducer encoder comprises six layers, and finally the output of the sixth layer is used as encoding characteristics. Combining the coding characteristics and the past prediction results, the decoder iteratively predicts the characters of the current time sequence. And calculating the gradient of the target function according to the error of the prediction sequence and the cross reconstruction loss, and accordingly finely adjusting the network towards the gradient descending direction until the target function converges.
In an embodiment, the method further comprises:
and performing augmentation operation on the audio data and the video data.
It should be noted that the augmentation operation is performed in the data preprocessing stage. The amplification operations used are acoustic velocity disturbances and spectral disturbances. The sound velocity disturbance is to make the sound velocity of the original audio signal slow or fast. The spectral disturbance is to perform time domain masking, frequency domain masking and distortion on the spectrogram of the audio Fbank characteristic. The augmentation operation is to expand the data volume and enhance the generalization capability of the model.
Specifically, in the embodiment, for the training of the audio and video, in order to ensure the robustness of the model and adapt to various complex environments as much as possible, the audio and video is subjected to the augmentation operation. The original audio is first adjusted at multiple speeds, using the Kaldi tool to make 0.9, 1 and 1.1 times the audio multiple speed changes. Then, amplifying the Fbank characteristic spectrogram, randomly twisting 10 to 20 frames, namely, moving the current frame forwards or backwards; or randomly masking 10 to 20 frames in the time domain; or masking operations in 10 to 20 consecutive frequency ranges, each of which may be randomly repeated one or two times on the spectrogram. The same operation can be similarly applied to video frames.
It should be noted that the data enhancement strategy is only an example and is not a limitation to the present invention, and in some other embodiments, the data enhancement method may also be set according to actual requirements, including but not limited to a rotation enhancement mode.
Second embodiment
In addition, referring to fig. 8, a second embodiment of the present invention provides an audio/video dual-modality based speech recognition system, including:
the acquisition module 10 is configured to acquire audio and video data to be processed, where the audio and video data includes paired audio data and video data.
The first feature extraction module 20 is configured to perform feature extraction on the audio data to obtain audio Fbank features.
And a second feature extraction module 30, configured to perform feature extraction on the video data by using a video preprocessing method and a 3D and 2D convolutional network, to obtain video features.
The encoding module 40 is configured to encode the audio features and the video features by using a pre-trained two-way information interaction-based encoder, and perform two-way information interaction on the audio features and the video features by using an information interaction submodule in an encoding stage to obtain the audio encoding features and the video encoding features, wherein a cross reconstruction loss is obtained by using a cross reconstruction submodule in a training process of the encoder for the encoder training, and in the training process, network parameters of the transform-based two-way information interaction-based encoder are to be trained.
And the decoding module 50 is configured to use a pre-trained transform-based audio/video decoder, and combine the audio coding features, the video coding features, and the character state code sequence predicted at the previous time step to iteratively predict a state code of a character at each time step, so as to obtain a predicted state sequence corresponding to the audio/video, where in a training process, a network parameter of the transform-based audio/video decoder is to be trained.
And a text prediction module 60 for converting the prediction state sequence into text information.
In the embodiment, for audio and video bimodal data, a speech and face processing tool is utilized to extract speech characteristics and video characteristics, then a trained two-way information interaction encoder based on a Transformer and an audio and video decoder based on the Transformer are used to fuse and extract characteristics of two modal information, a state sequence is obtained, and finally text information expressed by a user is obtained.
As a further application, the voice recognition system of the embodiment is utilized to transcribe the audio and video information into the text information expressed by the user, and deliver the text information to a downstream application or directly use the text information by the user, while from the perspective of downstream tasks, such as voice instructions, voice locks and voice recognition, the application is a lower layer application.
In one embodiment, the first feature extraction module 20 includes:
a dividing unit for dividing the audio data into different-length audios by using a voice endpoint detection method
And the enhancement unit is used for processing the audio with different lengths by utilizing a first-order high-pass filter to obtain an enhanced audio signal.
And the frame windowing unit is used for performing frame windowing on the enhanced audio signal to obtain a voice signal taking a frame as a unit.
And the first feature extraction unit is used for extracting features of the voice signal by utilizing short-time Fourier transform and a triangular filter group to obtain Fbank audio features.
In one embodiment, the second feature extraction module 20 includes:
and the screening unit is used for screening out the image frames containing the human face lips in the video data by using a human face detection tool and reserving the image frames in sequence.
And the cutting unit is used for cutting the human face area in the image frame to form continuous frames as a preprocessing result.
And the second feature extraction unit is used for sending the preprocessing result into a pre-trained 3D and 2D convolutional neural network to extract and obtain the visual features.
In an embodiment, the encoding module 40 is specifically configured to:
and respectively acquiring audio coding characteristics and video coding characteristics by utilizing two branch transform audio encoders and transform video encoders of the encoders, wherein bidirectional information interaction is carried out on corresponding network layers of the transform audio encoders and the transform video encoders through information interaction module blocks.
Fusing and decoding the audio coding characteristics and the video coding characteristics by using a transform decoder to obtain a prediction sequence;
and obtaining a predicted character sequence according to the dictionary for the predicted sequence.
It should be noted that other processes of the audio/video dual-modality based speech recognition system provided in this embodiment refer to the specific implementation steps of the audio/video dual-modality based speech recognition method, and are not described herein again.
Further, it is to be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention or portions thereof that contribute to the prior art may be embodied in the form of a software product, where the computer software product is stored in a storage medium (e.g., Read Only Memory (ROM)/RAM, a magnetic disk, an optical disk), and includes several instructions for enabling a terminal device (e.g., a mobile phone, a computer, a node packaging device, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A speech recognition method based on audio and video dual modes is characterized by comprising the following steps:
acquiring audio and video data to be processed, wherein the audio and video data comprises paired audio data and video data;
performing feature extraction on the audio data to obtain audio features;
performing feature extraction on the video data by adopting a video preprocessing method and a 3D and 2D convolution network to obtain video features;
coding audio features and video features by adopting a pre-trained two-way information interaction encoder based on a Transformer, and performing two-way information interaction on the audio features and the video features by utilizing an information interaction submodule in a coding stage to obtain the audio coding features and the video coding features, wherein a cross reconstruction loss is obtained by utilizing a cross reconstruction submodule in the training process of the encoder for the encoder training, wherein in the training process, network parameters of the Transformer-based two-way information interaction encoder are to be trained;
iteratively predicting a state code of a character of each time step by adopting a pre-trained audio and video decoder based on a Transformer and combining the audio coding characteristics, the video coding characteristics and a character state code sequence predicted by a previous time step to obtain a predicted state sequence corresponding to the audio and video, wherein in the training process, network parameters of the audio and video decoder based on the Transformer are to be trained;
and converting the prediction state sequence into text information.
2. The audio-video bimodal-based speech recognition method of claim 1, wherein the performing feature extraction on the audio data to obtain audio features comprises:
processing the audio data by utilizing a first-order high-pass filter to obtain an enhanced audio signal;
performing frame division and windowing on the audio signal of the enhanced high-frequency signal to obtain a voice signal taking a frame as a unit;
and performing feature extraction on the voice signal by using a short-time Fourier transform and a triangular filter group to obtain audio Fbank features.
3. The audio-video bimodal-based speech recognition method of claim 1, wherein the extracting the features of the video data by using a video preprocessing method and a 3D and 2D convolutional network to obtain video features comprises:
screening image frames containing human face lips in the video data by using a human face detection tool and sequentially retaining the image frames;
cutting the human face area in the image frame to form continuous frames as a preprocessing result;
and sending the preprocessing result into a pre-trained 3D and 2D convolutional neural network, and extracting to obtain video characteristics.
4. The dual mode audio-video based speech recognition method according to claim 1, wherein the Transformer based bi-directional information interaction encoder comprises a Transformer video encoder, a Transformer audio encoder, and an information interaction sub-module;
the transform video encoder comprises N-layer video coding blocks, and the transform audio encoder comprises N-layer audio coding blocks;
the output of the ith layer video coding block of the transform video encoder and the output of the ith layer audio coding block of the transform audio encoder are connected to an information interaction submodule, and the information interaction submodule respectively outputs the audio complementary information and the video complementary information of the ith layer;
adding the audio complementary information of the ith layer output by the information interaction submodule and the output of the audio coding block of the ith layer of the transform audio coder to be used as the input of the audio coding block of the (i + 1) th layer of the transform audio coder;
adding the video complementary information of the ith layer output by the information interaction submodule and the output of the video coding block of the ith layer of the transform video encoder to be used as the input of the video coding block of the (i + 1) th layer of the transform video encoder;
and i is more than or equal to 1 and less than N, N represents the layer number of an audio encoder or a video encoder, and the last layer of video coding block of the transform video encoder and the last layer of audio coding block of the transform audio encoder are connected with the transform-based audio/video decoder.
5. The audio-visual bimodal speech recognition method according to claim 4, wherein said information interaction submodule comprises a common spatial mapping layer and a self-attention computation layer;
connecting the output of the ith layer video coding block of the transform video encoder with the output of the ith layer audio coding block of the transform audio encoder to obtain the splicing characteristic of the ith layer and taking the splicing characteristic as the input of the common spatial mapping layer;
the common spatial mapping layer linearly maps the splicing characteristics of the ith layer into a multi-modal vector of the ith layer, and the multi-modal vector is used as an input Key of the self-attention computing layer;
the self-attention calculation layer respectively calculates audio complementary information of an ith layer and video complementary information of an ith layer by utilizing the multi-modal vectors of the ith layer, the output of an ith layer coding block of the transform audio encoder and the output of an ith layer coding block of the transform video encoder;
the self-attention calculation layer is used for taking the output of the ith layer coding block of the transform audio encoder as an input Query, taking the output of the ith layer coding block of the transform video encoder as an input Value, acquiring the audio complementary information of the ith layer through a self-attention mechanism, and adding the audio complementary information of the ith layer and the output of the ith layer coding block of the transform audio encoder as the input of the (i + 1) th layer coding block of the transform audio encoder;
the self-attention calculation layer is used for taking the output of the ith layer coding block of the transform video encoder as an input Query, taking the output of the ith layer coding block of the transform audio encoder as an input Value, acquiring the video complementary information of the ith layer through a self-attention mechanism, and adding the video complementary information of the ith layer and the output of the ith layer coding block of the transform video encoder as the input of the (i + 1) th layer coding block of the transform video encoder;
wherein the input of the first layer coding block in the transform audio encoder is from audio Fbank characteristics, and the input of the first layer coding block in the transform video encoder is from preprocessed video characteristics;
and the output of the audio coding block of the last layer is used as the output of the transform audio coder, and the output of the video coding block of the last layer is used as the output of the transform video coder.
6. The bimodal audio-visual based speech recognition method according to claim 4, wherein the output of the last layer of the fransformer video encoder and the output of the last layer of the fransformer audio encoder are both connected to the cross reconstruction submodule, and the cross reconstruction submodule is configured to obtain cross reconstruction loss for model training;
the cross reconstruction submodule comprises an audio reconstruction branch network and a video reconstruction branch network, and the audio reconstruction branch network and the video reconstruction branch network both adopt a Transformer model consisting of three layers of self-attention mechanisms;
the audio reconstruction branch network is based on a Transformer audio reconstruction network and is used for outputting the input video coding characteristics and reconstructing the audio characteristics;
the video reconstruction branch network is based on a Transformer video reconstruction network and is used for outputting the input audio coding characteristics and the reconstructed video characteristics;
the reconstruction loss of the audio reconstruction branch network is defined as an L1 regular term of the difference between the reconstructed audio feature and the initial audio feature, the reconstruction loss of the video reconstruction branch network is defined as an L1 regular term of the difference between the reconstructed video feature and the initial video feature, and the sum of the reconstruction loss of the audio reconstruction branch network and the reconstruction loss of the video reconstruction branch network is the cross reconstruction loss.
7. The dual-mode audio-video-based speech recognition method of claim 1, wherein the decoding process of the decoder is performed iteratively, wherein the decoder of each iteration comprises a plurality of decoding blocks and a Softmax prediction layer, and the decoding blocks comprise a multi-head self-attention layer, an implicit multi-head self-attention layer and a feedforward neural network which are connected in sequence;
in the iterative process of the first time step, the input of the decoder is a predicted character state code sequence consisting of starting characters, the decoder outputs decoding characteristics, and the decoding characteristics pass through a Softmax prediction layer to obtain a probability vector;
taking the position code of the maximum value in the probability vector as the state code of the character predicted at the current time step;
in the iterative process of the next time step of the decoder, splicing the state code predicted at the previous time step into a predicted character state code sequence, and then using the state code as a new input of the decoder, and so on until an ending character is predicted, ending the whole decoding process, and obtaining a predicted state sequence corresponding to the whole audio and video;
in the iterative process of the t time step of the decoder, the input sequence of a first layer decoding block is a character state code sequence predicted by the previous time step, and the decoding block outputs decoding characteristics in the decoding process by combining the audio coding characteristics finally output by the audio encoder and the video coding characteristics finally output by the video encoder, wherein the output of the decoding block of the i layer is the decoding characteristics of the i layer, the input of the decoding block of the i +1 layer is the decoding characteristics of the i layer, and the output of the decoding block of the last layer is used as the decoding characteristics of the decoder;
in the multi-head self-attention layer in the i-th layer decoding block, performing self-attention calculation by taking the decoding characteristics of the i-1-th layer decoding block as Key, Query and Value to obtain an output vector of the multi-head self-attention layer;
two parallel self-attention layers exist in the implicit multi-head self-attention layer in an i-th layer decoding block and are respectively used for calculating two modes of audio and video;
the first self-attention layer of the implicit multi-head self-attention layer takes the audio coding features as Key and Value, takes the output vector of the multi-head self-attention layer as Query, and performs self-attention calculation to obtain an audio decoding vector Ac;
the second self-attention layer of the implicit multi-head self-attention layer takes the video coding features as Key and Value, and the output vector of the multi-head self-attention layer as Query to perform self-attention calculation to obtain a video decoding vector Vc;
after the audio decoding vector Ac and the video decoding vector Vc are subjected to splicing operation, inputting the audio decoding vector Ac and the video decoding vector Vc into the feed-forward neural network for multi-mode feature fusion, and taking obtained fusion features as decoding features of an i-th layer decoding block;
and the decoding features output by the decoding block of the last layer are input into a Softmax prediction layer as a prediction vector to complete the prediction of the state code of the character of the t-th time step.
8. The audio-visual bimodal-based speech recognition method of any one of claims 2-7, wherein the method further comprises: model training of network parameters of the encoder based on the transform bidirectional information interaction and the audio/video decoder based on the transform is specifically as follows:
acquiring a training data set, wherein the training data set comprises audio and video data and corresponding real text labels, and preprocessing all audio and video in the audio data and the video data of the training set;
labeling the corresponding real text by the audio and video data by using a dictionary, converting the labeled real text into a state sequence, and using the state sequence as a label of training data in the training set;
training the Transformer-based bidirectional information interaction encoder and the Transformer-based audio/video decoder by using the training data set, comparing a predicted state sequence with a corresponding state sequence labeled by a real text, calculating cross entropy loss of decoding, calculating the gradient of a target function by combining the cross reconstruction loss, finely adjusting the network in the direction of gradient decrease until the target function is converged, and obtaining the pre-trained Transformer-based bidirectional information interaction encoder and the Transformer-based audio/video decoder.
9. An audiovisual bimodal based speech recognition method according to any of claims 2-7, characterized in that the method further comprises:
and carrying out augmentation operation on the audio data and the video data to obtain augmented audio and video data.
10. An audio-video dual-modality based speech recognition system, the system comprising:
the acquisition module is used for acquiring audio and video data to be processed, and the audio and video data comprises paired audio data and video data;
the first feature extraction module is used for extracting features of the audio data to obtain audio features;
the second feature extraction module is used for extracting features of the video data by adopting a video preprocessing method and a 3D and 2D convolution network to obtain video features;
the encoding module is used for encoding audio features and video features by adopting a pre-trained two-way information interaction-based encoder, and performing two-way information interaction on the audio features and the video features by utilizing an information interaction submodule in an encoding stage to obtain the audio encoding features and the video encoding features, wherein a cross reconstruction loss is obtained by utilizing a cross reconstruction module in the training process of the encoder for the encoder training, and in the training process, network parameters of the two-way information interaction-based encoder based on the Transformer are to be trained;
the decoding module is used for iteratively predicting a state code of a character of each time step by adopting a pre-trained audio and video decoder based on a Transformer and combining the audio coding characteristics, the video coding characteristics and a character state code sequence predicted at a previous time step to obtain a predicted state sequence corresponding to the audio and video, wherein in the training process, network parameters of the audio and video decoder based on the Transformer are to be trained;
and the text prediction module is used for converting the prediction state sequence into text information.
CN202210515512.6A 2022-05-11 2022-05-11 Audio and video dual-mode-based voice recognition method and system Pending CN114974215A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210515512.6A CN114974215A (en) 2022-05-11 2022-05-11 Audio and video dual-mode-based voice recognition method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210515512.6A CN114974215A (en) 2022-05-11 2022-05-11 Audio and video dual-mode-based voice recognition method and system

Publications (1)

Publication Number Publication Date
CN114974215A true CN114974215A (en) 2022-08-30

Family

ID=82982192

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210515512.6A Pending CN114974215A (en) 2022-05-11 2022-05-11 Audio and video dual-mode-based voice recognition method and system

Country Status (1)

Country Link
CN (1) CN114974215A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116089654A (en) * 2023-04-07 2023-05-09 杭州东上智能科技有限公司 Audio supervision-based transferable audio-visual text generation method and system
CN116628473A (en) * 2023-05-17 2023-08-22 国网上海市电力公司 Power equipment state trend prediction method based on multi-factor neural network algorithm
CN117219067A (en) * 2023-09-27 2023-12-12 北京华星酷娱文化传媒有限公司 Method and system for automatically generating subtitles by short video based on speech understanding

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116089654A (en) * 2023-04-07 2023-05-09 杭州东上智能科技有限公司 Audio supervision-based transferable audio-visual text generation method and system
CN116628473A (en) * 2023-05-17 2023-08-22 国网上海市电力公司 Power equipment state trend prediction method based on multi-factor neural network algorithm
CN117219067A (en) * 2023-09-27 2023-12-12 北京华星酷娱文化传媒有限公司 Method and system for automatically generating subtitles by short video based on speech understanding
CN117219067B (en) * 2023-09-27 2024-04-09 北京华星酷娱文化传媒有限公司 Method and system for automatically generating subtitles by short video based on speech understanding

Similar Documents

Publication Publication Date Title
CN109785824B (en) Training method and device of voice translation model
CN112489635B (en) Multi-mode emotion recognition method based on attention enhancement mechanism
Jiang et al. Speech simclr: Combining contrastive and reconstruction objective for self-supervised speech representation learning
CN110164476B (en) BLSTM voice emotion recognition method based on multi-output feature fusion
CN112071329B (en) Multi-person voice separation method and device, electronic equipment and storage medium
CN111415667B (en) Stream end-to-end speech recognition model training and decoding method
CN111243620B (en) Voice separation model training method and device, storage medium and computer equipment
CN114974215A (en) Audio and video dual-mode-based voice recognition method and system
CN112053690B (en) Cross-mode multi-feature fusion audio/video voice recognition method and system
CN104541324A (en) A speech recognition system and a method of using dynamic bayesian network models
CN112151030B (en) Multi-mode-based complex scene voice recognition method and device
CN108538283B (en) Method for converting lip image characteristics into voice coding parameters
CN114373451A (en) End-to-end Chinese speech recognition method
CN112289338A (en) Signal processing method and device, computer device and readable storage medium
CN111653270A (en) Voice processing method and device, computer readable storage medium and electronic equipment
CN116701568A (en) Short video emotion classification method and system based on 3D convolutional neural network
CN114999443A (en) Voice generation method and device, storage medium and electronic equipment
CN113409803B (en) Voice signal processing method, device, storage medium and equipment
Jin et al. Speech separation and emotion recognition for multi-speaker scenarios
CN114360491B (en) Speech synthesis method, device, electronic equipment and computer readable storage medium
CN115881156A (en) Multi-scale-based multi-modal time domain voice separation method
Narayanan et al. Hierarchical sequence to sequence voice conversion with limited data
CN115116444A (en) Processing method, device and equipment for speech recognition text and storage medium
CN113990295A (en) Video generation method and device
CN113628630A (en) Information conversion method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination