CN113033452B - Lip language identification method fusing channel attention and selective feature fusion mechanism - Google Patents

Lip language identification method fusing channel attention and selective feature fusion mechanism Download PDF

Info

Publication number
CN113033452B
CN113033452B CN202110366767.6A CN202110366767A CN113033452B CN 113033452 B CN113033452 B CN 113033452B CN 202110366767 A CN202110366767 A CN 202110366767A CN 113033452 B CN113033452 B CN 113033452B
Authority
CN
China
Prior art keywords
network
lip
fusion
layer
selective
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110366767.6A
Other languages
Chinese (zh)
Other versions
CN113033452A (en
Inventor
薛峰
杨添
王文博
洪自坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN202110366767.6A priority Critical patent/CN113033452B/en
Publication of CN113033452A publication Critical patent/CN113033452A/en
Application granted granted Critical
Publication of CN113033452B publication Critical patent/CN113033452B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a lip language identification method fusing a channel attention and selective feature fusion mechanism, which comprises the following steps: 1. downloading a data set GRID for training a model, and preprocessing the data set; 2. building a lip language recognition network, and selecting a proper objective function to optimize model parameters; 3. evaluating the effect of the model by adopting corresponding evaluation indexes; 4. and carrying out lip language recognition on the video by using the trained model. The invention uses the stacked 3D convolution neural network, the selective characteristic fusion network and the bidirectional GRU network to encode the input video frame, wherein a channel attention mechanism is added between each layer of 3D convolution layer, and finally a CTC decoder is adopted to generate an output text, so that the characteristics of the speaker lip region can be better learned, and the more accurate lip reading effect is realized.

Description

Lip language recognition method fusing channel attention and selective feature fusion mechanism
Technical Field
The invention belongs to the technical field of computer machine learning and artificial intelligence, and mainly relates to a lip language identification method of a deep neural network.
Background
Lip language plays a crucial role in human communication and speech understanding, however, human lip reading ability is poor as research shows. Good lip speech recognition technology can be a complement to audio-based speech recognition, can be used for improving hearing aids, improving the acquisition of speech information in silent, safe and noisy environments, and the like, has great practicability, and thus becomes an increasingly interesting field. Most of the lip reading work was based on manually designed feature learning before the occurrence of deep learning, and this type of method is computationally intensive and less accurate. Deep learning methods, which are continuously developed in recent years, are used to extract static features of the lip region of a speaker or to construct an end-to-end architecture. The 3D convolutional neural network can effectively learn the motion information of the lip part; the cyclic neural network can better process the information of the sequence; the ctc (connectionist Temporal classification) training approach may eliminate the need for alignment of the inputs with the target outputs, so that the sequence modeling is trained in an end-to-end fashion. On the basis of the deep learning methods, the lip language recognition technology is greatly developed.
Lip language recognition can be divided into two categories, word-level and sentence-level, depending on whether the modeling task is to classify words or phonemes or to predict a complete sentence sequence. The lip language recognition method at the word level only predicts a single isolated word, the prediction object is usually a short video of about 0.5s, the information provided by context for word prediction is ignored, the prediction object can be a video segment of several seconds or even longer, the context information can be fully utilized to help predict the word, and the later represents larger practical significance. In recent years, word-level lip language recognition methods have been developed rapidly, and the accuracy of single word classification can reach over 86% (on the LRW dataset). The sentence-level lip language recognition method for predicting the complete sentence sequence is relatively less researched, partial features of the lip region are insufficiently extracted by the existing model, the accuracy rate of lip language recognition is still low, and the method can be improved.
Disclosure of Invention
Aiming at the related problems in the existing lip language recognition, the invention provides a lip language recognition method fusing channel attention and a selective feature fusion mechanism so as to better extract the features of a speaker lip region, thereby realizing more accurate lip reading and achieving better lip language recognition effect.
The invention adopts the following technical scheme for solving the technical problems:
the invention relates to a lip language identification method fusing channel attention and selective feature fusion mechanisms, which is characterized by comprising the following steps of:
step 1, a sentence-level lip language recognition video data set is obtained, face feature detection is carried out on each video in the lip language recognition video data set, lip region images are extracted, so that a lip region image set of each video is obtained, and a lip region image data set L is formed;
step 2, dividing the lip region image data set L into a training set L 1 And test set L 2 And the training set L is 1 Dividing the video into a plurality of batches, wherein each batch contains lip region image sets corresponding to B videos and serves as B training samples; each training sample comprises T frames of lip region images; the number of channels of each frame of lip region image is C, the height is H and the width is W;
step 3, training set L 1 And test set L 2 The real texts corresponding to the lip region image sets of each video contained in the text table are respectively marked as G 1 And G 2
Step 4, constructing a lip language identification network fusing a channel attention and selective feature fusion mechanism;
step 4.1, constructing a front-end network HN of a fusion channel attention mechanism;
the front-end network HN is formed by connecting three same sub-modules CAN in series, and each sub-module CAN sequentially comprises a 3D convolution layer, a 3D batch regularization layer, a ReLU activation function, a 3D Dropout layer, a 3D maximum pooling layer and a channel attention network layer CA; the output of the channel attention network CA and the input of the channel attention network CA are multiplied element by element to obtain a result which is used as the output of each sub-module CAN;
the channel attention network CA comprises two branches, a first branch comprising in sequence: a 3D global maximum pooling layer, a 3D convolutional layer for reducing the number of input feature channels by r times, a ReLU activation function and a 3D convolutional layer for increasing the number of input feature channels by r times; the other branch is the same as the first branch except that the 3D global maximum pooling layer is changed into a 3D global average pooling layer; adding the outputs of the two branches element by element, and obtaining the output of the attention network CA through a Sigmoid activation function;
step 4.2, constructing a selective feature fusion network SKN;
the selective characteristic fusion network SKN is formed by connecting n identical selective fusion sub-modules SK in series, and each selective fusion sub-module SK is processed according to the formula (1):
Figure BDA0003007417180000021
in formula (1), Z represents the output of each selective fusion submodule SK;
Figure BDA0003007417180000022
the representative feature matrix multiplies the elements one by one; tan h is Tanh activation function; x and Y are two different feature matrices obtained by inputting the selective fusion sub-module SK through two fusion branches, and each fusion branch comprises a full connection layer; g (U) represents a result U obtained by adding two different feature matrixes X and Y obtained by two fusion branches element by element, and then sequentially passes through a full connection layer for reducing input dimension by r times, a ReLU activation function, a full connection layer for increasing input dimension by r times and a Sigmoid activation function;
4.3, constructing a back-end network TN for extracting the long-time information;
the back-end network TN sequentially comprises two bidirectional GRU layers, a full connection layer and a CTC loss layer; the input of the back-end network TN is the output of the selective characteristic fusion network SKN;
step 4.4, using the training set L 1 As the input of the lip language recognition network, and the training set L 1 Corresponding real text set G 1 As a label, CTC loss is used as a loss function, the lip language recognition network is trained by using an Adam optimization algorithm, and the lip language recognition network is combined in a test set L 2 The final lip language identification network is obtained and used for identifying the movement of the lips of the speaker in the video, namely, machine lip reading is realized.
Compared with the existing lip language recognition technology, the invention has the following advantages:
1. the invention integrates the channel attention mechanism into the lip language recognition model, and adds the channel attention mechanism into the 3D convolution neural network extracting the short-time information and the space characteristics from the front end of the model, so that the model can fully utilize the characteristics with the maximum information quantity and restrain the useless characteristics according to the dependence degree on each channel. The lip language identification network added with the channel attention mechanism can have better lip reading effect.
2. According to the method, a Batch standardization layer is added between each convolution layer and the active layer to optimize the model training process, the optimization method not only greatly accelerates the model training speed, but also enables the model to be free from overfitting on a limited data set, and improves the effect of model lip language recognition to a certain extent.
3. The present invention employs a selective feature fusion mechanism. Compared with the high way Network, the high way Network only provides dynamic and nonlinear fusion weight, the provided selective feature fusion mechanism not only solves the problem that the Network is difficult to train after deepening, but also can adaptively selectively learn the information of different feature spaces, thereby extracting richer semantic information and greatly improving the effect of model lip language recognition.
Drawings
FIG. 1 is a block diagram of a model of a network according to the present invention;
FIG. 2 is a block diagram of an optional feature fusion module according to the present invention;
FIG. 3 is a flow chart of the present invention.
Detailed Description
In this embodiment, a lip language recognition method based on a fusion mechanism of channel attention and selective features is to recognize the content expressed by a speaker according to the motion of the lip region of the speaker in a video, and map the content to a text language, thereby implementing lip reading based on deep learning. Downloading a sentence-level lip language recognition data set GRID, obtaining an image of a lip region of a speaker after face feature detection processing, constructing a complete lip language recognition model, and accelerating the model training speed through batch standardization; the attention mechanism of the fusion channel improves the effect of the model; updating the parameters of the optimization model by adopting an Adam optimization algorithm; and recognizing the content expressed by the speaker according to the motion of the lip of the speaker in the video by using the finally trained model, converting the content into a text language form, and finishing the whole lip language recognition function. Specifically, as shown in fig. 3, the method comprises the following steps:
step 1, a sentence-level lip language recognition video data set is obtained, face feature detection is carried out on each video in the lip language recognition video data set, and lip region images are extracted, so that a lip region image set of each video is obtained, and a lip region image data set L is formed;
step 2, dividing lip region image data set L into training sets L 1 And test set L 2 And will train set L 1 Dividing the video into a plurality of batches, wherein each batch contains lip region image sets corresponding to B videos and serves as B training samples; each training sample comprises T frames of lip region images; the number of channels of each frame of lip region image is C, the height is H and the width is W; in the specific example, B is 50, T is 75, C is 3, H is 64, and W is 128;
step 3, training set L 1 And test set L 2 The real texts corresponding to the lip region image sets of each video contained in the text table are respectively marked as G 1 And G 2
Step 4, constructing a lip language identification network fusing a channel attention and selective feature fusion mechanism, wherein the network structure is shown in figure 1;
step 4.1, constructing a front-end network HN of a fusion channel attention mechanism, and extracting short-time information and spatial characteristics of an input picture set;
the front-end network HN is composed of three identical sub-modules CAN in series, each sub-module sequentially comprises a 3D convolution layer, a 3D batch regularization layer (BN layer), a ReLU activation function, a 3D Dropout, a 3D maximum pooling layer and a channel attention network layer CA, and the output of each sub-module is a result of element-by-element multiplication of the output of the channel attention network CA and the input of the channel attention network CA.
In the embodiment, the 3D convolutional layers in the three sub-modules sequentially change the number of channels of the input features into 32, 64 and 96, each 3D maximum pooling layer reduces 1/2 the height and width of the input feature map, and the 3D convolutional neural network and the 3D maximum pooling layer can reduce the computational complexity and extract spatial features and short-time information in the input lip region image set; adding a 3D batch regularization layer (BN layer) after each 3D convolution layer can greatly accelerate the model training speed, has quick convergence and can improve the accuracy of the model.
The channel attention network CA comprises two branches, a first branch comprising in sequence: a 3D global maximum pooling layer, a 3D convolutional layer for reducing the number of input feature channels by r times, a ReLU activation function and a 3D convolutional layer for increasing the number of input feature channels by r times; in the specific example, r is 16; the other branch is the same as the first branch except that the 3D global maximum pooling layer is changed into a 3D global average pooling layer; and adding the outputs of the two branches element by element, and obtaining the output of the attention network CA through a Sigmoid activation function. The channel attention mechanism can improve the representation capability of the network by modeling the dependency of each channel, and can adjust the characteristics channel by channel, so that the network can selectively strengthen the learning of the characteristics containing useful information and inhibit useless characteristics;
step 4.2, constructing a selective feature fusion network SKN;
the selective characteristic fusion network SKN is formed by connecting n identical selective fusion submodules SK in series, the selective fusion submodules SK are shown in fig. 2, and experiments show that the model can finally achieve the best effect when n is 2, so that n is 2 in a specific embodiment. Each selective fusion submodule SK is processed according to equation (1):
Figure BDA0003007417180000051
in formula (1), Z represents the output of each selective fusion submodule SK;
Figure BDA0003007417180000052
the representative feature matrix is multiplied element by element; tan h is Tanh activation function; x and Y are two different feature matrixes obtained by inputting the selective fusion sub-module SK through two branches, and each branch comprises a full connection layer; g (U) represents: recording a result of element-by-element addition of two different feature matrices X and Y obtained by the two branches as U, and sequentially passing through a full connection layer for reducing the input dimension by r times, a ReLU activation function, a full connection layer for increasing the input dimension by r times and a Sigmoid activation function.
4.3, constructing a long-time information extraction back-end network TN;
the back-end network TN sequentially comprises two bidirectional GRU (gated Current Unit) layers, a full connection layer and a CTC (Connectionsist Temporal Classification) loss layer; the input of the back-end network TN is the output of the selective feature fusion network SKN. In a specific example, each GRU layer comprises 256 hidden neurons, the output of the front-end network HN can be further effectively aggregated by using two stacked bidirectional GRU layers, and long-term information in an input feature sequence can be acquired; the reason for selecting the CTC penalty function is that it eliminates the need for training data to align the input with the target output, eliminating many of the cumbersome post-processing operations.
Step 4.4 training divided by the resulting lip region image datasetCollection L 1 For lip language recognition network input, and training set L 1 Corresponding real text set G 1 As a label, CTC loss is used as a loss function, the lip language recognition network is trained by using an Adam optimization algorithm, and then the lip language recognition network is combined in a test set L 2 The final lip language identification network is obtained and used for identifying the motion of the lips of the speaker in the video, namely, the lip reading of a machine is realized.

Claims (1)

1. A lip language recognition method fusing channel attention and selective feature fusion mechanisms is characterized by comprising the following steps of:
step 1, a sentence-level lip language recognition video data set is obtained, face feature detection is carried out on each video in the lip language recognition video data set, lip region images are extracted, so that a lip region image set of each video is obtained, and a lip region image data set L is formed;
step 2, dividing the lip region image data set L into a training set L 1 And test set L 2 And the training set L is divided into 1 Dividing the video into a plurality of batches, wherein each batch contains lip region image sets corresponding to B videos and serves as B training samples; each training sample comprises T frames of lip region images; the number of channels of each frame of lip region image is C, the height is H and the width is W;
step 3, training set L 1 And test set L 2 The real texts corresponding to the lip region image sets of each video contained in the text table are respectively marked as G 1 And G 2
Step 4, constructing a lip language identification network fusing a channel attention and selective feature fusion mechanism;
step 4.1, constructing a front-end network HN of a fusion channel attention mechanism;
the front-end network HN is formed by connecting three same sub-modules CAN in series, and each sub-module CAN sequentially comprises a 3D convolution layer, a 3D batch regularization layer, a ReLU activation function, a 3D Dropout layer, a 3D maximum pooling layer and a channel attention network layer CA; the output of the channel attention network CA and the input of the channel attention network CA are multiplied element by element to obtain a result which is used as the output of each sub-module CAN;
the channel attention network CA comprises two branches, a first branch comprising in sequence: a 3D global maximum pooling layer, a 3D convolutional layer for reducing the number of input feature channels by r times, a ReLU activation function and a 3D convolutional layer for increasing the number of input feature channels by r times; the other branch is the same as the first branch except that the 3D global maximum pooling layer is changed into a 3D global average pooling layer; adding the outputs of the two branches element by element, and obtaining the output of the attention network CA through a Sigmoid activation function;
step 4.2, constructing a selective feature fusion network SKN;
the selective characteristic fusion network SKN is formed by connecting n identical selective fusion sub-modules SK in series, and each selective fusion sub-module SK is processed according to the formula (1):
Figure FDA0003007417170000011
in formula (1), Z represents the output of each selective fusion submodule SK;
Figure FDA0003007417170000012
the representative feature matrix is multiplied element by element; tan h is a Tanh activation function; x and Y are two different feature matrices obtained by inputting the selective fusion sub-module SK through two fusion branches, and each fusion branch comprises a full connection layer; g (U) represents a result U obtained by adding two different feature matrixes X and Y obtained by two fusion branches element by element, and then sequentially passes through a full connection layer for reducing input dimension by r times, a ReLU activation function, a full connection layer for increasing input dimension by r times and a Sigmoid activation function;
4.3, constructing a back-end network TN for extracting the long-time information;
the back-end network TN sequentially comprises two bidirectional GRU layers, a full connection layer and a CTC loss layer; the input of the back-end network TN is the output of the selective characteristic fusion network SKN;
step 4.4, using the training set L 1 As the input of the lip language recognition network, and the training set L 1 Corresponding real text set G 1 As a label, CTC loss is used as a loss function, the lip language recognition network is trained by using an Adam optimization algorithm, and the lip language recognition network is combined in a test set L 2 The final lip language identification network is obtained and used for identifying the motion of the lips of the speaker in the video, namely, the lip reading of a machine is realized.
CN202110366767.6A 2021-04-06 2021-04-06 Lip language identification method fusing channel attention and selective feature fusion mechanism Active CN113033452B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110366767.6A CN113033452B (en) 2021-04-06 2021-04-06 Lip language identification method fusing channel attention and selective feature fusion mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110366767.6A CN113033452B (en) 2021-04-06 2021-04-06 Lip language identification method fusing channel attention and selective feature fusion mechanism

Publications (2)

Publication Number Publication Date
CN113033452A CN113033452A (en) 2021-06-25
CN113033452B true CN113033452B (en) 2022-09-16

Family

ID=76453770

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110366767.6A Active CN113033452B (en) 2021-04-06 2021-04-06 Lip language identification method fusing channel attention and selective feature fusion mechanism

Country Status (1)

Country Link
CN (1) CN113033452B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343937B (en) * 2021-07-15 2022-09-02 北华航天工业学院 Lip language identification method based on deep convolution and attention mechanism
CN114118142A (en) * 2021-11-05 2022-03-01 西安晟昕科技发展有限公司 Method for identifying radar intra-pulse modulation type
CN114694255B (en) * 2022-04-01 2023-04-07 合肥工业大学 Sentence-level lip language recognition method based on channel attention and time convolution network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019204147A (en) * 2018-05-21 2019-11-28 株式会社デンソーアイティーラボラトリ Learning apparatus, learning method, program, learnt model and lip reading apparatus
CN110633683A (en) * 2019-09-19 2019-12-31 华侨大学 Chinese sentence-level lip language recognition method combining DenseNet and resBi-LSTM
CN111104884A (en) * 2019-12-10 2020-05-05 电子科技大学 Chinese lip language identification method based on two-stage neural network model
CN111401250A (en) * 2020-03-17 2020-07-10 东北大学 Chinese lip language identification method and device based on hybrid convolutional neural network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019204147A (en) * 2018-05-21 2019-11-28 株式会社デンソーアイティーラボラトリ Learning apparatus, learning method, program, learnt model and lip reading apparatus
CN110633683A (en) * 2019-09-19 2019-12-31 华侨大学 Chinese sentence-level lip language recognition method combining DenseNet and resBi-LSTM
CN111104884A (en) * 2019-12-10 2020-05-05 电子科技大学 Chinese lip language identification method based on two-stage neural network model
CN111401250A (en) * 2020-03-17 2020-07-10 东北大学 Chinese lip language identification method and device based on hybrid convolutional neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《Can We Read Speech Beyond the Lips?Rethinking ROI Selection for Deep Visual Speech Recognition》;Zhang YH et al;《IEEE》;20210118;全文 *
《基于深度学习的唇读识别研究》;吴大江;《中国优秀硕士学位论文全文数据库 信息科技辑》;20200615(第2020年第06期);全文 *

Also Published As

Publication number Publication date
CN113033452A (en) 2021-06-25

Similar Documents

Publication Publication Date Title
CN113033452B (en) Lip language identification method fusing channel attention and selective feature fusion mechanism
Xu et al. Sequential video VLAD: Training the aggregation locally and temporally
He et al. Asymptotic soft filter pruning for deep convolutional neural networks
CN110188343B (en) Multi-mode emotion recognition method based on fusion attention network
CN107273800B (en) Attention mechanism-based motion recognition method for convolutional recurrent neural network
CN108288015B (en) Human body action recognition method and system in video based on time scale invariance
CN110675860A (en) Voice information identification method and system based on improved attention mechanism and combined with semantics
Sultana et al. Bangla speech emotion recognition and cross-lingual study using deep CNN and BLSTM networks
CN111526434B (en) Converter-based video abstraction method
CN113496217A (en) Method for identifying human face micro expression in video image sequence
CN111414461A (en) Intelligent question-answering method and system fusing knowledge base and user modeling
CN112818861A (en) Emotion classification method and system based on multi-mode context semantic features
CN114694255B (en) Sentence-level lip language recognition method based on channel attention and time convolution network
CN112861524A (en) Deep learning-based multilevel Chinese fine-grained emotion analysis method
CN113627266A (en) Video pedestrian re-identification method based on Transformer space-time modeling
CN112989920A (en) Electroencephalogram emotion classification system based on frame-level feature distillation neural network
CN113822125B (en) Processing method and device of lip language recognition model, computer equipment and storage medium
CN114170657A (en) Facial emotion recognition method integrating attention mechanism and high-order feature representation
Hou et al. Confidence-guided self refinement for action prediction in untrimmed videos
CN113221683A (en) Expression recognition method based on CNN model in teaching scene
CN112766368A (en) Data classification method, equipment and readable storage medium
CN112528077A (en) Video face retrieval method and system based on video embedding
CN117033961A (en) Multi-mode image-text classification method for context awareness
CN116167015A (en) Dimension emotion analysis method based on joint cross attention mechanism
CN111242114A (en) Character recognition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant