CN116580440A - Lightweight lip language identification method based on visual transducer - Google Patents
Lightweight lip language identification method based on visual transducer Download PDFInfo
- Publication number
- CN116580440A CN116580440A CN202310592100.7A CN202310592100A CN116580440A CN 116580440 A CN116580440 A CN 116580440A CN 202310592100 A CN202310592100 A CN 202310592100A CN 116580440 A CN116580440 A CN 116580440A
- Authority
- CN
- China
- Prior art keywords
- lip
- layer
- inputting
- lip language
- convolution
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 48
- 230000000007 visual effect Effects 0.000 title claims abstract description 39
- 230000006870 function Effects 0.000 claims abstract description 34
- 230000009466 transformation Effects 0.000 claims abstract description 31
- 230000007246 mechanism Effects 0.000 claims abstract description 9
- 238000007781 pre-processing Methods 0.000 claims abstract description 9
- 230000006835 compression Effects 0.000 claims abstract description 8
- 238000007906 compression Methods 0.000 claims abstract description 8
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 8
- 230000002457 bidirectional effect Effects 0.000 claims abstract description 7
- 238000009499 grossing Methods 0.000 claims abstract description 6
- 230000007774 longterm Effects 0.000 claims abstract description 6
- 239000011159 matrix material Substances 0.000 claims description 20
- 238000010606 normalization Methods 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 9
- 230000004913 activation Effects 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 8
- 230000005284 excitation Effects 0.000 claims description 7
- 238000011176 pooling Methods 0.000 claims description 6
- 238000005516 engineering process Methods 0.000 claims description 3
- 238000011049 filling Methods 0.000 claims description 3
- 230000015556 catabolic process Effects 0.000 abstract description 2
- 238000006731 degradation reaction Methods 0.000 abstract description 2
- 238000004821 distillation Methods 0.000 description 24
- 238000012549 training Methods 0.000 description 10
- 238000006243 chemical reaction Methods 0.000 description 9
- 238000010586 diagram Methods 0.000 description 6
- 238000000605 extraction Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 239000013598 vector Substances 0.000 description 4
- 230000006399 behavior Effects 0.000 description 3
- 238000013136 deep learning model Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000000844 transformation Methods 0.000 description 2
- 206010011878 Deafness Diseases 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000013140 knowledge distillation Methods 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 238000002156 mixing Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
- G06V40/171—Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Data Mining & Analysis (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Mathematical Physics (AREA)
- General Engineering & Computer Science (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Human Computer Interaction (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Image Analysis (AREA)
Abstract
The application discloses a lightweight lip language identification method based on a visual transducer, which comprises the following steps: acquiring a lip language data set, and preprocessing based on the lip language data set to acquire a lip identification image; constructing a 3D convolutional neural network, and extracting space-time characteristics from the lip recognition image; inputting the space-time features into an improved convolution visual transformation network, and extracting local spatial features and global spatial features of the lip identification image; inputting the local spatial features and the global spatial features into a bidirectional gating circulation unit, and extracting long-term and short-term feature sequences of the lip identification image; and inputting the long-short-term feature sequence into a multiple perceptron, obtaining confidence scores of all the categories, and obtaining recognition probability values through a cross entropy loss function with a label smoothing mechanism based on the confidence scores of all the categories to finish lightweight lip language recognition. The application solves the problems of large model parameter quantity, long operation and reasoning time and performance degradation caused by model compression.
Description
Technical Field
The application belongs to the technical field of image recognition, and particularly relates to a lightweight lip language recognition method based on a visual transducer.
Background
Lip recognition is also called visual speech recognition, which refers to judging speaking content through lip movement change of a speaker, and the research process involves technologies such as computer vision, natural language processing and the like. The lip language identification has wide application in the aspects of identity authentication, voice recognition, speaker face synthesis, improvement of communication between deaf-mutes, public safety and the like. According to literature studies, there are some problems in the field of lip recognition that need attention. First, recognition is focused mainly on the lips and their surrounding environment, which makes the network model particularly important for fine feature extraction; and secondly, the lip language identification is to process time and space information between adjacent frames of continuous video frames, so that the identification difficulty is high, and simultaneously, the parameters of the model are greatly increased due to convolution and full connection layers contained in the currently adopted deep learning model, and the hardware performance of a computer is required to be higher. Furthermore, as the network goes deeper, the resolution decreases resulting in loss of information. Based on the problems and the complexity of lip reading, the accuracy of the lip reading task is not high all the time, the model parameters are large, and the operation and reasoning time is long. Therefore, there is a need to propose a lightweight lip language recognition method based on visual transducer.
Disclosure of Invention
In order to solve the technical problems, the application provides a lightweight lip language identification method based on a visual transducer, which effectively extracts high-dimensional characteristics of an image sequence, enhances semantic representation among video key frames, thereby reducing loss caused by global average of the image sequence, having obvious advantages in terms of parameter quantity and calculation quantity, and solving the problem of greatly reducing identification accuracy caused by model compression.
In order to achieve the above purpose, the application provides a lightweight lip language identification method based on a visual transducer, which comprises the following steps:
acquiring a lip language data set, preprocessing based on the lip language data set, and acquiring a lip identification image;
constructing a 3D convolutional neural network, and extracting space-time features from the lip recognition image
Inputting the space-time features into an improved convolution visual transformation network to obtain local space features and global space features of the lip identification image;
inputting the local spatial features and the global spatial features into a bidirectional gating circulation unit, and extracting long-term and short-term feature sequences of the lip identification image;
and inputting the long-short-term feature sequence into a multiple perceptron, obtaining confidence scores of all the categories, and obtaining recognition probability values through a cross entropy loss function with a label smoothing mechanism based on the confidence scores of all the categories to finish lightweight lip language recognition.
Optionally, the method for obtaining the lip identification image based on the lip language data set comprises the following steps:
and adjusting the size of the video frames input by the lip language data set to a first preset size, cutting the video frames to a second preset size, performing data enhancement expansion by adopting a data enhancement technology, setting a preset probability level, turning over each video frame, converting the video frames into gray images, and normalizing the gray images to obtain the lip identification images.
Optionally, constructing a 3D convolutional neural network, and the method for extracting spatiotemporal features from the lip recognition image includes:
setting a layer of 3D convolution, wherein the convolution kernel size is (5, 7), the stride is (1, 2), the filling is (2, 3), entering the batch normalization processing, then passing through a layer of activation function, finally sending into the maximum pooling layer processing, the kernel size of the pooling layer is (1, 3), and the stride is (1, 2).
Optionally, a 3D convolutional neural network is constructed, and the calculation for extracting spatiotemporal features from the lip recognition image is as follows:
wherein ,for the value in the jth feature map at position (x, y, z) in the ith layer, relu is the activation function, b is the bias, m is the index of the i-1 layer feature map connected to the current layer feature map, P i 、Q i 、R i The width, height and time dimensions of the convolution kernel, respectively.
Optionally, the improved convolutional visual transformation network comprises an in-SE-Conve-Embedding layer and three convolutional visual transformation modules.
Optionally, the method for inputting the spatiotemporal features into the improved convolutional visual transformation network and obtaining the local spatial features and the global spatial features of the lip recognition image includes:
inputting the space-time characteristics into an SE-Conve-Embedding layer, compressing the space-time characteristics by adopting a compression function to obtain statistical information of a channel, inputting the statistical information into an excitation function to obtain the correlation of the channel, and inputting the correlation into a scale function to obtain a new characteristic map;
and inputting the new feature map into a convolution visual transformation module, acquiring local context information through a convolution projection layer in the convolution visual transformation module, inputting the local context information into a multi-head attention layer in the convolution visual transformation module for normalization processing, acquiring a normalization result, and inputting the normalization result into a multi-layer perceptron MLP layer in the convolution visual transformation module to acquire local spatial features and global spatial features of the lip identification image.
Optionally, the method for obtaining the new feature map is as follows:
where x is the input feature map, F sq As a compression function, F ex As an excitation function, F scale As a function of the scale,is a new feature map.
Optionally, the method for obtaining local context information through the convolution projection layer in the convolution visual transformation module includes:
wherein For input of the i-th layer Q/K/V matrix, x i And s is the convolution kernel size, which is the output of the SE-control-encoding layer.
Optionally, the local spatial feature and the global spatial feature are input into a bidirectional gating circulation unit, and the method for extracting the long-term and short-term feature sequence of the lip identification image comprises the following steps:
setting an input dimension as 512, a hidden layer dimension as 1024, 3 layers in total, and an output dimension as 2048;
the calculation formula of the gating cycle unit is as follows:
wherein :zt =σ(W z x t +U z h t-1 ),r t =σ(W r x t +U r h t-1 ),z is the update gate, r is the reset gate, < ->And h is a hidden value, and W and U are input and hidden weight matrixes respectively.
Optionally, the long-short-term feature sequence is input into a multi-layer perceptron, the confidence scores of all the categories are obtained, and the method based on the confidence scores of all the categories comprises the following steps:
and inputting the extracted long-short-period characteristic sequences into a multi-layer perceptron, receiving the long-short-period characteristic sequences in a form of flattening the long-short-period characteristic sequences into one-dimensional tensors, multiplying the one-dimensional tensors by a weight matrix, multiplying the weight matrix to generate output characteristics, and obtaining confidence scores of all categories.
The application has the technical effects that: the application discloses a lightweight lip language identification method based on a visual transducer, which breaks through in the stages of data preprocessing, data enhancement and feature extraction, and improves the identification precision of a model; in combination with weight sharing and weight distillation, the mode of conversion after weight sharing is used to reduce model instability and performance reduction, pre-section-Logit distillation is selected as the main part, and the mode of weighted Self-section distillation loss and Hidden-State distillation loss is added to reduce model parameters so as to improve training test speed and memory consumption, improve model convergence condition and accelerate model operation efficiency.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application. In the drawings:
fig. 1 is a schematic flow chart of a lightweight lip language identification method based on a visual transducer according to an embodiment of the application;
FIG. 2 is a block diagram of an embodiment of the present application for feature extraction and recognition by an image input 3D convolutional network and a visual transducer;
FIG. 3 is a basic architecture diagram of a Cvt module according to an embodiment of the present application;
FIG. 4 is a diagram of a Transformer Block structure modified in accordance with an embodiment of the present application;
FIG. 5 is a diagram of the overall structure of Mini-3DCvT according to an embodiment of the present application;
FIG. 6 is a schematic diagram of weight transformation in the attention layer and feed forward network according to an embodiment of the present application.
Detailed Description
It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.
As shown in fig. 1, the method for recognizing the lightweight lip language based on the visual transducer in the embodiment includes the following steps:
acquiring a lip language data set, and preprocessing based on the lip language data set to acquire a lip identification image;
constructing a 3D convolutional neural network, and extracting space-time characteristics from the lip recognition image;
inputting the space-time features into an improved convolution visual transformation network to obtain local spatial features and global spatial features of the lip identification image;
inputting the local space features and the global space features into a bidirectional gating circulation unit, and extracting long-term and short-term feature sequences of the lip identification image;
and inputting the long-short-period characteristic sequences into a multiple perceptron, obtaining confidence scores of all the categories, obtaining recognition probability values through a cross entropy loss function with a label smoothing mechanism based on the confidence scores of all the categories, and completing lightweight lip language recognition.
As shown in fig. 2, the input size of the front-end network is (256×1×88×88), and the input data is preprocessed from the original data set. Data first passes through the 3DCnov layer and then enters Transformer Block. The parameter settings of the TransformerBlock are different in the three phases, and the parameter values of the different phases can be adjusted to adapt to different input data. After the front-end network has extracted the feature information, word boundary information is added, and the size of the input data is adjusted to (256×513×5×5). In order to better acquire global correlation and identification key information of the feature sequence, the feature information is sent to a back-end network Bi-GRU, the Bi-GRU splices the feature vector output by a front-end network and the feature vector with the direction inverted in the channel dimension, and finally, the gradient vanishing problem is effectively avoided through a dropout layer, and meanwhile, the model is prevented from being fitted.
The lip language identification method based on the deep convolution and the attention mechanism carries out preprocessing on a large lip reading data set, and the lip language identification image is obtained specifically by the following steps:
a unified preprocessing strategy is adopted for the data sets LRW and LRW-1000. We resize the initial input video frame to 96 x 96, then clip it to 88 x 88, and use both Mixup, cutmix data enhancement techniques for data enhancement expansion as the final input. We select a batch of video frames of size 256 in each epoch training. Each video frame is then flipped with a probability level of 0.5 and converted to a gray scale image, finally normalized to [0,1]. Furthermore, before the extracted feature information enters the backend, we expand the data dimension from 521 to 513, called word boundaries. The method can provide context and environmental information to aid in classifying lip readings. In addition, the label value of the data sample is smoothed, and the label is smoothly added into the loss function to guide the training of the network model, so that the overfitting of the model is reduced.
The lip language identification method based on the deep convolution and the attention mechanism comprises the following steps of: setting a layer of 3D convolution, wherein the convolution kernel size is (5, 7), the stride is (1, 2), the filling is (2, 3), then entering batch normalization processing, then passing through a layer of activation function, finally sending into a maximum pooling layer for processing, the kernel size of the pooling layer is (1, 3), the stride is (1, 2), and the 3D convolution calculation formula is as follows:
wherein :for the value in the jth feature map at position (x, y, z) in the ith layer, relu is the activation function, b is the bias, m is the index of the i-1 layer feature map connected to the current layer feature map>Middle P i 、Q i 、R i The width, height and time dimensions of the convolution kernel, respectively. The convolution mark embedding layer is processed, the size of the embedding core of the layer is (7, 7), the steps are (2, 2), and the number is 128.
The space-time feature vector generated in the previous step is input into an improved lightweight three-stage convolution visual transformer, the part consists of three layers of transformers, the specific architecture is shown in figure 3,
the module consists of a convolution projection layer, a multi-head attention layer and a full connection layer, which are stacked in three steps; wherein: the first step of convolution projection layer has the kernel size of (3, 3), the number of the kernels of 128, the attention of 3 heads and the depth of 128; the second step of convolution projection layer has the kernel size of (3, 3), the number of 256, attention of 12 heads and depth of 256; the third step convolves the projection layer with the kernel size (3, 3), the number of 512, the attention of 16 heads and the depth of 512.
In order to improve the recognition performance of the algorithm, the following modification is performed:
(1) The stacking of the Transformer blocks and the deepening of the network can result in reduced image resolution and reduced ability of the network to acquire spatial signature information. To solve this problem, convolutional mark Embedding layers are improved for the transducer block, compression and excitation structures (Squeeze and Excitation, SE) are added, and the new network layer is defined as SE-save-Embedding. The specific structure of SE-Conve-encoding is shown in figure 4, the feature map is input into the squeize to obtain statistical information of the channel, then the statistical information is input into the expression to obtain the correlation of the channel, and finally the correlation is input into the Scale to obtain a new feature map, as shown in formula 2.
Where x is the input feature map, F sq As a squeezing function, F ex As an excitation function, F scale As a function of the scale,is a new feature map. This layer can increase the feature dimension of the mark by changing parameters and reduce the length of the mark sequence, thereby compensating the information loss caused by deepening the network and reducing the resolution.
The mapped SE-Conve-Embedding features are sent to the Conv-project layer in Transformer Block. The purpose of this layer is to better obtain local context information. The specific formula is as follows:
wherein For input of the i-th layer Q/K/V matrix, x i And s is the convolution kernel size, which is the output of the SE-control-encoding layer. The feature matrix is then input from the Conv-project layer into the multi-headed attention layer. The number of heads in each step is increased, then the value is subjected to data normalization processing through layer normalization, finally, an output result is obtained through a multi-layer perceptron MLP layer, and the result is used as the output of a front-end network and is input into a next back-end network for sequence modeling. These operations may allow different scales to have good results in extracting features. Meanwhile, the difference between the output value and the expected value can be reduced by utilizing the ReLU activation function, as shown in an improved transducer block in fig. 4, the convolution and attention mechanism well extracts and models the characteristic information of the input characteristic diagram, and the rule characteristics among data are better learned.
(2) The module is lightweight modified as follows. The visual transducer is processed in two steps on the basis of a basic model:
the first step generates a compact architecture with weight conversion. Given a pre-trained large visual transducer model, parameters are first shared between every k adjacent transducer layers except LayerNorm. Each layer is then weight transformed by inserting a tiny linear layer before and after the softmax layer. In addition, a deep convolutional layer for MLP is introduced. These linear layers and conversion blocks are not shared.
And secondly, carrying out weight distillation training on the compression model. In this step, the target loss function is defined as the sum of the proposed multiple loss functions. And transferring knowledge from the large pre-training model to the small model by using the proposed weight distillation method. By distilling inside the transducer module, students can re-simulate the behavior of the teacher's network, thereby extracting more useful knowledge from the large-scale pre-trained model. Note that it is only performed when both the teacher and student models are a transducer architecture. In other cases, the student and teacher architectures are heterogeneous, leaving only complete pre-logic distillation, adding weighted Self-Attention distillation loss and Hidden-State distillation loss.
The overall structure of the resulting Mini-3DCvT is shown in FIG. 5, which is a method for extracting lip movement features by three-dimensional convolution and lightweight modified visual transducer. The extracted characteristic information is processed by a bi-directional gating circulating unit (BiGRU) and a full connection layer (FC) to carry out sequence modeling and classification.
The two-step process of compressing the visual transducer block is described in detail. The first step applies the weight transformation to both the multi-head self-attention (MSA) block and the multi-layer perceptron (MLP). Such a transformation allows each layer to be different, thereby improving parameter diversity and model representation capabilities. As shown in fig. 6, the parameters of the conversion core are not shared across layers, while all other blocks in the original converter except LayerNorm are shared. Since the shared blocks occupy a significant portion of the model parameters, the model size increases only slightly after weight conversion.
Attention layer weight conversion
The multi-headed self-attention (MSA) layer in Transformer Block uses weight conversion followed by layer normalization and residual network connection for analysis before and after each block. Elaborating the MSA layer as follows:
let M be the number of heads in the MSA, also referred to as the self-attention module. Given the input sequence Z0 ε RNxD, at the firstOf the k heads, query, key and value are generated by linear projection, respectively using Q k 、K k and Vk E RN x d, where N is the number of tags. The dimensions of the Conv-project input and Q-K-V matrices are D and D, respectively. Next, a weighted average of all values for each location is calculated. For similarity calculation between different elements, attention matrix A is obtained, wherein A is a weight matrix and is obtained by h k To represent the output of the attention layer, i.e. formulas (4) and (5):
h k =A k V k (4)
wherein a softmax operation is performed for each row of the input matrix. Finally, the outputs of all heads are connected using one full connection layer.
In order to improve the diversity of parameters, the application inserts two linear transformations before and after the softmax of the self-attention module, and is defined as follows:
wherein F(1) 、F (2) E, RM×M are linear transformation kernels around softmax, respectively. Such a linear transformation may make each attention matrix a' n different, while combining information across attention heads to increase the parameter variance.
Multi-layer perceptron weight transformation
The multi-layer perceptron (MLP) consists of two fully connected layers, the activation function being denoted as sigma, typically GELU. Let Y ε RN d be the input to the MLP. The output of the MLP is expressed as:
H=(YW (1) +b (1 ))W (2) +b (2) (8)
wherein ,W(1) ∈Rd×d',b (1) ∈Rd',W (2) ∈Rd'×d,b (2) And E, rd are the weights and deviations of the first layer and the second layer respectively. It should be noted that d 'is usually set'>d。
And then, further carrying out lightweight conversion on the MLP so as to improve parameter diversity. Specifically, let the input be y= [ Y1, …, yd ], where yl represents the Lth-th position of all the marker embedded vectors. Then, d linear transforms are introduced to convert Y to Y' = [ C (1) Y1, … C (d) yd ], where C (1), … C (d) ∈rn×n is the independent weight matrix of the linear layer. Then the new formula for the rewrite of equation (9) is:
H=σ(Y'W (1) +b (1 ))W (2) +b (2) (9)
in order to reduce the number of parameters and introduce locality in the transformation, deep convolution is employed [41] To sparsify and share weights in each weight matrix, resulting in only K2d parameters instead of N2d parameters (K<<N), where K is the kernel size of the convolution. After transformation, the output of the MLP is more diversified, and the parameter efficiency is improved.
Through these transformations, the weight sharing layer can recover the behavior of the pre-trained model, similar to the multiplexing process. The problems of unstable training and reduced performance can be alleviated to avoid the disadvantages of the weight sharing method.
To compress the pre-trained large model and solve the performance degradation problem caused by weight sharing, further resort to weight distillation, transferring knowledge from the large model to the small and compact models. Three types of distillative transformation blocks are used in the present application, namely, predictive-Logit distillation, self-Attention distillation and Hidden-State distillation.
Presection-Logit distillation
Hinton et al first demonstrated that the deep learning model can achieve better performance by mimicking the output behavior of a well behaved teacher model during training. Using this concept, predictive loss is introduced as follows:
wherein ,zs and zt The logarithm of the prediction of the student model and the teacher model is respectively, and T is a temperature value for controlling the smoothness of the logarithm. In the experiment, let t=1. CE represents cross entropy loss.
Self-Attention distillation
It is beneficial to use the attention map in the transducer layer to guide the training of the student model. To solve the problem of inconsistent dimensions between student and teacher models due to the different number of faces, cross entropy loss is applied on the relationships between queries, keys and values in the MSA. First a matrix is attached on all fronts. For example, define q= [ Q1, … QM]E RN× M d, K, V e RN× M d, and so on. For the simplification of the symbols, Q, K and V are denoted by S1, S2 and S3, respectively. Then 9 different relationship matrices can be generated, defined asNote that R12 is the Attention matrix a and Self-Attention distillation loss can be expressed as: />
wherein ,Ri,j,n R represents i,j Is the n-th row of (c).
Hidden-State distillation
Also, a relationship matrix, i.e., characteristics of the MLP output, can be generated for hidden states [45] . The Hidden State of the transducer layer is represented by H.epsilon.RN.times.d, and the Hidden-State distillation loss based on the relation matrix is defined as:
wherein RH, n is the nth row of RH, and the calculation formula is
The application adds the mixing function composed of two other knowledge distillation modes on the basis of the Prediction-Logit distillation as the final loss, so the final distillation target loss function is expressed as:
inputting the extracted spatial features into a bidirectional gating circulation unit, and extracting long-term and short-term feature sequences:
the setting of the bidirectional gating circulation unit specifically comprises the following steps: the input dimension is set to be 512, the hidden layer dimension is set to be 1024, 3 layers are added, the output dimension is set to be 2048, and the calculation formula of the gating circulation unit is as follows: wherein :zt =σ(W z x t +U z h t-1 ),r t =σ(W r x t +U r h t-1 ),/>z is the update gate, r is the reset gate, < ->And h is a hidden value, and W and U are input and hidden weight matrixes respectively.
Inputting the extracted long-short-period characteristic sequences into a multi-layer perceptron to obtain confidence scores of all categories, wherein the method specifically comprises the following steps of: the extracted long-short-period characteristic sequence is input into a multi-layer perceptron, the structure of the multi-layer perceptron is input dimension 2048 and output dimension 1000, the multi-layer perceptron is received in a form of flattening into one-dimensional tensor, then the one-dimensional tensor is multiplied by a weight matrix, and the weight matrix is multiplied to generate output characteristics, so that confidence scores of all categories are obtained. Based on the confidence scores of the various categories, outputting identification probability values through a cross entropy loss function with a label smoothing mechanism, wherein the identification probability values specifically comprise: based on the confidence scores of the various categories, the obtained output features and the real labels are sent into a cross entropy loss function with a label smoothing mechanism to output identification probability values.
The application provides a lightweight lip language identification method based on a visual transducer, which breaks through in the stages of data preprocessing, data enhancement and feature extraction, and improves the identification precision of a model; in combination with weight sharing and weight distillation, the mode of conversion after weight sharing is used to reduce model instability and performance reduction, pre-section-Logit distillation is selected as the main part, and the mode of weighted Self-section distillation loss and Hidden-State distillation loss is added to reduce model parameters so as to improve training test speed and memory consumption, improve model convergence condition and accelerate model operation efficiency. Experiments prove that the Mini-3DCvT provided by the application has high accuracy, greatly reduces the model parameter quantity, reduces the calculated amount and the training time, and has excellent performance in the aspects of improving the model accuracy, accelerating and compressing. Meanwhile, in addition to the identification accuracy requirement of the lip reading method, the light weight of the model gradually becomes an important problem for restricting the development of the field.
The present application is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present application are intended to be included in the scope of the present application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.
Claims (10)
1. The lightweight lip language identification method based on the visual transducer is characterized by comprising the following steps of:
acquiring a lip language data set, preprocessing based on the lip language data set, and acquiring a lip identification image;
constructing a 3D convolutional neural network, and extracting space-time characteristics from the lip recognition image;
inputting the space-time features into an improved convolution visual transformation network to obtain local space features and global space features of the lip identification image;
inputting the local spatial features and the global spatial features into a bidirectional gating circulation unit, and extracting long-term and short-term feature sequences of the lip identification image;
and inputting the long-short-term feature sequence into a multiple perceptron, obtaining confidence scores of all the categories, and obtaining recognition probability values through a cross entropy loss function with a label smoothing mechanism based on the confidence scores of all the categories to finish lightweight lip language recognition.
2. The vision transducer-based lightweight lip language identification method as defined in claim 1, wherein the lip language data set is used for preprocessing, and the method for acquiring the lip language identification image is as follows:
and adjusting the size of the video frames input by the lip language data set to a first preset size, cutting the video frames to a second preset size, performing data enhancement expansion by adopting a data enhancement technology, setting a preset probability level, turning over each video frame, converting the video frames into gray images, and normalizing the gray images to obtain the lip identification images.
3. The vision transducer-based lightweight lip language identification method of claim 1, wherein constructing a 3D convolutional neural network, the method of extracting spatiotemporal features from the lip recognition image comprises:
setting a layer of 3D convolution, wherein the convolution kernel size is (5, 7), the stride is (1, 2), the filling is (2, 3), entering the batch normalization processing, then passing through a layer of activation function, finally sending into the maximum pooling layer processing, the kernel size of the pooling layer is (1, 3), and the stride is (1, 2).
4. The vision transducer-based lightweight lip language identification method of claim 1, wherein constructing a 3D convolutional neural network, the computing of extracting spatiotemporal features from the lip recognition image is:
wherein ,for the value in the j-th featuremap at position (x, y, z) in the i-th layer, relu is the activation function, b is the bias, m is the index of the i-1 layer featuremap connected to the current layer featuremap, P i 、Q i 、R i The width, height and time dimensions of the convolution kernel, respectively.
5. The vision-transformer-based lightweight lip language identification method of claim 1, wherein the modified convolutional vision transformation network comprises an in SE-save-Embedding layer, three convolutional vision transformation modules.
6. The vision transducer-based lightweight lip language identification method of claim 5, wherein inputting the spatiotemporal features into a modified convolutional vision transformation network, the method of obtaining local spatial features and global spatial features of the lip recognition image comprises:
inputting the space-time characteristics into an SE-Conve-Embedding layer, compressing the space-time characteristics by adopting a compression function to obtain statistical information of a channel, inputting the statistical information into an excitation function to obtain the correlation of the channel, and inputting the correlation into a scale function to obtain a new characteristic map;
and inputting the new feature map into a convolution visual transformation module, acquiring local context information through a convolution projection layer in the convolution visual transformation module, inputting the local context information into a multi-head attention layer in the convolution visual transformation module for normalization processing, acquiring a normalization result, and inputting the normalization result into a multi-layer perceptron MLP layer in the convolution visual transformation module to acquire local spatial features and global spatial features of the lip identification image.
7. The vision transducer-based lightweight lip language identification method of claim 6, wherein the method for obtaining the new feature map is:
where x is the input feature map, F sq As a compression function, F ex As an excitation function, F scale As a function of the scale,is a new feature map.
8. The method for recognizing a lightweight lip language based on a visual transducer according to claim 6, wherein the method for acquiring local context information through a convolution projection layer in the convolution visual transformation module comprises the steps of:
wherein For input of the i-th layer Q/K/V matrix, x i And s is the convolution kernel size, which is the output of the SE-control-encoding layer.
9. The method for recognizing a lightweight lip language based on a visual transducer according to claim 1, wherein the method for inputting the local spatial feature and the global spatial feature into a bi-directional gating loop unit and extracting the long-short period feature sequence of the lip recognition image comprises:
setting an input dimension as 512, a hidden layer dimension as 1024, 3 layers in total, and an output dimension as 2048;
the calculation formula of the gating cycle unit is as follows:
wherein :zt =σ(W z x t +U z h t-1 ),r t =σ(W r x t +U r h t-1 ),z is the update gate, r is the reset gate, < ->And h is a hidden value, and W and U are input and hidden weight matrixes respectively.
10. The method for recognizing a lightweight lip language based on a visual transducer according to claim 1, wherein the long-short term feature sequence is input into a multi-layer perceptron to obtain confidence scores of respective categories, and the method based on the confidence scores of the respective categories comprises:
and inputting the extracted long-short-period characteristic sequences into a multi-layer perceptron, receiving the long-short-period characteristic sequences in a form of flattening the long-short-period characteristic sequences into one-dimensional tensors, multiplying the one-dimensional tensors by a weight matrix, multiplying the weight matrix to generate output characteristics, and obtaining confidence scores of all categories.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310592100.7A CN116580440B (en) | 2023-05-24 | 2023-05-24 | Lightweight lip language identification method based on visual transducer |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310592100.7A CN116580440B (en) | 2023-05-24 | 2023-05-24 | Lightweight lip language identification method based on visual transducer |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116580440A true CN116580440A (en) | 2023-08-11 |
CN116580440B CN116580440B (en) | 2024-01-26 |
Family
ID=87541097
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310592100.7A Active CN116580440B (en) | 2023-05-24 | 2023-05-24 | Lightweight lip language identification method based on visual transducer |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116580440B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117237856A (en) * | 2023-11-13 | 2023-12-15 | 腾讯科技(深圳)有限公司 | Image recognition method, device, computer equipment and storage medium |
CN117689044A (en) * | 2024-02-01 | 2024-03-12 | 厦门大学 | Quantification method suitable for vision self-attention model |
CN117952869A (en) * | 2024-03-27 | 2024-04-30 | 西南石油大学 | Drilling fluid rock debris counting method based on weak light image enhancement |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111401250A (en) * | 2020-03-17 | 2020-07-10 | 东北大学 | Chinese lip language identification method and device based on hybrid convolutional neural network |
WO2020232867A1 (en) * | 2019-05-21 | 2020-11-26 | 平安科技(深圳)有限公司 | Lip-reading recognition method and apparatus, computer device, and storage medium |
CN113343937A (en) * | 2021-07-15 | 2021-09-03 | 北华航天工业学院 | Lip language identification method based on deep convolution and attention mechanism |
US20210280191A1 (en) * | 2018-01-02 | 2021-09-09 | Boe Technology Group Co., Ltd. | Lip language recognition method and mobile terminal |
CN114359786A (en) * | 2021-12-07 | 2022-04-15 | 重庆邮电大学 | Lip language identification method based on improved space-time convolutional network |
CN114973412A (en) * | 2022-05-31 | 2022-08-30 | 华中科技大学 | Lip language identification method and system |
-
2023
- 2023-05-24 CN CN202310592100.7A patent/CN116580440B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210280191A1 (en) * | 2018-01-02 | 2021-09-09 | Boe Technology Group Co., Ltd. | Lip language recognition method and mobile terminal |
WO2020232867A1 (en) * | 2019-05-21 | 2020-11-26 | 平安科技(深圳)有限公司 | Lip-reading recognition method and apparatus, computer device, and storage medium |
CN111401250A (en) * | 2020-03-17 | 2020-07-10 | 东北大学 | Chinese lip language identification method and device based on hybrid convolutional neural network |
CN113343937A (en) * | 2021-07-15 | 2021-09-03 | 北华航天工业学院 | Lip language identification method based on deep convolution and attention mechanism |
CN114359786A (en) * | 2021-12-07 | 2022-04-15 | 重庆邮电大学 | Lip language identification method based on improved space-time convolutional network |
CN114973412A (en) * | 2022-05-31 | 2022-08-30 | 华中科技大学 | Lip language identification method and system |
Non-Patent Citations (1)
Title |
---|
任玉强 等: "高安全性人脸识别系统中的唇语识别算法研究", 计算机应用研究, vol. 34, no. 4 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117237856A (en) * | 2023-11-13 | 2023-12-15 | 腾讯科技(深圳)有限公司 | Image recognition method, device, computer equipment and storage medium |
CN117237856B (en) * | 2023-11-13 | 2024-03-01 | 腾讯科技(深圳)有限公司 | Image recognition method, device, computer equipment and storage medium |
CN117689044A (en) * | 2024-02-01 | 2024-03-12 | 厦门大学 | Quantification method suitable for vision self-attention model |
CN117952869A (en) * | 2024-03-27 | 2024-04-30 | 西南石油大学 | Drilling fluid rock debris counting method based on weak light image enhancement |
Also Published As
Publication number | Publication date |
---|---|
CN116580440B (en) | 2024-01-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN116580440B (en) | Lightweight lip language identification method based on visual transducer | |
CN111985369B (en) | Course field multi-modal document classification method based on cross-modal attention convolution neural network | |
CN113887610B (en) | Pollen image classification method based on cross-attention distillation transducer | |
CN112507898B (en) | Multi-modal dynamic gesture recognition method based on lightweight 3D residual error network and TCN | |
CN113343937B (en) | Lip language identification method based on deep convolution and attention mechanism | |
CN112328767A (en) | Question-answer matching method based on BERT model and comparative aggregation framework | |
CN114694255B (en) | Sentence-level lip language recognition method based on channel attention and time convolution network | |
CN116524593A (en) | Dynamic gesture recognition method, system, equipment and medium | |
CN113609326B (en) | Image description generation method based on relationship between external knowledge and target | |
CN114780767A (en) | Large-scale image retrieval method and system based on deep convolutional neural network | |
Jiang et al. | Cross-level reinforced attention network for person re-identification | |
CN117238019A (en) | Video facial expression category identification method and system based on space-time relative transformation | |
CN115240713B (en) | Voice emotion recognition method and device based on multi-modal characteristics and contrast learning | |
CN116246305A (en) | Pedestrian retrieval method based on hybrid component transformation network | |
CN116543289A (en) | Image description method based on encoder-decoder and Bi-LSTM attention model | |
CN115527064A (en) | Toxic mushroom fine-grained image classification method based on multi-stage ViT and contrast learning | |
CN116167014A (en) | Multi-mode associated emotion recognition method and system based on vision and voice | |
CN115188079A (en) | Continuous sign language identification method based on self-adaptive multi-scale attention time sequence network | |
CN113128456B (en) | Pedestrian re-identification method based on combined picture generation | |
CN113761106B (en) | Self-attention-strengthening bond transaction intention recognition system | |
CN117831138B (en) | Multi-mode biological feature recognition method based on third-order knowledge distillation | |
CN116229234A (en) | Image recognition method based on fusion attention mechanism | |
CN116665192A (en) | Fatigue detection method based on LSTM and SSD lightweight network combination | |
Wang et al. | Mini-3DCvT: a lightweight lip-reading method based on 3D convolution visual transformer | |
CN118172705A (en) | Cross-architecture video action recognition method and device based on knowledge distillation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |