CN116580440A - Lightweight lip language identification method based on visual transducer - Google Patents

Lightweight lip language identification method based on visual transducer Download PDF

Info

Publication number
CN116580440A
CN116580440A CN202310592100.7A CN202310592100A CN116580440A CN 116580440 A CN116580440 A CN 116580440A CN 202310592100 A CN202310592100 A CN 202310592100A CN 116580440 A CN116580440 A CN 116580440A
Authority
CN
China
Prior art keywords
lip
layer
inputting
lip language
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310592100.7A
Other languages
Chinese (zh)
Other versions
CN116580440B (en
Inventor
王慧娟
袁全波
邢艺兰
谢佳飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
North China Institute of Aerospace Engineering
Original Assignee
North China Institute of Aerospace Engineering
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by North China Institute of Aerospace Engineering filed Critical North China Institute of Aerospace Engineering
Priority to CN202310592100.7A priority Critical patent/CN116580440B/en
Publication of CN116580440A publication Critical patent/CN116580440A/en
Application granted granted Critical
Publication of CN116580440B publication Critical patent/CN116580440B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Human Computer Interaction (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a lightweight lip language identification method based on a visual transducer, which comprises the following steps: acquiring a lip language data set, and preprocessing based on the lip language data set to acquire a lip identification image; constructing a 3D convolutional neural network, and extracting space-time characteristics from the lip recognition image; inputting the space-time features into an improved convolution visual transformation network, and extracting local spatial features and global spatial features of the lip identification image; inputting the local spatial features and the global spatial features into a bidirectional gating circulation unit, and extracting long-term and short-term feature sequences of the lip identification image; and inputting the long-short-term feature sequence into a multiple perceptron, obtaining confidence scores of all the categories, and obtaining recognition probability values through a cross entropy loss function with a label smoothing mechanism based on the confidence scores of all the categories to finish lightweight lip language recognition. The application solves the problems of large model parameter quantity, long operation and reasoning time and performance degradation caused by model compression.

Description

Lightweight lip language identification method based on visual transducer
Technical Field
The application belongs to the technical field of image recognition, and particularly relates to a lightweight lip language recognition method based on a visual transducer.
Background
Lip recognition is also called visual speech recognition, which refers to judging speaking content through lip movement change of a speaker, and the research process involves technologies such as computer vision, natural language processing and the like. The lip language identification has wide application in the aspects of identity authentication, voice recognition, speaker face synthesis, improvement of communication between deaf-mutes, public safety and the like. According to literature studies, there are some problems in the field of lip recognition that need attention. First, recognition is focused mainly on the lips and their surrounding environment, which makes the network model particularly important for fine feature extraction; and secondly, the lip language identification is to process time and space information between adjacent frames of continuous video frames, so that the identification difficulty is high, and simultaneously, the parameters of the model are greatly increased due to convolution and full connection layers contained in the currently adopted deep learning model, and the hardware performance of a computer is required to be higher. Furthermore, as the network goes deeper, the resolution decreases resulting in loss of information. Based on the problems and the complexity of lip reading, the accuracy of the lip reading task is not high all the time, the model parameters are large, and the operation and reasoning time is long. Therefore, there is a need to propose a lightweight lip language recognition method based on visual transducer.
Disclosure of Invention
In order to solve the technical problems, the application provides a lightweight lip language identification method based on a visual transducer, which effectively extracts high-dimensional characteristics of an image sequence, enhances semantic representation among video key frames, thereby reducing loss caused by global average of the image sequence, having obvious advantages in terms of parameter quantity and calculation quantity, and solving the problem of greatly reducing identification accuracy caused by model compression.
In order to achieve the above purpose, the application provides a lightweight lip language identification method based on a visual transducer, which comprises the following steps:
acquiring a lip language data set, preprocessing based on the lip language data set, and acquiring a lip identification image;
constructing a 3D convolutional neural network, and extracting space-time features from the lip recognition image
Inputting the space-time features into an improved convolution visual transformation network to obtain local space features and global space features of the lip identification image;
inputting the local spatial features and the global spatial features into a bidirectional gating circulation unit, and extracting long-term and short-term feature sequences of the lip identification image;
and inputting the long-short-term feature sequence into a multiple perceptron, obtaining confidence scores of all the categories, and obtaining recognition probability values through a cross entropy loss function with a label smoothing mechanism based on the confidence scores of all the categories to finish lightweight lip language recognition.
Optionally, the method for obtaining the lip identification image based on the lip language data set comprises the following steps:
and adjusting the size of the video frames input by the lip language data set to a first preset size, cutting the video frames to a second preset size, performing data enhancement expansion by adopting a data enhancement technology, setting a preset probability level, turning over each video frame, converting the video frames into gray images, and normalizing the gray images to obtain the lip identification images.
Optionally, constructing a 3D convolutional neural network, and the method for extracting spatiotemporal features from the lip recognition image includes:
setting a layer of 3D convolution, wherein the convolution kernel size is (5, 7), the stride is (1, 2), the filling is (2, 3), entering the batch normalization processing, then passing through a layer of activation function, finally sending into the maximum pooling layer processing, the kernel size of the pooling layer is (1, 3), and the stride is (1, 2).
Optionally, a 3D convolutional neural network is constructed, and the calculation for extracting spatiotemporal features from the lip recognition image is as follows:
wherein ,for the value in the jth feature map at position (x, y, z) in the ith layer, relu is the activation function, b is the bias, m is the index of the i-1 layer feature map connected to the current layer feature map, P i 、Q i 、R i The width, height and time dimensions of the convolution kernel, respectively.
Optionally, the improved convolutional visual transformation network comprises an in-SE-Conve-Embedding layer and three convolutional visual transformation modules.
Optionally, the method for inputting the spatiotemporal features into the improved convolutional visual transformation network and obtaining the local spatial features and the global spatial features of the lip recognition image includes:
inputting the space-time characteristics into an SE-Conve-Embedding layer, compressing the space-time characteristics by adopting a compression function to obtain statistical information of a channel, inputting the statistical information into an excitation function to obtain the correlation of the channel, and inputting the correlation into a scale function to obtain a new characteristic map;
and inputting the new feature map into a convolution visual transformation module, acquiring local context information through a convolution projection layer in the convolution visual transformation module, inputting the local context information into a multi-head attention layer in the convolution visual transformation module for normalization processing, acquiring a normalization result, and inputting the normalization result into a multi-layer perceptron MLP layer in the convolution visual transformation module to acquire local spatial features and global spatial features of the lip identification image.
Optionally, the method for obtaining the new feature map is as follows:
where x is the input feature map, F sq As a compression function, F ex As an excitation function, F scale As a function of the scale,is a new feature map.
Optionally, the method for obtaining local context information through the convolution projection layer in the convolution visual transformation module includes:
wherein For input of the i-th layer Q/K/V matrix, x i And s is the convolution kernel size, which is the output of the SE-control-encoding layer.
Optionally, the local spatial feature and the global spatial feature are input into a bidirectional gating circulation unit, and the method for extracting the long-term and short-term feature sequence of the lip identification image comprises the following steps:
setting an input dimension as 512, a hidden layer dimension as 1024, 3 layers in total, and an output dimension as 2048;
the calculation formula of the gating cycle unit is as follows:
wherein :zt =σ(W z x t +U z h t-1 ),r t =σ(W r x t +U r h t-1 ),z is the update gate, r is the reset gate, < ->And h is a hidden value, and W and U are input and hidden weight matrixes respectively.
Optionally, the long-short-term feature sequence is input into a multi-layer perceptron, the confidence scores of all the categories are obtained, and the method based on the confidence scores of all the categories comprises the following steps:
and inputting the extracted long-short-period characteristic sequences into a multi-layer perceptron, receiving the long-short-period characteristic sequences in a form of flattening the long-short-period characteristic sequences into one-dimensional tensors, multiplying the one-dimensional tensors by a weight matrix, multiplying the weight matrix to generate output characteristics, and obtaining confidence scores of all categories.
The application has the technical effects that: the application discloses a lightweight lip language identification method based on a visual transducer, which breaks through in the stages of data preprocessing, data enhancement and feature extraction, and improves the identification precision of a model; in combination with weight sharing and weight distillation, the mode of conversion after weight sharing is used to reduce model instability and performance reduction, pre-section-Logit distillation is selected as the main part, and the mode of weighted Self-section distillation loss and Hidden-State distillation loss is added to reduce model parameters so as to improve training test speed and memory consumption, improve model convergence condition and accelerate model operation efficiency.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application. In the drawings:
fig. 1 is a schematic flow chart of a lightweight lip language identification method based on a visual transducer according to an embodiment of the application;
FIG. 2 is a block diagram of an embodiment of the present application for feature extraction and recognition by an image input 3D convolutional network and a visual transducer;
FIG. 3 is a basic architecture diagram of a Cvt module according to an embodiment of the present application;
FIG. 4 is a diagram of a Transformer Block structure modified in accordance with an embodiment of the present application;
FIG. 5 is a diagram of the overall structure of Mini-3DCvT according to an embodiment of the present application;
FIG. 6 is a schematic diagram of weight transformation in the attention layer and feed forward network according to an embodiment of the present application.
Detailed Description
It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.
As shown in fig. 1, the method for recognizing the lightweight lip language based on the visual transducer in the embodiment includes the following steps:
acquiring a lip language data set, and preprocessing based on the lip language data set to acquire a lip identification image;
constructing a 3D convolutional neural network, and extracting space-time characteristics from the lip recognition image;
inputting the space-time features into an improved convolution visual transformation network to obtain local spatial features and global spatial features of the lip identification image;
inputting the local space features and the global space features into a bidirectional gating circulation unit, and extracting long-term and short-term feature sequences of the lip identification image;
and inputting the long-short-period characteristic sequences into a multiple perceptron, obtaining confidence scores of all the categories, obtaining recognition probability values through a cross entropy loss function with a label smoothing mechanism based on the confidence scores of all the categories, and completing lightweight lip language recognition.
As shown in fig. 2, the input size of the front-end network is (256×1×88×88), and the input data is preprocessed from the original data set. Data first passes through the 3DCnov layer and then enters Transformer Block. The parameter settings of the TransformerBlock are different in the three phases, and the parameter values of the different phases can be adjusted to adapt to different input data. After the front-end network has extracted the feature information, word boundary information is added, and the size of the input data is adjusted to (256×513×5×5). In order to better acquire global correlation and identification key information of the feature sequence, the feature information is sent to a back-end network Bi-GRU, the Bi-GRU splices the feature vector output by a front-end network and the feature vector with the direction inverted in the channel dimension, and finally, the gradient vanishing problem is effectively avoided through a dropout layer, and meanwhile, the model is prevented from being fitted.
The lip language identification method based on the deep convolution and the attention mechanism carries out preprocessing on a large lip reading data set, and the lip language identification image is obtained specifically by the following steps:
a unified preprocessing strategy is adopted for the data sets LRW and LRW-1000. We resize the initial input video frame to 96 x 96, then clip it to 88 x 88, and use both Mixup, cutmix data enhancement techniques for data enhancement expansion as the final input. We select a batch of video frames of size 256 in each epoch training. Each video frame is then flipped with a probability level of 0.5 and converted to a gray scale image, finally normalized to [0,1]. Furthermore, before the extracted feature information enters the backend, we expand the data dimension from 521 to 513, called word boundaries. The method can provide context and environmental information to aid in classifying lip readings. In addition, the label value of the data sample is smoothed, and the label is smoothly added into the loss function to guide the training of the network model, so that the overfitting of the model is reduced.
The lip language identification method based on the deep convolution and the attention mechanism comprises the following steps of: setting a layer of 3D convolution, wherein the convolution kernel size is (5, 7), the stride is (1, 2), the filling is (2, 3), then entering batch normalization processing, then passing through a layer of activation function, finally sending into a maximum pooling layer for processing, the kernel size of the pooling layer is (1, 3), the stride is (1, 2), and the 3D convolution calculation formula is as follows:
wherein :for the value in the jth feature map at position (x, y, z) in the ith layer, relu is the activation function, b is the bias, m is the index of the i-1 layer feature map connected to the current layer feature map>Middle P i 、Q i 、R i The width, height and time dimensions of the convolution kernel, respectively. The convolution mark embedding layer is processed, the size of the embedding core of the layer is (7, 7), the steps are (2, 2), and the number is 128.
The space-time feature vector generated in the previous step is input into an improved lightweight three-stage convolution visual transformer, the part consists of three layers of transformers, the specific architecture is shown in figure 3,
the module consists of a convolution projection layer, a multi-head attention layer and a full connection layer, which are stacked in three steps; wherein: the first step of convolution projection layer has the kernel size of (3, 3), the number of the kernels of 128, the attention of 3 heads and the depth of 128; the second step of convolution projection layer has the kernel size of (3, 3), the number of 256, attention of 12 heads and depth of 256; the third step convolves the projection layer with the kernel size (3, 3), the number of 512, the attention of 16 heads and the depth of 512.
In order to improve the recognition performance of the algorithm, the following modification is performed:
(1) The stacking of the Transformer blocks and the deepening of the network can result in reduced image resolution and reduced ability of the network to acquire spatial signature information. To solve this problem, convolutional mark Embedding layers are improved for the transducer block, compression and excitation structures (Squeeze and Excitation, SE) are added, and the new network layer is defined as SE-save-Embedding. The specific structure of SE-Conve-encoding is shown in figure 4, the feature map is input into the squeize to obtain statistical information of the channel, then the statistical information is input into the expression to obtain the correlation of the channel, and finally the correlation is input into the Scale to obtain a new feature map, as shown in formula 2.
Where x is the input feature map, F sq As a squeezing function, F ex As an excitation function, F scale As a function of the scale,is a new feature map. This layer can increase the feature dimension of the mark by changing parameters and reduce the length of the mark sequence, thereby compensating the information loss caused by deepening the network and reducing the resolution.
The mapped SE-Conve-Embedding features are sent to the Conv-project layer in Transformer Block. The purpose of this layer is to better obtain local context information. The specific formula is as follows:
wherein For input of the i-th layer Q/K/V matrix, x i And s is the convolution kernel size, which is the output of the SE-control-encoding layer. The feature matrix is then input from the Conv-project layer into the multi-headed attention layer. The number of heads in each step is increased, then the value is subjected to data normalization processing through layer normalization, finally, an output result is obtained through a multi-layer perceptron MLP layer, and the result is used as the output of a front-end network and is input into a next back-end network for sequence modeling. These operations may allow different scales to have good results in extracting features. Meanwhile, the difference between the output value and the expected value can be reduced by utilizing the ReLU activation function, as shown in an improved transducer block in fig. 4, the convolution and attention mechanism well extracts and models the characteristic information of the input characteristic diagram, and the rule characteristics among data are better learned.
(2) The module is lightweight modified as follows. The visual transducer is processed in two steps on the basis of a basic model:
the first step generates a compact architecture with weight conversion. Given a pre-trained large visual transducer model, parameters are first shared between every k adjacent transducer layers except LayerNorm. Each layer is then weight transformed by inserting a tiny linear layer before and after the softmax layer. In addition, a deep convolutional layer for MLP is introduced. These linear layers and conversion blocks are not shared.
And secondly, carrying out weight distillation training on the compression model. In this step, the target loss function is defined as the sum of the proposed multiple loss functions. And transferring knowledge from the large pre-training model to the small model by using the proposed weight distillation method. By distilling inside the transducer module, students can re-simulate the behavior of the teacher's network, thereby extracting more useful knowledge from the large-scale pre-trained model. Note that it is only performed when both the teacher and student models are a transducer architecture. In other cases, the student and teacher architectures are heterogeneous, leaving only complete pre-logic distillation, adding weighted Self-Attention distillation loss and Hidden-State distillation loss.
The overall structure of the resulting Mini-3DCvT is shown in FIG. 5, which is a method for extracting lip movement features by three-dimensional convolution and lightweight modified visual transducer. The extracted characteristic information is processed by a bi-directional gating circulating unit (BiGRU) and a full connection layer (FC) to carry out sequence modeling and classification.
The two-step process of compressing the visual transducer block is described in detail. The first step applies the weight transformation to both the multi-head self-attention (MSA) block and the multi-layer perceptron (MLP). Such a transformation allows each layer to be different, thereby improving parameter diversity and model representation capabilities. As shown in fig. 6, the parameters of the conversion core are not shared across layers, while all other blocks in the original converter except LayerNorm are shared. Since the shared blocks occupy a significant portion of the model parameters, the model size increases only slightly after weight conversion.
Attention layer weight conversion
The multi-headed self-attention (MSA) layer in Transformer Block uses weight conversion followed by layer normalization and residual network connection for analysis before and after each block. Elaborating the MSA layer as follows:
let M be the number of heads in the MSA, also referred to as the self-attention module. Given the input sequence Z0 ε RNxD, at the firstOf the k heads, query, key and value are generated by linear projection, respectively using Q k 、K k and Vk E RN x d, where N is the number of tags. The dimensions of the Conv-project input and Q-K-V matrices are D and D, respectively. Next, a weighted average of all values for each location is calculated. For similarity calculation between different elements, attention matrix A is obtained, wherein A is a weight matrix and is obtained by h k To represent the output of the attention layer, i.e. formulas (4) and (5):
h k =A k V k (4)
wherein a softmax operation is performed for each row of the input matrix. Finally, the outputs of all heads are connected using one full connection layer.
In order to improve the diversity of parameters, the application inserts two linear transformations before and after the softmax of the self-attention module, and is defined as follows:
wherein F(1) 、F (2) E, RM×M are linear transformation kernels around softmax, respectively. Such a linear transformation may make each attention matrix a' n different, while combining information across attention heads to increase the parameter variance.
Multi-layer perceptron weight transformation
The multi-layer perceptron (MLP) consists of two fully connected layers, the activation function being denoted as sigma, typically GELU. Let Y ε RN d be the input to the MLP. The output of the MLP is expressed as:
H=(YW (1) +b (1 ))W (2) +b (2) (8)
wherein ,W(1) ∈Rd×d',b (1) ∈Rd',W (2) ∈Rd'×d,b (2) And E, rd are the weights and deviations of the first layer and the second layer respectively. It should be noted that d 'is usually set'>d。
And then, further carrying out lightweight conversion on the MLP so as to improve parameter diversity. Specifically, let the input be y= [ Y1, …, yd ], where yl represents the Lth-th position of all the marker embedded vectors. Then, d linear transforms are introduced to convert Y to Y' = [ C (1) Y1, … C (d) yd ], where C (1), … C (d) ∈rn×n is the independent weight matrix of the linear layer. Then the new formula for the rewrite of equation (9) is:
H=σ(Y'W (1) +b (1 ))W (2) +b (2) (9)
in order to reduce the number of parameters and introduce locality in the transformation, deep convolution is employed [41] To sparsify and share weights in each weight matrix, resulting in only K2d parameters instead of N2d parameters (K<<N), where K is the kernel size of the convolution. After transformation, the output of the MLP is more diversified, and the parameter efficiency is improved.
Through these transformations, the weight sharing layer can recover the behavior of the pre-trained model, similar to the multiplexing process. The problems of unstable training and reduced performance can be alleviated to avoid the disadvantages of the weight sharing method.
To compress the pre-trained large model and solve the performance degradation problem caused by weight sharing, further resort to weight distillation, transferring knowledge from the large model to the small and compact models. Three types of distillative transformation blocks are used in the present application, namely, predictive-Logit distillation, self-Attention distillation and Hidden-State distillation.
Presection-Logit distillation
Hinton et al first demonstrated that the deep learning model can achieve better performance by mimicking the output behavior of a well behaved teacher model during training. Using this concept, predictive loss is introduced as follows:
wherein ,zs and zt The logarithm of the prediction of the student model and the teacher model is respectively, and T is a temperature value for controlling the smoothness of the logarithm. In the experiment, let t=1. CE represents cross entropy loss.
Self-Attention distillation
It is beneficial to use the attention map in the transducer layer to guide the training of the student model. To solve the problem of inconsistent dimensions between student and teacher models due to the different number of faces, cross entropy loss is applied on the relationships between queries, keys and values in the MSA. First a matrix is attached on all fronts. For example, define q= [ Q1, … QM]E RN× M d, K, V e RN× M d, and so on. For the simplification of the symbols, Q, K and V are denoted by S1, S2 and S3, respectively. Then 9 different relationship matrices can be generated, defined asNote that R12 is the Attention matrix a and Self-Attention distillation loss can be expressed as: />
wherein ,Ri,j,n R represents i,j Is the n-th row of (c).
Hidden-State distillation
Also, a relationship matrix, i.e., characteristics of the MLP output, can be generated for hidden states [45] . The Hidden State of the transducer layer is represented by H.epsilon.RN.times.d, and the Hidden-State distillation loss based on the relation matrix is defined as:
wherein RH, n is the nth row of RH, and the calculation formula is
The application adds the mixing function composed of two other knowledge distillation modes on the basis of the Prediction-Logit distillation as the final loss, so the final distillation target loss function is expressed as:
inputting the extracted spatial features into a bidirectional gating circulation unit, and extracting long-term and short-term feature sequences:
the setting of the bidirectional gating circulation unit specifically comprises the following steps: the input dimension is set to be 512, the hidden layer dimension is set to be 1024, 3 layers are added, the output dimension is set to be 2048, and the calculation formula of the gating circulation unit is as follows: wherein :zt =σ(W z x t +U z h t-1 ),r t =σ(W r x t +U r h t-1 ),/>z is the update gate, r is the reset gate, < ->And h is a hidden value, and W and U are input and hidden weight matrixes respectively.
Inputting the extracted long-short-period characteristic sequences into a multi-layer perceptron to obtain confidence scores of all categories, wherein the method specifically comprises the following steps of: the extracted long-short-period characteristic sequence is input into a multi-layer perceptron, the structure of the multi-layer perceptron is input dimension 2048 and output dimension 1000, the multi-layer perceptron is received in a form of flattening into one-dimensional tensor, then the one-dimensional tensor is multiplied by a weight matrix, and the weight matrix is multiplied to generate output characteristics, so that confidence scores of all categories are obtained. Based on the confidence scores of the various categories, outputting identification probability values through a cross entropy loss function with a label smoothing mechanism, wherein the identification probability values specifically comprise: based on the confidence scores of the various categories, the obtained output features and the real labels are sent into a cross entropy loss function with a label smoothing mechanism to output identification probability values.
The application provides a lightweight lip language identification method based on a visual transducer, which breaks through in the stages of data preprocessing, data enhancement and feature extraction, and improves the identification precision of a model; in combination with weight sharing and weight distillation, the mode of conversion after weight sharing is used to reduce model instability and performance reduction, pre-section-Logit distillation is selected as the main part, and the mode of weighted Self-section distillation loss and Hidden-State distillation loss is added to reduce model parameters so as to improve training test speed and memory consumption, improve model convergence condition and accelerate model operation efficiency. Experiments prove that the Mini-3DCvT provided by the application has high accuracy, greatly reduces the model parameter quantity, reduces the calculated amount and the training time, and has excellent performance in the aspects of improving the model accuracy, accelerating and compressing. Meanwhile, in addition to the identification accuracy requirement of the lip reading method, the light weight of the model gradually becomes an important problem for restricting the development of the field.
The present application is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present application are intended to be included in the scope of the present application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims (10)

1. The lightweight lip language identification method based on the visual transducer is characterized by comprising the following steps of:
acquiring a lip language data set, preprocessing based on the lip language data set, and acquiring a lip identification image;
constructing a 3D convolutional neural network, and extracting space-time characteristics from the lip recognition image;
inputting the space-time features into an improved convolution visual transformation network to obtain local space features and global space features of the lip identification image;
inputting the local spatial features and the global spatial features into a bidirectional gating circulation unit, and extracting long-term and short-term feature sequences of the lip identification image;
and inputting the long-short-term feature sequence into a multiple perceptron, obtaining confidence scores of all the categories, and obtaining recognition probability values through a cross entropy loss function with a label smoothing mechanism based on the confidence scores of all the categories to finish lightweight lip language recognition.
2. The vision transducer-based lightweight lip language identification method as defined in claim 1, wherein the lip language data set is used for preprocessing, and the method for acquiring the lip language identification image is as follows:
and adjusting the size of the video frames input by the lip language data set to a first preset size, cutting the video frames to a second preset size, performing data enhancement expansion by adopting a data enhancement technology, setting a preset probability level, turning over each video frame, converting the video frames into gray images, and normalizing the gray images to obtain the lip identification images.
3. The vision transducer-based lightweight lip language identification method of claim 1, wherein constructing a 3D convolutional neural network, the method of extracting spatiotemporal features from the lip recognition image comprises:
setting a layer of 3D convolution, wherein the convolution kernel size is (5, 7), the stride is (1, 2), the filling is (2, 3), entering the batch normalization processing, then passing through a layer of activation function, finally sending into the maximum pooling layer processing, the kernel size of the pooling layer is (1, 3), and the stride is (1, 2).
4. The vision transducer-based lightweight lip language identification method of claim 1, wherein constructing a 3D convolutional neural network, the computing of extracting spatiotemporal features from the lip recognition image is:
wherein ,for the value in the j-th featuremap at position (x, y, z) in the i-th layer, relu is the activation function, b is the bias, m is the index of the i-1 layer featuremap connected to the current layer featuremap, P i 、Q i 、R i The width, height and time dimensions of the convolution kernel, respectively.
5. The vision-transformer-based lightweight lip language identification method of claim 1, wherein the modified convolutional vision transformation network comprises an in SE-save-Embedding layer, three convolutional vision transformation modules.
6. The vision transducer-based lightweight lip language identification method of claim 5, wherein inputting the spatiotemporal features into a modified convolutional vision transformation network, the method of obtaining local spatial features and global spatial features of the lip recognition image comprises:
inputting the space-time characteristics into an SE-Conve-Embedding layer, compressing the space-time characteristics by adopting a compression function to obtain statistical information of a channel, inputting the statistical information into an excitation function to obtain the correlation of the channel, and inputting the correlation into a scale function to obtain a new characteristic map;
and inputting the new feature map into a convolution visual transformation module, acquiring local context information through a convolution projection layer in the convolution visual transformation module, inputting the local context information into a multi-head attention layer in the convolution visual transformation module for normalization processing, acquiring a normalization result, and inputting the normalization result into a multi-layer perceptron MLP layer in the convolution visual transformation module to acquire local spatial features and global spatial features of the lip identification image.
7. The vision transducer-based lightweight lip language identification method of claim 6, wherein the method for obtaining the new feature map is:
where x is the input feature map, F sq As a compression function, F ex As an excitation function, F scale As a function of the scale,is a new feature map.
8. The method for recognizing a lightweight lip language based on a visual transducer according to claim 6, wherein the method for acquiring local context information through a convolution projection layer in the convolution visual transformation module comprises the steps of:
wherein For input of the i-th layer Q/K/V matrix, x i And s is the convolution kernel size, which is the output of the SE-control-encoding layer.
9. The method for recognizing a lightweight lip language based on a visual transducer according to claim 1, wherein the method for inputting the local spatial feature and the global spatial feature into a bi-directional gating loop unit and extracting the long-short period feature sequence of the lip recognition image comprises:
setting an input dimension as 512, a hidden layer dimension as 1024, 3 layers in total, and an output dimension as 2048;
the calculation formula of the gating cycle unit is as follows:
wherein :zt =σ(W z x t +U z h t-1 ),r t =σ(W r x t +U r h t-1 ),z is the update gate, r is the reset gate, < ->And h is a hidden value, and W and U are input and hidden weight matrixes respectively.
10. The method for recognizing a lightweight lip language based on a visual transducer according to claim 1, wherein the long-short term feature sequence is input into a multi-layer perceptron to obtain confidence scores of respective categories, and the method based on the confidence scores of the respective categories comprises:
and inputting the extracted long-short-period characteristic sequences into a multi-layer perceptron, receiving the long-short-period characteristic sequences in a form of flattening the long-short-period characteristic sequences into one-dimensional tensors, multiplying the one-dimensional tensors by a weight matrix, multiplying the weight matrix to generate output characteristics, and obtaining confidence scores of all categories.
CN202310592100.7A 2023-05-24 2023-05-24 Lightweight lip language identification method based on visual transducer Active CN116580440B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310592100.7A CN116580440B (en) 2023-05-24 2023-05-24 Lightweight lip language identification method based on visual transducer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310592100.7A CN116580440B (en) 2023-05-24 2023-05-24 Lightweight lip language identification method based on visual transducer

Publications (2)

Publication Number Publication Date
CN116580440A true CN116580440A (en) 2023-08-11
CN116580440B CN116580440B (en) 2024-01-26

Family

ID=87541097

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310592100.7A Active CN116580440B (en) 2023-05-24 2023-05-24 Lightweight lip language identification method based on visual transducer

Country Status (1)

Country Link
CN (1) CN116580440B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117237856A (en) * 2023-11-13 2023-12-15 腾讯科技(深圳)有限公司 Image recognition method, device, computer equipment and storage medium
CN117689044A (en) * 2024-02-01 2024-03-12 厦门大学 Quantification method suitable for vision self-attention model
CN117952869A (en) * 2024-03-27 2024-04-30 西南石油大学 Drilling fluid rock debris counting method based on weak light image enhancement

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111401250A (en) * 2020-03-17 2020-07-10 东北大学 Chinese lip language identification method and device based on hybrid convolutional neural network
WO2020232867A1 (en) * 2019-05-21 2020-11-26 平安科技(深圳)有限公司 Lip-reading recognition method and apparatus, computer device, and storage medium
CN113343937A (en) * 2021-07-15 2021-09-03 北华航天工业学院 Lip language identification method based on deep convolution and attention mechanism
US20210280191A1 (en) * 2018-01-02 2021-09-09 Boe Technology Group Co., Ltd. Lip language recognition method and mobile terminal
CN114359786A (en) * 2021-12-07 2022-04-15 重庆邮电大学 Lip language identification method based on improved space-time convolutional network
CN114973412A (en) * 2022-05-31 2022-08-30 华中科技大学 Lip language identification method and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210280191A1 (en) * 2018-01-02 2021-09-09 Boe Technology Group Co., Ltd. Lip language recognition method and mobile terminal
WO2020232867A1 (en) * 2019-05-21 2020-11-26 平安科技(深圳)有限公司 Lip-reading recognition method and apparatus, computer device, and storage medium
CN111401250A (en) * 2020-03-17 2020-07-10 东北大学 Chinese lip language identification method and device based on hybrid convolutional neural network
CN113343937A (en) * 2021-07-15 2021-09-03 北华航天工业学院 Lip language identification method based on deep convolution and attention mechanism
CN114359786A (en) * 2021-12-07 2022-04-15 重庆邮电大学 Lip language identification method based on improved space-time convolutional network
CN114973412A (en) * 2022-05-31 2022-08-30 华中科技大学 Lip language identification method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
任玉强 等: "高安全性人脸识别系统中的唇语识别算法研究", 计算机应用研究, vol. 34, no. 4 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117237856A (en) * 2023-11-13 2023-12-15 腾讯科技(深圳)有限公司 Image recognition method, device, computer equipment and storage medium
CN117237856B (en) * 2023-11-13 2024-03-01 腾讯科技(深圳)有限公司 Image recognition method, device, computer equipment and storage medium
CN117689044A (en) * 2024-02-01 2024-03-12 厦门大学 Quantification method suitable for vision self-attention model
CN117952869A (en) * 2024-03-27 2024-04-30 西南石油大学 Drilling fluid rock debris counting method based on weak light image enhancement

Also Published As

Publication number Publication date
CN116580440B (en) 2024-01-26

Similar Documents

Publication Publication Date Title
CN116580440B (en) Lightweight lip language identification method based on visual transducer
CN111985369B (en) Course field multi-modal document classification method based on cross-modal attention convolution neural network
CN113887610B (en) Pollen image classification method based on cross-attention distillation transducer
CN112507898B (en) Multi-modal dynamic gesture recognition method based on lightweight 3D residual error network and TCN
CN113343937B (en) Lip language identification method based on deep convolution and attention mechanism
CN112328767A (en) Question-answer matching method based on BERT model and comparative aggregation framework
CN114694255B (en) Sentence-level lip language recognition method based on channel attention and time convolution network
CN116524593A (en) Dynamic gesture recognition method, system, equipment and medium
CN113609326B (en) Image description generation method based on relationship between external knowledge and target
CN114780767A (en) Large-scale image retrieval method and system based on deep convolutional neural network
Jiang et al. Cross-level reinforced attention network for person re-identification
CN117238019A (en) Video facial expression category identification method and system based on space-time relative transformation
CN115240713B (en) Voice emotion recognition method and device based on multi-modal characteristics and contrast learning
CN116246305A (en) Pedestrian retrieval method based on hybrid component transformation network
CN116543289A (en) Image description method based on encoder-decoder and Bi-LSTM attention model
CN115527064A (en) Toxic mushroom fine-grained image classification method based on multi-stage ViT and contrast learning
CN116167014A (en) Multi-mode associated emotion recognition method and system based on vision and voice
CN115188079A (en) Continuous sign language identification method based on self-adaptive multi-scale attention time sequence network
CN113128456B (en) Pedestrian re-identification method based on combined picture generation
CN113761106B (en) Self-attention-strengthening bond transaction intention recognition system
CN117831138B (en) Multi-mode biological feature recognition method based on third-order knowledge distillation
CN116229234A (en) Image recognition method based on fusion attention mechanism
CN116665192A (en) Fatigue detection method based on LSTM and SSD lightweight network combination
Wang et al. Mini-3DCvT: a lightweight lip-reading method based on 3D convolution visual transformer
CN118172705A (en) Cross-architecture video action recognition method and device based on knowledge distillation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant