CN112818950B - Lip language identification method based on generation of countermeasure network and time convolution network - Google Patents

Lip language identification method based on generation of countermeasure network and time convolution network Download PDF

Info

Publication number
CN112818950B
CN112818950B CN202110262815.7A CN202110262815A CN112818950B CN 112818950 B CN112818950 B CN 112818950B CN 202110262815 A CN202110262815 A CN 202110262815A CN 112818950 B CN112818950 B CN 112818950B
Authority
CN
China
Prior art keywords
lip
angle
network
image
corrected
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110262815.7A
Other languages
Chinese (zh)
Other versions
CN112818950A (en
Inventor
张成伟
赵昊天
张满囤
齐畅
崔时雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hebei University of Technology
Original Assignee
Hebei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hebei University of Technology filed Critical Hebei University of Technology
Priority to CN202110262815.7A priority Critical patent/CN112818950B/en
Publication of CN112818950A publication Critical patent/CN112818950A/en
Application granted granted Critical
Publication of CN112818950B publication Critical patent/CN112818950B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a lip language identification method based on a generation countermeasure network and a time convolution network. Firstly, judging a lip deflection angle through a ResNet angle classifier, then utilizing a GAN two-stage converter to correct lips, and finally sending the lips into a TCN to carry out feature recognition classification to generate a high-precision lip recognition method of a lip recognition result; the method overcomes the problem that the lip feature extraction which cannot be solved by the traditional convolution model is influenced by the uncertainty of illumination intensity, illumination angle, recognition angle, speaker identity and the like in the actual environment, and obviously improves the accuracy of lip language recognition. The method designs intensive multi-angle lip change original data, not only realizes the continuity of images of a single camera, but also furthest realizes the continuity of the lip images in an observation range, and effectively solves the problem that the existing multi-angle model cannot process the lip images which continuously change in an actual environment, thereby improving the lip language recognition precision.

Description

Lip language identification method based on generation of countermeasure network and time convolution network
Technical Field
The invention belongs to the field of artificial intelligence and deep learning, and particularly relates to a lip language identification method based on a generation countermeasure network and a time convolution network.
Background
With the development of science and technology and the improvement of hardware manufacturing level, the amount of information that can be processed by a computer is exponentially increased, so that the artificial intelligence technology based on deep learning enters a rapid development stage, the artificial intelligence technology is widely applied to daily life of people, the production life style of people is changed subtly, and the artificial intelligence technology becomes one of indispensable important technologies in the human society. The application scenes of the artificial intelligence technology cover all aspects of production and life, including voice recognition, intelligent medical treatment, machine vision, intelligent question-answering systems, unmanned driving and the like. The success and accumulated experience of the artificial intelligence technology in these fields further advance the social attention to the new technology and accelerate the development of the artificial intelligence technology.
Lip language identification is an important application field of artificial intelligence technology, plays a key role in many fields of social production and life, and has very wide application prospects, for example:
1. lip feature based in vivo testing: in some scenes needing identity authentication, the real physiological characteristics of a subject often need to be determined, and the subject needs to complete a series of specified actions such as head rotation, eye blinking, reading a session and the like to complete the authentication of whether the subject is a real living human body. Common detection avoidance modes such as photos, videos, face changing, face shielding and the like can be effectively avoided by using a face key point detection technology, lip language identification and the like, so that a user is helped to avoid the harm of fraudulent behaviors, and the rights and interests of the user are guaranteed.
2. Assisting hearing impaired people to communicate: hearing impaired people include disabled people with impaired hearing caused by congenital heredity or acquired human factors, who cannot hear or make sound, and are inconvenient to communicate with other people in life. The communication auxiliary device carrying the lip language recognition technology is used, and therefore the communication requirements of hearing impaired people are met.
The current lip language identification models are divided into three types of models, namely letter-level models, word-level models and sentence-level models according to identification levels, most of the models adopt sequence-to-sequence (sequence2sequence and seq2seq) models for sequence modeling identification, and a continuous time series Classification (CTC) algorithm is used as a standard for measuring the accuracy of a prediction result. The seq2seq model is used for inputting continuous lip feature sequences, and performing time-series encoding and decoding on the input feature sequences through an encoder (encoder) and a decoder (decoder). A major difficulty of a lip language recognition task is that context relation of lip images is strong, but a sequence context relation mechanism is often adopted by a seq2seq model, so that the context relation in a lip change sequence cannot be well processed.
An improved version of the sequence model based on attention mechanism (attention) appears later, and satisfactory results are obtained in application scenarios with context dependence of short sentences, such as machine translation, intelligent question-answering systems and the like. However, the lip recognition task processes a long continuous image sequence, the context relationship is tighter, the span of the time dimension is larger, and the precision of the attention mechanism in the lip recognition task still needs to be improved. Another difficulty of the task of lip language recognition is that lip characteristics are often affected by angles, illumination, and speaker identities, and feature extraction faces a great deal of uncertainty. Most recognition models adopt a feature extractor based on a Residual Network (ResNet), and the feature extractor has a good effect under laboratory conditions, but has a poor performance when being directly applied to an actual environment.
Disclosure of Invention
Aiming at the defects of the prior art, the technical problem to be solved by the invention is to provide a lip language identification method based on a generation countermeasure network and a time convolution network.
The technical scheme for solving the technical problem is to provide a lip language identification method based on a generation countermeasure network and a time convolution network, which is characterized by comprising the following steps:
s1, making original data; the raw data comprises identification network raw data and intensive multi-angle lip change raw data;
s2, respectively labeling the characteristic points of the human face for each frame or each image of the original data by using a human face labeling algorithm to obtain two characteristic point position arrays;
s3, respectively carrying out face alignment on the face in each frame or each image of the original data according to the feature point matrix of the original data and the average face feature point matrix in the feature point position array to obtain aligned original data;
s4, after the face alignment is finished, lip feature points are selected from the face feature points obtained in the S2, and the coordinates of the centers of the lip feature points in each frame or each image of the aligned original data are obtained through calculation according to the lip feature points; dividing lip areas in each frame or each image of the aligned original data into lip images with fixed sizes according to coordinates of centers of the lip feature points, and further respectively obtaining a dense multi-angle lip change data set and an identification network data set;
s5, training a GAN two-stage converter and a ResNet angle classifier by using a dense multi-angle lip change data set;
s6, correcting the recognition network data set by using a trained ResNet angle classifier and a trained GAN two-stage converter, and correcting the lip image deflected in the recognition network data set: splitting the identification network data set into a plurality of lip images to be corrected frame by frame, and then inputting the lip images to a trained ResNet angle classifier for angle classification to obtain the respective lip deflection angle theta of each lip image to be corrected;
determining the number i of the used GAN two-stage converter according to the deflection angle theta of the lip;
then sending the lip image to be corrected of the lip deflection angle theta into a trained GAN two-stage converter with corresponding numbers for lip correction to obtain a corrected lip image; the lip deflection angle theta of the corrected lip image is 0 degree; combining the corrected lip images into a corrected lip image sequence;
s7, training a TCN time convolution network by using the corrected lip image sequence;
and S8, performing feature recognition and classification through the trained TCN time sequence convolution network to generate a lip language recognition result.
Compared with the prior art, the invention has the beneficial effects that:
(1) the invention is a high-precision lip language recognition method which comprises the steps of firstly judging the lip deflection angle through a ResNet angle classifier, then utilizing a GAN two-stage converter to correct the lip, and finally sending the lip to a TCN for feature recognition classification to generate a lip language recognition result; the method overcomes the problem that the lip feature extraction which cannot be solved by the traditional convolution model is influenced by the uncertainty of illumination intensity, illumination angle, recognition angle, speaker identity and the like in the actual environment, and obviously improves the accuracy of lip language recognition.
(2) According to the method, a Time Convolution Network (TCN) based on dual dimensions of time and space is adopted, a door mechanism adopted in the traditional sequence model processing is abandoned, the modeling task of the sequence model is realized by using expansion convolution, the dilemma faced by lip language identification is solved, and the accuracy of the lip language identification is further improved. TCNs have parallelism in processing sequence models and do not need to process sentences in order as sequence models do. Moreover, the sensing field of the TCN is flexible, is influenced by the number of convolution layers, the size of a convolution kernel, an expansion coefficient and the like, and can be flexibly set according to different tasks. Gradient loss and explosion problems caused by common parameters in different time periods, which are frequently generated by the sequence model, are not easy to occur in the training process of the TCN model. Finally, because the TCN needs to save the information of each step during training unlike the sequence model, and because the convolution kernel is shared in one layer, the memory usage is lower.
(3) The effect of converting the lip image with a large offset angle into a positive image through one step is very bad, because the converter cannot process information with a very large structural change, the GAN network of the present invention adopts a two-stage converter, and the conversion process is divided into two steps, namely, the lip image in a certain range is converted into a certain fixed angle in the first step, the range is +/-10 degrees, and the lip image is input into the second-stage converter and converted again, although the range of 60 degrees at the maximum can be spanned, the change in the two sub-regions of the input and output images is not large, namely, the input image is mostly an image with a rotation angle of +/-2 degrees, and the output image is 0 degrees, so the correction effect is greatly improved.
(4) The invention designs dense multi-angle lip change original data, not only realizes the continuity of the images of a single camera, but also furthest realizes the continuity of the lip images in the observation range, and effectively solves the problem that the existing multi-angle model can not process the lip images continuously changed in the actual environment, thereby improving the lip language recognition precision.
Drawings
FIG. 1 is a diagram of the placement positions of high definition cameras when the present invention produces dense multi-angle lip variation raw data;
FIG. 2 is a diagram of the conversion process of the GAN two-stage converter of the present invention to correct lip deflection;
FIG. 3 is a block diagram of the TCN time convolutional network of the present invention;
FIG. 4 is a camera shot at an observation angle of 35 in dense multi-angle lip variation raw data of the present invention;
FIG. 5 is a corrected lip image of the lip deflection angle of 0 degrees produced by the correction of FIG. 4 using the two-stage GAN converter No. 6 according to the present invention;
FIG. 6 is a camera shot at an observation angle of 0 in dense multi-angle lip variation raw data of the present invention.
Detailed Description
Specific examples of the present invention are given below. The specific examples are only intended to illustrate the invention in further detail and do not limit the scope of protection of the claims of the present application.
The invention provides a lip language identification method (short method) based on generation of a countermeasure network and a time convolution network, which is characterized by comprising the following steps:
s1, making original data; the raw data comprises identification network raw data and intensive multi-angle lip variation raw data;
preferably, in S1, the raw data for making the identification network is: acquiring a source video and a subtitle file from a network through a Python web crawler, acquiring a face region in the source video by using a Yolov5 face detection algorithm, segmenting a face, and corresponding to the subtitle file to obtain original data of the identified network, wherein the data type is video;
preferably, in order to ensure sufficient sample size and improve the capability of the network to cope with variable real environments, the method adopts a plurality of data sources for identifying the original network data, including samples of indoor environments (such as broadcasting television station programs and the like) and samples of outdoor environments (such as street interview videos at different times in a day); meanwhile, in order to ensure that data samples contained in the original data of the identification network are stable, 5000 pieces of data with different illumination conditions, different lip angles and multiple speakers are selected by manual selection after a large amount of data are collected, random sampling is carried out on the data according to the proportion of a training set to a test set of 8:2, the training set is used for training the network, and the test set is used for comparing experimental results. And identifying that the length of each video with a label in the network original data is 1 second, the frame rate is 25 frames/second, and storing the video as an RGB video, thereby completing the acquisition of the network original data.
Preferably, in S1, the raw data for making dense multi-angle lip variations is: in order to further improve the accuracy of lip language identification, a high-definition camera is placed in front of a subject, the camera records the lip change conditions of the subject under different observation angles alpha in the process that each subject reads out the appointed characters, and the illumination intensity and the illumination angle of the subject are changed in the recording process to simulate a real environment, so that dense multi-angle lip change original data are obtained; defining the observation angle of a person directly viewing the front to be 0 degrees, wherein the face of the person is vertical to the optical axis of the camera at the moment; the angle of rotation of a common face does not exceed 70 degrees, so that the observation angle alpha ranges from-70 degrees to 70 degrees.
Preferably, 71 high-definition cameras are arranged in a range of 0 to 70 degrees of observation angle alpha at one side (right side in the embodiment) of the lip at a position 1 meter ahead of the lip of the subject, and the distance between every two cameras is 1 degree, as shown in fig. 1; the video recording length is 20 seconds for each subject, the frame rate is 25 frames/second, and the video recording length is stored as a lip RGB image which changes continuously; after the video at one side is recorded, the videos are horizontally turned over as the video at the other side (the observation angle alpha is-70-0 degrees) and are stored as continuously-changed lip RGB images, all images under 141 observation angles are obtained, and each observation angle has a plurality of images.
S2, because the area of the face in the original data of the recognition network is very small, and the lip language recognition is irrelevant to the human background, the position of the face in the background needs to be determined, and the characteristics are extracted; because the quantity of the identified network original data is very large, and manual labeling is time-consuming and labor-consuming, the method respectively labels the human face characteristic points with a human face labeling algorithm (dlib open source tool in the embodiment) for each frame or each image of the original data (namely each frame of the identified network original data and each image of the intensive multi-angle lip change original data), so as to obtain two characteristic point position arrays and store the two characteristic point position arrays as independent files; storing the two feature point position arrays and the original data separately for subsequent calculation and calling;
the feature point position array is a sequence formed by feature point matrixes of each frame or each image of the original data; and (4) marking the face characteristic points by using a dlib source-opening tool to obtain 68 face characteristic points in total.
S3, respectively carrying out face alignment on the face in each frame or each image of the original data according to the feature point matrix of the original data and the average face feature point matrix in the feature point position array to obtain aligned original data: calculating by using a dlib tool according to the existing data to obtain an average face characteristic point matrix; respectively selecting all original data feature point matrixes from respective feature point position arrays, respectively calculating the offset of each original data feature point matrix and an average face feature point matrix by using a Poincare analysis (Procrustes analysis), respectively solving respective minimum offset by adopting a gradient descent method aiming at each offset, respectively translating and rotationally aligning the face in each frame or each image of the original data according to each minimum offset to obtain aligned original data, further finishing face alignment, correcting the face and reducing the influence of face inclination on the extraction of lip features;
preferably, in S3, the offset is calculated by using a pockels analysis, and the calculation process of the offset is shown in formula (1):
Figure BDA0002970726320000041
in formula (1), diff represents the difference between the original data feature point matrix and the average face feature point matrix, R is a 2 × 2 orthogonal matrix, s is a scalar, E is a two-dimensional vector, p i Matrix of characteristic points representing the original data, q i Representing a matrix of average face feature points.
S4, after the face alignment is finished, because other backgrounds except the face increase the difficulty of the recognition task, the method only keeps the images with fixed sizes of the face lip regions, selects lip feature points from the face feature points obtained in S2, and calculates the coordinates of the centers of the lip feature points in each frame or each image of the aligned original data according to the lip feature points; dividing lip areas in each frame or each image of the aligned original data into lip images with fixed sizes (96 multiplied by 96 pixels in the embodiment) according to coordinates of centers of the lip feature points, and further respectively obtaining dense multi-angle lip change data sets and identification network data sets;
preferably, in S4, the calculation formula of the coordinates of the center of the lip feature point in each frame or each image of the aligned original data is as shown in formula (2):
Figure BDA0002970726320000051
in the formula (2), x i Abscissa, y, representing the center of lip feature point in the ith frame or image i A vertical coordinate representing the center of the lip feature point in the ith frame or image; n represents the number of lip feature points, and N is 20 in the present embodiment since there are 20 lip feature points in total; the lip feature points are 20 and are selected from 68 individual face feature points defined by the dlib open source toolThe lip-related features, namely the 20 facial feature points, are located in the lip region, and are referred to as lip feature points and are 20 in number.
S5, training a GAN (generating confrontation network generalized adaptive Networks with U-Net as a generator and Patch-D as a discriminator) two-stage converter and a ResNet angle classifier by using the dense multi-angle lip change data set: lip language recognition in an actual environment is often influenced by factors such as illumination, angles and the like, and the design of a GAN two-stage converter and a ResNet angle classifier by the method solves the problem, so that the GAN two-stage converter and the ResNet angle classifier need to be trained firstly;
dividing the dense multi-angle lip change data set into 2 alpha +1 classes, wherein one class represents an observation angle to train a ResNet angle classifier to obtain a trained ResNet angle classifier;
dividing a dense multi-angle lip change data set into 2 x K-1 parts including K first-stage conversion sets and K-1 second-stage conversion sets, and respectively and correspondingly inputting the parts into a first-stage converter and a second-stage converter of a GAN two-stage converter for training to obtain a trained GAN two-stage converter; the number of the first-stage converters after training is K, and the number of the second-stage converters after training is K-1; k represents the number of the divided areas, and in the embodiment, K is 7;
each first stage transform set includes an input and an output; inputting one of K angle ranges divided into the observation angle range, and outputting a conversion point corresponding to the angle range, wherein the angle range containing 0 degrees has no conversion point;
each second stage conversion set includes an input and an output; the input is one of K-1 switching points, and the output is 0 degree observation angle.
The method comprises the following steps: dividing an observation angle range of-70 degrees to 70 degrees into K angle ranges and respectively using the K angle ranges as the input of K first-stage converters, and respectively using K-1 conversion points (the conversion point is a fixed angle in the angle range and is preferably a midpoint angle in the angle range) corresponding to each angle range as the output of the K first-stage converters to train the first-stage converters, wherein the angle range containing 0 degrees has no conversion point; respectively taking the K-1 conversion points as the input of K-1 second-stage converters and taking the 0-degree observation angle as the output to train the second-stage converters;
as shown in fig. 2, the lower small arrow indicates the first stage transition, and the upper large arrow indicates the second stage transition. In this example, the angular ranges of-70 °, -50 ° ] belong to the first-stage converter No. 1, (-50 °, -30 °) belong to the first-stage converter No. 2, (-30 °, -10 ° ] belong to the first-stage converter No. 3, (-10 °,10 ° ] belong to the first-stage converter No. 4, (10 °,30 ° ] belong to the first-stage converter No. 5, (30 °,50 ° ] belong to the first-stage converter No. 6, (50 °,70 ° ] belong to the first-stage converter No. 7, the transition point of the second-stage converter No. 1 is-60 °, the transition point of the second-stage converter No. 2 is-40 °, the transition point of the second-stage converter No. 3 is-20 °, the transition point of the second-stage converter No. 5 is 20 °, the transition point of the second-stage converter No. 6 is 40 °, the transition point of the second stage 7 converter is 60 °.
In the training of the GAN two-stage converter, for example, the No. 6 first-stage converter uses (30 °,50 ° ] angle range as input and 40 ° conversion point as output, but in the experimental process, it is found that, because the classification accuracy of the ResNet angle classifier has an error of ± 5 °, in order to prevent the poor conversion effect caused by misuse of the adjacent first-stage converters due to the classification error, the coverage of each first-stage converter can be increased by 10 ° to 30 °, i.e., the angle range of ± 5 ° can be used as input, the conversion point corresponding to the angle range is used as output to train the second-stage converter, for example, the No. 6 first-stage converter uses (25 °,55 ° ] angle range as input and 40 ° conversion point as output, the No. 6 first-stage converter converts the image in the river angle range to 40 ° conversion point, and the image of the 40 ° conversion point is converted into 0 ° by the No. 6 second stage converter. Considering the error of the conversion of the first-stage converter, the second-stage converter can be trained by using a conversion point +/-2 degrees as an input and 0 degrees as an output; for example, the second stage converter No. 6 would use 38-42 deg. and 0 deg. images.
S6, correcting the recognition network data set by using a trained ResNet angle classifier and a trained GAN two-stage converter, and correcting the lip image deflected in the recognition network data set: splitting the identification network data set into a plurality of lip images to be corrected frame by using an OpenCV-python library, inputting the lip images to a trained ResNet angle classifier, and performing angle classification by using a formula (3) to obtain the respective lip deflection angle theta of each lip image to be corrected:
θ=classify(image) (3)
in the formula (3), image represents a lip image to be corrected, and class represents a ResNet angle classifier after training;
the GAN two-stage converter number i used is then determined by equation (4) according to the lip deflection angle θ:
Figure BDA0002970726320000061
in the formula (4), i represents the number of the GAN two-stage converter, and i is more than or equal to 1 and less than or equal to K; the operator [ ] represents rounding;
then sending the lip image to be corrected of the lip deflection angle theta into a trained GAN two-stage converter with corresponding numbers to carry out lip correction by the formula (5) to obtain a corrected lip image; the lip deflection angle theta of the corrected lip image is 0 degree; converting the corrected lip images into gray images, combining the gray images into corrected lip image sequences, and storing the corrected lip image sequences as NPZ format files, wherein the time consumption of the process is 3 hours;
Figure BDA0002970726320000071
in the formula (5), out represents an output, specifically, a corrected lip image; first is a first-stage converter, second is a second-stage converter;
the rectification process of the trained GAN two-stage converter is as follows: and sending the lip image to be corrected of the lip deflection angle theta into a first-stage converter of a trained GAN two-stage converter with a corresponding number, converting the lip image to be corrected into a conversion point of a second-stage converter with a corresponding number, and correcting the lip image to be corrected into a corrected lip image with the lip deflection angle theta of 0 DEG through the second-stage converter with the number. For example, after a certain lip image to be corrected is trained, the ResNet angle classifier calculates that the deflection angle of the lip is 35 degrees, then a No. 6 GAN two-stage converter is determined to be adopted, and the deflection angle of the lip is adjusted to be 35 degrees to a conversion point of 40 degrees through a No. 6 first-stage converter, so that the first-stage conversion is completed; and in the second stage of the conversion process, a second-stage converter is adopted, and a conversion point of 40 degrees is input into a No. 6 second-stage converter to be converted into a front face of 0 degree, so that the two-stage correction process of the lip image is completed, and the corrected lip image is obtained. Specifically, when the GAN two-stage converter is numbered i-4, i.e., the angle ranges (-10 °,10 °), the lips can be corrected to 0 ° only by using the first stage converter No. 4.
S7, training the TCN time convolution network by using the corrected lip image sequence: introducing a corrected lip image sequence and converting the lip image sequence into a tensor with the dimension of B multiplied by T multiplied by H multiplied by W, wherein B is the Batch Size (Batch Size), T is the frame number, H is the height, and W is the width; inputting the feature vector into a ResNet18 lip feature encoder for encoding, outputting a feature vector with dimension of B × C × T after encoding is finished, wherein C is the number of channels (set to 512 in the embodiment), and inputting the feature vector into a TCN time convolution network for training to obtain a prediction result; then, using CE loss as a loss function, and calculating the loss value loss of the TCN time convolution network according to the prediction result and the label value of the label; when the loss value loss does not decrease any more or reaches the specified iteration times, finishing the training to obtain a TCN time convolution network after the training; if loss still decreases and does not reach the specified iteration times, reversely calculating neuron output errors of each layer from an output layer to an input layer by layer, then adjusting each weight value and bias value of the TCN time convolution network according to a gradient descent method until loss does not decrease or reaches the specified iteration times, enabling the TCN time convolution network to reach the optimum, and finishing training to obtain the TCN time convolution network after training;
preferably, the TCN time convolutional network structure is as shown in fig. 3, and includes an input layer (first layer), a first hidden layer (second layer), a second hidden layer (third layer), an output layer (fourth layer), an upsampling layer, and a SoftMax layer, which are sequentially executed; b multiplied by C multiplied by T characteristic vectors are input into an input layer, then pass through a first hidden layer of expansion convolution with an expansion coefficient of 1, pass through a second hidden layer of expansion convolution with an expansion coefficient of 2, pass through an output layer of expansion convolution with an expansion coefficient of 4, return to the original size through an upper sampling layer, and obtain a prediction result through logistic regression of a SoftMax layer; adding cross-layer connection between the input layer and the first hidden layer, between the first hidden layer and the second hidden layer, and between the second hidden layer and the output layer;
preferably, the calculation process of the dilation convolution is as shown in equation (6):
Figure BDA0002970726320000072
in the formula (6), f(s) represents that the dilation convolution operation is performed on the corresponding layer s of the TCN time convolution network, d represents a dilation coefficient, ind represents a convolution block number, and filt represents a filter size;
preferably, the calculation process of the cross-layer connection is as shown in formula (7):
o=Activation(base+F(base)) (7)
in the formula (7), o represents the output after cross-layer connection, Activation represents an Activation function, F represents expansion convolution on base, and base represents the bottom-layer input (namely the input layer and the two hidden layer inputs);
preferably, the calculation process of the SoftMax layer is as shown in formula (8):
Figure BDA0002970726320000081
in the formula (8), S r Representing the prediction result; r represents an r-th output value of the up-sampling layer, and g represents all output values of the up-sampling layer;
preferably, the CE loss is calculated as shown in equation (9):
Figure BDA0002970726320000082
in the formula (9), loss represents a loss value; p (label) represents the labeled value of label, q (label) represents the predicted probability of label;
and S8, performing feature recognition and classification through the trained TCN time sequence convolution network to generate a lip language recognition result.
Experiments show that the two-phase algorithm of the GAN two-phase converter has obvious advantages. The method uses a Structural Similarity (SSIM) index to measure the conversion quality of the picture. Experimental results show that the SSIM index of the two-stage conversion is improved by 7% compared with that of the one-stage conversion, 89% of structural similarity is achieved, and the accuracy of lip language recognition is further improved. As can be seen from fig. 5 and fig. 6, the difference between the front face image (as shown in fig. 5) of the human face converted by the second-stage converter in fig. 4 and the real image (as shown in fig. 6) is very small.
The accuracy of the method is further described in two specific experimental examples.
In this experimental example, the predicted result is calculated to be correct when the predicted result classification is consistent with the classification of the label. The raw data had 400 words after being screened. A plurality of comparative experiments are set in the experiment, each comparative experiment is coded by using a ResNet18 lip feature coder, and the test results are shown in Table 1:
TABLE 1
Network architecture Image processing method Test accuracy
TCN GAN two-stage converter 95.6%
TCN Using artwork 92.9%
LSTM GAN two-stage converter 94.9%
LSTM Using artwork 91.7%
It can be seen that the GAN two-phase transformer employed in the present invention has a significant improvement in accuracy over the different recognition networks (i.e., TCN and LSTM) and TCN over LSTM (long short term memory network) compared to the algorithm that only preprocesses the image using the lip centering method (i.e., using the artwork in table 1).
Nothing in this specification is said to apply to the prior art.

Claims (10)

1. A lip language identification method based on generation of a countermeasure network and a time convolution network is characterized by comprising the following steps:
s1, making original data; the raw data comprises identification network raw data and intensive multi-angle lip change raw data;
s2, respectively labeling the characteristic points of the human face for each frame or each image of the original data by using a human face labeling algorithm to obtain two characteristic point position arrays;
s3, respectively carrying out face alignment on the face in each frame or each image of the original data according to the feature point matrix of the original data and the average face feature point matrix in the feature point position array to obtain aligned original data;
s4, after the face alignment is finished, lip feature points are selected from the face feature points obtained in the S2, and the coordinates of the centers of the lip feature points in each frame or each image of the aligned original data are obtained through calculation according to the lip feature points; dividing lip areas in each frame or each image of the aligned original data into lip images with fixed sizes according to coordinates of centers of the lip feature points, and further respectively obtaining a dense multi-angle lip change data set and an identification network data set;
s5, training a GAN two-stage converter and a ResNet angle classifier by using a dense multi-angle lip change data set;
dividing a dense multi-angle lip change data set into 2 x K-1 parts including K first-stage conversion sets and K-1 second-stage conversion sets, and respectively and correspondingly inputting the parts into a first-stage converter and a second-stage converter of a GAN two-stage converter for training to obtain a trained GAN two-stage converter; the number of the first-stage converters after training is K, and the number of the second-stage converters after training is K-1; k represents the number of the divided areas;
each first stage conversion set includes an input and an output; inputting one of K angle ranges divided into an observation angle range, and outputting a conversion point corresponding to the angle range, wherein the angle range containing 0 degrees has no conversion point;
each second stage conversion set includes an input and an output; the input is one of K-1 conversion points, and the output is an observation angle of 0 degrees;
s6, correcting the recognition network data set by using a trained ResNet angle classifier and a trained GAN two-stage converter, and correcting the lip image deflected in the recognition network data set: splitting the identification network data set into a plurality of lip images to be corrected frame by frame, and then inputting the lip images to a trained ResNet angle classifier for angle classification to obtain the respective lip deflection angle theta of each lip image to be corrected;
determining the number i of the used GAN two-stage converter according to the deflection angle theta of the lip;
then sending the lip image to be corrected of the lip deflection angle theta into a trained GAN two-stage converter with corresponding numbers for lip correction to obtain a corrected lip image; the lip deflection angle theta of the corrected lip image is 0 degree; combining the corrected lip images into a corrected lip image sequence;
s7, training a TCN time convolution network by using the corrected lip image sequence;
and S8, performing feature recognition and classification through the trained TCN time sequence convolution network to generate a lip language recognition result.
2. The lip language identification method based on the generative countermeasure network and the time convolution network as claimed in claim 1, wherein in S1, the original data for making the identification network is: the method comprises the steps of obtaining a source video and a subtitle file from a network through a web crawler, obtaining a face area in the source video through a face detection algorithm, segmenting a face, and corresponding to the subtitle file to obtain original data of the identification network.
3. The lip language identification method based on the generative countermeasure network and the time convolution network as claimed in claim 1, wherein in S1, the raw data for making dense multi-angle lip variation is: placing a high-definition camera in front of the testees, recording lip change conditions of the testees under different observation angles alpha by the camera in the process of reading out the specified characters by each tester, and changing the illumination intensity and the illumination angle of the testees in the recording process to simulate a real environment to obtain dense multi-angle lip change original data; defining the observation angle of a person directly viewing the front to be 0 degree; the observation angle alpha ranges from-70 degrees to 70 degrees;
in S1, 71 high-definition cameras are arranged in a position 1 meter ahead of the lips of a subject and within a range of an observation angle alpha of 0-70 degrees, and the distance between every two cameras is 1 degree; and recording the video and storing the video as continuously changed images, horizontally turning the video to be used as the video with an observation angle alpha of-70-0 degrees and storing the video as the continuously changed images to obtain all the images under 141 observation angles, wherein each observation angle comprises a plurality of images.
4. The lip language identification method based on the generative countermeasure network and the time convolution network as claimed in claim 1, wherein in S3, an average face feature point matrix is calculated according to the existing data; respectively selecting all original data feature point matrixes from respective feature point position arrays, respectively calculating the offset of each original data feature point matrix and an average face feature point matrix, respectively solving respective minimum offset by adopting a gradient descent method aiming at each offset, respectively translating and rotationally aligning the face in each frame or each image of the original data according to each minimum offset to obtain aligned original data;
in S3, the offset is calculated by the pockels analysis, and the calculation process of the offset is as shown in formula (1):
Figure FDA0003604690380000021
in formula (1), diff represents the difference between the original data feature point matrix and the average face feature point matrix, R is a 2 × 2 orthogonal matrix, s is a scalar, E is a two-dimensional vector, p i Representing a matrix of characteristic points of the original data, q i Representing a matrix of average face feature points.
5. The lip language identification method based on the generation countermeasure network and the time convolution network according to claim 1, wherein in S4, a calculation formula of coordinates of centers of respective lip feature points in each frame or each image of the aligned original data is as shown in formula (2):
Figure FDA0003604690380000022
in the formula (2), x i Abscissa, y, representing the center of lip feature point in the ith frame or image i A vertical coordinate representing the center of the lip feature point in the ith frame or image; n representsNumber of lip feature points.
6. The lip language identification method based on the generative countermeasure network and the time convolution network as claimed in claim 1, wherein S5 is specifically: dividing the dense multi-angle lip change data set into 2 alpha +1 classes, wherein one class represents an observation angle to train a ResNet angle classifier to obtain a trained ResNet angle classifier; α represents an observation angle.
7. The lip language identification method based on the generative countermeasure network and the time convolution network as claimed in claim 6, wherein in S6, when K is 7, specifically: splitting the identification network data set into a plurality of lip images to be corrected frame by frame, inputting the lip images to a trained ResNet angle classifier, and performing angle classification by using a formula (3) to obtain the lip deflection angle theta of each lip image to be corrected:
θ=classify(image) (3)
in the formula (3), image represents a lip image to be corrected, and class represents a ResNet angle classifier after training;
then, the GAN two-stage converter number i used is determined by equation (4) according to the lip deflection angle θ:
Figure FDA0003604690380000031
in the formula (4), i represents the number of the GAN two-stage converter, and i is more than or equal to 1 and less than or equal to K; operator [ ] represents rounding;
then sending the lip image to be corrected of the lip deflection angle theta into a trained GAN two-stage converter with corresponding numbers to carry out lip correction by the formula (5) to obtain a corrected lip image; the lip deflection angle theta of the corrected lip image is 0 degree; combining the corrected lip images into a corrected lip image sequence;
Figure FDA0003604690380000032
in equation (5), out is the corrected lip image; first is the first stage converter and second is the second stage converter.
8. The lip language identification method based on generation of the countermeasure network and the time convolution network of claim 1, wherein in S6, the correction process of the trained GAN two-phase transformer is: and sending the lip image to be corrected of the lip deflection angle theta into a first-stage converter of a trained GAN two-stage converter with a corresponding number, converting the lip image to be corrected into a conversion point of a second-stage converter with a corresponding number, and correcting the lip image to be corrected into a corrected lip image with the lip deflection angle theta of 0 DEG through the second-stage converter with the number.
9. The lip language identification method based on the generative countermeasure network and the time convolution network of claim 1, wherein in S7, the corrected lip image sequence is imported and converted into a tensor with dimension B × T × H × W, wherein B is the batch size, T is the frame number, H is the height, and W is the width; inputting the coded signals into a ResNet18 lip feature coder for coding, outputting a feature vector with dimension of BxCxT after coding is finished, wherein C is the number of channels, and inputting the coded signals into a TCN time convolution network for training to obtain a prediction result; then, using the CE loss as a loss function, and calculating the loss value loss of the TCN time convolution network according to the prediction result and the label value of the label; when the loss value loss does not decrease any more or reaches the specified iteration times, finishing the training to obtain a TCN time convolution network after the training; and if the loss still drops and does not reach the specified iteration times, reversely calculating the output error of each layer of neurons from the output layer to the input layer by layer, then adjusting each weight value and bias value of the TCN time convolution network according to a gradient descent method until the loss does not drop any more or reaches the specified iteration times, so that the TCN time convolution network reaches the optimum, finishing the training and obtaining the TCN time convolution network after the training.
10. The lip language identification method based on the generation countermeasure network and the time convolution network of claim 9, wherein in S7, the TCN time convolution network structure includes an input layer, a first hidden layer, a second hidden layer, an output layer, an upsampling layer, and a SoftMax layer, which are sequentially executed; b multiplied by C multiplied by T characteristic vectors are input into an input layer, then pass through a first hidden layer with expansion convolution of an expansion coefficient of 1, pass through a second hidden layer with expansion convolution of an expansion coefficient of 2, pass through an output layer with expansion convolution of an expansion coefficient of 4, and restore to the original size through an upper sampling layer, and then pass through logistic regression of a SoftMax layer to obtain a prediction result; adding cross-layer connection between the input layer and the first hidden layer, between the first hidden layer and the second hidden layer, and between the second hidden layer and the output layer;
the calculation process of the dilation convolution is shown as equation (6):
Figure FDA0003604690380000041
in the formula (6), f(s) represents that the dilation convolution operation is performed on the corresponding layer s of the TCN time convolution network, d represents a dilation coefficient, ind represents a convolution block number, and filt represents a filter size;
the calculation process of cross-layer connection is shown as formula (7):
o=Activation(base+F(base)) (7)
in the formula (7), o represents the output after cross-layer connection, Activation represents an Activation function, F represents expansion convolution on base, and base represents the bottom-layer input;
the calculation process of the SoftMax layer is shown as the formula (8):
Figure FDA0003604690380000042
in the formula (8), S r Representing a prediction result; r represents an r-th output value of the up-sampling layer, and g represents all output values of the up-sampling layer;
the calculation process of the CEloss is shown as formula (9):
Figure FDA0003604690380000043
in the formula (9), loss represents a loss value; p (label) represents the label value, q (label) represents the predicted probability of label.
CN202110262815.7A 2021-03-11 2021-03-11 Lip language identification method based on generation of countermeasure network and time convolution network Active CN112818950B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110262815.7A CN112818950B (en) 2021-03-11 2021-03-11 Lip language identification method based on generation of countermeasure network and time convolution network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110262815.7A CN112818950B (en) 2021-03-11 2021-03-11 Lip language identification method based on generation of countermeasure network and time convolution network

Publications (2)

Publication Number Publication Date
CN112818950A CN112818950A (en) 2021-05-18
CN112818950B true CN112818950B (en) 2022-08-23

Family

ID=75863117

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110262815.7A Active CN112818950B (en) 2021-03-11 2021-03-11 Lip language identification method based on generation of countermeasure network and time convolution network

Country Status (1)

Country Link
CN (1) CN112818950B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113239902B (en) * 2021-07-08 2021-09-28 中国人民解放军国防科技大学 Lip language identification method and device for generating confrontation network based on double discriminators

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109409195A (en) * 2018-08-30 2019-03-01 华侨大学 A kind of lip reading recognition methods neural network based and system
CN109858412A (en) * 2019-01-18 2019-06-07 东北大学 A kind of lip reading recognition methods based on mixing convolutional neural networks
CN111291669A (en) * 2020-01-22 2020-06-16 武汉大学 Two-channel depression angle human face fusion correction GAN network and human face fusion correction method
CN111783566A (en) * 2020-06-15 2020-10-16 神思电子技术股份有限公司 Video synthesis method based on lip language synchronization and expression adaptation effect enhancement
CN112084927A (en) * 2020-09-02 2020-12-15 中国人民解放军军事科学院国防科技创新研究院 Lip language identification method fusing multiple visual information

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111339806B (en) * 2018-12-19 2021-04-13 马上消费金融股份有限公司 Training method of lip language recognition model, living body recognition method and device
CN110059602B (en) * 2019-04-10 2022-03-15 武汉大学 Forward projection feature transformation-based overlook human face correction method
CN110276274B (en) * 2019-05-31 2023-08-04 东南大学 Multitasking depth feature space gesture face recognition method
CN110443129A (en) * 2019-06-30 2019-11-12 厦门知晓物联技术服务有限公司 Chinese lip reading recognition methods based on deep learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109409195A (en) * 2018-08-30 2019-03-01 华侨大学 A kind of lip reading recognition methods neural network based and system
CN109858412A (en) * 2019-01-18 2019-06-07 东北大学 A kind of lip reading recognition methods based on mixing convolutional neural networks
CN111291669A (en) * 2020-01-22 2020-06-16 武汉大学 Two-channel depression angle human face fusion correction GAN network and human face fusion correction method
CN111783566A (en) * 2020-06-15 2020-10-16 神思电子技术股份有限公司 Video synthesis method based on lip language synchronization and expression adaptation effect enhancement
CN112084927A (en) * 2020-09-02 2020-12-15 中国人民解放军军事科学院国防科技创新研究院 Lip language identification method fusing multiple visual information

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Combining DC-GAN with ResNet for blood cell image classification;Li Ma et al;《Medical & Biological Engineering & Computing》;20200327;第1-14页 *
EdgeConnect: Generative Image Inpainting with Adversarial Edge Learning;Kamyar Nazeri et al;《arXiv:1901.00212v3》;20190111;第1-17页 *
基于SE-ResNet 模型的多角度人脸识别系统设计;陈雪敏;《贵阳学院学报(自然科学版)》;20201231;第15卷(第4期);第10-13页 *
基于生成对抗网络的偏转人脸转正;胡惠雅等;《浙江大学学报(工学版)》;20210131;第55卷(第1期);第116-123页 *

Also Published As

Publication number Publication date
CN112818950A (en) 2021-05-18

Similar Documents

Publication Publication Date Title
CN108986140B (en) Target scale self-adaptive tracking method based on correlation filtering and color detection
CN108648197B (en) Target candidate region extraction method based on image background mask
CN110059586B (en) Iris positioning and segmenting system based on cavity residual error attention structure
Savvides et al. Efficient design of advanced correlation filters for robust distortion-tolerant face recognition
CN109919977B (en) Video motion person tracking and identity recognition method based on time characteristics
CN111310676A (en) Video motion recognition method based on CNN-LSTM and attention
CN109035172B (en) Non-local mean ultrasonic image denoising method based on deep learning
CN111738363B (en) Alzheimer disease classification method based on improved 3D CNN network
CN110827304B (en) Traditional Chinese medicine tongue image positioning method and system based on deep convolution network and level set method
CN113012172A (en) AS-UNet-based medical image segmentation method and system
Huynh-The et al. NIC: A robust background extraction algorithm for foreground detection in dynamic scenes
CN107767358B (en) Method and device for determining ambiguity of object in image
CN107766864B (en) Method and device for extracting features and method and device for object recognition
CN112084927B (en) Lip language identification method fusing multiple visual information
WO2024109374A1 (en) Training method and apparatus for face swapping model, and device, storage medium and program product
CN116051560B (en) Embryo dynamics intelligent prediction system based on embryo multidimensional information fusion
CN112070685A (en) Method for predicting dynamic soft tissue motion of HIFU treatment system
CN112818950B (en) Lip language identification method based on generation of countermeasure network and time convolution network
CN115731597A (en) Automatic segmentation and restoration management platform and method for mask image of face mask
CN115439927A (en) Gait monitoring method, device, equipment and storage medium based on robot
CN111539320A (en) Multi-view gait recognition method and system based on mutual learning network strategy
CN113283334B (en) Classroom concentration analysis method, device and storage medium
CN116824641B (en) Gesture classification method, device, equipment and computer storage medium
CN112488165A (en) Infrared pedestrian identification method and system based on deep learning model
CN111080754A (en) Character animation production method and device for connecting characteristic points of head and limbs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant