CN112818950B

CN112818950B - Lip language identification method based on generation of countermeasure network and time convolution network

Info

Publication number: CN112818950B
Application number: CN202110262815.7A
Authority: CN
Inventors: 张成伟; 赵昊天; 张满囤; 齐畅; 崔时雨
Original assignee: Hebei University of Technology
Current assignee: Hebei University of Technology
Priority date: 2021-03-11
Filing date: 2021-03-11
Publication date: 2022-08-23
Anticipated expiration: 2041-03-11
Also published as: CN112818950A

Abstract

The invention discloses a lip language identification method based on a generation countermeasure network and a time convolution network. Firstly, judging a lip deflection angle through a ResNet angle classifier, then utilizing a GAN two-stage converter to correct lips, and finally sending the lips into a TCN to carry out feature recognition classification to generate a high-precision lip recognition method of a lip recognition result; the method overcomes the problem that the lip feature extraction which cannot be solved by the traditional convolution model is influenced by the uncertainty of illumination intensity, illumination angle, recognition angle, speaker identity and the like in the actual environment, and obviously improves the accuracy of lip language recognition. The method designs intensive multi-angle lip change original data, not only realizes the continuity of images of a single camera, but also furthest realizes the continuity of the lip images in an observation range, and effectively solves the problem that the existing multi-angle model cannot process the lip images which continuously change in an actual environment, thereby improving the lip language recognition precision.

Description

Lip language identification method based on generation of countermeasure network and time convolution network

Technical Field

The invention belongs to the field of artificial intelligence and deep learning, and particularly relates to a lip language identification method based on a generation countermeasure network and a time convolution network.

Background

With the development of science and technology and the improvement of hardware manufacturing level, the amount of information that can be processed by a computer is exponentially increased, so that the artificial intelligence technology based on deep learning enters a rapid development stage, the artificial intelligence technology is widely applied to daily life of people, the production life style of people is changed subtly, and the artificial intelligence technology becomes one of indispensable important technologies in the human society. The application scenes of the artificial intelligence technology cover all aspects of production and life, including voice recognition, intelligent medical treatment, machine vision, intelligent question-answering systems, unmanned driving and the like. The success and accumulated experience of the artificial intelligence technology in these fields further advance the social attention to the new technology and accelerate the development of the artificial intelligence technology.

Lip language identification is an important application field of artificial intelligence technology, plays a key role in many fields of social production and life, and has very wide application prospects, for example:

1. lip feature based in vivo testing: in some scenes needing identity authentication, the real physiological characteristics of a subject often need to be determined, and the subject needs to complete a series of specified actions such as head rotation, eye blinking, reading a session and the like to complete the authentication of whether the subject is a real living human body. Common detection avoidance modes such as photos, videos, face changing, face shielding and the like can be effectively avoided by using a face key point detection technology, lip language identification and the like, so that a user is helped to avoid the harm of fraudulent behaviors, and the rights and interests of the user are guaranteed.

2. Assisting hearing impaired people to communicate: hearing impaired people include disabled people with impaired hearing caused by congenital heredity or acquired human factors, who cannot hear or make sound, and are inconvenient to communicate with other people in life. The communication auxiliary device carrying the lip language recognition technology is used, and therefore the communication requirements of hearing impaired people are met.

The current lip language identification models are divided into three types of models, namely letter-level models, word-level models and sentence-level models according to identification levels, most of the models adopt sequence-to-sequence (sequence2sequence and seq2seq) models for sequence modeling identification, and a continuous time series Classification (CTC) algorithm is used as a standard for measuring the accuracy of a prediction result. The seq2seq model is used for inputting continuous lip feature sequences, and performing time-series encoding and decoding on the input feature sequences through an encoder (encoder) and a decoder (decoder). A major difficulty of a lip language recognition task is that context relation of lip images is strong, but a sequence context relation mechanism is often adopted by a seq2seq model, so that the context relation in a lip change sequence cannot be well processed.

An improved version of the sequence model based on attention mechanism (attention) appears later, and satisfactory results are obtained in application scenarios with context dependence of short sentences, such as machine translation, intelligent question-answering systems and the like. However, the lip recognition task processes a long continuous image sequence, the context relationship is tighter, the span of the time dimension is larger, and the precision of the attention mechanism in the lip recognition task still needs to be improved. Another difficulty of the task of lip language recognition is that lip characteristics are often affected by angles, illumination, and speaker identities, and feature extraction faces a great deal of uncertainty. Most recognition models adopt a feature extractor based on a Residual Network (ResNet), and the feature extractor has a good effect under laboratory conditions, but has a poor performance when being directly applied to an actual environment.

Disclosure of Invention

Aiming at the defects of the prior art, the technical problem to be solved by the invention is to provide a lip language identification method based on a generation countermeasure network and a time convolution network.

The technical scheme for solving the technical problem is to provide a lip language identification method based on a generation countermeasure network and a time convolution network, which is characterized by comprising the following steps:

s1, making original data; the raw data comprises identification network raw data and intensive multi-angle lip change raw data;

s2, respectively labeling the characteristic points of the human face for each frame or each image of the original data by using a human face labeling algorithm to obtain two characteristic point position arrays;

s3, respectively carrying out face alignment on the face in each frame or each image of the original data according to the feature point matrix of the original data and the average face feature point matrix in the feature point position array to obtain aligned original data;

s4, after the face alignment is finished, lip feature points are selected from the face feature points obtained in the S2, and the coordinates of the centers of the lip feature points in each frame or each image of the aligned original data are obtained through calculation according to the lip feature points; dividing lip areas in each frame or each image of the aligned original data into lip images with fixed sizes according to coordinates of centers of the lip feature points, and further respectively obtaining a dense multi-angle lip change data set and an identification network data set;

s5, training a GAN two-stage converter and a ResNet angle classifier by using a dense multi-angle lip change data set;

s6, correcting the recognition network data set by using a trained ResNet angle classifier and a trained GAN two-stage converter, and correcting the lip image deflected in the recognition network data set: splitting the identification network data set into a plurality of lip images to be corrected frame by frame, and then inputting the lip images to a trained ResNet angle classifier for angle classification to obtain the respective lip deflection angle theta of each lip image to be corrected;

determining the number i of the used GAN two-stage converter according to the deflection angle theta of the lip;

then sending the lip image to be corrected of the lip deflection angle theta into a trained GAN two-stage converter with corresponding numbers for lip correction to obtain a corrected lip image; the lip deflection angle theta of the corrected lip image is 0 degree; combining the corrected lip images into a corrected lip image sequence;

s7, training a TCN time convolution network by using the corrected lip image sequence;

and S8, performing feature recognition and classification through the trained TCN time sequence convolution network to generate a lip language recognition result.

Compared with the prior art, the invention has the beneficial effects that:

(1) the invention is a high-precision lip language recognition method which comprises the steps of firstly judging the lip deflection angle through a ResNet angle classifier, then utilizing a GAN two-stage converter to correct the lip, and finally sending the lip to a TCN for feature recognition classification to generate a lip language recognition result; the method overcomes the problem that the lip feature extraction which cannot be solved by the traditional convolution model is influenced by the uncertainty of illumination intensity, illumination angle, recognition angle, speaker identity and the like in the actual environment, and obviously improves the accuracy of lip language recognition.

(2) According to the method, a Time Convolution Network (TCN) based on dual dimensions of time and space is adopted, a door mechanism adopted in the traditional sequence model processing is abandoned, the modeling task of the sequence model is realized by using expansion convolution, the dilemma faced by lip language identification is solved, and the accuracy of the lip language identification is further improved. TCNs have parallelism in processing sequence models and do not need to process sentences in order as sequence models do. Moreover, the sensing field of the TCN is flexible, is influenced by the number of convolution layers, the size of a convolution kernel, an expansion coefficient and the like, and can be flexibly set according to different tasks. Gradient loss and explosion problems caused by common parameters in different time periods, which are frequently generated by the sequence model, are not easy to occur in the training process of the TCN model. Finally, because the TCN needs to save the information of each step during training unlike the sequence model, and because the convolution kernel is shared in one layer, the memory usage is lower.

(3) The effect of converting the lip image with a large offset angle into a positive image through one step is very bad, because the converter cannot process information with a very large structural change, the GAN network of the present invention adopts a two-stage converter, and the conversion process is divided into two steps, namely, the lip image in a certain range is converted into a certain fixed angle in the first step, the range is +/-10 degrees, and the lip image is input into the second-stage converter and converted again, although the range of 60 degrees at the maximum can be spanned, the change in the two sub-regions of the input and output images is not large, namely, the input image is mostly an image with a rotation angle of +/-2 degrees, and the output image is 0 degrees, so the correction effect is greatly improved.

(4) The invention designs dense multi-angle lip change original data, not only realizes the continuity of the images of a single camera, but also furthest realizes the continuity of the lip images in the observation range, and effectively solves the problem that the existing multi-angle model can not process the lip images continuously changed in the actual environment, thereby improving the lip language recognition precision.

Drawings

FIG. 1 is a diagram of the placement positions of high definition cameras when the present invention produces dense multi-angle lip variation raw data;

FIG. 2 is a diagram of the conversion process of the GAN two-stage converter of the present invention to correct lip deflection;

FIG. 3 is a block diagram of the TCN time convolutional network of the present invention;

FIG. 4 is a camera shot at an observation angle of 35 in dense multi-angle lip variation raw data of the present invention;

FIG. 5 is a corrected lip image of the lip deflection angle of 0 degrees produced by the correction of FIG. 4 using the two-stage GAN converter No. 6 according to the present invention;

FIG. 6 is a camera shot at an observation angle of 0 in dense multi-angle lip variation raw data of the present invention.

Detailed Description

Specific examples of the present invention are given below. The specific examples are only intended to illustrate the invention in further detail and do not limit the scope of protection of the claims of the present application.

The invention provides a lip language identification method (short method) based on generation of a countermeasure network and a time convolution network, which is characterized by comprising the following steps:

s1, making original data; the raw data comprises identification network raw data and intensive multi-angle lip variation raw data;

preferably, in S1, the raw data for making the identification network is: acquiring a source video and a subtitle file from a network through a Python web crawler, acquiring a face region in the source video by using a Yolov5 face detection algorithm, segmenting a face, and corresponding to the subtitle file to obtain original data of the identified network, wherein the data type is video;

preferably, in order to ensure sufficient sample size and improve the capability of the network to cope with variable real environments, the method adopts a plurality of data sources for identifying the original network data, including samples of indoor environments (such as broadcasting television station programs and the like) and samples of outdoor environments (such as street interview videos at different times in a day); meanwhile, in order to ensure that data samples contained in the original data of the identification network are stable, 5000 pieces of data with different illumination conditions, different lip angles and multiple speakers are selected by manual selection after a large amount of data are collected, random sampling is carried out on the data according to the proportion of a training set to a test set of 8:2, the training set is used for training the network, and the test set is used for comparing experimental results. And identifying that the length of each video with a label in the network original data is 1 second, the frame rate is 25 frames/second, and storing the video as an RGB video, thereby completing the acquisition of the network original data.

Preferably, in S1, the raw data for making dense multi-angle lip variations is: in order to further improve the accuracy of lip language identification, a high-definition camera is placed in front of a subject, the camera records the lip change conditions of the subject under different observation angles alpha in the process that each subject reads out the appointed characters, and the illumination intensity and the illumination angle of the subject are changed in the recording process to simulate a real environment, so that dense multi-angle lip change original data are obtained; defining the observation angle of a person directly viewing the front to be 0 degrees, wherein the face of the person is vertical to the optical axis of the camera at the moment; the angle of rotation of a common face does not exceed 70 degrees, so that the observation angle alpha ranges from-70 degrees to 70 degrees.

Preferably, 71 high-definition cameras are arranged in a range of 0 to 70 degrees of observation angle alpha at one side (right side in the embodiment) of the lip at a position 1 meter ahead of the lip of the subject, and the distance between every two cameras is 1 degree, as shown in fig. 1; the video recording length is 20 seconds for each subject, the frame rate is 25 frames/second, and the video recording length is stored as a lip RGB image which changes continuously; after the video at one side is recorded, the videos are horizontally turned over as the video at the other side (the observation angle alpha is-70-0 degrees) and are stored as continuously-changed lip RGB images, all images under 141 observation angles are obtained, and each observation angle has a plurality of images.

S2, because the area of the face in the original data of the recognition network is very small, and the lip language recognition is irrelevant to the human background, the position of the face in the background needs to be determined, and the characteristics are extracted; because the quantity of the identified network original data is very large, and manual labeling is time-consuming and labor-consuming, the method respectively labels the human face characteristic points with a human face labeling algorithm (dlib open source tool in the embodiment) for each frame or each image of the original data (namely each frame of the identified network original data and each image of the intensive multi-angle lip change original data), so as to obtain two characteristic point position arrays and store the two characteristic point position arrays as independent files; storing the two feature point position arrays and the original data separately for subsequent calculation and calling;

the feature point position array is a sequence formed by feature point matrixes of each frame or each image of the original data; and (4) marking the face characteristic points by using a dlib source-opening tool to obtain 68 face characteristic points in total.

S3, respectively carrying out face alignment on the face in each frame or each image of the original data according to the feature point matrix of the original data and the average face feature point matrix in the feature point position array to obtain aligned original data: calculating by using a dlib tool according to the existing data to obtain an average face characteristic point matrix; respectively selecting all original data feature point matrixes from respective feature point position arrays, respectively calculating the offset of each original data feature point matrix and an average face feature point matrix by using a Poincare analysis (Procrustes analysis), respectively solving respective minimum offset by adopting a gradient descent method aiming at each offset, respectively translating and rotationally aligning the face in each frame or each image of the original data according to each minimum offset to obtain aligned original data, further finishing face alignment, correcting the face and reducing the influence of face inclination on the extraction of lip features;

preferably, in S3, the offset is calculated by using a pockels analysis, and the calculation process of the offset is shown in formula (1):

in formula (1), diff represents the difference between the original data feature point matrix and the average face feature point matrix, R is a 2 × 2 orthogonal matrix, s is a scalar, E is a two-dimensional vector, p _i Matrix of characteristic points representing the original data, q _i Representing a matrix of average face feature points.

S4, after the face alignment is finished, because other backgrounds except the face increase the difficulty of the recognition task, the method only keeps the images with fixed sizes of the face lip regions, selects lip feature points from the face feature points obtained in S2, and calculates the coordinates of the centers of the lip feature points in each frame or each image of the aligned original data according to the lip feature points; dividing lip areas in each frame or each image of the aligned original data into lip images with fixed sizes (96 multiplied by 96 pixels in the embodiment) according to coordinates of centers of the lip feature points, and further respectively obtaining dense multi-angle lip change data sets and identification network data sets;

preferably, in S4, the calculation formula of the coordinates of the center of the lip feature point in each frame or each image of the aligned original data is as shown in formula (2):

in the formula (2), x _i Abscissa, y, representing the center of lip feature point in the ith frame or image _i A vertical coordinate representing the center of the lip feature point in the ith frame or image; n represents the number of lip feature points, and N is 20 in the present embodiment since there are 20 lip feature points in total; the lip feature points are 20 and are selected from 68 individual face feature points defined by the dlib open source toolThe lip-related features, namely the 20 facial feature points, are located in the lip region, and are referred to as lip feature points and are 20 in number.

S5, training a GAN (generating confrontation network generalized adaptive Networks with U-Net as a generator and Patch-D as a discriminator) two-stage converter and a ResNet angle classifier by using the dense multi-angle lip change data set: lip language recognition in an actual environment is often influenced by factors such as illumination, angles and the like, and the design of a GAN two-stage converter and a ResNet angle classifier by the method solves the problem, so that the GAN two-stage converter and the ResNet angle classifier need to be trained firstly;

dividing the dense multi-angle lip change data set into 2 alpha +1 classes, wherein one class represents an observation angle to train a ResNet angle classifier to obtain a trained ResNet angle classifier;

dividing a dense multi-angle lip change data set into 2 x K-1 parts including K first-stage conversion sets and K-1 second-stage conversion sets, and respectively and correspondingly inputting the parts into a first-stage converter and a second-stage converter of a GAN two-stage converter for training to obtain a trained GAN two-stage converter; the number of the first-stage converters after training is K, and the number of the second-stage converters after training is K-1; k represents the number of the divided areas, and in the embodiment, K is 7;

each first stage transform set includes an input and an output; inputting one of K angle ranges divided into the observation angle range, and outputting a conversion point corresponding to the angle range, wherein the angle range containing 0 degrees has no conversion point;

each second stage conversion set includes an input and an output; the input is one of K-1 switching points, and the output is 0 degree observation angle.

The method comprises the following steps: dividing an observation angle range of-70 degrees to 70 degrees into K angle ranges and respectively using the K angle ranges as the input of K first-stage converters, and respectively using K-1 conversion points (the conversion point is a fixed angle in the angle range and is preferably a midpoint angle in the angle range) corresponding to each angle range as the output of the K first-stage converters to train the first-stage converters, wherein the angle range containing 0 degrees has no conversion point; respectively taking the K-1 conversion points as the input of K-1 second-stage converters and taking the 0-degree observation angle as the output to train the second-stage converters;

as shown in fig. 2, the lower small arrow indicates the first stage transition, and the upper large arrow indicates the second stage transition. In this example, the angular ranges of-70 °, -50 ° ] belong to the first-stage converter No. 1, (-50 °, -30 °) belong to the first-stage converter No. 2, (-30 °, -10 ° ] belong to the first-stage converter No. 3, (-10 °,10 ° ] belong to the first-stage converter No. 4, (10 °,30 ° ] belong to the first-stage converter No. 5, (30 °,50 ° ] belong to the first-stage converter No. 6, (50 °,70 ° ] belong to the first-stage converter No. 7, the transition point of the second-stage converter No. 1 is-60 °, the transition point of the second-stage converter No. 2 is-40 °, the transition point of the second-stage converter No. 3 is-20 °, the transition point of the second-stage converter No. 5 is 20 °, the transition point of the second-stage converter No. 6 is 40 °, the transition point of the second stage 7 converter is 60 °.

In the training of the GAN two-stage converter, for example, the No. 6 first-stage converter uses (30 °,50 ° ] angle range as input and 40 ° conversion point as output, but in the experimental process, it is found that, because the classification accuracy of the ResNet angle classifier has an error of ± 5 °, in order to prevent the poor conversion effect caused by misuse of the adjacent first-stage converters due to the classification error, the coverage of each first-stage converter can be increased by 10 ° to 30 °, i.e., the angle range of ± 5 ° can be used as input, the conversion point corresponding to the angle range is used as output to train the second-stage converter, for example, the No. 6 first-stage converter uses (25 °,55 ° ] angle range as input and 40 ° conversion point as output, the No. 6 first-stage converter converts the image in the river angle range to 40 ° conversion point, and the image of the 40 ° conversion point is converted into 0 ° by the No. 6 second stage converter. Considering the error of the conversion of the first-stage converter, the second-stage converter can be trained by using a conversion point +/-2 degrees as an input and 0 degrees as an output; for example, the second stage converter No. 6 would use 38-42 deg. and 0 deg. images.

S6, correcting the recognition network data set by using a trained ResNet angle classifier and a trained GAN two-stage converter, and correcting the lip image deflected in the recognition network data set: splitting the identification network data set into a plurality of lip images to be corrected frame by using an OpenCV-python library, inputting the lip images to a trained ResNet angle classifier, and performing angle classification by using a formula (3) to obtain the respective lip deflection angle theta of each lip image to be corrected:

θ＝classify(image) (3)

in the formula (3), image represents a lip image to be corrected, and class represents a ResNet angle classifier after training;

the GAN two-stage converter number i used is then determined by equation (4) according to the lip deflection angle θ:

in the formula (4), i represents the number of the GAN two-stage converter, and i is more than or equal to 1 and less than or equal to K; the operator [ ] represents rounding;

then sending the lip image to be corrected of the lip deflection angle theta into a trained GAN two-stage converter with corresponding numbers to carry out lip correction by the formula (5) to obtain a corrected lip image; the lip deflection angle theta of the corrected lip image is 0 degree; converting the corrected lip images into gray images, combining the gray images into corrected lip image sequences, and storing the corrected lip image sequences as NPZ format files, wherein the time consumption of the process is 3 hours;

in the formula (5), out represents an output, specifically, a corrected lip image; first is a first-stage converter, second is a second-stage converter;

the rectification process of the trained GAN two-stage converter is as follows: and sending the lip image to be corrected of the lip deflection angle theta into a first-stage converter of a trained GAN two-stage converter with a corresponding number, converting the lip image to be corrected into a conversion point of a second-stage converter with a corresponding number, and correcting the lip image to be corrected into a corrected lip image with the lip deflection angle theta of 0 DEG through the second-stage converter with the number. For example, after a certain lip image to be corrected is trained, the ResNet angle classifier calculates that the deflection angle of the lip is 35 degrees, then a No. 6 GAN two-stage converter is determined to be adopted, and the deflection angle of the lip is adjusted to be 35 degrees to a conversion point of 40 degrees through a No. 6 first-stage converter, so that the first-stage conversion is completed; and in the second stage of the conversion process, a second-stage converter is adopted, and a conversion point of 40 degrees is input into a No. 6 second-stage converter to be converted into a front face of 0 degree, so that the two-stage correction process of the lip image is completed, and the corrected lip image is obtained. Specifically, when the GAN two-stage converter is numbered i-4, i.e., the angle ranges (-10 °,10 °), the lips can be corrected to 0 ° only by using the first stage converter No. 4.

S7, training the TCN time convolution network by using the corrected lip image sequence: introducing a corrected lip image sequence and converting the lip image sequence into a tensor with the dimension of B multiplied by T multiplied by H multiplied by W, wherein B is the Batch Size (Batch Size), T is the frame number, H is the height, and W is the width; inputting the feature vector into a ResNet18 lip feature encoder for encoding, outputting a feature vector with dimension of B × C × T after encoding is finished, wherein C is the number of channels (set to 512 in the embodiment), and inputting the feature vector into a TCN time convolution network for training to obtain a prediction result; then, using CE loss as a loss function, and calculating the loss value loss of the TCN time convolution network according to the prediction result and the label value of the label; when the loss value loss does not decrease any more or reaches the specified iteration times, finishing the training to obtain a TCN time convolution network after the training; if loss still decreases and does not reach the specified iteration times, reversely calculating neuron output errors of each layer from an output layer to an input layer by layer, then adjusting each weight value and bias value of the TCN time convolution network according to a gradient descent method until loss does not decrease or reaches the specified iteration times, enabling the TCN time convolution network to reach the optimum, and finishing training to obtain the TCN time convolution network after training;

preferably, the TCN time convolutional network structure is as shown in fig. 3, and includes an input layer (first layer), a first hidden layer (second layer), a second hidden layer (third layer), an output layer (fourth layer), an upsampling layer, and a SoftMax layer, which are sequentially executed; b multiplied by C multiplied by T characteristic vectors are input into an input layer, then pass through a first hidden layer of expansion convolution with an expansion coefficient of 1, pass through a second hidden layer of expansion convolution with an expansion coefficient of 2, pass through an output layer of expansion convolution with an expansion coefficient of 4, return to the original size through an upper sampling layer, and obtain a prediction result through logistic regression of a SoftMax layer; adding cross-layer connection between the input layer and the first hidden layer, between the first hidden layer and the second hidden layer, and between the second hidden layer and the output layer;

preferably, the calculation process of the dilation convolution is as shown in equation (6):

in the formula (6), f(s) represents that the dilation convolution operation is performed on the corresponding layer s of the TCN time convolution network, d represents a dilation coefficient, ind represents a convolution block number, and filt represents a filter size;

preferably, the calculation process of the cross-layer connection is as shown in formula (7):

o＝Activation(base+F(base)) (7)

in the formula (7), o represents the output after cross-layer connection, Activation represents an Activation function, F represents expansion convolution on base, and base represents the bottom-layer input (namely the input layer and the two hidden layer inputs);

preferably, the calculation process of the SoftMax layer is as shown in formula (8):

in the formula (8), S _r Representing the prediction result; r represents an r-th output value of the up-sampling layer, and g represents all output values of the up-sampling layer;

preferably, the CE loss is calculated as shown in equation (9):

in the formula (9), loss represents a loss value; p (label) represents the labeled value of label, q (label) represents the predicted probability of label;

Experiments show that the two-phase algorithm of the GAN two-phase converter has obvious advantages. The method uses a Structural Similarity (SSIM) index to measure the conversion quality of the picture. Experimental results show that the SSIM index of the two-stage conversion is improved by 7% compared with that of the one-stage conversion, 89% of structural similarity is achieved, and the accuracy of lip language recognition is further improved. As can be seen from fig. 5 and fig. 6, the difference between the front face image (as shown in fig. 5) of the human face converted by the second-stage converter in fig. 4 and the real image (as shown in fig. 6) is very small.

The accuracy of the method is further described in two specific experimental examples.

In this experimental example, the predicted result is calculated to be correct when the predicted result classification is consistent with the classification of the label. The raw data had 400 words after being screened. A plurality of comparative experiments are set in the experiment, each comparative experiment is coded by using a ResNet18 lip feature coder, and the test results are shown in Table 1:

TABLE 1

Network architecture	Image processing method	Test accuracy
			TCN	GAN two-stage converter	95.6％
TCN	Using artwork	92.9％
			LSTM	GAN two-stage converter	94.9％
LSTM	Using artwork	91.7％

It can be seen that the GAN two-phase transformer employed in the present invention has a significant improvement in accuracy over the different recognition networks (i.e., TCN and LSTM) and TCN over LSTM (long short term memory network) compared to the algorithm that only preprocesses the image using the lip centering method (i.e., using the artwork in table 1).

Nothing in this specification is said to apply to the prior art.

Claims

1. A lip language identification method based on generation of a countermeasure network and a time convolution network is characterized by comprising the following steps:

dividing a dense multi-angle lip change data set into 2 x K-1 parts including K first-stage conversion sets and K-1 second-stage conversion sets, and respectively and correspondingly inputting the parts into a first-stage converter and a second-stage converter of a GAN two-stage converter for training to obtain a trained GAN two-stage converter; the number of the first-stage converters after training is K, and the number of the second-stage converters after training is K-1; k represents the number of the divided areas;

each first stage conversion set includes an input and an output; inputting one of K angle ranges divided into an observation angle range, and outputting a conversion point corresponding to the angle range, wherein the angle range containing 0 degrees has no conversion point;

each second stage conversion set includes an input and an output; the input is one of K-1 conversion points, and the output is an observation angle of 0 degrees;

2. The lip language identification method based on the generative countermeasure network and the time convolution network as claimed in claim 1, wherein in S1, the original data for making the identification network is: the method comprises the steps of obtaining a source video and a subtitle file from a network through a web crawler, obtaining a face area in the source video through a face detection algorithm, segmenting a face, and corresponding to the subtitle file to obtain original data of the identification network.

3. The lip language identification method based on the generative countermeasure network and the time convolution network as claimed in claim 1, wherein in S1, the raw data for making dense multi-angle lip variation is: placing a high-definition camera in front of the testees, recording lip change conditions of the testees under different observation angles alpha by the camera in the process of reading out the specified characters by each tester, and changing the illumination intensity and the illumination angle of the testees in the recording process to simulate a real environment to obtain dense multi-angle lip change original data; defining the observation angle of a person directly viewing the front to be 0 degree; the observation angle alpha ranges from-70 degrees to 70 degrees;

in S1, 71 high-definition cameras are arranged in a position 1 meter ahead of the lips of a subject and within a range of an observation angle alpha of 0-70 degrees, and the distance between every two cameras is 1 degree; and recording the video and storing the video as continuously changed images, horizontally turning the video to be used as the video with an observation angle alpha of-70-0 degrees and storing the video as the continuously changed images to obtain all the images under 141 observation angles, wherein each observation angle comprises a plurality of images.

4. The lip language identification method based on the generative countermeasure network and the time convolution network as claimed in claim 1, wherein in S3, an average face feature point matrix is calculated according to the existing data; respectively selecting all original data feature point matrixes from respective feature point position arrays, respectively calculating the offset of each original data feature point matrix and an average face feature point matrix, respectively solving respective minimum offset by adopting a gradient descent method aiming at each offset, respectively translating and rotationally aligning the face in each frame or each image of the original data according to each minimum offset to obtain aligned original data;

in S3, the offset is calculated by the pockels analysis, and the calculation process of the offset is as shown in formula (1):

in formula (1), diff represents the difference between the original data feature point matrix and the average face feature point matrix, R is a 2 × 2 orthogonal matrix, s is a scalar, E is a two-dimensional vector, p _i Representing a matrix of characteristic points of the original data, q _i Representing a matrix of average face feature points.

5. The lip language identification method based on the generation countermeasure network and the time convolution network according to claim 1, wherein in S4, a calculation formula of coordinates of centers of respective lip feature points in each frame or each image of the aligned original data is as shown in formula (2):

in the formula (2), x _i Abscissa, y, representing the center of lip feature point in the ith frame or image _i A vertical coordinate representing the center of the lip feature point in the ith frame or image; n representsNumber of lip feature points.

6. The lip language identification method based on the generative countermeasure network and the time convolution network as claimed in claim 1, wherein S5 is specifically: dividing the dense multi-angle lip change data set into 2 alpha +1 classes, wherein one class represents an observation angle to train a ResNet angle classifier to obtain a trained ResNet angle classifier; α represents an observation angle.

7. The lip language identification method based on the generative countermeasure network and the time convolution network as claimed in claim 6, wherein in S6, when K is 7, specifically: splitting the identification network data set into a plurality of lip images to be corrected frame by frame, inputting the lip images to a trained ResNet angle classifier, and performing angle classification by using a formula (3) to obtain the lip deflection angle theta of each lip image to be corrected:

θ＝classify(image) (3)

then, the GAN two-stage converter number i used is determined by equation (4) according to the lip deflection angle θ:

in the formula (4), i represents the number of the GAN two-stage converter, and i is more than or equal to 1 and less than or equal to K; operator [ ] represents rounding;

then sending the lip image to be corrected of the lip deflection angle theta into a trained GAN two-stage converter with corresponding numbers to carry out lip correction by the formula (5) to obtain a corrected lip image; the lip deflection angle theta of the corrected lip image is 0 degree; combining the corrected lip images into a corrected lip image sequence;

in equation (5), out is the corrected lip image; first is the first stage converter and second is the second stage converter.

8. The lip language identification method based on generation of the countermeasure network and the time convolution network of claim 1, wherein in S6, the correction process of the trained GAN two-phase transformer is: and sending the lip image to be corrected of the lip deflection angle theta into a first-stage converter of a trained GAN two-stage converter with a corresponding number, converting the lip image to be corrected into a conversion point of a second-stage converter with a corresponding number, and correcting the lip image to be corrected into a corrected lip image with the lip deflection angle theta of 0 DEG through the second-stage converter with the number.

9. The lip language identification method based on the generative countermeasure network and the time convolution network of claim 1, wherein in S7, the corrected lip image sequence is imported and converted into a tensor with dimension B × T × H × W, wherein B is the batch size, T is the frame number, H is the height, and W is the width; inputting the coded signals into a ResNet18 lip feature coder for coding, outputting a feature vector with dimension of BxCxT after coding is finished, wherein C is the number of channels, and inputting the coded signals into a TCN time convolution network for training to obtain a prediction result; then, using the CE loss as a loss function, and calculating the loss value loss of the TCN time convolution network according to the prediction result and the label value of the label; when the loss value loss does not decrease any more or reaches the specified iteration times, finishing the training to obtain a TCN time convolution network after the training; and if the loss still drops and does not reach the specified iteration times, reversely calculating the output error of each layer of neurons from the output layer to the input layer by layer, then adjusting each weight value and bias value of the TCN time convolution network according to a gradient descent method until the loss does not drop any more or reaches the specified iteration times, so that the TCN time convolution network reaches the optimum, finishing the training and obtaining the TCN time convolution network after the training.

10. The lip language identification method based on the generation countermeasure network and the time convolution network of claim 9, wherein in S7, the TCN time convolution network structure includes an input layer, a first hidden layer, a second hidden layer, an output layer, an upsampling layer, and a SoftMax layer, which are sequentially executed; b multiplied by C multiplied by T characteristic vectors are input into an input layer, then pass through a first hidden layer with expansion convolution of an expansion coefficient of 1, pass through a second hidden layer with expansion convolution of an expansion coefficient of 2, pass through an output layer with expansion convolution of an expansion coefficient of 4, and restore to the original size through an upper sampling layer, and then pass through logistic regression of a SoftMax layer to obtain a prediction result; adding cross-layer connection between the input layer and the first hidden layer, between the first hidden layer and the second hidden layer, and between the second hidden layer and the output layer;

the calculation process of the dilation convolution is shown as equation (6):

the calculation process of cross-layer connection is shown as formula (7):

o＝Activation(base+F(base)) (7)

in the formula (7), o represents the output after cross-layer connection, Activation represents an Activation function, F represents expansion convolution on base, and base represents the bottom-layer input;

the calculation process of the SoftMax layer is shown as the formula (8):

in the formula (8), S _r Representing a prediction result; r represents an r-th output value of the up-sampling layer, and g represents all output values of the up-sampling layer;

the calculation process of the CEloss is shown as formula (9):

in the formula (9), loss represents a loss value; p (label) represents the label value, q (label) represents the predicted probability of label.