CN109086707A

CN109086707A - A kind of expression method for tracing based on DCNNs-LSTM model

Info

Publication number: CN109086707A
Application number: CN201810823018.XA
Authority: CN
Inventors: 饶云波; 宋佳丽; 吉普照; 苟苗; 范柏江; 杨攀; 郑雨嘉
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2018-07-25
Filing date: 2018-07-25
Publication date: 2018-12-25

Abstract

The invention belongs to expression tracer technique, specifically a kind of expression method for tracing based on DCNNs-LSTM model.The present invention considers that validity and shot and long term memory network of the convolutional neural networks in image characteristics extraction, there are the advantage on time serial message, propose that the deep neural network model of combination DCNN and LSTM a kind of is used to handle the extraction of facial 68 characteristic points and expression in the eyes, head feature information in processing.Data set enumerates the different facial images and corresponding label file such as old, young, juvenile and China and foreign countries, and the extraction to face, expression in the eyes and head three-dimensional feature information can be achieved at the same time in trained model.The three-dimensional feature information extracted is done the Euclidean distance that triangle gridding is rebuild and calculates each vertex of grid by the present invention, and Euclidean distance is changed into the identifiable deformation coefficient of animated character by Data Translation model, realizes a kind of expression, head migration effect.

Description

A kind of expression method for tracing based on DCNNs-LSTM model

Technical field

The invention belongs to expression tracer technique, specifically a kind of expression tracking side based on DCNNs-LSTM model Method.

Background technique

With the rapid development of hardware, software and artificial intelligence multiple fields, facial expression and expression in the eyes, head-tracking skill Art gradually becomes the popular direction of video image research field, imply the fresh air of the following human-computer interaction to.It is led in virtual reality The blinkpunkt of user is tracked in domain by expression in the eyes tracer technique, adjusts focal length to realize the Deep Canvas in virtual scene.In game In the world, game role can be controlled according to the expression of user, limb action, improve the feeling of immersion of game.In addition, in application psychology The fields such as, identity authentication, safe access control, the technology also have biggish application value and market prospects.As it can be seen that facial Expression and expression in the eyes, head-tracking technology have very actual research significance.

Face feature point and head, expression in the eyes motion information acquisition be tracer technique key link, extract information Precision directly affect tracking effect.It is main to realize process are as follows: face information to be inputted by equipment or original route, to input people Face do gradation conversion, noise reduction, detection etc. processing after, by feature location technological prediction and extract input face face, profile, head The characteristic information extracted is combined the tracking effect that other the relevant technologies realize multiplicity by the characteristic information in portion and expression in the eyes position. Different feature location algorithms is mainly from positional accuracy (accuracy), the speed of service (speed) and robustness (robustness) three aspects are assessed.

Current feature location technology is broadly divided into two kinds:

1) conventional method: constructing Initial Face using cascade regression algorithm first, then by multiple weak recurrences devices it is trained by The face shape of approaching to reality is walked to realize facial Feature Localization, if the Initial Face and real human face deviation of building are excessive, Subsequent regression optimization also will appear large error.In addition, by combining head pose and eyeball location prediction expression in the eyes, due to The complexity of head movement, it is therefore desirable to which multiple cameras carry out multi-angled shooting, or using complicated faceform, and finally Prediction result is affected by the accuracy on environment limitation and head, eyeball information.

2) based on the method for deep learning: being collected and prepared for human face data collection, build and realize facial Feature Localization and head The neural network that portion, expression in the eyes are predicted is trained.It is main at present to use since there are sequential correlations for head and expression in the eyes prediction Recurrent neural network realizes prediction tracking, while doing feature location in combination with other neural network models.This method mainly depends on In the pretreatment of training data and building and optimizing for network model, complicated input equipment is not needed, due to whole face For input, error brought by Initial Face shape is avoided.In addition, the accuracy of expression in the eyes estimation is no longer by head pose and eye The influence of pearl positioning.Therefore, current research is the method based on deep learning mostly to realize feature location.

Summary of the invention

The purpose of the present invention proposes a kind of expression tracking side based on DCNNs-LSTM model aiming at the above problem Method, the main thought of septum reset expression and expression in the eyes, head-tracking method of the present invention are that depth convolutional neural networks DCNN is combined to exist Advantage and shot and long term memory network LSTM model in feature extraction are being handled and are being predicted in the event there are time series Validity realizes that facial three-dimensional feature extracts and expression in the eyes, head pose are predicted, will extract information processing is with animated character's The mode for the deformation coefficient that structure matches realizes tracking effect.The rise of tracer technique brings many interesting technologies, such as Life analysis system is done according to face aging speed, view-based access control model tracer technique realizes human-computer interaction, solves dizzy in virtual reality Dizzy sense and sharpness problems etc..As Artificial Intelligence Development is more and more fiery, the research work of face, vision, head-tracking technology It is more important, however visual information is prediction of being joined together by face and header information in conventional method, can not be extracted simultaneously Characteristic information.For this status, the present invention considers validity and length of the convolutional neural networks in image characteristics extraction There are the advantages on time serial message in processing for phase memory network, propose the deep neural network of combination DCNN and LSTM a kind of Model is used to handle the extraction of facial 68 characteristic points and expression in the eyes, head feature information.Data set enumerates old, young, few Year and China and foreign countries etc. different facial images and corresponding label file, trained model can be achieved at the same time to face, expression in the eyes and The extraction of head three-dimensional feature information.The three-dimensional feature information extracted is done that triangle gridding is rebuild and to calculate grid each by the present invention Euclidean distance is changed into the identifiable deformation coefficient of animated character by Data Translation model, realizes one by the Euclidean distance on vertex Kind expression, head migration effect.

In order to make it easy to understand, the technology used in the present invention is described in detail first:

The present invention is based on deep neural networks to complete feature extraction.Convolutional neural networks are a kind of particularly effective extraction figures As the means of feature, however most research at present is all absorbed in the deeper bigger network of use and improves model accuracy and ignore The problem of speed of service and model size, therefore, the present invention by depth separate convolutional coding structure can with less parameter and Computing cost learns the advantage to richer feature, is combined with the separable convolutional coding structure of multilayer depth and builds convolutional network, net Network structure is as shown in table 1, the Standard convolution that first layer is 3 × 3, followed by 9 layers of separation convolution, then will by 1 layer of average pond layer Characteristic pattern is normalized to 1 × 1 × 1024 sizes.

The improved CNN network architecture of table 1

Type/stride	Filter Shape	Input Size
			Conv/s2	3×3×5×64	96×96×3
Convdw/s1	3×3×64dw	48×48×64
			Conv/s1	1×1×64×64	48×48×64
Convdw/s2	3×3×64dw	48×48×64
			Conv/s1	1×1×64×128	24×24×64
Conv dw/s1	3×3×128dw	24×24×128
			Conv/s1	1×1×128×128	24×24×128
Convdw/s1	3×3×128dw	12×12×128
			Conv/s1	1×1×128×256	12×12×128
Convdw/s2	3×3×256dw	12×12×256
			Conv/s1	1×1×256×256	12×12×256
Conv dw/s2	3×3×256dw	12×12×256
			Conv/s1	1×1×256×512	6×6×256
Conv dw/s1	3×3×512dw	6×6×512
			Conv/s1	1×1×512×512	6×6×512
Conv dw/s1	3×3×512dw	6×6×512
			Conv/s1	1×1×512×512	6×6×512
Conv dw/s2	3×3×512dw	6×6×512
			Conv/s1	1×1×512×1024	3×3×512
Avg Pool/s1	Pool 7×7	7×7×1024

Convolutional neural networks effect in image characteristics extraction is preferable, however it can only once learn a picture, can not There are the information extractions of time series for processing.Such as this kind of characteristic information of expression in the eyes and head, the state of t moment be with t moment it Preceding state is associated, traditional recurrent neural network RNN can handle with forecasting sequence data, it can be by previous letter Breath is connected in current task, such as the understanding using the supposition of past video-frequency band to current video section, however if sequence mistake It is long to will lead to the problem of gradient disappears, therefore tradition RNN can only handle short-term memory.

Shot and long term memory network LSTM is a kind of special RNN network, and the replicated blocks of standard RNN structure are transform as by it The structure there are three " door " is gathered around, wherein door is the behaviour of a kind of combination sigmoid neural net layer and a pointwise multiplication Make.As shown in Fig. 2, including forgeing door, input gate, out gate in the network module of LSTM.Forget door to determine from cell state Any information is abandoned, what value input gate decision will update, and out gate determines what value of final output.LSTM can be tied according to " door " Cell state is protected and controlled to structure, flows cell state at any time, hereby it is ensured that long-term memory.

Therefore, in the present invention, face and head, eye are extracted jointly in conjunction with depth convolutional neural networks DCNN and LSTM network Refreshing feature, DCNN are used for studying space feature, LSTM learning time feature.As shown in figure 3, inputting 5 consecutive frames simultaneously Picture is extracted the face characteristic information of corresponding picture, using the output valve of CNN as the defeated of LSTM network simultaneously by 5 CNN first Enter value, in LSTM layers, the output valve at preceding 5 moment will be used for the characteristic value at the 6th moment, finally determine network by out gate Output valve, prediction result is 213 coordinate values, three-dimensional coordinate including 68 characteristic points, left and right two expression in the eyes 3D coordinate With head 3D coordinate.

The technical solution of the present invention is as follows:

A kind of expression method for tracing based on DCNNs-LSTM model, comprising the following steps:

S1, it generates training dataset: in the video comprising face, marking out 68 characteristic points of face of face in video 3D coordinate, left and right two expression in the eyes position and head bias 3D information, video is cut into single picture by single frames, mark Information is also packaged by single frames, obtains the face continuous pictures in time；It is more by what is extracted from multiple videos A face pictures are as training dataset；

S2, face and head, expression in the eyes feature are extracted jointly in conjunction with depth convolutional neural networks DCNN and LSTM network, In, DCNN be used for studying space feature, LSTM learning time feature, specifically:

To multiple consecutive frame pictures of input simultaneously, the face characteristic of corresponding picture is extracted simultaneously by corresponding multiple DCNN Information, using the output valve of DCNN as the input value of LSTM network, in LSTM layers, the output valve at multiple moment will be used for next The characteristic value at moment, is determined the output valve of network by out gate, and prediction result is 213 coordinate values, including 68 characteristic points 3D coordinate, the expression in the eyes 3D coordinate and head 3D coordinate for controlling two；

DCNNs-LSTM model is trained using the training dataset that step S1 is obtained；

S3, using the face of face, expression in the eyes and head 3D information in trained DCNNs-LSTM model extraction video；

S4, the three-dimensional coordinate of model extraction is done mesh reconstruction and through Data Translation model generate deformation coefficient, for pair Face 3D model is controlled, and realizes expression tracking.

Beneficial effects of the present invention are to be different from conventional method, and opposite can be realized by a network structure in the present invention The extraction in portion, expression in the eyes and head three-dimensional feature, and model structure considers computing cost and the longer sequence problem of time interval, With stronger practicability.

Detailed description of the invention

Fig. 1 is any 4 face display diagrams in data set；

Fig. 2 is the neural network module logical construction schematic diagram of LSTM；

Fig. 3 is network frame figure of the invention；

Fig. 4 is training result figure；

Fig. 5 is single picture test result；

Fig. 6 is face, expression in the eyes and head-tracking effect diagram.

Specific embodiment

With reference to the accompanying drawings and examples, the technical schemes of the invention are described in detail.

In the present invention, the preparation and pretreatment of data set are the bases of training pattern, due to currently available such number Less according to collecting, the present invention has made three-dimensional face data collection used in tens of thousands of Zhang Xunlian by oneself.We are in YouTube, internal video net 100 videos comprising face have been collected on standing, and since video light is different, video is switched into gray level and unified each view first Then the pixel size of frequency marks out the 3D coordinate of 68 characteristic points of face of face in video, the expression in the eyes position of left and right two And the three-dimensional information of head bias.Then, video is cut into single picture by frame, while three-dimensional coordinate is also divided by single frames Dress, since there are time serieses for the movement on expression in the eyes and head, it is therefore desirable to different faces is distinguished, we by each video and Label is dispensed by the format of " letter _ number ", such as " A_0 ".Finally obtain it is tens of thousands of magnify small unified face picture and its Corresponding label, as shown in Figure 1.To train the stronger model of robustness, the present invention has collected different in different scenes Country and different age group face, and in video face expression and angle compared with horn of plenty.

Of the invention method particularly includes:

Step 1: for the robustness for enhancing model, data set of the invention covers the child of country variant nationality, blueness Year, middle age, old face, facial angle is different in image, and expression and head movement are abundant enough.Since expression in the eyes and head are believed The prediction of breath has time series, therefore picture to different faces and label name distinguish, as the file designation of A is The file designation of " A_0... ", B are " B_0... ", and the picture number of every face remains 400.We have prepared nearly three altogether Ten thousand pictures, wherein training set and test set press the ratio cut partition of 7:3, and test set picture individually prepares, and include various clarity And the picture of deviation angle.

Step 2: before training pattern, we are first handled data set.Gray processing is done to data set picture first Processing, to prevent the case where different faces are from different light conditions.It is then based on the Haar property detector frame of Opencv Human face region in every picture out, the human face region is by the input as model, to improve the accuracy of aspect of model extraction. For the delay for reducing picture processing, we by picture size (high h, it is wide w) to be uniformly scaled 96*96 size, while in data set Corresponding x, y-coordinate also scaled down, z coordinate is constant, as follows:

H_r=224/h, w_r=224/w (1)

New_x=x × h, new_y=y × w_r (2)

Wherein h_r refers to the compression factor of picture height, and w_r refers to the compression factor of picture width, and new_x refers to compressed x Coordinate, new_y refer to compressed ordinate.

The present invention is based on Pytorch frame training patterns, and the framework provides the Tensor for supporting CPU and GPU, can be very big Accelerate to calculate.In model training, setting each Batch includes 16 pictures, and training epoch value is set as 20 every time, is trained Every 5 epoch of model save primary, since parameter is relatively fewer in the network structure built, we set weight pad value It is set to 1e-4.To assess the gap between trained model extraction characteristic information and true value, the present invention uses Pytorch Under loss function of the SmoothL1Loss function as model, it is as follows:

The function error is Squared Error Loss on (- 1,1), other situations are L1 loss.Wherein, subscript i refers to i-th of x Element.We are iterated training, the result of every training primary (20epoch) to data set on NVIDIA GeForce GPU As shown in Figure 4.

After successive ignition training, multiple models of preservation are successively verified with verifying collection data, wherein val_ Size is set as 4, after constantly adjusting ginseng and comparison, finally obtains optimal models, the parameter setting mode of optimal models are as follows: weight Decay value is 1e-4, and lr 1e-3, lr decay is 0.95.

Step 3: after training obtains optimal models, we have selected the face of different condition to predict model, wherein Including the clear face in front, the clear face in side (30 degree, 60 degree, 90 degree), the face of positive half clarity and side half are clear The face (30 degree, 60 degree, 90 degree) of clear degree.It is assumed that 5 pictures of input are tested, as shown in figure 5, test result is a number Group structure.Due to being resize to it after picture input, final output also corresponding scaled down, the present invention couple Data press X=X_i/ h_r, Y=Y_iThe mode of/w_r is amplified, and subsequent mesh reconstruction is used for.

Step 4: pass through above step, the face of face in trained DCNN-LSTM model extraction video can be used in we Portion (68*3), expression in the eyes (2*3) and head (1*3) information, are denoted as: S=(X₁,Y₁,Z₁,X₂,...,Y₇₁,Z₇₁)^T∈R³ⁿ.To realize The face characteristic information extracted is mapped to animated character by a kind of interesting tracking effect, plan of the present invention.Expression Mapping side Method is generally divided into the method based on mid-module and the method based on characteristic point difference vector, the present invention use based on characteristic point difference to The method of amount is realized.Firstly, the characteristic information extracted is redeveloped into triangle gridding, as shown in figure 5, then calculating triangle gridding In the distance between each vertex, using calculated 178 distance values as the input data of Data Translation model, which is one The vertex distance value of triangle gridding is switched to controllable animation model changing features by a simple feedforward neural network, main realize Deformation coefficient the deformation coefficient of output is put into the Unity3D engineering for be ready for animated character and is transported after model treatment Row, tracking effect picture are as shown in Figure 6.

Key point of the invention is to propose a kind of combination depth convolutional network DCNN's and shot and long term memory network LSTM Network structure is used to extract face, expression in the eyes and the head three-dimensional feature information of people, and characteristic information is converted to deformation coefficient and is made For animated character to realize a kind of research method for mapping effect.

Face, expression in the eyes and head three-dimensional feature information are extracted: the present invention is using DCNN in character image feature extraction The advantage of high efficiency and LSTM network in the information that processing has time series, the two R. concomitans are mentioned in characteristic information It takes, wherein DCNN network is different from traditional CNN network, and convolutional layer mainly uses depth to separate convolution and builds in the present invention, To learn feature abundant with less parameter and computing cost.Nearly 30,000 facial images and corresponding label text are made by oneself Part trains network, obtains optimal network model with ginseng is adjusted through constantly training, which believes three-dimensional feature of the invention is used for Breath extracts.

Realize a kind of tracking effect using the three-dimensional feature information of extraction: the three-dimensional for getting face, expression in the eyes and head is special After reference breath, the present invention carries out triangle gridding reconstruction to characteristic point first and calculates the Euclidean distance between each vertex of grid, so Distance value is input to Data Translation model to obtain may act on the deformation coefficient of animated character afterwards, finally puts deformation coefficient Tracking effect can be realized into Unity3D engineering.

Different from conventional method, the present invention can be realized three-dimensional to face, expression in the eyes and head special by a network structure The extraction of sign, and model structure considers computing cost and the longer sequence problem of time interval, has stronger practicability.With The development of the high and new technologies such as virtual reality and artificial intelligence, also stepped up for the tracer technique demand of human body, main table It is existing are as follows: in network social intercourse, its face can be mapped to other animation models on the face by face tracking technology by user, not The facial expression variation for seeing other side in the case where exposure individual's portrait in real time, to be improved under the premise of guaranteeing individual privacy Social efficiency；In virtual reality world, traditional interactive mode is unable to satisfy demand, is based on gesture, head movement and sound The interactive mode of control easilys lead to fatigue again, using the not only fast response time of the interactive mode based on eyeball tracking technology, Tracking is also very accurate and does not allow to easily cause fatigue, can promote the technological break-through of field of human-computer interaction.As it can be seen that this kind of tracer technique tool Standby very actual research significance.The present invention uses the network for combining depth convolutional network DCNN and long memory network LSTM in short-term Structure carries out face, expression in the eyes and the three-dimensional feature on head and extracts, and does data processing to the three-dimensional feature information extracted and obtain A kind of deformation coefficient controlling animated character's characteristic point, to realize tracking effect.Being analyzed the visible present invention above can use In multiple fields, there is stronger commercial value.

The extraction for facial three-dimensional feature can be classified as conventional method and the method based on deep learning in the prior art. Conventional method septum reset feature point extraction is based primarily upon cascade regression model, which needs one Initial Face of construction first, so By multiple weak recurrence device training true face shapes of Step wise approximation, however the Initial Face once constructed and real human face are inclined Difference is larger, and the subsequent extracted feature of operation also will appear biggish error.In addition, the prediction of expression in the eyes is usually by head appearance What gesture and eyeball positioning were released jointly, due to the complexity of head movement, generally require to carry out using multiple cameras simultaneously polygonal Degree shooting, and final prediction result is compared with the influence vulnerable to initial head and eyeball positional accuracy.

In addition, traditional CNN network shows biggish excellent in image characteristics extraction in the method based on deep learning Gesture, however since parameter is excessive, biggish computing cost is needed, and accurately special often by deeper model realization is built Sign is extracted, therefore final mask size is larger, is not suitable for mobile device.And the prediction of expression in the eyes, header information, due to t moment Information it is associated with the characteristic information at t-n moment, CNN can not solve that this there are the feature extractions of time series.Recurrence mind The feature extraction there are time series can be handled through network, however will lead to the problem of gradient disappears since sequence is too long, Traditional RNN can only handle short-term memory.

The present invention overcomes conventional method cascade regressive structure may be because of the excessive bring positioning of Initial Face deviation Error and expression in the eyes header information required the case where the device is complicated and pre- geodesic structure is influenced by other characteristic informations when predicting, also gram Take in the algorithm based on deep learning that tradition CNN parameter is excessive, computing cost is larger and model is larger and is not available in shifting It the case where the problem of dynamic equipment and the common unpredictable time interval of recurrent neural network RNN biggish motion information, mentions A kind of three-dimensional feature extracting method based on DCNN and LSTM network out, and realize and the characteristic information of extraction is mapped to animation people The tracking effect of object.For the speed of service for improving network, the present invention is uniformly processed picture is inputted as 96*96.Due to being suitable for Human face data collection of the invention is less, therefore we have made a set of face 3D data set, the corresponding label of every face picture by oneself The 3D coordinate of 3D coordinate, the expression in the eyes of left and right two in file comprising 68 face feature points and the 3D for indicating head bias degree Coordinate.The tracking effect realized in the present invention is highly suitable for the fields such as artificial intelligence and the virtual reality of normalizing, can be predicted The relevant technologies that the present invention referred in recent years have preferable application prospect.

Claims

1. a kind of expression method for tracing based on DCNNs-LSTM model, which comprises the following steps:

S1, it generates training dataset: in the video comprising face, marking out the 3D of 68 characteristic points of face of face in video The 3D information of coordinate, the expression in the eyes position of left and right two and head bias, is cut into single picture, markup information by single frames for video Also it is packaged by single frames, obtains the face continuous pictures in time；The multiple people that will be extracted from multiple videos Face pictures are as training dataset；

S2, face and head, expression in the eyes feature are extracted jointly in conjunction with depth convolutional neural networks DCNN and LSTM network, wherein DCNN be used for studying space feature, LSTM learning time feature, specifically:

To multiple consecutive frame pictures of input simultaneously, the face characteristic for extracting corresponding picture simultaneously by corresponding multiple DCNN is believed Breath, using the output valve of DCNN as the input value of LSTM network, in LSTM layers, the output valve at multiple moment will be used for lower a period of time The characteristic value at quarter is determined that the output valve of network, prediction result are 213 coordinate values, the 3D including 68 characteristic points by out gate Coordinate, the expression in the eyes 3D coordinate and head 3D coordinate for controlling two；

S4, mesh reconstruction is done to the three-dimensional coordinate of model extraction and generates deformation coefficient through Data Translation model, for face 3D model is controlled, and realizes expression tracking.

2. a kind of expression method for tracing based on DCNNs-LSTM model according to claim 1, which is characterized in that step The structure of the DCNN of depth convolutional neural networks described in S2 are as follows: the Standard convolution that first layer is 3 × 3 is rolled up followed by 9 layers of separation Product, then characteristic pattern is normalized to 1 × 1 × 1024 sizes by 1 layer of average pond layer.

3. a kind of expression method for tracing based on DCNNs-LSTM model according to claim 2, which is characterized in that step In S2, the LSTM network is the information that t moment is predicted according to preceding t-5 frame information.