CN112766145A

CN112766145A - Method and device for identifying dynamic facial expressions of artificial neural network

Info

Publication number: CN112766145A
Application number: CN202110057226.5A
Authority: CN
Inventors: 彭保; 姚智; 段迟; 高洁
Original assignee: Shenzhen Institute of Information Technology
Current assignee: Shenzhen Institute of Information Technology
Priority date: 2021-01-15
Filing date: 2021-01-15
Publication date: 2021-05-07
Anticipated expiration: 2041-01-15
Also published as: CN112766145B

Abstract

The application provides a method and a device for identifying dynamic facial expressions based on an artificial neural network, wherein the method is applied to the prediction of the identity and/or expression intensity category of a person expression in a video under a person fixed scene; the method comprises the following steps: acquiring original video data of a person to be detected, and determining image data of a preset frame number in the original video data; generating a characteristic group according to the image data of the preset frame number; establishing a corresponding relation between the characteristic group of the person to be tested and the dynamic expression classification of the person to be tested by utilizing the artificial intelligent self-learning capability; acquiring a current characteristic group of a current person to be tested; and determining the current dynamic expression classification corresponding to the current feature group according to the corresponding relation. The facial dynamic expression video under a certain fixed scene is analyzed to predict the identity and expression intensity category of the expression, so that the facial dynamic expression video can be mainly applied to fixed scenes of personnel such as units, factories and classrooms.

Description

Method and device for identifying dynamic facial expressions of artificial neural network

Technical Field

The application relates to the field of human face expression detection, in particular to a method and a device for identifying human face dynamic expression of an artificial neural network.

Background

With the development of science and technology, the field of face recognition is more and more extensive, from face swiping and card punching in life to popularization of face swiping payment, face recognition is always a research hotspot in the fields of computer vision and pattern recognition, and has wide application in the fields of video monitoring and network retrieval besides identity verification. With the development of deep learning and the improvement of computer computing performance, face recognition technology has made a great progress in recent research, achieving a reasonably good recognition rate on multiple data sets. How to use the face data obtained by recognition has become a major issue in recent years.

The popular expression analyzer applied to the inquiries estimates the facial expressions of the inquiries by analyzing the facial features so as to deduce the language reliability of the inquiries. The video monitoring system is used for detecting workers such as drivers and workers in real time, mental states such as fatigue and pressure can be analyzed, and accidents can be avoided by early warning. In many public places, the expressions of the analysts can be observed through the monitoring system, and the analysts are controlled and investigated in advance by analyzing whether the analysts are in a hurry or abnormal expressions, so that certain activities which damage the public order can be prevented to a certain extent. Therefore, the expression recognition technology plays an increasingly important role in industrial labor, life entertainment, man-machine interaction and the like.

Up to now, the fast positioning of human faces and the effective recognition of expressions are in the golden period of research and application. The research relates to multiple fields and multiple disciplines such as deep learning, machine learning, psychological analysis, physiological analysis and the like. The expression of facial expression is different for different psychology and emotion of each person, and the expression depends on the influence of complex influences such as character, facial structure and the like. The dynamic expression has stronger universality compared with the dynamic expression of a single expression chart by analyzing the change condition of the characteristic points of the human face.

At present, the study on the facial expressions is wide and high-quality, but the application of a single expression in practical application is more limited due to the influence of different facial structures and psychological reflection differences of each person on the single expression map.

Disclosure of Invention

In view of the above problems, the present application is directed to a method and an apparatus for identifying a dynamic facial expression of an artificial neural network, which overcome or at least partially solve the above problems, and includes:

a face dynamic expression recognition method based on an artificial neural network is applied to identity and/or expression intensity category prediction of a person expression in a video under a person fixed scene;

the method comprises the following steps:

acquiring original video data of a person to be detected, and determining image data of a preset frame number in the original video data;

generating a characteristic group according to the image data of the preset frame number;

establishing a corresponding relation between the characteristic group of the person to be tested and the dynamic expression classification of the person to be tested by utilizing the artificial intelligent self-learning capability;

acquiring a current characteristic group of a current person to be tested;

determining the current dynamic expression classification corresponding to the current feature group according to the corresponding relation; specifically, determining the current dynamic expression classification corresponding to the current feature group includes: and classifying the dynamic expression corresponding to the feature group which is the same as the current feature group in the corresponding relation, and determining the dynamic expression as the current dynamic expression classification.

Further, the step of acquiring original video data of a person to be tested and determining image data of a preset frame number in the original video data includes:

acquiring the video frame rate and the video duration of the original video data;

and determining image data of a preset frame number in the original video data according to the video frame rate and the video duration.

Further, the step of determining image data with a preset frame number in the original video data according to the video frame rate and the video duration includes:

averagely dividing the original video data into a preset number of video segments according to the video frame rate and the video duration;

and extracting image data with the same frame number position from each video segment as the image data of the preset frame number.

Further, the step of generating a feature group according to the image data of the preset number of frames includes:

generating an enhanced image group according to a background region and a non-background region in the gray image data group;

generating an optical flow motion information image group containing motion information of the face of the person to be detected in the X-axis direction and the Y-axis direction according to the gray image data corresponding to the adjacent video segments;

generating a gradient output image group containing edges in 4 directions of the image data with the preset frame number according to the image data with the preset frame number;

and generating the feature set according to the enhanced image set, the optical flow motion information image set and the gradient output image set.

Further, the step of establishing a corresponding relationship between the feature group corresponding to the person to be tested and the dynamic expression classification of the person to be tested includes:

acquiring sample data for establishing a corresponding relation between the feature group and the dynamic expression classification;

analyzing the characteristics and the rules of the characteristic group, and determining the network structure and the network parameters of the artificial neural network according to the characteristics and the rules;

and training and testing the network structure and the network parameters by using the sample data, and determining the corresponding relation between the feature group and the dynamic expression classification.

Further, the step of obtaining sample data for establishing a correspondence between the feature group and the dynamic expression classification includes:

collecting the feature sets and the dynamic expression classifications of different samples;

analyzing the feature group, and selecting data related to the dynamic expression classification as the feature group by combining with prestored expert experience information;

and taking the data pair formed by the dynamic expression classification and the selected feature group as sample data.

Further, the air conditioner is provided with a fan,

training the network structure and the network parameters, including:

selecting a part of data in the sample data as a training sample, inputting the feature group in the training sample into the network structure, and training through a loss function of the network structure, an activation function and the network parameters to obtain an actual training result;

determining whether an actual training error between the actual training result and a corresponding dynamic expression classification in the training sample meets a preset training error;

determining that the training of the network structure and the network parameters is completed when the actual training error meets the preset training error;

and/or the presence of a gas in the gas,

testing the network structure and the network parameters, comprising:

selecting another part of data in the sample data as a test sample, inputting the feature group in the test sample into the trained network structure, and testing by using the loss function, the activation function and the trained network parameters to obtain an actual test result;

determining whether an actual test error between the actual test result and the corresponding dynamic expression classification in the test sample meets a set test error;

and when the actual test error meets the set test error, determining that the test on the network structure and the network parameters is finished.

Further, the air conditioner is provided with a fan,

training the network structure and the network parameters, further comprising:

when the actual training error does not meet the set training error, updating the network parameters through an error loss function of the network structure;

activating a function and the updated network parameters to retrain through the loss function of the network structure until the retrained actual training error meets the set training error;

and/or the presence of a gas in the gas,

testing the network structure and the network parameters, further comprising:

and when the actual test error does not meet the set test error, retraining the network structure and the network parameters until the retrained actual test error meets the set test error.

A human face dynamic expression recognition device of an artificial neural network is applied to identity and/or expression intensity category prediction of human expressions in videos of human fixed scenes;

the method specifically comprises the following steps:

the system comprises an image data generation module with preset frame numbers, a data acquisition module and a data processing module, wherein the image data generation module is used for acquiring original video data of a person to be detected and determining image data with preset frame numbers in the original video data;

the characteristic group generating module is used for generating a characteristic group according to the image data of the preset frame number;

the corresponding relation establishing module is used for establishing a corresponding relation between the characteristic group of the person to be tested and the dynamic expression classification of the person to be tested by utilizing the artificial intelligence self-learning capability;

the current characteristic group acquisition module is used for acquiring a current characteristic group of a current person to be detected;

a current dynamic expression classification determining module, configured to determine, according to the correspondence, a current dynamic expression classification corresponding to the current feature group; specifically, determining the current dynamic expression classification corresponding to the current feature group includes: and classifying the dynamic expression corresponding to the feature group which is the same as the current feature group in the corresponding relation, and determining the dynamic expression as the current dynamic expression classification.

An apparatus comprising a processor, a memory and a computer program stored on the memory and capable of running on the processor, the computer program when executed by the processor implementing the steps of the method for dynamic facial expression recognition by an artificial neural network as described above.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for dynamic facial expression recognition of an artificial neural network as described above.

The application has the following advantages:

in the embodiment of the application, the original video data of a person to be tested is obtained, and the image data of the preset frame number in the original video data is determined; generating a characteristic group according to the image data of the preset frame number; establishing a corresponding relation between the characteristic group of the person to be tested and the dynamic expression classification of the person to be tested by utilizing the artificial intelligent self-learning capability; acquiring a current characteristic group of a current person to be tested; determining the current dynamic expression classification corresponding to the current feature group according to the corresponding relation; specifically, determining the current dynamic expression classification corresponding to the current feature group includes: and classifying the dynamic expression corresponding to the feature group which is the same as the current feature group in the corresponding relation, and determining the dynamic expression as the current dynamic expression classification. The method has the advantages that the dynamic facial expression video in a certain fixed scene is analyzed, and the identity and expression intensity category of the expression are predicted, so that the method can be mainly applied to fixed scenes of personnel such as units, factories and classrooms; the analysis and the learning of the input dynamic expression characteristics are completed through the characteristics of priori knowledge and deep learning self-adaption; when the personnel in the application place are fixed, the method can be used for accurately counting and controlling the emotion of the personnel, and can be used for timely handling abnormal conditions and the like.

Drawings

Fig. 1 is a flowchart illustrating steps of a method for identifying dynamic facial expressions of an artificial neural network according to an embodiment of the present application;

fig. 2 is a schematic diagram illustrating a normalized segmentation principle of a method for identifying a dynamic facial expression of an artificial neural network according to an embodiment of the present application;

fig. 3 is a schematic diagram illustrating a normalized segmentation method of a method for identifying a dynamic facial expression of an artificial neural network according to an embodiment of the present application;

fig. 4 is a schematic diagram illustrating a convolution principle of an image by a conventional 2DCNN of a method for identifying a dynamic facial expression of an artificial neural network according to an embodiment of the present application;

fig. 5 is a schematic diagram illustrating a convolution principle of an image by 3DCNN according to a method for identifying a dynamic facial expression of an artificial neural network according to an embodiment of the present application;

fig. 6 is a schematic diagram of an overall structure of a TP-3DCNN network of a method for identifying dynamic facial expressions of an artificial neural network according to an embodiment of the present application;

fig. 7 is a block diagram illustrating a structure of a device for recognizing dynamic facial expressions of an artificial neural network according to an embodiment of the present application.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, a method for identifying a dynamic facial expression based on an artificial neural network, which is applied to identity and/or expression intensity category prediction of a human expression in a video of a human fixed scene, is shown in an embodiment of the present application;

the method comprises the following steps:

s110, acquiring original video data of a person to be tested, and determining image data of a preset frame number in the original video data;

s120, generating a feature group according to the image data of the preset frame number;

s130, establishing a corresponding relation between the feature group of the person to be tested and the dynamic expression classification of the person to be tested by utilizing the artificial intelligent self-learning capability;

s140, acquiring a current feature group of a current person to be tested;

s150, determining the current dynamic expression classification corresponding to the current feature group according to the corresponding relation; specifically, determining the current dynamic expression classification corresponding to the current feature group includes: and classifying the dynamic expression corresponding to the feature group which is the same as the current feature group in the corresponding relation, and determining the dynamic expression as the current dynamic expression classification.

Next, a face dynamic expression recognition method of the artificial neural network in the present exemplary embodiment will be further described.

In step S110, original video data of a person to be tested is obtained, and image data of a preset number of frames in the original video data is determined.

It should be noted that the original video data may be a dynamic expression video in an application scene, such as monitoring, video recording, and the like, obtained through a stable camera.

In an embodiment of the present invention, a specific process of "acquiring the original video data of the person to be tested and determining the image data of the preset frame number in the original video data" in step S110 may be further described with reference to the following description.

The following steps are described: acquiring the video frame rate and the video duration of the original video data;

the following steps are described: and determining image data of a preset frame number in the original video data according to the video frame rate and the video duration.

In an advanced embodiment of the present invention, a specific process of "determining image data of a preset frame number in the original video data according to the video frame rate and the video duration" may be further described with reference to the following description.

The following steps are described: averagely dividing the original video data into a preset number of video segments according to the video frame rate and the video duration;

the following steps are described: and extracting image data with the same frame number position from each video segment as the image data of the preset frame number.

As an example, the image data with the preset frame number may be image pixel information of 7 frames of images obtained by averaging and extracting original video data according to a relationship between a video frame rate and a video duration, wherein besides obtaining the image pixel information of the 7 frames of images, the identity tag and the dynamic expression intensity category tag corresponding to two frame image groups used for the input are also obtained and temporarily stored.

Generating a feature group according to the image data of the preset frame number as stated in the step S120;

in an embodiment of the present invention, the specific process of "generating the feature set according to the image data of the preset frame number" in step S120 can be further described with reference to the following description.

The following steps are described: generating an enhanced image group according to a background region and a non-background region in the gray image data group;

as an example, the enhanced image group may segment the background of each image data group with preset frame number by using a normalized segmentation method, and reduce the pixel value of the background to 0, and the remaining pixel values are unchanged, it should be noted that the principle of normalized segmentation (normalized cut) is as follows:

the frame image shot by the camera can be regarded as a graph formed by a large number of pixel points after graying, so the idea of realizing image segmentation is to calculate a weight graph (weighted graph) among the pixel points and segment the picture into regions with the same characteristics (texture, color, brightness and the like) through the connected compactness.

To understand the meaning of the code designed for normalized segmentation, some basic concepts are understood first: the representation of the graph may be constrained by the relationship of edges and vertices, with a definitional formula:

G＝(V，E)

in the formula, V represents a vertex (vertex), E represents an edge (edge) for connecting points of a picture, information in one picture can be completely reflected through G, and a concept of weight is further introduced, that is, the edge for connecting two points has a weight value, so as to represent the closeness degree of the relationship between the two points, as shown in fig. 2. It can be clearly found by observation that the graph can be divided into 2 regions, and from the perspective of reflecting the connection tightness by the weights in the graph, two values which are obviously different from other weights appear in the left and right regions, namely 2 sides with the weight of 0.1, the divided A region is composed of 4 vertexes, and the B region is composed of 5 vertexes.

The following defines the weight similarity between points as follows:

in the formula, σ is the variance of a gaussian convolution kernel, dist is a difference formula defining two pixel points, wherein σ mainly affects the influence degree of dist on w, which is defined by groping in algorithm design, and dist calculates the entropy by adopting (R, G, B, X, Y) vectors of each pixel point in the patent.

As shown in fig. 3, it can be seen that the dotted line is a desirable segmentation method, and it is obvious from fig. 3 that two edge points are cut out, because as an edge point, the connection weight with other points is the smallest, so it is not enough to consider only the smallest weight when cutting, and thus a normalized cutting method is proposed, which has the following discriminant:

on the basis of the original calculated weight, the calculated value of the weight of the point and all other points is introduced. Taking a as an example, when computing based on the above, when the distance between the point a and other points is too far, which results in that w (a, V) is too small, the discriminant is increased, and combining the same influence manner of the point B, it can be found that when the point is at the edge position, the weighted connecting line is not at the lowest value, and the point is divided, which affects the final region division result. In order to solve the extreme value of the formula, the expression of the formula on the image is obtained through mathematical derivation:

where W is the previously defined similarity matrix, D is a diagonal matrix with values of W for each row corresponding to the addition, and y is a category discrimination vector of the form [ x1, x2, x3.. ] with dimensions of the number of pixels. And the solved target is the y vector under the condition of the extreme value of the formula, so that the elements in the picture are subjected to region division. Using lagrange analysis for this formula choice, the objective function can be converted to the following formula:

y^T(D-W)y+y^TDy

derived therefrom, the relation y:

(D-W)y＝λDy

in the formula, y is the eigenvector, λ is the eigenvalue, and the solution required is the eigenvector corresponding to the second smallest eigenvalue, since the solution of the smallest eigenvalue to 0 is not required. After the characteristic value is obtained through calculation, the characteristic vector can be obtained by substituting the original expression, and the category of each pixel is classified according to the inward value of the characteristic to complete the normalization segmentation.

The following steps are described: generating an optical flow motion information image group containing motion information of the face of the person to be detected in the X-axis direction and the Y-axis direction according to the gray image data corresponding to the adjacent video segments;

it should be noted that the optical flow is used to describe the instantaneous speed of the pixel motion of a spatially moving object on the observation imaging plane, and is a method for finding the correspondence between the previous frame and the current frame by using the change of the pixels in the image sequence in the time domain and the correlation between the adjacent frames, so as to calculate the motion information of the object between the adjacent frames.

The optical flow method for describing features refers to a method for calculating motion information of an object between adjacent frames by finding a corresponding relation existing between a previous frame and a current frame by using changes of pixels in an image sequence in a time domain and correlation between the adjacent frames.

The method mainly comprises two descriptions, namely a sparse optical flow method and a dense optical flow method, and the difference of the two descriptions lies in a selection principle of characteristic points used for research when the image motion information is described.

In the aspect of feature point selection, traditional Harris angular points do not have invariance such as illumination, scale and the like, and extraction targets are good only at edge change positions, while sift feature points have good features in the methods, the extracted feature points are often pixel value mutation points in a target window, the feature points extracted for a face are very few, and unnecessary feature points such as glasses and the like can be extracted, so that the method in the patent adopts a method of utilizing a regression tree in a Dlib library function to extract 68 feature points of facial expressions as feature points required by a sparse optical flow method.

The sparse optical flow method LK is used under the following three assumption conditions:

constant brightness: pixels of a target image in a scene appear unchanged from frame to frame movement. For grayscale images (as well as for color images) this means that the grayscale values of the pixels do not change as the frame is tracked.

Time duration (micro movement): the movement of the camera on the image varies slowly with time. In practice this means that temporal variations do not cause a drastic change in the position of the pixel, so that the grey value of the pixel can be used to derive the corresponding partial derivative of the position.

Spatial consistency: adjacent points of the same surface in the scene have similar motion and are projected at a relatively close distance onto the image plane.

For a two-dimensional image, the pixel value at time t immediately below (x, y) on the image is set to I (x, y, t). At the time δ t, the position is changed, namely:

I(x,y,t)＝I(x+δx,y+δy,t+δt)

assuming that the camera capture frequency is high and the motion between adjacent frame images is small enough, the first order taylor expansion for the left I (x, y, t) is:

where R (x, y, t) is a high-order residue of the Taylor equation, which is approximately 0, the two equations can be derived:

the equivalence is as follows:

wherein

And

the derivatives of the pixel along the x and y directions are recorded as u and v, respectively. The above formula can therefore be abbreviated as:

I_xu+I_yv+I_t＝0

since the above equations have two unknowns u and v, they cannot be solved, and then, based on a third assumption, some other equations are obtained to be solved simultaneously. It can be assumed that within a window of size m, the optical flow of the image is a constant value. Then the following system of equations can be obtained:

to solve the above over-constrained system, a least squares method may be used, and the above equation is expressed in a matrix form as:

record as

Obtaining by using a least square method:

then finally all solved optical flows (velocity vectors) are

Finally obtained

The optical flow corresponds to the LK algorithm.

The following steps are described: generating a gradient output image group containing edges in 4 directions of the image data with the preset frame number according to the image data with the preset frame number;

the Gabor filter is a filter that performs Gabor conversion on a signal. The idea source of the transformation is to improve the traditional fourier transformation, and extract the frequency information of the time series by introducing a window function capable of extracting local time information, and the fourier transformation improved by the windowing way is also called Gabor transformation.

One-dimensional Gabor transforms are primarily performed by dividing the signal into many small time intervals and analyzing each small time interval to determine the frequency at which the signal exists for that time interval. In order to extract small time intervals of the signal, a sliding window is added to the signal, and Fourier transformation can be carried out on different time periods of the signal through sliding of the sliding window. Let f (t) be the original signal, and f ∈ L²(R), then the Gabor transform can be defined as:

in the formula, g_a(t-b) is a sliding window function applied to the signal, and the parameter b is used to move the window in parallel, integrating the signal over b has the following result:

for the g (t) function, a gaussian function is often selected, because the fourier transform of the gaussian function is still a gaussian function, which makes it possible to use the method of adding the window function when performing the inverse fourier transform on the signal frequency domain; secondly, the shape of the gaussian function is arcuate, which is analyzed using local signals.

Therefore, when analyzing the Gabor transform of the signal, the next step is to define the kernel function, which is defined as follows:

the dual function γ (t) of g (t) is calculated as:

the discrete Gabor transform can be written as:

when the dual function of g (t) is obtained, the operation can be simplified and Gabor transformation can be obtained

The one-dimensional Gabor transformation is expanded to a two-dimensional space domain, and experiments prove that the frequency and the direction of the Gabor filter are close to a human visual system, so that the image texture can be well expressed. The method is used for extracting the edge characteristics of images in different directions in a two-dimensional space domain, each Gabor filter is the product of a sine plane wave and a Gaussian kernel function, so that the Gabor filters are self-similar, namely all the Gabor filters can be generated from a mother wavelet through expansion and rotation. The Gabor filter can also detect different scales.

g(x,y)＝s(x,y)w(x,y)

In the formula, x₀And y₀Centre of Gabor transform for image, x-x by polar coordinate transformation₀Defining, namely introducing the detected angle theta, wherein K is a Gaussian scale and is defined as:

the filter window function can be set through K and theta, and the extraction of the texture features of the image can be completed through sliding.

The following steps are described: and generating the feature set according to the enhanced image set, the optical flow motion information image set and the gradient output image set.

In step S130, establishing a correspondence between a feature group corresponding to the person to be tested and the dynamic expression classification of the person to be tested by using the artificial intelligence self-learning capability; wherein the dynamic expression classification comprises health, melasma, late blight and ulcer disease.

For example: and analyzing the appearance state rules of the personnel to be tested corresponding to different dynamic expression classifications by utilizing an artificial neural network algorithm, and finding out the mapping rule between the characteristic group corresponding to the personnel to be tested and the dynamic expression classification of the personnel to be tested through the self-learning and self-adaptive characteristics of the artificial neural network.

For example: the method can utilize an artificial neural network algorithm to collect a large number of feature groups corresponding to the staff to be tested under different conditions (including but not limited to one or more of gender, skin color, age and the like), select the feature groups and dynamic expression classifications corresponding to the staff to be tested under a plurality of conditions as sample data, learn and train the neural network, and fit the relationship between the feature groups and the dynamic expression classifications corresponding to the staff to be tested by adjusting the weight between the network structure and the network nodes, so that the neural network can accurately fit the corresponding relationship between the feature groups and the dynamic expression classifications corresponding to the staff to be tested under different conditions.

In an embodiment, the correspondence includes: and (4) functional relation.

Preferably, the feature set is an input parameter of the functional relationship, and the dynamic expression is classified as an output parameter of the functional relationship;

determining a current dynamic expression classification corresponding to the current feature set, further comprising:

and when the corresponding relation comprises a functional relation, inputting the current feature group into the functional relation, and determining the output parameter of the functional relation as the current dynamic expression classification.

Therefore, the flexibility and convenience of determining the current feature group can be improved through the corresponding relations in various forms.

In the deep learning field, a 2DCNN (2-Dimensional Convolutional Neural Networks) is adopted to operate a video, and generally, each frame image of the video is respectively identified by performing convolution with a two-Dimensional convolution kernel to extract features, and the convolution of the method does not take into account inter-frame motion information of a time dimension. The conventional 2DCNN convolves the image as shown in fig. 4.

The operation shown in fig. 4 is performed by performing a convolution operation on the image with a convolution kernel of a fixed template size, and performing the convolution operation by using a smooth motion method. The figure is that 3 × 3 matrix kernels are taken as an example, the features of the image are extracted through a sliding template, the nonlinear relation between the image and the label is described finally through feature images and nonlinear units with different levels of detail layer by layer, and the network parameters are continuously optimized through a loss function between a predicted output value and a label value through back propagation, so that the optimal convolution kernel parameters for extracting the features are calculated, and the training is completed. The classical Convolutional Neural network can construct a nonlinear relationship between an image which cannot be described by a language and an output thereof in a mode of transmitting information and analyzing the image through a simulation neuron to solve the problem, but the Convolutional mode of 2DCNN is performed on a single image and cannot describe the relationship between the image and the image, so that the problem of video classification regression and the like cannot be well solved, and 3DCNN (3-Dimensional Convolutional Neural network) is generated for solving the problem.

The 3DCNN mainly solves the time feature extraction problem which cannot be solved by the 2 DCNN. The principle is that a plurality of pictures are convoluted simultaneously through convolution kernel, the space-time characteristics of pixel value change between continuous frame images in a video can be learned through training optimization, and the convolution principle is shown in figure 5.

When processing a plurality of frame images input in a space, 3DCNN regards a plurality of stacked continuous frame images as a cube, and then performs convolution in the cube through a convolution template. So that its convolution kernel moves in three directions during the convolution process. The spatio-temporal direction between the frame images, the x-direction of each frame image and the y-direction of each frame image. In the preceding and subsequent structural analysis, it can be seen that each feature image in the convolutional layer is connected with a plurality of adjacent continuous frame images in the previous layer, so that the spatio-temporal motion information on the images can be extracted. As shown in the above schematic diagram, the value of each position of the convolution image is obtained by convolving the local pixel information of the same position of the previous layer of three consecutive frame images. A further important feature of 3DCNN is that only one type of feature can be extracted from the cube at a time when extracting spatio-temporal features between learning frame images. As shown, the same convolution kernel is used in the process of moving the selected cube (3 × h × w) through the whole large cube (4 × h × w), because the weights are shared, only one feature of the whole large cube can be learned through one convolution kernel, and more features can be learned through a large number of inputs by arranging a plurality of sets of convolution kernels.

In the field of dynamic expression classification, the 3DCNN is used to classify features of an input video by analyzing changes between frame images in the input video. However, when a section of face video is input into a network, it is not only desirable to judge how the present emotion is, but also desirable to judge the identity of the dynamic video, because the recognized expression has more application fields only by keeping the identity of the acquired input video. Therefore, the patent proposes a network TP-3DCNN (two-way three-dimensional convolutional neural network) capable of outputting identity and dynamic expression category at the same time.

The TP-3DCNN principle is that two videos are input into a 3DCNN network at the same time, two-channel parallel training is carried out in the training process, and network parameters theta of a convolutional neural network are shared, wherein the network parameters theta comprise weight values, bias values and convolutional template values. Meanwhile, the loss feedback mode in the network is not the conventional single loss function feedback, and the whole network has four loss functions which respectively play roles in identity comparison, dynamic expression comparison, identity judgment and dynamic expression judgment. The overall network structure is shown in fig. 6.

As described in the above network structure, when the method is performed, two pieces of dynamic expression data with labels are simultaneously transmitted into a network, each training requires back propagation and optimization of four loss functions, an ideal network model can be obtained through a large number of training set videos, only one 3DCNN network is needed in the subsequent testing, and the final two classification results are directly output through the last full connection layer as shown in the figure.

Each 3DCNN network structure unit comprises an input layer, a preprocessing feature extraction layer, a 3D convolution layer, a 2D pooling layer, a 3D convolution layer, a 3D pooling layer, two full-connection layers and an output layer. The principle of these layers is explained separately below.

An input layer: dynamic expression data of a group of 7-frame images are input, and the image specification size of the dynamic expression data is normalized to 112 × 112 before input.

Pre-processing a feature extraction layer: in the input layer, only frame image layers are obtained at first, and each group of dynamic representation consists of 7 frames by taking the figure as an example. Experiments show that preprocessing feature extraction is carried out on an input image through certain priori knowledge, and the obtained image which is rough and targeted and can summarize the overall features of the image is input into a network to obtain a more comprehensive learning effect. Thus, 3 sets of features are extracted for a set of images: normalizing the human face color image after cutting the background, extracting 4-direction gradient feature images from the gray group image by using a Gabor filter, and performing x-direction and y-direction sparse light flow images of the human face feature points of the original image. These three parts are represented in the figure by groups of images of orange, blue and green, respectively. The identity information and the dynamic expression information are different in concerned point, the former is more concerned about the whole feature of the human face after the background influence is removed after the normalization and the latter is concerned about the motion information of the human face feature point, obviously, the latter is concerned about the common feature of the human face, therefore, in the following training, the former two feature image groups are used as the input of the identity information, and the optical flow information of the feature point is used as the input of the dynamic expression classification. Since the first 2 sets of features were obtained by processing a single frame image, and the last 1 set of flow features were calculated by expanding between two images, the size of the final output image set was 47 ═ 1 × 7+4 × 7+2 × 6.

First layer 3d convolutional layer: 3 convolution kernels with template size 3 x 5 were selected to check the input 3 sets of features, giving a total of 47 images for sliding convolution. Selecting 3 convolved templates is expected to extract three sets of rough image features at the first convolution, 5 x 5 is the size of the template designed for the 112 x 112 input, and after convolution, the image becomes 5 x3 x 5 x 108 and 2 x3 x 4 x 108. A first set of parameters is analyzed, where 3 refers to 3 sets of convolution kernels, the first 5 refers to the corresponding total 5 channels in the first 2 sets of features, the second 5 refers to the number of images per channel after convolution, and the second 108 × 108 is the size of the image after convolution. The convolution steps are set to be 1 in both minicubes, given the size of the input image set. The calculation formula of the convolution at this time is:

Output＝(input-kernel)+1

input refers to the number of images per channel, the height and width of the images, output refers to the corresponding output parameters, kernel refers to the size of the template of convolution, and by this equation, the second 5-7-3 +1 and 108-112-5 +1 can be calculated. Two sets of parameters are analyzed, 2 refers to 2 channels in the x and y directions of the optical flow characteristic information, and 4 is calculated because the input number is 6.

Second layer 2d pooling layer: the main objective of the layer is to reduce the parameter quantity in the network and increase the training and testing speed, so 2dmaxpool is selected to process the convolution output layer obtained from the previous layer, the size of the template used here is 2 × 2, namely, a motion window with the area of 2 × 2 being 4 is traversed on the convolution output image, the maximum value of four values in the window is taken as an output value to establish a new image, so the height and width (h, w) of the new image become half of the original value, and the objective to be achieved is to exchange smaller loss for larger working efficiency. After passing through the layer, the output became 5 × 3 × 5 × 54 and 2 × 3 × 4 × 54.

Third layer 3d convolutional layer: the input was convolved by selecting 3 convolution kernels with template size 3 x3. Selecting 3 convolution templates is expected to extract finer spatio-temporal and spatial features of the image on the basis of the first convolution, 3 x3 is the size of the template designed for 54 x 54 of the input, when the output becomes 5 x3 x 52 and 2 x3 x2 x 52.

The first set of parameters was analyzed, whose convolution principle was similar to the first one, except that the corresponding input number changed from 5 × 7 at the time of the first convolution to 3 × 5 after the previous convolution with 3 convolution kernels, where the first two parameters respectively represent the results obtained by the 3 convolution kernels of the first convolution, the total of 5 channels corresponding to the first 2 sets of features, and 5 is the output number after the first convolution for each channel. Therefore, during the convolution of the layer, the calculation method is not changed, the result is changed to 3 × 5 × 3 because 3 convolution templates are added, the total number is multiplied by 3, the last 3 is changed to 3 through convolution from the original 5 images per channel in space-time, and the second group of output parameters can be analyzed similarly.

Fourth 3d pooling layer: the main reason for choosing 3d instead of 2d is that the output of the flattened parameters is better preserved spatiotemporal characteristics after this pooling because the fully connected layers are connected and the output of the network parameters is reduced, and the output becomes 5 x3 x2 x 26 and 2 x3 x1 x 26 because the pooled nuclei of 2 x2 are chosen for this layer

A fifth fully-connected layer: the outputs obtained before are respectively flattened into single-dimensional vectors, the input variables are respectively 3X 5X 2X 26 vectors and 3X 2X 1X 26, and vectors with the lengths of 1024 and 204 are output through the full-connection layers

Sixth full tie layer: outputting the vectors with the lengths of 1024 and 204 into the vectors with the lengths of 128 and 32 in a fully connected mode

An output layer: when the method is applied in a fixed environment, the number of the body parts and the classification number of the dynamic expression intensity are output through the last full connection layer, and are preset to be 20 and 5 in the method.

After the network structure is analyzed, a loss function required by back propagation and optimization is explained, wherein the loss function consists of four parts, the first part is an identity recognition loss function, and is measured by a softmax function:

in the formula, y_iAttribution real identity representing input dynamic emotions, i.e. belonging to a specific identity name tag, y_piRepresenting the predicted value for the identity in the output layer of the neural network, N is the size of the input batch. And (5) performing loss calculation through the softmax loss function to return the optimized network parameters.

The second part is a dynamic expression classification loss function, and the softmax function is adopted in the method for measuring:

in the formula (I), the compound is shown in the specification,y_eindicating the real category to which the entered dynamic emoticon belongs, i.e. it belongs to a specific identity name tag, y_peAnd representing the prediction category of the dynamic expression in the output layer of the neural network, wherein N is the input batch size. And (5) performing loss calculation through the softmax loss function to return the optimized network parameters.

And the third part is an identity comparison loss function, and experiments show that a good effect value can be obtained by calculating the similarity of the two networks by adopting Euclidean distance. Thus, the euclidean distance of the identity-related parts of the outputs defining the upper and lower two paths is as follows:

d(i₁,i₂)＝||y_pi1,y_pi2||₂

in the formula i₁And i₂The identities of the dynamic expressions which are respectively input in two ways, and y_pi1And y_pi2The identity difference is measured by the Euclidean distance of the predicted value for the predicted values of the two paths of dynamic expression identities, and in order to distinguish the losses of the same identity label value and different situations, the final loss function about the identity is defined as follows:

in the formula, N is the size of batch input during each training, yi is a preset tag contrast value, the value is 1 when two dynamic expressions belong to the same person, i.e., have the same identity, otherwise, 0, thres is a threshold value set in advance, and the value is set to be 2.5 through multiple times of training on the training set.

The third part is a loss function of the dynamic expression category, and the similarity of two network outputs about the dynamic expression category is analyzed through Euclidean distance as the above formula:

d(e₁,e₂)＝||y_pe1,y_pe2||₂

in the formula, e₁The category to which the dynamic expression intensity input for the first path belongs; e.g. of the type₂For the dynamic expression intensity of the second inputThe category of the genus; y is_pe1The predicted value of the category to which the two paths of dynamic expression intensities belong in the previous path is obtained; y is_pe2The predicted value of the category to which the two paths of dynamic expression intensities belong is obtained by the next path; and in order to distinguish the losses of the same and different dynamic expression intensity categories, defining a final loss function related to the dynamic expression intensity categories as follows:

in the formula, y_eAnd the value is 1 when the two dynamic expression intensities belong to the same class, otherwise, the value is 0, thres is a threshold value set in advance, and the threshold value is set to be 1.5 through multiple times of training of the training set.

The total penalty function at each batch training is finally defined as:

L_o＝λ_iL_i+λ_e·L_e+λ_i12·L_i12+λ_e12·L_e12

in the formula, λ is a regularization coefficient for adjusting influence of each loss function therein.

In an embodiment of the present invention, the following description may be combined to further explain "utilize the artificial intelligence self-learning capability to establish the corresponding relationship between the feature group corresponding to the person to be tested and the dynamic expression classification of the person to be tested in step S110; wherein the dynamic expression classification includes specific processes of health, melasma, late blight and ulcer disease.

The following steps are described: acquiring sample data for establishing a corresponding relation between the feature group and the dynamic expression classification;

in an advanced embodiment, a specific process of acquiring sample data for establishing the correspondence between the feature set and the dynamic expression classification may be further described in conjunction with the following description.

The following steps are described: collecting the feature sets and the dynamic expression classifications of different samples;

for example: data collection: collecting feature groups corresponding to the people to be tested with different colors and corresponding dynamic expression classifications; collecting the feature groups corresponding to the persons to be tested with different sizes and the corresponding dynamic expression classifications; and collecting the feature groups corresponding to the to-be-tested persons with different transparencies and the corresponding dynamic expression classifications.

Therefore, the operation data are collected through multiple ways, the quantity of the operation data is increased, the learning capacity of the artificial neural network is improved, and the accuracy and the reliability of the determined corresponding relation are improved.

The following steps are described: analyzing the feature group, and selecting data related to the dynamic expression classification as the feature group by combining with prestored expert experience information (for example, selecting the feature group influencing the dynamic expression classification as an input parameter and using a specified parameter as an output parameter);

for example: and taking the feature group in the relevant data of the person to be tested after the dynamic expression classification is determined as an input parameter, and taking the dynamic expression classification in the relevant data as an output parameter.

The following steps are described: and taking the data pair formed by the dynamic expression classification and the selected feature group as sample data.

For example: and using part of the obtained input and output parameter pairs as training sample data and using part of the obtained input and output parameter pairs as test sample data.

Therefore, the collected feature groups are analyzed and processed to obtain sample data, the operation process is simple, and the reliability of the operation result is high.

The following steps are described: analyzing the characteristics and the rules of the characteristic group, and determining the network structure and the network parameters of the artificial neural network according to the characteristics and the rules;

for example: analyzing the feature group corresponding to the person to be tested and the dynamic expression classification of the person to be tested, and preliminarily determining the basic structure of the network, the input and output node number of the network, the number of hidden nodes, the initial weight of the network and the like.

Optionally, the specific process of training the network structure and the network parameters in the step "training and testing the network structure and the network parameters and determining the corresponding relationship between the feature set and the dynamic expression classification" may be further described in conjunction with the following description.

The following steps are described: selecting a part of data in the sample data as a training sample, inputting the feature group in the training sample into the network structure, and training through a loss function of the network structure, an activation function and the network parameters to obtain an actual training result;

specifically, a loss function is minimized through a gradient descent algorithm, network parameters are updated, a current neural network model is trained, and an actual training result is obtained;

determining whether an actual training error between the actual training result and a corresponding dynamic expression classification in the training sample meets a preset training error; determining that the training of the network structure and the network parameters is completed when the actual training error meets the preset training error;

specifically, when the actual training error satisfies the preset training error, and the currently trained model converges, it is determined that the training of the network structure and the network parameters is completed.

More optionally, training the network structure and the network parameters further includes:

when the actual training error does not meet the set training error, updating the network parameters through an error loss function of the network structure; activating a function and the updated network parameters to retrain through the loss function of the network structure until the retrained actual training error meets the set training error;

for example: and if the test error meets the requirement, finishing the network training test.

Therefore, the reliability of the network structure and the network parameters is further verified by using the test sample for testing the network structure and the network parameters obtained by training.

Optionally, the specific process of testing the network structure and the network parameters in the step "training and testing the network structure and the network parameters and determining the corresponding relationship between the feature set and the dynamic expression classification" may be further described in conjunction with the following description.

The following steps are described: selecting another part of data in the sample data as a test sample, inputting the feature group in the test sample into the trained network structure, and testing by using the loss function, the activation function and the trained network parameters to obtain an actual test result; determining whether an actual test error between the actual test result and the corresponding dynamic expression classification in the test sample meets a set test error; and when the actual test error meets the set test error, determining that the test on the network structure and the network parameters is finished.

And step S140, acquiring a current feature group of the current person to be tested.

Determining the current dynamic expression classification corresponding to the current feature group according to the corresponding relationship in the step S150; specifically, determining the current dynamic expression classification corresponding to the current feature group includes: and classifying the dynamic expression corresponding to the feature group which is the same as the current feature group in the corresponding relation, and determining the dynamic expression as the current dynamic expression classification.

In a concrete realization, the scheme that this patent adopted is owing to need a large amount of and the better sample of definition, so when using the training network, need confirm in advance to use the scene, and the requirement to the scene is to use the better and comparatively stable camera of definition. And due to the principle of the method, the method is suitable for the situation that the personnel in the scene are fixed, and needs to be retrained if new personnel join the application scene.

Step 1: and acquiring a certain amount of dynamic expression videos in the application scene through the stable camera.

Step 2: and screening the obtained dynamic expression videos, removing videos with occlusion, non-frontal faces and serious backlight, and labeling each video with identity and expression intensity categories by using priori knowledge.

And step 3: and carrying out data set random segmentation on the obtained dynamic expression video, and dividing the data set into 4: the scale of 1 divides the training set and the test set.

And 4, step 4: the training set was again run with 4: the ratio of 1 is divided to obtain a training set and a validation set.

And 5: each video in the data set is divided into 7 frames of images, and the images represent dynamic expressions.

Step 6: the batch number is set to 32, i.e. 32 image groups are input to the network each time, resize is applied to each image in the image group to 112 × 112, thereby setting the input to [32, 3, 7, 112, 112], where 3 represents 3 color channels of each image.

And 7: graying the input image group to obtain [32, 1, 7, 112, 112]

And 8: extracting the features of the image group after the graying, and obtaining [32, 1, 7, 112, 112], [32, 4, 7, 112, 112], [32, 2, 6, 112, 112] respectively through a normalization segmentation method, a Gabor filter, feature point extraction and optical flow x and y direction description

And step 9: due to the characteristic requirements of the pyrrch network, when inputting, dimension adjustment and splicing are carried out on the three groups of outputs obtained in the step 8 to obtain [32, 47, 112, 112]

Step 10: and (3) inputting the multi-dimensional vector obtained in the step (9) into a first layer of 3D convolution layer, wherein 3 templates with kernel of 3 x 5 are used in the layer, and the layers are respectively convolved to obtain 5 outputs of [32, 3, 5, 108, 108] and 2 outputs of [32, 3, 4, 108, 108] due to different feature numbers corresponding to feature groups.

Step 11: performing 2d maximum pooling on the output obtained in the step 10, wherein the size of the pooled kernel is 2 x2, and obtaining outputs of 5 [32, 3, 5, 54, 54] and 2 [32, 3, 4, 54, 54 ])

Step 12: the output obtained in step 11 is input into the 3D convolution layer of the second layer, and because of the constraint of the pytorech framework on the vector structure of the input, the 1 st dimension, i.e. the vector with the value of 3, needs to be extracted from the vectors obtained in step 11 at this layer, and the input is changed to 15 [32, 5, 54, 54] and 6 [32, 4, 54, 54 ]. This was then fed into the second convolutional layer using 3 kernel 3 x3 templates, thus yielding 15 [32, 3, 3, 54, 54] and 6 [32, 3, 2, 52, 52]

Step 13: inputting the result obtained in the step 12 into a 3d maximum value pooling layer, wherein the size of the pooled kernel is 2 x2, and the obtained results are 15 [32, 3, 2, 26, 26] and 6 [32, 3, 1, 26, 26]

Step 14: converting the results obtained in step 13 into dimension vectors, namely [32, 15X 3X 2X 26] and [32, 6X 3X 1X 26] are obtained

Step 15: inputting the result obtained in the step into the first full connection layer to obtain [32, 1024] and [32, 204]

Step 16: inputting the result obtained in step 15 into the second full-link layer to obtain [32, 128] and [32, 24]

And step 17: 32 vectors with the length of 128 and 24 vectors are obtained through the step 16, the vectors are divided into 2 classes according to the randomly extracted rule, then the classes are changed into [16, 2, 128] and [16, 2, 24], and an identity contrast loss function Li12 and a dynamic expression category loss function Le12 are respectively calculated

Step 18: and (4) respectively calculating an identity recognition loss function Li and a dynamic expression classification loss function Le according to the result obtained in the step (16) through the real label value.

Step 19: and adding the loss functions obtained in the step 17 and the step 18 into Lo in a batch unit, and performing back propagation on the Lo to optimize the network parameters.

Step 20: assuming that the number of acquired videos is 12832, it can be known from batch processing with N ═ 32 that the process of steps 1-19 needs to be repeated 401 times for each training set

Step 21: when the traversal training of a training set is completed, the best training is utilizedInputting the verification set obtained by the primary division into a neural network for verification, and calculating an identity recognition loss function L_iAnd dynamic expression classification loss function L_eAnd (5) storing.

Step 22: and storing and comparing the loss function values after each training set traversal training through an iterative idea, stopping training when the loss values are increased, and outputting network parameters.

Step 23: and (4) performing test judgment on the input test set by using the network obtained in the step (22). .

In the process of identifying the dynamic expressions, when an application scene is determined, the environment and the data set required by a network can be obtained by preparing the acquisition of the test environment and the training data in advance, and then the classification and time sequencing can be carried out on a large number of given test dynamic expressions through the pre-training, so that the processing efficiency can be improved while the processing effect is ensured, and good preparation work is carried out for further data analysis of the dynamic expressions. And the 3D convolutional network can be optimized by learning the space-time and space characteristics of the input video, the whole parameter quantity is small, the running speed is high, and the good classification effect can be obtained while the high efficiency is ensured. According to the scheme, the extracted multiple groups of image features are learned through the TP-3DCNN, the judgment of the identity and the category of the dynamic expression intensity is completed under the condition of considering both efficiency and accuracy, and due to the visualization and operability of the network, the application prospect is strong in a public scene where a plurality of people are fixedly provided with cameras.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

Referring to fig. 7, a device for identifying dynamic facial expressions of an artificial neural network according to an embodiment of the present application is shown, where the device is applied to distinguish dynamic expression classifications of a person to be detected by detecting a real-time image of the person to be detected; wherein the dynamic expression classifications include health, black spot, late blight and ulcer disease;

the method specifically comprises the following steps:

the image data generation module 710 for presetting a frame number is used for acquiring original video data of a person to be tested and determining image data of the preset frame number in the original video data;

a feature group generating module 720, configured to generate a feature group according to the image data with the preset frame number;

the corresponding relation establishing module 730 is used for establishing a corresponding relation between the feature group of the person to be tested and the dynamic expression classification of the person to be tested by utilizing the artificial intelligence self-learning capability;

a current feature group obtaining module 740, configured to obtain a current feature group of a current person to be tested;

a current dynamic expression classification determining module 750, configured to determine, according to the corresponding relationship, a current dynamic expression classification corresponding to the current feature group; specifically, determining the current dynamic expression classification corresponding to the current feature group includes: and classifying the dynamic expression corresponding to the feature group which is the same as the current feature group in the corresponding relation, and determining the dynamic expression as the current dynamic expression classification.

In an embodiment of the present invention, the image data generating module 710 with the preset frame number includes:

the frequency frame rate and video duration acquisition submodule is used for acquiring the video frequency frame rate and the video duration of the original video data;

and the image data determining submodule of the preset frame number is used for determining the image data of the preset frame number in the original video data according to the video frame rate and the video duration.

In an embodiment of the present invention, the sub-module for determining image data with a preset frame number includes:

the video segment segmentation submodule is used for averagely segmenting the original video data into a preset number of video segments according to the video frame rate and the video duration;

and the image data extraction submodule with the preset frame number is used for extracting the image data with the same frame number position from each video segment as the image data with the preset frame number.

In an embodiment of the present invention, the first feature group generation sub-module includes:

the grayed image data generation submodule is used for performing graying processing on the image data with the preset frame number to generate a grayed image data group corresponding to the image data with the preset frame number; the grayed image data group comprises grayed image data with the same number as the preset frame number;

the enhanced image group generation submodule is used for generating an enhanced image group according to the background region and the non-background region in the gray image data group;

the optical flow motion information image group generation submodule is used for generating an optical flow motion information image group containing motion information of the face of the person to be detected in the X-axis direction and the Y-axis direction according to the gray image data corresponding to the adjacent video segments;

the gradient output image group generation submodule is used for generating a gradient output image group containing edges of 4 directions of the grayed image data according to the grayed image data group;

and the feature group generation submodule is used for generating the feature group according to the enhanced image group, the optical flow motion information image group and the gradient output image group.

In an embodiment of the present invention, the corresponding relationship establishing module 730 includes:

the obtaining submodule is used for obtaining sample data for establishing a corresponding relation between the feature group and the dynamic expression classification;

the analysis submodule is used for analyzing the characteristics and the rules of the characteristic group and determining the network structure and the network parameters of the artificial neural network according to the characteristics and the rules;

and the training submodule is used for training and testing the network structure and the network parameters by using the sample data and determining the corresponding relation between the feature group and the dynamic expression classification.

In an embodiment of the present invention, the obtaining sub-module includes:

the collection submodule is used for collecting the feature groups and the dynamic expression classifications of different samples;

the analysis submodule is used for analyzing the feature group and selecting data related to the dynamic expression classification as the feature group by combining prestored expert experience information;

and the sample data generation submodule is used for classifying the dynamic expressions and selecting a data pair formed by the characteristic groups as sample data.

In one embodiment of the present invention, the first and second electrodes are,

the training submodule includes:

a training result generation submodule, configured to select a part of the sample data as a training sample, input the feature set in the training sample to the network structure, and train through a loss function of the network structure, an activation function, and the network parameters to obtain an actual training result;

a training result error judgment submodule for determining whether an actual training error between the actual training result and the corresponding dynamic expression classification in the training sample satisfies a preset training error;

a training completion determination submodule configured to determine that the training of the network structure and the network parameters is completed when the actual training error satisfies the preset training error;

and/or the presence of a gas in the gas,

the test submodule is used for testing the network structure and the network parameters, and comprises:

a test result generation submodule, configured to select another part of the sample data as a test sample, input the feature set in the test sample into the trained network structure, and perform a test with the loss function, the activation function, and the trained network parameter to obtain an actual test result;

the test result error judgment submodule is used for determining whether the actual test error between the actual test result and the corresponding dynamic expression classification in the test sample meets a set test error or not;

and the test completion judging submodule is used for determining that the test on the network structure and the network parameters is completed when the actual test error meets the set test error.

the training submodule further comprises:

a network parameter updating submodule, configured to update the network parameter through an error loss function of the network structure when the actual training error does not meet the set training error;

the first retraining submodule is used for retraining the activation function and the updated network parameters through the loss function of the network structure until the actual training error after retraining meets the set training error;

and/or the presence of a gas in the gas,

the test submodule further comprises:

and the second retraining submodule is used for retraining the network structure and the network parameters when the actual test error does not meet the set test error until the retrained actual test error meets the set test error.

In an embodiment of the present invention, the present invention further provides an apparatus, including a processor, a memory, and a computer program stored on the memory and capable of running on the processor, where the computer program, when executed by the processor, implements the steps of the above-mentioned method for identifying dynamic facial expressions of an artificial neural network.

In an embodiment of the present invention, the present invention further provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the steps of the method for identifying dynamic facial expressions of an artificial neural network as described above.

While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The method and the device for identifying the dynamic facial expressions of the artificial neural network are introduced in detail, specific examples are applied in the method to explain the principle and the implementation mode of the method, and the description of the embodiments is only used for helping to understand the method and the core idea of the method; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A face dynamic expression recognition method based on an artificial neural network is characterized in that the method is applied to identity and/or expression intensity category prediction of a person expression in a video under a person fixed scene;

the method comprises the following steps:

acquiring a current characteristic group of a current person to be tested;

2. The method according to claim 1, wherein the step of obtaining the original video data of the person to be tested and determining the preset number of frames of image data in the original video data comprises:

3. The method of claim 2, wherein the step of determining a preset number of frames of image data in the original video data according to the video frame rate and the video duration comprises:

4. The method according to claim 1, wherein the step of generating the feature group from the image data of the preset number of frames comprises:

carrying out graying processing on the image data with the preset frame number to generate a grayed image data group corresponding to the image data with the preset frame number; the grayed image data group comprises grayed image data with the same number as the preset frame number;

generating a gradient output image group comprising edges of the gray image data in 4 directions according to the gray image data group;

5. The method according to claim 1, wherein the step of establishing the corresponding relationship between the feature group corresponding to the person to be tested and the dynamic expression classification of the person to be tested comprises:

6. The method of claim 5, wherein the step of obtaining sample data for establishing correspondence between the feature set and the dynamic expression classification comprises:

7. The method according to any one of claims 5 to 6,

training the network structure and the network parameters, including:

and/or the presence of a gas in the gas,

testing the network structure and the network parameters, comprising:

8. A device for recognizing dynamic facial expressions of an artificial neural network is characterized in that the method is applied to the prediction of the identity and/or expression intensity category of a person expression in a video of a person fixed scene;

the method specifically comprises the following steps:

9. An apparatus comprising a processor, a memory, and a computer program stored on the memory and capable of running on the processor, the computer program when executed by the processor implementing the method of any one of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.