CN111210415A - Method for detecting facial expression coma of Parkinson patient - Google Patents

Method for detecting facial expression coma of Parkinson patient Download PDF

Info

Publication number
CN111210415A
CN111210415A CN202010010215.7A CN202010010215A CN111210415A CN 111210415 A CN111210415 A CN 111210415A CN 202010010215 A CN202010010215 A CN 202010010215A CN 111210415 A CN111210415 A CN 111210415A
Authority
CN
China
Prior art keywords
face
video
facial expression
facial
parkinson
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010010215.7A
Other languages
Chinese (zh)
Other versions
CN111210415B (en
Inventor
苏鸽
尹建伟
林博
罗巍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202010010215.7A priority Critical patent/CN111210415B/en
Publication of CN111210415A publication Critical patent/CN111210415A/en
Application granted granted Critical
Publication of CN111210415B publication Critical patent/CN111210415B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • G06T7/0012Biomedical image inspection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • G06V40/176Dynamic expression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30004Biomedical image processing

Abstract

The invention discloses a method for detecting facial expression fan of a Parkinson patient based on a VS-C3D network, which comprises the following specific steps: firstly, capturing a face of a face video, cutting out an area irrelevant to the face, and leaving an image sequence containing the face as input data; the input image sequence is then divided into two channels: one channel is composed of an RGB color image sequence, and the other channel is an optical flow image extracted from the RGB color image; then, a VEL algorithm in the VS-C3D network segments video clips containing facial activities, and removes the expressionless areas in the video; finally, the VGGV network extracts spatiotemporal features of facial activity, digitizes a mimicry representation of facial activity, and distinguishes parkinson patients with facial expression anhedonia symptoms from normal control subjects by these spatiotemporal features. In conclusion, the facial expression coma symptom of the Parkinson can be identified with high accuracy.

Description

Method for detecting facial expression coma of Parkinson patient
Technical Field
The invention relates to the technical field of medical decision support of Parkinson, in particular to a detection method of facial expression hypo (hypomimia) of a Parkinson patient based on a VS-C3D network.
Background
The facial expression hypo (hypomima) of parkinson is a typical application in the field of computer vision, and plays an important role in the fields of intelligent medical treatment, disease monitoring, remote diagnosis and the like. However, since recent work has been developed based on still images, facial expression hypo (hypomimia) detection technology of parkinson has been in the way in recent two years. The present invention enables a revolution from identifying parkinson's facial expression fans (hypomima) based on static images to exploring parkinson's facial expression fan (hypomima) patterns based on dynamic video segments. Of course, the present invention is an extremely difficult task. This is because the present invention not only requires the generation of dynamic variable length video segments, extracts a spatio-temporal representation that captures the progressive rigidity process of parkinson's facial expression hypo (hypomimia), but it also visualizes some pathological phenomena. However, most methods still continue to use earlier geometric features, HOG features, etc., and put these hand-made features into some classifiers for training classification. Obviously, these conventional methods have the following disadvantages: 1) first, they are unaware that facial expression hypo (hypomimia) of parkinson is a progressive stiffening process. Thus, they do not use dynamic video clips, they still rely on static images to quantify facial expression hypo (hypomimia) activity. Once their model encounters an expressionless picture, conventional methods are very prone to false positives, resulting in poor detection performance. 2) Traditional hand-made representations focus on local regions rather than globally, and today, a large number of representations have emerged that rely on facial organ structure. Such representations may not be optimal because they ignore areas of interest of low facial expression (hypomimia), such as: the cheek. 3) Designing these hands represents a significant amount of expertise, which is too high for parkinson's facial expression hypo (hypomimia) detection researchers. The following is a brief description of a conventional parkinson's facial expression hypo (hypomima) detection method. The conventional detection method starts with a mathematical model of the facial contours of Katsikitis et al, who after his work have developed a variety of facial features, and then most detection methods improve on their approach. Katsikitis et al constructed a reference model of the face as a control, and measured the movement of facial muscles based on the reference model from the vertical and horizontal directions. To simplify this model, grammitkopoulo et al replaced the lines in the previous face contour model with facial keypoints. Based on the facial key points, Andrea adopts the pileuler analysis to construct an average facial template as a neutral reference model, and calculates the distance between the facial expression based on the static image and the neutral reference model. By means of statistical analysis Andrea demonstrated that healthy persons had facial expressions that varied more widely than parkinson patients. However, some of the hand-made features of Andrea design move a small distance when detecting facial expression fans (hypomimia), and contribute little to the detection. Therefore, Shinde selects the strong expressive facial organ, eyes, as the detection target, and constructs a histogram to explore the eye opening and closing process. Shinde found that healthy humans had blink rates of approximately twenty per minute and that parkinson patients had ten or fewer blinks per minute.
Obviously, the above method mathematically models the static image using hand-made features, rarely mining the progressive rigidity process of parkinson facial expression fans (hypomimia). Moreover, previous approaches have focused on quantifying local regions similar to facial organs without capturing a global facial representation. Active mode detection would be a complementary measure to static image detection. In addition, the existing method mainly adopts a detection method combining a static image and the traditional hand-made characteristics, and rarely adopts methods such as a deep network and the like. This is because parkinson facial expression hypo (hypomima) detection belongs to clinical medical decision support, and the interpretation of pathological mechanisms is very important, so they rarely adopt such unexplained methods as deep network. The invention may be the first attempt to use a deep learning method and video clips to detect parkinson facial expression fans (hypomimia) and give visual interpretation.
Disclosure of Invention
The invention provides a detection method of facial expression fans (hypomima) of a Parkinson patient based on a VS-C3D network, wherein the VS-C3D network realizes the detection of the facial expression fans (hypomima) of the Parkinson patient from two aspects of space and time based on a facial video. Firstly, the VS-C3D network proposed by the invention captures variable content-dependent video segments, then extracts spatiotemporal representations of facial activities, and finally, the invention gives a visual illustration of VS-C3D network decisions.
A method for detecting facial expression addiction (hypomimia) of a Parkinson patient based on a VS-C3D network comprises the steps of VS-C3D, fully called Variable Segment conditional 3D Networks, obtaining Variable video segments mainly by using a VEL algorithm, learning the Variable video segments to obtain strong expressive force characteristics, and detecting facial expression addiction. The method comprises the following steps:
1) detecting a human face;
2) inputting a captured human face image sequence after human face detection into a two-stream channel (two-stream), wherein a first channel of the two-stream channel directly receives the human face image sequence to form a space RGB image sequence, and a second channel of the two-stream channel converts the human face image sequence into a time optical flow image sequence through an optical flow image extractor;
3) segmenting the space RGB image sequence obtained in the step 2) through a VEL algorithm to obtain a video segment with a space variable length, forming a starting point of each face activity, and segmenting the time optical flow image sequence obtained in the step 2) by using the starting point of each face activity to obtain a video segment with the variable length of the optical flow;
the VEL algorithm is shown as formula (1):
LN*M=αMinSeg*AN*M+(1-α)*LFA·*keyvalue(1)
in the formula (1), L is the final video segment length, LN*MIs a matrix of dimensions N x M, LN*MRepresenting N video segments of length, Min, containing at most M facial activitiesSegWhich represents the lower limit of the video length,
Figure BDA0002356878160000031
t is the default public data set frame rate, f represents the video frame rate to be detected, AN*MIs a vector of dimensions N x M, AN*MWherein each element value is 1, as an auxiliary matrix, α is an adjustment coefficient, LFARepresenting the set of all video segment lengths, keyvalueKeys being sets containing 1 and/or 0 elementsvalueThe middle element is 1, and the image sequences in the face activity fragments corresponding to 1 are all effective activity occurrence regions, keysvalueA middle element of 0 indicates that the image sequence among the face motion segments corresponding to 0 has no valid motion occurrence region.
4) Training the video segments with variable space length and the video segments with variable optical flow obtained in the step 3) through a 3D convolutional neural network to obtain a training network for judging whether the facial expression is low;
5) and 4) detecting the facial expression fan-out condition of the Parkinson patient by adopting the training network for judging whether the facial expression is fan-out or not in the step 4) to obtain a detection result.
Firstly, the invention uses MTCNN algorithm to capture face of face video, cuts out the area irrelevant to face, and leaves the image sequence containing face as input data; the input image sequence is then divided into two channels: one channel is composed of an RGB color image sequence, and the other channel is an optical flow image extracted from the RGB color image; then, a VEL algorithm in the VS-C3D network segments video clips containing facial activities, and removes the expressionless areas in the video; finally, the VGGV network extracts spatiotemporal features of facial activity, digitizes a mimicry representation of facial activity, and distinguishes parkinson patients with hypo-expression symptoms (hypomimia) from normal control subjects by these spatiotemporal features. The invention realizes high-accuracy identification of facial expression and coma symptom (hypomimia) of the Parkinson.
In step 1), the face detection includes:
the face detection extracts a face part from an original face video, removes objects irrelevant to a detection target, such as a background, clothes, a chair and the like, and converts a face motion video into a serial image sequence only containing a face image sequence (namely, a serial image sequence of a face). The face detection adopts an MTCNN algorithm.
In the step 2), the two-stream channel (two-stream) directly takes the original serial image sequence extracted by the face detection as a first channel of the two-stream channel, namely a space RGB image sequence; then, extracting a second channel, namely a time optical flow image sequence, as a two-flow channel by performing optical flow diagram extraction on the original serial image sequence extracted by face detection;
in step 3), a VEL algorithm for extracting a video segment from the facial muscle movement map of the spatial RGB image sequence is specifically as follows:
a) the invention considers that the movement of certain facial muscle of the face can represent the change of the whole facial activity and determine certain muscle with representation force;
b) according to the time line of the video, the selected facial muscles can generate a determined motion trail along with time flow, and the determined motion trail is recorded as a facial muscle motion map;
c) according to the movement diagram of the facial muscle, using a sliding window to start sliding detection from a first frame, detecting an area with obvious amplitude change, and recording the starting point and the ending point of the area with obvious change, namely the time stamp mentioned above;
d) the region time determined by the time stamp and the traditional video activity detection length are linearly mapped to be the final face activity length.
In step 4), a VGG16 network is taken as a basic framework, a 2D convolution is expanded into a 3D convolution, and a classification layer of a full connection layer is modified into two types, namely a parkinson patient with facial expression hypo (hypomima) and a healthy object, so as to form a VGGV network, an image sequence of the two channels above is characterized into two 4096-dimensional variable representations by using the VGGV network, the above process is to train on a labeled data set, wherein the parkinson patient with the facial expression hypo (hypomima) is labeled as 1, the healthy control object is labeled as 0, supervised training is performed through the labeled data above, and the trained result can be used for representing a video segment containing facial activity.
VGGV network for face activity representation extraction and detection, specifically:
A) the VGG16 network is used as a basic framework, and the framework comprises five convolution characteristic layers and three full connection layers;
B) to process multiple frames of image data, the network is extended to a 3D convolutional network;
C) the present invention is to classify parkinson patients with facial expression hypo (hypomima) from normal control subjects into two classes, and therefore, the present invention modifies the final full-junction layer into two classes.
The detection method of facial expression hypo (hypomima) of the Parkinson patient based on the VS-C3D network further comprises the following step 6):
visualizing the detection result obtained in the step 5).
Compared with the prior art, the invention has the following advantages:
the present invention enables a revolution from identifying parkinson's facial expression fans (hypomima) based on static images to exploring parkinson's facial expression fan (hypomima) patterns based on dynamic video segments. The method not only needs to generate a dynamic variable-length video segment and extract the space-time representation capable of capturing the progressive rigidity process of the Parkinson facial expression hypo (hypomimia), but also visualizes some pathological phenomena.
The invention is based on a VS-C3D network, detects facial hypo-coma symptoms of Parkinson from two aspects of time dimension and space dimension, and experiments prove that the method can detect Parkinson patients with facial expression hypo-coma symptoms (hypomimia) from people with unknown disease conditions with 94.74% of accuracy. The method realizes high-accuracy identification of facial expression and coma symptom (hypomimia) of the Parkinson.
Drawings
Fig. 1 is an overall framework diagram of a detection method of facial expression hypo (hypomima) of parkinson patients based on a VS-C3D network in the method of the present invention, wherein MTCNN is a face detection algorithm for extracting faces in an image sequence, VEL algorithm is an algorithm for extracting variable length video segments, VGGV is a 3D convolution network adapted with a VGG16 network as a basic framework, softmax denotes that softmax function maps an input to real numbers between 0 and 1, and normalization assurance sum is 1, RELU denotes a Linear rectification function (Rectified lu Linear Unit, RELU);
FIG. 2 is a diagram of facial muscle movements required for the execution of the VEL algorithm in the method of the present invention, with the abscissa being the number of video frames, which is mainly used to represent the time flow, and the ordinate being the FECF, which records the local facial muscle movement amplitude, and the abscissa mainly representing the change diagram of the FECF with the time flow, and the occluded part in FIG. 2 not expressing the meaning of the whole image;
FIG. 3 is a VGGV network framework diagram of face video spatio-temporal feature extraction and Parkinson detection in the method of the present invention, wherein L (i, j) represents the length of the jth face activity in the ith video, Conv3D 3 × 3 × 3stride 1 represents the 3D convolution with a convolution kernel of 3 × 3 × 3 and a step size of 1, maxpool3D 2 × 2 × 2stride 2 represents the step size of 2, the kernel is the 3D pooling layer of 2 × 2 × 2, FC represents the full-link layer, and ReLU represents the Linear rectification function (Rectified LU Linear Unit, ReLU);
FIG. 4 is a thermodynamic diagram of decision support for healthy control subjects;
fig. 5 is a thermodynamic diagram of parkinson patient decision support.
Detailed Description
As shown in fig. 1, a method for detecting facial expression hypo (hypomima) of parkinson patient based on VS-C3D network, the method comprises the following steps:
1) face detection
Extracting a face part from an original face video by adopting an MTCNN algorithm, removing objects irrelevant to a detection target, such as a background, clothes, a chair and the like, and converting a face activity video into a serial image sequence only containing a face;
2) two-stream channel (two-stream)
The method detects the symptom of facial expression and coma (hypomima) of Parkinson from two aspects through a two-stream framework, and directly uses an original serial image sequence extracted by face detection as a first channel of a two-stream channel, namely a spatial RGB image sequence; then, extracting a second channel, namely a time optical flow image sequence, which is a two-flow channel, by performing optical flow diagram extraction on an original serial image sequence extracted by face detection, wherein each optical flow diagram can represent the pixel change between two continuous images;
3) VEL algorithm extraction of variable length video clips
The invention proposes a VEL algorithm, which detects the time stamp of the face movement according to the facial muscle movement diagram (figure 2) of the spatial RGB image sequence of the first channel, and divides the spatial RGB image sequence and the time optical flow image sequence of the two channels into a series of video segments containing the face movement through the time stamp. The specific process is as follows:
the VEL algorithm is built based on the following four settings:
setting 1: the video data set frame rates disclosed by the deep network are all the same, and the accuracy of synchronization is one frame, corresponding to 1/t second.
Setting 2: the ratio of the frame rate of the public data set to the frame rate of the contrast video is t/f, and f is the frame rate of the contrast video.
The VEL algorithm dynamically captures the video area of facial activity, with the traditional 3D convolution input as a reference. The conventional 3D convolutional input is a fixed length, a 16 frame video segment, from which it can be inferred that at least one 16 frame length video segment can record a behavioral activity in a t frames per second video. Therefore, following this idea, a lower limit of the video segment length is set with reference to 16 frames. According to settings 1 and 2, the frame rate of the public data set is t frames per second, with a lower limit of 16 frames for the video length. The frame rate of the video data set to be detected is f frames per second as the contrast video in setting 2. Thus, by linear transformation, it is not difficult to find the lower limit of the length of the video containing the facial activity (16 × t/f), which ensures that at least one complete facial activity can be accommodated in a segment.
The minimum video length can be formulated as:
Figure BDA0002356878160000071
wherein MinSegRepresenting a lower limit on video length, t being the default public data set frame rate, f tableAnd displaying the frame rate of the video to be detected.
Setting 3: facial expression hypo (hypomima) is not a calm neutral expression, and it can be accompanied by movements of facial muscles that occur synchronously and produce changes of a certain magnitude.
Facial expression hypo (hypomima), a weakly expressed facial expression, means that PD patients only have diminished expression amplitude and expression velocity when they exhibit expression, not no expression. According to setting 3, facial expression hypo (hypomima) is not a calm neutral expression, which can occur simultaneously with the movement of facial muscles and produce a certain amplitude of change. This means that the course of change of facial expression hypo (hypomimia) can be quantitatively expressed and captured by some simple muscle movement supervision. As shown in fig. 2, this is a graph of eye muscle movement, where the peak region (in the yellow box) where the float changes more means that a facial activity is occurring, such as: the process of ascending indicates opening the eyes and the process of descending indicates closing the eyes. Meanwhile, small amplitude dithering (red box region) is considered as irregular twitching of facial muscles, maintaining an unweaved neutral expression. The purpose of the VEL algorithm is to remove the non-expressive hypo (non-hypomima) regions that are non-expressive, capturing the start and end points of facial activity accurately. Assume that the set of segment lengths containing facial motion is denoted LFA M*1Paired time points of facial movement are denoted mstarti,mendi. Then, in a long video, the number of video frames F (seg (FECF), j) occupied by the jth face activity is formulated as mstarti,mendiThe first norm of (d):
F(Seg(FECF),j)=||mendj-mstartj||p
Figure BDA0002356878160000072
mstartj,endja frame number respectively representing the start and end of the jth activity, p represents a norm, FECF represents a facial expression change factor, Seg () represents a video segment, Seg (FECF) represents a video segment from a video segmentThe starting point and end point pair of facial movements detected in the motion amplitude map of a certain FECF facial expression change factor in the segment F (seg (FECF), j) represents the length of the image sequence occupied by the j-th facial movement, LFA M*1M facial activities are shown to occupy the set of image sequence lengths, FA represents facial activity (facial activity), and M represents the number of facial activities contained in one video segment.
Setting 4: the change of the facial muscles is coordinated and uniform, and the change of local areas is consistent with the change of the whole body.
According to setting 4, all facial muscles are moving simultaneously when facial activity occurs, and thus, the section of the overall facial movement is determined using the local muscular movement map. In a long video, there may be a plurality of facial activities, which are arranged in time sequence, starting from the first frame image, sliding forward using a sliding window with a window size γ, and when a certain section of the whole area is detected to be in a rising or falling trend in the window, the first point is the starting point, because the area where the facial activity occurs is in a rising or falling trend, and at the same time, when an activity is finished, no new facial activity occurs next, and no facial activity is terminated, and then a small change amplitude is detected in the active window. Therefore, this point is taken as the end point. Seg (FECF) is a set of paired time points.
Figure BDA0002356878160000081
In the formula, γ represents the size of the sliding window, M represents the number of facial movements contained in a motion amplitude map, and N+Representing a positive integer set, j is the j th facial activity currently being detected, FECF represents a facial expression transformation factor, recording the local facial muscle movement amplitude, m represents the left edge position of the current sliding window, k represents the k-th frame image sequence number in the sliding window, Fir (x) represents the first time point meeting the x condition, and μ represents a threshold. KeyvalueFrame representation 1, the image sequence after this frame is a valid activity occurrence region until a key is encounteredvalueFrame, phase 0Similarly, keyvalueA frame of 0 indicates that no active activity occurs. f. ofiRepresenting the ith frame of image in video, FmaxRepresenting the total number of pictures in the video clip.
Figure BDA0002356878160000092
LFA M*1∈RM*1It is an M x 1 matrix that is transposed and extended to an N dimension to form an N x M dimensional matrix. L isFA∈RN*MAnd is used to represent a plurality of video segment slice lengths. At the same time, key _ value is also extended to N dimension, LFAAnd key value dot products can filter out video segments with no apparent activity.
Figure BDA0002356878160000091
In the formula, LFARepresenting the set of all video segment lengths, N representing the total number of videos, T representing the transpose, M being the facial activity contained in one video, i representing the sequence number of the currently processed video. Other letter representations are explained in the above formulas.
Finally, MinSegGuarantee minimum video segment length, guarantee enough motion information in a segment, and at the same time, LFAIs a set of video clip lengths that contain facial activity. Considering the above two factors together, the variable video length is defined as follows:
LN*M=αMinSeg*AN*M+(1-α)*LFA·*keyvaluel is the final video segment length, LN*MIs a matrix of dimensions N x M, representing the length of N video segments containing at most M facial activities, AN*MThe vector is N x M dimensional, the numerical values are all 1, and the vector is used as an auxiliary matrix, and α is an adjusting coefficient.
4) Video segment feature representation training
The method takes a VGG16 network as a basic framework, 2D convolution is expanded into 3D convolution, and a classification layer of a full connection layer is modified into two types, namely a Parkinson patient with facial expression addiction (hypomima) and a healthy object, so as to form a VGGV network, an image sequence of the two channels on the VGGV network is characterized into two 4096-dimensional variable representations, the above process is to train on a labeled data set, wherein the Parkinson patient with the facial expression addiction (hypomima) is labeled as 1, the healthy contrast object is labeled as 0, supervised training is carried out through the labeled data, and the trained result can be represented by a video segment containing facial activity;
5) parkinsonian facial expression hypo (hypomima) detection
Detecting whether the object to be detected has the Parkinson facial expression and low coma symptom (hypomima) or not by using the trained network to the input video segment, wherein the detection is specifically described as follows:
to characterize variable-length video segments, the invention proposes a two-stream VGGV network (fig. 3) which is intended to uniformly represent variable-length segments as a fixed-dimension depth representation. In order to extract the motion characteristics with strong expressive force, a deep convolutional neural network is selected to train the segments, and a VGG16 network is used as a basic framework. A typical VGG16 framework 3x3 kernel can only perform operations such as convolution pooling on two-dimensional planes. However, to explore the rigidification process of facial expression hypo (hypomimia), the present invention convolves in the time dimension, and thus the input is a sequence of multiple frames of images that includes the time dimension. To address this issue, all convolution kernels in the first through fifth layers of VGG16 are extended to the 3x3x3 dimension, and the corresponding maximum pooling kernel is extended to 2x2x2, in order to extract features of the face motion process. Wherein the convolution step size is 1, and the pooling step size is 2. Passing through VGGVlayer1~VGGVlayer2A sequence of images
Figure BDA0002356878160000103
From 3 xL(i,j)The x 224 dimension is converted to
Figure BDA0002356878160000101
Figure BDA0002356878160000102
Dimension, where 3 denotes the three channels RGB, 224X 224 denotes the size of the image, length and width, respectively, L(i,j)Is LN*MRepresents the length of the image sequence occupied by the jth activity in the ith video. In addition, an additional optical flow graph input is added in parallel to the network, which forms a two-flow neural network. The optical flow map contains detailed motion information between two consecutive frames of images, which provides extremely important information for the evolution process of facial expression fans (hypomimia). In combination with the common RGB image sequence, the two-stream VGGV network extracts the pair of complementary information based on the spatial RGB image sequence and the temporal optical flow sequence, respectively, by using a two-stream framework, which not only informationizes the facial structure, but also effectively captures the activity information by using the continuity of the video. In order to allow the temporal and spatial representations to share one feature extraction model, the invention fuses two deep convolutional networks to one. Finally, at VGGVlayer7The fully-connected layer outputs a 4096-dimensional vector as a temporal or spatial representation of the facial activity. Combining the two representations, a variable length video segment is represented as a vector space of 4096 x 2-8192 dimensions. Finally, the method fuses two channels in the softmax layer, predicts the fusion result and judges whether the object to be detected is Parkinson.
6) Parkinsonian facial fan detection resolution visualization
The Grad-CAM algorithm is applied to the VGGV network to visualize VGGV network decisions, which can be used to analyze the distinction between Parkinson's patients and normal persons.
The VGGV network structure consists of two parts, namely a feature layer and a classifier layer, and the specific steps are as follows:
a) the basic structure of the feature layer composite component is formed by stacking the following components in sequence: a 3 × 3 × 3D convolution kernel, a batch normalization layer (BatchNorm), a ReLU active layer, a 3 × 3 × 3 max pooling layer; the composite component is denoted Block 1;
b) the basic structure of the feature layer composite component is formed by stacking the following components in sequence: a 3 × 3 × 3D convolution kernel, a batch normalization layer (BatchNorm), a ReLU active layer, a 3 × 3 × 3 maximum pooling layer; the composite component is denoted Block 2;
c) the basic structure of the classifier layer composite component is formed by stacking the following components in sequence: a linear mapping layer, a ReLU activation layer; the composite component is denoted Block 3;
d) the basic structure of the classifier layer composite assembly 2 consists of the following components stacked in order: a linear mapping layer, a Softmax function; the composite component is denoted Block 4.
e) The VGGV network is formed by sequentially connecting the following components: the device comprises a characteristic layer composite component, a classifier layer composite component and a classifier layer composite component.
Specifically, the method comprises the following steps:
first, data preprocessing
1) Face detection
In an actual scheme, firstly, a trained MTCNN algorithm needs to be called to perform face detection on a video frame, and other regions except for the face are cut off. This is to exclude the influence of factors other than a human face on the VGGV network. In fact, experiments find that the situation that the VGGV network learns other adverse factors easily occurs if factors except the human face are not removed. For example: in the case where the VGGV network learns background knowledge while training the VGGV network, the VGGV network does not detect facial expression addiction (hypomimia) of parkinson according to the facial expression activity of the human face, but analyzes the facial expression addiction (hypomimia) by the background.
2) Light flow graph extraction
And calculating a light flow diagram according to the relative movement of pixels at the same position of two continuous RGB images, wherein N-1 continuous light flow diagrams can be extracted from N continuous space diagram sequences.
3) Video segment segmentation
Video segment segmentation is performed according to the VEL algorithm provided above.
Second, model training
The parkinson patient was labeled 1 and the healthy control subject was labeled 0 based on video clips segmented by VEL algorithm. Therefore, these marked video segments are input to the VGGV network and trained until the final model converges.
Detection of facial expression hypomyia (hypomimia) in Parkinson's disease
And inputting the facial expression video of the person to be detected into the network trained in the last step, wherein the person with the prediction result of 1 is the Parkinson patient with facial expression and hypo-coma (hypomima) symptom, otherwise, the person is a healthy person.
Fourth, decision visualization
As shown in FIG. 1, the VS-C3D network decision visualization is performed by incorporating Grad-CAM algorithms. In order to visualize the decision key area, the invention propagates the maximum probability of the classification layer backwards, sets the prediction probability of other classes to zero, and then multiplies the average weight of the last convolutional layer feature map and the gradient to obtain a corresponding thermodynamic map. The invention finds out through experiments that the thermodynamic diagram shows the flexible range of the human face activities, the healthy people basically cover the facial area more uniformly by the thermodynamic diagram red area, and the Parkinson patients with facial expression hypo (hypomimia) symptoms only cover partial area by the thermodynamic diagram red area.
Fig. 4 is a thermodynamic diagram for decision support of healthy control subjects, from the first to the sixth panel, facial thermodynamic diagrams for six consecutive facial activities in a video, where the thermal overlap region indicates a significant region of facial activity and the six consecutive images indicate changes in the significant region of facial activity over time. As can be seen from the figure, the thermodynamic diagram of the healthy control subject has more uniform thermal area coverage, basically extends to the whole face, has obvious change along with time, and cannot be fixed in a certain area, which shows that the facial muscle activity of the healthy control subject is relatively flexible, the expression change is obvious, and the facial coma condition does not exist.
Fig. 5 is a thermodynamic diagram of decision support for parkinson's patients, the first to sixth panels being facial thermodynamic diagrams of six consecutive facial activities in a video, the thermal overlap region in each panel indicating a distinct region of facial activity, the six consecutive images representing changes in the distinct region of facial activity over time. As can be seen, the distribution of the thermal regions in the thermodynamic diagram of the Parkinson's patient is more concentrated, usually concentrated on a certain fixed facial region, and does not change or shift greatly with time, which indicates that the facial muscles of the Parkinson's patient are stiff and cannot be flexibly mobilized to generate effective facial activities.

Claims (6)

1. A method for detecting facial expression addiction of a Parkinson patient is characterized by comprising the following steps:
1) detecting a human face;
2) inputting a captured face image sequence into a two-flow channel after face detection, directly receiving the face image sequence by a first channel of the two-flow channel to form a space RGB image sequence, and converting the face image sequence into a time optical flow image sequence by a second channel of the two-flow channel through an optical flow image extractor;
3) segmenting the space RGB image sequence obtained in the step 2) through a VEL algorithm to obtain a video segment with a space variable length, forming a starting point of each face activity, and segmenting the time optical flow image sequence obtained in the step 2) by using the starting point of each face activity to obtain a video segment with the variable length of the optical flow;
4) training the video segments with variable space length and the video segments with variable optical flow obtained in the step 3) through a 3D convolutional neural network to obtain a training network for judging whether the facial expression is low;
5) and 4) detecting the facial expression fan-out condition of the Parkinson patient by adopting the training network for judging whether the facial expression is fan-out or not in the step 4) to obtain a detection result.
2. The method for detecting facial expression addiction of Parkinson's patients as claimed in claim 1, wherein in the step 1), the human face detection comprises:
the face detection extracts a face part from an original face video, and converts a face moving video into a sequence only containing face images.
3. The method for detecting facial expression addiction in Parkinson's patients as claimed in claim 1, wherein, in step 1), the human face detection adopts MTCNN algorithm.
4. The method for detecting facial expression addiction of Parkinson's patients as claimed in claim 1, wherein in step 3), the VEL algorithm is represented by formula (1):
LN*M=αMinSeg*AN*M+(1-α)*LFA.*keyvalue(1)
in the formula (1), LN*MRepresenting N video segments of length, Min, containing at most M facial activitiesSegWhich represents the lower limit of the video length,
Figure FDA0002356878150000011
t is the default public data set frame rate, f represents the video frame rate to be detected, AN*MIs a vector of dimensions N x M, AN*MWherein each element value is 1, as an auxiliary matrix, α is an adjustment coefficient, LFARepresenting the set of all video segment lengths, keyvalueIs a set containing 1 and/or 0 elements.
5. The method of claim 4, wherein the key is used to detect facial expression loss of the Parkinson's patientvalueThe middle element is 1, and the image sequences in the face activity fragments corresponding to 1 are all effective activity occurrence regions, keysvalueA middle element of 0 indicates that the image sequence among the face motion segments corresponding to 0 has no valid motion occurrence region.
6. The method for detecting facial expression addiction in Parkinson's patients according to claim 1, further comprising the step of 6): visualizing the detection result obtained in the step 5).
CN202010010215.7A 2020-01-06 2020-01-06 Method for detecting facial expression hypo of Parkinson patient Active CN111210415B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010010215.7A CN111210415B (en) 2020-01-06 2020-01-06 Method for detecting facial expression hypo of Parkinson patient

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010010215.7A CN111210415B (en) 2020-01-06 2020-01-06 Method for detecting facial expression hypo of Parkinson patient

Publications (2)

Publication Number Publication Date
CN111210415A true CN111210415A (en) 2020-05-29
CN111210415B CN111210415B (en) 2022-08-23

Family

ID=70789713

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010010215.7A Active CN111210415B (en) 2020-01-06 2020-01-06 Method for detecting facial expression hypo of Parkinson patient

Country Status (1)

Country Link
CN (1) CN111210415B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113033359A (en) * 2021-03-12 2021-06-25 西北大学 Self-supervision-based pre-training and facial paralysis grading modeling and grading method and system
CN113901915A (en) * 2021-10-08 2022-01-07 无锡锡商银行股份有限公司 Expression detection method for light-weight network and Magface in video
CN116392086A (en) * 2023-06-06 2023-07-07 浙江多模医疗科技有限公司 Method, system, terminal and storage medium for detecting stimulus
WO2023229991A1 (en) * 2022-05-23 2023-11-30 Aic Innovations Group, Inc. Neural network architecture for movement analysis

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130300900A1 (en) * 2012-05-08 2013-11-14 Tomas Pfister Automated Recognition Algorithm For Detecting Facial Expressions
US20180075306A1 (en) * 2016-09-14 2018-03-15 Canon Kabushiki Kaisha Temporal segmentation of actions using context features
WO2018119807A1 (en) * 2016-12-29 2018-07-05 浙江工商大学 Depth image sequence generation method based on convolutional neural network and spatiotemporal coherence
CN110175596A (en) * 2019-06-04 2019-08-27 重庆邮电大学 The micro- Expression Recognition of collaborative virtual learning environment and exchange method based on double-current convolutional neural networks
US20190311188A1 (en) * 2018-12-05 2019-10-10 Sichuan University Face emotion recognition method based on dual-stream convolutional neural network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130300900A1 (en) * 2012-05-08 2013-11-14 Tomas Pfister Automated Recognition Algorithm For Detecting Facial Expressions
US20180075306A1 (en) * 2016-09-14 2018-03-15 Canon Kabushiki Kaisha Temporal segmentation of actions using context features
WO2018119807A1 (en) * 2016-12-29 2018-07-05 浙江工商大学 Depth image sequence generation method based on convolutional neural network and spatiotemporal coherence
US20190311188A1 (en) * 2018-12-05 2019-10-10 Sichuan University Face emotion recognition method based on dual-stream convolutional neural network
CN110175596A (en) * 2019-06-04 2019-08-27 重庆邮电大学 The micro- Expression Recognition of collaborative virtual learning environment and exchange method based on double-current convolutional neural networks

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113033359A (en) * 2021-03-12 2021-06-25 西北大学 Self-supervision-based pre-training and facial paralysis grading modeling and grading method and system
CN113033359B (en) * 2021-03-12 2023-02-24 西北大学 Self-supervision-based pre-training and facial paralysis grading modeling and grading method and system
CN113901915A (en) * 2021-10-08 2022-01-07 无锡锡商银行股份有限公司 Expression detection method for light-weight network and Magface in video
CN113901915B (en) * 2021-10-08 2024-04-02 无锡锡商银行股份有限公司 Expression detection method of light-weight network and MagFace in video
WO2023229991A1 (en) * 2022-05-23 2023-11-30 Aic Innovations Group, Inc. Neural network architecture for movement analysis
CN116392086A (en) * 2023-06-06 2023-07-07 浙江多模医疗科技有限公司 Method, system, terminal and storage medium for detecting stimulus
CN116392086B (en) * 2023-06-06 2023-08-25 浙江多模医疗科技有限公司 Method, terminal and storage medium for detecting stimulation

Also Published As

Publication number Publication date
CN111210415B (en) 2022-08-23

Similar Documents

Publication Publication Date Title
CN111210415B (en) Method for detecting facial expression hypo of Parkinson patient
Liao et al. Deep facial spatiotemporal network for engagement prediction in online learning
CN106778687B (en) Fixation point detection method based on local evaluation and global optimization
CN109886986A (en) A kind of skin lens image dividing method based on multiple-limb convolutional neural networks
CN110826389B (en) Gait recognition method based on attention 3D frequency convolution neural network
CN111539331B (en) Visual image reconstruction system based on brain-computer interface
CN112115775A (en) Smoking behavior detection method based on computer vision in monitoring scene
CN112395442A (en) Automatic identification and content filtering method for popular pictures on mobile internet
Yue et al. Deep super-resolution network for rPPG information recovery and noncontact heart rate estimation
CN113158905A (en) Pedestrian re-identification method based on attention mechanism
Dutta Facial pain expression recognition in real-time videos
CN111881818B (en) Medical action fine-grained recognition device and computer-readable storage medium
Syu et al. Psoriasis detection based on deep neural network
CN112488165A (en) Infrared pedestrian identification method and system based on deep learning model
CN115546491A (en) Fall alarm method, system, electronic equipment and storage medium
Liang et al. An adaptive viewpoint transformation network for 3D human pose estimation
Zhang An intelligent and fast dance action recognition model using two-dimensional convolution network method
Sun et al. Faketransformer: Exposing face forgery from spatial-temporal representation modeled by facial pixel variations
CN115147636A (en) Lung disease identification and classification method based on chest X-ray image
Panicker et al. Cardio-pulmonary resuscitation (CPR) scene retrieval from medical simulation videos using local binary patterns over three orthogonal planes
Angusamy et al. Human Emotion Detection using Machine Learning Techniques
Yang et al. Model-agnostic Method: Exposing Deepfake using Pixel-wise Spatial and Temporal Fingerprints
Wang et al. Hierarchical Style-Aware Domain Generalization for Remote Physiological Measurement
CN113408389A (en) Method for intelligently recognizing drowsiness action of driver
Kang et al. Research on a microexpression recognition technology based on multimodal fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant