CN115909438A

CN115909438A - Pain expression recognition system based on depth time-space domain convolutional neural network

Info

Publication number: CN115909438A
Application number: CN202211313169.3A
Authority: CN
Inventors: 缪长虹; 陈万坤; 吴晗; 陈昭媛; 蒋怡; 高沈佳
Original assignee: Zhongshan Hospital Fudan University
Current assignee: Zhongshan Hospital Fudan University
Priority date: 2022-10-25
Filing date: 2022-10-25
Publication date: 2023-04-04

Abstract

The invention discloses a pain expression recognition system based on a deep time-space domain convolutional neural network, which is characterized by comprising the following components: a video library of facial expressions of pain patients; a video clip preprocessing unit; a two-channel three-dimensional convolutional neural network. The invention provides a characteristic map and an evaluation technology for identifying pain expressions based on facial expression identification, which combine depth time-space domain convolutional neural network models, and build a more accurate pain expression identification model through the depth time-space domain convolutional neural network on the basis of various pain special diagnosis maps, thereby overcoming the problems that the optimization training of a DCNN model needs to be supported by a large-scale labeled data set and the number of samples of various pain facial expression image data sets is insufficient, fully utilizing the existing large-scale labeled image data to automatically learn the characteristic representation of images, and effectively improving the efficiency and the accuracy of pain degree evaluation.

Description

Pain expression recognition system based on depth time-space domain convolutional neural network

Technical Field

The invention relates to an expression recognition system, in particular to a pain expression recognition system based on a depth time-space domain convolutional neural network.

Background

Pain assessment is an important component of pain control and mainly involves two mainstream approaches, self-assessment and observer assessment. The self-evaluation method has the characteristics of convenience, subjectivity and the like, and is the most widely applied evaluation method at present. However, self-assessment cannot guarantee that each assessment is accurate and reliable, and some special groups (such as dementia patients, neonates, patients with mental impairment or patients in intensive care) cannot accurately express pain degree of the patients. Observer evaluation methods may be more effective for special populations than self-evaluation methods. However, the effect of the observer evaluation method depends on continuous observation and identification performed by professionals, and the efficiency is low, which brings huge burden to hospital staff. Therefore, it is of great importance to study a method for automatically recognizing pain by patient's expression for pain assessment. In recent years, with the development of machine learning and computer vision technology, the accuracy and efficiency of Facial expression recognition are continuously improved, and especially, a Facial Action Coding System (FACS) is established and 6 basic expressions of human are defined: happy (Happy), angry (Angry), fright (surrise), fear (Fear), disgust (dispust) and sadness (Sad) describe the relationship between facial movements and facial expressions by using a FACS system, so that the research on the basic expression of the human face is developed in a breakthrough way. Pain is actually reflected in facial expressions although the expression of pain does not belong to the basic expression, and therefore pain recognition using human facial information is a possible solution considering a pain expression as a special, more complex facial expression. In 1991, craig et al pioneered the facial reaction caused by the sharply increased chronic low back pain, opening the door to the field of facial pain expression recognition. Prkachin and Solomon study the relation between pain and facial moving elements and propose a Prkachin and Solomon pain degree metric (PSPI), further promoting the development of facial pain expression recognition.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: currently, in clinical practice, the evaluation tools such as a Pain facial Coding System (NFCS) and a Pain Scale (NIPS) are used for manual evaluation by trained medical personnel. In these evaluation tools, "facial expression" is used as an important monitoring index. However, the manual evaluation is not only time-consuming and labor-consuming, but also the evaluation results depend on the experience of medical staff and are influenced by subjective factors such as individual moods.

In order to solve the technical problem, the technical scheme of the invention is to provide a pain expression recognition system based on a deep time-space domain convolutional neural network, which is characterized by comprising the following steps:

the pain patient facial expression video library is used for training the two-channel three-dimensional convolutional neural network, and all video segments of the injured patient in different states are divided into N types of expressions in the pain patient facial expression video library according to pain degrees, so that different expression labels for representing the pain degrees are marked on the video segments;

the video clip preprocessing unit is used for cutting the video clip into a frame sequence with the length of a frame l, performing graying processing on each frame image, and extracting a depth global likelihood value mode QCLP-n characteristic diagram of the obtained grayed image, wherein n is a positive integer;

the double-channel three-dimensional convolution neural network comprises a feature extraction part and a feature fusion and classification recognition part, wherein:

the characteristic extraction part comprises two mutually independent three-dimensional convolution neural network channels, wherein one three-dimensional convolution neural network channel processes a gray scale image sequence with the length of the frame I obtained by the video fragment preprocessing unit, the other three-dimensional convolution neural network channel processes a QCLP-n characteristic image sequence with the length of the frame I obtained by the video fragment preprocessing unit, and the two three-dimensional convolution neural network channels respectively output an n 4-dimensional characteristic vector;

and the characteristic fusion and classification identification part is used for connecting two n 4-dimensional characteristic vectors in series to obtain a 2n 4-dimensional characteristic vector, then inputting the 2n 4-dimensional characteristic vector into the classifier, outputting an n-dimensional column vector by the classifier, wherein the number of each dimension in the vector represents the probability of the image data input into the two-channel three-dimensional convolutional neural network belonging to the class, and the dimension corresponding to the maximum probability is the classification class of the image data input into the two-channel three-dimensional convolutional neural network.

Preferably, the states of the injured patient under different degrees of pain caused by painful operations can be characterized through the video segments in the video library of facial expression of the injured patient.

Preferably, the two three-dimensional convolutional neural network channels are identical in structure.

Preferably, the three-dimensional convolutional neural network channel comprises an input layer, a convolutional layer one, a pooling layer one, a convolutional layer two, a pooling layer two, a convolutional layer three, a pooling layer three, a convolutional layer four, a pooling layer four, a convolutional layer five, a pooling layer five, and a full-connection layer, wherein:

normalizing each frame of gray-scale image or QCLP-N characteristic image input from an input layer into M multiplied by N pixels, wherein M and N are positive integers;

in the convolution layer I, performing convolution operation on a gray scale map sequence or a QCLP-N characteristic map sequence of the length of a frame I by adopting N1 three-dimensional convolution cores of d1 xk 1, and outputting N1 characteristic map groups, wherein each characteristic map group comprises l1 characteristic maps with the size of M1 xN 1, d1 is the size of a time dimension, k1 xk 1 is the size of a space dimension, and N1, d1, k1, l1, h1 and w1 are positive integers;

in the first pooling layer, performing downsampling operation on a feature map group output by the first convolution layer by using a d2 xk 2 pooling kernel, and outputting n1 feature map groups, wherein each feature map group comprises l2 feature maps with the size of h2 xw 2, d2 is the time dimension, k2 xk 2 is the space dimension, and d2, k2, l2, h2 and w2 are positive integers;

in the second convolution layer, performing convolution operation on a feature map group output by the first pooling layer by adopting n2 three-dimensional convolution kernels of d1 xk 1, and simultaneously performing zero padding operation to output n2 feature map groups, wherein each feature map group comprises l2 feature maps with the size of h2 xw 2, and n2 is a positive integer;

in the second pooling layer, performing downsampling on the feature map groups output by the convolutional layer 2 by using a d2 xk 2 pooling kernel, and outputting n2 feature map groups, wherein each feature map group comprises l3 feature maps with the size of h3 xw 3, and l3, h3 and w3 are positive integers;

in the convolution layer III, performing convolution operation on a feature map group output by the pooling layer II by adopting n 3d 1 xk 1 three-dimensional convolution kernels, and simultaneously performing zero padding operation to output n3 feature map groups, wherein each feature map group comprises l3 feature maps with the size of h3 xw 3, and n3 is a positive integer;

in the third pooling layer, performing downsampling operation on a feature map group output by the third convolution layer by using a d2 xk 2 pooling kernel, and outputting n3 feature map groups, wherein each feature map group comprises l4 feature maps with the size of h4 xw 4, and l4, h4 and w4 are positive integers;

in the convolution layer four, performing convolution operation on feature map groups output by the pooling layer three by adopting n 3d 1 xk 1 three-dimensional convolution cores, and simultaneously performing zero filling operation to output n3 feature map groups, wherein each feature map group comprises l4 feature maps with the size of h4 xw 4;

in the fourth pooling layer, performing downsampling operation on feature map groups output by the fourth convolution layer by using a d2 xk 2 pooling kernel, and outputting n3 feature map groups, wherein each feature map group comprises l5 feature maps with the size of h5 xw 5, and l5, h5 and w5 are positive integers;

in the convolution layer five, performing convolution operation on feature map groups output by the pooling layer three by adopting n 3d 1 xk 1 three-dimensional convolution kernels, and simultaneously performing zero padding operation to output n3 feature map groups, wherein each feature map group comprises l5 feature maps with the size of h5 xw 5;

in the pooling layer five, performing downsampling operation on the feature map group output by the convolution layer five by adopting a pooling kernel of d2 xk 2, and outputting n3 feature maps of h6 xw 6, wherein h6 and w6 are positive integers;

and in the full connection layer, the output of the pooling layer five is fully connected to n4 neurons of the layer, and an n 4-dimensional feature vector is output, wherein n4 is a positive integer.

Preferably, the classifier adopts a K-SoftMif regression classifier, the number of output nodes of the K-SoftMif regression classifier is n, and each node is fully connected with the 2n 4-dimensional feature vector.

Preferably, the two-channel three-dimensional convolutional neural network is trained, a gray graph sequence with the length of the frame I and a QCLP-n characteristic graph sequence are input into the two-channel three-dimensional convolutional neural network, the network is trained and optimized by using a back propagation algorithm, the training is finished until loss function values output by the classifier are reduced and converged, and a trained network model is stored.

The invention provides a characteristic map and an evaluation technology for identifying pain expressions based on facial expression identification, which combine depth time-space domain convolutional neural network models, and build a more accurate pain expression identification model through the depth time-space domain convolutional neural network on the basis of various pain special diagnosis maps, thereby overcoming the problems that the optimization training of a DCNN model needs to be supported by a large-scale labeled data set and the number of samples of various pain facial expression image data sets is insufficient, fully utilizing the existing large-scale labeled image data to automatically learn the characteristic representation of images, and effectively improving the efficiency and the accuracy of pain degree evaluation.

Drawings

FIG. 1 illustrates a network structure of a deep convolutional neural network;

FIG. 2 illustrates the receptive fields of nodes;

FIG. 3 illustrates an application scenario of the present invention;

FIG. 4 illustrates one particular implementation of the present invention;

FIG. 5 illustrates a specific structure of a three-dimensional convolutional neural network;

FIG. 6 illustrates a confusion matrix of the inventive model on a MMD data block;

FIG. 7 illustrates a comparison of the average accuracy of the model of the present invention with other 3 symbologies;

FIG. 8a illustrates the accuracy of the present invention;

FIG. 8b illustrates the recall of the present invention.

Detailed Description

The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.

The deep convolutional neural network learns data characteristics layer by layer in a mode that a plurality of serial convolutional layers (convolutional layers) and pooling layers (posing layers) are arranged at intervals, wherein the convolutional layers scan the whole image by using a convolution kernel smaller than the size of the image in a mode of convolution operation, and the sum of the convolution kernel and the weight of the local position of the image is calculated. When the input data is an image with a two-dimensional structure, the convolution operation can directly process the two-dimensional topological structure, the weight number can be reduced, the network complexity is reduced, and the feature extraction and the mode classification are facilitated. The output of convolutional layers is often discretized and normalized and is called feature maps (feature maps), with 1 feature map for each convolution. The feature map is then input to the pooling layer for spatial subsampling (subsample), a straightforward approach is to compute the average for the neighbor nodes around the point of interest of the input image, each time computing the step value of the surrounding neighbor nodes between 1 and the maximum neighbor range. The resolution of the output characteristic mapping chart can be reduced through the processing of the pooling layer, the sensitivity of the convolutional neural network to the position change of the object to be identified in the input image is reduced, the convolutional neural network has certain distortion resistance, and the network structure of the convolutional neural network is shown in figure 1.

In the deep convolutional neural network shown in fig. 1:

the input layer uses 2 adjacent frames X and Y as input, enabling it to capture the dynamic features in time domain and the static features of the image in space domain between them.

In the convolutional layer:

the convolution kernels are divided into 4 groups, each frame corresponding to 2 groups of convolution kernels. Writing a certain convolution kernel in each group into a matrix form: fx,

Fy and->

Then trained, fx and Fy and £ er>

And &>

Will automatically form pairs of orthogonal basis functions. The corresponding 4 feature maps can be written as: />

Y x Fy and->

If the size of the input image is N × N and the size of the convolution kernel is K × K, the size of the feature map after the valid convolution operation (valid convolution) is (N-K + 1) × (N-K + 1). The RGB three channels of the color image are processed using a multi-channel convolution operation (i.e., 3D convolution). Bias parameters may also be added to replace linear mapping with affine, using stride techniques to reduce the parameters.

The multiplication layer is used to compute element-wise product between 2 feature maps. The 2 feature maps participating in the operation need to be respectively in 2 sets of feature maps and respectively correspond to the adjacent frames X and Y. The output of the multiplication layer is called a product map, and the product map has 2 groups.

The addition layer is used to compute the element-wise sum (element-wise sum) of the 2 product maps, namely:

the summation operation is replaced by a product operation of the filter responses corresponding to different frames. This multiplication operation can be seen as the outer product of 2 vectorized images, i.e. the correlation coefficient of 2 images, and can also be seen as the deformation of the energy perception model. It is this correlation analysis that provides the time-space domain convolutional neural network with information on the transformation between adjacent frames.

According to the structure of the time-space domain convolution neural network, the network can give a plurality of response values on nodes mapped by inputting 2 continuous frames. Considering 1 node, the size of the receptive field of this node in X and Y is K, see FIG. 2.

In fig. 2, the image in the small rectangular box of the input layer is the visible range of the node. Node S _l Is a scalar quantity, which can be written as follows:

where i and j are used to index node S _l The receptive field range of (1). The convolution operation in equation (a) can also be written in the form of a matrix-vector multiplication, since a two-dimensional discrete cyclic convolution operation can be implemented with a special block circulant matrix.

As shown in FIG. 3, by adopting the technical scheme disclosed by the invention, the facial expressions of different angles and different time points of the patient can be automatically captured when the doctor and the patient remotely video, and the pain level of the patient can be automatically judged through the face pain recognition model and the analysis matching result. The target information hidden in the multi-source data is explored through an algorithm from a large number of multi-source data through a data mining technology. The key parts (such as eyebrow corners, nose tips, mouth corners and the like) of the human face are captured and positioned through a common network camera, 34 detection pixel points are distributed in total, the change of the pixel position of each area is analyzed, and the real emotional change of people is reflected by combining facial expression activities measured in each area according to a machine learning algorithm. At present, the invention models 16 base points which can cause the expression of the pain to form 65536 pain expression libraries on the basis, and uses the range value of 0-100 to express the degree of the expression of the pain. The ROC curve is used for evaluating the detection precision, the ROC score value is between 0 and 1, the more accurate the detection is, the closer the numerical value is to 1, the accurate comparison is formed after sampling, and then the pain condition judgment is formed by matching with the pain level standard.

As shown in fig. 4, the implementation of the present invention specifically includes the following steps:

step 1, collecting the characterization state of the injured patient and video segments in different states of mild pain, severe pain and the like caused by painful operation, dividing the video into N expressions of calmness, crying, mild pain, severe pain and the like by medical workers according to the pain degree, and establishing a facial expression video library of the pain patient.

And 2, cutting each video segment in the facial expression video library of the pain patient into a frame sequence with the length of an l frame, graying each frame of image, and extracting a depth global likelihood value mode QCLP-n characteristic diagram, wherein n is a positive integer and is selected from 22, 34 and 56 values.

And 3, constructing a double-channel three-dimensional convolution neural network. In this embodiment, the constructed two-channel three-dimensional convolutional neural network is divided into two parts: the first part is used for feature extraction; the second part is used for feature fusion and classification recognition. The specific structures of the first part and the second part are as follows:

the first part is composed of two mutually independent three-dimensional convolution neural network channels, the first three-dimensional convolution neural network channel processes a gray scale image sequence of the length of the frame l, and the second three-dimensional convolution neural network channel processes a QCLP-n characteristic image sequence of the length of the frame l. The two three-dimensional convolutional neural network channels have the same network structure and respectively consist of an input layer, a convolutional layer I, a pooling layer I, a convolutional layer II, a pooling layer II, a convolutional layer III, a pooling layer III, a convolutional layer IV, a pooling layer IV, a convolutional layer V, a pooling layer V and a full-connection layer, but the network model parameters of the two three-dimensional convolutional neural network channels are different, and the specific structures of the three-dimensional convolutional neural networks of the two channels are shown in fig. 5.

The processing process of the three-dimensional convolution neural network on the gray map sequence or QCLP-n characteristic map sequence with the length of the frame I comprises the following steps:

step 301, normalizing each frame gray-scale image or QCLP-N feature map input from the input layer into mxn pixels, where M and N are positive integers and the value range is [58, 1074].

Step 302, in the convolution layer one, performing convolution operation on a grayscale graph sequence of l frame lengths or a QCLP-N feature graph sequence by using N1 three-dimensional convolution kernels of d1 × k1 × k1 (d 1 is a time dimension and k1 × k1 is a space dimension), and outputting N1 feature graph groups, wherein each feature graph group includes l1 feature graphs of which the size is M1 × N1, d1, k1, l1, h1 and w1 are positive integers, N1 is selected from 32, 64 and 128 values, d1 and k1 are selected from 3, 5 and 7 values, l1 is selected from 16, 24 and 32 values, and the value range of h1 and w1 is [32, 1074].

Step 303, in the pooling layer one, performing downsampling operation on a feature map group output by the convolution layer one by using a pooling kernel of d2 × k2 (d 2 is a time dimension, and k2 × k2 is a space dimension), and outputting n1 feature map groups, where each feature map group includes l2 feature maps with a size of h2 × w2, d2, k2, l2, h2, and w2 are positive integers, d2 and k2 are selected from

values

1, 2, and 3, l2 is selected from values 16, 24, and 32, and the range of values of h2 and w2 is [32, 128].

Step 304, in the second convolution layer, performing convolution operation on a feature map group output by the first pooling layer by using n2 d1 × k1 × k1 three-dimensional convolution kernels, and performing Zero Padding (Zero Padding) operation at the same time to output n2 feature map groups, wherein each feature map group includes l2 feature maps with the size of h2 × w2, and n2 is a positive integer and is selected from 64, 128 and 256 values.

305, in the second pooling layer, performing downsampling operation on the feature map groups output by the convolution layer 2 by using a d2 xk 2 pooling kernel, and outputting n2 feature map groups, wherein each feature map group comprises l3 feature maps with the size of h3 xw 3, l3, h3 and w3 are positive integers, l3 is selected from 8, 12 and 16 numerical values, and the dereferencing range of h3 and w3 is [16, 64];

step 306, in the convolutional layer three, performing convolution operation on the feature map groups output by the pooling layer two by using n 3d 1 × k1 × k1 three-dimensional convolution kernels, and performing zero padding operation at the same time to output n3 feature map groups, wherein each feature map group comprises l3 feature maps with the size of h3 × w3, wherein n3 is a positive integer and is selected from 128, 256 and 512 values.

Step 307, in the third pooling layer, performing downsampling on feature map groups output by the third convolutional layer by using d2 × k2 × k2 pooling kernels, and outputting n3 feature map groups, where each feature map group includes l4 feature maps with the size of h4 × w4, l4, h4, and w4 are positive integers, l4 is selected from 4, 6, and 8 values, and the value range of h4 and w4 is [8, 32].

Step 308, in the convolutional layer four, performing convolution operation on the feature map groups output by the pooling layer three by using n 3d 1 × k1 × k1 three-dimensional convolution kernels, and performing zero padding operation at the same time to output n3 feature map groups, where each feature map group includes l4 feature maps with the size of h4 × w 4.

Step 309, in the pooling layer four, performing downsampling on feature map groups output by the convolutional layer four by using a pooling kernel of d2 × k2 × k2, and outputting n3 feature map groups, where each feature map group includes l5 feature maps with a size of h5 × w5, where l5, h5, and w5 are positive integers, l5 is selected from 2, 3, and 4 values, and a value range of h5 and w5 is [4, 16].

And 310, in the convolution layer five, performing convolution operation on feature map groups output by the pooling layer three by adopting n 3d 1 × k1 × k1 three-dimensional convolution kernels, and simultaneously performing zero filling operation to output n3 feature map groups, wherein each feature map group comprises l5 feature maps with the size of h5 × w 5.

Step 311, in the pooling layer five, performing downsampling on the feature map group output by the convolution layer five by using the pooling kernel of d2 × k2 × k2, and outputting n3 feature maps of h6 × w6, where h6 and w6 are positive integers and the value range is [2,8].

And step 312, in the full connection layer, fully connecting the output of the pooling layer five to n4 neurons of the layer, and outputting an n 4-dimensional feature vector, wherein n4 is a positive integer, and values are selected from 256, 512 and 1024.

And 4, connecting n 4-dimensional feature vectors output by the full connection layers of the two three-dimensional convolutional neural networks in series to obtain a 2n 4-dimensional feature vector.

And 5, adopting a K-softMif regression classifier at a classification layer, wherein the number of output nodes of the K-softMif regression classifier is n, each node is fully connected with a 2n 4-dimensional feature vector, an n-dimensional column vector is output, the number of each dimension in the vector represents the probability that the input sample belongs to the class, and the dimension corresponding to the maximum probability is the classification class of the input sample.

And 6, training the two-channel three-dimensional convolutional neural network, inputting a gray graph sequence with the length of the frame I and a QCLP-n characteristic graph sequence into the two-channel three-dimensional convolutional neural network, training and optimizing the network by using a back propagation algorithm until loss function values output by the classification layer K-softMif are reduced and converged, finishing training, and storing a trained network model.

And 7, inputting the test video clip into the trained two-channel three-dimensional convolutional neural network for pain expression classification, and outputting a recognition result.

The expression recognition model based on the two-channel three-dimensional convolutional neural network is tested by using an MMD facial expression data medium (extended Cohn-Kanade facial expression). The MMD data set contained approximately 2000 videos of 10 classes of pain expressions (from class 1 pain to class 10 pain) for 210 individuals, with the trial using only a subset of videos with classification tags. The spatial resolution of the video is 640 x 49 or 640 x 480, black and white or color images, and each expression of each person in the database comprises a series of facial activities, consisting of a sequence of expressions starting to be expressed to a very strong expression. To simplify the calibration, the experiment converted the color video into the black and white video, and reduced the spatial domain size of the video to l60 × 120, and the video was also preprocessed by reducing the pixel mean.

According to the expression recognition model based on the two-channel three-dimensional convolutional neural network, after training, 16 randomly selected filters corresponding to adjacent frames X and Y find that the characteristics learned by the time-space domain deep convolutional neural network on a natural data set are similar to the form of a Gabor filter, namely different filters have different selectivity on thousand sizes, position, frequency, direction and phase. The morphology of these filters is very similar to the response of simple cells (simple cells) found neurologically in the V I region of the human brain to external stimuli.

FIG. 6 shows the confusion matrix of the model of the present invention on MMD data blocks. Where the rows represent the correct categories and the columns represent the classification results of the models. As can be seen from fig. 6, the model provided by the present invention has a high overall recognition rate and a high error rate in both class 2 pain and class 4 pain, which is similar to the intuition of people, because sometimes it is difficult for people to correctly judge these class 2 pain.

FIG. 7 shows a comparison of the average accuracy of the model herein with other 3 symbologies. As can be seen from FIG. 7, the average accuracy of the model disclosed by the invention on level 5 is 92.3%, which is higher than the average accuracy of the AAM algorithm, the CLA algorithm and the TMS algorithm.

Because the number of continuous frames of the one-time input model is the basis of extracting the time-space domain features by the model, the continuous frames are important parameters for determining the performance of the model. Furthermore, 1 parallel convolutional layer is required for each frame, so the number of consecutive frames also determines the computational complexity of the model. Generally, in order to reduce the computational complexity and the probability of overfitting occurring without affecting the model performance, it is desirable that the number of consecutive frames is as small as possible. In order to research the influence of the number of continuous frames on the performance of the model disclosed by the invention, the number of the continuous frames is selected from 1-10, the number of the parallel convolution layers of the identification model is improved, and the relation between the accuracy of the model on 3 types and the number of the continuous frames is calculated under the condition that other parameters are not changed. These 3 categories are: level 4 of accuracy on level 2 and average accuracy over other 3 classes, fig. 8a illustrates the experimental results.

As can be seen from fig. 8 a: as the number of consecutive frames increases, the accuracy of all 3 pain categories steadily increases earlier: when the number of the apparent frames seen by the model is increased from 1 frame to 2 frames, the accuracy of 3 pain categories is greatly improved; when the number of expression frames reaches 5 frames, the accuracy of the 3 pain levels gradually stabilizes. Considering the computational complexity of the model, it is a better compromise strategy to suggest a continuous frame number of 5.

Unlike the accuracy, the recall rate is the ratio of the number of identified relevant videos to the number of all relevant videos in the database, and the recall rate of the model to the relevant videos is measured. In order to study the recall rate of the model disclosed by the invention under different continuous frame numbers, the recall rate of 5 pain expression categories of the model under the condition that the continuous frame number is 1-6 frames is calculated, and fig. 8b shows the experimental result.

From fig. 8b, it can be seen that the recall rate of the model shows an upward trend with the increase of the number of consecutive frames, but the recall rate is not stable. Similar to the accuracy, most pain expression categories can better cover the samples in the database when the number of consecutive frames reaches 5 frames, but the number of consecutive frames is continuously increased, and the recall rate of individual pain expression categories tends to decrease, so that the value of the number of consecutive frames is suggested to be below 6 according to the experimental result.

Claims

1. A pain expression recognition system based on a deep time-space domain convolutional neural network, comprising:

the two-channel three-dimensional convolution neural network comprises a feature extraction part and a feature fusion and classification recognition part, wherein:

the feature extraction part comprises two mutually independent three-dimensional convolution neural network channels, wherein one three-dimensional convolution neural network channel processes a gray map sequence with the length of the frame I obtained by the video fragment preprocessing unit, the other three-dimensional convolution neural network channel processes a QCLP-n feature map sequence with the length of the frame I obtained by the video fragment preprocessing unit, and the two three-dimensional convolution neural network channels respectively output an n 4-dimensional feature vector;

and the characteristic fusion and classification identification part is used for connecting two n 4-dimensional characteristic vectors in series to obtain a 2n 4-dimensional characteristic vector, then inputting the 2n 4-dimensional characteristic vector into the classifier, outputting an n-dimensional column vector by the classifier, wherein the number of each dimension in the vector represents the probability that the image data input into the two-channel three-dimensional convolutional neural network belongs to the class, and the dimension corresponding to the maximum probability is the classification class of the image data input into the two-channel three-dimensional convolutional neural network.

2. The system according to claim 1, wherein the video segments in the video library of facial expressions of the pain patient can be used to characterize the state of the injured patient under different levels of pain caused by pain-causing operation.

3. The system according to claim 1, wherein the two three-dimensional convolutional neural network channels have the same structure.

4. The system of claim 3, wherein the three-dimensional convolutional neural network channel comprises an input layer, a convolutional layer one, a pooling layer one, a convolutional layer two, a pooling layer two, a convolutional layer three, a pooling layer three, a convolutional layer four, a pooling layer four, a convolutional layer five, a pooling layer five, and a fully-connected layer, wherein:

in the convolution layer I, performing convolution operation on a gray scale image sequence or a QCLP-N characteristic image sequence of the length of a frame l by adopting N1 three-dimensional convolution cores of d1 xk 1, and outputting N1 characteristic image groups, wherein each characteristic image group comprises l1 characteristic images with the size of M1 xN 1, d1 is the size of a time dimension, k1 xk 1 is the size of a space dimension, and N1, d1, k1, l1, h1 and w1 are positive integers;

in the third pooling layer, performing downsampling operation on feature map groups output by the third convolutional layer by adopting a d2 xk 2 pooling core, and outputting n3 feature map groups, wherein each feature map group comprises l4 feature maps with the size of h4 xw 4, and l4, h4 and w4 are positive integers;

in the convolution layer four, performing convolution operation on feature map groups output by the pooling layer three by adopting n3 three-dimensional convolution kernels of d1 xk 1, and simultaneously performing zero padding operation to output n3 feature map groups, wherein each feature map group comprises l4 feature maps with the size of h4 xw 4;

in the fifth pooling layer, performing downsampling operation on a feature map group output by the fifth convolutional layer by adopting a d2 xk 2 pooling core, and outputting n3 h6 xw 6 feature maps, wherein h6 and w6 are positive integers;

5. The system for recognizing pain expressions based on a deep time-space domain convolutional neural network according to claim 1, wherein the classifier adopts a K-SoftMif regression classifier, the number of output nodes of the K-SoftMif regression classifier is n, and each node is fully connected with a 2n 4-dimensional feature vector.

6. The system of claim 1, wherein the two-channel three-dimensional convolutional neural network is trained, a l-frame long gray-scale map sequence and a QCLP-n feature map sequence are input into the two-channel three-dimensional convolutional neural network, the network is trained and optimized by using a back propagation algorithm, the training is finished until loss function values output by the classifier are reduced and converged, and the trained network model is stored.