CN108596069A

CN108596069A - Neonatal pain expression recognition method and system based on depth 3D residual error networks

Info

Publication number: CN108596069A
Application number: CN201810346075.3A
Authority: CN
Inventors: 卢官明; 蒋银银; 李晓南; 卢峻禾
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2018-04-18
Filing date: 2018-04-18
Publication date: 2018-09-28

Abstract

The invention discloses a kind of neonatal pain expression recognition methods and system based on depth 3D residual error networks.This method includes：The newborn's expression video library for including pain expression class label is established, and the sample in newborn's expression video library is divided into training set and verification collection；Build a kind of depth 3D residual error networks for neonatal pain Expression Recognition, pre-training is carried out to network using the disclosed extensive video database for having class label, obtain initial weight parameter value, it recycles the training set in newborn's expression video library and verification collection sample to be finely adjusted network, obtains trained network model；Newborn's expression video segment to be tested is input to trained network model, expression classification recognition is carried out, obtains pain Expression Recognition result.The present invention extracts the time-space variation for capableing of reflecting time information using depth 3D residual error networks from video, the variation of facial expression can be preferably characterized, to promote the accuracy of Classification and Identification.

Description

Neonatal pain expression recognition method and system based on depth 3D residual error networks

Technical field

It is especially a kind of based on the new of depth 3D residual error networks the present invention relates to facial expression recognition and machine learning field Raw youngster's pain expression recognition method and system.

Background technology

Scientific research proves that newborn has pain sensing capability.Neonatal pain is operated essentially from induced pain, packet Include vola blood sampling, artery and vein puncture and trachea cannula, subcutaneous and intramuscular injection etc..Repeatedly or lasting pain stimulation is to newborn Growth and development generate and a series of in the recent period and at a specified future date seriously affect, it will lead to that intelligent of neonatal is slow, nervous centralis The harm such as system injury and Affective Disorder.Pain Assessment is an important ring for control pain, so correctly assessment pain and timely Corresponding analgesia measure is taken, to mitigate neonatal pain, there is important clinical value, this is to improving China human mortality Quality has far-reaching significance.

At present in clinical practice, artificial pain Assessment is carried out by specially trained medical staff.However, manually commenting Estimate not only time and effort consuming, but also assessment result depends on the experience of medical staff, and is influenced by subjective factors such as personal moods. Further, since the medical resource in China is distributed not perfectly flat weighing apparatus, it is relatively deficient in small city and remote countryside regional healthcare resource, Especially lack the health care professional in terms of paediatrics, objective assessment can not be made to neonatal pain degree.Therefore, urgently It needs to develop a computer assisted neonatal pain automatic evaluation system, provide assistance in diagnosis for parent and medical staff, To take corresponding analgesia measure in time, mitigate neonatal pain.

In terms of neonatal pain automatically assessment, have some researchs, as " one kind being based on facial expression to Chinese patent application The neonatal pain recognition methods of analysis " (number of patent application 201710628847.8, publication No. CN107491740A), by from Facial dynamic geometry feature and facial dynamic texture feature are extracted in video sequence, carry out dimensionality reduction, classification after Fusion Features again.But This method automatically will accurately extract characteristic parameter and extremely be not easy.Chinese patent " based on the neonatal pain of rarefaction representation with it is non- Pain expression classification recognition method " (patent No. ZL201210077351.3) is built sparse using the feature vector of training sample It indicates the excessively complete dictionary in model, test sample was regarded as to the linear combination of training sample in complete dictionary, utilize its spy Some sparsities carry out pain and non-pain expression classification recognition.But this method needs are well-designed to meet sparsity constraints condition Excessively complete dictionary, and two Classification and Identifications only are carried out to pain and two class expression of non-pain, pain degree are not assessed.

In order to automatically extract facial expression feature, the limitation and subjectivity of artificial design features, the present inventor are avoided Some neonatal pain expression recognition methods based on neural network are proposed, such as " newborn's pain based on convolutional neural networks Pain expression classification method " (number of patent application CN201611233381.3, publication No. CN106778657A), " one kind being based on depth The neonatal pain expression recognition method of neural network " (number of patent application CN201710497593.0, publication No. CN107392109A).Performance of the convolutional neural networks in image classification identification mission largely has benefited from deeper Network model.However, if merely increasing network depth by stacking convolutional layer, when the convolution number of plies increases to some After value, the accuracy rate of Classification and Identification can decline instead.On the other hand, face is extracted from still image using 2D convolutional neural networks Portion's expressive features have ignored the behavioral characteristics of adjacent interframe in video, cannot characterize the variation of facial expression well.

In order to extract the feature in video in time-domain and spatial domain, Yi Zhongzhi simultaneously using deep neural network The thinking connect is exactly that will be used to the 2D convolution that characteristics of image learns expand to be 3D convolution, while carrying out in room and time dimension Convolution operation.In this way, the 3D convolutional neural networks being made of 3D convolution operations can while obtaining per frame image features, The association and variation of consecutive frame over time can be expressed.However there is certain difficulty in practice in such design, first, The memory that the introducing of time dimension keeps the number of parameters, run time and training of entire neural network required all increases substantially；Its Secondary, the 3D convolution kernels of random initializtion need the video sample of a large amount of tape labels to be trained.

Invention content

Goal of the invention：It is a kind of based on depth 3D residual error networks present invention aims at providing for problem of the prior art Neonatal pain expression recognition method and system, by residual unit structure be applied to depth convolutional neural networks, can be effective Alleviate network model training when backpropagation in gradient disappearance problem, and then solve deep layer network be difficult to training and performance move back The problem of change, meanwhile, 3D convolution operations are realized using the combination operation of 2D convolution sum 1D convolution, compared to the 2D of same depth Convolutional neural networks only add a certain number of 1D convolution, not will produce in number of parameters, run time etc. The growth of degree.

Technical solution：For achieving the above object, the technical solution adopted by the present invention is as follows：

A kind of neonatal pain expression recognition method based on depth 3D residual error networks, includes the following steps：

(1) newborn's expression video segment sample needed for acquisition, by each video clip be trimmed into one it is isometric Frame sequence is established and includes newborn's expression video library of pain expression class label, and by the sample in newborn's expression video library Originally it is divided into training set and verification collects；

(2) depth 3D residual error network of the structure applied to neonatal pain Expression Recognition, including be linked in sequence：Input Layer, the first convolutional layer, the first pond layer, 3D residual errors sub-network, 2D residual errors sub-network, full articulamentum and Softmax classification layers；

Input layer is used for input video sequence, every frame image in video sequence is normalized；

First convolutional layer, the video sequence after the normalization exported to input layer using several 3D convolution kernels carry out convolution Operation, exports several feature graphic sequences；

First pond layer, the output that the first convolutional layer is checked using 3D pondizations carry out the maximum pond of spatial domain and time-domain Operation, exports several feature graphic sequences；

The 3D residual errors sub-network includes 3 kinds of 3D residual units with different structure of several alternate cycles connection, with And it is interspersed in the pond layer in 3D residual unit connection paths；The 3D residual units of 3 kinds of different structures are all made of 2D convolution Realize that the 3D convolution operations of spatial domain and time-domain, combination are respectively the string without shortcut branch with 1D convolution combination operations Line mode, parallel mode and the serial mode with shortcut branch；

The 2D residual errors sub-network, including at least three being linked in sequence have mutually isostructural 2D residual units and 1 pond Change layer；

The output of 2D residual error sub-networks is fully connected to the n output neuron of this layer by full articulamentum, exports a n dimension Feature vector；

And Softmax classification layers, the feature vector for exporting full articulamentum are connected to corresponding expression classification entirely N output node exports a n-dimensional vector, the number of each dimension represents the probability that input sample belongs to the category in vector, Wherein n is expression class number；

(3) have the extensive video database of class label disclosed in utilizing, to constructed depth 3D residual errors network into Row pre-training obtains initial weight parameter value；Based on the initial weight parameter value, the instruction in newborn's expression video library is utilized Practice collection and verification collection sample, constructed depth 3D residual error networks are trained using the method for fine tuning, optimize network model Parameter obtains trained network model；

(4) newborn's expression video segment to be tested is input to trained network model, carries out expression classification recognition, Obtain pain Expression Recognition result.

As a further optimization solution of the present invention, the 3D residual errors sub-network includes the first sub-network, the second sub-network With third sub-network；Each sub-network includes at least 3 3D residual units and 1 pond layer with different structure.

As a further optimization solution of the present invention, the 3D residual units in the 3D residual errors sub-network are respectively 3D residual errors Unit A, 3D residual unit B and 3D residual unit C, the 3D residual units A includes the first branch and the second branch, the first branch Including convolutional layer A1,3D convolution module A and convolutional layer A4 being linked in sequence, the second branch is shortcut connection (Shortcut Connections) branch, it is non-linear using ReLU after the first branch is added pixel-by-pixel with the output of the second branch Activation primitive layer exports；

The 3D residual units B includes the first branch and the second branch, the first branch include the convolutional layer B1 being linked in sequence, 3D convolution modules B and convolutional layer B4, the second branch are shortcut connection branches, by the output of the first branch and the second branch carry out by After pixel is added, exported using ReLU nonlinear activation function layers；

The 3D residual units C includes the first branch and the second branch, the first branch include the convolutional layer C1 being linked in sequence, 3D convolution modules C and convolutional layer C4, the second branch are shortcut connection branches, by the output of the first branch and the second branch carry out by After pixel is added, exported using ReLU nonlinear activation function layers.

As a further optimization solution of the present invention, the 2D residual units include the first branch and the second branch, and first Branch includes 3 convolutional layers being linked in sequence 1, convolutional layer 2 and convolutional layer 3, and the second branch is shortcut connection branch, by first After road is added pixel-by-pixel with the output of the second branch, then through the output of ReLU nonlinear activation function layers.

As a further optimization solution of the present invention, in the 3D residual units A, convolutional layer A1 uses m₁A 1 × 1 × 1 Convolution kernel to input carry out convolution operation；3D convolution modules A includes convolutional layer A2 and convolutional layer A3, convolutional layer A2 and convolutional layer The 3D convolution operations of spatial domain and time-domain are realized between A3 using serial mode, wherein convolutional layer A2 uses m₁A 1 × k The convolution kernel of × k carries out the output of convolutional layer A1 the convolution operation of spatial domain, and convolutional layer A3 uses m₁The convolution of a d × 1 × 1 The output for checking convolutional layer A2 carries out the convolution operation of time-domain；Convolutional layer A4 uses m₂A 1 × 1 × 1 convolution kernel is to 3D volumes The output of volume module A carries out convolution operation；Wherein, m₁It is chosen in 64,128,256 numerical value, k and d choose in 1,3 numerical value, m₂ It is chosen in 256,512,1024 numerical value；

In the 3D residual units B, convolutional layer B1 uses m₁A 1 × 1 × 1 convolution kernel carries out convolution operation to input； 3D convolution modules B includes convolutional layer B2 and convolutional layer B3, and sky is realized using parallel mode between convolutional layer B2 and convolutional layer B3 Between domain and time-domain 3D convolution operations, wherein convolutional layer B2 use m₁Output of the convolution kernel of a 1 × k × k to convolutional layer B1 The convolution operation of spatial domain is carried out, convolutional layer B3 uses m₁The convolution kernel of a d × 1 × 1 carries out the time to the output of convolutional layer B1 The convolution operation in domain, after convolutional layer B2 is added pixel-by-pixel with the output of convolutional layer B3, using ReLU nonlinear activation letters Several layers of output, the input as convolutional layer B4；Convolutional layer B4 uses m₂A 1 × 1 × 1 convolution kernel is defeated to 3D convolution modules B's Go out to carry out convolution operation；

In the 3D residual units C, convolutional layer C1 uses m₁A 1 × 1 × 1 convolution kernel carries out convolution operation to input； 3D convolution modules C includes convolutional layer C2 and convolutional layer C3, and branch is connected using with shortcut between convolutional layer C2 and convolutional layer C3 Serial mode realizes the 3D convolution operations of spatial domain and time-domain, wherein convolutional layer C2 uses m₁The convolution kernel of a 1 × k × k The convolution operation of spatial domain is carried out to the output of convolutional layer C1, convolutional layer C3 uses m₁The convolution kernel of a d × 1 × 1 is to convolutional layer The output of C2 carries out the convolution operation of time-domain, after convolutional layer C2 is added pixel-by-pixel with the output of convolutional layer C3, using ReLU nonlinear activation function layers export, the input as convolutional layer C4；Convolutional layer C4 uses m₂A 1 × 1 × 1 convolution kernel pair The output of 3D convolution modules C carries out convolution operation.

As a further optimization solution of the present invention, in the 2D residual units, 3 convolutional layer 1, the volumes being linked in sequence Lamination 2 and convolutional layer 3, the convolution kernel that m k × k is respectively adopted carry out convolution operation to input, wherein m is in 512,2048 numerical value Middle selection.

As a further optimization solution of the present invention, in step (3), the full articulamentum of the depth 3D residual error networks is in institute It states and training is finely adjusted using the learning rate bigger than predetermined threshold value on the basis of initial weight parameter value, remove the full articulamentum Outside, other each layers of the depth residual error convolutional neural networks use on the basis of the initial weight parameter value than default threshold It is worth small learning rate and is finely adjusted training.

A kind of neonatal pain Expression Recognition system based on depth 3D residual error networks that another aspect of the present invention provides, packet It includes：

Sample process module cuts each video clip for acquiring required newborn's expression video segment sample Volume at an isometric frame sequence, the newborn's expression video library for including pain expression class label is established, and by newborn's table Sample in feelings video library is divided into training set and verification collects；

Network struction module is used to build the depth 3D residual error networks applied to neonatal pain Expression Recognition, including suitable Sequence connection：Input layer, the first convolutional layer, the first pond layer, 3D residual errors sub-network, 2D residual errors sub-network, full articulamentum and Softmax classification layers；

The output of 2D residual error sub-networks is fully connected to n output neuron, the feature of output one n dimensions by full articulamentum Vector；

Model training module, the extensive video database for there is class label disclosed in utilization, to constructed depth It spends 3D residual error networks and carries out pre-training, obtain initial weight parameter value；And it is based on the initial weight parameter value, utilize new life Training set in youngster's expression video library and verification collection sample, carry out constructed depth 3D residual error networks using the method for fine tuning Training optimizes network model parameter, obtains trained network model；

And test module, for newborn's expression video segment to be tested to be input to trained network model, into Row expression classification recognition obtains pain Expression Recognition result.

Advantageous effect：Compared with prior art, the invention has the advantages that：

(1) realize d × k × k's using the combination operation of the 1D convolution of 2D convolution sums d × 1 × 1 of a 1 × k × k 3D convolution operations, compared to the 2D convolutional neural networks of same depth, depth 3D residual errors network only adds a certain number of 1D convolution not will produce excessive growth in number of parameters, run time etc..

(2) have disclosed in utilization the extensive video database of class label to constructed depth 3D residual errors network into When row pre-training, the 2D convolution kernels of spatial domain can use the large-scale image data library of disclosed tape label to carry out pre-training, The 1D convolution kernels of only time-domain need random initializtion, can reduce the training difficulty of network in this way, accelerate the training speed of network Degree.

(3) depth 3D residual errors network is from the expansion of depth residual error convolutional neural networks, by the way that shortcut connection is added (Shortcut connections) branch constitutes basic residual unit, and backpropagation when the network number of plies is deepened can be effectively relieved In gradient disappearance problem, and then solve the problems, such as deep layer network be difficult to training and performance degradation.

(4) time domain and spatial feature for using depth 3D residual error networks extraction video clip, by feature extraction from static map As being extended to video sequence, the behavioral characteristics for capableing of reflecting time information can be independently extracted, the expressive features extracted can be with The variation for preferably characterizing facial expression has stronger characterization ability and extensive energy relative to traditional artificial design features Power, to finally promote the accuracy of Classification and Identification.

Description of the drawings

Fig. 1 is the neonatal pain expression recognition method flow provided in an embodiment of the present invention based on depth 3D residual error networks Schematic diagram；

Fig. 2 is depth 3D residual error schematic network structures provided in an embodiment of the present invention；

Fig. 3 is the structural schematic diagram of 4 sub-networks in depth 3D residual error networks provided in an embodiment of the present invention；Wherein (a)- (c) it is respectively 3 sub-networks in 3D residual error sub-networks, is (d) 2D residual error sub-networks；

Fig. 4 is the structural representation of the 3D residual units of 2D residual units provided in an embodiment of the present invention and 3 kinds of different structures Figure；Wherein (a)-(c) is respectively 3D residual unit A-C, is (d) 2D residual units.

Specific implementation mode

Specific embodiments of the present invention are further described in detail with reference to the accompanying drawings of the specification.

As shown in Figure 1, a kind of neonatal pain expression based on depth 3D residual error networks provided in an embodiment of the present invention is known Other method, mainly includes the following steps：

Step 1, acquisition needed for newborn's expression video segment sample, by each video clip be trimmed into one it is isometric Frame sequence, establish and include newborn's expression video library of pain expression class label, and will be in newborn's expression video library Sample is divided into training set and verification collects.

Acquisition newborn such as is injected intravenously, extracts blood sample at the pain expression video and peace when conventional induced pain operation Expression video segment under the non-induced pain state cry and screamed under quiet state and because of reasons such as starvation, by professional person using in the world Generally acknowledged evaluation neonatal pain tool --- newborn's Facial Coding System (Neonatal Facial Coding System, NFCS), and in conjunction with other physical signs medically, the video of acquisition is assessed by 1~10 pain scores standard, it will Score value is classified as mild pain expression between 1~5 expression, and score value is classified as severe pain table between 6~10 expression Feelings.Finally, select the high 4 quasi-representative expressions of scoring consistency (it is quiet, cry, mild pain, severe pain) video clip, and Each video clip is trimmed into the frame sequence of 16 frame lengths, establishes the newborn's expression video for including pain expression class label Library, and the sample in newborn's expression video library is pressed 7:3 ratio cut partition is that training set and verification collect.In the present embodiment, Quiet expression is marked with label 0, and the expression cried under non-pain status is marked with label 1, and mild pain expression is marked with label 2 Note, severe pain expression are marked with label 3.

The depth 3D residual error networks of step 2, structure applied to neonatal pain Expression Recognition.Constructed network includes suitable Sequence connection：Input layer, the first convolutional layer, the first pond layer, 3D residual errors sub-network, 2D residual errors sub-network, full articulamentum and Softmax classification layers；Its network core part is 3D residual error sub-networks, including 3 kinds of several alternate cycles connection have The 3D residual units of different structure, and the pond layer that is interspersed in 3D residual unit connection paths；These 3D residual units are adopted The 3D convolution operations that spatial domain and time-domain are realized with 2D convolution sum 1D convolution combination operations, to reduce network calculations complexity. Wherein combination has the serial mode without shortcut branch, parallel mode and three kinds of the serial mode with shortcut branch.

Below with the concrete scene application of the present embodiment, constructed depth 3D residual error network structures are carried out specifically It is bright, it is to be understood that those skilled in the art can the adjustment appropriate on the basis of this specific implementation agent structure, with suitable It should specifically apply.

As shown in Fig. 2, the depth 3D residual error networks in the present embodiment include mainly：Input layer, the first convolutional layer, the first pond Change layer, 3D residual errors sub-network, 2D residual errors sub-network, full articulamentum and Softmax classification layers.Wherein 3D residual errors sub-network is divided into One sub-network, the second sub-network and third sub-network.

Each frame image normalization in the video sequence of 16 frame lengths of input is 160 × 160 pixels by input layer.

First convolutional layer, video sequence after the normalization exported to input layer using 64 1 × 7 × 7 convolution kernels into Row convolution operation, using batch normalization (Batch Normalization, BN), nonlinear activation function ReLU mappings, output 64 A 16 × 80 × 80 feature graphic sequence.

First pond layer, the output that the first convolutional layer is checked using 2 × 3 × 3 pondization carry out spatial domain and time-domain Maximum pondization operation, the feature graphic sequence that output is 64 8 × 39 × 39.

As shown in Fig. 3 (a), the first sub-network includes 3D residual unit A, 3D residual unit B, 3D the residual error lists being linked in sequence First C and the second pond layer.

As shown in Fig. 4 (a), 3D residual units A includes the first branch and the second branch, and the first branch includes being linked in sequence Convolutional layer A1,3D convolution module A and convolutional layer A4, the second branch is shortcut connection branch, by the first branch and the second branch Output carries out after being added pixel-by-pixel, then using nonlinear activation function ReLU mappings, the characteristic pattern that output is 256 8 × 39 × 39 Sequence.Wherein, after convolutional layer A1 carries out convolution algorithm using the output of 64 1 × 1 × 1 the first pond of convolution kernel pair layers, warp Cross batch normalization (BN), nonlinear activation function ReLU mappings, the feature graphic sequence that output is 64 8 × 39 × 39；3D convolution moulds Block A includes convolutional layer A2 and convolutional layer A3, between convolutional layer A2 and convolutional layer A3 using serial mode come realize spatial domain and when Between domain 3D convolution operations, wherein convolutional layer A2 carries out space using 64 1 × 3 × 3 convolution kernels to the output of convolutional layer A1 After the convolution operation in domain, by batch normalization (BN), nonlinear activation function ReLU mappings, the spy that output is 64 8 × 39 × 39 Levy graphic sequence；Convolutional layer A3 carries out the output of convolutional layer A2 using 64 3 × 1 × 1 convolution kernels the convolution operation of time-domain Afterwards, by batch normalization (BN), the feature graphic sequence that output is 64 8 × 39 × 39；Convolutional layer A4 is using 256 1 × 1 × 1 After convolution kernel carries out convolution operation to the output of 3D convolution modules A, by batch normalization (BN), 256 8 × 39 × 39 are exported Feature graphic sequence.

As shown in Fig. 4 (b), 3D residual units B includes the first branch and the second branch, and the first branch includes being linked in sequence Convolutional layer B1,3D convolution module B and convolutional layer B4, the second branch is shortcut connection branch, by the first branch and the second branch Output carries out after being added pixel-by-pixel, then using nonlinear activation function ReLU mappings, the characteristic pattern that output is 256 8 × 39 × 39 Sequence.Wherein, after convolutional layer B1 carries out convolution operation using 64 1 × 1 × 1 convolution kernels to the output of 3D residual units A, warp Cross batch normalization (BN), nonlinear activation function ReLU mappings, the feature graphic sequence that output is 64 8 × 39 × 39；3D convolution moulds Block B includes convolutional layer B2 and convolutional layer B3, between convolutional layer B2 and convolutional layer B3 using parallel mode come realize spatial domain and when Between domain 3D convolution operations, wherein convolutional layer B2 carries out space using 64 1 × 3 × 3 convolution kernels to the output of convolutional layer B1 After the convolution operation in domain, mapped by batch normalization (BN), nonlinear activation function ReLU, 64 8 × 39 × 39 dimensions of output Feature graphic sequence；The convolution that convolutional layer B3 carries out the output of convolutional layer B1 using 64 3 × 1 × 1 convolution kernels time-domain is grasped After work, by batch normalization (BN), the feature graphic sequence that output is 64 8 × 39 × 39；The output of convolutional layer B2 and convolutional layer B3 It carries out after being added pixel-by-pixel, then using nonlinear activation function ReLU mappings, exports the characteristic pattern sequence of 64 8 × 39 × 39 dimensions Row；After convolutional layer B4 carries out convolution operation using 256 1 × 1 × 1 convolution kernels to the output of 3D convolution modules B, by batch rule One changes (BN), the feature graphic sequence that output is 256 8 × 39 × 39.

As shown in Fig. 4 (c), 3D residual units C includes the first branch and the second branch, and the first branch includes being linked in sequence Convolutional layer C1,3D convolution module C and convolutional layer C4, the second branch is shortcut connection branch, by the first branch and the second branch Output carries out after being added pixel-by-pixel, then using nonlinear activation function ReLU mappings, the characteristic pattern that output is 256 8 × 39 × 39 Sequence.Wherein, after convolutional layer C1 carries out convolution operation using 64 1 × 1 × 1 convolution kernels to the output of 3D residual units B, warp Cross batch normalization (BN), nonlinear activation function ReLU mappings, the feature graphic sequence that output is 64 8 × 39 × 39；3D convolution moulds Block C includes convolutional layer C2 and convolutional layer C3, and the serial mode that branch is connected with shortcut is used between convolutional layer C2 and convolutional layer C3 To realize the 3D convolution operations of spatial domain and time-domain, wherein convolutional layer C2 is using 64 1 × 3 × 3 convolution kernels to convolutional layer After the output of C1 carries out the convolution operation of spatial domain, by batch normalization (BN), nonlinear activation function ReLU mappings, output 64 A 8 × 39 × 39 feature graphic sequence；When convolutional layer C3 carries out the output of convolutional layer C2 using 64 3 × 1 × 1 convolution kernels Between domain convolution operation after, by batch normalization (BN), the feature graphic sequence that output is 64 8 × 39 × 39；Convolutional layer C2 and volume The output of lamination C3 carries out after being added pixel-by-pixel, then using nonlinear activation function ReLU mappings, exports 64 8 × 39 × 39 Feature graphic sequence；After convolutional layer C4 carries out convolution operation using 256 1 × 1 × 1 convolution kernels to the output of 3D convolution modules C, By batch normalization (BN), the feature graphic sequence that output is 256 8 × 39 × 39.

Second pond layer carries out time-domain and spatial domain using 2 × 1 × 1 Chi Huahe to the output of 3D residual units C Maximum pond, 256 4 × 39 × 39 feature graphic sequences of output.

As shown in Fig. 3 (b), the second sub-network includes 3D residual unit A, 3D residual unit B, 3D the residual error lists being linked in sequence First C, 3D residual unit A, 3D residual unit B, 3D residual unit C, 3D residual unit A, 3D residual unit B and third pond layer, The feature graphic sequence that output is 512 2 × 20 × 20.

As shown in Fig. 3 (c), third sub-network is made of 36 3D residual units and 1 pond layer, and the order of connection is：3D The ponds residual unit C → 3D residual unit A → 3D residual unit B →... ... 3D residual unit A → 3D residual units B → the 4th Layer, the feature graphic sequence that output is 1024 1 × 10 × 10.

As shown in Fig. 3 (d), 2D residual error sub-networks include the 3 2D residual units and the 5th pond layer being linked in sequence, output 2048 1 × 1 characteristic patterns.

As shown in Fig. 4 (d), 2D residual units include the first branch and the second branch, and the first branch is linked in sequence including 3 Convolutional layer 1, convolutional layer 2 and convolutional layer 3, the second branch be shortcut connection branch, by the output of the first branch and the second branch It carries out after being added pixel-by-pixel, then using nonlinear activation function ReLU mappings, the characteristic pattern that output is 2048 5 × 5.Wherein, it rolls up After lamination 1 carries out convolution algorithm using 512 1 × 1 convolution kernels, by batch normalization (BN), nonlinear activation function ReLU Mapping, the characteristic pattern that output is 512 5 × 5；Convolutional layer 2 rolls up the output of convolutional layer 1 using 512 3 × 3 convolution kernels After product operation, by batch normalization (BN), nonlinear activation function ReLU mappings, the characteristic pattern that output is 512 5 × 5；Convolutional layer After 3 carry out convolution algorithm using 2048 1 × 1 convolution kernels to the output of convolutional layer 2, by batch normalization (BN), output 2048 5 × 5 characteristic patterns.

5th pond layer carries out average pondization operation to the output of third residual unit, and uses dropout method tune Whole connection weight, the characteristic pattern that output is 2048 1 × 1.

The output of 2D residual error sub-networks is fully connected to 4 output neurons of this layer by full articulamentum, exports one 4 dimension Feature vector；

Softmax classifies layer, and be connected to corresponding expression classification entirely 4 of the feature vector for exporting full articulamentum are defeated Egress exports 4 dimensional vectors, the number of each dimension represents the probability that input sample belongs to the category in vector.

Step 3, using the disclosed Kinetics video databases for having class label, to constructed depth 3D residual error nets Network carries out pre-training, obtains initial weight parameter value；Based on the initial weight parameter value, using in newborn's expression video library Training set and verification collection sample, constructed depth 3D residual error networks are trained using the method for fine tuning, optimize network Model parameter obtains trained network model.

Depth 3D residual error networks are trained based on training set and transfer learning method in this step, and on verification collection It is tested, obtains trained depth 3D residual error networks.The thought of wherein transfer learning is first by constructed depth 3D Residual error network carries out pre-training on the data set at one with sufficient training sample, and the good model parameter of pre-training is moved to Object module, making object module, there are one good initial weight parameter values, have the ability extracted to video image characteristic.So Afterwards, it allows depth 3D residual errors network to be finely adjusted (Fine-Tuning) on the newborn's expression video library established, that is, is based on instruction Practice collection, the depth 3D residual errors network after pre-training is finely trained.

In this example, first on Kinetics video databases to depth 3D residual error networks carry out pre-training (can be selected with The corresponding video of the identical label classification number of neonatal pain expression class number of label is trained), by pre-training, make Depth 3D residual error networks constructed by us have the ability extracted to video image characteristic, and network is made to obtain preferable initial power Weight parameter value.In addition, in order to further decrease the training difficulty of network, accelerates training speed, there is classification mark disclosed in utilization When the extensive video database of label carries out pre-training to constructed depth 3D residual error networks, spatial domain in 3D residual error networks 2D convolution kernels can first use the large-scale image data library (such as ImageNet) of disclosed tape label to carry out pre-training, only time The 1D convolution kernels in domain need random initializtion.

After obtaining initial weight parameter value, constructed depth 3D residual error networks are carried out on newborn's expression video library It finely tunes, the update rule of weight parameter is in trim process：The full articulamentum of depth 3D residual error networks is in initial weight parameter value On the basis of be trained using the learning rate bigger than predetermined threshold value, in addition to full articulamentum, other of depth 3D residual error networks are each Layer be trained using the learning rate smaller than predetermined threshold value on the basis of initial weight parameter value, wherein predetermined threshold value according to Concrete condition in hands-on determines.For example, in addition to the full articulamentum of last layer, other layers in depth 3D residual error networks It is trained, that is, allowed in network except the power of other each layers of full articulamentum using 0.001 learning rate on original parameter basis Weight parameter keeps updating more by a small margin on the basis of the initial weight parameter value that pre-training obtains, the full articulamentum of last layer Then 0.01 learning rate is used to be trained.Specifically, training uses Softmax loss functions, is optimized using gradient descent method The loss function, to carry out the update of network parameter.

Newborn's expression video segment to be tested is input to trained network model by step 4, carries out expression classification knowledge Not, pain Expression Recognition result is obtained.

Based on identical inventive concept, a kind of new life based on depth 3D residual error networks disclosed in another embodiment of the present invention Youngster's pain Expression Recognition system, including：Sample process module will for acquiring required newborn's expression video segment sample Each video clip is trimmed into an isometric frame sequence, establishes the newborn's expression video for including pain expression class label Library, and the sample in newborn's expression video library is divided into training set and verification collection；Network struction module, for building application In the depth 3D residual error networks of neonatal pain Expression Recognition, including be linked in sequence：Input layer, the first convolutional layer, the first pond Change layer, 3D residual errors sub-network, 2D residual errors sub-network, full articulamentum and Softmax classification layers；Model training module, for utilizing The disclosed extensive video database for having class label carries out pre-training to constructed depth 3D residual error networks, obtains just Beginning weight parameter value；And be based on the initial weight parameter value, using in newborn's expression video library training set and verification Collect sample, constructed depth 3D residual error networks are trained using the method for fine tuning, optimizes network model parameter, instructed The network model perfected；And test module, for newborn's expression video segment to be tested to be input to trained network Model carries out expression classification recognition, obtains pain Expression Recognition result.The specific implementation details of the present embodiment please refer to above-mentioned side Method embodiment, details are not described herein again.

It will be understood by those skilled in the art that can carry out adaptively changing and it to the module in embodiment Be arranged in the one or more systems different from the embodiment.Can in embodiment module or unit or component combine At a module or unit or component, and it can be divided into multiple submodule or subelement or sub-component in addition.

The above, the only specific implementation mode in the present invention, but scope of protection of the present invention is not limited thereto, appoints What is familiar with the people of the technology within the technical scope disclosed by the invention, it will be appreciated that expects transforms or replaces, and should all cover Within the scope of the present invention, therefore, the scope of protection of the invention shall be subject to the scope of protection specified in the patent claim.

Claims

1. a kind of neonatal pain expression recognition method based on depth 3D residual error networks, which is characterized in that include the following steps：

(1) newborn's expression video segment sample needed for acquisition, an isometric frame sequence is trimmed by each video clip Row establish the newborn's expression video library for including pain expression class label, and the sample in newborn's expression video library are drawn It is divided into training set and verification collects；

(2) depth 3D residual error network of the structure applied to neonatal pain Expression Recognition, including be linked in sequence：Input layer, One convolutional layer, the first pond layer, 3D residual errors sub-network, 2D residual errors sub-network, full articulamentum and Softmax classification layers；

First convolutional layer, the video sequence after the normalization exported to input layer using several 3D convolution kernels carry out convolution behaviour Make, exports several feature graphic sequences；

First pond layer, the output that the first convolutional layer is checked using 3D pondizations carries out spatial domain and the maximum pondization of time-domain is grasped Make, exports several feature graphic sequences；

The 3D residual errors sub-network, includes 3 kinds of 3D residual units with different structure of several alternate cycles connection, and wears The pond layer being inserted in 3D residual unit connection paths；The 3D residual units of 3 kinds of different structures are all made of 2D convolution sums 1D Convolution combination operation realizes that the 3D convolution operations of spatial domain and time-domain, combination are respectively the serial side without shortcut branch Formula, parallel mode and the serial mode with shortcut branch；

The 2D residual errors sub-network, including at least three being linked in sequence have mutually isostructural 2D residual units and 1 pond Layer；

The output of 2D residual error sub-networks is fully connected to the n output neuron of this layer, the spy of output one n dimensions by full articulamentum Sign vector；

And Softmax classification layers, the feature vector for exporting full articulamentum are connected to n of corresponding expression classification entirely Output node exports a n-dimensional vector, the number of each dimension represents the probability that input sample belongs to the category in vector, Middle n is expression class number；

(3) the extensive video database for having class label disclosed in utilizing carries out constructed depth 3D residual error networks pre- Training, obtains initial weight parameter value；Based on the initial weight parameter value, the training set in newborn's expression video library is utilized Collect sample with verification, constructed depth 3D residual error networks be trained using the method for fine tuning, optimizes network model parameter, Obtain trained network model；

(4) newborn's expression video segment to be tested is input to trained network model, carries out expression classification recognition, obtains Pain Expression Recognition result.

2. the neonatal pain expression recognition method according to claim 1 based on depth 3D residual error networks, feature exist In the 3D residual errors sub-network includes the first sub-network, the second sub-network and third sub-network；Each sub-network includes at least 3 3D residual units with different structure and 1 pond layer.

3. the neonatal pain expression recognition method according to claim 1 based on depth 3D residual error networks, feature exist In, the 3D residual units in the 3D residual errors sub-network are respectively 3D residual unit A, 3D residual unit B and 3D residual unit C, The 3D residual units A includes the first branch and the second branch, and the first branch includes convolutional layer A1,3D the convolution mould being linked in sequence Block A and convolutional layer A4, the second branch are shortcut connection branches, and the first branch is added pixel-by-pixel with the output of the second branch Afterwards, it is exported using ReLU nonlinear activation function layers；

The 3D residual units B includes the first branch and the second branch, and the first branch includes convolutional layer B1, the 3D volume being linked in sequence Volume module B and convolutional layer B4, the second branch are shortcut connection branches, and the output of the first branch and the second branch is carried out pixel-by-pixel After addition, exported using ReLU nonlinear activation function layers；

The 3D residual units C includes the first branch and the second branch, and the first branch includes convolutional layer C1, the 3D volume being linked in sequence Volume module C and convolutional layer C4, the second branch are shortcut connection branches, and the output of the first branch and the second branch is carried out pixel-by-pixel After addition, exported using ReLU nonlinear activation function layers.

4. the neonatal pain expression recognition method according to claim 1 based on depth 3D residual error networks, feature exist In the 2D residual units include the first branch and the second branch, and the first branch includes 3 convolutional layers being linked in sequence 1, convolution Layer 2 and convolutional layer 3, the second branch are shortcut connection branches, and the first branch is added pixel-by-pixel with the output of the second branch Afterwards, then through ReLU nonlinear activation function layers it exports.

5. the neonatal pain expression recognition method according to claim 3 based on depth 3D residual error networks, feature exist In in the 3D residual units A, convolutional layer A1 uses m₁A 1 × 1 × 1 convolution kernel carries out convolution operation to input；3D convolution Modules A includes convolutional layer A2 and convolutional layer A3, between convolutional layer A2 and convolutional layer A3 using serial mode come realize spatial domain and The 3D convolution operations of time-domain, wherein convolutional layer A2 uses m₁The convolution kernel of a 1 × k × k carries out the output of convolutional layer A1 empty Between domain convolution operation, convolutional layer A3 use m₁The convolution kernel of a d × 1 × 1 carries out the output of convolutional layer A2 the volume of time-domain Product operation；Convolutional layer A4 uses m₂A 1 × 1 × 1 convolution kernel carries out convolution operation to the output of 3D convolution modules A；Wherein, m₁ It is chosen in 64,128,256 numerical value, k and d choose in 1,3 numerical value, m₂It is chosen in 256,512,1024 numerical value；

In the 3D residual units B, convolutional layer B1 uses m₁A 1 × 1 × 1 convolution kernel carries out convolution operation to input；3D convolution Module B includes convolutional layer B2 and convolutional layer B3, between convolutional layer B2 and convolutional layer B3 using parallel mode come realize spatial domain and The 3D convolution operations of time-domain, wherein convolutional layer B2 uses m₁The convolution kernel of a 1 × k × k carries out the output of convolutional layer B1 empty Between domain convolution operation, convolutional layer B3 use m₁The convolution kernel of a d × 1 × 1 carries out the output of convolutional layer B1 the volume of time-domain Product operation, it is defeated using ReLU nonlinear activation function layers after convolutional layer B2 is added pixel-by-pixel with the output of convolutional layer B3 Go out, the input as convolutional layer B4；Convolutional layer B4 uses m₂A 1 × 1 × 1 convolution kernel carries out the output of 3D convolution modules B Convolution operation；

In the 3D residual units C, convolutional layer C1 uses m₁A 1 × 1 × 1 convolution kernel carries out convolution operation to input；3D convolution Module C includes convolutional layer C2 and convolutional layer C3, and the serial side that branch is connected with shortcut is used between convolutional layer C2 and convolutional layer C3 Formula realizes the 3D convolution operations of spatial domain and time-domain, wherein convolutional layer C2 uses m₁The convolution kernel of a 1 × k × k is to convolution The output of layer C1 carries out the convolution operation of spatial domain, and convolutional layer C3 uses m₁The convolution kernel of a d × 1 × 1 is defeated to convolutional layer C2's Go out to carry out the convolution operation of time-domain, it is non-using ReLU after convolutional layer C2 is added pixel-by-pixel with the output of convolutional layer C3 Linear activation primitive layer output, the input as convolutional layer C4；Convolutional layer C4 uses m₂A 1 × 1 × 1 convolution kernel is to 3D convolution The output of module C carries out convolution operation.

6. the neonatal pain expression recognition method according to claim 4 based on depth 3D residual error networks, feature exist In in the 2D residual units, the volume of m k × k is respectively adopted in 3 convolutional layer 1, convolutional layer 2 and the convolutional layers 3 being linked in sequence Product verification input carries out convolution operation, wherein m chooses in 512,2048 numerical value.

7. the neonatal pain expression recognition method according to claim 1 based on depth 3D residual error networks, feature exist In in step (3), the full articulamentum of the depth 3D residual error networks uses on the basis of the initial weight parameter value than pre- If the big learning rate of threshold value is finely adjusted training, in addition to the full articulamentum, the depth residual error convolutional neural networks other Each layer is finely adjusted training on the basis of the initial weight parameter value using the learning rate smaller than predetermined threshold value.

8. a kind of neonatal pain Expression Recognition system based on depth 3D residual error networks, which is characterized in that including：

Each video clip is trimmed by sample process module for acquiring required newborn's expression video segment sample One isometric frame sequence establishes the newborn's expression video library for including pain expression class label, and newborn's expression is regarded Sample in frequency library is divided into training set and verification collects；

Network struction module is used to build the depth 3D residual error networks applied to neonatal pain Expression Recognition, including sequentially connects It connects：Input layer, the first convolutional layer, the first pond layer, 3D residual errors sub-network, 2D residual errors sub-network, full articulamentum and Softmax Classification layer；

The output of 2D residual error sub-networks is fully connected to n output neuron, the feature vector of output one n dimensions by full articulamentum；

Model training module, the extensive video database for there is class label disclosed in utilization, to constructed depth 3D Residual error network carries out pre-training, obtains initial weight parameter value；And it is based on the initial weight parameter value, utilize newborn's table Training set in feelings video library and verification collection sample, instruct constructed depth 3D residual error networks using the method for fine tuning Practice, optimizes network model parameter, obtain trained network model；

And test module carries out table for newborn's expression video segment to be tested to be input to trained network model Feelings Classification and Identification obtains pain Expression Recognition result.