CN108229268A

CN108229268A - Expression Recognition and convolutional neural networks model training method, device and electronic equipment

Info

Publication number: CN108229268A
Application number: CN201611268009.6A
Authority: CN
Inventors: 金啸; 胡晨晨; 旷章辉; 张伟
Original assignee: Sensetime Group Ltd
Current assignee: Sensetime Group Ltd
Priority date: 2016-12-31
Filing date: 2016-12-31
Publication date: 2018-06-29

Abstract

An embodiment of the present invention provides a kind of Expression Recognition and convolutional neural networks model training method, device and electronic equipment, the method includes：By the face key point in the convolution layer segment of convolutional neural networks model and the facial image to be detected of acquisition, human face expression feature extraction is carried out to facial image to be detected, obtains human face expression characteristic pattern；Determine ROI corresponding with each face key point in human face expression characteristic pattern；Pond processing is carried out to determining each ROI by the pond layer segment of convolutional neural networks model, obtains the ROI feature figure behind pond；The Expression Recognition result of facial image is obtained according at least to ROI feature figure.Through the embodiment of the present invention, subtle expression shape change can effectively be captured, the otherness that different facial attitude tapes come can be preferably handled simultaneously, make full use of the detailed information of facial multiple regions variation, have to the face of subtle expression shape change and different postures and more accurately identify.

Description

Expression Recognition and convolutional neural networks model training method, device and electronic equipment

Technical field

The present embodiments relate to a kind of field of artificial intelligence more particularly to expression recognition method, device and electronics Equipment and, a kind of convolutional neural networks model training method, device and electronic equipment.

Background technology

Facial expression recognition technology refers to specify an expression classification to given facial image, including：Angry, detest, Happily, it is sad, frightened, surprised etc..At present, facial expression recognition technology in human-computer interaction, clinical diagnosis, long-distance education, detect It looks into the fields such as hearing and gradually shows wide application prospect, be the popular research direction of computer vision and artificial intelligence.

A kind of existing facial expression recognition technology is the identification technology based on conventional machines learning framework.Use the tradition Machine learning frame, which carries out Expression Recognition, can include 4 basic steps：Face datection, face characteristic extraction, Feature Dimension Reduction and According to tagsort.But：First, face characteristic extraction needs engineer and extracts, and the profession of specific area is needed to know Know；Second, compared to depth characteristic (feature map), the classical geometric properties such as level of abstractions such as Gabor filter, SIFT It is weak with ability to express；Third, more and more training datas, training duration is difficult to be utilized in traditional machine learning method, and instructs It is complicated to practice process dispersion.

Lead to that existing Expression Recognition cost is higher, and Expression Recognition accuracy rate is relatively low as a result,.

Invention content

An embodiment of the present invention provides a kind of Expression Recognition technical solutions.

It is according to embodiments of the present invention in a first aspect, provide a kind of expression recognition method, including：Pass through convolutional Neural net Face key point in the convolution layer segment of network model and the facial image to be detected obtained, to facial image to be detected into Pedestrian's face human facial feature extraction obtains human face expression characteristic pattern；It determines crucial with each face in the human face expression characteristic pattern The corresponding region of interest ROI of point；Pond is carried out to determining each ROI by the pond layer segment of convolutional neural networks model Change is handled, and obtains the ROI feature figure behind pond；The Expression Recognition of the facial image is obtained according at least to the ROI feature figure As a result.

Optionally, the facial image includes Static Human Face image.

Optionally, the facial image includes the facial image in sequence of frames of video.

Optionally, according at least to the ROI feature figure obtain the facial image Expression Recognition as a result, including：According to The ROI feature figure of the facial image of the present frame obtains the preliminary Expression Recognition knot of the facial image of the present frame Fruit；According to the preliminary Expression Recognition result of the present frame and at least Expression Recognition of the facial image of a prior frame as a result, obtaining Take the Expression Recognition result of the facial image of the present frame.

Optionally, according to the preliminary Expression Recognition result of the present frame and at least expression of the facial image of a prior frame Recognition result, obtain the Expression Recognition of the facial image of the present frame as a result, including：By the facial image of the present frame Preliminary facial expression recognition result and the facial expression recognition result of at least facial image of a prior frame are weighted processing, obtain The facial image of the present frame Expression Recognition as a result, wherein, the preliminary Expression Recognition of the facial image of the present frame As a result weight is more than the weight of the Expression Recognition result of the facial image of any prior frame.

Optionally, according to the preliminary Expression Recognition result of the present frame and at least expression of the facial image of a prior frame Recognition result before the Expression Recognition result for the facial image for obtaining the present frame, further includes：Determine the present frame regarding Position in frequency frame sequence is greater than or equal to setting position threshold value.

Optionally, the method further includes：It is less than in response to position of the present frame in the sequence of frames of video and sets Fixed position threshold, export the facial expression recognition of the facial image of the present frame as a result, and/or, preserve the present frame Facial image facial expression recognition result.

Optionally, in the convolution layer segment by convolutional neural networks model and the facial image to be detected obtained Face key point, human face expression feature extraction is carried out to facial image to be detected, obtains human face expression characteristic pattern, including： Face critical point detection is carried out to facial image to be detected, obtains the face key point in the facial image；According to described Face key point carries out human face expression feature by the convolution layer segment of the convolutional neural networks model to the facial image Extraction obtains human face expression characteristic pattern.

Optionally, in the convolution layer segment by convolutional neural networks model and the facial image to be detected obtained In face key point, before carrying out human face expression feature extraction to facial image to be detected, the method further includes：It obtains The sample image of training trains the convolutional neural networks model using the sample image, wherein, in the sample image Include the information of face key point and the markup information of human face expression.

Optionally, the sample image for obtaining training trains the convolutional neural networks using the sample image Model, including：The sample image of training is obtained, by the convolution layer segment of convolutional neural networks model to the sample image Human face expression feature extraction is carried out, obtains human face expression characteristic pattern；It determines to close with each face in the human face expression characteristic pattern The corresponding region of interest ROI of key point；Determining each ROI is carried out by the pond layer segment of convolutional neural networks model Pondization processing, obtains the ROI feature figure behind pond；According at least to the ROI feature figure, the convolutional neural networks model is adjusted Network parameter.

Optionally it is determined that area-of-interest corresponding with each face key point in the human face expression characteristic pattern ROI, including：In the human face expression characteristic pattern, corresponding each position is determined according to the coordinate of each face key point； Using determining each position as reference point, the region of corresponding each setting range is obtained, each region of acquisition is determined as Corresponding ROI.

Optionally, the pond layer segment by convolutional neural networks model carries out pond processing to determining each ROI, The ROI feature figure behind pond is obtained to include：Pond is carried out to determining each ROI by the pond layer segment of convolutional neural networks model Change is handled, and obtains the ROI feature figure being sized behind pond；It is described according at least to the ROI feature figure, adjust the convolution The network parameter of neural network model, including：The ROI entrance loss layers being sized are obtained to the sample image Carry out the expression classification resultant error of expression classification；According to the expression classification resultant error, the convolutional neural networks are adjusted The network parameter of model.

Optionally, it by the ROI entrance loss layers being sized, obtains and expression classification is carried out to the sample image Expression classification resultant error, including：By the ROI entrance loss layers being sized, pass through the logistic regression of the loss layer Loss function calculates the expression classification resultant error and exports.

Optionally, the logistic regression loss function is the logistic regression loss function with setting expression classification quantity.

Optionally, the human face expression sample image to be trained is the sample image of sequence of frames of video.

Optionally, before the information of the sample image for obtaining training and corresponding face key point, the side Method further includes：The sample image of the training is detected, obtains the information of face key point.

Second aspect according to embodiments of the present invention provides a kind of expression recognition apparatus, including：First determining module, For by the convolution layer segment of convolutional neural networks model and obtain facial image to be detected in face key point, it is right Facial image to be detected carries out human face expression feature extraction, obtains human face expression characteristic pattern；2nd the 5th determining module, is used for Determine region of interest ROI corresponding with each face key point in the human face expression characteristic pattern；Third determining module, For carrying out pond processing to determining each ROI by the pond layer segment of convolutional neural networks model, the ROI behind pond is obtained Characteristic pattern；4th determining module, for obtaining the Expression Recognition result of the facial image according at least to the ROI feature figure.

Optionally, the facial image includes Static Human Face image.

Optionally, the third determining module, including：First acquisition submodule, for the face according to the present frame The ROI feature figure of image obtains the preliminary Expression Recognition result of the facial image of the present frame；Second obtains submodule Block, for the preliminary Expression Recognition result according to the present frame and at least Expression Recognition knot of the facial image of a prior frame Fruit obtains the Expression Recognition result of the facial image of the present frame.

Optionally, second acquisition submodule, for the preliminary human face expression of the facial image of the present frame to be known Other result and the facial expression recognition result of at least facial image of a prior frame are weighted processing, obtain the present frame The Expression Recognition of facial image is as a result, wherein, the weight of the preliminary Expression Recognition result of the facial image of the present frame is more than The weight of the Expression Recognition result of the facial image of any prior frame.

Optionally, described device further includes：5th determining module, for determining the present frame in sequence of frames of video Position is greater than or equal to setting position threshold value.

Optionally, described device further includes：Respond module, in response to the present frame in the sequence of frames of video Position be less than setting position threshold, export the facial expression recognition of the facial image of the present frame as a result, and/or, protect Deposit the facial expression recognition result of the facial image of the present frame.

Optionally, first determining module for carrying out face critical point detection to facial image to be detected, obtains Face key point in the facial image；According to the face key point, pass through the convolution of the convolutional neural networks model Layer segment carries out human face expression feature extraction to the facial image, obtains human face expression characteristic pattern.

Optionally, first determining module, for being treated by the convolution layer segment of the convolutional neural networks model The facial image of detection carries out face key point extraction；According to the face key point of extraction, facial image to be detected is carried out Human face expression feature extraction obtains human face expression characteristic pattern.

Optionally, described device further includes：Training module for obtaining the sample image of training, uses the sample Image trains the convolutional neural networks model, wherein, the information and face of face key point are included in the sample image The markup information of expression.

Optionally, the training module, including：First submodule for obtaining the sample image of training, passes through convolution The convolution layer segment of neural network model carries out human face expression feature extraction to the sample image, obtains human face expression feature Figure；The second submodule, for determining region of interest corresponding with each face key point in the human face expression characteristic pattern Domain ROI；Third submodule, the pond layer segment for passing through convolutional neural networks model carry out pond Hua Chu to determining each ROI Reason obtains the ROI feature figure behind pond；4th submodule, for according at least to the ROI feature figure, adjusting the convolution god Network parameter through network model.

Optionally, the second submodule, in the human face expression characteristic pattern, according to each face key point Coordinate determines corresponding each position；Using determining each position as reference point, the area of corresponding each setting range is obtained Each region of acquisition is determined as corresponding ROI by domain.

Optionally, the third submodule, it is each to what is determined for passing through the pond layer segment of convolutional neural networks model ROI carries out pond processing, obtains the ROI feature figure being sized behind pond；4th submodule, for by the setting The ROI entrance loss layers of size obtain the expression classification resultant error that expression classification is carried out to the sample image；According to described Expression classification resultant error adjusts the network parameter of the convolutional neural networks model.

Optionally, the 4th submodule, for by the ROI entrance loss layers being sized, passing through the loss The logistic regression loss function of layer calculates the expression classification resultant error and exports.

Optionally, described device further includes：6th determining module is examined for the sample image to the training It surveys, obtains the information of face key point.

The third aspect according to embodiments of the present invention, provides a kind of electronic equipment, including：Processor, memory, communication Element and communication bus, the processor, the memory and the communication device are completed mutual by the communication bus Communication；For the memory for storing an at least executable instruction, the executable instruction makes the processor perform first Any expression recognition method of aspect.

Fourth aspect according to embodiments of the present invention provides a kind of computer readable storage medium, and the computer can Storage medium is read to be stored with：For passing through the convolution layer segment of convolutional neural networks model and the facial image to be detected obtained In face key point, human face expression feature extraction is carried out to facial image to be detected, obtain human face expression characteristic pattern can Execute instruction；For determining region of interest ROI corresponding with each face key point in the human face expression characteristic pattern Executable instruction；For carrying out pond processing to determining each ROI by the pond layer segment of convolutional neural networks model, obtain Obtain the executable instruction of the ROI feature figure behind pond；For obtaining the table of the facial image according at least to the ROI feature figure The executable instruction of feelings recognition result.

5th aspect according to embodiments of the present invention, provides a kind of convolutional neural networks model training method, including：It obtains The sample image of training and the information of corresponding face key point are taken, wherein, include human face expression in the sample image Markup information；Human face expression feature is carried out by the convolution layer segment of convolutional neural networks model to the sample image to carry It takes, obtains human face expression characteristic pattern；Determine that sense corresponding with each face key point is emerging in the human face expression characteristic pattern Interesting region ROI；Pond processing is carried out to determining each ROI by the pond layer segment of convolutional neural networks model, obtains pond ROI feature figure afterwards；According at least to the ROI feature figure, the network parameter of the convolutional neural networks model is adjusted.

Optionally, the pond layer segment by convolutional neural networks model carries out pond processing to determining each ROI, The ROI feature figure behind pond is obtained, including：Determining each ROI is carried out by the pond layer segment of convolutional neural networks model Pondization processing, obtains the ROI feature figure being sized behind pond；It is described according at least to the ROI feature figure, adjust the volume The network parameter of product neural network model, including：The ROI entrance loss layers being sized are obtained to the sample graph Expression classification resultant error as carrying out expression classification；According to the expression classification resultant error, the convolutional Neural net is adjusted The network parameter of network model.

6th aspect according to embodiments of the present invention, provides a kind of convolutional neural networks model training apparatus, including：The One acquisition module, for obtaining the information of the sample image of training and corresponding face key point, wherein, the sample image In include the markup information of human face expression；Second acquisition module, for passing through the convolution layer segment of convolutional neural networks model Human face expression feature extraction is carried out to the sample image, obtains human face expression characteristic pattern；Third acquisition module, for determining State region of interest ROI corresponding with each face key point in human face expression characteristic pattern；4th acquisition module, for leading to The pond layer segment for crossing convolution neural network model carries out pond processing to determining each ROI, obtains the ROI feature behind pond Figure；5th acquisition module, for according at least to the ROI feature figure, adjusting the network parameter of the convolutional neural networks model.

Optionally, the third acquisition module, in the human face expression characteristic pattern, according to each face key point Coordinate determine corresponding each position；Using determining each position as reference point, corresponding each setting range is obtained Each region of acquisition is determined as corresponding ROI by region.

Optionally, the 4th acquisition module, for passing through the pond layer segment of convolutional neural networks model to determining Each ROI carries out pond processing, obtains the ROI feature figure being sized behind pond；5th acquisition module, including：First Acquisition submodule carries out expression classification for by the ROI entrance loss layers being sized, obtaining to the sample image Expression classification resultant error；Submodule is adjusted, for according to the expression classification resultant error, adjusting the convolutional neural networks The network parameter of model.

Optionally, first acquisition submodule, for by the ROI entrance loss layers being sized, by described The logistic regression loss function of loss layer calculates the expression classification resultant error and exports.

Optionally, described device further includes：6th acquisition module is examined for the sample image to the training It surveys, obtains the information of face key point.

7th aspect according to embodiments of the present invention, provides a kind of electronic equipment, including：Processor, memory, communication Element and communication bus, the processor, the memory and the communication device are completed mutual by the communication bus Communication；For the memory for storing an at least executable instruction, the executable instruction makes the processor perform the 5th Any convolutional neural networks model training method of aspect.

Eighth aspect according to embodiments of the present invention provides a kind of computer readable storage medium, and the computer can Storage medium is read to be stored with：For obtaining the information of the sample image of training and corresponding face key point, wherein, the sample Include the executable instruction of the markup information of human face expression in this image；For passing through the convolutional layer of convolutional neural networks model Part carries out human face expression feature extraction to the sample image, obtains the executable instruction of human face expression characteristic pattern；For true The executable instruction of region of interest ROI corresponding with each face key point in the fixed human face expression characteristic pattern；With Pond processing is carried out to determining each ROI in the pond layer segment by convolutional neural networks model, it is special to obtain the ROI behind pond Levy the executable instruction of figure；For according at least to the ROI feature figure, adjusting the network parameter of the convolutional neural networks model Executable instruction.

The technical solution provided according to embodiments of the present invention is carrying out human face expression feature extraction according to face key point, After obtaining human face expression characteristic pattern, determine that face key point is corresponding from human face expression characteristic pattern further according to face key point ROI (Region Of Interest, area-of-interest), after processing of the ROI by ROI ponds layer, can obtain ROI feature Figure；Then, human face expression is determined according to ROI feature figure.By select using corresponding to the region of face key point as ROI, energy It is enough effectively to capture subtle expression shape change, while can preferably handle the otherness that different facial attitude tapes come, fully profit The detailed information changed with facial multiple regions has the face of subtle expression shape change and different postures and more accurately knows Not.

Description of the drawings

Fig. 1 is a kind of step flow chart of according to embodiments of the present invention one expression recognition method；

Fig. 2 is a kind of step flow chart of according to embodiments of the present invention two expression recognition method；

Fig. 3 is a kind of step flow chart of according to embodiments of the present invention three expression recognition method；

Fig. 4 is a kind of step flow chart of according to embodiments of the present invention four convolutional neural networks model training method；

Fig. 5 is a kind of structure diagram of according to embodiments of the present invention five expression recognition apparatus；

Fig. 6 is a kind of structure diagram of according to embodiments of the present invention six convolutional neural networks model training apparatus；

Fig. 7 is the structure diagram of according to embodiments of the present invention seven a kind of electronic equipment；

Fig. 8 is the structure diagram of according to embodiments of the present invention eight a kind of electronic equipment.

Specific embodiment

(identical label represents identical element in several attached drawings) and embodiment below in conjunction with the accompanying drawings, implement the present invention The specific embodiment of example is described in further detail.Following embodiment is used to illustrate the present invention, but be not limited to the present invention Range.

It will be understood by those skilled in the art that the terms such as " first ", " second " in the embodiment of the present invention are only used for distinguishing Different step, equipment or module etc. neither represent any particular technology meaning, also do not indicate that the inevitable logic between them is suitable Sequence.

Embodiment one

With reference to Fig. 1, a kind of step flow chart of according to embodiments of the present invention one expression recognition method is shown.

The expression recognition method of the present embodiment includes the following steps：

Step S102：By in the convolution layer segment of convolutional neural networks model and the facial image to be detected of acquisition Face key point carries out human face expression feature extraction to facial image to be detected, obtains human face expression characteristic pattern.

Trained convolutional neural networks model has the function of facial expression recognition, includes at least input layer portion Divide, convolution layer segment, pond layer segment, full connection layer segment etc..Wherein, input layer segment is used for input picture；Convolutional layer portion Divide and carry out feature extraction；Pond layer segment carries out carrying out pond processing to the handling result of convolution layer segment, such as to convolutional layer portion The characteristic pattern separately won carries out down-sampled etc.；Full connection layer segment can be used for classifying etc..

In the present embodiment, human face expression feature extraction is carried out by the convolution layer segment of convolutional neural networks model, is obtained Human face expression characteristic pattern.In addition, the acquisition for face key point, it, can be in input convolutional Neural in a kind of feasible pattern Before network model, by carrying out face critical point detection acquisition to facial image to be detected；In another feasible pattern, It can be extracted by the convolution layer segment of convolutional neural networks model, that is, convolution layer segment first extracts face to be detected Face key point in image, then, the face key point based on extraction carry out further human face expression feature extraction, obtain Human face expression characteristic pattern；It, can also be before convolutional neural networks model be inputted, manually to be checked in still further possibility The mark that the facial image of survey carries out face key point obtains.

Step S104：Determine ROI corresponding with each face key point in human face expression characteristic pattern.

The handling result to image entirety, the handling result are contained in the human face expression characteristic pattern of convolution layer segment output Comprising larger data volume, if based on this progress facial expression recognition, mass data need to be handled, system processing load is heavier.

For this purpose, in the scheme of the embodiment of the present invention, first according to face key point, determine that each face key point corresponds to ROI (Region Of Interest, area-of-interest).For example, in the information according to face key point, determine and each During the corresponding ROI of face key point, in human face expression characteristic pattern, corresponding position is determined according to the coordinate of face key point It puts；Centered on determining position, the region of setting range is obtained, the region of acquisition is determined as ROI.

Step S106：Pond processing is carried out to determining each ROI by the pond layer segment of convolutional neural networks model, is obtained Obtain the ROI feature figure behind pond.

Wherein, pondization processing includes but not limited to down-sampled processing.

Step S108：The Expression Recognition result of facial image is obtained according at least to ROI feature figure.

Include the characteristic information of human face expression in ROI feature figure, can be obtained to people to be detected according to ROI feature figure The Expression Recognition result of human face expression in face image.

Through this embodiment, human face expression feature extraction is being carried out according to face key point, is obtaining human face expression characteristic pattern Afterwards, determine that face key point corresponding ROI, the ROI pass through ROI ponds from human face expression characteristic pattern further according to face key point After the processing of layer, ROI feature figure can be obtained；Then, human face expression is determined according to ROI feature figure.It will be corresponded to by selection The region of face key point can effectively capture subtle expression shape change, while can preferably handle different sides as ROI The otherness that portion's attitude tape comes makes full use of the detailed information of facial multiple regions variation, to subtle expression shape change and not Face with posture has and more accurately identifies.

Embodiment two

With reference to Fig. 2, a kind of step flow chart of according to embodiments of the present invention two expression recognition method is shown.

In the present embodiment, a convolutional neural networks model with facial expression recognition is first trained, then, is based on The model carries out the facial expression recognition of image.But it should be understood by those skilled in the art that in actual use, can also use Third party trains the convolutional neural networks model completed to carry out facial expression recognition.

Step S202：The sample image of training is obtained, uses sample image training convolutional neural networks model.

Wherein, sample image can be still image, or the sample image of sequence of frames of video.It is wrapped in sample image The markup information of information and human face expression containing face key point.In the present embodiment, by training sample image into Row detection obtains the information of face key point.

In a kind of feasible pattern for realizing this step, the sample image of training is obtained, passes through convolutional neural networks mould The convolution layer segment of type carries out human face expression feature extraction to sample image, obtains human face expression characteristic pattern；Determine human face expression ROI corresponding with each face key point in characteristic pattern；By the pond layer segment of convolutional neural networks model to determining Each ROI carry out pond processing, obtain pond after ROI feature figure；According at least to ROI feature figure, convolutional neural networks are adjusted The network parameter of model.

Wherein, it in a kind of feasible pattern, determines corresponding with each face key point in human face expression characteristic pattern ROI includes：In human face expression characteristic pattern, corresponding each position is determined according to the coordinate of each face key point；With true Fixed each position is reference point, obtains the region of corresponding each setting range, each region of acquisition is determined as corresponding to ROI.

Pond processing is carried out to determining each ROI by the pond layer segment of convolutional neural networks model, pond can be obtained The ROI feature figure being sized after change.And according to ROI feature figure, when adjusting the network parameter of convolutional neural networks model, The ROI feature figure entrance loss layer that can will be sized obtains the expression classification result that expression classification is carried out to sample image Error；According to expression classification resultant error, the network parameter of convolutional neural networks model is adjusted.Wherein, the network parameter of adjustment Including but not limited to weight parameter weight, offset parameter bias etc..

The acquisition of expression classification resultant error can pass through loss layer by the ROI entrance loss layers that will be sized Logistic regression loss function calculates expression classification resultant error and exports and obtain.Wherein, logistic regression loss function can be Logistic regression loss function with setting expression classification quantity.

By above procedure, the training of the convolutional neural networks model with expression identification function is realized, and then, it is based on The convolutional neural networks model that the training is completed can carry out the expression detection of face.

Step S204：Obtain facial image to be detected.

Wherein, facial image to be detected can be Static Human Face image, or the facial image of sequence of frames of video.

Step S206：Face critical point detection is carried out to facial image to be detected, the face obtained in facial image closes Key point.

In the present embodiment, in a manner that advanced pedestrian's face image detects acquisition face key point.If but as previously mentioned, volume Product neural network model has the function of face critical point detection, then can be directly by facial image input convolution god to be detected Through network model, the facial image progress face key point that detection is treated by the convolution layer segment of convolutional neural networks model carries It takes.And then according to the face key point of extraction, human face expression feature extraction is carried out to facial image to be detected, obtains face Expressive features figure.

Step S208：By in the convolution layer segment of convolutional neural networks model and the facial image to be detected of acquisition Face key point carries out human face expression feature extraction to facial image to be detected, obtains human face expression characteristic pattern.

Step S210：Determine ROI corresponding with each face key point in human face expression characteristic pattern.

Step S212：Pond processing is carried out to determining each ROI by the pond layer segment of convolutional neural networks model, is obtained Obtain the ROI feature figure corresponding with each ROI behind pond.

Step S214：The Expression Recognition result of facial image is obtained according at least to ROI feature figure.

After ROI feature figure is obtained, can Expression Recognition be carried out according to ROI feature figure.

In a kind of preferred embodiment, when using the face table in the continuous sequence of frames of video of convolutional neural networks model inspection During feelings image, if on the basis of present frame, it can be first using convolutional neural networks model to current in sequence of frames of video Frame is detected, and according to the ROI feature figure of the facial image of present frame, obtains the preliminary Expression Recognition of the facial image of present frame As a result；And then according to the preliminary Expression Recognition result of present frame and at least the facial image of a prior frame Expression Recognition as a result, Obtain the Expression Recognition result of the facial image of present frame.For example, after the preliminary Expression Recognition result of face for obtaining present frame, It can also judge whether position of the present frame in sequence of frames of video is greater than or equal to the position threshold of setting；If it is not, then due to Position of the present frame in sequence of frames of video is less than the position threshold of setting, is made with the preliminary Expression Recognition result of the face of present frame The facial expression recognition result output of facial image for final present frame and/or, preserve the people of the facial image of present frame Face Expression Recognition result；If so, obtain present frame before setting quantity video frame facial expression recognition result；It ought The preliminary Expression Recognition result of the facial image of previous frame and the facial expression recognition of at least facial image of a prior frame obtained As a result linear weighted function processing is carried out, obtains the Expression Recognition result of the facial image of present frame.Wherein, an at least prior frame can be with A discontinuous frame or multiframe before a continuous frame or multiframe or present frame before being present frame.Pass through above-mentioned mistake Journey can determine the Expression Recognition of present frame as a result, avoiding the error of single frame detection according to the testing result of continuous multiframe, So that testing result is more accurate.

Wherein, by the facial expression recognition result of present frame with obtain an at least prior frame facial expression recognition knot Can be the preliminary Expression Recognition result of face of present frame and the face table of the prior frame obtained when fruit carries out linear weighted function processing Feelings recognition result sets weight respectively, and when setting weight, the weight of the preliminary Expression Recognition result of face of present frame, which is more than, to be obtained The weight of the facial expression recognition result of any prior frame taken；Then, according to the weight of setting, to the face of current video frame Preliminary Expression Recognition result carries out linear weighted function with the prior frame facial expression recognition result obtained.Because mainly for forward sight is worked as Frequency frame carries out Expression Recognition, so the testing result for current video frame sets heavier weight, by associated video frame While testing result is as reference, it can effectively ensure that current video frame as detection target.

It should be noted that in above process, the setting number of the video frame before the position threshold of setting, present frame Amount and the weight of setting can be appropriately arranged with by those skilled in the art according to actual conditions.Wherein it is preferred to currently Video frame before video frame sets quantity as 3.

Through this embodiment, using the convolutional neural networks model that can precisely identify human face expression, face can be captured The expression of slight change so that Expression Recognition is more accurately and fast.Also, for continuous sequence of frames of video, pass through the company of fusion The testing result of continuous multiframe, effectively prevents the error of single frame detection, also further improves the accuracy of expression detection.

Embodiment three

With reference to Fig. 3, a kind of step flow chart of according to embodiments of the present invention three expression recognition method is shown.

The present embodiment illustrates the expression recognition method of the embodiment of the present invention in the form of a specific example.This The expression recognition method of embodiment both includes convolutional neural networks model training part, also includes the use of the convolution god of training completion The part of Expression Recognition is carried out through network model.

Step S302：Facial Expression Image is collected, and carries out expression mark, forms a sample graph image set to be trained It closes.

For example, by being labelled with ten kinds of expressions manually, it is respectively：Angry, tranquil, puzzlement is detested, is happy, sad, harmful Be afraid of, surprised, strabismus and scream.

Step S304：The face and its key point in every sample image are detected using Face datection algorithm, and utilizes pass Key point alignment face.

In this step, conventional Face datection algorithm can be utilized to detect face and its key in every sample image Point such as includes 21 face key points of eyes, face etc.；Then, it is aligned face using 21 face key points.

Step S306：Use the sample image and face key point training CNN models for having carried out expression mark.

In the present embodiment, the brief configuration example of a CNN model is as follows：

// first part

1. data input layer

// second part

2.<=1 convolutional layer 1_1 (3x3x4/2)

3.<=2 ReLU layers of nonlinear responses

4.<=3Pooling layers // Pooling layers common

5.<=4 convolutional layer 1_2 (3x3x6/2)

6.<=5 ReLU layers of nonlinear responses

7.<=6Pooling layers

8.<=7 convolutional layer 1_3 (3x3x6)

9.<=8 ReLU layers of nonlinear responses

10.<=9Pooling layers

11.<=10 convolutional layer 2_1 (3x3x12/2)

12.<=11 ReLU layers of nonlinear responses

13.<=12Pooling layers

14.<=13 convolutional layer 2_2 (3x3x12)

15.<=14 ReLU layers of nonlinear responses

16.<=15Pooling layers

17.<=16 ReLU layers of nonlinear responses

18.<=17 convolutional layer 5_4 (3x3x16)

// Part III

19.<The Pooling layers in Pooling layers of=18ROI // carry out ROI ponds

20.<=19 full articulamentums

21.<=20 loss layers

In above-mentioned CNN model structures, the sample image and face key point that have carried out expression mark pass through first part Input layer input CNN models be trained；Then it is handled by the conventional convolution layer segment of second part；Based on second Partial handling result obtains ROI feature figure according to face key point, and obtained ROI feature figure is inputted Pooling layers of ROI The processing of ROI pondizations is carried out, obtains the ROI feature figure of Chi Huahou；ROI feature figure behind pond sequentially inputs full articulamentum and damage again Lose layer；The network parameter of adjustment CNN models is determined how according to the handling result of loss layer, CNN models are trained.

Wherein, when obtaining ROI feature figure according to face key point based on the handling result of second part, for face 21 The corresponding ROI of key point can be mapped back the last one convolutional layer of CNN models according to the coordinate of 21 key points first On the characteristic pattern of (in the present embodiment be the 32nd layer) output, that is, according to the face detected on original sample image 21 Key point is mapped on the characteristic pattern of the 32nd layer of output, and extracts out 21 cells on characteristic pattern centered on these key points Domain (such as 3 × 3 region or irregular codes etc.), it is then defeated using the characteristic pattern in this 21 regions as Pooling layers of ROI Enter, obtain ROI feature figure, then ROI feature figure is input to full articulamentum, be followed by the logistic regression loss function layer of ten classification Error is calculated, and reversely in the result and the human face expression mark of mark by (e.g., SoftmaxWithloss Layer) Propagated error, so as to update the parameter of CNN models (parameter for including full articulamentum).So cycle repeatedly, until error no longer It reduces, the convergence of CNN models obtains the model of training completion.

Because 21 ROI regions are covered with the relevant all positions of human face expression, and without redundancy so that CNN Model, which can focus more on, learns these regions well, it is easier to capture the subtle variation of face muscle；After Pooling layers of ROI The ROI feature for obtaining regular length represents, so as to allow that identical network can also be used when inputting different size of ROI region Structure；The character representation of the regular length is sequentially inputted to full articulamentum and loss layer again, obtains final expression classification knot Fruit.

Wherein, ROI Pooling layers are the pond layers for ROI feature figure, for example, some ROI region coordinate be (x1, Y1, x2, y2), then input size is (y2-y1) × (x2-x1), if the Output Size of Pooling layers of ROI is pooled_ Height × pooled_width, then the output of each grid be [(y2-y1)/pooled_height] × [(x2-x1)/ Pooled_width] pool area result.

Furthermore, it is necessary to explanation is：

In the explanation of above-mentioned convolutional network structure, 2.<=1 shows that current layer for the second layer, is inputted as first layer；Convolutional layer Bracket is that convolution layer parameter (3x3x16) shows that convolution kernel size is 3x3 below, port number 16.Other the rest may be inferred, no longer It repeats.

In above-mentioned convolutional network structure, there are one nonlinear response unit R eLU after each convolutional layer.Preferably, PReLU (ParametricRectified Linear Units, parametrization correct linear unit) may be used in the ReLU, to have Effect improves the accuracy of detection of CNN models.

In addition, the convolution kernel of convolutional layer is set as 3x3, local message can be preferably integrated；Set the interval of convolutional layer Stride can allow upper strata feature to obtain the visual field of bigger under the premise of calculation amount is not increased.

But those skilled in the art are it should be apparent that the number of plies of the size of above-mentioned convolution kernel, port number and convolutional layer Quantity is exemplary illustration, and in practical applications, those skilled in the art can be adaptively adjusted according to actual needs, The embodiment of the present invention is not restricted this.In addition, all layers of combination and parameter in convolutional network model in the present embodiment All be it is optional, can be in any combination.

Step S308：Expression Recognition carries out the Facial Expression Image after alignment, and obtain by the CNN models that training is completed To recognition result.

With CNN model trainings the difference is that when the CNN models that training is used to complete carry out Expression Recognition, CNN What the full articulamentum of model was followed by is the logistic regression layer of ten classification rather than logistic regression loss function layer, to directly obtain Recognition result.

For single image, the CNN models that can be directly completed by above-mentioned training carry out Expression Recognition.

For sequence of frames of video, the identification of each of which frame is as single image.But in order to promote sequence of frames of video The accuracy rate of Expression Recognition can merge multiframe.For example, setting t=1 as video lead frame, work as t>=3 be present frame When position is more than or equal to third frame, two frames before present frame and its are carried out at the same time identification, respectively obtain the identification of this three frame As a result.If three frames of input are denoted as Xt-2, Xt-1 and Xt respectively, the recognition result of three frames is denoted as Yt-2, Yt-1 and Yt, it will The recognition result of this three frame does a linear weighted function, if present frame weight 0.5, two frame weights before the frame are all 0.25, that Final prediction result is Y=0.25 × Yt-2+0.25 × Yt-1+0.5 × Yt.Work as t<3, i.e., current frame position is less than third During needle frame, Y=Yt.Those skilled in the art actually should it should be apparent that the weight of above-mentioned each frame is merely illustrative In, those skilled in the art suitably can set weight for each frame according to actual needs, be more than the corresponding weight of present frame Other frames.

Through this embodiment, by the way that the corresponding region of 21 key points using face is selected to make full use of face as ROI The detailed information of multiple regions variation can capture the expression of facial slight change so that identification is more accurate, quickly；By right Multiframe is merged so that the scheme of the present embodiment can be effectively applicable to the Expression Recognition based on video.

The expression recognition method of the present embodiment can be performed by any suitable equipment with data-handling capacity, including But it is not limited to：Mobile terminal, PC machine, server, mobile unit, advertisement machine, face check-in etc..

Example IV

With reference to Fig. 4, the step of showing a kind of according to embodiments of the present invention four convolutional neural networks model training method Flow chart.

The convolutional neural networks model training method of the present embodiment includes the following steps：

Step S402：Obtain the sample image of training and the information of corresponding face key point.

Wherein, the markup information of human face expression is included in sample image, can be sample in advance when carrying out CNN training This image is labeled, and in the present embodiment, corresponding human facial expression information is marked out in sample image, so that follow-up basis should Markup information determines whether CNN training results are accurate.

In addition, for every sample image, it is also necessary to obtain the information of corresponding face key point.Therefore, actually should In, before this step, it can also include：The sample image of training is detected, obtains the information of face key point. To make training effect more preferable, after the information for obtaining face key point, face pair can also be carried out according to these key points Together, the sample image input CNN after face is aligned carries out sample training.It is aligned by face, sample image can be improved Training effect.

In above process, face key point is obtained to pattern detection and alignment face can be by those skilled in the art It is realized using any suitable relevant way, wherein, the realization that face key is obtained to pattern detection can include but is not limited to： By the CNN with face key point positioning function, alternatively, ASM (ActiveShapeModel) method, alternatively, G-EBGM (bases In the elastic graph matching of Gabor characteristic) etc.；The realization of face alignment can include but is not limited to：AAM(Active Appearance Model), CLM (ConstrainedLocalModels, constrained partial model) etc..

In addition, conventional key point may be used in face key point, such as 68 key points of face, but not limited to this, this hair In bright embodiment, 21 key points of face can be used, this 21 key points respectively include：Per each 3 keys in side eyebrow position Point (brows, eyebrow tail and eyebrow peak), each 3 key points (inner eye corner, the tail of the eye, pupil center) of every side eyes, nose areas 4 key points (wing of nose outermost points, nose, the nose lowermost points of both sides), 5, face position key point (two labial angles, on Lip depression points, lower lip depression points, lower lip and upper lip contact line intermediate position points).On the one hand, this 21 key points represent face Key position, can be with Efficient Characterization face characteristic；On the other hand, the embodiment of the present invention can be completed by this 21 key points Convolutional neural networks model training, reduce data volume and training cost.

Step S404：Human face expression feature extraction is carried out to sample image by the convolution layer segment of CNN models, obtains people Face expressive features figure.

In the present embodiment, the convolutional coding structure of conventional CNN models may be used in the conventional part of CNN models, to sample image Processing be referred to relevant CNN models convolution layer segment processing carry out, details are not described herein.Through convolution layer segment After processing, obtaining corresponding human face expression characteristic pattern, (certain handling result of convolution layer segment can be understood as CNN models at certain Output result in secondary training process).

Step S406：It determines in human face expression characteristic pattern, ROI corresponding with each face key point.

The handling result to image entirety is contained in the human face expression characteristic pattern of convolutional layer output, if directly using the knot Fruit carries out subsequent expression training, then on the one hand data volume to be treated is larger, on the other hand, can not also be directed to human face expression It is targetedly trained, causes training result not accurate.

For this purpose, in the scheme of the embodiment of the present invention, in the information according to face key point, determine crucial with each face During the corresponding ROI of point, in human face expression characteristic pattern, corresponding each position is determined according to the coordinate of each face key point It puts；By each determining position for reference point (as centered on point, in practical applications, allowing the central point, there are a small ranges Deviation), obtain the region of setting range, each region of acquisition be determined as corresponding ROI.It it is 21 for face key point For key point, when determining ROI, can first it be mapped according to the coordinate of face key point, i.e. 21 key points of face It returns on the human face expression characteristic pattern of the last one convolutional layer output of CNN, then, with each key point on human face expression characteristic pattern Centered on, a certain range of region (generally extracting ranging from 3 × 3~7 × 7, preferably 3 × 3) is extracted, with this 21 regions Input of the characteristic pattern as ROI Pooling Layer (ROI ponds layer).This 21 regions cover relevant with human face expression All positions, and without redundancy so that network, which can focus more on, learns these regions well, it is easier to capture face muscle Subtle variation.

Step S408：Pond processing is carried out to determining each ROI by the pond layer segment of CNN, obtains the ROI behind pond Characteristic pattern.

In CNN models, pond layer often behind convolutional layer, by pond come reduce feature that convolutional layer exports to Amount, while improve as a result, so as to being as a result less prone to over-fitting.It, can be according to the size dynamic of image for different images The size and step-length of ground computing pool window, to obtain the pond result of same size.

In embodiments of the present invention, ROI is inputted into pond layer, after the ROI pondizations processing of pond layer, fixation can be obtained The unified ROI feature figure of the character representation of the ROI of length, i.e. size.

Step S410：According at least to the ROI feature figure of Chi Huahou, the network parameter of CNN models is adjusted.

In a kind of feasible pattern, the ROI feature figure being sized behind ROI ponds can be inputted full articulamentum and carried out Corresponding processing；Then, then by the ROI entrance loss layers that treated is sized, obtain and expression point is carried out to sample image The expression classification resultant error of class, for example, the error can be the output result of loss layer；According to expression classification resultant error, Adjust the network parameter of CNN models.

After the handling result for obtaining ROI ponds layer, which can be inputted full articulamentum, pass through full articulamentum Different size of image is converted to the feature of identical dimensional；Then, the feature entrance loss layer full articulamentum exported, to obtain Obtain loss result；The network parameter of adjustment CNN models is decided whether according to the loss result to continue to train.Specifically To the present embodiment, after the ROI feature figure for obtaining ROI ponds layer, which is inputted into full articulamentum, is set The ROI feature of dimension, wherein, setting dimension can be appropriately arranged with by those skilled in the art according to actual demand, and the present invention is real Example is applied not to be restricted this；The ROI feature of the setting dimension is entered loss layer, passes through loss function counting loss result；Into And judge whether the training output of CNN models meets the condition of convergence according to loss result；If meeting the condition of convergence, terminate CNN The training of model；If being unsatisfactory for the condition of convergence, the parameter of CNN model trainings (is including but not limited to weighed according to loss result Weight parameter weight, offset parameter bias etc.) it is adjusted；Continue the training of CNN models using the parameter after adjustment, directly Meet the condition of convergence to training result.Wherein, the condition of convergence can according to actual needs be suitably set by those skilled in the art, The embodiment of the present invention is not restricted this.

In the present embodiment, the loss function of loss layer is using logistic regression loss function, and in the case, full articulamentum is defeated The ROI being sized gone out will be entered loss layer, and expression classification result is carried out by the logistic regression loss function of loss layer Error calculation, and then error originated from input result of calculation.Optionally, logistic regression loss function is with setting expression classification quantity Logistic regression loss function.For example, being labelled with the expression of ten types in sample image, the training goal of CNN models is can The expression of this ten type is identified and classified, then logistic regression loss function can be with for the logistic regression of very class damage Function is lost, wherein, " ten classification " expression can be detected and be identified ten types of mark by the logistic regression loss function Expression.

In addition, in practical CNN model trainings, the people of sequence of frames of video can also be used for trained sample image Face expression sample image has certain contact in sequence of frames of video between video frame, using sequence of frames of video as training sample This, be more conducive to the CNN models after training in actually detected to continuous video frame in human face expression identification.

By the above process, the training of the Expression Recognition to the CNN models in the embodiment of the present invention is realized, with routine Training method is different, after being handled by the convolution layer segment in CNN models human face expression sample image, according to face Key point determines ROI in convolutional layer handling result, ROI input ROI ponds layer is handled, finally according to ROI ponds The handling result of layer determines the training to convolutional neural networks model.By select using corresponding to the region of face key point as ROI can be more targetedly trained, and the detailed information of face's multiple regions variation can be made full use of, to subtle The face of expression shape change and different postures, which has, more accurately to be identified.This facial area using corresponding to face key point is made For the CNN model training methods of ROI, subtle expression shape change can be effectively captured, while can preferably handle different faces The otherness that attitude tape comes, so as to improve the precision of prediction of CNN models and robustness.Also, it is compared to conventional machines study Frame carries out the mode of Expression Recognition, using the CNN of structure of the embodiment of the present invention because of the structure feature of itself, can not only use Big data quantity sample is trained, and training effectiveness is high, and training cost is relatively low.

The convolutional neural networks model training method of the present embodiment can be by any suitable with data-handling capacity Equipment performs, including but not limited to：Mobile terminal, PC machine etc..

Embodiment five

With reference to Fig. 5, a kind of structure diagram of according to embodiments of the present invention five expression recognition apparatus is shown；It specifically includes Following module：

First determining module 502, for passing through the convolution layer segment of convolutional neural networks model and obtaining to be detected Face key point in facial image carries out human face expression feature extraction to facial image to be detected, it is special to obtain human face expression Sign figure.

Second determining module 504, it is corresponding respectively with each face key point in the human face expression characteristic pattern for determining Region of interest ROI.

Third determining module 506, the pond layer segment for passing through convolutional neural networks model carry out determining each ROI Pondization processing, obtains the ROI feature figure behind pond.

4th determining module 508, for obtaining the Expression Recognition knot of the facial image according at least to the ROI feature figure Fruit.

Optionally, the facial image includes Static Human Face image.

Optionally, the third determining module 506, including：First acquisition submodule 5062, for according to the present frame Facial image the ROI feature figure, obtain the preliminary Expression Recognition result of the facial image of the present frame；Second obtains Submodule 5064, for the preliminary Expression Recognition result according to the present frame and at least expression of the facial image of a prior frame Recognition result obtains the Expression Recognition result of the facial image of the present frame.

Optionally, second acquisition submodule 5064, for by the preliminary face table of the facial image of the present frame The facial expression recognition result of feelings recognition result and at least facial image of a prior frame is weighted processing, obtains described current The Expression Recognition of the facial image of frame is as a result, wherein, the weight of the preliminary Expression Recognition result of the facial image of the present frame More than the weight of the Expression Recognition result of the facial image of any prior frame.

Optionally, described device further includes：5th determining module 510, for determining the present frame in sequence of frames of video In position be greater than or equal to setting position threshold value.

Optionally, described device further includes：Respond module 512, in response to the present frame in the video frame sequence Position in row is less than the position threshold of setting, export the facial expression recognition of the facial image of the present frame as a result, and/ Or, preserve the facial expression recognition result of the facial image of the present frame.

Optionally, first determining module 502, for carrying out face critical point detection to facial image to be detected, Obtain the face key point in the facial image；According to the face key point, pass through the convolutional neural networks model Convolution layer segment carries out human face expression feature extraction to the facial image, obtains human face expression characteristic pattern.

Optionally, first determining module 502, for passing through the convolution layer segment pair of the convolutional neural networks model Facial image to be detected carries out face key point extraction；According to the face key point of extraction, to facial image to be detected into Pedestrian's face human facial feature extraction obtains human face expression characteristic pattern.

Optionally, described device further includes：Training module 514 for obtaining the sample image of training, uses the sample This image trains the convolutional neural networks model, wherein, information and the people of face key point are included in the sample image The markup information of face expression.

Optionally, the training module 514, including：First submodule 5142, for obtaining the sample image of training, Human face expression feature extraction is carried out to the sample image by the convolution layer segment of convolutional neural networks model, obtains face table Feelings characteristic pattern；The second submodule 5144, it is corresponding respectively with each face key point in the human face expression characteristic pattern for determining Region of interest ROI；Third submodule 5146, it is each to what is determined for passing through the pond layer segment of convolutional neural networks model ROI carries out pond processing, obtains the ROI feature figure behind pond；4th submodule 5148, for according at least to the ROI feature Figure adjusts the network parameter of the convolutional neural networks model.

Optionally, the second submodule 5142, in the human face expression characteristic pattern, according to each face key The coordinate of point determines corresponding each position；Using determining each position as reference point, corresponding each setting range is obtained Region, each region of acquisition is determined as corresponding ROI.

Optionally, the third submodule 5146, for passing through the pond layer segment of convolutional neural networks model to determining Each ROI carry out pond processing, obtain pond after the ROI feature figure being sized；4th submodule 5148, for inciting somebody to action The ROI entrance loss layers being sized obtain the expression classification resultant error that expression classification is carried out to the sample image； According to the expression classification resultant error, the network parameter of the convolutional neural networks model is adjusted.

Optionally, the 4th submodule 5148, for by the ROI entrance loss layers being sized, by described The logistic regression loss function of loss layer calculates the expression classification resultant error and exports.

Optionally, the 6th determining module 516 is detected for the sample image to the training, obtains face and closes The information of key point.

Expression recognition apparatus through this embodiment can perform any one expression recognition method in embodiment one to three, And the advantageous effect of this method is obtained, therefore not to repeat here.

Embodiment six

With reference to Fig. 6, a kind of structure of according to embodiments of the present invention six convolutional neural networks model training apparatus is shown Block diagram；Specifically include following module：

First acquisition module 602, for obtaining the information of the sample image of training and corresponding face key point, In, the markup information of human face expression is included in the sample image.

Second acquisition module 604, for pass through the convolution layer segment of convolutional neural networks model to the sample image into Pedestrian's face human facial feature extraction obtains human face expression characteristic pattern.

Third acquisition module 606, it is corresponding respectively with each face key point in the human face expression characteristic pattern for determining Region of interest ROI.

4th acquisition module 608, the pond layer segment for passing through convolutional neural networks model carry out determining each ROI Pondization processing, obtains the ROI feature figure behind pond.

5th acquisition module 610, for according at least to the ROI feature figure, adjusting the convolutional neural networks model Network parameter.

Optionally, the third acquisition module 606, in the human face expression characteristic pattern, being closed according to each face The coordinate of key point determines corresponding each position；Using determining each position as reference point, corresponding each setting model is obtained Each region of acquisition is determined as corresponding ROI by the region enclosed.

Optionally, the 4th acquisition module 608, for passing through the pond layer segment of convolutional neural networks model to determining Each ROI carry out pond processing, obtain pond after the ROI feature figure being sized；5th acquisition module 610, including： First acquisition submodule 6102, for obtaining the ROI entrance loss layers being sized to the sample image carry out table The expression classification resultant error of mutual affection class；Submodule 6104 is adjusted, for according to the expression classification resultant error, described in adjustment The network parameter of convolutional neural networks model.

Optionally, first acquisition submodule 6104, for by the ROI entrance loss layers being sized, passing through The logistic regression loss function of the loss layer calculates the expression classification resultant error and exports.

Optionally, described device further includes：6th acquisition module 612, for being carried out to the sample image of the training Detection obtains the information of face key point.

Embodiment seven

The embodiment of the present invention five provides a kind of electronic equipment, such as can be mobile terminal, personal computer (PC), put down Plate computer, server etc..Below with reference to Fig. 7, it illustrates suitable for being used for realizing the terminal device of the embodiment of the present invention or service The structure diagram of the electronic equipment 700 of device：As shown in fig. 7, electronic equipment 700 includes one or more processors, communication member Part etc., one or more of processors are for example：One or more central processing unit (CPU) 701 and/or one or more Image processor (GPU) 713 etc., processor can according to the executable instruction being stored in read-only memory (ROM) 702 or From the executable instruction that storage section 708 is loaded into random access storage device (RAM) 703 perform various appropriate actions and Processing.Communication device includes communication component 712 and/or communication interface 709.Wherein, communication component 712 may include but be not limited to net Card, the network interface card may include but be not limited to IB (Infiniband) network interface card, and communication interface 709 includes such as LAN card, modulation /demodulation The communication interface of the network interface card of device etc., communication interface 709 perform communication process via the network of such as internet.

Processor can communicate with read-only memory 702 and/or random access storage device 703 to perform executable instruction, It is connected by communication bus 704 with communication component 712 and is communicated through communication component 712 with other target devices, so as to completes this The corresponding operation of any one expression recognition method that inventive embodiments provide, for example, the convolution by convolutional neural networks model It is special to carry out human face expression to facial image to be detected for face key point in layer segment and the facial image to be detected obtained Sign extraction, obtains human face expression characteristic pattern；It determines corresponding with each face key point in the human face expression characteristic pattern Region of interest ROI；Pond processing is carried out to determining each ROI by the pond layer segment of convolutional neural networks model, is obtained ROI feature figure behind pond；The Expression Recognition result of the facial image is obtained according at least to the ROI feature figure.

In addition, in RAM 703, it can also be stored with various programs and data needed for device operation.CPU701 or GPU713, ROM702 and RAM703 are connected with each other by communication bus 704.In the case where there is RAM703, ROM702 is can Modeling block.RAM703 stores executable instruction or executable instruction is written into ROM702 at runtime, and executable instruction makes place It manages device and performs the corresponding operation of above-mentioned communication means.Input/output (I/O) interface 705 is also connected to communication bus 704.Communication Component 712 can be integrally disposed, may be set to be with multiple submodule (such as multiple IB network interface cards), and in communication bus chain It connects.

I/O interfaces 705 are connected to lower component：Importation 706 including keyboard, mouse etc.；It is penetrated including such as cathode The output par, c 707 of spool (CRT), liquid crystal display (LCD) etc. and loud speaker etc.；Storage section 708 including hard disk etc.； And the communication interface 709 of the network interface card including LAN card, modem etc..Driver 710 is also according to needing to connect It is connected to I/O interfaces 705.Detachable media 711, such as disk, CD, magneto-optic disk, semiconductor memory etc. are pacified as needed On driver 710, in order to be mounted into storage section 708 as needed from the computer program read thereon.

Need what is illustrated, framework as shown in Figure 7 is only a kind of optional realization method, can root during concrete practice The component count amount and type of above-mentioned Fig. 7 are selected, are deleted, increased or replaced according to actual needs；It is set in different function component Put, can also be used it is separately positioned or integrally disposed and other implementations, such as GPU and CPU separate setting or can be by GPU collection Into on CPU, communication device separates setting, can also be integrally disposed on CPU or GPU, etc..These interchangeable embodiment party Formula each falls within protection scope of the present invention.

Particularly, according to embodiments of the present invention, it is soft to may be implemented as computer for the process above with reference to flow chart description Part program.For example, the embodiment of the present invention includes a kind of computer program product, including being tangibly embodied in machine readable media On computer program, computer program included for the program code of the method shown in execution flow chart, and program code can wrap The corresponding instruction of corresponding execution method and step provided in an embodiment of the present invention is included, for example, the volume by convolutional neural networks model Face key point in lamination part and the facial image to be detected obtained carries out human face expression to facial image to be detected Feature extraction obtains human face expression characteristic pattern；It determines corresponding respectively with each face key point in the human face expression characteristic pattern Region of interest ROI；Pond processing is carried out to determining each ROI by the pond layer segment of convolutional neural networks model, is obtained Obtain the ROI feature figure behind pond；The Expression Recognition result of the facial image is obtained according at least to the ROI feature figure.At this In the embodiment of sample, which from network can be downloaded and installed and/or is situated between from detachable by communication device Matter 711 is mounted.When the computer program is executed by processor, the above-mentioned work(limited in the method for the embodiment of the present invention is performed Energy.

Embodiment eight

The embodiment of the present invention eight provides a kind of electronic equipment, such as can be mobile terminal, personal computer (PC), put down Plate computer, server etc..Below with reference to Fig. 8, it illustrates suitable for being used for realizing the terminal device of the embodiment of the present invention or service The structure diagram of the electronic equipment 800 of device：As shown in figure 8, electronic equipment 800 includes one or more processors, communication member Part etc., one or more of processors are for example：One or more central processing unit (CPU) 801 and/or one or more Image processor (GPU) 813 etc., processor can according to the executable instruction being stored in read-only memory (ROM) 802 or From the executable instruction that storage section 808 is loaded into random access storage device (RAM) 803 perform various appropriate actions and Processing.Communication device includes communication component 812 and/or communication interface 809.Wherein, communication component 812 may include but be not limited to net Card, the network interface card may include but be not limited to IB (Infiniband) network interface card, and communication interface 809 includes such as LAN card, modulation /demodulation The communication interface of the network interface card of device etc., communication interface 809 perform communication process via the network of such as internet.

Processor can communicate with read-only memory 802 and/or random access storage device 803 to perform executable instruction, It is connected by communication bus 804 with communication component 812 and is communicated through communication component 812 with other target devices, so as to completes this The corresponding operation of any one convolutional neural networks model training method that inventive embodiments provide, for example, obtaining the sample of training The information of this image and corresponding face key point, wherein, the markup information of human face expression is included in the sample image；It is logical The convolution layer segment for crossing convolution neural network model carries out human face expression feature extraction to the sample image, obtains human face expression Characteristic pattern；Determine region of interest ROI corresponding with each face key point in the human face expression characteristic pattern；Pass through volume The pond layer segment of product neural network model carries out pond processing to determining each ROI, obtains the ROI feature figure behind pond；Extremely Less according to the ROI feature figure, the network parameter of the convolutional neural networks model is adjusted.

In addition, in RAM 803, it can also be stored with various programs and data needed for device operation.CPU801 or GPU813, ROM802 and RAM803 are connected with each other by communication bus 804.In the case where there is RAM803, ROM802 is can Modeling block.RAM803 stores executable instruction or executable instruction is written into ROM802 at runtime, and executable instruction makes place It manages device and performs the corresponding operation of above-mentioned communication means.Input/output (I/O) interface 805 is also connected to communication bus 804.Communication Component 812 can be integrally disposed, may be set to be with multiple submodule (such as multiple IB network interface cards), and in communication bus chain It connects.

I/O interfaces 805 are connected to lower component：Importation 806 including keyboard, mouse etc.；It is penetrated including such as cathode The output par, c 807 of spool (CRT), liquid crystal display (LCD) etc. and loud speaker etc.；Storage section 808 including hard disk etc.； And the communication interface 809 of the network interface card including LAN card, modem etc..Driver 810 is also according to needing to connect It is connected to I/O interfaces 805.Detachable media 811, such as disk, CD, magneto-optic disk, semiconductor memory etc. are pacified as needed On driver 810, in order to be mounted into storage section 808 as needed from the computer program read thereon.

Need what is illustrated, framework as shown in Figure 8 is only a kind of optional realization method, can root during concrete practice The component count amount and type of above-mentioned Fig. 8 are selected, are deleted, increased or replaced according to actual needs；It is set in different function component Put, can also be used it is separately positioned or integrally disposed and other implementations, such as GPU and CPU separate setting or can be by GPU collection Into on CPU, communication device separates setting, can also be integrally disposed on CPU or GPU, etc..These interchangeable embodiment party Formula each falls within protection scope of the present invention.

Particularly, according to embodiments of the present invention, it is soft to may be implemented as computer for the process above with reference to flow chart description Part program.For example, the embodiment of the present invention includes a kind of computer program product, including being tangibly embodied in machine readable media On computer program, computer program included for the program code of the method shown in execution flow chart, and program code can wrap Include it is corresponding perform the corresponding instruction of method and step provided in an embodiment of the present invention, for example, obtaining the sample image of training and right The information of face key point answered, wherein, the markup information of human face expression is included in the sample image；Pass through convolutional Neural The convolution layer segment of network model carries out human face expression feature extraction to the sample image, obtains human face expression characteristic pattern；Really Region of interest ROI corresponding with each face key point in the fixed human face expression characteristic pattern；Pass through convolutional Neural net The pond layer segment of network model carries out pond processing to determining each ROI, obtains the ROI feature figure behind pond；According at least to institute ROI feature figure is stated, adjusts the network parameter of the convolutional neural networks model.In such embodiments, the computer program It can be downloaded and installed from network and/or be mounted from detachable media 811 by communication device.In the computer journey When sequence is executed by processor, the above-mentioned function of being limited in the method for the embodiment of the present invention is performed.

It may be noted that according to the needs of implementation, all parts/step described in the embodiment of the present invention can be split as more The part operation of two or more components/steps or components/steps can be also combined into new component/step by multi-part/step Suddenly, to realize the purpose of the embodiment of the present invention.

It is above-mentioned to realize or be implemented as in hardware, firmware according to the method for the embodiment of the present invention to be storable in note Software or computer code in recording medium (such as CD ROM, RAM, floppy disk, hard disk or magneto-optic disk) are implemented through net The original storage that network is downloaded is in long-range recording medium or nonvolatile machine readable media and will be stored in local recording medium In computer code, can be stored in using all-purpose computer, application specific processor or can compile so as to method described here Such software processing in journey or the recording medium of specialized hardware (such as ASIC or FPGA).It is appreciated that computer, processing Device, microprocessor controller or programmable hardware include can storing or receive software or computer code storage assembly (for example, RAM, ROM, flash memory etc.), when the software or computer code are by computer, processor or hardware access and when performing, realize Processing method described here.In addition, when all-purpose computer access is used to implement the code for the processing being shown here, code It performs and is converted to all-purpose computer to perform the special purpose computer of processing being shown here.

Those of ordinary skill in the art may realize that each exemplary lists described with reference to the embodiments described herein Member and method and step can be realized with the combination of electronic hardware or computer software and electronic hardware.These functions are actually It is performed with hardware or software mode, specific application and design constraint depending on technical solution.Professional technician Described function can be realized using distinct methods to each specific application, but this realization is it is not considered that exceed The range of the embodiment of the present invention.

Embodiment of above is merely to illustrate the embodiment of the present invention, and is not the limitation to the embodiment of the present invention, related skill The those of ordinary skill in art field in the case where not departing from the spirit and scope of the embodiment of the present invention, can also make various Variation and modification, therefore all equivalent technical solutions also belong to the scope of the embodiment of the present invention, the patent of the embodiment of the present invention Protection domain should be defined by the claims.

Claims

1. a kind of expression recognition method, which is characterized in that including：

It is right by the face key point in the convolution layer segment of convolutional neural networks model and the facial image to be detected of acquisition Facial image to be detected carries out human face expression feature extraction, obtains human face expression characteristic pattern；

Determine region of interest ROI corresponding with each face key point in the human face expression characteristic pattern；

Pond processing is carried out to determining each ROI by the pond layer segment of convolutional neural networks model, obtains the ROI behind pond Characteristic pattern；

The Expression Recognition result of the facial image is obtained according at least to the ROI feature figure.

2. according to the method described in claim 1, it is characterized in that, the facial image includes Static Human Face image.

3. according to the method described in claim 1, it is characterized in that, the facial image includes the face figure in sequence of frames of video Picture.

4. according to the method described in claim 3, it is characterized in that, obtain the face figure according at least to the ROI feature figure The Expression Recognition of picture as a result, including：

According to the ROI feature figure of the facial image of the present frame, the preliminary table of the facial image of the present frame is obtained Feelings recognition result；

According to the preliminary Expression Recognition result of the present frame and at least Expression Recognition of the facial image of a prior frame as a result, obtaining Take the Expression Recognition result of the facial image of the present frame.

5. according to the method described in claim 4, it is characterized in that, according to the preliminary Expression Recognition result of the present frame and extremely The Expression Recognition of the facial image of a few prior frame is as a result, obtain the Expression Recognition of the facial image of the present frame as a result, packet It includes：

By the preliminary facial expression recognition result of the facial image of the present frame and the people of at least facial image of a prior frame Face Expression Recognition result is weighted processing, obtains the Expression Recognition of the facial image of the present frame as a result, wherein, described to work as The weight of the preliminary Expression Recognition result of the facial image of previous frame is more than the Expression Recognition result of the facial image of any prior frame Weight.

6. a kind of convolutional neural networks model training method, which is characterized in that including：

The sample image of training and the information of corresponding face key point are obtained, wherein, someone is included in the sample image The markup information of face expression；

Human face expression feature extraction is carried out to the sample image by the convolution layer segment of convolutional neural networks model, obtains people Face expressive features figure；

According at least to the ROI feature figure, the network parameter of the convolutional neural networks model is adjusted.

7. a kind of expression recognition apparatus, which is characterized in that including：

First determining module, for the facial image to be detected for passing through the convolution layer segment of convolutional neural networks model and obtaining In face key point, human face expression feature extraction is carried out to facial image to be detected, obtains human face expression characteristic pattern；

2nd the 5th determining module, for determining sense corresponding with each face key point in the human face expression characteristic pattern Interest region ROI；

Third determining module, the pond layer segment for passing through convolutional neural networks model carry out pond Hua Chu to determining each ROI Reason obtains the ROI feature figure behind pond；

4th determining module, for obtaining the Expression Recognition result of the facial image according at least to the ROI feature figure.

8. a kind of convolutional neural networks model training apparatus, which is characterized in that including：

First acquisition module, for obtaining the information of the sample image of training and corresponding face key point, wherein, the sample Include the markup information of human face expression in this image；

Second acquisition module, the convolution layer segment for passing through convolutional neural networks model carry out face table to the sample image Feelings feature extraction obtains human face expression characteristic pattern；

Third acquisition module, it is corresponding interested with each face key point in the human face expression characteristic pattern for determining Region ROI；

4th acquisition module, the pond layer segment for passing through convolutional neural networks model carry out pond Hua Chu to determining each ROI Reason obtains the ROI feature figure behind pond；

5th acquisition module, for according at least to the ROI feature figure, adjusting the network ginseng of the convolutional neural networks model Number.

9. a kind of electronic equipment, which is characterized in that including：Processor, memory, communication device and communication bus, the processing Device, the memory and the communication device complete mutual communication by the communication bus；

For the memory for storing an at least executable instruction, the executable instruction makes the processor perform right such as will Seek any expression recognition methods of 1-5.

10. a kind of electronic equipment, which is characterized in that including：Processor, memory, communication device and communication bus, the processing Device, the memory and the communication device complete mutual communication by the communication bus；

For the memory for storing an at least executable instruction, the executable instruction makes the processor perform right such as will Seek the 6 convolutional neural networks model training methods.