CN111597955A

CN111597955A - Smart home control method and device based on expression emotion recognition of deep learning

Info

Publication number: CN111597955A
Application number: CN202010397040.XA
Authority: CN
Inventors: 谭永锐; 丁宁; 刘训玺
Original assignee: Bokang Yunxin Science & Technology Co ltd
Current assignee: Bokang Yunxin Science & Technology Co ltd
Priority date: 2020-05-12
Filing date: 2020-05-12
Publication date: 2020-08-28

Abstract

The invention provides an intelligent home control method and device based on expression emotion recognition of deep learning, wherein the method comprises the following steps: s1: presetting application scenes of the intelligent home equipment, wherein different application scenes are associated with different expressions and emotions, and different application scenes output different control instructions; s2: acquiring a target image, detecting a target face image, performing face feature analysis on the target face image, and identifying identity information corresponding to face feature information; s3: inputting a target face image into a preset deep learning model, and identifying expression emotion; s4: and controlling the intelligent home equipment according to the identified identity information and expression emotion and the control instruction corresponding to the associated application scene by using a Zigbee protocol. The method has the advantages that the expression emotion is recognized by deep learning, the accuracy of expression emotion recognition is improved, the expression emotion recognition and intelligent home control are combined, the intelligentization is improved, the method is simple and convenient, and the safety of application places is improved.

Description

Smart home control method and device based on expression emotion recognition of deep learning

Technical Field

The invention relates to the field of smart home, in particular to a smart home control method and device based on expression emotion recognition of deep learning.

Background

The intelligent home is an important specific practical application of the Internet of things. The intelligent home is characterized in that various devices (such as audio and video devices, a lighting system, curtain control, air conditioner control, network home appliances and the like) in the home are connected together through the Internet of things technology, and multiple functions and means such as home appliance control, lighting control, anti-theft alarm, environment monitoring, heating and ventilation control and the like are provided. At present, the development of smart homes can be divided into four major stages: the system comprises a first-generation mobile phone control, a second-generation scene linkage, a third-generation voice interaction and a fourth-generation artificial intelligence.

Along with the development of intelligent home technology, price and market, the intelligent home enters into ordinary families at an astonishing speed, and the popular intelligent home field in the market at the present stage is mostly in the second generation and the third generation, and common man-machine interaction is operated through a mobile phone, namely a mobile phone APP is installed, intelligent equipment in the home is controlled through software instructions, and the control of the instructions can be optimized through voice recognition on the other part of the intelligent equipment. In actual use, the problem of insufficient intelligence still exists in the use of smart homes.

Expression recognition is an important direction for understanding and analyzing human emotion by a computer and is an important way for realizing human-computer interaction. Expression recognition technology originated in the 20 th century 70 th psychologists Ekman and Friesen's research, suggesting that humans have mainly six basic emotions of anger, happiness, sadness, surprise, disgust and fear, each of which expresses psychological activities by different expressions. Therefore, analysis and recognition of emotion can be realized by expression recognition. The expression recognition has wide research prospects in human-computer interaction, including human-computer interaction, emotion analysis, smart home and the like.

At present, artificial design features are mainly used in the expression recognition field and are used for recognizing and extracting appearance features of images, and feature extraction methods of the traditional expression recognition technology mainly comprise gradient direction histograms, local binary patterns, wavelet transformation, local linear embedding and the like. These methods of artificial design are more effective in specific samples, have the disadvantage of being difficult to adapt to new changes, are subject to the algorithms of design, and when identifying new face images, the recognition rate will be significantly reduced; the design algorithm is time-consuming and labor-consuming; feature extraction and classification are separate and cannot be fused into a unified model.

Disclosure of Invention

The application provides an intelligent home control method and device based on expression emotion recognition of deep learning, and aims to solve the problem that intelligent home is not intelligent enough in the prior art.

According to a first aspect, an embodiment provides a smart home control method based on emotion recognition of expressions in deep learning, and the method includes:

s1: presetting application scenes of the intelligent home equipment, wherein different application scenes are associated with different expressions and emotions, and different application scenes output different control instructions;

s2: acquiring a target image, preprocessing the target image, extracting a target face image in the target image through an MTCNN (multiple-transmission network) model, and performing face feature analysis through an insight face model to identify face feature information and corresponding identity information;

s3: inputting the target face image into a preset deep learning model, and identifying expression emotion;

s4: and calling an application scene associated with the expressive emotion according to the identified identity information and the identified expressive emotion, and controlling the intelligent home equipment according to a control instruction corresponding to the associated application scene by using a Zigbee protocol.

Further preferably, in step S2, the method for acquiring a target image and preprocessing the target image includes: and normalizing the environment and the illumination in the acquired target image.

Further preferably, in the step S2, the method for extracting the target face image in the target image through the MTCNN model further includes: and (3) detecting a target face region in the target image by utilizing the cascade of three-layer network structures of P-Net, R-Net and O-Net, extracting the target face image, detecting face key points, carrying out affine transformation by utilizing the face key points, and aligning and correcting the target face image.

Further preferably, in step S2, the method for performing face feature analysis through the insight face model to identify face feature information and corresponding identity information includes:

presetting a face database, wherein face feature vectors and corresponding identity information are preset in the face database;

acquiring a target face image;

then inputting the target face image into an Insightface model, and identifying target face characteristic information;

traversing the target face feature information through the face database, and identifying the identity information corresponding to the face feature vector as the identity information of the target face image when the similarity value between the target face feature information and the face feature vector in the face database meets a preset condition.

Further preferably, the step 3, before inputting the target face image into a preset deep learning model and recognizing the expression emotion in the target face image, further includes:

collecting a plurality of facial feature images and expression emotion images corresponding to the facial feature images, and creating a data set by using the facial feature images and the expression emotion images, wherein the data set comprises a training set and a verification set;

constructing a deep neural network;

and inputting the training set into a deep neural network, performing data training learning, and performing verification by using the verification set to generate a deep learning model.

Further preferably, before creating the data set using the facial feature image and the emotional expression image, the method further includes: and carrying out data set enhancement processing including turning, rotating and cutting transformation on the collected face feature images.

Further preferably, the deep neural network adopts a VGG19 convolutional neural network; the VGG19 convolutional neural network comprises: 19 hidden layers, 5 pooling layers and 1 sorting layer, wherein the hidden layers comprise: 16 convolutional layers, 3 fully-connected layers and 1 dropout layer, wherein the convolutional layers adopt small convolutional kernels.

Further preferably, the method for creating the VGG19 convolutional neural network comprises:

creating a first layer of convolution, wherein the first layer of convolution comprises two layers of convolution networks and a maxpool layer, and adding nonlinear correction to transmit the obtained feature map into a lower layer for processing;

creating a second layer of convolution, wherein the second layer of convolution comprises two layers of convolution networks and a maxpool layer, and the obtained feature map is transmitted to a lower layer for processing by adding nonlinear correction;

creating a third layer of convolution, wherein the third layer of convolution comprises four layers of convolution networks and a maxpool layer, and the nonlinear correction is added to transmit the obtained feature map into a lower layer for processing;

creating a fourth layer of convolution, wherein the fourth layer of convolution comprises a four-layer convolution network and a maxpool layer, and the obtained feature map is transmitted to a lower layer for processing by adding nonlinear correction;

creating a fifth layer convolution comprising a four-layer convolution network and a maxpool layer, and adding nonlinear correction to transmit the obtained feature map into a lower layer for processing;

creating a full connection layer which comprises a dropout layer, three full connection layers and a softmax layer, stopping the work of an activation value of a certain neuron in propagation at a certain probability through the dropout layer, and then transmitting the obtained feature graph to a lower layer for processing;

creating a loss function, wherein the loss function adopts a cross entropy loss function, and the cross entropy loss function calculation formula is as follows:

according to a second aspect, an embodiment provides a smart home control device based on emotion recognition of expressions in deep learning, the device includes: the system comprises an association setting module, a face recognition module, an emotion recognition module and an instruction calling module;

the correlation setting module is used for presetting application scenes of the intelligent home equipment, wherein different application scenes are correlated with different expression emotions, and different application scenes output different control instructions;

the face recognition module is used for acquiring a target image, preprocessing the target image, extracting a target face image in the target image through an MTCNN (multiple-transmission neural network) model, and performing face feature analysis through an insight face model to recognize face feature information and corresponding identity information;

the emotion recognition module is used for inputting the target face image into a preset deep learning model and recognizing expression emotion;

the instruction calling module is used for calling an application scene related to the expressive emotion according to the identified identity information and the identified expressive emotion, and then controlling the intelligent home equipment according to a control instruction corresponding to the related application scene by using a Zigbee protocol.

Compared with the prior art, the invention has the beneficial effects that:

(1) the invention analyzes the expression emotion by utilizing expression recognition, combines the expression emotion to be applied to the intelligent home control application, and realizes more intelligent home control application.

(2) The invention utilizes the image preprocessing to improve the accuracy of face and expression emotion recognition.

(3) The invention utilizes the deep neural network to improve the accuracy and efficiency of face and facial expression emotion recognition.

(4) The invention utilizes the face recognition in the middle process of the deep neural network to multiplex the processing capacity and simultaneously carry out the face identity recognition on the intelligent household application, thereby improving the safety.

Drawings

Fig. 1 is a flowchart of an intelligent home control method based on emotion recognition of deep learning in an embodiment of the present invention;

fig. 2 is a flowchart of an intelligent home control method in an embodiment of the present invention;

FIG. 3 is a flow chart of a face recognition method according to an embodiment of the present invention;

FIG. 4 is a flow chart of a method for identifying identity information in an embodiment of the present invention;

FIG. 5 is a flowchart of an expression recognition method according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a VGG19 deep neural network in an embodiment of the present invention;

fig. 7 is a block diagram of a structure of an intelligent home control device for emotion recognition based on deep learning in an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following detailed description and accompanying drawings.

The embodiment of the application adopts the deep network to identify and analyze the facial expressions, extracts the facial feature points through the face identification, performs deep neural network learning training, and then classifies the facial features points to enable the network to learn independently, once the design verification is completed, the model basically does not need to be modified too much, the change adaptability of new data can be improved, the feature extraction and classification are completed in one model, and the determination of a manual design algorithm is well avoided. At present, a number of deep neural networks are used for expression recognition, such as a recurrent neural network RNN and a convolutional neural network CNN.

Example 1

Referring to fig. 1-2, an embodiment of the present application provides a smart home control method based on emotion recognition of expressions and emotions through deep learning, which includes the following steps.

Step S100: the method comprises the steps of presetting application scenes of the intelligent household equipment, wherein different application scenes are associated with different expressions and emotions, and different application scenes output different control instructions.

In step S100, different control instructions are executed for different application scenarios, where the application scenarios at least include: the system comprises a music playing scene, a light control scene, an air conditioner control scene, a temperature and humidity control scene, a curtain control scene and a household appliance control scene. And setting the association relation of the corresponding application scenes according to different expression emotions.

Different application scenarios manage different emotions, which may include, in some embodiments: happiness, anger, sadness and surprise, which are categories of emotions with relatively obvious phenotype and small difficulty of detection. Further, it is also possible to include: fear, aversion, photopic vision, confusion and recognition of neutral five emotions. A face database is preset for different expression emotions so as to identify the face and the expression emotions on the face.

Step S200: the method comprises the steps of obtaining a target image, preprocessing the target image, extracting a target face image in the target image through an MTCNN model, analyzing face characteristics through an insight face model, and identifying face characteristic information and corresponding identity information.

Further, referring to fig. 3, the method for recognizing a human face includes:

step S210: acquiring a target image;

step S220: preprocessing an image;

step S230: extracting a target face image by using an MTCNN model;

step S240: and performing face feature analysis on the target face image by using the Insightface model to identify face feature information.

In step S200, the method mainly includes acquiring a target image, and performing target face detection and correlation processing on the target image. And further detecting a target face image in the target image, and returning high-precision target face frame coordinates and face key point coordinates. In general, face recognition extracts a face key point image implied in each target face and compares the extracted face key point image with a known target face, so as to recognize the identity of each face. The application scenes of face detection and recognition gradually evolve from indoor to outdoor, and develop from single limited scenes to scenes such as squares, stations, subway exits and the like, and the requirements of face detection and recognition are higher and higher, for example: the face size is changeable, the number is redundant, the postures are various, and the face size is variable, the face size is redundant, the face is shielded by a face, a mask with a hat is worn, the expression is exaggerated, the makeup is disguised, the illumination condition is poor, the resolution is low, and even the face size is difficult to distinguish even with naked eyes.

In this embodiment, referring to fig. 3, a target image is acquired, and the target image may be acquired by acquiring an image with a camera. After the target image is obtained, the method for preprocessing the target image comprises the following steps: and normalizing the environment and the illumination in the acquired target image. In addition, after the target image is preprocessed, the quality of the acquired target image can be improved through a GAN-based deep learning model.

In this embodiment, when feature images are collected, environmental factors such as ambient light are unified, but in order to further realize efficient and accurate target recognition and eliminate the influence of the environment and illumination existing in the images, a normalization technique needs to be used to normalize the ambient illumination, and a GAN-based depth model is used to correct the state of the target face image.

Wherein, GAN is called as generic adaptive Nets, and is directly translated into a Generative countermeasure network. GAN is used as a discriminant model (commonly used is a support vector machine and a multilayer neural network). Secondly, the optimization process is to find a Nash equilibrium between the generative model and the discriminant model.

A learning framework established by the GAN is actually a simulation game between the generative model and the discriminant model. The purpose of generating the model is to imitate, model and learn the distribution rule of real data as much as possible; the discriminant model is to discriminate whether the obtained input data comes from the real data distribution or from a generated model. Through continuous competition between the two internal models, the generation capability and the discrimination capability of the two models are improved. If we compare the generative model to a masquerier, then the discriminant model is a police role. The purpose of the masquerier is to improve the self-disguising ability through continuous learning, so that the data provided by the masquerier can better deceive the discriminant model. The discriminant model improves the self-discriminant capability through continuous training, and can more accurately judge where the data source is so as to correct the real data in the target image.

In step S200, the method for extracting a target face image in a target image through an MTCNN model further includes: and (3) detecting a target face region in a target image by utilizing the cascade of three-layer network structures of P-Net, R-Net and O-Net, extracting the target face image, detecting face key points, carrying out affine transformation by utilizing the face key points, and aligning and correcting the target face image.

In this embodiment, the method for extracting the target face image in the target image through the MTCNN model may further specifically include: and (3) detecting a target face region and key points of a target face in video information by utilizing the cascade of three-layer network structures of P-Net, R-Net and O-Net, carrying out affine transformation by utilizing the key points of the face after a target face frame is detected, aligning and correcting the face image, acquiring the target face image, and detecting the key points of the face in the target face image. In the embodiment, the input size of the P-Net is 12 x 12, the parameters are the least, the model is the smallest, the candidate region of the face can be obtained at the fastest speed, and then the nms is used for combining the candidate frames; the candidate frames are then resize to 24 x 24 and sent to R-Net, which will overrule a large set of candidate frames and send to O-Net after NMS; O-Net and R-Net are very similar, except that in this step we identify the candidate boxes more accurately, and for each candidate box we also give 5 signatures, e.g. 2 eyes, 1 nose, 2 corners of mouth. Furthermore, after O-Net, a difficulty is how to align the detected faces for rectification, because the detected faces may not be flat in actual pictures. The effects brought by different affine transformation methods can be very different, generally, affine transformation is carried out by using cv2. warpAffeine API of OpenCV, and of course, the transformation is only carried out on a two-dimensional plane, namely parallel edges are still parallel after the transformation; cp2. warppspective, as distinguished from perspective transformation.

In this embodiment, Affine Transformation (affinity Transformation) is performed by using the face key points, and a Transformation matrix M is required; the simplest method is to use the getoffset transform () API provided by OpenCV, and only 3 points on the original plane and 3 points on the target plane are needed to obtain one transformation matrix M: m ═ cv2.getaffinetransform (pts1, pts 2).

The face detection method comprises the steps of utilizing an MTCNN (Multi-task masked connected computational networks) model to detect faces, further utilizing a face detection framework of Multi-task target faces, using three-layer network structure cascade, namely a 3-layer CNN cascade algorithm structure, to simultaneously detect the faces and face features, and simultaneously performing regression of a target face image frame and detection of face key points.

In this embodiment, the target image is input into the MTCNN model, the image in the target image is scaled into pictures of different sizes according to different scaling ratios to form a feature pyramid of the pictures, the regression vectors of the candidate window and the bounding box of the face region are obtained by using P-Net, and the bounding box is used for regression, the candidate windows are calibrated and then the highly overlapping candidate boxes are merged by a non-maximum suppression (NMS), R-Net trains the P-Net candidate boxes in an R-Net network, then fine tuning the candidate window by using the regression value of the bounding box, removing the overlapped window by using NMS, wherein the O-Net function is similar to the R-Net function, only when removing the overlapped candidate window, and simultaneously displaying the positioning of the key points of the human face, wherein the network structure of the P-Net is a fully-convoluted neural network structure.

The training data for face detection in the MTCNN model in this embodiment is labeled in the following format,

# File name

File name

Number of # marker boxes

Number of bounding box

# where x1 and y1 are coordinates of the top left corner of the mark frame, w and h are the width of the mark frame, and blu, expression, animation, invalidity, oclusion, and pos are the attributes of the mark frame, such as whether it is blurred or not, lighting conditions, whether it is occluded or not, whether it is valid or not, and the posture.

Further its format is as follows:

x1,y1,w,h,blur,expression,illumination,invalid,occlusion,pose。

therefore, the training data for face feature detection is in the following format:

the # first data is a file name,

the second and third data are the coordinates of the upper left corner of the marker box,

the fourth and fifth data are mark box length and width,

the sixth and seventh data are left-eye mark points,

the eighth and ninth data are right eye mark points,

the tenth and eleventh data are left mouth marker points,

the last two coordinates are the right mouth mark points.

In step S200, the method for performing face feature analysis through the insight face model to identify face feature information and corresponding identity information includes:

step S241: and presetting a face database, wherein the face database is preset with face feature vectors and corresponding identity information.

Step S242: and acquiring a target face image.

Step S243: and inputting the target face image into the Insightface model, and identifying the characteristic information of the target face.

Step S244: and traversing the target face feature information through the face database, and identifying the identity information corresponding to the face feature vector as the identity information of the target face image when the similarity value between the face feature information and the face feature vector in the face database meets a preset condition.

In the step, a target face image detected through the MTCNN model is obtained, the target face is input into the insight face model, and face feature information corresponding to face key points in the target face image is identified. The target face image is input into an insight face model, and the feature vector of Embedding layer Embedding is calculated. Comparing Euclidean distances among the feature vectors, and judging whether the feature vectors are the same person, for example, when the feature distance is less than 1, the feature vectors are considered as the same person, and when the feature distance is more than 1, the feature vectors are considered as different persons.

The face feature recognition analysis is carried out through the Insightface model, the Insightface model does not use the traditional softmax mode to carry out classification learning, but extracts a certain layer as a feature to learn a coding method from an image to a Euclidean space, and then carries out face recognition, face verification, face clustering and the like based on the coding. The face recognition effect can be compared by using the Euclidean distance between faces, and the same person can be considered when the Euclidean distance between the faces is less than 1.06. Further, the insight face model is mainly used for verifying whether the human faces are the same person or not, and identifying who the person is through the human faces. The main idea of the insight face model is to map a face image to a multidimensional space, and represent the similarity of the face through a spatial distance. The spatial distance between the face image and the face image is smaller, and the spatial distance between different face images is larger. Therefore, the face recognition can be realized through the space mapping of the face image, an image mapping method based on a deep neural network and a loss function based on triplets are adopted in the insight face model to train the neural network, and the network directly outputs a 128-dimensional vector space.

The organization of the training data is shown below, where the directory name is the identity information and the files under the directory are the photos of the corresponding person.

Aaron_Eckhart

Aaron_Eckhart_0001.jpg

Aaron_Guiel

Aaron_Guiel_0001.jpg

Aaron_Patterson

Aaron_Patterson_0001.jpg

Aaron_Peirsol

Aaron_Peirsol_0001.jpg

Aaron_Peirsol_0002.jpg

Aaron_Peirsol_0003.jpg

Aaron_Peirsol_0004.jpg

...

Then, each picture in the training data is preprocessed, a human face is detected through an MTCNN model, training data of the insight face is generated, and a corresponding data structure is formed as follows:

Aaron_Eckhart

Aaron_Eckhart_0001_face.jpg

Aaron_Guiel

Aaron_Guiel_0001_face.jpg

……

batch in a network structure in the insight face model represents training data of a human face, a deep convolutional neural network is used next, then the L2 normalization operation is adopted, the human face image feature representation is obtained, and finally the loss function of a triple (TripletLoss) is obtained.

Step S300: and inputting the target face image into a preset deep learning model, and identifying the expression emotion.

The method comprises the following steps of identifying the expression emotion based on the identity information corresponding to the identification target face image. Identity information is confirmed through face recognition, intelligent control of a non-owner can be avoided, and safety is improved.

Further, the target face image identified in step S200 is input into a preset deep learning model, and an expression emotion in the target face image is identified.

Referring to fig. 4, in step 300, inputting the target face image into a preset deep learning model, before recognizing the expressive emotion in the target face image, the method further includes:

collecting a plurality of facial feature images and expression emotion images corresponding to the facial feature images, and creating a data set by using the facial feature images and the expression emotion images, wherein the data set comprises a training set and a verification set; constructing a deep neural network; and inputting the training set into a deep neural network, performing data training learning, and verifying by using a verification set to generate a deep learning model.

In this embodiment, in order to meet the requirement of deep neural network learning and training emotion recognition, a data set is created, where the data set is divided into a training set and a verification set. In this example, 80% of the data set was used for training and 20% for testing. The data collection requires uniform light and soft background, the clothes of the collected person require uniform color, and factors such as beard, earrings or glasses which hinder training do not exist.

In this embodiment, the smart home device has 6 application scenes in advance, and associates 6 different emotions. Wherein, the collected expressions are classified according to 6 expressions, namely anger, happiness, sadness, surprise, disgust and fear; each expression, required to contain 5 different gaze directions, was taken simultaneously from different angles using 5 cameras; the data set contains a minimum of over 1000 people, half of which are male and female, between the ages of 5 and 70. In this embodiment, the size of the collected picture is 224 × 224 pixels.

Further, before creating the data set using the facial feature image and the expressive emotion image, the method further includes: and carrying out data set enhancement processing including turning, rotating and cutting transformation on the collected face feature images. The problem of too fast overfitting of the network is solved by performing enhancement processing on the collected data set. In this example, the cut transforms by flipping, rotating, and cutting each image. During cutting, data of the data set is cut randomly, and the image is turned over randomly, rotated and the like. Through the data enhancement operation, the data volume of the data set is greatly improved, and the robustness of the trained network is stronger.

In this embodiment, the deep neural network adopts a VGG19 convolutional neural network; the VGG19 convolutional neural network includes: 19 hidden layers, 5 pooling layers and 1 classification layer, wherein the hidden layers comprise: 16 convolutional layers, 3 fully-connected layers and 1 dropout layer, wherein the convolutional layers adopt small convolutional kernels.

Referring to fig. 5, the method for creating the VGG19 convolutional neural network structure includes:

creating a first layer of convolution, including two layers of convolution networks and a maxpool layer, adding nonlinear correction, and transmitting the obtained feature map into a lower layer for processing; creating a second layer of convolution, wherein the second layer of convolution comprises two layers of convolution networks and a maxpool layer, adding nonlinear correction, and transmitting the obtained feature map into a lower layer for processing; creating a third layer of convolution, including four layers of convolution networks and one layer of maxpool layer, adding nonlinear correction, and transmitting the obtained feature map into a lower layer for processing; creating a fourth layer of convolution, including a four-layer convolution network and a maxpool layer, adding nonlinear correction, and transmitting the obtained feature map into a lower layer for processing; creating a fifth layer convolution comprising a four-layer convolution network and a maxpool layer, and transmitting the obtained feature map into a lower layer for processing; creating a full connection layer which comprises three full connection layers and a softmax layer, and transmitting the obtained feature graph into a lower layer for processing; creating a loss function, wherein the loss function adopts a cross entropy loss function, and the cross entropy loss function calculation formula is as follows:

further, for the neural network structure of expression recognition, the VGG19 network used in the embodiment of the present application, VGG19, is one of the convolutional neural networks CNN, and the model includes 19 hidden layers, including 16 convolutional layers, 3 fully-connected layers, and 1 dropout layer; the network also comprises 5 pooling layers, 1 classification softmax classification layer. The VGG19 network adopts a plurality of small convolution kernels, the performance of the VGG19 network is superior to that of a large convolution kernel, the complexity of network depth guarantee learning is increased, and meanwhile, the calculation amount is reduced by almost half.

The method of creating the VGG19 convolutional neural network includes the further steps of.

(1) A first convolution is created comprising two convolution networks and one layer of maxpool, wherein the convolution networks have convolution kernels all 3 x 3 in size, step size 1, padding 1, number 64, and output a matrix of 224 x 64. After passing through the maxpool layer with the size of 2 x 2 and the step length of 2, the size of the matrix is reduced by 50%, the number of channels is unchanged, and finally, the feature matrix of 112 x 64 is output, the pooling layer can greatly reduce the size of the model, improve the calculation speed and the robustness of extracted features, and finally, nonlinear correction (RELU) is added, and the obtained feature graph is continuously transmitted to the lower layer for processing.

(2) And (3) creating a second convolution layer, wherein the second convolution layer comprises two convolution networks and one maxpool layer, the size and the step size of convolution kernels are the same as those in the step 1, the number of the convolution kernels is changed into 128, and a 112 x 128 matrix is output. After passing through the maxpool layer, the size of the output feature matrix is 64 × 128, finally, the nonlinear correction (RELU) is added, and the obtained feature map is continuously transmitted to the lower layer for processing.

(3) And creating a third convolutional layer which comprises 4 convolutional layers and one maxpool layer, wherein the size and the step size of the convolutional kernel are the same as those in the step 1, the number of the convolutional kernels is changed into 256, and a 56 x 256 matrix is output. After passing through the maxpool layer, the size of the output feature matrix is 28 × 256, and finally, the nonlinear correction (RELU) is added, and the obtained feature map is continuously transmitted to the lower layer for processing.

(4) And creating a fourth convolution layer which comprises 4 convolution layers and one maxpool layer, wherein the size and the step size of convolution kernels are the same as those in the step 1, the number of the convolution kernels is 512, and a 28 x 512 matrix is output. After passing through the maxpool layer, the size of the output feature matrix is 14 × 512, and finally, the nonlinear correction (RELU) is added, and the obtained feature map is continuously transmitted to the lower layer for processing.

(5) And creating a fifth convolutional layer which comprises 4 convolutional layers and one maxpool layer, wherein the size and the step size of a convolutional kernel are the same as those of the step 1, the number of the convolutional kernels is changed to 512, and a 14 x 512 matrix is output. After passing through the maxpool layer, the size of the output feature matrix is 7 × 512, and finally, the nonlinear correction (RELU) is added, and the obtained feature map is continuously transmitted to the lower layer for processing.

(6) And creating a full connection layer, wherein the full connection layer comprises three full connection layers, namely a softmax layer. The role of the fully-connected layer is to map the scattered feature representation to the sample label space. After the convolution is performed in the step 5, the obtained 7 x 512 feature vectors are linked with the fully-connected network with two layers of 1 x 4096 and the fully-connected network with one layer of 1 x 1000, the model obtains the output probabilities of six expression classifications after the fully-connected layers, normalization processing is performed through a softmax layer classifier, the probabilities are normalized to 1, and data processing is facilitated.

In order to effectively alleviate the occurrence of overfitting and make the model have stronger generalization, in this embodiment, before full connection, the activation value of a certain neuron in propagation is stopped at a certain probability through a dropout layer, so that the propagation process does not depend on certain local features too much, the robustness of the model is increased, and then the obtained feature map is transmitted to the next layer for processing. In the embodiment, the dropout layer is added, so that the activation value of a certain neuron in propagation stops working at a certain probability, the propagation process is not dependent on certain local features, and the robustness of the model is improved.

(7) And creating a loss function, wherein the loss function adopts a cross entropy loss function. The cross entropy loss function calculation formula is as follows:

step S400: and calling an application scene associated with the expressive emotion according to the identified identity information and the identified expressive emotion, and controlling the intelligent home equipment according to a control instruction corresponding to the associated application scene by using a Zigbee protocol.

In step S400, the smart home control uses the ZigBee protocol to realize control of different scenes. Generally, the smart home device is composed of four products, namely a multifunctional gateway, a human body sensor, a door and window sensor and a wireless switch, and the four products have the common characteristic of supporting a Zigbee protocol. The zigbee protocol is a low-power local area network protocol based on the ieee802.15.4 standard (2.4Ghz band), and is a short-distance and low-power wireless communication technology. The ZigBee protocol is applied to intelligent homes, and has the characteristics of low power consumption, low cost, short time delay, short distance, high capacity, high reliability, high safety and the like.

Example 2

The embodiment of the application provides an intelligent house controlling means of expression emotion discernment based on degree of depth study, and it is shown with reference to fig. 7 that the device includes: the system comprises an association setting module 100, a face recognition module 200, an emotion recognition module 300 and an instruction calling module 400;

the association setting module 100 is configured to preset application scenes of the smart home device, where different application scenes are associated with different emotions, and different application scenes output different control instructions.

The face recognition module 200 is configured to obtain a target image, pre-process the target image, extract a target face image in the target image through the MTCNN model, perform face feature analysis through the insight face model, and recognize face feature information and corresponding identity information.

The emotion recognition module 300 is configured to input the target face image into a preset deep learning model, and recognize an expression emotion.

The instruction calling module 400 is configured to call an application scene associated with the expressive emotion according to the identified identity information and the identified expressive emotion, and then control the smart home device according to a control instruction corresponding to the associated application scene by using a Zigbee protocol.

Example 3

An embodiment of the present application provides a server, and the server includes: the method for controlling the smart home based on the emotion recognition of the deep learning in embodiment 1 includes the following steps:

presetting application scenes of the intelligent home equipment, wherein different application scenes are associated with different expressions and emotions, and different application scenes output different control instructions; acquiring a target image, preprocessing the target image, extracting a target face image in the target image through an MTCNN (multiple-transmission neural network) model, and performing face feature analysis through an insight face model to identify face feature information and corresponding identity information; inputting a target face image into a preset deep learning model, and identifying expression emotion; and calling an application scene associated with the expressive emotion according to the identified identity information and the identified expressive emotion, and controlling the intelligent home equipment according to a control instruction corresponding to the associated application scene by using a Zigbee protocol.

Example 4

An embodiment of the present application provides a computer-readable storage medium, where an intelligent home control program for emotion recognition based on deep learning is stored in the computer-readable storage medium, and when the intelligent home control program for emotion recognition based on deep learning is executed by a processor, the step of implementing the intelligent home control method for emotion recognition based on deep learning in embodiment 1 includes:

The server in this embodiment may include, but is not limited to, a memory, a processor, and a network interface, which may be communicatively connected to each other through a system bus. The server may be a rack server, a blade server, a tower server, or a rack server, and the like, and the server may be an independent server or a server cluster composed of a plurality of servers.

The memory includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage may be an internal storage unit of the application server, such as a hard disk or a memory of the application server. In other embodiments, the memory may also be an external storage device of the server, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the server. Of course, the memory may also include both the internal storage unit of the server and its external storage devices. In this embodiment, the memory is generally used to store an operating system installed in the server and various types of application software, such as emotion recognition program code. In addition, the memory may also be used to temporarily store various types of data that have been output or are to be output.

The processor may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor is typically used to control the overall operation of the server. In this embodiment, the processor is configured to run program codes stored in the memory or process data, for example, each step in the running smart home control method based on emotion recognition of expressions and emotions through deep learning. The network interface may include a wireless network interface or a wired network interface, which is typically used to establish a communication connection between the server and other electronic devices.

The functional units in the embodiments of the present invention may be integrated together to form an independent part, or each unit may exist separately, or two or more units may be integrated to form an independent part. The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-only memory (ROM, Read-Onl8 memory 8), a random access memory (RAM, random access memory 8), a magnetic disk, an optical disk, or other various media capable of storing program codes.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing description of the preferred embodiments of the present invention has been presented only for the purpose of facilitating understanding of the invention and is not intended to limit the invention thereto. Various modifications and alterations to this invention will become apparent to those skilled in the art to which this invention pertains. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The intelligent home control method based on expression emotion recognition of deep learning is characterized by comprising the following steps of:

2. The smart home control method for emotion recognition based on deep learning of claim 1, wherein in step S2, the method for preprocessing the target image includes: and normalizing the environment and the illumination in the acquired target image.

3. The smart home control method for emotion recognition based on deep learning of expression and emotion, as claimed in claim 2, wherein, in step S2, the method for extracting the target facial image in the target image through MTCNN model further comprises: and (3) detecting a target face region in the target image by utilizing the cascade of three-layer network structures of P-Net, R-Net and O-Net, extracting the target face image, detecting face key points, carrying out affine transformation by utilizing the face key points, and aligning and correcting the target face image.

4. The intelligent home control method for emotion recognition based on deep learning of claim 3, wherein in step S2, the method for performing facial feature analysis through an insight face model to recognize facial feature information and corresponding identity information includes:

acquiring a target face image;

inputting the target face image into an Insightface model, and identifying target face characteristic information;

5. The smart home control method based on the deep learning expressive emotion recognition of claim 4, wherein in the step 3, the target face is input into a preset deep learning model, and before the expressive emotion is recognized, the method further comprises:

constructing a deep neural network;

6. The smart home control method based on the deep learning emotional recognition of the claim 5, wherein before creating the data set by using the facial feature image and the emotional expression image, the method further comprises: and carrying out data set enhancement processing including turning, rotating and cutting transformation on the collected face feature images.

7. The intelligent home control method for emotion recognition based on deep learning of expressions and emotions according to claim 5 or 6, wherein the deep neural network employs a VGG19 convolutional neural network; the VGG19 convolutional neural network comprises: 19 hidden layers, 5 pooling layers and 1 sorting layer, wherein the hidden layers comprise: 16 convolutional layers, 3 fully-connected layers and 1 dropout layer, wherein the convolutional layers adopt small convolutional kernels.

8. The smart home control method based on deep learning emotional recognition is characterized in that the method for creating the VGG19 convolutional neural network comprises the following steps:

creating a first layer of convolution, including two layers of convolution networks and a maxpool layer, adding nonlinear correction, and transmitting the obtained feature map into a lower layer for processing;

creating a second layer of convolution, wherein the second layer of convolution comprises two layers of convolution networks and a maxpool layer, adding nonlinear correction, and transmitting the obtained feature map into a lower layer for processing;

creating a third layer of convolution, including four layers of convolution networks and one layer of maxpool layer, adding nonlinear correction, and transmitting the obtained feature map into a lower layer for processing;

creating a fourth layer of convolution, including a four-layer convolution network and a maxpool layer, adding nonlinear correction, and transmitting the obtained feature map into a lower layer for processing;

creating a fifth layer convolution comprising a four-layer convolution network and a maxpool layer, adding nonlinear correction, and transmitting the obtained feature map into a lower layer for processing;

9. intelligent house controlling means of expression emotion discernment based on degree of depth study, its characterized in that, the device includes: the system comprises an association setting module, a face recognition module, an emotion recognition module and an instruction calling module;

the face recognition module is used for acquiring a target image, preprocessing the target image, extracting a target face image in the target image through an MTCNN (multiple-transmission computing network) model, performing face feature analysis through an insight face model, and recognizing face feature information and corresponding identity information;