CN114120422A

CN114120422A - Expression recognition method and device based on local image data fusion

Info

Publication number: CN114120422A
Application number: CN202111459474.9A
Authority: CN
Inventors: 杨华千; 韦鹏程; 冯伟; 邹晓兵
Original assignee: Chongqing University of Education
Current assignee: Chongqing University of Education
Priority date: 2021-12-01
Filing date: 2021-12-01
Publication date: 2022-03-01

Abstract

The application provides an expression recognition method and device based on local image data fusion, comprising the following steps: acquiring a gray image and a depth image containing a target face; preprocessing the gray level image, and further determining the length of the image mouth of the target face; comparing the length of the mouth of the image with the length of a reference mouth corresponding to the target face to determine a comparison result; determining a precise fusion area and a rough fusion area based on the comparison result, wherein the precise fusion area and the rough fusion area both belong to partial areas of the target face; performing precision-differentiated image fusion on the preprocessed gray-scale image and the preprocessed depth image based on the precise fusion region and the rough fusion region; and performing expression recognition on the fused face image to determine the expression of the target face at the moment. Therefore, the accuracy and the stability of the facial expression recognition can be improved.

Description

Expression recognition method and device based on local image data fusion

Technical Field

The application relates to the technical field of face recognition, in particular to an expression recognition method and device based on local image data fusion.

Background

In the field of human-computer interaction, the accurate recognition of the facial expression can help a machine and a computer to recognize possible cognition and thinking modes of human beings, so that the emotion of the human beings is accurately understood, and the quality of the human-computer interaction is favorably improved. The facial expression data is high-dimensional and complex data, and the data describing facial expression has larger spatial dimension.

At present, an expression recognition mode based on a face gray level image is easily influenced by illumination change; the expression recognition mode based on the face depth image is easily affected by factors such as data acquisition precision and dynamic changes (e.g., dynamic changes of five sense organs and position changes), and stability and accuracy of the face expression recognition system are hindered.

Disclosure of Invention

The embodiment of the application aims to provide an expression recognition method and device based on local image data fusion so as to improve the accuracy and stability of facial expression recognition.

In order to achieve the above object, embodiments of the present application are implemented as follows:

in a first aspect, an embodiment of the present application provides an expression recognition method based on local image data fusion, including: acquiring a gray level image and a depth image containing a target face, wherein the gray level image and the depth image are obtained after shooting the target face at the same time; preprocessing the gray level image, detecting the preprocessed gray level image, and determining the length of an image mouth of a target face; comparing the image mouth length with a reference mouth length corresponding to the target face, and determining a comparison result, wherein the comparison result is any one of a first result, a second result and a third result, the first result indicates that the image mouth length is greater than the reference mouth length, the second result indicates that the image mouth length is consistent with the reference mouth length, and the third result indicates that the image mouth length is smaller than the reference mouth length; determining a precise fusion area and a rough fusion area based on the comparison result, wherein the precise fusion area and the rough fusion area both belong to partial areas of the target face; performing precision-differentiated image fusion on the preprocessed gray-scale image and the preprocessed depth image based on the precise fusion region and the rough fusion region; and performing expression recognition on the fused face image to determine the expression of the target face at the moment.

In the embodiment of the application, the gray-scale image and the depth image which are shot (processed) at the same time and contain the target face are obtained, the gray-scale image is preprocessed to determine the length of the image mouth of the target face, and the corresponding accurate fusion region and the corresponding rough fusion region are determined on the basis of the comparison result of the length of the image mouth and the length of the reference mouth. Then, performing precision-differentiated image fusion on the preprocessed gray-scale image and the preprocessed depth image by using the precise fusion region and the rough fusion region; and performing expression recognition on the fused face image to determine the expression of the target face at the moment. In this way, before the grayscale image and the depth image are fused, the length of the image mouth identified in the grayscale image is used as a basis for determining the precise fusion region and the rough fusion region. The different expressions may have different mouth lengths and correspond to a plurality of types of expressions, but in the expression with an elongated mouth angle, the expression with an elongated mouth angle may be identified as the key for expression recognition by a different emphasis determination region from the expression with an un-elongated mouth angle, for example, when the mouth angle is elongated, the face may smile (happy), cry (sad), etc., while the smile expression may accompany changes in the eye pattern and the cheek muscles, etc., and when crying, the eyebrow and eyelid may change significantly and have a relatively sharp feature. Therefore, the length of the mouth of the image is used as a basis for determining the precise fusion area and the rough fusion area, fine fusion of local images can be performed in a targeted manner, and fusion can be performed roughly for relatively irrelevant areas, so that the accuracy of facial expression recognition can be greatly improved, a complementary effect can be formed by combining the gray level image and the depth image, and the accuracy and the stability of the facial expression recognition can be further improved.

With reference to the first aspect, in a first possible implementation manner of the first aspect, the preprocessing the grayscale image includes: determining a binocular central point connecting line of a target face in the gray level image, and carrying out angle adjustment on the gray level image based on the binocular central point connecting line so as to correct the target face; determining the apex of the nose tip from the gray level image; correspondingly, the expression recognition method based on local image data fusion further comprises the following steps: preprocessing the depth image, specifically comprising:

determining a binocular central point connecting line of a target face in the depth image, and carrying out angle adjustment on the depth image based on the binocular central point connecting line so as to align the target face; and determining the apex of the nose tip from the depth image.

In the implementation mode, a binocular central point connecting line of a target face in a gray level image (depth image) is determined, and the angle of the gray level image (depth image) is adjusted based on the binocular central point connecting line so as to align the target face; and, the tip apex of the nose is determined from the grayscale image (depth image). Through the determination of the points, the gray level image and the depth image can be well corresponded, and the fusion precision is favorably ensured. And before the fusion, the target human face is aligned by using the central point connecting line of the two eyes, so that the precision of the image fusion can be further ensured.

With reference to the first possible implementation manner of the first aspect, in a second possible implementation manner of the first aspect, the performing precision-differentiated image fusion on the preprocessed grayscale image and the preprocessed depth image based on the precise fusion region and the rough fusion region includes: establishing a mapping relation between the center points of the eyes and the vertex of the nose tip of the target face in the preprocessed gray level image and the center points of the eyes and the vertex of the nose tip of the target face in the preprocessed depth image in a one-to-one correspondence manner; performing feature extraction on the accurate fusion region of the target face in the preprocessed gray level image, and performing feature extraction on the accurate fusion region of the target face in the preprocessed depth image; matching the image characteristic points extracted based on the gray level image with the depth characteristic points extracted based on the depth image, and establishing a mapping relation between the matched image characteristic points and the depth characteristic points to realize the registration of the gray level image and the depth image in a precise fusion area; determining the outline of the rough fusion area of the target face in the preprocessed gray level image, and determining the outline of the rough fusion area of the target face in the preprocessed depth image; and matching the image region contour determined based on the gray level image with the depth region contour determined based on the depth image, and establishing a mapping relation between the matched image region contour and the image region contour to realize the registration of the gray level image and the depth image in the rough fusion region.

In the implementation mode, the binocular central point and the nose tip vertex of the target face in the preprocessed gray-scale image and the binocular central point and the nose tip vertex of the target face in the preprocessed depth image are in one-to-one correspondence to establish a mapping relation, and the mapping relation can be used as a reference standard for carrying out image fusion on the gray-scale image and the depth image. Aiming at the image fusion of the accurate fusion area, extracting the characteristics of the accurate fusion area of the target face in the preprocessed gray level image, and extracting the characteristics of the accurate fusion area of the target face in the preprocessed depth image; and matching the image characteristic points extracted based on the gray level image with the depth characteristic points extracted based on the depth image, establishing a mapping relation between the matched image characteristic points and the depth characteristic points, and realizing the registration of the gray level image and the depth image in a precise fusion area, so that the image fusion of the precise fusion area can be carried out through the matching of a plurality of characteristic points, and the precision of the image fusion is ensured. And aiming at the registration of the rough fusion area, the image fusion in the area is realized after the outline matching is carried out mainly by utilizing the outline of the rough fusion area of the target face in the preprocessed gray level image and the outline of the rough fusion area of the target face in the preprocessed depth image, so that the image fusion efficiency of the rough fusion area can be greatly improved. Therefore, the precision and efficiency of image fusion can be effectively guaranteed by the aid of the precision-differentiated image fusion method, stability and accuracy of expression recognition can be improved, and operating efficiency of the expression recognition method can also be improved.

With reference to the second possible implementation manner of the first aspect, in a third possible implementation manner of the first aspect, the target face includes a frontal area, an eyebrow area, an orbit area, an ear area, a temporal area, a cheek area, a nose area, an oral area, and a jaw area, and based on the comparison result, an accurate fusion area and a coarse fusion area are determined, including: if the comparison result is the first result, determining that the eyebrow area, the orbit area, the cheek area and the mouth area are precise fusion areas, and the rest are rough fusion areas; if the comparison result is the second result, determining that the forehead area, the eyebrow area, the orbit area, the cheek area and the mouth area are precise fusion areas, and the rest are rough fusion areas; and if the comparison result is the third result, determining that the eyebrow area, the orbit area, the mouth area and the jaw area are accurate fusion areas, and the rest are rough fusion areas.

In this implementation, the target face is divided into a frontal region, an eyebrow region, an orbit region, an ear region, a temporal region, a cheek region, a nose region, an oral region, and a jaw region, and if the comparison result is a first result (the image mouth length is greater than the reference mouth length), the brow region, the orbit region, the cheek region, and the oral region are determined to be precise fusion regions, and the rest are rough fusion regions. Since the expression of the human face at that moment is usually accompanied by features such as the corner of the eye, the eyelids, the eyebrows, and the cheeks when the length of the mouth of the image is longer than the reference length of the mouth, for example, when the face has a smile, the corner of the mouth rises, usually accompanied by the appearance of the lines of the eye, and the cheek is partly raised, causing changes in the contour and depth information, when crime is sad, usually accompanied by the transformation of the eyebrows into a certain form (for example, the tail of the eyebrows hangs down), the convergence of the eyelids, and the mouth presents an arc that is convex upward as a whole. Therefore, when the comparison result is the first result, the eyebrow area, the orbit area, the cheek area and the mouth area are determined to be the accurate fusion area, and the rest are the rough fusion area, so that the determined accurate fusion area is beneficial to expression recognition of the subsequently fused face image. Similarly, if the comparison result is the second result (the length of the image mouth is consistent with the length of the reference mouth, and the consistency here indicates that the length difference is within the set range), the forehead area, the eyebrow area, the orbit area, the cheek area and the mouth area are determined to be the precise fusion area, and the rest are the rough fusion area. When the length of the image mouth is consistent with that of the reference mouth, the uncertainty of the category of the facial expression is high, so that multiple key parts can be determined to be accurate fusion regions, for example, the expression is calm, surprised, puzzled and the like, and the specific expression can be determined jointly through the characteristics of multiple regions such as the forehead (for example, whether the face has lines), the eyebrow (whether the eyebrow is raised), the eyes and the mouth shape, so that the accuracy of expression recognition is ensured. And if the comparison result is a third result (the length of the image mouth is smaller than that of the reference mouth), determining that the eyebrow area, the orbit area, the mouth area and the jaw area are precise fusion areas, and the rest are rough fusion areas. In this way, areas needing accurate fusion can be screened out in emphasis according to expression categories which may occur when the length of the image mouth is smaller than the reference mouth length, for example, when the expression is angry, the characteristics of tight lips, closed mouth, pressing the eyebrow head down, pressing the eyebrow tail up, and deviating the eyeball to the upper eyelid are often accompanied. While the expression is not sweet, the tightening of the lower jaw is usually accompanied, and the skin on the surface of the lower jaw appears as a small naked eye and the like. Therefore, by the method, the corresponding accurate fusion area is determined according to different comparison results, and the facial expression can be accurately recognized.

With reference to the third possible implementation manner of the first aspect, in a fourth possible implementation manner of the first aspect, performing expression recognition on the fused face image, where the comparison result is the first result, includes: acquiring eyebrow features of an eyebrow region, mouth features of an oral region, eye features of an orbit region and cheek features of a cheek region in the face image, wherein the eyebrow features comprise eyebrow contours and eyebrow depth information, the mouth features comprise mouth length and lip contours, the eye features comprise eyelid contours and canthus lines, and the cheek features comprise cheek contours and cheek depth information; and determining the expression state of the target face at the moment based on the eyebrow feature, the mouth feature, the eye feature and the cheek feature.

With reference to the third possible implementation manner of the first aspect, in a fifth possible implementation manner of the first aspect, the performing expression recognition on the fused face image, where the comparison result is the second result, includes: acquiring forehead features of a forehead region, eyebrow features of an eyebrow region, mouth features of an oral region, eye features of an orbit region and cheek features of a cheek region in the face image, wherein the forehead features comprise forehead contour and forehead pattern information, the eyebrow features comprise eyebrow contour and eyebrow depth information, the mouth features comprise mouth length and lip contour, the eye features comprise eyelid contour and canthus pattern, and the cheek features comprise cheek contour and cheek depth information; and determining the expression state of the target face at the moment based on the forehead feature, the eyebrow feature, the mouth feature, the eye feature and the cheek feature.

With reference to the third possible implementation manner of the first aspect, in a sixth possible implementation manner of the first aspect, the performing expression recognition on the fused face image, where the comparison result is the third result, includes: acquiring eyebrow characteristics of an eyebrow region, mouth characteristics of an oral region, eye characteristics of an orbit region and jaw characteristics of a jaw region in the face image, wherein the eyebrow characteristics comprise eyebrow contours and eyebrow depth information, the mouth characteristics comprise mouth length and lip contours, the eye characteristics comprise eyelid contours and canthus grains, and the jaw characteristics comprise jaw contours and jaw grains; and determining the expression state of the target face at the moment based on the eyebrow feature, the mouth feature, the eye feature and the jaw feature.

In a second aspect, an embodiment of the present application provides an expression recognition apparatus based on local image data fusion, including: the system comprises an image acquisition unit, a processing unit and a processing unit, wherein the image acquisition unit is used for acquiring a gray level image and a depth image which comprise a target face, and the gray level image and the depth image are obtained by shooting the target face based on the same moment; the processing unit is used for preprocessing the gray level image, detecting the preprocessed gray level image and determining the length of an image mouth of the target face; the processing unit is further configured to compare the image mouth length with a reference mouth length corresponding to the target face, and determine a comparison result, where the comparison result is any one of a first result, a second result, and a third result, where the first result indicates that the image mouth length is greater than the reference mouth length, the second result indicates that the image mouth length is consistent with the reference mouth length, and the third result indicates that the image mouth length is smaller than the reference mouth length; the processing unit is further configured to determine a precise fusion region and a rough fusion region based on the comparison result, where the precise fusion region and the rough fusion region both belong to partial regions of a target face; the processing unit is further configured to perform precision-differentiated image fusion on the preprocessed gray-scale image and the preprocessed depth image based on the precise fusion region and the rough fusion region; and the processing unit is also used for carrying out expression recognition on the fused face image and determining the expression of the target face at the moment.

In a third aspect, an embodiment of the present application provides a storage medium, where the storage medium includes a stored program, where, when the program runs, a device in which the storage medium is located is controlled to execute the expression recognition method based on local image data fusion according to the first aspect or any one of possible implementation manners of the first aspect.

In a fourth aspect, an embodiment of the present application provides an electronic device, including a memory and a processor, where the memory is configured to store information including program instructions, and the processor is configured to control execution of the program instructions, where the program instructions are loaded and executed by the processor to implement the expression recognition method based on local image data fusion according to the first aspect or any one of possible implementation manners of the first aspect.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a flowchart of an expression recognition method based on local image data fusion according to an embodiment of the present application.

Fig. 2 is a schematic diagram of region division of a target face according to an embodiment of the present application.

Fig. 3 is a schematic diagram of an expression recognition apparatus based on local image data fusion according to an embodiment of the present application.

Fig. 4 is a block diagram of an electronic device according to an embodiment of the present disclosure.

Icon: 1-forehead region; 2-brow area; 3-nasal region; 4-the mouth region; 5-jaw area; 6-buccal area; 7-orbital area; 8-the ear region; 9-temporal region; 10-expression recognition device based on local image data fusion; 11-an image acquisition unit; 12-a processing unit; 20-an electronic device; 21-a memory; 22-a communication module; 23-a bus; 24-a processor.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

In this embodiment, the expression recognition method based on local image data fusion may be executed by an electronic device, and the electronic device may be a terminal, such as a laptop, a personal computer, a tablet computer, a smart phone, and the like; it may also be a server, such as a cloud server, a cluster of servers, etc. Of course, the electronic device may also be a dedicated device, for example, a device developed specifically includes a processing device and an image acquiring device, and the image acquiring device includes a set real-time image acquiring component (e.g., a high definition camera) and a depth image acquiring component (e.g., a depth camera), which are respectively used for shooting a real-time image (a corresponding grayscale image can be obtained after performing grayscale processing) and a depth image synchronously in real time.

Referring to fig. 1, fig. 1 is a flowchart of an expression recognition method based on local image data fusion according to an embodiment of the present application.

In the present embodiment, the expression recognition method based on local image data fusion may include step S10, step S20, step S30, step S40, step S50, and step S60.

In order to improve the quality of the human-computer interaction, the facial expression may be recognized during the interaction, and based on this, the electronic device may perform step S10.

Step S10: the method comprises the steps of obtaining a gray level image and a depth image which comprise a target face, wherein the gray level image and the depth image are obtained after shooting the target face at the same time.

In this embodiment, the electronic device may obtain a real-time image captured by the real-time image capturing component, and perform gray processing on the real-time image to obtain a gray image; and the electronic device may acquire the depth images captured by the depth image acquiring means at the same time. The gray level image and the depth image both comprise a target human face.

After acquiring the grayscale image and the depth image, the electronic device may perform step S20.

Step S20: and preprocessing the gray level image, detecting the preprocessed gray level image, and determining the length of the image mouth of the target face.

In this embodiment, the electronic device may pre-process the grayscale image.

For example, the electronic device may determine a binocular central point connection line of the target face in the grayscale image, and perform angle adjustment on the grayscale image based on the binocular central point connection line to correct the target face.

Specifically, the electronic device can perform contour detection on the gray level image to determine the positions of the two eyes of the target face, and can determine the connecting line of the center points of the two eyes of the target face in the gray level image based on the center point of each eye. Then, the connecting line of the central points of the two eyes can be adjusted to be horizontal (taking the horizontal direction as an example, but not limited to the horizontal direction), so that the angle of the gray image can be adjusted, and the target face can be kept in the right position. The electronics can then detect the contour of the nose from the grayscale image and determine the apex of the nose tip.

Similarly, the electronic device may also pre-process the depth image.

For example, the electronic device may determine a binocular central point connection line of the target face in the depth image, and perform angle adjustment on the depth image based on the binocular central point connection line to correct the target face.

Specifically, the electronic device can also perform contour detection on the depth image to determine the positions of the two eyes of the target face, and based on the center point of each eye, the line of the center points of the two eyes of the target face in the depth image can be determined. Then, the connecting line of the central points of the two eyes can be adjusted to be horizontal (taking the horizontal direction as an example, but not limited to the horizontal direction), so that the angle of the depth image is adjusted, and the target face is kept in the right position. Then, the electronic device can detect the tip of the nose from the depth image (the tip of the nose is shallower and closer to the depth camera than other parts of the target face).

Determining a binocular central point connecting line of a target face in a gray level image (depth image), and carrying out angle adjustment on the gray level image (depth image) based on the binocular central point connecting line so as to align the target face; and, the tip apex of the nose is determined from the grayscale image (depth image). Through the determination of the points, the gray level image and the depth image can be well corresponded, and the fusion precision is favorably ensured. And before the fusion, the target human face is aligned by using the central point connecting line of the two eyes, so that the precision of the image fusion can be further ensured.

After the gray level image is preprocessed, the electronic equipment can detect the preprocessed gray level image to determine the length of the image mouth of the target face. The reason why the gray image is selected to determine the length of the mouth of the image is that this method is more stable, and the skin color difference between the mouth and the skin of the surrounding face is obvious, which is more beneficial to contour detection. Of course, in some other possible implementations, the mouth detection may also be performed by using the depth image, and is not limited herein.

The way of determining the length of the image mouth of the target face may be to take the distance between the corners of the mouth as the length of the image mouth, which is not limited herein. Specifically, the contour of the mouth part can be detected to obtain the distance between the mouth corners; then, the inter-ocular angle distance of the target face in the gray level image is detected as the intra-ocular angle distance, and the mouth angle distance is compared with the intra-ocular angle distance to obtain the mouth length (which is a relative value) of the image.

After determining the image mouth length of the target face, the electronic device may perform step S30.

Step S30: comparing the image mouth length with a reference mouth length corresponding to the target face, and determining a comparison result, wherein the comparison result is any one of a first result, a second result and a third result, the first result indicates that the image mouth length is greater than the reference mouth length, the second result indicates that the image mouth length is consistent with the reference mouth length, and the third result indicates that the image mouth length is smaller than the reference mouth length.

In this embodiment, the electronic device may compare the length of the mouth of the image with a reference mouth length corresponding to the target face, and determine a comparison result.

The reference mouth length here may be determined based on an image of the target face in a calm expression, for example, in the image of the target face in a calm expression, the mouth length at this time is determined as the reference mouth length based on the inter-ocular angle distance between the two eyes of the target face (this reference mouth length is a relative value, not an absolute value, for example, the inter-ocular angle distance is 1, then the reference mouth length may be 1.4, may be 0.8, and may be a stable value depending on the person, but particularly on the person).

Specifically, the comparison of the image mouth length and the reference mouth length may be to determine whether the difference between the image mouth length and the reference mouth length exceeds a set value (e.g., 5%).

If the image mouth length is longer than the reference mouth length by a set value or more, it can be determined that the image mouth length is longer than the reference mouth length, and thus the comparison result can be determined to be the first result.

If the difference between the image mouth portion length and the reference mouth portion length is within the set value, it can be determined that the image mouth portion length and the reference mouth portion length are identical, and thus the comparison result can be determined as the second result.

If the image mouth length is shorter than the reference mouth length by a set value or more, it can be determined that the image mouth length is smaller than the reference mouth length, and thus the comparison result can be determined to be the third result.

After determining the comparison of the image mouth length to the reference mouth length, the electronic device may perform step S40.

Step S40: and determining a precise fusion area and a rough fusion area based on the comparison result, wherein the precise fusion area and the rough fusion area both belong to partial areas in the target face.

In order to facilitate accurate recognition of facial expressions, the target face may be divided into a plurality of regions, referring to fig. 2, the target face may include: frontal region 1, brow region 2, orbital region 7, ear region 8, temporal region 9, cheek region 6, nasal region 3, oral region 4, jaw region 5.

In this embodiment, the electronic device may determine the corresponding precise fusion region and the rough fusion region according to different comparison results.

Illustratively, for the case where the comparison result is the first result:

the electronic device may determine that the eyebrow region, the orbit region, the cheek region, and the mouth region are precise fusion regions, and the rest (forehead region, ear region, temple region, nose region, jaw region) are rough fusion regions.

Since the expression of the human face at that moment is usually accompanied by features such as the corner of the eye, the eyelids, the eyebrows, and the cheeks when the length of the mouth of the image is longer than the reference length of the mouth, for example, when the face has a smile, the corner of the mouth rises, usually accompanied by the appearance of the lines of the eye, and the cheek is partly raised, causing changes in the contour and depth information, when crime is sad, usually accompanied by the transformation of the eyebrows into a certain form (for example, the tail of the eyebrows hangs down), the convergence of the eyelids, and the mouth presents an arc that is convex upward as a whole. Therefore, when the comparison result is the first result, the eyebrow area, the orbit area, the cheek area and the mouth area are determined to be the accurate fusion area, and the rest are the rough fusion area, so that the determined accurate fusion area is beneficial to expression recognition of the subsequently fused face image.

Illustratively, for the case where the comparison result is the second result:

the electronic device may determine that the frontal area, the eyebrow area, the orbital area, the cheek area, the mouth area are precise fusion areas, and the rest (ear area, temporal area, nose area, jaw area) are coarse fusion areas.

When the length of the image mouth is consistent with that of the reference mouth, the uncertainty of the category of the facial expression is high, so that multiple key parts can be determined to be accurate fusion regions, for example, the expression is calm, surprised, puzzled and the like, and the specific expression can be determined jointly through the characteristics of multiple regions such as the forehead (for example, whether the face has lines), the eyebrow (whether the eyebrow is raised), the eyes and the mouth shape, so that the accuracy of expression recognition is ensured.

Illustratively, for the case where the comparison result is the third result:

the electronic device may determine that the eyebrow region, the orbit region, the mouth region, and the jaw region are precise fusion regions, and the rest (forehead region, ear region, temple region, cheek region, nose region) are coarse fusion regions.

In this way, areas needing accurate fusion can be screened out in emphasis according to expression categories which may occur when the length of the image mouth is smaller than the reference mouth length, for example, when the expression is angry, the characteristics of tight lips, closed mouth, pressing the eyebrow head down, pressing the eyebrow tail up, and deviating the eyeball to the upper eyelid are often accompanied. While the expression is not sweet, the tightening of the lower jaw is usually accompanied, and the skin on the surface of the lower jaw appears as a small naked eye and the like. Therefore, by the method, the corresponding accurate fusion area is determined according to different comparison results, and the facial expression can be accurately recognized.

After determining the precise blending region and the coarse blending region, the electronic device may perform step S50.

Step S50: and performing precision-differentiated image fusion on the preprocessed gray-scale image and the preprocessed depth image based on the precise fusion region and the rough fusion region.

In this embodiment, the electronic device may perform precision-differentiated image fusion on the preprocessed grayscale image and the preprocessed depth image based on the corresponding precise fusion region and the rough fusion region.

For example, the electronic device may establish a mapping relationship between the center point of the eyes and the vertex of the nose tip of the target face in the preprocessed gray-scale image and the center point of the eyes and the vertex of the nose tip of the target face in the preprocessed depth image in a one-to-one correspondence manner. The mapping relation established by the method can be used as a reference for image fusion for carrying out precision differentiation on the preprocessed gray-scale image and the preprocessed depth image subsequently.

Firstly, fusing images of a preprocessed gray image and a preprocessed depth image in a precise fusion area:

the electronic equipment can extract the features of the accurate fusion region of the target face in the preprocessed gray-scale image, and extract the features of the accurate fusion region of the target face in the preprocessed depth image. Feature extraction is performed on the accurate fusion region of the target face in the grayscale image and the accurate fusion region of the target face in the depth image, and a consistent feature extraction mode is adopted, so that two types of feature point sets (sets of image feature points and depth feature points) including a plurality of feature points are obtained.

Then, the image feature points extracted based on the grayscale image are matched with the depth feature points extracted based on the depth image. The main matching mode can match the corresponding relation of the positions of two kinds of characteristic points in the image (which can be understood as the corresponding relation of a certain precise fusion area in a gray scale image and the image coordinate of the precise fusion area in a depth image, the mapping relation established on the basis of the center point of eyes and the vertex of the nose tip can be used as a reference, the adjustment parameters are determined by the corresponding relation of the image coordinate of the precise fusion area between the outline of the gray scale image and the outline of the depth image, the corresponding relation of the high-precision image coordinate is calculated, each pair of characteristic points which are equivalent to matching are used as a point position to adjust the image fusion in the precise fusion area, so that the precision of the image fusion in the precise fusion area is higher), thereby realizing the high-precision matching of the image characteristic points and the depth characteristic points, therefore, a mapping relation is established between the matched image characteristic points and the depth characteristic points, and high-precision image fusion of the preprocessed gray-scale image and the preprocessed depth image in the precise fusion area is realized (namely, registration of the gray-scale image and the depth image in the precise fusion area is realized).

And for the image fusion of the preprocessed gray-scale image and the preprocessed depth image in the rough fusion area:

the electronic device may determine the contour of the coarse fusion region of the target face in the preprocessed gray-scale image, and determine the contour of the coarse fusion region of the target face in the preprocessed depth image. In the method, the contour detection of the rough fusion region of the target face in the gray level image and the rough fusion region of the target face in the depth image can be performed in a differentiated contour detection mode, so that the contour detection of the rough fusion region can be quickly and accurately realized. For example, for the contour detection of the rough fusion region of the target face in the gray-scale image, an edge detection operator can be adopted to detect the target contour; and for the profile detection of the rough fusion region of the target face in the depth image, the edge detection operator can be adopted to detect the target profile to realize the detection, and because the precision requirement of the rough fusion region is relatively low, the processing efficiency is preferentially improved, therefore, a simpler detection mode can be adopted, the processing time is reduced to a certain extent, and the detection efficiency of the rough fusion region is improved.

Then, the electronic device may match the image area contour determined based on the grayscale image with the depth area contour determined based on the depth image, and use a mapping relationship established based on the center point of both eyes and the vertex of the nose tip as a reference, and combine the contour similarity and the coordinate difference value (the smaller the coordinate difference, the better), to achieve matching of the image area contour determined based on the grayscale image with the depth area contour determined based on the depth image. By using the mapping relation established on the basis of the central point of the eyes and the vertex of the nose tip as a reference, the matching accuracy of the image region contour determined based on the gray image and the depth region contour determined based on the depth image can be ensured to a greater extent. Based on the above, the electronic device can further establish a mapping relation between the matched image area contour and the image area contour, so that registration of the gray image and the depth image in the rough fusion area is realized.

And establishing a mapping relation by one-to-one correspondence of the binocular central point and the nose tip vertex of the target face in the preprocessed gray-scale image and the binocular central point and the nose tip vertex of the target face in the preprocessed depth image, wherein the mapping relation can be used as a reference standard for carrying out image fusion on the gray-scale image and the depth image. Aiming at the image fusion of the accurate fusion area, extracting the characteristics of the accurate fusion area of the target face in the preprocessed gray level image, and extracting the characteristics of the accurate fusion area of the target face in the preprocessed depth image; and matching the image characteristic points extracted based on the gray level image with the depth characteristic points extracted based on the depth image, establishing a mapping relation between the matched image characteristic points and the depth characteristic points, and realizing the registration of the gray level image and the depth image in a precise fusion area, so that the image fusion of the precise fusion area can be carried out through the matching of a plurality of characteristic points, and the precision of the image fusion is ensured. And aiming at the registration of the rough fusion area, the image fusion in the area is realized after the outline matching is carried out mainly by utilizing the outline of the rough fusion area of the target face in the preprocessed gray level image and the outline of the rough fusion area of the target face in the preprocessed depth image, so that the image fusion efficiency of the rough fusion area can be greatly improved. Therefore, the precision and efficiency of image fusion can be effectively guaranteed by the aid of the precision-differentiated image fusion method, stability and accuracy of expression recognition can be improved, and operating efficiency of the expression recognition method can also be improved.

The preprocessed gray-scale image and the preprocessed depth image are subjected to image fusion in the precise fusion region and image fusion in the rough fusion region, so that the precision differentiation image fusion of the preprocessed gray-scale image and the preprocessed depth image is realized, and the accuracy, reliability and operating efficiency of expression recognition can be considered.

After implementing the image fusion for precision differentiation of the pre-processed grayscale image and the pre-processed depth image, the electronic device may perform step S60.

Step S60: and performing expression recognition on the fused face image to determine the expression of the target face at the moment.

In this embodiment, the electronic device may perform expression recognition on the fused face image.

For example, for the case that the comparison result is the first result, the electronic device may obtain an eyebrow feature of an eyebrow region, a mouth feature of an oral region, an eye feature of an eye socket region, and a cheek feature of a cheek region in the face image, where the eyebrow feature includes an eyebrow contour and eyebrow depth information, the mouth feature includes a mouth length and a lip contour, the eye feature includes an eyelid contour and an eye corner line, and the cheek feature includes a cheek contour and cheek depth information.

When the comparison result is the first result, the expressions may be smiling, not crumbling, sad crying, and the like, but the expressions may be smiling, crying, not crumbling, and the like, and the characteristics are more obvious. For example, in smiling, the corners of the mouth are raised, the mouth is elongated, the cheek muscles are raised, and the lines of the eyes emerge; when sad crying, the mouth angle is pulled down, the mouth is stretched, the opening degree of the upper and lower eyelids is decreased, and the eyebrow tail is brought down. And the areas with larger distinction degree obtain corresponding characteristics, such as an eyebrow contour (which can judge the form and direction of the eyebrow) and eyebrow depth information (which can be used for judging whether the eyebrow is wrinkled), a mouth length (which is mainly used for judging whether the mouth is stretched) and a lip contour (which can be used for judging whether the mouth corner is upwards or downwards pulled), an eyelid contour (which can determine the opening and closing of the eyelid and whether the pupil is inclined to the upper eyelid or the lower eyelid) and an eye corner line (which can be used for judging whether the eye corner line, namely the fishtail line, exists), a cheek contour (which can be used for judging the form change degree of the cheek) and cheek depth information (which can judge whether the cheek is bulged and is mainly the upper part of the cheek, also called as an upper cheek area). Therefore, the acquisition of the features is beneficial to accurately obtaining the expression features of the current face.

Then, the electronic device may determine an expression state of the target face at the time based on the eyebrow feature, the mouth feature, the eye feature, and the cheek feature.

Similarly, for the case that the comparison result is the second result, the electronic device may obtain a forehead feature of a forehead region, an eyebrow feature of an eyebrow region, a mouth feature of an oral region, an eye feature of an eye socket region, and a cheek feature of a cheek region in the face image, where the forehead feature includes a forehead contour and forehead pattern information, the eyebrow feature includes an eyebrow contour and eyebrow depth information, the mouth feature includes a mouth length and a lip contour, the eye feature includes an eyelid contour and an eye corner pattern, and the cheek feature includes a cheek contour and cheek depth information.

Since the comparison result is a second result, the expressions may be various expressions such as calm, puzzles, serious, and the like, and if the expressions are distinguished, as many features as possible are required to distinguish the expressions, so as to accurately identify the expressions. Therefore, these key regions are selected to obtain corresponding features, such as forehead contour (mainly disclosing whether the forehead is exposed) and forehead pattern information (when the forehead is exposed, whether the forehead has forehead patterns, usually accompanying with the appearance of forehead patterns when the eyebrows are raised), eyebrow contour (capable of determining the form and direction of eyebrows) and eyebrow depth information (capable of determining whether the eyebrows are wrinkled or not, etc.), mouth length and lip contour (capable of determining whether the positions of two corners of the mouth are consistent), eyelid contour (capable of determining the enlargement and reduction of the eyelids, whether the lower eyelid is raised, and whether the pupil is inclined to the upper eyelid or the lower eyelid), and canthus line (capable of determining whether canthus lines exist), cheek contour (capable of determining the degree of change of the shape of the cheek) and cheek depth information (capable of determining whether the cheek is bulged or not, symmetrical). Therefore, the acquisition of the features is beneficial to accurately obtaining the expression features of the current face.

Then, the electronic device may determine an expression state of the target face at the time based on the forehead feature, the eyebrow feature, the mouth feature, the eye feature, and the cheek feature.

For example, for the case that the comparison result is the third result, the electronic device may obtain an eyebrow feature of an eyebrow region, a mouth feature of an oral region, an eye feature of an eye socket region, and a jaw feature of a jaw region in the face image, where the eyebrow feature includes an eyebrow contour and eyebrow depth information, the mouth feature includes a mouth length and a lip contour, the eye feature includes an eyelid contour and an eye corner line, and the jaw feature includes a jaw contour and a jaw line.

Since the comparison result is the third result, the expressions are rich, such as a lot of expressions like pounding mouth, engendering, anger, frightening, etc., and the expressions can be distinguished mainly by eyebrow features, mouth features, eye features, jaw features, etc. For example, eyebrow contour (which can determine the form and direction of eyebrow) and eyebrow depth information (which can be used to determine whether eyebrow is curled up or not, whether mouth is curled up or not), mouth length (which is mainly used to determine whether mouth is shortened) and lip contour (which can be used to determine whether lips are shrunk or not), eyelid contour (which can determine whether eyelids are enlarged or not, whether lower eyelid is lifted, and whether pupil and upper and lower eyelids are biased) and canthus texture (which can be used to determine whether canthus texture, i.e. crow texture, jaw contour (which is deformed) and jaw texture (which is sheet-shaped pitted small-flesh texture, usually accompanied with mouth corner depression, often appear in difficult expressions). Therefore, the acquisition of the features is beneficial to accurately obtaining the expression features of the current face.

Then, the electronic device may determine an expression state of the target face at the moment based on the eyebrow feature, the mouth feature, the eye feature, and the jaw feature.

It should be noted that, the manner for the electronic device to determine the expression state of the target face at this time may be: for example, one expression corresponds to multiple features, the proportion of different features can be different, the target face at the moment is scored by determining whether the features exist or not, the similarity between the target face and the expression is determined, and the expression with the highest similarity is determined as the expression state of the target face at the moment. Of course, the recognition of static expressions (i.e. the expression state of the face recognized by a single image) can also be performed by a trained model, which is not limited herein.

After the expression state of the target face at the moment is determined, the electronic equipment can further determine the expression of the target face at the moment.

For example, in order to ensure the accuracy of expression recognition, the electronic device may further analyze a dynamic change process of the expression state of the target face in combination with the expression states of the target face at a plurality of previous moments, so as to more accurately recognize the expression of the target face at the moment.

Referring to fig. 3, fig. 3 is a schematic diagram of an expression recognition apparatus 10 based on local image data fusion according to an embodiment of the present disclosure.

In this embodiment, the expression recognition apparatus 10 based on local image data fusion may include:

the image acquiring unit 11 is configured to acquire a grayscale image and a depth image that include a target face, where the grayscale image and the depth image are obtained by performing shooting processing on the target face based on the same time.

And the processing unit 12 is configured to pre-process the grayscale image, detect the pre-processed grayscale image, and determine the length of the image mouth of the target face.

The processing unit 12 is further configured to compare the image mouth length with a reference mouth length corresponding to the target human face, and determine a comparison result, where the comparison result is any one of a first result, a second result, and a third result, where the first result indicates that the image mouth length is greater than the reference mouth length, the second result indicates that the image mouth length is consistent with the reference mouth length, and the third result indicates that the image mouth length is smaller than the reference mouth length.

The processing unit 12 is further configured to determine, based on the comparison result, a precise fusion region and a rough fusion region, where the precise fusion region and the rough fusion region both belong to partial regions of a target face.

The processing unit 12 is further configured to perform precision-differentiated image fusion on the preprocessed gray-scale image and the preprocessed depth image based on the precise fusion region and the rough fusion region.

The processing unit 12 is further configured to perform expression recognition on the fused face image, and determine an expression of the target face at the moment.

In this embodiment, the processing unit 12 is further configured to determine a binocular central point connecting line of a target face in the grayscale image, and perform angle adjustment on the grayscale image based on the binocular central point connecting line to correct the target face; determining the apex of the nose tip from the gray level image; the processing unit 12 is further configured to perform preprocessing on the depth image, and specifically configured to: determining a binocular central point connecting line of a target face in the depth image, and carrying out angle adjustment on the depth image based on the binocular central point connecting line so as to align the target face; and determining the apex of the nose tip from the depth image.

In this embodiment, the processing unit 12 is further configured to establish a mapping relationship between the center point of the two eyes and the vertex of the nose tip of the target face in the preprocessed gray-scale image and the center point of the two eyes and the vertex of the nose tip of the target face in the preprocessed depth image in a one-to-one correspondence manner; performing feature extraction on the accurate fusion region of the target face in the preprocessed gray level image, and performing feature extraction on the accurate fusion region of the target face in the preprocessed depth image; matching the image characteristic points extracted based on the gray level image with the depth characteristic points extracted based on the depth image, and establishing a mapping relation between the matched image characteristic points and the depth characteristic points to realize the registration of the gray level image and the depth image in a precise fusion area; determining the outline of the rough fusion area of the target face in the preprocessed gray level image, and determining the outline of the rough fusion area of the target face in the preprocessed depth image; and matching the image region contour determined based on the gray level image with the depth region contour determined based on the depth image, and establishing a mapping relation between the matched image region contour and the image region contour to realize the registration of the gray level image and the depth image in the rough fusion region.

In this embodiment, the target face includes a frontal region, an eyebrow region, an orbit region, an ear region, a temporal region, a cheek region, a nose region, an oral region, and a jaw region, and the processing unit 12 is further configured to determine that the eyebrow region, the orbit region, the buccal region, and the oral region are precise fusion regions and the rest are coarse fusion regions when the comparison result is the first result; when the comparison result is the second result, determining that the forehead area, the eyebrow area, the eye socket area, the cheek area and the mouth area are precise fusion areas, and the rest are rough fusion areas; and when the comparison result is the third result, determining that the eyebrow area, the orbit area, the mouth area and the jaw area are accurate fusion areas, and the rest are rough fusion areas.

In this embodiment, the comparison result is the first result, and the processing unit 12 is further configured to obtain an eyebrow feature of an eyebrow region, a mouth feature of an oral region, an eye feature of an eye socket region, and a cheek feature of a cheek region in the face image, where the eyebrow feature includes an eyebrow contour and eyebrow depth information, the mouth feature includes a mouth length and a lip contour, the eye feature includes an eyelid contour and an eye corner line, and the cheek feature includes a cheek contour and cheek depth information; and determining the expression state of the target face at the moment based on the eyebrow feature, the mouth feature, the eye feature and the cheek feature.

In this embodiment, the comparison result is the second result, and the processing unit 12 is further configured to obtain a forehead feature of a forehead region, an eyebrow feature of an eyebrow region, a mouth feature of an oral region, an eye feature of an eye socket region, and a cheek feature of a cheek region in the face image, where the forehead feature includes a forehead contour and forehead pattern information, the eyebrow feature includes an eyebrow contour and eyebrow depth information, the mouth feature includes a mouth length and a lip contour, the eye feature includes an eyelid contour and an eye corner pattern, and the cheek feature includes a cheek contour and cheek depth information; and determining the expression state of the target face at the moment based on the forehead feature, the eyebrow feature, the mouth feature, the eye feature and the cheek feature.

In this embodiment, the comparison result is the third result, and the processing unit 12 is further configured to obtain an eyebrow feature of an eyebrow region, a mouth feature of an oral region, an eye feature of an eye socket region, and a jaw feature of a jaw region in the face image, where the eyebrow feature includes an eyebrow contour and eyebrow depth information, the mouth feature includes a mouth length and a lip contour, the eye feature includes an eyelid contour and an eye corner texture, and the jaw feature includes a jaw contour and a jaw texture; and determining the expression state of the target face at the moment based on the eyebrow feature, the mouth feature, the eye feature and the jaw feature.

Referring to fig. 4, fig. 4 is a block diagram of an electronic device 20 according to an embodiment of the present disclosure.

In this embodiment, the electronic device 20 may be a server, such as a cloud server, a server cluster, or the like; but may also be a terminal, such as a personal computer, a smart phone, etc., without limitation.

Illustratively, the electronic device 20 may include: a communication module 22 connected to the outside world via a network, one or more processors 24 for executing program instructions, a bus 23, and a different form of memory 21, such as a disk, ROM, or RAM, or any combination thereof. The memory 21, the communication module 22, and the processor 24 may be connected by a bus 23.

Illustratively, the memory 21 has stored therein a program. The processor 24 may call and execute these programs from the memory 21, so that the expression recognition method based on the local image data fusion can be realized by executing the programs.

The embodiment of the present application further provides a storage medium, where the storage medium includes a stored program, and when the program runs, the device on which the storage medium is located is controlled to execute the expression recognition method based on local image data fusion in the embodiment.

To sum up, the embodiment of the present application provides an expression recognition method and apparatus based on local image data fusion, which determine an image mouth length of a target face after preprocessing a gray-scale image by obtaining the gray-scale image and a depth image including the target face, which are taken (processed) at the same time, and determine a corresponding accurate fusion region and a corresponding rough fusion region based on a comparison result between the image mouth length and a reference mouth length. Then, performing precision-differentiated image fusion on the preprocessed gray-scale image and the preprocessed depth image by using the precise fusion region and the rough fusion region; and performing expression recognition on the fused face image to determine the expression of the target face at the moment. In this way, before the grayscale image and the depth image are fused, the length of the image mouth identified in the grayscale image is used as a basis for determining the precise fusion region and the rough fusion region. The different expressions may have different mouth lengths and correspond to a plurality of types of expressions, but in the expression with an elongated mouth angle, the expression with an elongated mouth angle may be identified as the key for expression recognition by a different emphasis determination region from the expression with an un-elongated mouth angle, for example, when the mouth angle is elongated, the face may smile (happy), cry (sad), etc., while the smile expression may accompany changes in the eye pattern and the cheek muscles, etc., and when crying, the eyebrow and eyelid may change significantly and have a relatively sharp feature. Therefore, the length of the mouth of the image is used as a basis for determining the precise fusion area and the rough fusion area, fine fusion of local images can be performed in a targeted manner, and fusion can be performed roughly for relatively irrelevant areas, so that the accuracy of facial expression recognition can be greatly improved, a complementary effect can be formed by combining the gray level image and the depth image, and the accuracy and the stability of the facial expression recognition can be further improved.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. An expression recognition method based on local image data fusion is characterized by comprising the following steps:

acquiring a gray level image and a depth image containing a target face, wherein the gray level image and the depth image are obtained after shooting the target face at the same time;

preprocessing the gray level image, detecting the preprocessed gray level image, and determining the length of an image mouth of a target face;

comparing the image mouth length with a reference mouth length corresponding to the target face, and determining a comparison result, wherein the comparison result is any one of a first result, a second result and a third result, the first result indicates that the image mouth length is greater than the reference mouth length, the second result indicates that the image mouth length is consistent with the reference mouth length, and the third result indicates that the image mouth length is smaller than the reference mouth length;

determining a precise fusion area and a rough fusion area based on the comparison result, wherein the precise fusion area and the rough fusion area both belong to partial areas of the target face;

performing precision-differentiated image fusion on the preprocessed gray-scale image and the preprocessed depth image based on the precise fusion region and the rough fusion region;

and performing expression recognition on the fused face image to determine the expression of the target face at the moment.

2. The expression recognition method based on local image data fusion of claim 1, wherein preprocessing the grayscale image comprises:

determining a binocular central point connecting line of a target face in the gray level image, and carrying out angle adjustment on the gray level image based on the binocular central point connecting line so as to correct the target face;

determining the apex of the nose tip from the gray level image;

correspondingly, the expression recognition method based on local image data fusion further comprises the following steps: preprocessing the depth image, specifically comprising:

determining a binocular central point connecting line of a target face in the depth image, and carrying out angle adjustment on the depth image based on the binocular central point connecting line so as to align the target face;

and determining the apex of the nose tip from the depth image.

3. The expression recognition method based on local image data fusion of claim 2, wherein the image fusion of precision differentiation is performed on the preprocessed gray-scale image and the preprocessed depth image based on the precise fusion region and the rough fusion region, and the method comprises:

establishing a mapping relation between the center points of the eyes and the vertex of the nose tip of the target face in the preprocessed gray level image and the center points of the eyes and the vertex of the nose tip of the target face in the preprocessed depth image in a one-to-one correspondence manner;

performing feature extraction on the accurate fusion region of the target face in the preprocessed gray level image, and performing feature extraction on the accurate fusion region of the target face in the preprocessed depth image;

matching the image characteristic points extracted based on the gray level image with the depth characteristic points extracted based on the depth image, and establishing a mapping relation between the matched image characteristic points and the depth characteristic points to realize the registration of the gray level image and the depth image in a precise fusion area;

determining the outline of the rough fusion area of the target face in the preprocessed gray level image, and determining the outline of the rough fusion area of the target face in the preprocessed depth image;

and matching the image region contour determined based on the gray level image with the depth region contour determined based on the depth image, and establishing a mapping relation between the matched image region contour and the image region contour to realize the registration of the gray level image and the depth image in the rough fusion region.

4. The facial expression recognition method based on local image data fusion of claim 3, wherein the target face comprises a frontal region, an eyebrow region, an orbit region, an ear region, a temporal region, a cheek region, a nose region, an oral region and a jaw region, and based on the comparison result, a precise fusion region and a rough fusion region are determined, comprising:

if the comparison result is the first result, determining that the eyebrow area, the orbit area, the cheek area and the mouth area are precise fusion areas, and the rest are rough fusion areas;

if the comparison result is the second result, determining that the forehead area, the eyebrow area, the orbit area, the cheek area and the mouth area are precise fusion areas, and the rest are rough fusion areas;

and if the comparison result is the third result, determining that the eyebrow area, the orbit area, the mouth area and the jaw area are accurate fusion areas, and the rest are rough fusion areas.

5. The expression recognition method based on local image data fusion of claim 4, wherein the comparison result is the first result, and the expression recognition is performed on the fused facial image, and comprises:

acquiring eyebrow features of an eyebrow region, mouth features of an oral region, eye features of an orbit region and cheek features of a cheek region in the face image, wherein the eyebrow features comprise eyebrow contours and eyebrow depth information, the mouth features comprise mouth length and lip contours, the eye features comprise eyelid contours and canthus lines, and the cheek features comprise cheek contours and cheek depth information;

and determining the expression state of the target face at the moment based on the eyebrow feature, the mouth feature, the eye feature and the cheek feature.

6. The expression recognition method based on local image data fusion of claim 4, wherein the comparison result is the second result, and the expression recognition is performed on the fused facial image, and comprises:

acquiring forehead features of a forehead region, eyebrow features of an eyebrow region, mouth features of an oral region, eye features of an orbit region and cheek features of a cheek region in the face image, wherein the forehead features comprise forehead contour and forehead pattern information, the eyebrow features comprise eyebrow contour and eyebrow depth information, the mouth features comprise mouth length and lip contour, the eye features comprise eyelid contour and canthus pattern, and the cheek features comprise cheek contour and cheek depth information;

and determining the expression state of the target face at the moment based on the forehead feature, the eyebrow feature, the mouth feature, the eye feature and the cheek feature.

7. The expression recognition method based on local image data fusion of claim 4, wherein the comparison result is the third result, and the expression recognition is performed on the fused facial image, and comprises:

acquiring eyebrow characteristics of an eyebrow region, mouth characteristics of an oral region, eye characteristics of an orbit region and jaw characteristics of a jaw region in the face image, wherein the eyebrow characteristics comprise eyebrow contours and eyebrow depth information, the mouth characteristics comprise mouth length and lip contours, the eye characteristics comprise eyelid contours and canthus grains, and the jaw characteristics comprise jaw contours and jaw grains;

and determining the expression state of the target face at the moment based on the eyebrow feature, the mouth feature, the eye feature and the jaw feature.

8. An expression recognition device based on local image data fusion is characterized by comprising:

the system comprises an image acquisition unit, a processing unit and a processing unit, wherein the image acquisition unit is used for acquiring a gray level image and a depth image which comprise a target face, and the gray level image and the depth image are obtained by shooting the target face based on the same moment;

the processing unit is used for preprocessing the gray level image, detecting the preprocessed gray level image and determining the length of an image mouth of the target face;

the processing unit is further configured to compare the image mouth length with a reference mouth length corresponding to the target face, and determine a comparison result, where the comparison result is any one of a first result, a second result, and a third result, where the first result indicates that the image mouth length is greater than the reference mouth length, the second result indicates that the image mouth length is consistent with the reference mouth length, and the third result indicates that the image mouth length is smaller than the reference mouth length;

the processing unit is further configured to determine a precise fusion region and a rough fusion region based on the comparison result, where the precise fusion region and the rough fusion region both belong to partial regions of a target face;

the processing unit is further configured to perform precision-differentiated image fusion on the preprocessed gray-scale image and the preprocessed depth image based on the precise fusion region and the rough fusion region;

and the processing unit is also used for carrying out expression recognition on the fused face image and determining the expression of the target face at the moment.

9. A storage medium, characterized in that the storage medium comprises a stored program, wherein when the program runs, a device where the storage medium is located is controlled to execute the expression recognition method based on local image data fusion according to any one of claims 1 to 7.

10. An electronic device, comprising a memory for storing information including program instructions and a processor for controlling execution of the program instructions, the program instructions being loaded and executed by the processor to implement the method for facial expression recognition based on local image data fusion according to any one of claims 1 to 7.