CN110543813B

CN110543813B - Face image and gaze counting method and system based on scene

Info

Publication number: CN110543813B
Application number: CN201910660630.4A
Authority: CN
Inventors: 杨志明; 其他发明人请求不公开姓名
Original assignee: Ideepwise Artificial Intelligence Robot Technology Beijing Co ltd
Current assignee: Ideepwise Artificial Intelligence Robot Technology Beijing Co ltd
Priority date: 2019-07-22
Filing date: 2019-07-22
Publication date: 2022-03-15
Anticipated expiration: 2039-07-22
Also published as: CN110543813A

Abstract

The invention provides a face image and gaze counting method and system based on a scene, wherein the face image and gaze counting method based on the scene comprises the following steps: acquiring an image of a specified object shot by a camera; preprocessing the image to obtain data adaptive to the input of a pre-established deep learning model; and inputting the data into a pre-trained deep learning model, and simultaneously outputting the face position, the positions of a plurality of face key points and the social attribute classification result of the person. The scene-based gaze technology method can obtain gaze counts based on a deep learning model, so that the interest degree of people in a specified object is obtained. The method can effectively judge the interest degree of each person for the object and the social attributes of the person, such as age, gender and the like, for the crowd within a certain range from the object in the actual scene, and combines the user image and the interest degree in the actual scene to provide an objective and effective data basis for subsequent information recommendation.

Description

Face image and gaze counting method and system based on scene

Technical Field

The invention relates to the field of image processing, in particular to a face image and gaze counting method and system based on a scene.

Background

The customer image refers to the appearance of a customer, namely a customer model, which is characterized by various labels according to collected customer information, including basic social attributes, living habits, consumption levels and the like of the customer. The method has wide application in various fields in real life, such as accurate marketing, namely, enterprises can form effective insights for users through analyzing the behavior data, and can predict future behaviors and decisions of the users according to the previous behaviors of the users.

Because of the diversity of the real scene, the eye counting method is easily influenced by factors such as illumination, visual angle, distance, face-shielded image motion blur and the like, so that the eye counting problem in the real scene is a very significant research problem; in addition, due to the special requirements of the client portrait data for the scene, the client portrait data is obtained more on the basis of the internet and the mobile terminal at present, and the user portrait obtaining in the actual scene is still a problem to be solved.

Disclosure of Invention

The present invention is directed to overcome the above technical drawbacks, and provides a method for depicting a human face in an actual scene, which utilizes deep learning, image understanding, and three-dimensional modeling technologies, and can analyze the gaze retention time of a person on a real object and the portrait of the corresponding person in the actual scene, and for a certain specified object, calculate the gaze retention time of each person on the object in a gaze counting manner, and finally estimate the degree of interest of the person with different customer portraits on the specified object, thereby analyzing the degree of interest of the person on the object and social attributes such as the age and gender of the person.

In order to achieve the above object, the present invention provides a face image method based on a scene, the method comprising:

acquiring an image of a specified object shot by a camera;

preprocessing the image to obtain data adaptive to the input of a pre-established deep learning model;

and inputting the data into a pre-trained deep learning model, and simultaneously outputting the face position, the positions of a plurality of face key points and the social attribute classification result of the person.

As an improvement of the above method, the building and training step of the deep learning model specifically includes:

establishing a deep learning model, wherein the deep learning model comprises four cascaded convolutional neural networks: the first stage of convolutional neural network, the second stage of convolutional neural network, the third stage of convolutional neural network and the fourth stage of convolutional neural network; the fourth-stage convolutional neural network is a multitask neural network;

and training the deep learning model by a deep learning training method based on a large amount of data acquired in a real scene to obtain the trained deep learning model.

As an improvement of the above method, the inputting data into a pre-trained deep learning model, and outputting face positions, positions of a plurality of face key points, and social attribute classification results of people specifically includes:

inputting data into a first-stage convolutional neural network and outputting a plurality of candidate regions;

inputting the candidate regions into a second-stage convolutional neural network, screening non-face candidate regions from the candidate regions, and outputting the remaining candidate regions;

inputting the remaining candidate regions into a third-level convolutional neural network, and screening out non-face candidate regions from the remaining candidate regions again to obtain a final candidate region and a plurality of face key points;

and inputting the final candidate region and the plurality of face key points into a fourth-level convolutional neural network, determining the face region, and simultaneously outputting face positions, the positions of the plurality of face key points and social attribute classification results of people.

As an improvement of the above method, the face key points include: two eyes, nose, two mouth corner edge points, and cheekbones; the number of the face key points is not less than 4.

As an improvement of the above method, the social attribute is age and gender.

The invention provides a face image system based on a scene, which comprises:

the image preprocessing module is used for preprocessing the image acquired by the camera to obtain data matched with the input of the pre-established deep learning model;

the candidate frame extraction module is used for inputting data into a first-stage convolutional neural network to generate a plurality of candidate areas;

the face preliminary screening module is used for inputting the candidate areas into a second-level convolutional neural network, screening most of the non-face candidate areas from the candidate areas and outputting the rest candidate areas;

the face fine screening module is used for inputting the remaining candidate regions into a third-level convolutional neural network, and screening out non-face candidate regions from the remaining candidate regions again to obtain a final candidate region and a plurality of face key points; and

and the classification module is used for inputting the final candidate region and the plurality of face key points into a fourth-level convolutional neural network, determining a face region, and simultaneously outputting face positions, the positions of the plurality of face key points and a classification result of the face region.

The invention provides a scene-based gaze counting method, which is realized based on the scene-based face image drawing method and comprises the following steps:

acquiring an image of a specified object shot by a camera;

judging whether the eye attention point of the person is on a specified object from the image based on the deep learning model;

and counting the stay time of the gaze on the specified object, thereby realizing gaze counting and obtaining the degree of interest of the person on the specified object.

As an improvement of the above method, when the camera is a color camera, the determining, from the image based on the deep learning model, whether the gaze point of interest of the person is on a specified object specifically includes:

step 101) obtaining a face position in an image and a plurality of coordinates of a face key point in the face position through a deep learning model;

step 102) establishing a face plane coordinate system by taking the centers of two eyes of a person as coordinate origin points, taking the right eye direction as the positive direction of an x axis and the direction of a nose as the negative direction of a y axis, and then obtaining the average position relation of key points of a face in the face plane coordinate system in reality through data statistics of a real scene;

step 103) calibrating the color camera through an off-line calibration tool kit to obtain a camera internal reference matrix K;

step 104) calculating projection mapping from a plane of the face in the world coordinate system to a plane of the face in the camera coordinate system by using the average position relation of the face in the real world coordinate system and a plurality of coordinates corresponding to key points of the face in the camera coordinate system to obtain a homography matrix H;

step 105) acquiring a rotation matrix R and a translation matrix T of a face plane relative to a plane where a specified object is located in reality through a homography matrix between planes and a camera internal reference matrix K to obtain a face orientation, and setting a normal direction passing through the centers of two eyes as a human eye watching direction;

step 106) calculating the intersection point of the gazing direction of the human eyes and the specified object, and determining whether the human eyes see the specified object by judging whether the intersection point is in the specified object.

As an improvement of the above method, when the camera is a depth camera, the determining whether the eye attention point of the person is on the specified object from the image specifically includes:

step 201) obtaining the position of a human face in an image and the position information of the central points of the human eye and the mouth through a deep learning model;

step 202) acquiring the position of a corresponding point in a real scene through information conversion, namely the relative spatial position of each human eye, each central point of the mouth and each depth camera; judging whether the central point is set as an original initialization value 0, and turning to step 203); otherwise, go to step 204);

step 203) supplementing by using surrounding neighborhood pixel points; if the number of the face is still 0 after the supplement, the depth camera is judged not to search the face position;

step 204) calculating a face plane and normal orientation according to the space coordinates of the center points of the eyes and the mouth of the person, and then setting the normal orientation passing through the centers of the eyes as the face orientation;

step 205) calculates the intersection point of the face orientation and the specified object, and determines whether the human eyes are looking at the specified object by judging whether the intersection point is in the specified object.

As an improvement of the above method, the counting the residence time of the eyes on the designated object includes: and counting the number of video frames aiming at staying at the specified object, and dividing the number of the video frames by the frame rate to obtain the staying time on the specified object.

As an improvement of the above method, the method further comprises: and associating the degree of interest of the person in the specified object with the classification result of the social attribute of the person, and recommending related information in the region of interest of the person.

The invention provides a scene-based gaze counting system, comprising:

the image acquisition module is used for acquiring an image of a specified object shot by the camera;

the judging module is used for judging whether the attention point of the person is on the specified object or not from the image;

and the gaze counting module is used for counting the stay time of gaze on the specified object, thereby realizing gaze counting and obtaining the degree of interest of people on the specified object.

As an improvement of the above system, the system further comprises: and the association module is used for associating the degree of interest of the person in the specified object with the classification result of the social attribute of the person and recommending related information in the region of interest of the person.

The invention has the beneficial effects that:

1. the method can effectively judge the interest degree of each person for the object and the social attributes of the person, such as age, gender and the like, for the crowd within a certain range from the object in the actual scene, and fills the gap of combining the client image and the interest degree in the actual scene at present;

2. the method can count the eye sight residence time of the person in the camera to the object and the related attributes of the person under the condition that the camera is arranged beside the specified object in the actual scene, so that the interest degree of the person with different attributes to the object is analyzed, and an objective and effective data supply basis is provided for subsequent information recommendation.

Drawings

FIG. 1 is a flow chart of a scene-based face imaging method of the present invention;

FIG. 2 is a flow chart of the present invention for determining whether a person is looking at a specified object based on a color camera;

FIG. 3 is a flow chart of the present invention for determining whether a person is looking at a specified object based on a depth camera.

Detailed Description

The following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings, will make the advantages and features of the invention easier to understand by those skilled in the art, and thus will clearly and clearly define the scope of the invention.

The gaze counting and face portrait analysis technology is characterized in that related technologies such as deep learning, image understanding and three-dimensional modeling are utilized, a camera is arranged beside or behind a specified object, whether the gaze of a person in the camera is on the object is judged, when the camera shoots a plurality of persons, each person is judged respectively, and when the gaze is judged to be on the object, the stay time of the gaze on the object is counted to serve as a judgment standard for the interest degree of the person on the object. Meanwhile, the portrait attributes of the people are analyzed by utilizing a deep learning technology and are combined with eye count, so that the social attributes such as age, gender and the like of the people interested in the object are counted.

Example 1

Training is carried out on the basis of a large amount of data acquired in a real scene through a deep learning method, and the face position detection, the face key point detection and the gender and age classification are integrated into a multi-task deep learning model for prediction in a mode of combining the face detection and the key point detection. Finally, the position information of each face frame in the image and the classification results of the coordinate information, the gender and the age of a plurality of face key points can be obtained by inputting an actual scene image containing the face; the face key points comprise: two eyes, a nose, two mouth corner edge points, cheekbones and the like; the number of the face key points is more than or equal to 4.

As shown in fig. 1, embodiment 1 of the present invention provides a method for depicting a human face based on a scene, where the method includes:

acquiring an image of a specified object shot by a camera, and preprocessing the image to obtain a data format matched with the input of a deep learning model;

the deep learning model is 4 cascaded convolutional neural networks, can detect and position the face from coarse to fine, and simultaneously obtains the positions of a plurality of key points of the face and classifies the gender and the age of the person. The algorithm consists of four phases:

step S1) is to use shallow CNN (convolutional neural network) to generate a large number of candidate regions quickly;

step S2) is to screen out a large number of non-face candidate areas through complex CNN;

step S3) is to use the more complicated CNN to screen out a small number of non-face candidate regions again and output a plurality of key points at the same time;

step S4) is to determine the final face region with the more powerful CNN and output the location of multiple facial keypoints and gender, age classification for each face region simultaneously.

Each convolutional neural network may be one convolutional neural network or a cascaded convolutional neural network.

Example 2

Embodiment 2 of the present invention provides a face image system based on a scene, the system including:

Example 3

Embodiment 3 of the present invention provides a scene-based gaze counting method, in which a camera is placed beside a designated object, whether a gaze focus of a person is on the object is determined by using a visual image correlation technique, and a time spent on the object is counted to realize gaze counting. For the eye counting work, the following steps are mainly carried out:

step 1) acquiring image data through a camera;

the camera is a color camera or a depth camera;

step 2) judging whether a person looks at a specified object or not through image data;

as shown in figure 2 of the drawings, in which,

the method comprises the steps of judging whether a person looks at the billboard or not by taking the billboard in the elevator as an example according to the image data of the color camera

Step 2-1) simultaneously acquiring face frame coordinates in the image and position coordinates of a plurality of face key points (eyes, noses and other parts) in the image through a deep learning model;

and 2-2) taking the centers of two eyes of a person as coordinate origin points, setting the direction of the right eye as the positive direction of an x axis and the direction of a nose as the negative direction of a y axis, establishing a face plane coordinate system, and counting and acquiring the average position relation of a plurality of face key points (eyes, noses and the like) in the face plane coordinate system in reality through a large amount of real scene data.

Step 2-3) calibrating the color camera through an off-line calibration tool kit to obtain an internal reference matrix K of the camera;

step 2-4) calculating projection mapping from a plane of the face in the world coordinate system to a plane of the face in the camera coordinate system by using the average position relation of the face in the real world coordinate system and a plurality of corresponding face key point coordinates (eyes, noses and the like) in the camera coordinate system (namely in the image), and obtaining a homography matrix H;

step 2-5) acquiring a rotation matrix R and a translation matrix T of a face plane relative to a plane where a specified object is located in reality through a homography matrix between planes and a camera internal reference matrix to obtain a face orientation, and setting a normal direction passing through the centers of two eyes as a human eye watching direction;

and 2-6) calculating the intersection point of the gazing direction of the human eyes and the specified object, and determining whether the human eyes see the specified object by judging whether the intersection point is in the specified object.

As shown in fig. 3, the step of determining whether a person looks at a billboard specifically includes, for the image data of the depth camera, taking the billboard in the elevator as an example:

step 201) searching a human face frame in a color picture and position information of central points of human eyes and a mouth through a deep learning model;

step 202) acquiring the position of a corresponding point in an actual scene through information conversion, namely the relative spatial position of the eyes, the center point of the mouth and the depth camera of each face; judging whether the central point is set as an original initialization value 0, and turning to step 203); otherwise, go to step 204);

step 203) supplementing by using surrounding neighborhood pixel points; and if the number is still 0 after the supplement, determining that the face position is not searched by the depth camera.

Step 204) carrying out face orientation modeling through the obtained information of the eyes and the mouth, namely calculating a face plane and normal orientation according to the space coordinates of the eyes and the mouth, and then setting the normal direction passing through the center of the eyes as the face orientation;

step 205) calculating the intersection point of the face orientation and the plane of the billboard, and determining whether a person is looking at the billboard by judging whether the intersection point is in the plane of the billboard.

Particularly, when a plurality of faces exist in the picture, the same processing is performed on the faces with different ID information.

Step 3) counting the stay time of each person in the appointed object, thereby determining the interest degree of the person in the object;

for each person, counting the number of video frames of which the eyes stay at the specified object, and dividing the number by the frame rate to obtain the stay time on the specified object as the basis of the interest degree of the person on the object.

And 4) associating the degree of interest of the person in the specified object with the classification result of the social attribute of the person, and recommending related information in the region of interest of the person.

Example 4

Embodiment 4 of the present invention provides a scene-based gaze counting system, including:

the gaze counting module is used for counting the stay time of gaze on the specified object, so that gaze counting is realized, and the degree of interest of people on the specified object is obtained;

and the association module is used for associating the degree of interest of the person in the specified object with the classification result of the social attribute of the person and recommending related information in the region of interest of the person.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A scene-based gaze counting method, which is realized according to a scene-based face image method,

the face image method based on the scene comprises the following steps:

acquiring an image of a specified object shot by a camera;

inputting data into a pre-trained deep learning model, and simultaneously outputting face positions, positions of a plurality of face key points and social attribute classification results of people;

the eye count method comprises the following steps:

acquiring an image of a specified object shot by a camera;

counting the stay time of the gaze on the specified object, thereby realizing gaze counting and obtaining the degree of interest of the person on the specified object;

when the camera is a depth camera, the method for judging whether the eye attention point of the person is on the specified object from the image specifically comprises the following steps:

step 202) acquiring the position of a corresponding point in a real scene through information conversion, namely the relative spatial position of the central point of the human eye and mouth of each human face and a depth camera; judging whether the central point is set as an original initialization value: 0, go to step 203); otherwise, go to step 204);

2. The scene-based gaze counting method of claim 1, wherein the steps of building and training the deep learning model specifically comprise:

3. The scene-based gaze counting method of claim 2, wherein the inputting of data into a pre-trained deep learning model and the outputting of face positions, positions of a plurality of face key points and social attribute classification results comprises:

4. The scene-based gaze counting method of one of claims 1 to 3, wherein the face key points comprise: two eyes, nose, two mouth corner edge points, and cheekbones; the number of the face key points is not less than 4.

5. The scene-based gaze counting method of claim 1, wherein the social attributes are age and gender.

6. The scene-based gaze counting method according to claim 1, wherein the statistical gaze residence time on the designated object is as follows: and counting the number of video frames aiming at staying at the specified object, and dividing the number of the video frames by the frame rate to obtain the staying time on the specified object.

7. The scene-based gaze counting method of claim 1, further comprising: and associating the degree of interest of the person in the specified object with the classification result of the social attribute of the person, and recommending related information in the region of interest of the person.

8. A scene-based eye count system, the system comprising:

step 205) calculating the intersection point of the face orientation and the specified object, and determining whether the human eyes see the specified object by judging whether the intersection point is in the specified object;

the face image method based on the scene comprises the following steps:

acquiring an image of a specified object shot by a camera;

9. The scene-based gaze counting system of claim 8, further comprising: and the association module is used for associating the degree of interest of the person in the specified object with the classification result of the social attribute of the person and recommending related information in the region of interest of the person.