CN114387679A

CN114387679A - System and method for realizing sight line estimation and attention analysis based on recursive convolutional neural network

Info

Publication number: CN114387679A
Application number: CN202210040206.1A
Authority: CN
Inventors: 杨傲雷; 郭帅; 徐昱琳
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2022-01-14
Filing date: 2022-01-14
Publication date: 2022-04-22

Abstract

The invention discloses a system for realizing sight estimation and attention analysis based on a recurrent convolutional neural network, which comprises a sight feature extraction module, a sight regression module, a sight falling point mapping module and an attention visualization and analysis module. The method comprises the steps of simultaneously extracting binocular appearance features and head posture features to perform spatial domain feature fusion; for continuous multi-frame sight line features, jointly encoding the time sequence features of the watching behaviors through a Bi-LSTM network layer to complete time domain feature fusion, and then returning sight line vectors of intermediate frames; the method can acquire the coordinates of the sight-line drop point in real time and is not limited by a scene; the data support is provided by the bottom layer module, and the abundant real-time visual line tracking visualization and related visual line parameter visualization forms are provided by the attention visualization and analysis module. The technical scheme can meet the requirements of accuracy and stability of the use scene, and the application scene is wide.

Description

System and method for realizing sight line estimation and attention analysis based on recursive convolutional neural network

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a system and a method for realizing sight line estimation and attention analysis based on a recurrent convolutional neural network.

Background

The eyes are the main organs of human beings for acquiring information from the outside, the sight direction of the human beings contains rich potential information, the attention concentration position of the human beings can be identified through the detection of the sight direction, and then the interested areas and objects in the current space can be estimated, and the information has great value in the aspects of commercial evaluation and disease diagnosis. In addition, the sight direction is also important information for understanding human intention by a machine, and accurate estimation of the sight direction can provide powerful technical support for human-computer interaction and bring non-contact human-computer interaction experience for users. The sight line estimation technology is widely applied in the fields of auxiliary driving, virtual reality and the like, and gradually becomes an important research branch in the field of computer vision.

The current research on line-of-sight estimation technology is mainly divided into two categories: a feature-based gaze estimation method and an appearance-based gaze estimation method.

Firstly, detecting visual characteristics of human eyes in an acquired image, such as positions of characteristic points such as pupil centers, iris centers, canthus and the like; then extracting relevant sight line parameters from the image; and finally, establishing a mapping model between the visual line direction and the visual line direction based on the characteristics. The method is used for the earliest time, however, the sight line tracking technology based on the characteristics depends on the detection precision of visual characteristic points of eyes to a great extent, so that the vision acquisition equipment is required to be close to the eyes, and an invasive sight line tracking solution is mostly adopted; in addition, a point light source and a multi-camera system are required to be introduced, and such solutions include chinese patent No. CN 104113680, application publication No. 2014.10.22, patent name: the sight tracking system and the sight tracking method use a plurality of groups of point light sources and a plurality of cameras, the hardware configuration complexity of the system is high, and the application and popularization of the system are limited; in addition, based on the model constructed by the method, model parameters need to be solved through user calibration, the number of calibration points is large, and the complexity of user operation is high; more importantly, when the feature-based sight line estimation technology is used for establishing a mapping model, only eye features are considered, and when the model calculated after calibration is used, a larger prediction error is shown when the head of a user deviates from a calibration position and moves freely, so that the sight line estimation precision is obviously reduced. Therefore, the sight line estimation system adopting the method has great limitation in processing the sight line estimation task under the free movement of the head.

With the development of deep learning, sight line estimation based on appearance gradually becomes a research hotspot. The method has the main idea that a machine learning model is trained by a large amount of sample data which comprise human eyes or human face pictures and fixation points or fixation directions as labels, so that a mapping model of human face or human eye appearance and sight line drop points on a screen is established. Such methods are described in, for example, chinese patent No. CN 106599994, application publication No. 2017.04.26, patent name: a sight line estimation method based on a depth regression network estimates the sight line direction by taking a single eye image as an input of a 5-layer depth regression neural network. Chinese patent publication No. CN 110795982, publication No. 2019.01.04, patent name: the invention discloses an apparent sight estimation method based on human body posture analysis. Chinese patent publication No. CN 111680546 a, application publication No. 2020.09.18, patent name: the invention discloses an attention detection method, an attention detection device, electronic equipment and a storage medium, wherein a CNN network with two branches is used for respectively extracting and fusing head posture characteristics and eye characteristics, but a two-classification model for attention detection is finally established, namely only two-classification results of whether a detected person watches a specific target area can be obtained; the sight line estimation problem is converted into the binary solution, although the calculation is simplified compared with a regression model, the sight line drop point position cannot be estimated in real time, the use scene of the technical scheme is greatly limited, and the complete attention analysis task cannot be completed; in addition, the vision regression and vision drop point calculation functions are not provided, and the binary relation between the human face appearance and the specific target area is directly established, so that the model is highly dependent on camera arrangement and target watching position selection, and therefore, when a complex and changeable target area to be detected in a real scene is faced, difficulty is brought to the training and deployment of the model, and the practicability is low. In addition, most of the existing methods extract the single-frame picture features of the face or eyes based on a convolutional neural network, and then use a regression model to calculate the sight line direction, so that the robustness of the method is poor due to the dynamics and the variability of sight line estimation, and the condition that the result is unstable in use exists. In the sight estimation task, the final goal is to determine the position of the falling point of the sight of the user in the current scene, so as to perform subsequent analysis, however, in the current three-dimensional sight estimation solution, only the sight direction represented by a two-dimensional angle vector is given, and the mapping relation between the sight direction and the final sight falling point is not established, so that an end-to-end system cannot be formed.

Disclosure of Invention

The invention aims to provide a better solution for the problems in the sight line estimation method, and provides a system and a method for realizing sight line estimation and attention analysis based on a recurrent convolutional neural network, which are mainly characterized in that the head posture and the apparent characteristics of two eyes are subjected to feature fusion in a sight line estimation task and are used as the input of a sight line regression module; in the sight regression part, dynamic sight characteristics of the head and eyes in the gaze behavior are further fused, continuous 15 frames of sight characteristics are used as input of a Bi-directional Long short-term memory network (Bi-LSTM), then the gaze direction of an intermediate frame is regressed, and the sight estimation precision is improved while the stability of an estimation result is ensured by carrying out joint coding on time sequence information of the gaze behavior; the method comprises the steps of positioning a sight starting point based on an EPnP algorithm and a three-dimensional face model, fitting a sight equation and a watching plane equation in a unified camera space, and calculating the sight falling point, so that mapping between a sight vector in a three-dimensional space and the sight falling point in a watching scene is established; and a bottom-layer sight estimation model is used for providing data support, and the sight estimation result visualization and attention analysis functions are provided, so that a complete end-to-end sight estimation and attention analysis system is formed.

In the system for realizing the sight line estimation and the attention analysis based on the recurrent convolutional neural network, the system mainly comprises 4 parts:

(1) a single-frame image space domain feature extraction and feature fusion module based on a convolutional neural network and an EPnP algorithm, which is referred to as a sight line feature extraction module for descriptive convenience;

(2) a sight line regression module for performing sight line regression by fusing time domain information based on continuous multi-frame picture characteristics, which is hereinafter referred to as a sight line regression module for short;

(3) a module for mapping between sight line vectors and sight line falling points based on the space geometric model, which is referred to as a sight line falling point mapping module for short in the following;

(4) a visual display and attention analysis module, hereinafter referred to as an attention visualization and analysis module, is used for visually displaying the sight line estimation result;

the functions of the modules in the system, and the hardware components and features thereof will be described separately below.

Visual line feature extraction module

The sight line feature extraction module is used for extracting sight line features from RGB original images collected by the monocular camera and using the sight line features as input vectors of the sight line regression module. The module only needs one monocular camera in hardware composition. The processing logic mainly comprises: image acquisition, face recognition, face feature point detection, head posture calculation and binocular feature extraction 5 parts.

In the sight line feature extraction module, the invention extracts the features of the eye appearance feature and the head pose. For the extraction of the eye apparent features, firstly, the eye region is determined according to a face feature point detection algorithm, then, the binocular region is converted into a standardized feature space through affine transformation, and the eye image subjected to graying and histogram equalization processing is input into an eye feature extraction network Gaze-DenseNet for the extraction of the eye region sight features. For the acquisition of the head pose, the head pose feature is acquired by carrying out optimized solution on the two-dimensional coordinates of the corresponding feature points obtained by the face feature point detection algorithm and a pre-acquired universal three-dimensional face model containing 68 feature points through an EPnP algorithm. The gesture features are subjected to a feature space standardization conversion step, and the 6D pose is simplified into a two-dimensional head rotation angle vector. And finally, fusing the two parts of characteristics in a cascading mode to be used as the input of a next-level sight line regression module.

(II) sight line regression module

The sight line regression module is a core processing module of the whole sight line estimation system, and constructs a mapping model between the fusion characteristics of the binocular appearance and the head pose and the sight line direction. The module is used for finishing a three-dimensional sight line estimation task, and aims to estimate the sight line direction { pitch, yaw } of a person in a camera space in real time from a shot face video, and respectively represent the offset angles of the sight line in the vertical direction and the horizontal direction. The sight line vector is defined as a unit vector pointing from the midpoint of the line connecting the inner canthus of both eyes to the target point of fixation in the camera coordinate system. The sight line regression module is structurally divided into a bidirectional long-short term memory network layer consisting of 15 memory units and a full connection layer.

The invention fuses dynamic characteristics in the gazing behavior in the sight line regression module, fully utilizes the sequence information of the eye and head movement in the time dimension, and predicts the gazing direction of the current frame under the condition of giving the eye appearance and the head posture characteristics of the continuous frames. In the bidirectional long-short term memory network layer, the overall information of a section of video is transmitted to each frame of image through forward and backward transmission of the characteristics, and the local sight characteristics extracted from each frame of image and the overall information of the video are integrated to carry out optimization screening of the sight characteristics in the current frame of image. And then, performing full-connection network learning based on the obtained sight line characteristics, thereby regressing an accurate and stable sight line angle. The sight line regression module is characterized in that dynamic characteristics of an image sequence in a time dimension are added while the sight line characteristics of each frame of image are extracted, so that the global expression capability of the characteristics is improved, the instability of the characteristics caused by noise in the image is reduced, and the accuracy and the robustness of a sight line estimation task are improved.

(III) sight line drop point mapping module

The sight line drop point mapping module realizes the conversion of the sight line direction dimension, namely, the sight line vector in the three-dimensional space is converted into the sight line drop point coordinate on the two-dimensional plane, thereby providing effective data support for the subsequent attention analysis module. The module specifically comprises a sight starting point positioning module based on an EPnP algorithm and a sight vector and sight falling point conversion module based on a space geometric model. In the above sight line regression module, a representation of the sight line direction in the camera coordinate system is obtained, which may be converted into a three-dimensional sight line vector. Under the condition that the relative position relation between a camera coordinate system and a screen coordinate system is calibrated, the representation of the three-dimensional sight line vector in the screen coordinate system can be obtained, but the position of a sight line starting point in the screen coordinate system is usually difficult to obtain through a monocular camera, and the existing method usually adopts a depth camera as a solution.

In the invention, based on the established universal three-dimensional face model and the pixel coordinate values of the face characteristic points detected by the face alignment algorithm, the coordinate values of the origin of the head coordinate system in an internal reference known camera coordinate system are obtained through the EPnP algorithm iterative optimization solution. And because the relative position of the sight starting point and the head coordinate origin is uniquely determined by the three-dimensional face model, the sight starting point position under the camera coordinate system is positioned, and then the sight starting point position is converted into the sight starting point under the screen coordinate system through the calibrated position relation. And finally, the representation of the sight line falling point in a screen coordinate system is obtained through the established space geometric model. In the module, based on a sight starting point positioning module of an EPnP algorithm and a constructed space geometric model, the conversion from a sight vector in a three-dimensional space to a sight falling point of a two-dimensional gaze plane is completed only by using a monocular camera. The hardware cost of the system is reduced while the solving accuracy is ensured.

(IV) attention visualization and analysis module

The attention visualization and analysis module is an upper module of the whole sight line estimation system. The hardware component needs a display, and the software part comprises a sight line estimation data real-time display part, a graphical display part of an eye movement heat point diagram and the like, an attention analysis part, a data storage part and a data export part and the like. The module displays the calculation results of the bottom module and the middle module in a visual mode. The display comprises the display of a sight line drop point of a user on a watching screen in real time, the display of a dynamic change diagram of head posture data and the display of a sight line heat point diagram generated on the basis of statistics of watching data in a period of time. The sight line hotspot graph displays the region where the attention of the user is focused in a special highlight form, clearly indicates the watching time of the user at each position in the whole scene, and provides rich and reliable data support for further attention analysis or interest point analysis.

In the four modules of the system for realizing the sight estimation and the attention analysis based on the recurrent convolutional neural network, the sight feature extraction module realizes the efficient extraction of sight features, the convolutional neural network is used for extracting eye features, and the head posture is optimized and solved in a feature point matching mode. The sight line regression module is a functional core of the whole system, establishes a mapping relation between a feature extraction result of the sight line feature extraction module and a sight line vector, and is realized by depending on a deep learning method. The two modules are functionally distinguished for clarity in describing the system structure, but are logically tightly connected, that is, the feature extraction network and the line-of-sight regression network together form a complete bidirectional recursive convolutional neural network, which is called a line-of-sight estimation network. The network carries out off-line training based on a pre-collected data set, continuously iterates and optimizes model parameters under the constraint of a loss function, takes a model with the minimum test error as a final sight estimation model after multiple times of training, and then deploys the model in the system to participate in real-time sight estimation. And the sight line drop point mapping module receives the sight line angle output by the superior network, calibrates the position of the camera and the screen, and acquires the sight line drop point coordinate in the gazing scene in real time by using the proposed sight line drop point calculating method. The attention visualization and analysis module completes data visualization based on the calculation results of the modules, visually displays the prediction effect of the whole system, can support data derivation, and is combined with different application scenes to perform subsequent deep analysis research.

Under the system framework, the invention realizes the sight line estimation and attention analysis method based on the recurrent convolutional neural network, and the method comprises the following steps:

step 1: carrying out face detection and face characteristic point extraction;

step 2: extracting head posture features by combining an EPnP algorithm based on the acquired face feature point data and the three-dimensional face model;

and step 3: extracting a network Gaze-DenseNet based on the designed binocular features, and extracting and processing sight line features in a standardized space;

and 4, step 4: performing spatial domain feature fusion on the obtained sight line feature data, then performing time domain feature extraction and fusion, and performing three-dimensional sight line estimation based on the fusion features;

and 5: constructing a space geometric model and combining an EPnP algorithm to realize real-time conversion between a three-dimensional sight line vector and a two-dimensional sight line drop point;

step 6: visualizing the sight line estimation result and analyzing attention;

the face detection and feature point extraction in the step 1 mainly includes two aspects of face region detection and 68 face feature points in the face region detection and positioning, and the specific steps are as follows:

step 1-1: the method comprises the steps of receiving RGB pictures collected by a monocular camera at a certain frequency in real time, setting the resolution and the shooting angle of the camera, and ensuring that the collected pictures contain complete user face areas.

Step 1-2: and detecting a region where the human face is in the received input image by using a human face recognition model pre-trained in a Dlib machine learning library, and marking the region by using a rectangular frame.

Step 1-3: and positioning 68 facial feature points in the facial region detected in the previous step by adopting a facial feature point detection algorithm based on a Continuous Conditional Neural Field (CCNF) model, and acquiring and storing the pixel coordinates of each point.

The head pose feature is extracted by combining an EPnP algorithm based on the acquired face feature point data and the three-dimensional face model in the step 2, and the method specifically comprises the following steps:

step 2-1: construction of a general three-dimensional face model M_FAnd establishing a head coordinate system on a model, wherein the model consists of three-dimensional coordinates of 68 feature points in the human face, the feature points are represented in the head coordinate system, and the positions of the feature points are in one-to-one correspondence with the feature points detected in the step 1-3 after being imaged by a camera.

Step 2-2: combining the constructed general three-dimensional human face model M_FRegarding the head coordinate system as a world coordinate system, for a camera with known internal reference, under the condition that 68 3D feature points and corresponding two-dimensional projection coordinates thereof in the world coordinate system are known, the head coordinate system is converted into an n-Point Perspective (PnP) problem for solving.

Step 2-3: for the above PnP problem, the 6D pose of the head is solved based on the gaussian-newton optimization by using the EPnP algorithm, that is, the pose includes the rotation matrix R of the head coordinate system relative to the camera coordinate system and the displacement t of the head coordinate system relative to the camera coordinate system.

In the step 3, based on the designed binocular feature extraction network Gaze-DenseNet, the sight line feature extraction processing is performed in a standardized space, and the method mainly comprises the following steps: the method comprises the following steps of constructing a feature extraction network module Gaze-DenseNet, standardizing a feature space, intercepting and preprocessing a binocular image and extracting binocular features, and specifically comprises the following steps:

step 3-1: firstly, a feature extraction network of a binocular picture is constructed, in the invention, based on a feature extraction mechanism of a DenseNet network, and simultaneously in combination with the characteristics of a small-size image (40 multiplied by 150) in the task, a binocular feature extraction network module Gaze-DenseNet containing 19 convolutional layers is designed, and the specific network structure is shown in the following table:

step 3-2: and (3) cutting the original image into a binocular image through a characteristic space standardization technology and converting the binocular image into a standardized camera space.

The characteristic space standardization technology aims to solve the problem that due to different conditions such as head postures, shooting angles and distances of people, the performance of a network is influenced by an input network due to different sizes and different resolutions of binocular pictures intercepted based on characteristic points. A normalization technique is introduced, that is, the image is converted into a normalized camera space by affine transformation, in which the distances of the head of the person and the camera are the same, and the virtual camera pose changes following the change of the head pose, so that the complete binocular image can be intercepted at a prescribed resolution. The specific implementation method comprises the following 5 steps:

(1) using the camera coordinate system as a world coordinate system, knowing the coordinates e of the starting point of the sight line (i.e. the midpoint of the line connecting the canthus in both eyes)_zAnd head attitude rotation matrix R, rotating camera seatThe mark is aligned to the z-axis with the center point of the eyes. The z-axis of the rotated virtual camera can be c_z＝e_z/‖e_z‖；

(2) The camera is rotated about the z-axis until the x-axis of the virtual camera coordinate system and the x-axis of the head coordinate system lie in the same plane. The x-axis of the head coordinate system is a known quantity, i.e. the first row R of the head rotation matrix R_xThis step makes the rotated virtual camera coordinate system x-axis c_xAnd x-axis R of head coordinate system_xOn the same plane, the requirement of the rotated virtual camera y-axis c is satisfied_yPerpendicular to this plane. And c_yPerpendicular to the virtual camera z-axis c_zThus, c_yCan be prepared from R_xAnd c_zIs obtained as the cross product of c_y＝R_x×c_zFurther, the x-axis c of the camera coordinate system_xCan be formed by_yAnd c_zIs obtained as the cross product of c_x＝c_y×c_zFrom which a rotation matrix R of the camera can be derived_c＝[c_xc_yc_z]；

(3) The distance from the center of both eyes to the center of the camera is normalized. Defining a scaling matrix S ═ diag (1,1, d/| e)_z|) zooming the z-axis of the camera coordinate system, wherein d is the distance from the center of the two eyes to the center of the camera after standardization, and d is taken as 60cm in the invention;

(4) obtaining a camera conversion matrix M-SR through the steps_cA head rotation matrix R representing the conversion relationship from the real camera coordinate system to the virtual camera coordinate system in the standardized space and converted into the standardized space_n＝RR_c；

(5) By affine transformation matrix

Converting to obtain standardized binocular image, wherein C_rIs the true internal reference matrix of the camera, which is obtained by camera calibration, C_sIs an internal reference matrix of a virtual camera in a standardized space, C in this patent_sTaking the following values, the resolution of the obtained binocular picture is 40 × 150.

The feature space standardization operation is completed through the steps. In the following description, the references to "feature space normalization technique" all indicate the process of converting the features in the original camera space into the normalized camera space using the description in step 3-2, and the detailed processing steps are not repeated.

Step 3-3: the binocular image with the resolution of 40 × 150 is subjected to graying processing and histogram equalization processing.

Step 3-4: inputting the preprocessed binocular images into a Gaze-DenseNet module which completes training, extracting binocular features, and outputting the feature dimension of 58 multiplied by 1.

In the step 4, spatial domain feature fusion is performed on the obtained sight line feature data, then time domain feature extraction and fusion are performed, and three-dimensional sight line estimation is performed based on the fusion features, and the specific steps are as follows:

step 4-1: in order to ensure the uniformity of the feature space, the head rotation matrix in the step 2-3 is also converted into the standardized feature space to obtain a head rotation matrix R of the standardized space_n＝R×R_cWherein R is_cObtaining a camera rotation matrix by the feature space standardization process of the step 3-2; and rotating the head by a matrix R_nFurther reduced to the Euler angle form of a two-dimensional head pose of

Wherein h is_θ,

Respectively, the pitch and yaw angles of the head in the standardized space.

Step 4-2: and after the extraction of the features of the spatial domain is finished, the features comprise binocular appearance features and head posture features, and the two features are fused in a cascading mode to form a 60-dimensional sight line feature vector.

Step 4-3: loading a continuous 15-frame image sequence with the f-th frame as an intermediate frame, performing the processing from the step 1-1 to the step 4-2, acquiring the sight line characteristics extracted from each frame image, and forming a characteristic sequence L in a time dimension (L ═ L)_f-7,…,L_f-1,L_f,L_f+1,…,L_f+7}。

Step 4-4: inputting the characteristic sequence L into a Bi-LSTM network layer containing 15 units, extracting and fusing time domain characteristics, and returning the sight angle of an intermediate frame f in a standardized camera space n through full-connection layer processing

Where pitch represents the pitch angle of the line of sight and yaw represents the yaw angle.

And 4-5: outputting the angle of sight of the standardized space

Conversion to three-dimensional sight vector by

Then converting the image into the original camera space to obtain the corresponding sight angle of the current frame image

Thereby completing the estimation of the gaze angle of one frame of image.

And 5, constructing a space geometric model and combining an EPnP algorithm to realize real-time conversion between a three-dimensional sight line vector and a two-dimensional sight line drop point, wherein the method specifically comprises the following steps:

step 5-1: using the existing screen-camera calibration technology, the relative position relationship between the camera coordinate system C and the screen coordinate system S is calibrated and obtained as

Step 5-2: the sight line vector g in the camera space to be acquired by the steps 4-5_fUnitizing to obtain a unit sight line vector

Step 5-3: due to the starting point p of the line of sight₀Fixed in position in the head coordinate system by a universal face model M_FObtaining, namely obtaining the position coordinates of the sight departure point in the camera coordinate system by using the head pose calculation result obtained based on the EPnP algorithm in the step 2-3

Step 5-4: using a calibrated translation relationship, a unit line-of-sight vector

And a starting point p of sight₀And uniformly converting the images into a screen coordinate system.

Step 5-5: under a screen coordinate system, the linear equation where the sight line is located can be solved by knowing the sight line direction vector and the sight line starting point, and then the sight line falling point coordinate P (x, y) in the screen is calculated according to the space geometric relationship of intersection of the plane and the line in the three-dimensional space.

Step 6, visualization of sight line estimation result and attention analysis, mainly including the following aspects:

step 6-1: the data analysis and display platform acquires a sight line falling point coordinate P (x, y) and a sight line departure point coordinate P from the lower-layer module₀And head pose h_nAnd the data are displayed dynamically in real time.

Step 6-2: and in the identification process, the sight line falling point and the image in the user visual field are dynamically displayed in a superposed mode, and the sight line direction is tracked in real time. The statistical data in a period of time are visualized, and an eye movement hotspot graph is generated and displayed on display equipment based on the fixation time and the sight line drop point position distribution.

Step 6-3: accurate and rich sight line estimation data are stored in a background database, and the specific data comprise: the binocular image with the time stamp, the coordinates of the sight line starting point, the coordinates of the sight line falling point, the fixation time, the blinking times and the like.

Step 6-4: based on the data, an attention detection report is generated. Meanwhile, the system is provided with a diversified data export interface, and provides data support for research in related fields.

Compared with the prior art, the system and the method for realizing the sight line estimation and the attention analysis based on the recurrent convolutional neural network have the following outstanding technical effects:

(1) in the spatial domain, the head posture characteristic and the binocular appearance characteristic of the single-frame image are fused, and the sight line estimation system with unconstrained head movement is realized. And in the concrete implementation, the binocular image is used as input, so that the problems of incomplete and unstable monocular characteristics are solved. For the extraction of the features of the two eyes, a convolutional neural network aiming at the small-size image in the task and based on a DenseNet mechanism is designed; and a method based on feature point matching optimization solution is adopted for obtaining the head posture.

(2) In the time domain, a sight line regression network fused with the time sequence information is constructed. In consideration of the dynamic characteristics of the sight estimation task, namely the dynamic characteristics of the head posture and the eye movement process and the continuity in the time dimension, the dynamic sight characteristics are further subjected to combined coding through a Bi-LSTM network layer, so that the sight angle is regressed, and the accuracy and the stability of the sight estimation result are ensured.

(3) In a sight line drop point mapping module, a sight line drop point calculating method based on a monocular camera is provided. In the prior art, a depth learning method is usually used for directly performing two-dimensional sight line drop point regression, or the two-dimensional sight line drop point regression is realized based on a binocular camera or a depth camera. In the invention, firstly, a monocular camera is used based on an EPnP algorithm and a three-dimensional human face model M_FPositioning a sight starting point; then establishing a space geometric model, fitting a sight line vector and a fixation plane equation in a unified camera space, and performing sight line drop point calculation; therefore, a real-time conversion relation between the three-dimensional sight line vector in the camera space and the two-dimensional sight line falling point in the gazing plane is established. The deployment and use of the method are not limited to specific scenes, and the hardware cost of the system is reduced.

(4) The data display and attention analysis module of the sight line estimation system is based on data support of the rear-end module, provides diversified data acquisition and display functions such as a binocular image with a timestamp, sight line starting point coordinates, sight line falling point coordinates, fixation time, blinking times, head gestures and the like, and has abundant data output interfaces for further scientific research and analysis.

Drawings

Fig. 1 is a schematic structural diagram of the overall architecture of the system for implementing gaze estimation and attention analysis based on the recurrent convolutional neural network according to the present invention.

Fig. 2(a) and fig. 2(b) are schematic diagrams of the sight line estimation scene based on screen stimulation according to the invention.

Fig. 3 is a flowchart of the overall architecture implementation of the method for implementing gaze estimation and attention analysis based on the recurrent convolutional neural network according to the present invention.

Fig. 4(a) and 4(b) are matching diagrams of the three-dimensional face model and the feature points according to the present invention.

Fig. 5 is a network framework diagram of the gaze estimation network of the present invention.

FIG. 6 is a schematic diagram of a geometric model of a sight-line landing point mapping space according to the present invention.

Fig. 7 is a structural diagram of a sight line estimation result visualization and attention analysis module according to the present invention.

Fig. 8(a) and 8(b) are schematic diagrams of the sight line estimation scene based on the physical stimulation according to the invention.

Detailed Description

The embodiments of the present invention will be further described with reference to the accompanying drawings, but the scope of the present invention is not limited to the following embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. Unless otherwise defined, all terms of art used hereinafter have the same meaning as commonly understood by one of ordinary skill in the art and are used herein for the purpose of describing particular embodiments only and are not intended to limit the scope of the present invention.

The first embodiment is as follows: sight estimation system based on screen stimulation

The sight line estimation system based on screen stimulation means that a user only receives information on a screen of a terminal display as a visual stimulation source, and therefore only pays attention to information such as the gazing point position, the gazing time and the like in the screen range.

The application scenario of the system is schematically shown in fig. 2. With reference to the reference numbers in the figures, the system mainly comprises 1, a vision acquisition device (monocular camera, optionally a USB camera or a webcam) in terms of hardware composition; 2. a terminal display; 3. and the controller comprises three parts. The camera is mounted in two ways, one is attached, i.e. mounted in the middle of the upper edge or the lower edge of the terminal display screen, and the other is independent, i.e. mounted beside the screen independently through a camera bracket, and the two ways correspond to fig. 2(a) and fig. 2(b) in fig. 2 respectively. The vision acquisition equipment is used for acquiring a face picture of a target user and is connected with the controller; the controller has sight line estimation and data storage functions, provides an open interface, and can derive a visual analysis report and original data. The display is connected with the controller, and the function of the display comprises two aspects, namely, providing a user with a watching target to give the user visual stimulation from a screen; on the other hand, the calculation result of the controller can be output for visual display. The sight line recognition system is mainly used for the fields of human-computer interaction, attention analysis, consumer behavior analysis of online shopping malls and the like.

The overall block diagram of the embodiment is shown in fig. 3, and the operation flow of the sight line estimation system based on screen stimulation is described in detail with reference to the drawing, and the specific steps of the system operation are as follows.

Step 1: and carrying out face detection and face characteristic point extraction.

Step 1-1: and hardware equipment is installed and connected according to the requirements, the angle and the resolution of the camera are adjusted, and the human face in the area to be recognized can be clearly imaged in the visual field of the camera.

Step 1-2: the vision acquisition equipment acquires a video stream I containing a target user head area in real time_RGBAnd the input signals are input into the controller in a wired connection mode.

Step 1-3: a face detection module in the controller receives RGB picture input, operates a face detection algorithm and identifies a region R where a face is positioned_face。

Step 1-4: in the face region R_faceIn the method, 68 facial feature points in a human face area are detected by using a human face feature point detection algorithm based on a Continuous Conditional Neural Field (CCNF) model, and the pixel coordinate P of each point is obtained as { P ═ P {₁,P₂,…,P₆₈And preserved.

Step 2: and extracting the head posture characteristic by combining an EPnP algorithm based on the acquired face characteristic point data and the three-dimensional face model.

Step 2-1: head posture resolving module in controller is combined with pre-constructed universal three-dimensional human face model M_faceAnd the pixel coordinates P of the human face characteristic points acquired in the step 1.4 and the internal parameters K of the camera are solved for the position and the posture of the head through an EPnP algorithm, wherein the general three-dimensional human face model M_faceThe feature point matching map is shown in fig. 4(a) and 4(b), in which fig. 4(a) indicates the face region and the detected 68 feature points, and fig. 4(b) is the corresponding three-dimensional face model.

Step 2-2: and obtaining a head rotation matrix R corresponding to the f frame of picture through iterative optimization solution, and simultaneously obtaining the displacement t of the origin of the head coordinate system in the camera coordinate system.

And step 3: and extracting the network Gaze-DenseNet based on the designed binocular features, and performing sight line feature extraction processing in a standardized space.

Step 3-1: recursive convolution sight estimation network model M for completing off-line training_gazeThe whole architecture diagram of the sight line estimation network is shown in fig. 5 and is deployed in a controller.

Step 3-2: and (3) carrying out image preprocessing, cutting out binocular images from the original images acquired by the vision acquisition equipment by a characteristic space standardization technology, and converting the binocular images into a standardized camera space, wherein the size of the binocular images is 40 multiplied by 150.

Step 3-3: and (3) graying the binocular RGB image output in the step (3-2) and then carrying out histogram equalization processing to obtain a binocular image to be input by the feature extraction network.

Step 3-4: inputting the preprocessed binocular images into a Gaze-DenseNet binocular feature extraction module of a sight estimation network model, and extracting 58-dimensional binocular apparent features I through convolution_f。

And 4, step 4: and performing spatial domain feature fusion on the acquired sight line feature data, then performing time domain feature extraction and fusion, and performing three-dimensional sight line estimation based on the fusion features.

Step 4-1: in order to unify the binocular appearance characteristic and the head posture characteristic into the standardized characteristic space, the head rotation matrix in the step 2-2 is converted into the standardized characteristic space, and a head rotation matrix R of the standardized space is obtained_n＝R×R_cWherein R is_cObtaining a camera rotation matrix by the feature space standardization process of the step 3-2; and rotating the head by a matrix R_nFurther reduced to the Euler angle form of a two-dimensional head pose of

Wherein h is_θ,

Respectively, the pitch and yaw angles of the head in the standardized space.

Step 4-2: completing the feature extraction of the single-frame picture space domain, including the apparent feature I of the two eyes_fAnd head pose feature h_f. Processing from step 1-1 to step 4-1 is carried out on continuous 15 frames of original images, so that a characteristic sequence I of the binocular picture in a standardized space is obtained_f-7,…,I_f-1,I_f,I_f+1,…,I_f+7H and head pose sequence h ═ h_f-7,…,h_f-1,h_f,h_f+1,…,h_f+7}。

Step 4-3: firstly, performing feature fusion on the two feature sequences in a cascading mode in a space dimension according to frames, and then forming a group of time-dimension feature sequences L (L) by every continuous 15 frames_f-7,…,L_f-1,L_f,L_f+1,…,L_f+7Each element in the sequence represents all the sight line features extracted from each frame of image.

Step 4-4: inputting the characteristic sequence L into a Bi-LSTM network containing 15 units, and regressing the sight angle of an intermediate frame f in a standardized camera space n through time domain characteristic fusion and regression layer processing

And 4-5: outputting the angle of sight of the standardized space

Conversion to three-dimensional sight vector by

Thereby completing the estimation of the gaze angle of one frame of image.

And 5: and constructing a space geometric model and combining an EPnP algorithm to realize real-time conversion between the three-dimensional sight line vector and the two-dimensional sight line drop point.

Step 5-3: due to the starting point p of the line of sight₀Fixed in position in the head coordinate system by a universal face model M_faceObtaining, namely obtaining the position coordinates of the sight starting point in the camera coordinate system by using the head pose calculation result in the step 2-2

Step 5-4: unit sight line vector

And starting point of sight

Converting the converted sight line vector to a screen coordinate system in a unified manner

Starting point of sight

Step 5-5: in the screen coordinate system C, the sight direction vector is known

And starting point of sight

Uniquely determining a linear equation of the sight line, combining the following two formulas, and solving the sight line falling point coordinate P in the screen by a space geometric relationship_gaze)x,y)。

Step 6: and visualizing the sight line estimation result and analyzing attention. The structure of the module is shown in fig. 7, and comprises four parts of data display of sight line parameters, chart display and attention analysis, and data storage and derivation.

Step 6-1, data display: when the real-time sight line estimation is carried out, the fixation time t, the blink frequency n and the sight line departure point coordinate p are measured₀And the coordinates of the line of sight falling point P_gazeAnd dynamically refreshing and displaying data such as head gestures and the like in a data display area in real time, and storing the data in a database of the controller.

Step 6-2 is shown graphically: in the real-time sight line identification process, a sight line drop point tracks the sight line direction on a screen in a red bright spot form in real time, dynamically changes the position of the sight line drop point, and is overlaid in the content of the screen for dynamic display. After the line-of-sight recognition is completed, an eye movement hotspot graph is generated based on the line-of-sight data stored in the controller data storage area within the statistical time. The figure shows the area where the user's attention is focused in a particularly highlighted form, clearly indicating the user's gaze time at various locations in the screen. Besides, the method also comprises a graphic display form such as a blink frequency graph and a head posture change curve.

Step 6-3 attention analysis: and generating an attention analysis report based on the rich data and the chart information, and analyzing the region in which the attention of the current user is focused.

Step 6-4: and a data export interface is provided while data and diagram display are carried out, and export of the original data is supported for further analysis and research.

Example two: sight estimation system based on physical stimulation

The sight tracking system based on the physical stimulation is defined in that a gazing scene of a user is a three-dimensional space, can be a gazing plane in the space, and can also be characters or objects with different depths of field. The application range is wider as the scene causing the visual stimulation is not limited. The application scene of the system is schematically shown in fig. 8(a) and 8(b), wherein fig. 8(a) shows that the gazing scene is a plane in a three-dimensional space, and fig. 8(b) shows that the gazing scene is a person and an object in the space.

In combination with the labels in the scene schematic diagram, the sight line tracking system based on physical stimulation mainly comprises the following hardware components: 1. the visual acquisition equipment facing the user can select a common USB monocular camera or a network camera according to the specific camera type, the requirements can be met, the resolution of the camera is set to be not less than 640 multiplied by 480, and the higher the resolution is, the longer the supported recognizable distance is. 2. The specific type selection requirements of the vision acquisition equipment facing the gazing scene are the same as above. 3. And the controller is used for deploying the sight estimation network model which completes the off-line training and calculating the image processing and the sight estimation. 4. The terminal display device, its display content includes: the visual scene image, the real-time sight-line falling point coordinates, the dynamic change diagram of the real-time sight-line falling point in the visual scene image, the head posture, the blinking times and other information collected by the visual collection equipment facing the visual scene also comprise a visual chart and the like which are reanalyzed based on original data.

In the present embodiment, the mounting arrangement of the hardware device requires the following: 1. the user-facing visual acquisition equipment is used for acquiring facial pictures of a user, and the lens faces the user when the user-facing visual acquisition equipment is installed, so that a complete facial image is captured in the user activity range. 2. The visual acquisition equipment facing the gazing scene is used for acquiring pictures in the gazing scene in the visual field range of a user, and the requirement that the main gazing scene is in the center of the visual field of the camera and is clear, visible and visible is metThe controller is fixedly arranged above the head by 50cm, the lens faces to the lower side by 45 degrees 3, the input interface of the controller is connected with two visual acquisition devices, and the output interface of the controller is connected with a terminal display device. 4. The terminal display equipment is installed without special requirements, so that the sight line result can be visually displayed. Data collection personnel can interact on the interface through a mouse to carry out operations such as data storage, visualization and the like. 5. After the system is installed, the relative positions of the two cameras facing the user and the watching scene are fixed respectively, and the relative coordinate conversion relation of the two cameras needs to be calibrated and obtained through a double-camera calibration technology

And inputting the value through an interface, and uploading the value to a model to participate in calculation.

In the embodiment, when the sight line estimation task is executed, the core identification algorithm used is completely the same as that in the first embodiment, and the specific difference between the first embodiment and the second embodiment in the working process is that the gazing scene in the first embodiment is derived from a picture displayed in a screen, and the gazing scene in the second embodiment is a real-time video acquired by a visual acquisition device. The software flow in operation of both embodiments is generic. Therefore, the system operation steps of this embodiment can be performed with reference to embodiment one. In the application field, different from the first embodiment, the first embodiment is mainly directed to application scenarios including commodity attention collection, advertisement delivery, classroom teaching attention detection and the like.

The sight line estimation method used in the above embodiment can be generalized to more sight line estimation scenes in which the human face can be completely detected within a reasonable depth of field range. The hardware composition of the system is simple, the introduction of a fussy auxiliary device is avoided, the flexibility of system deployment is improved, and the hardware cost is saved by only using the monocular camera as the sight line acquisition sensor. Meanwhile, in the aspect of software, the line of sight estimation algorithm based on the recurrent convolutional neural network is optimized in time dimension, the instability of a deep learning method in the traditional line of sight estimation is changed, and the estimation result is more accurate and the robustness is improved. In addition, the visual line estimation under the state of free head motion in the detectable range of the human face can be realized by a mode of fusing the binocular appearance features and the head posture features. The practicability of the sight line estimation system is improved, and the application range is expanded.

(3) In a sight line drop point mapping module, a sight line drop point calculating method based on a monocular camera is provided. In the prior art, a depth learning method is usually used to directly perform two-dimensional sight line drop point regression, or the two-dimensional sight line drop point regression is realized based on a binocular camera or a depth camera. In the invention, firstly, a monocular camera is used based on an EPnP algorithm and a three-dimensional human face model M_FPositioning a sight starting point; then establishing a space geometric model, fitting a sight line vector and a fixation plane equation in a unified camera space, and performing sight line drop point calculation; therefore, a real-time conversion relation between the three-dimensional sight line vector in the camera space and the two-dimensional sight line falling point in the gazing plane is established. The deployment and use of the method are not limited to specific scenes, and the hardware cost of the system is reduced.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A system for performing gaze estimation and attention analysis based on a recursive convolutional neural network, the system comprising from bottom to top:

the sight line feature extraction module is used for extracting and processing the human face eye features and the head posture features;

the sight line regression module is connected with the sight line feature extraction module and is used for carrying out spatial domain fusion processing on the acquired feature data, extracting and fusing time domain sight line features and further carrying out regression processing on the sight line direction of the user;

the sight line drop point mapping module is connected with the sight line regression module and used for resolving the sight line drop point of the user in real time; and

and the attention visualization and analysis module is connected with the sight line drop point mapping module and is used for analyzing and displaying the calculated and acquired user attention data.

2. A method for performing line-of-sight estimation and attention analysis based on a recursive convolutional neural network using the system of claim 1, the method comprising the steps of:

step 1: carrying out face detection and face characteristic point extraction;

and step 3: based on the designed binocular feature extraction network Gaze-DenseNet, binocular feature extraction processing is carried out in a standardized feature space;

and 4, step 4: performing spatial domain feature fusion on the obtained sight line feature data, then performing time domain feature extraction and fusion, and performing three-dimensional sight line regression based on the fusion sight line feature;

and 5: constructing a space geometric model and combining an EPnP algorithm to realize real-time conversion of a three-dimensional sight line vector and a two-dimensional sight line drop point;

step 6: and carrying out visualization and attention analysis on the sight line estimation result.

3. The method for realizing line-of-sight estimation and attention analysis based on a recursive convolutional neural network as claimed in claim 2, wherein the face detection and feature point extraction in step 1 mainly comprises the detection of a face region and the detection and positioning of face feature points in the face region, and the specific steps are as follows:

step 1-1: receiving RGB pictures acquired by a monocular camera at a certain frequency in real time, setting the resolution and shooting angle of the camera, and ensuring that the acquired pictures contain complete user face areas;

step 1-2: detecting a region where a human face is located in a received input image by using a human face recognition model pre-trained in a Dlib machine learning library, marking the region by using a rectangular frame and storing data in the human face region;

step 1-3: and positioning the facial feature points in the facial region detected in the last step by adopting a facial feature point detection algorithm based on a continuous condition neural domain model, and acquiring and storing the pixel coordinates of each point.

4. The method for realizing line-of-sight estimation and attention analysis based on a recursive convolutional neural network as claimed in claim 3, wherein the specific steps of the step 2 are as follows:

step 2-1: construction of a general three-dimensional face model M_FAnd establishing a head coordinate system on the model, wherein the three-dimensional human face model M_FThe method comprises the following steps of (1) forming three-dimensional coordinates of 68 feature points in a human face, wherein the coordinates of the feature points are expressed in a head coordinate system, and the positions of the feature points are in one-to-one correspondence with the two-dimensional feature points detected in the step (1-3) after being imaged by a camera;

step 2-2: combining the constructed general three-dimensional human face model M_FRegarding a head coordinate system as a world coordinate system, converting the head coordinate system into a PnP problem for a camera with known internal reference under the condition that 68 3D feature points and corresponding two-dimensional projection coordinates are in the known world coordinate system, and solving the PnP problem;

5. The method for realizing line-of-sight estimation and attention analysis based on a recursive convolutional neural network as claimed in claim 4, wherein the step 3 of binocular feature extraction processing in a standardized space based on a binocular feature extraction network Gaze-DenseNet mainly comprises: the method comprises the following steps of constructing a binocular feature extraction network module Gaze-DenseNet, standardizing a feature space, intercepting and preprocessing a binocular image and extracting features, wherein the specific steps are as follows:

step 3-1: firstly, constructing a feature extraction network of a binocular picture, and designing a lightweight binocular feature extraction network module Gaze-DenseNet based on a feature extraction mechanism of a DenseNet network by combining the characteristics of small-size images 40 multiplied by 150 in the task;

step 3-2: cutting out a binocular image from an original image by a characteristic space standardization technology and converting the binocular image into a standardized camera space;

step 3-3: performing graying processing on a binocular RGB image with the resolution of 40 multiplied by 150, and then performing histogram equalization processing;

step 3-4: and inputting the preprocessed binocular image into a Gaze-DenseNet module, extracting binocular apparent features, and outputting a feature dimension of 58 multiplied by 1.

6. The method for realizing line-of-sight estimation and attention analysis based on the recurrent convolutional neural network as claimed in claim 5, wherein the specific steps of step 4 are as follows:

step 4-1: converting the head rotation matrix R in the step 2-3 into a standardized feature space to obtain a head rotation matrix R of the standardized space_n＝R×R_cWherein R is_cObtaining a camera rotation matrix by the feature space standardization process of the step 3-2; and rotating the head by a matrix R_nFurther reduced to the Euler angle form of a two-dimensional head pose of

Wherein h is_θ，

Respectively representing the pitch angle and the yaw angle of the head in a standardized space;

step 4-2: after the extraction of the features in the single-frame image spatial domain is finished, including binocular appearance features and head posture features, and fusing the binocular appearance features and the head posture features in a cascading manner to form a 60-dimensional sight line feature vector;

step 4-3: loading a continuous 15-frame image sequence with the f-th frame as an intermediate frame, performing the processing from the step 1-1 to the step 4-2, acquiring the standardized spatial sight line feature extracted from each frame image, and forming a sight line feature sequence L with a time dimension as { L ═ L {_f-7，...，L_f-1，L_f，L_f+1，...，L_f+7}；

Step 4-4: inputting the sight line characteristic sequence L into a device containingIn a Bi-LSTM network layer with 15 memory units, the sight angle of an intermediate frame f in a standardized camera space n is regressed through the processing of a full connection layer through the extraction and fusion of time domain features and the processing of a full connection layer

Wherein pitch represents the pitch angle of the line of sight and yaw represents the yaw angle;

step 4-5, outputting the sight angle of the standardized space

Conversion to three-dimensional sight vector by

Thereby completing the estimation of the sight angle of one frame of image:

7. the method for realizing line-of-sight estimation and attention analysis based on a recursive convolutional neural network as claimed in claim 6, wherein the specific steps of step 5 are:

step 5-1: calibrating the relative position relationship between the camera coordinate system C and the gazed screen coordinate system S by using the existing screen-camera calibration technology

Step 5-3: defining the middle point of the inner canthus connecting line of the eyes as the starting point of the sight line, and expressing as p in the head coordinate system₀The point is defined by a universal three-dimensional face model M in the head coordinate system_FUnique determination; combining the head rotation amount R obtained based on the EPnP algorithm in the step 2-3 to obtain the position coordinates of the sight starting point in the camera coordinate system

Step 5-4: based on calibration results

The unit sight line vector

And a starting point p of sight₀Uniformly converting the images into a screen coordinate system;

step 5-5: under a screen coordinate system, knowing a sight line direction vector and a sight line starting point, solving a linear equation where the sight line is located, and further calculating a sight line falling point coordinate P (x, y) in the screen according to a space geometric relation of intersection of a plane and a line in a three-dimensional space.

8. The method for realizing line-of-sight estimation and attention analysis based on a recursive convolutional neural network as claimed in claim 7, wherein the specific steps of the step 6 are as follows:

step 6-1: the data analysis and display platform in the attention visualization and analysis module acquires sight line falling point coordinates P (x, y) and sight line starting point coordinates P₀And head pose h_nData, real-time dynamic display is carried out;

step 6-2: the data analysis and display platform dynamically displays the sight line drop point and the image in the user visual field in a superposition manner in the identification process, tracks the sight line direction in real time, supports visualization of statistical data within a period of time, generates an eye movement hotspot graph based on the fixation time and the sight line drop point position distribution, and displays the eye movement hotspot graph on display equipment;

step 6-3: and storing accurate and rich sight estimation data to a background database, wherein the specific data comprises but is not limited to: the method comprises the following steps of (1) obtaining a binocular image with a timestamp, a sight line starting point coordinate, a sight line falling point coordinate, a fixation time and blink times;

step 6-4: based on the acquired data, an attention detection analysis report is generated, and meanwhile, a diversified data export interface is provided, so that data support is provided for research in the related field.