WO2020216054A1 - 视线追踪模型训练的方法、视线追踪的方法及装置 - Google Patents

视线追踪模型训练的方法、视线追踪的方法及装置 Download PDF

Info

Publication number
WO2020216054A1
WO2020216054A1 PCT/CN2020/083486 CN2020083486W WO2020216054A1 WO 2020216054 A1 WO2020216054 A1 WO 2020216054A1 CN 2020083486 W CN2020083486 W CN 2020083486W WO 2020216054 A1 WO2020216054 A1 WO 2020216054A1
Authority
WO
WIPO (PCT)
Prior art keywords
eye
image
predicted
sight
vector
Prior art date
Application number
PCT/CN2020/083486
Other languages
English (en)
French (fr)
Inventor
周正
季兴
王一同
朱晓龙
罗敏
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2020216054A1 publication Critical patent/WO2020216054A1/zh
Priority to US17/323,827 priority Critical patent/US11797084B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • G06F3/013Eye tracking input arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/70Denoising; Smoothing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/90Dynamic range modification of images or parts thereof
    • G06T5/92Dynamic range modification of images or parts thereof based on global image properties
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/18Eye characteristics, e.g. of the iris
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/18Eye characteristics, e.g. of the iris
    • G06V40/19Sensors therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/18Eye characteristics, e.g. of the iris
    • G06V40/193Preprocessing; Feature extraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/18Eye characteristics, e.g. of the iris
    • G06V40/197Matching; Classification

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a method of gaze tracking model training, a method, device, device, and storage medium for gaze tracking.
  • Visual tracking technology also known as eye tracking technology, is a technology that uses various detection methods such as software algorithms, machinery, electronics, and optics to obtain the subject’s current visual attention direction. It is widely used in human-computer interaction, assisted driving, and psychological research. , Virtual reality and military.
  • Geometric methods are usually used to realize line-of-sight estimation. Geometric methods often require peripherals, based on cameras or eye trackers, to estimate the line of sight in three dimensions through dual light sources.
  • the embodiment of the application provides a method for training the gaze tracking model. Without the help of peripherals, the cosine distance between the predicted value and the label value is used as the model loss to train the gaze tracking model, so that the gaze tracking model can be subsequently used for sight tracking track.
  • the embodiments of the present application also provide corresponding devices, equipment, and storage media.
  • the first aspect of the application provides a method for training a gaze tracking model, including:
  • the training sample set including a pair of training samples, wherein the pair of training samples includes an eye sample image and annotated sight vector corresponding to the eye sample image;
  • the reference parameters of the initial line-of-sight tracking model are iteratively adjusted until the model loss meets the convergence condition to obtain the target line-of-sight tracking model.
  • the second aspect of the present application provides a method for gaze tracking, including:
  • Target gaze tracking model is a gaze tracking model trained by the method described in the first aspect
  • a third aspect of the present application provides a device for training a gaze tracking model, including:
  • An acquiring module configured to acquire a training sample set, the training sample set includes a pair of training samples, wherein the pair of training samples includes an eye sample image and annotated sight vector corresponding to the eye sample image;
  • a training module configured to process the eye sample image acquired by the acquisition module through an initial gaze tracking model, to obtain a predicted gaze vector of the eye sample image
  • the first processing module is configured to determine the model loss according to the predicted sight vector obtained by the training module and the cosine distance of the labeled sight vector;
  • the second processing module is configured to iteratively adjust the reference parameters of the initial line-of-sight tracking model until the model loss processed by the first processing module meets a convergence condition, so as to obtain a target line-of-sight tracking model.
  • a fourth aspect of the present application provides a sight tracking device, including:
  • the acquisition module is used to acquire the target eye image
  • the processing module is configured to process the target eye image acquired by the acquisition module by using the target gaze tracking model to determine the predicted gaze vector of the target eye image, and the target gaze tracking model adopts the aforementioned first aspect Gaze tracking model trained by the method;
  • the line of sight tracking module is used to track the line of sight according to the predicted line of sight vector obtained by the processing module.
  • a fifth aspect of the present application provides a computer device, the computer device including a processor and a memory:
  • the memory is used for storing program code; the processor is used for executing the method for training the gaze tracking model described in the first aspect according to the instructions in the program code.
  • a sixth aspect of the present application provides a computer device, the computer device including a processor and a memory:
  • the memory is used to store a target gaze tracking model, and the target gaze tracking model is a gaze tracking model trained according to the gaze tracking model training method described in the first aspect; the processor is used to run the target gaze tracking Model for line-of-sight tracking.
  • the seventh aspect of the present application provides a computer-readable storage medium, including instructions, which when run on a computer, cause the computer to execute the method for training the gaze tracking model as described in the first aspect,
  • An eighth aspect of the present application provides a computer-readable storage medium, including instructions, which when run on a computer, cause the computer to execute the method for gaze tracking described in the second aspect.
  • the predicted sight vector is obtained, and the predicted sight vector and the marked sight vector
  • the cosine distance between the vectors is the model loss for model training to obtain the target gaze tracking model; for subsequent gaze tracking, there is no need to use peripherals, just input the collected eye images into the target gaze tracking function, which simplifies the sight line
  • the tracking process, and the cosine distance as the model loss training model can better represent the difference between the predicted value and the labeled value, thereby improving the prediction accuracy of the trained gaze tracking model.
  • FIG. 1 is a schematic diagram of an example of an application scenario of gaze tracking in an embodiment of the present application
  • FIG. 2 is a schematic diagram of a scene of gaze tracking model training in an embodiment of the present application
  • FIG. 3 is a schematic diagram of an embodiment of a method for training a gaze tracking model in an embodiment of the present application
  • FIG. 4 is a schematic diagram of another embodiment of the method for training a gaze tracking model in an embodiment of the present application.
  • FIG. 5 is a schematic diagram of an embodiment of the feature processing process of the anti-residual block in the embodiment of the present application.
  • FIG. 6 is a schematic diagram of an embodiment of a method for gaze tracking provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a processing result of a third-order Bezier curve in an embodiment of the present application.
  • FIG. 8 is a schematic diagram of an embodiment of an apparatus for training a gaze tracking model in an embodiment of the present application.
  • FIG. 9 is a schematic diagram of another embodiment of the device for training a gaze tracking model in an embodiment of the present application.
  • FIG. 10 is a schematic diagram of an embodiment of a device for gaze tracking in an embodiment of the present application.
  • FIG. 11 is a schematic diagram of an embodiment of a server provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of an embodiment of a terminal device provided by an embodiment of the present application.
  • the embodiment of the application provides a method for training a line-of-sight tracking model, which adopts a scheme of taking the cosine of the loss function between the predicted value and the labeled value, which can better represent the difference between the predicted value and the labeled value, and can ensure The line-of-sight tracking model has higher prediction accuracy.
  • the embodiments of the present application also provide corresponding devices, equipment, and storage media. Detailed descriptions are given below.
  • Visual tracking technology is a technology of machine vision. It captures the image of the eyeball through the image sensor, recognizes the characteristics of the eyeball according to the image processing, and calculates the user's gaze point in real time through these characteristics.
  • the user’s gaze point can be known, it can be determined that the user is interested in the content of the gaze point, and the information at the gaze point can be enlarged. For example, if the gaze point is a small picture, you can Small pictures are enlarged to large pictures. Take Figure 1 as an example to briefly introduce the process of controlling through the eyeball.
  • the user’s gaze has been at a certain point on the mobile phone for more than a preset time.
  • the user’s eye image can be determined by analyzing the eye image.
  • the content of the gaze such as: the user has been staring at a small picture on the mobile phone, and the gaze duration reaches the duration threshold, the mobile phone can enlarge the display size of the picture on the screen, which is more conducive for the user to read the information concerned.
  • the driving assistance system can collect the driver's eye images in real time, and analyze the eye images to determine the driver's eye gaze point. If the gaze point deviates from the road, the driving assistance system can give a reminder (for example, sound an audible alarm) to improve safety during driving.
  • a reminder for example, sound an audible alarm
  • Visual tracking technology has changed the path of interaction between people and computer equipment. Manual operations are no longer necessary to interact with computer equipment. Computer equipment can also be controlled by eye movements.
  • the eyeball is very similar to the mouse click and selection operation, the eyeball can realize the viewing selection, and then activate the buttons, icons, links or text and other controls.
  • the eyeball can control the selection by looking at a point for more than a certain time, such as hovering on a selectable target within a predetermined period of time, for example, staying still for 800 milliseconds, then the control of the selectable target can be achieved.
  • eyeball control There may be many examples of eyeball control, which are not listed in the embodiment of this application.
  • the user's gaze point is realized based on the target gaze tracking model, which may have the depth of a convolutional neural network. Learning model.
  • the target gaze tracking model is trained through a large amount of sample data.
  • an embodiment of the present application provides a method for training a gaze tracking model, which can train to obtain a target gaze with higher prediction accuracy. Tracking model.
  • FIG. 2 is a schematic diagram of an application scenario of gaze tracking model training provided by an embodiment of the application.
  • This scenario includes a database 101 and a server 102 for training a gaze tracking model, and the database 101 and the server 102 are connected through a network.
  • the database 101 can also be integrated on the server 102.
  • the database is located on an independent device as an example.
  • the training sample set on the database 101 includes multiple training sample pairs, and each training sample pair includes The eye sample image and the marked sight vector corresponding to the eye sample image; these training sample pairs can be specially made by developers, or they can be reported by users through rewarded participation. Of course, they can also be obtained in other ways
  • the method for obtaining the training sample pair is not specifically limited in this application.
  • the database 101 can provide a training sample set for the server.
  • the server 102 After the server 102 obtains the training sample set from the database 101 through the network, it inputs the eye sample image into the initial gaze tracking model.
  • the server 102 processes the eye sample image through the initial gaze tracking model to obtain the predicted gaze vector of the eye sample image;
  • the server 102 takes the cosine of the predicted line-of-sight vector and the loss function of the marked line-of-sight vector to determine the similarity between the predicted line-of-sight vector and the marked line-of-sight vector;
  • the server 102 iteratively adjusts the reference parameters of the initial line-of-sight tracking model until the similarity satisfies the convergence condition to obtain the target line-of-sight tracking model.
  • the server 102 may further send the target gaze tracking model to the terminal device to run the target gaze tracking model on the terminal device, and use these target gaze tracking models to implement corresponding functions.
  • the server 102 uses the predicted sight vector and the cosine distance of the marked sight vector as the model loss, which can better represent the difference between the predicted value and the marked value, and can ensure The line-of-sight tracking model has higher prediction accuracy.
  • the application scenario shown in FIG. 2 is only an example.
  • the process of training the gaze tracking model provided by the embodiment of the application can also be applied to other application scenarios, and the gaze tracking is not here. There are any restrictions on the application scenarios of the model training process.
  • the gaze tracking model training process can be applied to devices with a model training function, such as terminal devices, servers, etc.
  • the terminal device can be a smart phone, a computer, a personal digital assistant (Personal Digital Assistant, PDA), a tablet computer, etc.
  • the server can be an application server or a web server.
  • the server can be A standalone server can also be a cluster server.
  • the terminal device and the server can train the gaze tracking model separately or interact with each other to train the gaze tracking model.
  • the terminal device can obtain the training sample set from the server, and then use the training sample
  • the line-of-sight tracking model is trained by the set, or the server may obtain a training sample set from the terminal, and use the training sample set to train the line-of-sight tracking model.
  • the terminal device or server executes the gaze tracking model training process provided in the embodiments of the present application.
  • the target gaze tracking model can be sent to other terminal devices to run the foregoing on these terminal devices.
  • the target gaze tracking model realizes the corresponding function; the target gaze tracking model can also be sent to other servers to run the target gaze tracking model on other servers, and the corresponding functions are realized through these servers.
  • FIG. 3 is a schematic diagram of an embodiment of a method for training a gaze tracking model provided by an embodiment of the application.
  • the following embodiments are described with a server as the execution subject. It should be understood that the execution subject of the method for gaze tracking model training is not limited to the server, and can also be applied to terminal equipment and other devices with model training functions.
  • the method for training the gaze tracking model includes the following steps:
  • the training sample set includes a pair of training samples, wherein the pair of training samples includes an eye sample image and annotated sight vector corresponding to the eye sample image.
  • the labeled gaze vector is the real data label of the eyeball gaze direction in the eye sample image, which is used to supervise the training result during the training process, and can also be called ground-truth.
  • the labeled sight vector is a three-dimensional space vector, including three dimensions of xyz.
  • the training sample set in the embodiment of the present application may include a real eye image and an annotated gaze vector corresponding to the image, and may also include a synthetic eye image and an annotated gaze vector corresponding to the synthetic eye image, where the real eye image refers to It is the eye image that is directly captured by the camera and other equipment, and the composite eye image refers to the eye image synthesized by software tools.
  • the training sample set in the embodiment of the present application includes real eye images and synthetic eye images, which can improve the robustness of the gaze tracking model.
  • the server uses a deep learning method to construct an initial gaze tracking model, and uses the model to predict the gaze vector corresponding to the eye sample image to obtain the predicted gaze vector.
  • the predicted sight vector is also a three-dimensional space vector.
  • the initial line-of-sight tracking model includes a feature extraction network (used to extract image features of the eye sample image) and a regression network (used to regress the extracted image features to obtain the line of sight vector).
  • a feature extraction network used to extract image features of the eye sample image
  • a regression network used to regress the extracted image features to obtain the line of sight vector
  • the server determines the model based on the cosine distance of the labeled sight vector and the predicted sight vector Loss, and model training.
  • the cosine distance is used to characterize the angle formed between space vectors. The smaller the space vector angle (that is, the larger the cosine distance), the higher the similarity of the space vector. On the contrary, the larger the space vector angle (that is, the cosine distance). The smaller the distance), the lower the similarity of the space vector.
  • the cosine distance between the predicted sight vector and the marked sight vector is cos( ⁇ ), and the model loss is 1-cos( ⁇ ).
  • the server adopts a stochastic gradient descent algorithm (SDG, Stochastic Gradient Descent) to adjust the reference parameters of the initial gaze tracking model (or called model parameters or network weights). And use the parameter adjusted model to re-predict until the model loss meets the convergence condition.
  • SDG stochastic gradient descent algorithm
  • the process of adjusting the model parameters so that the model loss meets the convergence condition even if the predicted line of sight vector tends to be the process of marking the line of sight vector.
  • the predicted gaze vector is obtained, which is then used to predict
  • the cosine distance between the line of sight vector and the marked line of sight vector is the model loss.
  • Model training is performed to obtain the target line-of-sight tracking model; for subsequent line-of-sight tracking, there is no need to use peripherals, just input the collected eye image into the target line-of-sight tracking.
  • the gaze tracking process is simplified, and the cosine distance is used as the model loss training model, which can better represent the difference between the predicted value and the labeled value, thereby improving the prediction accuracy of the trained gaze tracking model.
  • the training sample pair further includes annotated coordinates of the eyeball in the eye sample image;
  • the initial gaze tracking model is used to process the eye sample image to obtain the predicted gaze vector of the eye sample image, Can include:
  • the method may also include:
  • the branch for predicting eye coordinates in addition to training the branch for predicting the gaze vector, is also trained to achieve multi-task learning (MTL). Therefore, the training sample pair also contains the marked coordinates of the eyeballs in the eye sample images.
  • the initial gaze tracking model processes the eye sample images, it also outputs the predicted coordinates of the eyeballs.
  • the predicted coordinates or label coordinates of the eyeball refer to the position coordinates of the center point of the pupil of the eyeball; and, the predicted coordinates and the label coordinates are two-dimensional spatial coordinates, including two xy dimensions.
  • the server uses the Euclidean distance to represent the difference between the predicted value of the position coordinate and the labeled value, and uses the Euclidean distance as part of the model loss to perform the model Training, that is, the model loss of the gaze tracking model is composed of cosine distance and Euclidean distance.
  • the predicted gaze vector when training the gaze tracking model, not only the predicted gaze vector is considered, but also the predicted coordinates of the eyeball are considered. This not only can further improve the robustness of the gaze tracking model, but also can realize multi-task learning.
  • the labeled sight vector is a direction vector based on a unit circle (ie, a unit vector).
  • a unit vector ie, a unit vector
  • the determining the model loss according to the cosine distance of the predicted line of sight vector and the marked line of sight vector may include:
  • the model loss is determined according to the cosine distance of the normalized line of sight vector and the labeled line of sight vector.
  • the predicted sight vector before taking the cosine of the predicted sight vector and the loss function of the marked sight vector, the predicted sight vector is first normalized to obtain the normalized sight vector, and then the Cosine distance calculation is performed on the normalized line-of-sight vector and the labeled line-of-sight vector, and the eigenvalues can be normalized so that the calculation loss is within the unit circle, and finally the prediction result is more robust.
  • processing the eye sample image through the initial gaze tracking model to obtain the predicted gaze vector of the eye sample image may include:
  • the at least one including affine transformation, white balance, automatic contrast, or Gaussian blur;
  • the first eye sample image in the training sample set is flipped into a second eye sample image, and the labeled sight vector corresponding to the first eye sample image is flipped correspondingly, the second eye sample image Is the image of the eye in the target orientation, and the initial gaze tracking model is used to process the image of the eye in the target orientation;
  • the first eye sample image is the right eye sample image
  • the second eye sample The image is a left eye sample image
  • the second eye sample image is a right eye sample image
  • the anti-residual block in the initial line-of-sight tracking model is used to map the standard image to obtain the predicted line-of-sight vector of the standard image.
  • processing of affine transformation, white balance, automatic contrast, or Gaussian blur, etc. is performed on the eye sample image first, which can improve the generalization of the gaze tracking model.
  • Gaussian Blur can adjust the pixel color value according to the Gaussian curve, and can selectively blur the image.
  • Gaussian blur can count the color values of pixels around a certain point according to the Gaussian curve, and use the mathematically weighted average calculation method to get the color value of this curve.
  • Auto contrast refers to the measurement of different brightness levels between the brightest white and the darkest black in an image. The larger the difference range, the greater the contrast, and the smaller the difference range, the smaller the contrast.
  • Affine transformation is geometrically defined as an affine transformation or affine mapping between two vector spaces, consisting of a non-singular linear transformation followed by a translation transformation.
  • each affine transformation can be given by a matrix A and a vector b, which can be written as A and an additional column b.
  • the server may also use other methods to preprocess the image to improve the generalization of the gaze tracking model obtained by training, which is not limited in this embodiment.
  • the gaze tracking model only processes the image of the target azimuth eye and obtains the predicted gaze direction, where the target azimuth eye can be the left eye or the right eye.
  • the gaze tracking model in the embodiments of the present application can be trained only for the left eye, or only for the right eye.
  • the image for the right eye can be flipped to become the left eye.
  • the image of is used for model training.
  • the corresponding labeled sight vector is also correspondingly flipped to the labeled sight vector of the left-eye image.
  • the server When predicting, the server first cuts out an eye picture of the size required by the model from the picture containing the face according to the key points of the left and right corners of the eye, and the right-eye picture will be flipped into the left-eye picture and input to the model for prediction.
  • the gaze tracking model can predict the gaze direction of the left and right eye images, the process of flipping the image and marking the gaze vector can be omitted in the training process, which will not be repeated in this embodiment.
  • the method may further include:
  • the predicted sight vector of the standard image is flipped back to the space corresponding to the first eye sample image.
  • the model is a model that needs to input the left eye image
  • the right eye will be flipped into the left eye image input model for prediction, and the obtained prediction result needs to be flipped back to the space of the right eye at the same time.
  • the target azimuth eye is the left eye
  • the method further includes:
  • the first abscissa value indicates that the left eye is looking to the left and the second abscissa value indicates that the right eye is looking to the right
  • the first abscissa value and the second abscissa value are corrected .
  • said correcting the first abscissa value and the second abscissa value may include:
  • the fourth abscissa value of the abscissa of the right eye is determined according to the average value and the third abscissa value.
  • the server is correct (right Eye's) sight vector is corrected.
  • the server first determines the average value of the abscissa of the left eye and the right eye according to the first abscissa value and the second abscissa value, and adjusts the right eye's predicted gaze vector to be parallel to the left eye's predicted gaze vector. Therefore, the abscissa of the right eye predicted line of sight vector after parallel processing is corrected using the average value, so that the corrected right eye predicted line of sight vector and the left eye predicted line of sight vector are consistent in the x-axis direction.
  • the server is correct ( The sight vector of the left eye is corrected. This embodiment will not repeat the correction process.
  • the left and right eye sight vectors obtained by prediction will be rationally corrected to obtain the final result.
  • the number of the anti-residual blocks is less than 19.
  • the gaze tracking model in order to make the trained target gaze tracking model applicable to the mobile terminal, the gaze tracking model is tailored, and the number of anti-residual blocks can be cut to only 5, thereby reducing the target gaze tracking model
  • the size of the model is easy to deploy on the mobile terminal.
  • 5 here is just an example, there may be 6 or 4, or other values.
  • MobileNet V2 As shown in Fig. 4, this embodiment of the present application uses MobileNet V2 as the backbone of the gaze tracking model.
  • MobileNet V2 includes a series of invertedresidualblocks to improve the performance of the model, enhance the expressiveness of model features, and reduce the amount of calculation.
  • the structure diagram of the anti-residual block is shown in Figure 5. It can be seen from Figure 5 that the anti-residual block first uses a 1x1 convolution 51 to enlarge the dimensions of the input feature map, and then uses a 3x3 depth volume The product 52 (depthwise convolution) is calculated to obtain more expressive features. Finally, a 1 ⁇ 1 convolution 53 is used to reduce the channel dimension, and finally the initial input features and output features are feature-joined. The 1x1 convolution increases the input dimension of the deep convolution, which can effectively alleviate the feature degradation.
  • the structure of MobileNet v2 provided by the embodiment of the application is a tailored MobileNet V2, which reduces the number of anti-residual blocks to 5, and correspondingly reduces the number of channels output by each layer, so that the model can be deployed on the mobile terminal.
  • the structure diagram of the tailored MobileNet v2 can be referred to Table 1 for understanding.
  • t represents the expansion factor
  • c is the dimension of the output channel of the current sequence
  • n is the number of times this layer is repeated
  • s is the stride.
  • the initial line-of-sight tracking model in the embodiments of the present application is not limited to the MobileNet v2 model provided above, and may also be other structures or other types of models.
  • the MobileNet v2 model will first process the input eye sample images, such as: affine transformation, white balance, automatic contrast, Gaussian blur, etc. to process the image to enhance the data and improve the generalization ability of the model.
  • the feature representation of the multi-level mapping of the eye sample image is used to build a regression that predicts the line of sight vector and eye coordinates.
  • the marked sight vector of the eye sample image can be expressed as (x1, y1, z1) in three spatial directions, and the predicted sight vector of the eye sample image, that is, the output of the fully connected layer is (x2, y2, z2) three values, and the output of the eyeball coordinates is (x', y') two values. Predicting the z2 value of the sight vector is mainly for the normalization of the vector.
  • the line-of-sight regressor in the embodiment of the application uses the cosine distance loss for the predicted line-of-sight vector and the loss function of the labeled line-of-sight vector, because it is considered that the line-of-sight vector is based on a direction vector of the unit circle, and the cosine is taken It can well represent the difference in the angle between the learned predicted sight vector and the marked sight vector, so that the predicted result is closer to the true value.
  • the Euclidean distance (L2distanceloss) is used as the loss function.
  • this application adds a normalization layer (Normalization Layer), which normalizes the eigenvalues so that the calculation loss is within the unit circle, and finally makes the prediction result more robust.
  • x2' (x2- ⁇ )/ ⁇
  • y2' (y2- ⁇ )/ ⁇
  • z2' (z2- ⁇ )/ ⁇
  • is the average of the three values (x2, y2, z2)
  • is the variance of the three values (x2, y2, z2).
  • a is the marked sight vector
  • b is the predicted sight vector. This formula calculates the similarity between two vectors, so the larger the value, the closer the two vectors are.
  • the network actually uses 1-cos( ⁇ ) to calculate the loss between two vectors, the smaller the value, the closer it is.
  • the input of the embodiment of this application can be a left-eye picture of 112px ⁇ 112px.
  • all right-eye pictures will be flipped to left-eye, and the marked sight vector will be flipped in the same way. If you still have the marked coordinates of the eyeball, you also need to do a flip operation.
  • the image of the human face will first be cut into the eye image of the model input size according to the key points of the left and right eye corners, and the right eye will be flipped into the left eye image and input into the model for prediction.
  • the obtained prediction result needs to be flipped back to the space of the right eye at the same time, and the left and right eye sight vectors predicted by the network will be rationally corrected to obtain the final result.
  • the target gaze tracking model After the target gaze tracking model is trained, the target gaze tracking model can be applied to different actual scenes. No matter what kind of scene it is applied to, it is necessary to obtain the predicted sight vector, so as to realize the corresponding sight tracking process.
  • an embodiment of the method for gaze tracking provided in the embodiment of the present application may include:
  • Target gaze tracking model uses a target gaze tracking model to process the target eye image, and determine a predicted gaze vector of the target eye image.
  • the target gaze tracking model is a gaze tracking model obtained according to the method for training the gaze tracking model described above.
  • the process of processing the eye image using a target gaze tracking model and determining the predicted gaze direction vector of the eye image may further include:
  • the performing line-of-sight tracking according to the predicted line-of-sight direction vector may include:
  • the line of sight is tracked in the direction indicated by the predicted line of sight vector.
  • the process of determining the predicted line of sight vector of the target eye image can be understood by referring to the previous process of determining the predicted line of sight vector of the eye sample image, and the details are not repeated here.
  • the target gaze tracking model when used to track the gaze of the human eye in the video stream, the position of the eye area cropped from the video frame will be jittered, and the target gaze tracking model is The eye image in each video frame is processed separately, and there is no context awareness (that is, the processing result is not affected by the corresponding processing result of the previous video frame), so the subsequent predicted line of sight direction will also be jittered.
  • the gaze vector prediction results corresponding to the eye images in the video frames between the current target eye images can be used to compare the current target eye image with the smoothing algorithm.
  • the predicted sight vector is smoothed.
  • the reference eye image corresponding to the target eye image is determined, and the reference eye image
  • the target eye image is the image in the continuous video frame in the video stream; according to the predicted sight vector corresponding to the reference eye image, the predicted sight vector corresponding to the target eye image is smoothed.
  • the terminal when the target eye image is the i-th video frame, the terminal sets at least one video frame before the i-th video frame (such as the i-1th video frame and the i-2th video frame).
  • the frame and the i-3th video frame) are determined as reference video frames, and the predicted sight vector of the i-th video frame is smoothed according to the predicted sight vector corresponding to the eye image in the reference video frame.
  • a Bezier curve can be used in the smoothing process, and the Bezier curve can be a first-order, second-order, third-order Bezier curve, etc., which is not limited in this embodiment.
  • the smoothing formula of the third-order Bezier curve is as follows:
  • B(t) is the predicted gaze vector corresponding to the current target eye image after smoothing
  • Pi is the predicted gaze vector corresponding to the reference eye image
  • t is the introduced parameter, ranging from 0 to 1.
  • a weighted moving average and exponential smoothing algorithm can also be used for smoothing processing, which is not limited in this embodiment.
  • this application also provides a corresponding gaze tracking model training device, so that the gaze tracking model training method described above can be applied and realized in practice.
  • FIG. 8 is a schematic diagram of an embodiment of an apparatus 40 for training a gaze tracking model provided by an embodiment of the present application.
  • the obtaining module 401 is configured to obtain a training sample set, where the training sample set includes a pair of training samples, wherein the pair of training samples includes an eye sample image and annotated sight vector corresponding to the eye sample image;
  • the training module 402 is configured to process the eye sample images acquired by the acquisition module 401 through the initial gaze tracking model to obtain the predicted eye vector of the eye sample images;
  • the first processing module 403 is configured to determine the model loss according to the predicted sight vector obtained by the training module 402 and the cosine distance of the marked sight vector;
  • the second processing module 404 is configured to iteratively adjust the reference parameters of the initial line-of-sight tracking model until the model loss processed by the first processing module 403 meets the convergence condition, so as to obtain the target line-of-sight tracking model.
  • the predicted sight vector is obtained, and the predicted sight vector and the marked sight vector
  • the cosine distance between the vectors is the model loss for model training to obtain the target gaze tracking model; for subsequent gaze tracking, there is no need to use peripherals, just input the collected eye images into the target gaze tracking function, which simplifies the sight line
  • the tracking process, and the cosine distance as the model loss training model can better represent the difference between the predicted value and the labeled value, thereby improving the prediction accuracy of the trained gaze tracking model.
  • the training module 402 is configured to process the eye sample image through the initial gaze tracking model when the training sample pair also includes the marked coordinates of the eyeball in the eye sample image to obtain the eye sample The predicted sight vector of the image and the predicted coordinates of the eyeball;
  • the first processing module 403 is further configured to determine the model loss according to the Euclidean distance between the predicted coordinates of the eyeball and the marked coordinates of the eyeball.
  • the device 40 further includes:
  • the third processing module 405 is used for normalizing the predicted sight vector to obtain a normalized sight vector
  • the first processing module 403 is configured to determine the model loss according to the cosine distance of the normalized line-of-sight vector and the labeled line-of-sight vector.
  • the training module 402 is used to:
  • At least one of the following processing is performed on the eye sample image, the at least one including affine transformation, white balance, automatic contrast or Gaussian blur;
  • the first eye sample image in the training sample set is flipped into a second eye sample image, and the labeled sight vector corresponding to the first eye sample image is flipped correspondingly, the second eye sample image Is the image of the eye in the target orientation, and the initial gaze tracking model is used to process the image of the eye in the target orientation;
  • the first eye sample image is the right eye sample image
  • the second eye sample The image is a left eye sample image
  • the second eye sample image is a right eye sample image
  • the anti-residual block in the initial gaze tracking model is used to perform mapping processing on the standard image to obtain the predicted gaze vector of the standard image.
  • the training module 402 is further configured to, when the standard image is obtained from the first eye sample image, flip the predicted line of sight vector of the standard image back to that corresponding to the first eye sample image space.
  • the target azimuth eye is the left eye
  • the training module 402 is further configured to:
  • the first abscissa value indicates that the left eye is looking to the left and the second abscissa value indicates that the right eye is looking to the right
  • the first abscissa value and the second abscissa value are corrected .
  • the training module 402 is used to:
  • the fourth abscissa value of the abscissa of the right eye is determined according to the average value and the third abscissa value.
  • the number of the anti-residual blocks is less than 19.
  • this application also provides a corresponding device for gaze tracking, so that the above method of gaze tracking can be applied and realized in practice.
  • FIG. 10 is a schematic diagram of an embodiment of an apparatus 50 for gaze tracking provided by an embodiment of the application.
  • the obtaining module 501 is used to obtain an image of the target eye
  • the processing module 502 is configured to use a target gaze tracking model to process the target eye image acquired by the acquisition module 501, and determine the predicted gaze vector of the target eye image;
  • the sight tracking module 503 is configured to perform sight tracking according to the predicted sight vector obtained by the processing module 502.
  • processing module 502 is further configured to determine the coordinates of the eyeball in the target eye image
  • the line-of-sight tracking module 503 is configured to use the coordinates of the eyeball as the starting point of the line of sight, and perform line-of-sight tracking in the direction indicated by the predicted line of sight vector.
  • the device 50 for gaze tracking may further include a smoothing processing module, and the smoothing processing module is used for:
  • a reference eye image corresponding to the target eye image is determined, and the reference eye image and the target eye image are consecutive in the video stream.
  • Smoothing is performed on the predicted sight vector corresponding to the target eye image according to the predicted sight vector corresponding to the reference eye image.
  • FIG. 11 is a schematic diagram of a server structure for gaze tracking model training provided by an embodiment of the application.
  • the server 700 may have relatively large differences due to different configurations or performance, and may include one or more central processing units (CPU) 722 (for example, one or more processors) and a memory 732, one or more A storage medium 730 (for example, one or a storage device in a large amount) for storing the application program 742 or the data 744.
  • the memory 732 and the storage medium 730 may be short-term storage or persistent storage.
  • the program stored in the storage medium 730 may include one or more modules (not shown in the figure), and each module may include a series of command operations on the server. Furthermore, the central processing unit 722 may be configured to communicate with the storage medium 730, and execute a series of instruction operations in the storage medium 730 on the server 700.
  • the server 700 may also include one or more power supplies 726, one or more wired or wireless network interfaces 750, one or more input and output interfaces 758, and/or one or more operating systems 741, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
  • operating systems 741 such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
  • the steps performed by the server in the foregoing embodiment may be based on the server structure shown in FIG. 11.
  • the CPU 722 is used to execute the neural network model training process described in the above-mentioned Figures 1 to 6.
  • the present application also provides a server, which has a similar structure to the server shown in FIG. 11, and its memory is used to store a target gaze tracking model.
  • the target gaze tracking model is the gaze provided by the embodiment of the application.
  • the tracking model training method is trained; its processor is used to run the target gaze tracking model for gaze tracking.
  • the embodiment of the application also provides another device for gaze tracking.
  • the device may be a terminal device, as shown in FIG. 12.
  • the terminal can be any terminal device including mobile phone, tablet computer, personal digital assistant (English full name: Personal Digital Assistant, English abbreviation: PDA), sales terminal (English full name: Point of Sales, English abbreviation: POS), on-board computer, etc. Take the terminal as a mobile phone as an example:
  • FIG. 12 shows a block diagram of a part of the structure of a mobile phone related to a terminal provided in an embodiment of the present application.
  • the mobile phone includes: radio frequency (English full name: Radio Frequency, English abbreviation: RF) circuit 810, memory 820, input unit 830, display unit 840, sensor 850, audio circuit 860, wireless fidelity (full English name: wireless fidelity , English abbreviation: WiFi) module 870, processor 880, and power supply 890 and other components.
  • radio frequency English full name: Radio Frequency, English abbreviation: RF
  • the RF circuit 810 can be used for receiving and sending signals during information transmission or communication. In particular, after receiving the downlink information of the base station, it is processed by the processor 880; in addition, the designed uplink data is sent to the base station.
  • the RF circuit 810 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (English full name: Low Noise Amplifier, English abbreviation: LNA), a duplexer, and the like.
  • the RF circuit 810 can also communicate with the network and other devices through wireless communication.
  • the above-mentioned wireless communication can use any communication standard or protocol, including but not limited to Global System of Mobile Communications (English full name: Global System of Mobile communication, English abbreviation: GSM), General Packet Radio Service (English full name: General Packet Radio Service, GPRS) ), Code Division Multiple Access (English name: Code Division Multiple Access, English abbreviation: CDMA), Wideband Code Division Multiple Access (English name: Wideband Code Division Multiple Access, English abbreviation: WCDMA), Long Term Evolution (English name: Long Term Evolution, English abbreviation: LTE), email, short message service (English full name: Short Messaging Service, SMS), etc.
  • GSM Global System of Mobile Communications
  • GSM Global System of Mobile Communications
  • GPRS General Packet Radio Service
  • CDMA Code Division Multiple Access
  • WCDMA Wideband Code Division Multiple Access
  • LTE Long Term Evolution
  • email Short message service
  • the memory 820 may be used to store software programs and modules.
  • the processor 880 executes various functional applications and data processing of the mobile phone by running the software programs and modules stored in the memory 820.
  • the memory 820 may mainly include a program storage area and a data storage area.
  • the program storage area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; Data (such as audio data, phone book, etc.) created by the use of mobile phones.
  • the memory 820 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices.
  • the input unit 830 can be used to control instructions and generate key signal inputs related to user settings and function control of the mobile phone.
  • the input unit 830 may include a touch panel 831 and other input devices 832.
  • the touch panel 831 also called a touch screen, can collect user touch operations on or near it (for example, the user uses any suitable objects or accessories such as fingers, stylus, etc.) on the touch panel 831 or near the touch panel 831. Operation), and drive the corresponding connection device according to the preset program.
  • the touch panel 831 may include two parts: a touch detection device and a touch controller.
  • the touch detection device detects the user's touch position, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact coordinates, and then sends it To the processor 880, and can receive and execute the commands sent by the processor 880.
  • the touch panel 831 can be implemented in multiple types such as resistive, capacitive, infrared, and surface acoustic wave.
  • the input unit 830 may also include other input devices 832.
  • other input devices 832 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control buttons, switch buttons, etc.), trackball, mouse, and joystick.
  • the display unit 840 can be used for the result of the gaze tracking.
  • the display unit 840 may include a display panel 841.
  • a liquid crystal display (English full name: Liquid Crystal Display, English abbreviation: LCD), organic light-emitting diode (English full name: Organic Light-Emitting Diode, English abbreviation: OLED), etc.
  • the display panel 841 is configured in a form.
  • the touch panel 831 can cover the display panel 841. When the touch panel 831 detects a touch operation on or near it, it transmits it to the processor 880 to determine the type of the touch event, and then the processor 880 determines the type of the touch event. The type provides corresponding visual output on the display panel 841.
  • the touch panel 831 and the display panel 841 are used as two independent components to realize the input and input functions of the mobile phone, but in some embodiments, the touch panel 831 and the display panel 841 can be integrated. Realize the input and output functions of mobile phones.
  • the mobile phone may also include at least one sensor 850, and the target eye image can be collected through the sensor 850, of course, the target eye image can also be collected through a camera, or the target eye image can be collected through an eye tracker.
  • the sensor 850 is, for example, a light sensor, a motion sensor, and other sensors.
  • the light sensor may include an ambient light sensor and a proximity sensor.
  • the ambient light sensor can adjust the brightness of the display panel 841 according to the brightness of the ambient light.
  • the proximity sensor can close the display panel 841 and/or when the mobile phone is moved to the ear. Or backlight.
  • the accelerometer sensor can detect the magnitude of acceleration in various directions (usually three-axis), and can detect the magnitude and direction of gravity when stationary, and can be used to identify mobile phone posture applications (such as horizontal and vertical screen switching, related Games, magnetometer posture calibration), vibration recognition related functions (such as pedometer, percussion), etc.; as for other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc., which can be configured in mobile phones, we will not here Repeat.
  • mobile phone posture applications such as horizontal and vertical screen switching, related Games, magnetometer posture calibration), vibration recognition related functions (such as pedometer, percussion), etc.
  • vibration recognition related functions such as pedometer, percussion
  • other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc., which can be configured in mobile phones, we will not here Repeat.
  • the audio circuit 860, the speaker 861, and the microphone 862 can provide an audio interface between the user and the mobile phone.
  • the audio circuit 860 can transmit the electric signal after the conversion of the received audio data to the speaker 861, which is converted into a sound signal by the speaker 861 for output; on the other hand, the microphone 862 converts the collected sound signal into an electric signal, and the audio circuit 860 After being received, it is converted into audio data, and then processed by the audio data output processor 880, and sent to, for example, another mobile phone via the RF circuit 810, or the audio data is output to the memory 820 for further processing.
  • WiFi is a short-distance wireless transmission technology.
  • the mobile phone can help users send and receive emails, browse web pages, and access streaming media through the WiFi module 870. It provides users with wireless broadband Internet access.
  • FIG. 12 shows the WiFi module 870, it is understandable that it is not a necessary component of a mobile phone, and can be omitted as needed without changing the essence of the invention.
  • the processor 880 is the control center of the mobile phone. It uses various interfaces and lines to connect various parts of the entire mobile phone, and executes by running or executing software programs and/or modules stored in the memory 820, and calling data stored in the memory 820. Various functions and processing data of the mobile phone can be used to monitor the mobile phone as a whole.
  • the processor 880 may include one or more processing units; preferably, the processor 880 may integrate an application processor and a modem processor, where the application processor mainly processes the operating system, user interface, application programs, etc. , The modem processor mainly deals with wireless communication. It can be understood that the foregoing modem processor may not be integrated into the processor 880.
  • the mobile phone also includes a power supply 890 (such as a battery) for supplying power to various components.
  • a power supply 890 (such as a battery) for supplying power to various components.
  • the power supply can be logically connected to the processor 880 through a power management system, so that functions such as charging, discharging, and power management are realized through the power management system.
  • the mobile phone may also include a camera, a Bluetooth module, etc., which will not be repeated here.
  • the processor 880 included in the terminal has the function of performing corresponding line-of-sight tracking based on the target line-of-sight tracking model described above.
  • the embodiments of the present application also provide a computer-readable storage medium for storing program code, which is used to execute any one of the methods for training a gaze tracking model described in the foregoing embodiments, or to execute A method for gaze tracking described in the foregoing embodiment.
  • the embodiments of the present application also provide a computer program product including instructions, which when run on a computer, cause the computer to execute any one of the methods for training a gaze tracking model described in the foregoing embodiments, or Perform a line of sight tracking method described in the foregoing embodiment.
  • the disclosed system, device, and method may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • each unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of this application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (English full name: Read-Only Memory, English abbreviation: ROM), random access memory (English full name: Random Access Memory, English abbreviation: RAM), magnetic Various media that can store program codes, such as discs or optical discs.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Ophthalmology & Optometry (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Eye Examination Apparatus (AREA)

Abstract

本申请公开了一种视线追踪模型训练的方法,包括:获取训练样本集合,通过初始视线追踪模型对眼部样本图像进行处理,以得到眼部样本图像的预测视线向量,根据预测视线向量和标注视线向量的余弦距离确定模型损失,对初始视线追踪模型的参考参数进行迭代调整直到模型损失满足收敛条件,以得到目标视线追踪模型。采用本申请提供的方案,进行视线追踪时,无需借助外设,只需要将采集到的眼部图像输入目标视线追踪能够即可,简化了视线追踪的流程,并且以余弦距离作为模型损失训练模型,能够更好的表现预测值与标注值之间的差异性,进而提高了训练得到的视线追踪模型的预测准确度。

Description

视线追踪模型训练的方法、视线追踪的方法及装置
本申请实施例要求于2019年04月24日提交,申请号为201910338224.6、发明名称为“视线追踪模型训练的方法、视线追踪的方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请实施例中。
技术领域
本申请涉及人工智能技术领域,具体涉及一种视线追踪模型训练的方法、视线追踪的方法、装置、设备及存储介质。
背景技术
视觉追踪技术也称为眼动追踪技术,是利用软件算法、机械、电子、光学等各种检测手段获取受试者当前视觉注意方向的技术,它广泛应用于人机交互、辅助驾驶、心理研究、虚拟现实和军事等多个领域。
相关技术中,通常采用几何方法实现视线估计。几何方法往往需要借助外设,基于摄像机或眼动仪,通过双光源来对视线做三维估计。
相关技术中,采用几何方法实现视线估计时需要借助额外的设备,实现过程复杂且成本较高,进而导致视线估计的应用场景受限。
发明内容
本申请实施例提供一种视线追踪模型训练的方法,在不借助外设的情况下,采用预测值和标注值之间的余弦距离作为模型损失训练视线追踪模型,以便后续利用视线追踪模型进行视线追踪。本申请实施例还提供了相应的装置、设备及存储介质。
本申请第一方面提供一种视线追踪模型训练的方法,包括:
获取训练样本集合,所述训练样本集合包括训练样本对,其中,所述训练样本对包括眼部样本图像和所述眼部样本图像对应的标注视线向量;
通过初始视线追踪模型对所述眼部样本图像进行处理,得到所述眼部样本图像的预测视线向量;
根据所述预测视线向量和所述标注视线向量的余弦距离确定模型损失;
对所述初始视线追踪模型的参考参数进行迭代调整直到所述模型损失满 足收敛条件,以得到目标视线追踪模型。
本申请第二方面提供一种视线追踪的方法,包括:
获取目标眼部图像;
采用目标视线追踪模型对所述目标眼部图像进行处理,确定所述目标眼部图像的预测视线向量,所述目标视线追踪模型为采用上述第一方面所述的方法训练得到的视线追踪模型;
根据所述预测视线向量进行视线追踪。
本申请第三方面提供一种视线追踪模型训练的装置,包括:
获取模块,用于获取训练样本集合,所述训练样本集合包括训练样本对,其中,训练样本对包括眼部样本图像和所述眼部样本图像对应的标注视线向量;
训练模块,用于通过初始视线追踪模型对所述获取模块获取的所述眼部样本图像进行处理,得到所述眼部样本图像的预测视线向量;
第一处理模块,用于根据所述训练模块得到的预测视线向量和所述标注视线向量的余弦距离确定模型损失;
第二处理模块,用于对所述初始视线追踪模型的参考参数进行迭代调整直到所述第一处理模块处理得到的所述模型损失满足收敛条件,以得到目标视线追踪模型。
本申请第四方面提供一种视线追踪的装置,包括:
获取模块,用于获取目标眼部图像;
处理模块,用于采用目标视线追踪模型对所述获取模块获取的目标眼部图像进行处理,确定所述目标眼部图像的预测视线向量,所述目标视线追踪模型为采用上述第一方面所述的方法训练得到的视线追踪模型;
视线追踪模块,用于根据所述处理模块得到的预测视线向量进行视线追踪。
本申请第五方面提供一种计算机设备,所述计算机设备包括处理器以及存储器:
所述存储器用于存储程序代码;所述处理器用于根据所述程序代码中的指令执行上述第一方面所述的视线追踪模型训练的方法。
本申请第六方面提供一种计算机设备,所述计算机设备包括处理器以及存储器:
所述存储器用于存储目标视线追踪模型,所述目标视线追踪模型是根据上述第一方面所述的视线追踪模型训练的方法训练得到的视线追踪模型;所述处理器用于运行所述目标视线追踪模型,以进行视线追踪。
本申请第七方面提供一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行如上述第一方面所述的视线追踪模型训练的方法,
本申请第八方面提供一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行如上述第二方面所述的视线追踪的方法。
从以上技术方案可以看出,本申请实施例至少具有以下优点:
本申请实施例中,通过获取包含眼部样本图像和对应标注视线向量的训练样本对,并利用初始视线追踪模型对眼部样本图像进行处理,得到预测视线向量,进而以预测视线向量与标注视线向量之间的余弦距离为模型损失进行模型训练,得到目标视线追踪模型;后续进行视线追踪时,无需借助外设,只需要将采集到的眼部图像输入目标视线追踪能够即可,简化了视线追踪的流程,并且以余弦距离作为模型损失训练模型,能够更好的表现预测值与标注值之间的差异性,进而提高了训练得到的视线追踪模型的预测准确度。
附图说明
图1是本申请实施例中视线追踪的一应用场景的一示例示意图;
图2是本申请实施例中视线追踪模型训练的一场景示意图;
图3是本申请实施例中视线追踪模型训练的方法一实施例示意图;
图4是本申请实施例中视线追踪模型训练的方法另一实施例示意图;
图5是本申请实施例中反残差区块的特征处理过程的一实施例示意图;
图6是本申请实施例提供的视线追踪的方法的一实施例示意图;
图7是本申请实施例中三阶Bezier曲线的一处理结果示意图;
图8是本申请实施例中视线追踪模型训练的装置的一实施例示意图;
图9是本申请实施例中视线追踪模型训练的装置的另一实施例示意图;
图10是本申请实施例中视线追踪的装置的一实施例示意图;
图11是本申请实施例提供的服务器的一实施例示意图;
图12是本申请实施例提供的终端设备的一实施例示意图。
具体实施方式
下面结合附图,对本申请的实施例进行描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
本申请实施例提供一种视线追踪模型训练的方法,采用对预测值和标注值之间的损失函数取余弦的方案,可以更好的表示预测值与标注值之间的差异性,可以确保得到的视线追踪模型的预测准确度更高。本申请实施例还提供了相应的装置、设备及存储介质。以下分别进行详细说明。
视觉追踪技术属于机器视觉的一种技术,它是通过图像传感器捕捉到眼球的图像,根据对图像的处理来识别眼球的特征,通过这些特征实时地反算出用户的注视点。
在实际应用中,若可以获知用户的注视点,则可以确定用户对该注视点的内容感兴趣,则可以将注视点处的信息放大,如:注视点处是一个小图片,则可以将该小图片放大为大图片。以图1为例,简单介绍通过眼球进行控制的过程。
如图1所示,用户的视线注视在手机的某个点上超过预设时间,手机的图像获取装置获取到这段时间内的眼部图像后,可以通过对眼部图像的分析确定用户所注视的内容,如:用户一直在盯着手机上的一个小图片看,且注视时长达到时长阈值,手机可以放大该图片在屏幕上的显示尺寸,从而更有利于用户读取所关注的信息。
在另一种应用场景中,将视觉追踪技术应用于辅助驾驶系统后,辅助驾驶 系统可以实时采集驾驶者的眼部图像,并对眼部图像进行分析,确定驾驶者眼球的注视点。若注视点偏离道路,辅助驾驶系统则可以进行提醒(比如发出声音警报),提高驾驶过程中的安全性。
视觉追踪技术改变了人到计算机设备的之间的交互路径,不再必须通过手动操作才能与计算机设备进行交互,也可以通过眼球运动来控制计算机设备。
在实际操作过程中,眼球与鼠标的点击选择操作很相似,眼球可以实现观看选择,进而激活按钮、图标、链接或文本等控件。眼球对选择的控制可以是注视一个点超过一定时间,如在一段预定的时间内悬停在一个可选择的目标上,例如静止800毫秒,则可以实现对该可选择的目标的控制。
通过眼球控制的示例可以有很多,本申请实施例中不做一一列举。
在本申请实施例中,无论是图1中的手机、还是其他终端,能分析出用户的注视点都是基于目标视线追踪模型实现的,该目标视线追踪模型可以是具有卷积神经网络的深度学习模型。该目标视线追踪模型是通过大量的样本数据训练得到的,为了准确的确定用户的注视点,本申请实施例提供了一种视线追踪模型训练的方法,可以训练得到预测准确度更高的目标视线追踪模型。
下面结合图2,介绍本申请实施例中的视线追踪模型的训练过程。
参见图2,图2为本申请实施例提供的视线追踪模型训练的一应用场景示意图。
该场景中包括数据库101和用于训练视线追踪模型的服务器102,数据库101和服务器102通过网络连接。当然,该数据库101也可以集成在服务器102上,该场景中以数据库位于独立的设备上为例进行说明,数据库101上的训练样本集合中包括多个训练样本对,其中每个训练样本对包括眼部样本图像和眼部样本图像对应的标注视线向量;这些训练样本对可以是开发人员专门制作的,也可以是通过有奖参与的方式由用户参与上报的,当然,还可以通过其他方式获得本申请的训练样本对,本申请中对训练样本对的获取方式不做具体限定。
其中,数据库101能够为服务器提供训练样本集合。
服务器102通过网络从数据库101处获取到训练样本集合后,将眼部样本图像输入到初始视线追踪模型。
服务器102通过初始视线追踪模型对所述眼部样本图像进行处理,以得到所述眼部样本图像的预测视线向量;
服务器102对所述预测视线向量和所述标注视线向量的损失函数取余弦,以确定所述预测视线向量和所述标注视线向量的相似度;
服务器102对所述初始视线追踪模型的参考参数进行迭代调整直到所述相似度满足收敛条件,以得到目标视线追踪模型。
服务器102生成目标视线追踪模型后,可以进一步将该目标视线追踪模型发送至终端设备,以在终端设备上运行该目标视线追踪模型,利用这些目标视线追踪模型实现相应的功能。
需要说明的是,服务器102在训练视线追踪模型的过程中,采用预测视线向量和标注视线向量的余弦距离作为模型损失,可以更好的表现预测值与标注值之间的差异性,可以确保得到的视线追踪模型的预测准确度更高。
需要说明的是,上述图2所示的应用场景仅为一种示例,在实际应用中,本申请实施例提供的视线追踪模型训练的过程还可以应用于其他应用场景,在此不对该视线追踪模型训练的过程的应用场景做任何限定。
应理解,本申请实施例提供的视线追踪模型训练的过程可以应用于具备模型训练功能的设备,如终端设备、服务器等。其中,终端设备具体可以为智能手机、计算机、个人数字助理(Personal Digital Assitant,PDA)、平板电脑等;服务器具体可以为应用服务器,也可以为Web服务器,在实际应用部署时,该服务器可以为独立服务器,也可以为集群服务器。
在实际应用中,终端设备和服务器可以单独训练视线追踪模型,也可以彼此交互训练视线追踪模型,二者交互训练视线追踪模型时,终端设备可以从服务器处获取训练样本集,进而利用该训练样本集对视线追踪模型进行训练,或者,服务器可以从终端处获取训练样本集,利用该训练样本集对视线追踪模型进行训练。
应理解,终端设备或服务器执行本申请实施例提供的视线追踪模型训练的过程,训练得到目标视线追踪模型后,可以将该目标视线追踪模型发送至其他终端设备,以在这些终端设备上运行上述目标视线追踪模型,实现相应的功能;也可以将该目标视线追踪模型发送至其他服务器,以在其他服务器上运行上述目标视线追踪模型,通过这些服务器实现相应的功能。
下面通过实施例对本申请提供的神经网络模型训练方法进行介绍。
参见图3,图3为本申请实施例提供的一种视线追踪模型训练的方法的一实施例示意图。为了便于描述,下述实施例以服务器作为执行主体进行描述, 应理解,该视线追踪模型训练的方法的执行主体并不仅限于服务器,还可以应用于终端设备等具备模型训练功能的设备。如图3所示,该视线追踪模型训练的方法包括以下步骤:
201、获取训练样本集合,所述训练样本集合包括训练样本对,其中,训练样本对包括眼部样本图像和所述眼部样本图像对应的标注视线向量。
其中,标注视线向量是眼部样本图像中眼球注视方向的真实数据标注,用于在训练过程中对训练结果进行监督,也可以被称为真值(Ground-truth)。本申请实施例中,标注视线向量为三维空间向量,包含xyz三个维度。
本申请实施例中的训练样本集合可以包括真实眼部图像和该图像对应的标注视线向量,还可以包括合成眼部图像以及该合成眼部图像对应的标注视线向量,其中,真实眼部图像指的是通过摄像机等设备直接拍摄获取的眼部图像,合成眼部图像指的是通过软件工具合成的眼部图像。本申请实施例中的训练样本集合包括真实眼部图像和合成眼部图像,可以提高视线追踪模型的鲁棒性。
202、通过初始视线追踪模型对所述眼部样本图像进行处理,得到所述眼部样本图像的预测视线向量。
本申请实施例中,服务器采用深度学习方法构建初始视线追踪模型,利用该模型对眼部样本图像对应的视线向量进行预测,得到预测视线向量。其中,预测视线向量同样为三维空间向量。
可选的,该初始视线追踪模型包括特征提取网络(用于对眼部样本图像进行图像特征提取)以及回归网络(用于对提取到的图像特征进行回归,得到视线向量)。
203、根据所述预测视线向量和所述标注视线向量的余弦距离确定模型损失。
由于标注视线向量和预测视线向量均为三维空间向量,为了更加直观体现出预测值与标注值之间的差异性,本申请实施例中,服务器根据标注视线向量和预测视线向量的余弦距离确定模型损失,并进行模型训练。其中余弦距离用于表征空间向量之间所成的夹角,空间向量夹角越小(即余弦距离越大),表明空间向量相似度越高,相反的,空间向量夹角越大(即余弦距离越小),表明空间向量相似度越低。
可选的,预测视线向量和标注视线向量的余弦距离为cos(θ),模型损失为 1-cos(θ)。
204、对所述初始视线追踪模型的参考参数进行迭代调整直到所述模型损失满足收敛条件,得到目标视线追踪模型。
在一种可能的实施方式中,当模型损失不满足收敛条件时,服务器采用随机梯度下降算法(SDG,Stochastic Gradient Descent)调整初始视线追踪模型的参考参数(或称为模型参数或网络权重),并利用参数调整后的模型进行重新预测,直至模型损失满足收敛条件。其中,调整模型参数使其模型损失满足收敛条件的过程,即使预测视线向量趋向于标注视线向量的过程。
综上所述,本申请实施例中,通过获取包含眼部样本图像和对应标注视线向量的训练样本对,并利用初始视线追踪模型对眼部样本图像进行处理,得到预测视线向量,进而以预测视线向量与标注视线向量之间的余弦距离为模型损失进行模型训练,得到目标视线追踪模型;后续进行视线追踪时,无需借助外设,只需要将采集到的眼部图像输入目标视线追踪能够即可,简化了视线追踪的流程,并且以余弦距离作为模型损失训练模型,能够更好的表现预测值与标注值之间的差异性,进而提高了训练得到的视线追踪模型的预测准确度。
可选地,所述训练样本对还包括眼部样本图像中眼球的标注坐标;所述通过初始视线追踪模型对所述眼部样本图像进行处理,得到所述眼部样本图像的预测视线向量,可以包括:
通过初始视线追踪模型对所述眼部样本图像进行处理,得到所述眼部样本图像的预测视线向量和眼球的预测坐标;
所述方法还可以包括:
根据所述眼球的预测坐标和所述眼球的标注坐标之间的欧式距离确定所述模型损失
在一种可能的实施方式中,训练初始视线追踪模型过程中,除了训练预测视线向量的分支外,同时训练预测眼球坐标的分支,从而实现多任务学习(Multi-Task Learning,MTL)。因此,训练样本对中还包含眼部样本图像中眼球的标注坐标,相应的,初始视线追踪模型对眼部样本图像进行处理后,还输出眼球的预测坐标。
可选的,眼球的预测坐标或者标注坐标指的是眼球瞳孔中心点的位置坐标;并且,预测坐标和标注坐标为二维空间坐标,包括xy两个维度。
不同于采用余弦距离表征视线向量预测值与标注值之间的差异性,服务器采用欧式距离表征位置坐标预测值与标注值之间的差异性,并将欧式距离为模型损失的一部分,对模型进行训练,即视线追踪模型的模型损失由余弦距离和欧式距离构成。
本申请实施例中,在训练视线追踪模型时,不光考虑了预测视线向量时,还考虑了眼球的预测坐标,不仅可以进一步提高视线追踪模型的鲁棒性,而且还可以实现多任务学习。
可选地,所述标注视线向量为基于单位圆的方向向量(即单位向量),本申请实施例提供的视线追踪模型训练的方法的另一实施例中,还可以包括:
对所述预测视线向量进行归一化处理,以得到归一化视线向量;
所述根据所述预测视线向量和所述标注视线向量的余弦距离确定模型损失,可以包括:
根据所述归一化视线向量和所述标注视线向量的余弦距离确定所述模型损失。
本申请实施例中,在对所述预测视线向量和所述标注视线向量的损失函数取余弦之前,先对预测视线向量进行归一化处理,以得到归一化视线向量,然后再对所述归一化视线向量和所述标注视线向量进行余弦距离计算,可以将特征值归一化,使其计算损失是处于单位圆之内,最后让预测的结果更加鲁棒。
可选地,所述通过初始视线追踪模型对所述眼部样本图像进行处理,得到所述眼部样本图像的预测视线向量,可以包括:
对所述眼部样本图像进行如下至少一项的处理,所述至少一项包括仿射变换、白平衡、自动对比度或高斯模糊;
将所述训练样本集合中的第一眼部样本图像翻转为第二眼部样本图像,并将所述第一眼部样本图像对应的标注视线向量进行对应翻转,所述第二眼部样本图像为目标方位眼部的图像,所述初始视线追踪模型用于对所述目标方位眼部的图像进行处理;所述第一眼部样本图像为右眼样本图像时,所述第二眼部样本图像为左眼样本图像,所述第一眼部样本图像为左眼样本图像时,所述第二眼部样本图像为右眼样本图像;
对每个眼部样本图像做裁剪处理,得到标准图像;
采用所述初始视线追踪模型中的反残差区块对所述标准图像进行映射处 理,以得到所述标准图像的预测视线向量。
本申请实施例中,先对眼部样本图像做仿射变换、白平衡、自动对比度或高斯模糊等的处理,可以提高视线追踪模型的泛化性。
其中,高斯模糊(Gaussian Blur)可以根据高斯曲线调节像素色值,可以有选择地模糊图像。换句话说,就是高斯模糊能够把某一点周围的像素色值按高斯曲线统计起来,采用数学上加权平均的计算方法得到这条曲线的色值。
自动对比度指的是一幅图像中明暗区域最亮的白和最暗的黑之间不同亮度层级的测量,差异范围越大代表对比越大,差异范围越小代表对比越小。
仿射变换是在几何上定义为两个向量空间之间的一个仿射变换或者仿射映射,由一个非奇异的线性变换接上一个平移变换组成。在有限维的情况,每个仿射变换可以由一个矩阵A和一个向量b给出,它可以写作A和一个附加的列b。
当然,除了上述图像预处理方式外,服务器还可以采用其他方式对图像进行预处理,以提高训练得到的视线追踪模型的泛化性,本实施例对此并不构成限定。
在一种可能的实现方式中,视线追踪模型仅对目标方位眼部的图像进行处理,并得到预测视线方向,其中,该目标方位眼部可以为左眼或右眼。
相应的,本申请实施例中的视线追踪模型可以只对左眼进行训练,或者只对右眼进行训练,例如:只针对左眼进行训练,那么针对右眼的图像就可以通过翻转成为左眼的图像用于模型训练,当右眼图像翻转为左眼图像后,相应的标注视线向量也要对应的翻转成左眼图像的标注视线向量。
在预测时,服务器首先根据左右眼角关键点,从包含人脸的图片中剪裁(wrap)出模型需要大小的眼睛图片,而右眼图片会翻转成左眼图片输入模型进行预测。
需要说明的是,当视线追踪模型可以对左右眼图像进行视线方向预测时,在训练过程能够可以省略翻转图像以及标注视线向量的过程,本实施例在此不再赘述。
可选地,所述方法还可以包括:
当所述标准图像是通过第一眼部样本图像得到时,将所述标准图像的预测 视线向量翻转回所述第一眼部样本图像对应的空间。
本申请实施例中,当模型是需要输入左眼图像的模型时,而右眼会翻转成左眼图片输入模型进行预测,得到的预测结果需要同时翻转回右眼的空间。
可选地,所述目标方位眼部为左眼,所述方法还包括:
获取左眼的预测视线向量中的第一横坐标值,以及右眼的预测视线向量中的第二横坐标值,所述左眼和所述右眼属于同一用户对象;
当所述第一横坐标值表征所述左眼向左看,所述第二横坐标值表征右眼向右看时,对所述第一横坐标值和所述第二横坐标值进行矫正。
其中,所述对所述第一横坐标值和所述第二横坐标值进行矫正,可以包括:
根据所述第一横坐标值和所述第二横坐标值,确定所述左眼和所述右眼的横坐标的平均值;
调整所述右眼的预测视线向量与所述左眼的预测视线向量相互平行,其中,平行处理后,所述右眼的横坐标为第三横坐标值;
根据所述平均值和所述第三横坐标值确定所述右眼的横坐标的第四横坐标值。
在一种可能的场景下,当同一对象左眼和右眼图像对应的预测视线向量中x值的符号相反时,存在如下两种情况:左右眼分别朝两侧看和左右眼向中间看。显然,前者并不符合人眼的正常观看习惯,需要对视线向量进行矫正。
可选的,当目标方位眼部为左眼时,若左眼对应的预测视线方向表征左眼朝左看,而右眼对应的预测视线方向表征右眼朝右看时,服务器则对(右眼的)视线向量进行矫正。
在矫正过程中,服务器首先根据第一横坐标值和第二横坐标值,确定左眼和右眼的横坐标的平均值,并调整右眼的预测视线向量与左眼的预测视线向量平行,从而利用平均值对平行处理后右眼的预测视线向量的横坐标进行修正,使得修正后右眼的预测视线向量与左眼的预测视线向量在x轴方向一致。
需要说明的是,当目标方位眼部为右眼时,若右眼对应的预测视线方向表征右眼朝右看,而左眼对应的预测视线方向表征左眼朝左看时,服务器则对(左眼的)视线向量进行矫正。本实施例对矫正过程不再赘述。
本申请实施例中,预测得到的左右眼视线向量会进行合理性矫正得到最终 结果。
可选地,所述反残差区块的数量小于19。
本申请实施例中,为了使训练得到的目标视线追踪模型能够应用于移动端,将视线追踪模型做了剪裁,可以将反残差区块的数量剪裁到只有5个,从而缩小目标视线追踪模型的模型尺寸,方便部署在移动端。当然,这里5个只是举例,也可以有6个或4个,或者其他数值。
为了进一步理解本申请实施例所提供的方案,参阅图4,对本申请实施例提供的另一视线追踪模型训练的方法进行介绍:
如图4所示,本申请实施例使用了MobileNet V2作为视线追踪模型的脊柱(backbone)。MobileNet V2包含一系列的反残差区块(invertedresidualblock)来提升模型的性能,增强模型特征的表现力,并且减少了计算量。
反残差区块的结构图如图5所示,从图5中可以看出,反残差区块先用1x1的卷积51将输入的特征(feature map)维度放大,然后使用3x3深度卷积52(depthwise convolution)计算得到更有表达力的特征,最后用1x1的卷积53将通道(channel)维度缩小,最终将初始输入的特征与输出的特征进行特征拼接。通过1x1的卷积将深度卷积的输入维度增加,能有效缓解特征退化的情况。
本申请实施例提供的MobileNet v2的结构是经过裁剪的MobileNet V2,将反残差区块减少至5个,并且对应减少了每一层输出的channel数,以便将模型部署在移动端。
经过剪裁的MobileNet v2的结构图可以参阅可以参阅表1进行理解。
表1:剪裁后的MobileNet v2的结构
Figure PCTCN2020083486-appb-000001
Figure PCTCN2020083486-appb-000002
其中t代表了膨胀因素,c是当前序列的输出通道的维度,n为本层重复的次数,s为步长(stride)。
需要说明的是,本申请实施例中的初始视线追踪模型不限于上述所提供的MobileNet v2的模型,还可以是其他结构或者其他类型的模型。
MobileNet v2模型会先对输入的眼部样本图像进行处理,例如:通过仿射变换,白平衡,自动对比度,高斯模糊等对图像进行处理来进行数据增强,提高模型的泛化能力。
经过MobileNet v2模型会对眼部样本图像多层次映射的特征表示,用于建预测视线向量和眼球坐标的回归器。
其中,该眼部样本图像的标注视线向量在空间的三个方向上可以表示为(x1,y1,z1),该眼部样本图像的预测视线向量,也就是全连接层的输出为(x2,y2,z2)三个值,而眼球坐标的输出为(x’,y’)两个值。预测视线向量的z2值主要是为了做向量的归一化。
本申请实施例中的视线回归器使用了对所述预测视线向量和所述标注视线向量的损失函数取余弦(cosinedistanceloss),是因为考虑到标注视线向量是基于单位圆的一个方向向量,取余弦可以很好地表示学习出来的预测视线向量与标注视线向量之间角度的差异性,从而使预测结果更加接近真实值。
由于眼球的坐标与角度并无直接联系,并且是2D的坐标,所以采用欧式距离(L2distanceloss)作为损失函数。在Cosinedistanceloss之前,本申请增加了归一化层(Normalization Layer),将特征值归一化,使其计算损失是处于单位圆之内,最后让预测的结果更加鲁棒。
本申请实施例提供的归一化的方法可以参与如下公式进行理解:
x2’=(x2-μ)/σ,y2’=(y2-μ)/σ,z2’=(z2-μ)/σ;其中,μ为(x2,y2,z2)三个值的平均值,σ为(x2,y2,z2)三个值的方差。
在归一化后,确定预测视线向量与标注视线向量的余弦距离,用公式可以表示为:
Figure PCTCN2020083486-appb-000003
其中a为标注视线向量,b为预测视线向量。该公式计算两个向量之间的相似度,所以值越大,表示两个向量越接近。网络实际上使用的是1-cos(θ)来计算两个向量之间的损失,值越小越接近。
本申请实施例输入的可以为112px×112px的左眼图片,在训练时,所有的右眼图片会翻转为左眼,并将标注视线向量做同样的翻转操作。若还有眼球的标注坐标,也需要做翻转操作。
在预测时,有人脸的图片会首先将左眼、右眼根据左右眼角关键点剪裁为模型输入大小的眼睛图片,而右眼会翻转成左眼图片输入到模型中进行预测。得到的预测结果需要同时翻转回右眼的空间,网络预测得到的左眼和右眼视线向量会进行合理性矫正得到最终结果。
以上多个实施例描述了视线追踪模型训练的方法,训练好目标视线追踪模型后,便可以将该目标视线追踪模型应用到实际的不同场景中。无论是应用到哪种场景中,都需要得到预测视线向量,从而才能实现相应的视线追踪过程。
如图6所示,本申请实施例提供的视线追踪的方法的一实施例可以包括:
301、获取目标眼部图像。
302、采用目标视线追踪模型对所述目标眼部图像进行处理,确定所述目标眼部图像的预测视线向量。
所述目标视线追踪模型为按照前述所描述的视线追踪模型训练的方法所得到的视线追踪模型。
303、根据所述预测视线向量进行视线追踪。
可选地,所述采用目标视线追踪模型对所述眼部图像进行处理,确定所述眼部图像的预测视线方向向量时,还可以包括:
确定所述目标眼部图像中眼球的坐标;
所述根据所述预测视线方向向量进行视线追踪,可以包括:
将所述眼球的坐标作为视线的起点,按照所述预测视线向量所指示的方向进行视线追踪。
本申请实施例中,在确定所述目标眼部图像的预测视线向量的过程可以参阅前面确定眼部样本图像的预测视线向量的过程进行理解,本处不再重复赘述。
在一种可能的应用场景下,当使用目标视线追踪模型对视频流中人眼的视线进行追踪时,由于从视频帧中裁剪出的眼部区域的位置会发生抖动,而目标视线追踪模型是对每帧视频帧中的眼部图像进行单独处理,并没有上下文感知能力(即处理结果并不受之前视频帧对应处理结果的影响),因此后续预测到的视线方向也会发生抖动。
在不复杂化模型的前提下,为了缓解视线方向抖动问题,可以利用当前目标眼部图像之间视频帧中眼部图像对应的视线向量预测结果,通过平滑处理算法对当前目标眼部图像对应的预测视线向量进行平滑处理。
在一种可能的实施方式中,确定目标眼部图像的预测视线向量之后,当目标眼部图像属于视频流中的视频帧时,确定目标眼部图像对应的参考眼部图像,参考眼部图像和目标眼部图像是视频流中连续视频帧中的图像;根据参考眼部图像对应的预测视线向量,对目标眼部图像对应的预测视线向量进行平滑处理。
在一个示例性的例子中,当目标眼部图像为第i帧视频帧时,终端将第i帧视频帧之前至少一帧视频帧(比如第i-1帧视频帧、第i-2帧视频帧和第i-3帧视频帧)确定为参考视频帧,并根据该参考视频帧中眼部图像对应的预测视线向量,对第i帧视频帧的预测视线向量进行平滑处理。
其中,进行平滑处理时可以采用的贝塞尔(Bezier)曲线,且Bezier曲线可以为一阶、二阶、三阶Bezier曲线等等,本实施例对此不作限定。
以三阶Bezier曲线为例,三阶Bezier曲线平滑公式如下:
B(t)=P 0(1-t) 3+3P 1t(1-t) 2+3P 2t 2(1-t)+P 3t 3
其中,B(t)为平滑处理后当前目标眼部图像对应的预测视线向量,Pi为参考眼部图像对应的预测视线向量,t为引入的参数,范围在0到1之间。
从图7可以看出,经过Bezier曲线平滑后,预测视线向量中x,y值的抖动越来越小,视线方向向量更加稳定。
当然,除了使用Bezier曲线进行平滑处理外,还可以采用加权移动平均和指数平滑算法进行平滑处理,本实施例对此不作限定。
针对上文描述的神经网络模型训练方法,本申请还提供了对应的视线追踪模型训练的装置,以使上述视线追踪模型训练的方法在实际中得以应用和实现。
参见图8,图8是本申请实施例提供的视线追踪模型训练的装置40的一实施例示意图。
获取模块401,用于获取训练样本集合,所述训练样本集合包括训练样本对,其中,训练样本对包括眼部样本图像和所述眼部样本图像对应的标注视线向量;
训练模块402,用于通过初始视线追踪模型对所述获取模块401获取的所述眼部样本图像进行处理,得到所述眼部样本图像的预测视线向量;
第一处理模块403,用于根据所述训练模块402得到的预测视线向量和所述标注视线向量余弦距离确定模型损失;
第二处理模块404,用于对所述初始视线追踪模型的参考参数进行迭代调整直到所述第一处理模块403处理得到的所述模型损失满足收敛条件,以得到目标视线追踪模型。
本申请实施例中,通过获取包含眼部样本图像和对应标注视线向量的训练样本对,并利用初始视线追踪模型对眼部样本图像进行处理,得到预测视线向量,进而以预测视线向量与标注视线向量之间的余弦距离为模型损失进行模型训练,得到目标视线追踪模型;后续进行视线追踪时,无需借助外设,只需要将采集到的眼部图像输入目标视线追踪能够即可,简化了视线追踪的流程,并且以余弦距离作为模型损失训练模型,能够更好的表现预测值与标注值之间的差异性,进而提高了训练得到的视线追踪模型的预测准确度。
可选地,训练模块402,用于在所述训练样本对还包括眼部样本图像中眼球的标注坐标时,通过初始视线追踪模型对所述眼部样本图像进行处理,得到所述眼部样本图像的预测视线向量和眼球的预测坐标;
所述第一处理模块403,还用于根据所述眼球的预测坐标和所述眼球的标注坐标之间的欧式距离确定所述模型损失。
可选地,参阅图9,该装置40还包括:
第三处理模块405:用于对所述预测视线向量进行归一化处理,以得到归一化视线向量;
所述第一处理模块403,用于根据所述归一化视线向量和所述标注视线向量的余弦距离确定所述模型损失损。
可选地,所述训练模块402用于:
对所述眼部样本图像进行如下至少一项的处理,所述至少一项包括仿射变 换、白平衡、自动对比度或高斯模糊;
将所述训练样本集合中的第一眼部样本图像翻转为第二眼部样本图像,并将所述第一眼部样本图像对应的标注视线向量进行对应翻转,所述第二眼部样本图像为目标方位眼部的图像,所述初始视线追踪模型用于对所述目标方位眼部的图像进行处理;所述第一眼部样本图像为右眼样本图像时,所述第二眼部样本图像为左眼样本图像,所述第一眼部样本图像为左眼样本图像时,所述第二眼部样本图像为右眼样本图像;
对每个眼部样本图像做裁剪处理,得到标准图像;
采用所述初始视线追踪模型中的反残差区块对所述标准图像进行映射处理,以得到所述标准图像的预测视线向量。
可选地,所述训练模块402,还用于当所述标准图像是通过第一眼部样本图像得到时,将所述标准图像的预测视线向量翻转回所述第一眼部样本图像对应的空间。
可选地,所述目标方位眼部为左眼,所述训练模块402还用于:
获取左眼的预测视线向量中的第一横坐标值,以及右眼的预测视线向量中的第二横坐标值,所述左眼和所述右眼属于同一用户对象;
当所述第一横坐标值表征所述左眼向左看,所述第二横坐标值表征右眼向右看时,对所述第一横坐标值和所述第二横坐标值进行矫正。
可选地,所述训练模块402用于:
根据所述第一横坐标值和所述第二横坐标值确定所述左眼和所述右眼的横坐标的平均值;
调整所述右眼的预测视线向量与所述左眼的预测视线向量相互平行,其中,平行处理后,所述右眼的横坐标为第三横坐标值;;
根据所述平均值和所述第三横坐标值确定所述右眼的横坐标的第四横坐标值。
可选地,所述反残差区块的数量小于19。
针对上文描述的视线追踪的方法,本申请还提供了对应的视线追踪的装置,以使上述视线追踪的方法在实际中得以应用和实现。
图10为本申请实施例提供的视线追踪的装置50的一实施例示意图。
获取模块501,用于获取目标眼部图像;
处理模块502,用于采用目标视线追踪模型对所述获取模块501获取的目标眼部图像进行处理,确定所述目标眼部图像的预测视线向量;
视线追踪模块503,用于根据所述处理模块502得到的预测视线向量进行视线追踪。
可选地,所述处理模块502,还用于确定所述目标眼部图像中眼球的坐标;
视线追踪模块503,用于将所述眼球的坐标作为视线的起点,按照所述预测视线向量所指示的方向进行视线追踪。
可选的,该视线追踪的装置50还可以包括平滑处理模块,平滑处理模块用于:
当所述目标眼部图像属于视频流中的视频帧时,确定所述目标眼部图像对应的参考眼部图像,所述参考眼部图像和所述目标眼部图像是所述视频流中连续视频帧中的图像;
根据所述参考眼部图像对应的预测视线向量,对所述目标眼部图像对应的预测视线向量进行平滑处理。
本申请还提供了一种用于视线追踪模型训练的设备,该设备具体可以为服务器,参见图11,图11是本申请实施例提供的一种用于视线追踪模型训练的服务器结构示意图,该服务器700可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)722(例如,一个或一个以上处理器)和存储器732,一个或一个以上存储应用程序742或数据744的存储介质730(例如一个或一个以上海量存储设备)。其中,存储器732和存储介质730可以是短暂存储或持久存储。存储在存储介质730的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对服务器中的一系列指令操作。更进一步地,中央处理器722可以设置为与存储介质730通信,在服务器700上执行存储介质730中的一系列指令操作。
服务器700还可以包括一个或一个以上电源726,一个或一个以上有线或无线网络接口750,一个或一个以上输入输出接口758,和/或,一个或一个以上操作系统741,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。
上述实施例中由服务器所执行的步骤可以基于该图11所示的服务器结构。
其中,CPU 722用于执行上述图1至图6部分所描述的神经网络模型训练 的过程。
此外,本申请还提供了一种服务器,该服务器与上述图11所示的服务器的结构相类似,其存储器用于存储目标视线追踪模型,该目标视线追踪模型是根据本申请实施例提供的视线追踪模型训练的方法训练得到的;其处理器用于运行该目标视线追踪模型,以进行视线追踪。
本申请实施例还提供了另一种用于视线追踪的设备,该设备可以为终端设备,如图12所示,为了便于说明,仅示出了与本申请实施例相关的部分,具体技术细节未揭示的,请参照本申请实施例方法部分。该终端可以为包括手机、平板电脑、个人数字助理(英文全称:Personal Digital Assistant,英文缩写:PDA)、销售终端(英文全称:Point of Sales,英文缩写:POS)、车载电脑等任意终端设备,以终端为手机为例:
图12示出的是与本申请实施例提供的终端相关的手机的部分结构的框图。参考图12,手机包括:射频(英文全称:Radio Frequency,英文缩写:RF)电路810、存储器820、输入单元830、显示单元840、传感器850、音频电路860、无线保真(英文全称:wireless fidelity,英文缩写:WiFi)模块870、处理器880、以及电源890等部件。本领域技术人员可以理解,图12中示出的手机结构并不构成对手机的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
下面结合图12对手机的各个构成部件进行具体的介绍:
RF电路810可用于收发信息或通话过程中,信号的接收和发送,特别地,将基站的下行信息接收后,给处理器880处理;另外,将设计上行的数据发送给基站。通常,RF电路810包括但不限于天线、至少一个放大器、收发信机、耦合器、低噪声放大器(英文全称:Low Noise Amplifier,英文缩写:LNA)、双工器等。此外,RF电路810还可以通过无线通信与网络和其他设备通信。上述无线通信可以使用任一通信标准或协议,包括但不限于全球移动通讯系统(英文全称:Global System of Mobile communication,英文缩写:GSM)、通用分组无线服务(英文全称:General Packet Radio Service,GPRS)、码分多址(英文全称:Code Division Multiple Access,英文缩写:CDMA)、宽带码分多址(英文全称:Wideband Code Division Multiple Access,英文缩写:WCDMA)、长期演进(英文全称:Long Term Evolution,英文缩写:LTE)、电子邮件、短消息服务(英文全称:Short Messaging Service,SMS)等。
存储器820可用于存储软件程序以及模块,处理器880通过运行存储在存储器820的软件程序以及模块,从而执行手机的各种功能应用以及数据处理。存储器820可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据手机的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器820可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
输入单元830可用于控制指令,以及产生与手机的用户设置以及功能控制有关的键信号输入。具体地,输入单元830可包括触控面板831以及其他输入设备832。触控面板831,也称为触摸屏,可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触控面板831上或在触控面板831附近的操作),并根据预先设定的程式驱动相应的连接装置。可选的,触控面板831可包括触摸检测装置和触摸控制器两个部分。其中,触摸检测装置检测用户的触摸方位,并检测触摸操作带来的信号,将信号传送给触摸控制器;触摸控制器从触摸检测装置上接收触摸信息,并将它转换成触点坐标,再送给处理器880,并能接收处理器880发来的命令并加以执行。此外,可以采用电阻式、电容式、红外线以及表面声波等多种类型实现触控面板831。除了触控面板831,输入单元830还可以包括其他输入设备832。具体地,其他输入设备832可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆等中的一种或多种。
显示单元840可用于视线追踪的结果。显示单元840可包括显示面板841,可选的,可以采用液晶显示器(英文全称:Liquid Crystal Display,英文缩写:LCD)、有机发光二极管(英文全称:Organic Light-Emitting Diode,英文缩写:OLED)等形式来配置显示面板841。进一步的,触控面板831可覆盖显示面板841,当触控面板831检测到在其上或附近的触摸操作后,传送给处理器880以确定触摸事件的类型,随后处理器880根据触摸事件的类型在显示面板841上提供相应的视觉输出。虽然在图12中,触控面板831与显示面板841是作为两个独立的部件来实现手机的输入和输入功能,但是在某些实施例中,可以将触控面板831与显示面板841集成而实现手机的输入和输出功能。
手机还可包括至少一种传感器850,可以通过传感器850采集目标眼部图 像,当然也可以通过摄像头采集目标眼部图像,或者通过眼动仪采集目标眼部图像。传感器850比如光传感器、运动传感器以及其他传感器。具体地,光传感器可包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节显示面板841的亮度,接近传感器可在手机移动到耳边时,关闭显示面板841和/或背光。作为运动传感器的一种,加速计传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别手机姿态的应用(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等;至于手机还可配置的陀螺仪、气压计、湿度计、温度计、红外线传感器等其他传感器,在此不再赘述。
音频电路860、扬声器861,传声器862可提供用户与手机之间的音频接口。音频电路860可将接收到的音频数据转换后的电信号,传输到扬声器861,由扬声器861转换为声音信号输出;另一方面,传声器862将收集的声音信号转换为电信号,由音频电路860接收后转换为音频数据,再将音频数据输出处理器880处理后,经RF电路810以发送给比如另一手机,或者将音频数据输出至存储器820以便进一步处理。
WiFi属于短距离无线传输技术,手机通过WiFi模块870可以帮助用户收发电子邮件、浏览网页和访问流式媒体等,它为用户提供了无线的宽带互联网访问。虽然图12示出了WiFi模块870,但是可以理解的是,其并不属于手机的必须构成,完全可以根据需要在不改变发明的本质的范围内而省略。
处理器880是手机的控制中心,利用各种接口和线路连接整个手机的各个部分,通过运行或执行存储在存储器820内的软件程序和/或模块,以及调用存储在存储器820内的数据,执行手机的各种功能和处理数据,从而对手机进行整体监控。可选的,处理器880可包括一个或多个处理单元;优选的,处理器880可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器880中。
手机还包括给各个部件供电的电源890(比如电池),优选的,电源可以通过电源管理系统与处理器880逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。
尽管未示出,手机还可以包括摄像头、蓝牙模块等,在此不再赘述。
在本申请实施例中,该终端所包括的处理器880具有上述所描述的基于目 标视线追踪模型进行相应的视线追踪的功能。
本申请实施例还提供一种计算机可读存储介质,用于存储程序代码,该程序代码用于执行前述各个实施例所述的一种视线追踪模型训练方法中的任意一种实施方式,或者执行前述实施例所述的一种视线追踪的方法。
本申请实施例还提供一种包括指令的计算机程序产品,当其在计算机上运行时,使得计算机执行前述各个实施例所述的一种视线追踪模型训练的方法中的任意一种实施方式,或者执行前述实施例所述的一种视线追踪方法。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述 的存储介质包括:U盘、移动硬盘、只读存储器(英文全称:Read-Only Memory,英文缩写:ROM)、随机存取存储器(英文全称:Random Access Memory,英文缩写:RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (15)

  1. 一种视线追踪模型训练的方法,其特征在于,包括:
    获取训练样本集合,所述训练样本集合包括训练样本对,其中,所述训练样本对包括眼部样本图像和所述眼部样本图像对应的标注视线向量;
    通过初始视线追踪模型对所述眼部样本图像进行处理,得到所述眼部样本图像的预测视线向量;
    根据所述预测视线向量和所述标注视线向量的余弦距离确定模型损失;
    对所述初始视线追踪模型的参考参数进行迭代调整直到所述模型损失满足收敛条件,得到目标视线追踪模型。
  2. 根据权利要求1所述的方法,其特征在于,所述训练样本对还包括所述眼部样本图像中眼球的标注坐标;所述通过初始视线追踪模型对所述眼部样本图像进行处理,得到所述眼部样本图像的预测视线向量,包括:
    通过初始视线追踪模型对所述眼部样本图像进行处理,得到所述眼部样本图像的预测视线向量和眼球的预测坐标;
    所述方法还包括:
    根据所述眼球的预测坐标和所述眼球的标注坐标之间的欧式距离确定所述模型损失。
  3. 根据权利要求1所述的方法,其特征在于,所述标注视线向量为基于单位圆的方向向量,所述方法还包括:
    对所述预测视线向量进行归一化处理,以得到归一化视线向量;
    所述根据所述预测视线向量和所述标注视线向量的余弦距离确定模型损失,包括:
    根据所述归一化视线向量和所述标注视线向量的余弦距离确定所述模型损失。
  4. 根据权利要求1-3任一所述的方法,其特征在于,所述通过初始视线追踪模型对所述眼部样本图像进行处理,得到所述眼部样本图像的预测视线向量,包括:
    对所述眼部样本图像进行如下至少一项的处理,所述至少一项包括仿射变换、白平衡、自动对比度或高斯模糊;
    将所述训练样本集合中的第一眼部样本图像翻转为第二眼部样本图像,并将所述第一眼部样本图像对应的标注视线向量进行对应翻转,所述第二眼部样本图像为目标方位眼部的图像,所述初始视线追踪模型用于对所述目标方位眼部的图像进行处理;所述第一眼部样本图像为右眼样本图像时,所述第二眼部样本图像为左眼样本图像,所述第一眼部样本图像为左眼样本图像时,所述第二眼部样本图像为右眼样本图像;
    对每个眼部样本图像做裁剪处理,得到标准图像;
    采用所述初始视线追踪模型中的反残差区块对所述标准图像进行映射处理,得到所述标准图像的预测视线向量。
  5. 根据权利要求4所述的方法,其特征在于,所述方法还包括:
    当所述标准图像是通过第一眼部样本图像得到时,将所述标准图像的预测视线向量翻转回所述第一眼部样本图像对应的空间。
  6. 根据权利要求4所述的方法,其特征在于,所述目标方位眼部为左眼,所述方法还包括:
    获取左眼的预测视线向量中的第一横坐标值,以及右眼的预测视线向量中的第二横坐标值,所述左眼和所述右眼属于同一用户对象;
    当所述第一横坐标值表征所述左眼向左看,所述第二横坐标值表征右眼向右看时,对所述第一横坐标值和所述第二横坐标值进行矫正。
  7. 根据权利要求6所述的方法,其特征在于,所述对所述第一横坐标值和所述第二横坐标值进行矫正,包括:
    根据所述第一横坐标值和所述第二横坐标值,确定所述左眼和所述右眼的横坐标的平均值;
    调整所述右眼的预测视线向量与所述左眼的预测视线向量相互平行,其中,平行处理后,所述右眼的横坐标为第三横坐标值;
    根据所述平均值和所述第三横坐标值,确定所述右眼的横坐标的第四横坐标值。
  8. 根据权利要求4所述的方法,其特征在于,所述反残差区块的数量小于19。
  9. 一种视线追踪的方法,其特征在于,包括:
    获取目标眼部图像;
    采用目标视线追踪模型对所述目标眼部图像进行处理,确定所述目标眼部图像的预测视线向量,所述目标视线追踪模型为采用权利要求1-8任一所述的方法训练得到的视线追踪模型;
    根据所述预测视线向量进行视线追踪。
  10. 根据权利要求9所述的方法,其特征在于,所述采用目标视线追踪模型对所述眼部图像进行处理,确定所述眼部图像的预测视线方向向量时,还包括:
    确定所述目标眼部图像中眼球的坐标;
    所述根据所述预测视线方向向量进行视线追踪,包括:
    将所述眼球的坐标作为视线的起点,按照所述预测视线向量所指示的方向进行视线追踪。
  11. 根据权利要求9或10所述的方法,其特征在于,所述采用目标视线追踪模型对所述目标眼部图像进行处理,确定所述目标眼部图像的预测视线向量之后,所述方法还包括:
    当所述目标眼部图像属于视频流中的视频帧时,确定所述目标眼部图像对应的参考眼部图像,所述参考眼部图像和所述目标眼部图像是所述视频流中连续视频帧中的图像;
    根据所述参考眼部图像对应的预测视线向量,对所述目标眼部图像对应的预测视线向量进行平滑处理。
  12. 一种视线追踪模型训练的装置,其特征在于,包括:
    获取模块,用于获取训练样本集合,所述训练样本集合包括训练样本对,其中,所述训练样本对包括眼部样本图像和所述眼部样本图像对应的标注视线向量;
    训练模块,用于通过初始视线追踪模型对所述获取模块获取的所述眼部样本图像进行处理,得到所述眼部样本图像的预测视线向量;
    第一处理模块,用于根据所述训练模块得到的预测视线向量和所述标注视线向量的余弦距离确定模型损失;
    第二处理模块,用于对所述初始视线追踪模型的参考参数进行迭代调整直到所述第一处理模块处理得到的所述模型损失满足收敛条件,得到目标视线追 踪模型。
  13. 一种视线追踪的装置,其特征在于,包括:
    获取模块,用于获取目标眼部图像;
    处理模块,用于采用目标视线追踪模型对所述获取模块获取的目标眼部图像进行处理,确定所述目标眼部图像的预测视线向量,所述目标视线追踪模型为采用权利要求1-8任一所述的方法训练得到的视线追踪模型;
    视线追踪模块,用于根据所述处理模块得到的预测视线向量进行视线追踪。
  14. 一种计算机设备,其特征在于,所述计算机设备包括处理器以及存储器:
    所述存储器用于存储程序代码;所述处理器用于根据所述程序代码中的指令执行权利要求1至8任一项所述的视线追踪模型训练的方法,或者,
    所述存储器用于存储目标视线追踪模型,所述目标视线追踪模型是根据上述权利要求1至8任一项所述的视线追踪模型训练的方法训练得到的视线追踪模型;所述处理器用于运行所述目标视线追踪模型,以进行视线追踪。
  15. 一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行如上述权利要求1至8任一项所述的视线追踪模型训练的方法,或者执行如上述权利要求9至11任一所述的视线追踪的方法。
PCT/CN2020/083486 2019-04-24 2020-04-07 视线追踪模型训练的方法、视线追踪的方法及装置 WO2020216054A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/323,827 US11797084B2 (en) 2019-04-24 2021-05-18 Method and apparatus for training gaze tracking model, and method and apparatus for gaze tracking

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910338224.6A CN110058694B (zh) 2019-04-24 2019-04-24 视线追踪模型训练的方法、视线追踪的方法及装置
CN201910338224.6 2019-04-24

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/323,827 Continuation US11797084B2 (en) 2019-04-24 2021-05-18 Method and apparatus for training gaze tracking model, and method and apparatus for gaze tracking

Publications (1)

Publication Number Publication Date
WO2020216054A1 true WO2020216054A1 (zh) 2020-10-29

Family

ID=67320663

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/083486 WO2020216054A1 (zh) 2019-04-24 2020-04-07 视线追踪模型训练的方法、视线追踪的方法及装置

Country Status (3)

Country Link
US (1) US11797084B2 (zh)
CN (1) CN110058694B (zh)
WO (1) WO2020216054A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112541433A (zh) * 2020-12-11 2021-03-23 中国电子技术标准化研究院 一种基于注意力机制的两阶段人眼瞳孔精确定位方法
CN114500839A (zh) * 2022-01-25 2022-05-13 青岛根尖智能科技有限公司 一种基于注意力跟踪机制的视觉云台控制方法及系统

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110058694B (zh) 2019-04-24 2022-03-25 腾讯科技(深圳)有限公司 视线追踪模型训练的方法、视线追踪的方法及装置
CN110503068A (zh) * 2019-08-28 2019-11-26 Oppo广东移动通信有限公司 视线估计方法、终端及存储介质
CN111259713B (zh) * 2019-09-16 2023-07-21 浙江工业大学 一种基于自适应加权的视线跟踪方法
CN110706283B (zh) * 2019-11-14 2022-07-29 Oppo广东移动通信有限公司 用于视线追踪的标定方法、装置、移动终端及存储介质
CN111145087B (zh) * 2019-12-30 2023-06-30 维沃移动通信有限公司 一种图像处理方法及电子设备
CN111580665B (zh) * 2020-05-11 2023-01-10 Oppo广东移动通信有限公司 注视点预测方法、装置、移动终端及存储介质
CN112766097B (zh) * 2021-01-06 2024-02-13 中国科学院上海微系统与信息技术研究所 视线识别模型的训练方法、视线识别方法、装置及设备
JP7219787B2 (ja) * 2021-04-09 2023-02-08 本田技研工業株式会社 情報処理装置、情報処理方法、学習方法、およびプログラム
US11704814B2 (en) * 2021-05-13 2023-07-18 Nvidia Corporation Adaptive eye tracking machine learning model engine
US11606544B2 (en) * 2021-06-08 2023-03-14 Black Sesame Technologies Inc. Neural network based auto-white-balancing
CN113379644A (zh) * 2021-06-30 2021-09-10 北京字跳网络技术有限公司 基于数据增强的训练样本获取方法、装置和电子设备
CN113506328A (zh) * 2021-07-16 2021-10-15 北京地平线信息技术有限公司 视线估计模型的生成方法和装置、视线估计方法和装置
CN113805695B (zh) * 2021-08-26 2024-04-05 深圳静美大健康科技有限公司 阅读理解水平的预测方法及装置、电子设备和存储介质
CN113673479A (zh) * 2021-09-03 2021-11-19 济南大学 基于视觉关注点识别物体的方法
CN113900519A (zh) * 2021-09-30 2022-01-07 Oppo广东移动通信有限公司 注视点获取方法、装置以及电子设备
CN113807330B (zh) * 2021-11-19 2022-03-08 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) 面向资源受限场景的三维视线估计方法及装置
CN114449162B (zh) * 2021-12-22 2024-04-30 天翼云科技有限公司 一种播放全景视频的方法、装置、计算机设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180181737A1 (en) * 2014-08-28 2018-06-28 Facetec, Inc. Facial Recognition Authentication System Including Path Parameters
CN108229284A (zh) * 2017-05-26 2018-06-29 北京市商汤科技开发有限公司 视线追踪及训练方法和装置、系统、电子设备和存储介质
CN109492514A (zh) * 2018-08-28 2019-03-19 初速度(苏州)科技有限公司 一种单相机采集人眼视线方向的方法及系统
CN109508679A (zh) * 2018-11-19 2019-03-22 广东工业大学 实现眼球三维视线跟踪的方法、装置、设备及存储介质
CN110058694A (zh) * 2019-04-24 2019-07-26 腾讯科技(深圳)有限公司 视线追踪模型训练的方法、视线追踪的方法及装置

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7686451B2 (en) * 2005-04-04 2010-03-30 Lc Technologies, Inc. Explicit raytracing for gimbal-based gazepoint trackers
EP2659480B1 (en) * 2010-12-30 2016-07-27 Dolby Laboratories Licensing Corporation Repetition detection in media data
US9179833B2 (en) * 2013-02-28 2015-11-10 Carl Zeiss Meditec, Inc. Systems and methods for improved ease and accuracy of gaze tracking
US9852337B1 (en) * 2015-09-30 2017-12-26 Open Text Corporation Method and system for assessing similarity of documents
CN107103293B (zh) * 2017-04-13 2019-01-29 西安交通大学 一种基于相关熵的注视点估计方法
US10534982B2 (en) * 2018-03-30 2020-01-14 Tobii Ab Neural network training for three dimensional (3D) gaze prediction with calibration parameters
US11262839B2 (en) * 2018-05-17 2022-03-01 Sony Interactive Entertainment Inc. Eye tracking with prediction and late update to GPU for fast foveated rendering in an HMD environment
CN108805078A (zh) * 2018-06-11 2018-11-13 山东大学 基于行人平均状态的视频行人再识别方法及系统
US10929665B2 (en) * 2018-12-21 2021-02-23 Samsung Electronics Co., Ltd. System and method for providing dominant scene classification by semantic segmentation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180181737A1 (en) * 2014-08-28 2018-06-28 Facetec, Inc. Facial Recognition Authentication System Including Path Parameters
CN108229284A (zh) * 2017-05-26 2018-06-29 北京市商汤科技开发有限公司 视线追踪及训练方法和装置、系统、电子设备和存储介质
CN109492514A (zh) * 2018-08-28 2019-03-19 初速度(苏州)科技有限公司 一种单相机采集人眼视线方向的方法及系统
CN109508679A (zh) * 2018-11-19 2019-03-22 广东工业大学 实现眼球三维视线跟踪的方法、装置、设备及存储介质
CN110058694A (zh) * 2019-04-24 2019-07-26 腾讯科技(深圳)有限公司 视线追踪模型训练的方法、视线追踪的方法及装置

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112541433A (zh) * 2020-12-11 2021-03-23 中国电子技术标准化研究院 一种基于注意力机制的两阶段人眼瞳孔精确定位方法
CN112541433B (zh) * 2020-12-11 2024-04-19 中国电子技术标准化研究院 一种基于注意力机制的两阶段人眼瞳孔精确定位方法
CN114500839A (zh) * 2022-01-25 2022-05-13 青岛根尖智能科技有限公司 一种基于注意力跟踪机制的视觉云台控制方法及系统
CN114500839B (zh) * 2022-01-25 2024-06-07 青岛根尖智能科技有限公司 一种基于注意力跟踪机制的视觉云台控制方法及系统

Also Published As

Publication number Publication date
US11797084B2 (en) 2023-10-24
CN110058694B (zh) 2022-03-25
CN110058694A (zh) 2019-07-26
US20210271321A1 (en) 2021-09-02

Similar Documents

Publication Publication Date Title
WO2020216054A1 (zh) 视线追踪模型训练的方法、视线追踪的方法及装置
US11989350B2 (en) Hand key point recognition model training method, hand key point recognition method and device
WO2020177582A1 (zh) 视频合成的方法、模型训练的方法、设备及存储介质
US20220076000A1 (en) Image Processing Method And Apparatus
CN108491775B (zh) 一种图像修正方法及移动终端
US20210343041A1 (en) Method and apparatus for obtaining position of target, computer device, and storage medium
WO2020192465A1 (zh) 一种三维对象重建方法和装置
CN108712603B (zh) 一种图像处理方法及移动终端
CN108076290B (zh) 一种图像处理方法及移动终端
WO2019233216A1 (zh) 一种手势动作的识别方法、装置以及设备
US11997422B2 (en) Real-time video communication interface with haptic feedback response
US20240184372A1 (en) Virtual reality communication interface with haptic feedback response
CN110969060A (zh) 神经网络训练、视线追踪方法和装置及电子设备
US20220317775A1 (en) Virtual reality communication interface with haptic feedback response
CN111080747B (zh) 一种人脸图像处理方法及电子设备
CN110807769B (zh) 图像显示控制方法及装置
CN111556337A (zh) 一种媒体内容植入方法、模型训练方法以及相关装置
CN113255396A (zh) 图像处理模型的训练方法及装置、图像处理方法及装置
CN112733673B (zh) 一种内容显示方法、装置、电子设备及可读存储介质
CN107798662B (zh) 一种图像处理方法及移动终端
CN110443752B (zh) 一种图像处理方法和移动终端
CN111385481A (zh) 图像处理方法及装置、电子设备及存储介质
CN110930372A (zh) 一种图像处理方法、电子设备及计算机可读存储介质
CN112585673A (zh) 信息处理设备、信息处理方法及程序
CN114973347B (zh) 一种活体检测方法、装置及设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20796221

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20796221

Country of ref document: EP

Kind code of ref document: A1