CN117197612A

CN117197612A - Training method, detection device, electronic equipment and storage medium

Info

Publication number: CN117197612A
Application number: CN202311142090.3A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Moore Threads Technology Co Ltd
Current assignee: Moore Threads Technology Co Ltd
Priority date: 2023-09-05
Filing date: 2023-09-05
Publication date: 2023-12-08

Abstract

The present disclosure relates to the field of information processing technologies, and in particular, to a training method, a detection method, a device, an electronic device, and a storage medium, where the training method includes: acquiring a first sample, and then generating a head posture label and an eye position label which correspond to each image to be trained respectively; based on the multiple images to be trained, a second sample is obtained by using the sight line label, the head posture label and the eye position label corresponding to each image to be trained, and the detection model is trained by using the second sample to obtain a trained detection model. According to the embodiment of the disclosure, the head posture label and the eye position label are added to the first sample, so that the trained detection model can learn the head posture information, the eye position information and the realization information in the sample to be trained at the same time, and the detection speed and the training convergence speed of the model can be increased while the accuracy of the detection model is improved.

Description

Training method, detection device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of information processing, and in particular relates to a training method, a detection device, electronic equipment and a storage medium.

Background

The vision detection technology can judge the vision direction of the target object according to the image acquisition device, and can be applied to the technical fields of attention analysis of the target object, functional support of somatosensory games and the like. The sight line detection result is usually generated by a detection model, and the accuracy and the detection time consumption of the detection model are directly related to the accuracy and the time consumption of an upper layer task. Therefore, how to train the detection model better is a technical problem that needs to be solved by the developer.

Disclosure of Invention

The present disclosure provides a training and detecting technical scheme.

According to an aspect of the present disclosure, there is provided a training method applied to a detection model, the training method including: acquiring a first sample; wherein the first sample comprises: the plurality of images to be trained and the sight line labels respectively corresponding to the images to be trained; the sight line tag is used for representing the sight line direction of a target person in the image to be trained; generating a head posture label and an eye position label which correspond to each image to be trained respectively; the head gesture label is used for representing the head direction of a target person in the image to be trained, and the eye position label is used for representing the eye position of the target person in the image to be trained; based on the multiple images to be trained, a second sample is obtained by using the sight line label, the head posture label and the eye position label corresponding to each image to be trained, and the detection model is trained by using the second sample to obtain a trained detection model.

In a possible implementation manner, the generating a head pose label and an eye position label corresponding to each image to be trained respectively includes: obtaining head area images corresponding to the images to be trained respectively according to the images to be trained; and generating head posture labels and eye position labels respectively corresponding to the images to be trained according to the head area images respectively corresponding to the images to be trained.

In a possible implementation manner, the generating, according to the head area images corresponding to the to-be-trained images respectively, a head pose tag and an eye position tag corresponding to the to-be-trained images respectively includes: generating head posture angles corresponding to the images to be trained respectively according to the target images, and taking the angles of the head posture angles as head posture labels corresponding to the images to be trained respectively; generating eye key points corresponding to the images to be trained respectively according to the target images, and taking the position information of the eye key points in the head area images corresponding to the images to be trained respectively as eye position labels corresponding to the images to be trained respectively; the target image is an image to be trained or an image obtained by normalizing the image to be trained based on a preset normalization rule.

In one possible implementation manner, the preset normalization rule is obtained by the following manner: according to a plurality of images to be trained in the first sample, determining the mean value and standard deviation of pixel values among the plurality of images to be trained; and determining a preset normalization rule according to the mean value and the standard deviation.

In a possible implementation manner, the training the detection model using the second sample to obtain a trained detection model includes: training the detection model by using the second sample, and adjusting model parameters of the detection model based on the trained overall loss to obtain a trained detection model; wherein the overall loss comprises: a vision estimation loss corresponding to the vision tag, a head posture estimation loss corresponding to the head posture tag, and an eye position estimation loss corresponding to the eye position tag; the value of the overall loss is positively correlated with any one of a gaze estimation loss, a head pose estimation loss, and an eye position estimation loss.

In one possible embodiment, the line-of-sight estimation loss includes a line-of-sight estimation regression loss, a line-of-sight estimation classification loss; the vision estimation regression loss comprises a difference value between a vision label corresponding to the second sample and a predicted vision label; the sight line estimation classification loss comprises a difference value of a predicted sight line label on two image coordinate axes of an image in a second sample; the head pose estimation loss comprises a difference value between a head pose label corresponding to the second sample and a predicted head pose label; the eye position estimation loss includes a difference between an eye position tag corresponding to the second sample and a predicted eye position tag.

In a possible embodiment, the trained detection model is used to obtain a line of sight detection result, or a line of sight detection result and a head posture detection result, or a line of sight detection result and an eye position detection result, or a line of sight detection result and a head posture detection result and an eye position detection result of an image.

According to an aspect of the present disclosure, there is provided a detection method including: acquiring an image to be detected; generating a model detection result corresponding to the image to be detected according to the image to be detected and the trained detection model; determining a target detection result corresponding to the image to be detected according to a model detection result corresponding to the image to be detected; the trained detection model is obtained by training the training method.

In a possible implementation manner, the image to be detected is a frame of image in a video acquired in real time, and the determining, according to a model detection result corresponding to the image to be detected, a target detection result corresponding to the image to be detected includes: and determining a target detection result corresponding to the image to be detected according to the model detection result corresponding to the image to be detected, the target detection result corresponding to the image of the previous frame of the image to be detected and the preset weight corresponding to each of the model detection result and the target detection result.

In a possible implementation manner, the generating, according to the image to be detected and the trained detection model, a model detection result corresponding to the image to be detected includes: generating a head region image corresponding to the image to be detected according to the image to be detected; carrying out normalization processing on the head region image according to a preset normalization rule; and inputting the normalized head region image into a trained detection model to obtain a model detection result corresponding to the image to be detected.

According to an aspect of the present disclosure, there is provided a training apparatus applied to a detection model, the training apparatus including: the sample acquisition module is used for acquiring a first sample; wherein the first sample comprises: the plurality of images to be trained and the sight line labels respectively corresponding to the images to be trained; the sight line tag is used for representing the sight line direction of a target person in the image to be trained; the label generating module is used for generating a head gesture label and an eye position label which correspond to the images to be trained respectively; the head gesture label is used for representing the head direction of a target person in the image to be trained, and the eye position label is used for representing the eye position of the target person in the image to be trained; the model training module is used for obtaining a second sample by utilizing the sight line label, the head posture label and the eye position label corresponding to each image to be trained based on the plurality of images to be trained, and training the detection model by utilizing the second sample to obtain a trained detection model.

According to an aspect of the present disclosure, there is provided a detection apparatus including: the image acquisition module is used for acquiring an image to be detected; the model detection result generation module is used for generating a model detection result corresponding to the image to be detected according to the image to be detected and the trained detection model; the target detection result generation module is used for determining a target detection result corresponding to the image to be detected according to the model detection result corresponding to the image to be detected; the trained detection model is obtained by training the training device.

According to an aspect of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the instructions stored in the memory to perform the above method.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.

In the embodiment of the disclosure, a first sample may be acquired, and then a head pose label and an eye position label corresponding to each image to be trained are generated. Finally, based on the plurality of images to be trained, a second sample is obtained by utilizing the sight line label, the head posture label and the eye position label corresponding to each image to be trained, and the detection model is trained by utilizing the second sample, so that the trained detection model is obtained. According to the embodiment of the disclosure, the head posture label and the eye position label are added to the first sample, so that the trained detection model can learn the head posture information, the eye position information and the realization information in the sample to be trained at the same time, the generation processes of the head detection posture result, the eye position detection result and the sight line detection result can be kept in parallel, and the detection speed of the model and the convergence speed of training can be increased while the accuracy of the detection model is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the technical aspects of the disclosure.

Fig. 1 shows a flowchart of a training method provided according to an embodiment of the present disclosure.

Fig. 2 shows a flowchart of a detection method provided according to an embodiment of the present disclosure.

Fig. 3 shows a block diagram of a training apparatus provided in accordance with an embodiment of the present disclosure.

Fig. 4 shows a block diagram of a detection apparatus provided according to an embodiment of the present disclosure.

Fig. 5 shows a block diagram of an electronic device provided in accordance with an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Furthermore, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order not to obscure the present disclosure.

In the related art, the line of sight detection result is usually trained through an existing training set, and the existing training set is composed of an image to be trained and a line of sight label corresponding to the image to be trained. In one example, to improve the accuracy of generating the line-of-sight detection result, the generation of the line-of-sight detection result is generally performed in conjunction with the head posture detection result. The generation of the head posture detection result and the generation process of the sight line detection result are mutually independent, and can be regarded as being generated through two detection models, the head posture detection result is output by the head posture detection model and then is input into the sight line detection model, and then the sight line detection model can output the sight line detection result. This tends to cause the following problems: 1. the labels of the existing training set are single and are difficult to use in combination with practical application scenes. 2. If the line of sight detection precision is to be improved, an additional processing needs to be performed on the line of sight detection result, and the process is a serial process, which can cause too slow line of sight detection flow in actual deployment.

In view of this, the embodiment of the disclosure provides a training method, which may obtain a first sample, and then generate a head pose label and an eye position label corresponding to each image to be trained respectively. Finally, based on the plurality of images to be trained, a second sample is obtained by utilizing the sight line label, the head posture label and the eye position label corresponding to each image to be trained, and the detection model is trained by utilizing the second sample, so that the trained detection model is obtained. According to the embodiment of the disclosure, the head posture label and the eye position label are added to the first sample, so that the trained detection model can learn the head posture information, the eye position information and the realization information in the sample to be trained at the same time, the generation processes of the head detection posture result, the eye position detection result and the sight line detection result can be kept in parallel, and the detection speed of the model and the convergence speed of training can be increased while the accuracy of the detection model is improved.

Referring to fig. 1, fig. 1 shows a flowchart of a training method provided according to an embodiment of the disclosure, and in conjunction with fig. 1, the training method may be applied to a detection model, where the training method includes:

Step S100, a first sample is acquired. Wherein the first sample comprises: the plurality of images to be trained and the sight line labels respectively corresponding to the images to be trained; the sight line label is used for representing the sight line direction of a target person in the image to be trained. In one example, the first sample may be selected according to actual requirements of a developer, and may include a data set existing in the related art.

And step 200, generating a head posture label and an eye position label which correspond to the images to be trained respectively. The head gesture label is used for representing the head direction of a target person in the image to be trained, and the eye position label is used for representing the eye position of the target person in the image to be trained.

In one possible implementation, step S200 may include: and obtaining head area images corresponding to the images to be trained respectively according to the images to be trained. And then generating head posture labels and eye position labels respectively corresponding to the images to be trained according to the head area images respectively corresponding to the images to be trained. The above-described generation process of the head region image may be obtained by a head detection algorithm in the related art, or a head detection model, for example. In one example, the generation of the head region image may be derived from one of the sub-models in the detection model. For example, the above sub-model may be trained by an image to be trained and a head region label corresponding to the image to be trained (may be represented as coordinate information of a vertex of a head region detection frame in the image), so that the trained sub-model may generate a head region image according to an input image, and the specific training manner is not limited herein. According to the embodiment of the disclosure, the head region image is extracted from the current image, so that the subsequent sight line detection task can focus on the detection of the head region, and the influence of the useless image features on the sight line detection precision is reduced.

In a possible implementation manner, the generating, according to the head area images corresponding to the to-be-trained images respectively, a head pose tag and an eye position tag corresponding to the to-be-trained images respectively includes: and generating head posture angles corresponding to the images to be trained respectively according to the target images, and taking the angles of the head posture angles as head posture labels corresponding to the images to be trained respectively. For example, the head pose angle may include a pitch angle (or pitch), a yaw angle (or yaw), and a roll angle (or roll) in the related art. In one example, after the head pose label is obtained, manual calibration may be performed, and then subsequent model training may be performed to improve accuracy of the line-of-sight detection result and the head detection result that are finally generated in the model application stage. In one example, the head pose angle may be obtained by an algorithm or model in the related art, and embodiments of the disclosure are not limited herein, and may be obtained by, for example, 6DRepNet, which is a head pose angle detection model that can perform six-dimensional rotation representation on the head pose. And generating eye key points corresponding to the images to be trained respectively according to the target images, and taking the position information of the eye key points in the head region images corresponding to the images to be trained respectively as eye position labels corresponding to the images to be trained respectively. The target image is an image to be trained or an image obtained by normalizing the image to be trained based on a preset normalization rule. For example, the eye keypoint may be a preset keypoint of the plurality of eye keypoints, such as an eye center keypoint, and may specifically be determined in conjunction with an identification algorithm of the eye keypoint. If the central key point does not exist in the recognition algorithm, the median value can be calculated through a plurality of coordinate information of a plurality of eye key points and used as an eye position label of the eye key points. In one example, the plurality of key points may be obtained through an algorithm or a model in the related art, and the embodiments of the present disclosure are not limited herein. For example: the coordinate of the facial key point in the image can be obtained by obtaining the SLPT (generally called Sparse Local Patch Transformer), which is a detection model of the facial key point, and the key point corresponding to the eye can be selected to obtain the position information of the eye key point. Illustratively, the location information of the eye key may include: the abscissa of the left eye key point (which may be represented as coordinates in the pixel coordinate system of the image), the abscissa of the right eye key point. The normalization rule may be adjusted according to actual needs of a developer, and in an example, a mean value and a standard deviation of pixel values between a plurality of images to be trained in the first sample may be determined according to the plurality of images to be trained in the first sample. And then determining a preset normalization rule according to the mean value and the standard deviation. Illustratively, if the mean represents mean, the standard deviation represents std, the pre-normalized image represents X1, and the normalized image X2 may represent: x2= (X1/255-mean)/std, this formula can be regarded as the above-described preset normalization rule. It should be understood that the developer may also adjust the rules according to the actual situation, and the embodiments of the present disclosure are not limited herein.

With continued reference to fig. 1, step S300 includes obtaining a second sample based on the plurality of images to be trained by using the line-of-sight tag, the corresponding head posture tag, and the corresponding eye position tag corresponding to each image to be trained, and training the detection model by using the second sample to obtain a trained detection model. In one possible embodiment, the trained detection model is used to obtain a line of sight detection result, or a line of sight detection result and a head posture detection result, or a line of sight detection result and an eye position detection result, or a line of sight detection result and a head posture detection result and an eye position detection result of the image. The output result of the trained detection model can be formed by adding or deleting the detection head of the detection model. For example: each detection result can correspond to one detection head, and the corresponding detection result can be removed if the corresponding detection head is removed. It should be understood that even if the detection head corresponding to the detection result is removed, the trained detection model learns the features of the detection result, which is beneficial to improving the accuracy of the detection result. According to the embodiment of the disclosure, the requirements of some practical application scenes can be met through a mode of outputting a plurality of detection results at one time, the detection heads can be added and deleted according to the requirements of upper-layer tasks, and the suitability of a detection model after training can be improved. For example, the network structure of the detection model may refer to a model structure such as Mobilenet, shufflenet, reset in the related art, or may alternatively use a lightweight model structure such as MobileNetV3-Small, and the number and dimensions of the detection heads may be modified correspondingly on the basis of the model structure.

In one example, the training the detection model using the second sample to obtain a trained detection model includes: training the detection model by using the second sample, and adjusting model parameters of the detection model based on the trained overall loss to obtain a trained detection model; wherein the overall loss comprises: a vision estimation loss corresponding to the vision tag, a head posture estimation loss corresponding to the head posture tag, and an eye position estimation loss corresponding to the eye position tag; the value of the overall loss is positively correlated with any one of a gaze estimation loss, a head pose estimation loss, and an eye position estimation loss. The detection model can be subjected to supervised training through the labels, model parameters of the detection model and the training process, and the embodiment of the disclosure is not limited herein. Each tag may correspond to a loss function to counter-propagate multiple predicted losses to automatically adjust model parameters of the detection model.

In one example, detecting the overall loss of the model may include: line of sight estimation regression loss1 (which may be one of line of sight estimation losses), line of sight estimation classification loss2 (which may be one of line of sight estimation losses), eye position estimation loss3 (which may be one of eye position estimation losses), head posture estimation loss4 (which may be one of head posture estimation losses). The line-of-sight estimated regression loss includes a difference between a line-of-sight tag corresponding to the second sample and a predicted line-of-sight tag, which, illustratively, Wherein N is the size of batch size (or batch size) during model training, i is the serial number of the current image, Y _{pred_gaze} Sight tags (which may be represented as vectors) to represent model predictions, Y _gaze The visual line label is used for representing the visual line label corresponding to the image to be trained. Illustratively Y _{pred_gaze} Can be expressed asThe softmax is a normalization function in the related art, and the output result can be normalized to be in the range of 0 to 1, so that different values can be calculated under one scale. The yaw_vector is used to represent the component of the view vector predicted by the detection model in the yaw direction (which may include a plurality of elements, and the specific number may be set by a developer), and index vector is the sequence number of each element in the component. y isaw_offset is used to represent the offset of the yaw axis corresponding to the component from the reference yaw axis (image coordinate axis). The pitch_vector is used for representing the component of the sight line vector predicted by the detection model in the pitch direction, the index vector is the sequence number of each element in the component, and the pitch_offset is used for representing the offset of the pitch axis corresponding to the component and the reference pitch axis (image coordinate axis). The line-of-sight estimated classification loss (loss 2) comprises predicting the difference of line-of-sight labels (e.g. labels expressed in the form of vectors) on two image coordinate axes of the image in the second sample, illustratively +. >Wherein Y is _{pred_gaze_yaw} To represent the component of the gaze vector predicted by the detection model after normalization (i.e. after softmax processing) in the yaw direction, Y _{gaze_yaw} To represent the component of the vector corresponding to the normalized line-of-sight tag in the yaw direction. Y is Y _{pred_gaze_ptch} Representing the component of the line-of-sight vector predicted by the normalized post-detection model in the pitch direction, Y _{gaze_pitch} To represent the component of the vector corresponding to the normalized line-of-sight tag in the pitch direction. Y is Y _{pred_gaze_yaw} Can be expressed as Y _{pred_gaze_yaw} ＝softmax(yaw_vector)，Y _{pred_gaze_pitch} Can be expressed as Y _{pred_gaze_pitch} ＝softmax(pitch_vector)。

The eye position estimation loss (loss 3) includes a difference between the eye position label corresponding to the second sample, the corresponding predicted eye position label, and, illustratively,wherein Y is _{pred_eye_position} Label for representing eye position predicted by detection model, Y _{eye_position} The eye position label is used for representing the eye position label corresponding to the image to be trained.

The head pose estimation loss (loss 4) includes a difference between the head pose tag corresponding to the second sample and the predicted head pose tag, which may be, for example,wherein Y is _{pred_headpose} Tag for representing predicted head pose of detection model, Y _headpose The head gesture label is used for representing the head gesture label corresponding to the image to be trained.

The overall loss of detection model loss can be expressed as: loss=w1×loss1+w2×loss2+w3×loss3+w4×loss4. Wherein w1, w2, w3, w4 are used to represent weights corresponding to loss1, loss2, loss3, loss4, respectively, and the sum of the four weights is 1, which is not a limitation in the embodiments of the disclosure.

In one possible implementation, step S300 may include: taking the normalized head area image, the corresponding sight line label, the corresponding head posture label and the corresponding eye position label corresponding to each to-be-trained image in the plurality of to-be-trained images as new first samples, and training the detection model according to the new first samples to obtain a trained detection model. According to the embodiment of the disclosure, the image to be trained can be normalized to obtain the normalized head region image, and then the detection model is trained based on the new first sample, so that the influence of the image size on the feature extraction can be reduced, and the detection precision of the detection model is improved. The training process may refer to the above, and the embodiments of the present disclosure are not described herein.

In a possible implementation manner, the training the detection model using the second sample to obtain a trained detection model includes: splitting the second sample into a training set, a verification set and a test set according to a preset rule. In one example, the duty ratios of the training set, the validation set, and the test set may be set by a developer as the case may be, and embodiments of the present disclosure are not limited herein. And then training a plurality of machine learning models according to the training set. For example, the variety of the plurality of machine learning models may be the same or different. And determining the target model with highest precision in the trained multiple machine learning models according to the verification set. For example, the image in the verification set may be input into a plurality of machine learning models, and the accuracy of each machine learning model may be obtained by comparing the labels corresponding to the verification set. And retraining the target model according to the training set and the verification set to obtain a trained target model, and taking the trained target model as a trained detection model. And finally, determining the precision of the trained detection model according to the test set. The training method provided by the embodiment of the disclosure can also provide the precision of the trained detection model for the developer, and is beneficial to the developer to evaluate the model of the trained detection model.

According to the training method provided by the embodiment of the disclosure, the head posture label, the eye position label and the sight line label are used as the supervision information for training, so that the convergence process of the detection model can be accelerated, and the comprehensiveness and stability of the detection model are improved. The method is beneficial to improving the detection precision of the detection model.

Referring to fig. 2, fig. 2 shows a flowchart of a detection method provided according to an embodiment of the present disclosure, and in conjunction with fig. 2, an embodiment of the present disclosure further provides a detection method, where the detection method includes: step S600, an image to be detected is acquired. For example, the image to be detected may be acquired by an image acquisition device in the related art.

And step S700, generating a model detection result corresponding to the image to be detected according to the image to be detected and the trained detection model. For example, the image to be detected may be input to a trained detection model to obtain a model detection result. In one example, the image to be detected may also be processed according to an image processing algorithm in the related art, and then the processed image to be detected is input into a trained detection model.

In one possible implementation, step S700 may include: and generating a head area image corresponding to the image to be detected according to the image to be detected. The above-described generation process of the head region image may be obtained by a head detection algorithm in the related art, or a head detection model, for example. In one example, the generation of the head region image may be derived from one of the sub-models in the detection model. For example, the above sub-model may be trained by an image to be trained and a head region label corresponding to the image to be trained (may be represented as coordinate information of a vertex of a head region detection frame in the image), so that the trained sub-model may generate a head region image according to an input image, and the specific training manner is not limited herein. According to the embodiment of the disclosure, the head region image is extracted from the current image, so that the subsequent sight line detection task can focus on the detection of the head region, and the influence of the useless image features on the sight line detection precision is reduced. And then carrying out normalization processing on the head region image according to a preset normalization rule. And finally, inputting the normalized head region image into a trained detection model to obtain a model detection result corresponding to the image to be detected. Illustratively, the above normalization rules the embodiments of the present disclosure are not limited herein, and may be consistent with the normalization rules in the above training method.

With continued reference to fig. 2, step S800 determines a target detection result corresponding to the image to be detected according to the model detection result corresponding to the image to be detected. For example, the model detection result corresponding to the image to be detected can be directly used as the target detection result corresponding to the image to be detected. In one example, the model detection result may also be subjected to some subsequent processing in the related art, so as to improve the accuracy of the target detection result or adapt to a specific application scenario, and the embodiments of the disclosure are not limited herein.

In one possible implementation, the model detection result, and/or the target detection result, includes: the eye position detection device comprises a vision line detection result, a head posture detection result, a vision line detection result, an eye position detection result, a vision line detection result, a head posture detection result and an eye position detection result. The output result of the trained detection model can be formed by adding or deleting the detection head of the detection model. For example: each detection result can correspond to one detection head, and the corresponding detection result can be removed if the corresponding detection head is removed. It should be understood that even if the detection head corresponding to the detection result is removed, the trained detection model learns the features of the detection result, which is beneficial to improving the accuracy of realizing the detection result. According to the embodiment of the disclosure, the requirements of some practical application scenes can be met through a mode of outputting a plurality of detection results at one time, the detection heads can be added and deleted according to the requirements of upper-layer tasks, and the suitability of a detection model after training can be improved.

In one possible implementation manner, the image to be detected is a frame of image in a video acquired in real time, and step S800 includes: and determining a target detection result corresponding to the image to be detected according to the model detection result corresponding to the image to be detected, the target detection result corresponding to the image of the previous frame of the image to be detected and the preset weight corresponding to each of the model detection result and the target detection result. The embodiment of the disclosure is not limited to the specific value of the preset weight, and a developer can determine that the sum of the preset weights is 1 according to actual situations. In one example, the preset weight of the model detection result may be greater than the preset weight of the target detection result corresponding to the previous frame image, so as to improve the accuracy of the target detection result corresponding to the image to be detected.

Referring to fig. 3, fig. 3 shows a block diagram of a training apparatus provided according to an embodiment of the disclosure, and in conjunction with fig. 3, an embodiment of the disclosure further provides a training apparatus 100 applied to a detection model, where the training apparatus 100 includes: a sample acquisition module 110 for acquiring a first sample; wherein the first sample comprises: the plurality of images to be trained and the sight line labels respectively corresponding to the images to be trained; the sight line tag is used for representing the sight line direction of a target person in the image to be trained; the label generating module 120 is configured to generate a head pose label and an eye position label corresponding to each image to be trained; the head gesture label is used for representing the head direction of a target person in the image to be trained, and the eye position label is used for representing the eye position of the target person in the image to be trained; the model training module 130 is configured to obtain a second sample based on the multiple images to be trained by using a line-of-sight tag, a corresponding head gesture tag, and a corresponding eye position tag corresponding to each image to be trained, and train the detection model by using the second sample to obtain a trained detection model.

Referring to fig. 4, fig. 4 shows a block diagram of a detection apparatus provided according to an embodiment of the disclosure, and in conjunction with fig. 4, an embodiment of the disclosure further provides a detection apparatus 200, where the detection apparatus 200 includes: an image acquisition module 210, configured to acquire an image to be detected; the model detection result generating module 220 is configured to generate a model detection result corresponding to the image to be detected according to the image to be detected and the trained detection model; the target detection result generating module 230 is configured to determine a target detection result corresponding to the image to be detected according to a model detection result corresponding to the image to be detected; the trained detection model is obtained by training the training device.

It will be appreciated that the above-mentioned method embodiments of the present disclosure may be combined with each other to form a combined embodiment without departing from the principle logic, and are limited to the description of the present disclosure. It will be appreciated by those skilled in the art that in the above-described methods of the embodiments, the particular order of execution of the steps should be determined by their function and possible inherent logic.

In addition, the disclosure further provides an electronic device, a computer readable storage medium, and a program, where the foregoing may be used to implement any one of the training method and the detection method provided by the disclosure, and the corresponding technical schemes and descriptions and corresponding descriptions referring to the method parts are not repeated.

The method has specific technical association with the internal structure of the computer system, and can solve the technical problems of improving the hardware operation efficiency or the execution effect (including reducing the data storage amount, reducing the data transmission amount, improving the hardware processing speed and the like), thereby obtaining the technical effect of improving the internal performance of the computer system which accords with the natural law.

In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

The disclosed embodiments also provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method. The computer readable storage medium may be a volatile or nonvolatile computer readable storage medium.

The embodiment of the disclosure also provides an electronic device, which comprises: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the instructions stored in the memory to perform the above method.

Embodiments of the present disclosure also provide a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, performs the above method.

The electronic device may be provided as a terminal device, a server or other form of device.

Referring to fig. 5, fig. 5 illustrates a block diagram of an electronic device 1900 provided in accordance with an embodiment of the disclosure. For example, electronic device 1900 may be provided as a server or terminal device. Referring to FIG. 5, electronic device 1900 includes a processing component 1922 that further includes one or more processors and memory resources represented by memory 1932 for storing instructions, such as application programs, that can be executed by processing component 1922. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, processing component 1922 is configured to execute instructions to perform the methods described above.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output interface 1958. Electronic device 1900 may operate an operating system based on memory 1932, such as the Microsoft Server operating system (Windows Server) ^TM ) Base of apple companyOperating system (Mac OS X) ^TM ) Multi-user multi-process computer operating system (Unix) ^TM ) Unix-like operating system (Linux) of free and open source code ^TM ) Unix-like operating system (FreeBSD) with open source code ^TM ) Or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 1932, including computer program instructions executable by processing component 1922 of electronic device 1900 to perform the methods described above.

The present disclosure may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

Computer program instructions for performing the operations of the present disclosure can be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The computer program product may be realized in particular by means of hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.

The foregoing description of various embodiments is intended to highlight differences between the various embodiments, which may be the same or similar to each other by reference, and is not repeated herein for the sake of brevity.

It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.

If the technical scheme of the application relates to personal information, the product applying the technical scheme of the application clearly informs the personal information processing rule before processing the personal information and obtains the autonomous agreement of the individual. If the technical scheme of the application relates to sensitive personal information, the product applying the technical scheme of the application obtains individual consent before processing the sensitive personal information, and simultaneously meets the requirement of 'explicit consent'. For example, a clear and remarkable mark is set at a personal information acquisition device such as a camera to inform that the personal information acquisition range is entered, personal information is acquired, and if the personal voluntarily enters the acquisition range, the personal information is considered as consent to be acquired; or on the device for processing the personal information, under the condition that obvious identification/information is utilized to inform the personal information processing rule, personal authorization is obtained by popup information or a person is requested to upload personal information and the like; the personal information processing rule may include information such as a personal information processor, a personal information processing purpose, a processing mode, and a type of personal information to be processed.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A training method applied to a detection model, the training method comprising:

acquiring a first sample; wherein the first sample comprises: the plurality of images to be trained and the sight line labels respectively corresponding to the images to be trained; the sight line tag is used for representing the sight line direction of a target person in the image to be trained;

generating a head posture label and an eye position label which correspond to each image to be trained respectively; the head gesture label is used for representing the head direction of a target person in the image to be trained, and the eye position label is used for representing the eye position of the target person in the image to be trained;

Based on the multiple images to be trained, a second sample is obtained by using the sight line label, the head posture label and the eye position label corresponding to each image to be trained, and the detection model is trained by using the second sample to obtain a trained detection model.

2. The training method according to claim 1, wherein the generating a head pose label and an eye position label respectively corresponding to each image to be trained includes:

obtaining head area images corresponding to the images to be trained respectively according to the images to be trained;

and generating head posture labels and eye position labels respectively corresponding to the images to be trained according to the head area images respectively corresponding to the images to be trained.

3. The training method according to claim 2, wherein the generating a head pose label and an eye position label respectively corresponding to the images to be trained according to the head region images respectively corresponding to the images to be trained includes:

generating head posture angles corresponding to the images to be trained respectively according to the target images, and taking the angles of the head posture angles as head posture labels corresponding to the images to be trained respectively;

Generating eye key points corresponding to the images to be trained respectively according to the target images, and taking the position information of the eye key points in the head area images corresponding to the images to be trained respectively as eye position labels corresponding to the images to be trained respectively; the target image is an image to be trained or an image obtained by normalizing the image to be trained based on a preset normalization rule.

4. The training method of claim 3, wherein the preset normalization rule is obtained by:

according to a plurality of images to be trained in the first sample, determining the mean value and standard deviation of pixel values among the plurality of images to be trained;

and determining a preset normalization rule according to the mean value and the standard deviation.

5. The training method of claim 1, wherein training the test model using the second sample results in a trained test model comprising: training the detection model by using the second sample, and adjusting model parameters of the detection model based on the trained overall loss to obtain a trained detection model; wherein the overall loss comprises: a vision estimation loss corresponding to the vision tag, a head posture estimation loss corresponding to the head posture tag, and an eye position estimation loss corresponding to the eye position tag; the value of the overall loss is positively correlated with any one of a gaze estimation loss, a head pose estimation loss, and an eye position estimation loss.

6. The training method of claim 5, wherein the line-of-sight estimation loss comprises a line-of-sight estimation regression loss, a line-of-sight estimation classification loss; the vision estimation regression loss comprises a difference value between a vision label corresponding to the second sample and a predicted vision label; the sight line estimation classification loss comprises a difference value of a predicted sight line label on two image coordinate axes of an image in a second sample;

the head pose estimation loss comprises a difference value between a head pose label corresponding to the second sample and a predicted head pose label;

the eye position estimation loss includes a difference between an eye position tag corresponding to the second sample and a predicted eye position tag.

7. The training method of claim 1, wherein the trained detection model is used to obtain a gaze detection result, or a gaze detection result and a head pose detection result, or a gaze detection result and an eye position detection result, or a gaze detection result and a head pose detection result and an eye position detection result of an image.

8. A method of detection, the method comprising:

acquiring an image to be detected;

Generating a model detection result corresponding to the image to be detected according to the image to be detected and the trained detection model;

determining a target detection result corresponding to the image to be detected according to a model detection result corresponding to the image to be detected; the trained detection model is a detection model obtained by training according to the training method of any one of claims 1 to 7.

9. The method of claim 8, wherein the image to be detected is a frame of image in a video acquired in real time, and the determining the target detection result corresponding to the image to be detected according to the model detection result corresponding to the image to be detected comprises:

and determining a target detection result corresponding to the image to be detected according to the model detection result corresponding to the image to be detected, the target detection result corresponding to the image of the previous frame of the image to be detected and the preset weight corresponding to each of the model detection result and the target detection result.

10. The detection method according to claim 8 or 9, wherein the generating a model detection result corresponding to the image to be detected according to the image to be detected and the trained detection model includes:

Generating a head region image corresponding to the image to be detected according to the image to be detected;

carrying out normalization processing on the head region image according to a preset normalization rule;

and inputting the normalized head region image into a trained detection model to obtain a model detection result corresponding to the image to be detected.

11. A training device for use with a detection model, the training device comprising:

the sample acquisition module is used for acquiring a first sample; wherein the first sample comprises: the plurality of images to be trained and the sight line labels respectively corresponding to the images to be trained; the sight line tag is used for representing the sight line direction of a target person in the image to be trained;

the label generating module is used for generating a head gesture label and an eye position label which correspond to the images to be trained respectively; the head gesture label is used for representing the head direction of a target person in the image to be trained, and the eye position label is used for representing the eye position of the target person in the image to be trained;

the model training module is used for obtaining a second sample by utilizing the sight line label, the head posture label and the eye position label corresponding to each image to be trained based on the plurality of images to be trained, and training the detection model by utilizing the second sample to obtain a trained detection model.

12. A detection device, characterized in that the detection device comprises:

the image acquisition module is used for acquiring an image to be detected;

the model detection result generation module is used for generating a model detection result corresponding to the image to be detected according to the image to be detected and the trained detection model;

the target detection result generation module is used for determining a target detection result corresponding to the image to be detected according to the model detection result corresponding to the image to be detected; the trained detection model is a detection model obtained by training the training device according to claim 11.

13. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to invoke the instructions stored in the memory to perform the training method of any of claims 1 to 7 or the detection method of any of claims 8 to 10.

14. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the training method of any one of claims 1 to 7 or the detection method of any one of claims 8 to 10.