CN112101123B

CN112101123B - Attention detection method and device

Info

Publication number: CN112101123B
Application number: CN202010845697.8A
Authority: CN
Inventors: 周鲁平; 胡晓华
Original assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Current assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Priority date: 2020-08-20
Filing date: 2020-08-20
Publication date: 2024-05-28
Anticipated expiration: 2040-08-20
Also published as: CN112101123A

Abstract

The application is suitable for the technical field of image processing and provides a method and a device for detecting attention. The method comprises the following steps: acquiring an original video about a user; the original video comprises a plurality of frames of original images; respectively importing a plurality of frames of original images into a key point extraction network, and outputting key point images; importing the key point image into a gesture recognition network and outputting gesture information; determining a user state according to all the gesture information corresponding to the original images in the original video; and generating an attention detection result based on the original video and the user state, and outputting the attention detection result. The application takes continuous multi-frame original images about the user as a judging reference, utilizes a key point extraction network and a gesture recognition network to detect whether the attention of the user is concentrated or not, and outputs an attention detection result so as to remind the user to correct in time.

Description

Attention detection method and device

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a method and apparatus for detecting attention.

Background

With the development of the modern of education informatization, more and more students are learning without supervision, such as net lessons, home self-study, and the like. However, students learn without supervision, and the learning effect of the students is easily affected by the poor luck and draw mind or other factors. Under the condition of no supervision, the students cannot concentrate attention in the whole course during learning, and after going away, the students are difficult to return to mind in time, so that the learning effect is poor, and therefore, an attention detection method is needed to detect whether the attention of the students is concentrated or not so as to help the students to correct the attention of the students in time when the attention of the students is not concentrated.

Disclosure of Invention

Therefore, the embodiment of the application provides a method and a device for detecting attention, which solve the problem of identifying whether the attention of students is concentrated during learning so as to remind the students to correct in time.

In a first aspect, an embodiment of the present application provides an attention detection method, including: acquiring an original video about a user; the original video comprises a plurality of frames of original images; respectively importing a plurality of frames of original images into a key point extraction network, and outputting key point images; importing the key point image into a gesture recognition network and outputting gesture information; determining a user state according to all the gesture information corresponding to the original images in the original video; and determining the attention detection result based on the original video and the user state, and outputting the attention detection result.

In a possible implementation manner of the first aspect, the acquiring an original video about a user includes: and acquiring original videos corresponding to each acquisition period according to a preset acquisition period.

The duration of the collection period is a preset duration, and may specifically be one minute; the original video comprises a plurality of frames of original images; the original video corresponding to one acquisition period may specifically include sixty frames of original images, that is, one second in the acquisition period corresponds to one frame of the original images. The user state determined subsequently is the user state corresponding to the acquisition period; and the follow-up determined attention detection result is the attention detection result corresponding to the acquisition period.

It should be understood that, by adjusting the preset duration of the acquisition period, the method provided by the embodiment of the application can be suitable for an application scene for monitoring the attention of the user in real time, so as to prompt the user to correct the problem of inattention in time. In addition, by collecting attention detection results corresponding to each acquisition period, an attention analysis report can be generated for characterizing the attention condition of the user during the acquisition of all original videos; the attention analysis report is output to let the user (e.g., student) or other user (e.g., parent or teacher of student) know the attention of the user during the acquisition of all the original videos (e.g., time of a class).

In a second aspect, an embodiment of the present application provides an attention detection device, including: the original video acquisition module is used for acquiring an original video about a user; the original video comprises a plurality of frames of original images; the key point extraction module is used for respectively importing a plurality of frames of original images into a key point extraction network and outputting key point images; the gesture recognition module is used for importing the key point images into a gesture recognition network and outputting gesture information; the state determining module is used for determining the state of the user according to the posture information corresponding to all the original images in the original video; the detection result generation module is used for generating an attention detection result based on the original video and the user state; and the detection result output module is used for outputting the attention detection result.

In a third aspect, an embodiment of the present application provides a terminal device, including: a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the method of any one of the above first aspects when the computer program is executed.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium comprising: the computer readable storage medium stores a computer program which, when executed by a processor, implements the method of any of the above first aspects.

In a fifth aspect, an embodiment of the application provides a computer program product for, when run on a terminal device, causing the terminal device to perform the method of any of the first aspects described above.

It will be appreciated that the advantages of the second to fifth aspects may be found in the relevant description of the first aspect, and are not described here again.

Compared with the prior art, the embodiment of the application has the beneficial effects that:

Compared with the prior art, the attention detection method provided by the application takes the key point image about the user as the input of the gesture recognition network to output gesture information; based on the change between the gesture information corresponding to the continuous multi-frame key point images, whether the attention of the user is concentrated or not is determined, and an attention detection result is output so as to remind the user of timely correction.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of the detection method according to the first embodiment of the present application;

fig. 2 is a schematic view of an application scenario provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of a key point extraction network according to a second embodiment of the present application;

FIG. 4 is a flowchart of a detection method according to a third embodiment of the present application;

FIG. 5 is a schematic view of a rotation vector according to a third embodiment of the present application;

FIG. 6 is a flowchart of a detection method according to a fourth embodiment of the present application;

FIG. 7 is a flowchart of a detection method according to a fifth embodiment of the present application;

FIG. 8 is a flowchart for determining a user status according to a fifth embodiment of the present application;

fig. 9 is a flowchart of an implementation of the detection method provided in the sixth embodiment of the present application;

FIG. 10 is a schematic structural diagram of a detecting device according to an embodiment of the present application;

Fig. 11 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in the present description and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

Furthermore, the terms "first," "second," "third," and the like in the description of the present specification and in the appended claims, are used for distinguishing between descriptions and not necessarily for indicating or implying a relative importance.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

In the embodiment of the present application, the execution subject of the flow is a terminal device. The terminal device includes, but is not limited to: a server, a computer, a smart phone, a tablet computer, and the like, capable of executing the method provided by the application. Preferably, the terminal device is an intelligent education terminal device, which is capable of acquiring an original video about a user. Fig. 1 shows a flowchart of the implementation of the method provided in the first embodiment of the present application, which is described in detail below:

in S101, an original video about a user is acquired.

In this embodiment, the original video includes a plurality of frames of original images. Typically, the original video is acquired by a camera with respect to the user. Illustratively, the original video of the user in the sitting state is obtained, and the camera should be set at a position where the user in the sitting state can be photographed, for example, on a target desk or on a device (a display screen watched by the user while learning, or a stationary book stand where a book is placed while the user is reading).

In one possible implementation manner, the acquiring the original video related to the user may specifically be acquiring the original video corresponding to each acquisition period according to a preset acquisition period. The duration of the collection period is a preset duration, and may specifically be one minute; the original video comprises a plurality of frames of original images; the original video corresponding to one acquisition period may specifically include sixty frames of original images, that is, one second in the acquisition period corresponds to one frame of the original images. The user state determined subsequently is the user state corresponding to the acquisition period; and the follow-up determined attention detection result is the attention detection result corresponding to the acquisition period.

It should be understood that, in the above possible implementation manner, by adjusting the preset duration of the acquisition period, the method provided by the embodiment of the present application may be suitable for an application scenario in which attention of a user is monitored in real time, so as to prompt the user to correct the problem of distraction in time. In addition, by collecting attention detection results corresponding to each acquisition period, an attention analysis report can be generated for characterizing the attention condition of the user during the acquisition of all original videos; the attention analysis report is output to let the user (e.g., student) or other user (e.g., parent or teacher of student) know the attention of the user during the acquisition of all the original videos (e.g., time of a class).

In S102, the original images of the plurality of frames are respectively imported into a key point extraction network, and the key point images are output.

In this embodiment, the keypoint extraction network is configured to extract keypoint feature information about the user in the original image; the key point image comprises all the key point characteristic information in the original image. The keypoint extraction network may be a trained keypoint identification network for extracting a target object in an image, and illustratively, the keypoint extraction network may be a OpenPose human body keypoint identification model, where the keypoints include a left eye keypoint, a right eye keypoint, a nose keypoint, a left ear keypoint, a right ear keypoint, a left shoulder keypoint, a right shoulder keypoint, and a middle (neck) keypoint.

In one possible implementation manner, the foregoing importing multiple frames of the original image into the key point extraction network respectively, and outputting the key point image may specifically be: and extracting characteristic information about a left eye key point, a right eye key point, a nose key point, a left ear key point, a right ear key point, a left shoulder key point, a right shoulder key point and a middle (neck) key point of the user in the original image, and obtaining the key point image based on the characteristic information.

In S103, the key point image is imported into a gesture recognition network, and gesture information is output.

In this embodiment, the gesture recognition network is an algorithm model trained based on a deep learning algorithm, takes the key point image as input, and takes the gesture information as output, and is used for determining the gesture information based on the feature information of each key point in the key point image. The gesture information is used to characterize the gesture of the user in the original image, and may include, for example, low head, normal, head-up, etc., and is used to characterize the sitting posture of the user in the original image.

In one possible implementation manner, the importing the key point image into the gesture recognition network and outputting the gesture information may specifically be extracting feature information of each key point in the key point image, and calculating according to the internal parameters of the gesture recognition network and the feature information to obtain the gesture information.

In S104, a user state is determined according to pose information corresponding to all the original images in the original video.

In this embodiment, the original video includes a plurality of frames of the original image and time stamps of the original image of each frame, and each frame of the original image corresponds to the gesture information obtained in S103; the user state is used to characterize the concentration or lack of concentration of the user in the original video.

In one possible implementation manner, the determining the user state according to the gesture information corresponding to all the original images in the original video may specifically be: judging whether the gesture information of the original images of two adjacent frames is the same or not based on the time stamp of the original images of each frame; if the posture information of the original image is different from that of the original image of the previous frame of the original image, marking the original image as not focusing attention, otherwise, marking the original image as not focusing attention; and if the ratio of the number of the original images marked as the inattention to the number of frames of the original video is larger than or equal to a preset ratio, determining the user state as the inattention, otherwise, determining the user state as the inattention.

In S105, an attention detection result is generated based on the original video and the user state, and the attention detection result is output.

In this embodiment, the attention detection result is used to characterize the attention situation of the user during the period of time in which the original video is acquired. Specifically, based on the time stamp of the original video, determining an acquisition time period of the original video, wherein the attention detection result is used for representing that the user state of the user is attention-focusing or attention-non-focusing in the acquisition time period of the original video. For example, the starting time stamp of the original video is 12:00:00, the ending time stamp of the original video is 12:01:00, the acquisition time period of the original video is determined to be 12:00:00-12:01:00, the value of the user state is 1, the user state is not focused (0 indicates that the user is focused), and the generated focus detection result is specifically "in 12:00:00-12:01:00, the user is not focused".

In one possible implementation manner, the outputting the attention detection result may specifically be: and displaying the attention detection result on a display module of the terminal device, or sending the attention detection result to a user terminal to inform a user of the specific condition of the attention of the user in the acquisition time period of the original video.

In the present embodiment, the key point image about the user is taken as the input of the gesture recognition network, and the gesture information is output; based on the change between the gesture information corresponding to the continuous multi-frame key point images, whether the attention of the user is concentrated or not is determined, and an attention detection result is output so as to remind the user of timely correction.

Fig. 2 shows an application scenario schematic diagram provided in an embodiment of the present application. Referring to fig. 2, in one possible application scenario, the human body in the figure is a student sitting in a chair, who is going to a class; at this time, by the detection method provided by the application, the terminal equipment comprising the camera is arranged on the table, so that the terminal equipment can implement the detection method provided by the application, namely, the original video of the student is acquired by the camera, and the attention concentration condition of the student during the acquisition of the original video is determined according to the original video. For example, taking one minute as a collection period, acquiring an original video of the student in each collection period, determining an attention detection result of the student in each collection period, and sending the attention detection result to a teacher terminal to inform the teacher; by taking forty minutes as a class, the method provided by the application can realize that a teacher supervises the attention concentration of the student in forty minutes of the class, in particular to which minutes are concentration and which minutes are non-concentration.

Fig. 3 shows a schematic diagram of a key point extraction network according to a second embodiment of the present application. Referring to fig. 3, with respect to the embodiment illustrated in fig. 1, the method S102 provided in this embodiment includes S301 to S302, which are specifically described as follows:

further, the step of respectively importing the original images into a key point extraction network to output key point images includes:

in S301, the original image is imported into the human body recognition layer, and a human body image is cut from the original image.

In this embodiment, referring to fig. 3, when the original image is imported into the key point extraction network, the original image is first imported into the human body recognition layer to determine a human body image about a user within the original image.

In one possible implementation manner, the importing the original image into the human body recognition layer, and capturing the human body image from the original image may specifically be: preprocessing the original image, determining human body edge contour information in the original image according to the preprocessed original image, and intercepting a human body image containing a user face and an upper body in the original image according to the human body edge contour information. The preprocessing the original image may specifically be: image processing means such as image sharpening processing and the like for the original image, which are used for highlighting edge contours, are carried out to obtain a preprocessed original image; the determining the human edge contour information in the original image according to the preprocessed original image may specifically be: the preprocessed original image is imported into a trained human body recognition model for determining the human body edge contour, so that human body edge contour information is obtained; according to the human edge contour information, the capturing a human body image including a user face and an upper body in the original image may specifically be: and determining the edge contour of the target human body on the original image according to the human body edge contour information, and intercepting out the area surrounded by the edge contour of the target human body to identify the human body image. It should be appreciated that the human body recognition model may be a model trained in the prior art for confirming human body edge contour information contained in an image of a human body, and will not be described in detail herein.

In S302, the human body image is imported into the keypoint identification layer, a plurality of keypoints are extracted from the human body image, and a keypoint image including the plurality of keypoints is output.

In this embodiment, referring to fig. 3, the keypoint extraction network includes a keypoint identification layer, and the human body image is imported to the keypoint identification layer to determine keypoint images with respect to a plurality of keypoints within the human body image when the human body identification layer outputs the human body image. .

In this embodiment, the keypoint identification layer is configured to identify keypoints on the human body image with respect to the user, and the keypoints include a left eye keypoint, a right eye keypoint, a nose keypoint, a left ear keypoint, a right ear keypoint, a left shoulder keypoint, a right shoulder keypoint, and a middle (neck) keypoint, for example. Optionally, the key point extraction layer may be a OpenPose human key point identification model, and specific implementation is not described herein.

In one possible implementation manner, the above-mentioned respectively importing the human body images of each frame into a key point extraction network, and outputting the key point images may specifically be: extracting characteristic information about a left eye key point, a right eye key point, a nose key point, a left ear key point, a right ear key point, a left shoulder key point, a right shoulder key point and a middle (neck) key point of the user in the human body image, obtaining the key point image based on the characteristic information, specifically, connecting the key points according to a preset connection relation, extracting the key points and connection lines of the key points from the human body image, and obtaining the key point image containing a plurality of key points (as shown in fig. 3).

In this embodiment, a human body recognition layer is set in the key point extraction network, so that feature information of a non-important background environment in the original image can be removed, only feature information of a target human body is reserved as far as possible, which is equivalent to preprocessing the original image, and the information amount of an image to be processed in a subsequent step (or the calculated amount of the subsequent step) is reduced so as to improve the efficiency of the subsequent attention detection; the key point identification layer is arranged, so that the key points of different target human bodies (various postures or wearing various clothes) can be identified, the applicable crowd for attention detection can be enlarged by extracting the key points on the human body image, the characteristic information to be processed subsequently can be further simplified, and only the characteristic information of the key points of the human body image is reserved, thereby facilitating subsequent improvement of the attention detection efficiency and further improving the training efficiency of the subsequent posture identification network.

Fig. 4 shows a flowchart of an implementation of the method provided by the third embodiment of the application. Referring to fig. 4, with respect to the embodiment illustrated in fig. 1, the method S103 provided in this embodiment includes S401 to S402, which are specifically described as follows:

Further, the importing the key point image into a gesture recognition network, outputting gesture information, includes:

in this embodiment, the posture information includes a head rotation vector and a human body rotation vector; fig. 5 shows an example of the head rotation vector. Referring to fig. 5, the head rotation vector refers to a rotation vector between a head three-dimensional coordinate system (coordinate system constituted by x ' axis, y ' axis, and z ' axis in the drawing) established based on the head orientation of the target object and a standard three-dimensional coordinate system (coordinate system constituted by x axis, y axis, and z axis in the drawing) established based on the ground, the head three-dimensional coordinate system being identical to the center (point O in the drawing) of the standard three-dimensional coordinate system. Illustratively, the head rotation vector is (a, b, c), the a being an angle value rotated about an x-axis of a standard three-dimensional coordinate system, the b being an angle value rotated about a y-axis of the standard three-dimensional coordinate system, the standard three-dimensional coordinate system being rotated about a z-axis of the standard three-dimensional coordinate system, the c being an angle value rotated about the standard three-dimensional coordinate system; it will be appreciated that the standard three-dimensional coordinate system is coincident with the head three-dimensional coordinate system after three rotations as described above.

In S401, a head rotation vector is determined based on face feature information in the key point image.

In this embodiment, generally, only the key point feature information located on the head of the target object, that is, the face feature information, is necessary information in the key point image at the time of determining the head rotation vector. In one possible implementation manner, the determining the head rotation vector based on the face feature information in the keypoint image may specifically be: extracting face feature information in the key point image, namely integrating feature information of each face key point positioned on the head of the target object in the key point image, wherein the face key point can be specifically: left eye keypoints, right eye keypoints, nose keypoints, left ear keypoints, and right ear keypoints; and calculating based on the face characteristic information and the internal parameters of the gesture recognition network to obtain the head rotation vector.

In S402, a human rotation vector is determined based on human feature information in the keypoint image.

In this embodiment, in general, the necessary information in the key point image, that is, the human body feature information, when determining the human body rotation vector should include human body key points located on the human body of the target object in the key point image, and the human body key points include, for example, feature information of nose key points (for describing the longitudinal direction of the human body, which may be replaced by other human body key points), left shoulder key points, right shoulder key points, and middle (neck) key points. The implementation manner of determining the human rotation vector based on the human feature information in the key point image may be specifically described in S401, and will not be described herein.

In this embodiment, the head rotation vector and the human body rotation vector which can be used to represent the user gesture are calculated through the gesture recognition network respectively, so that the user gesture can be described in more detail, that is, the head rotation vector is used to describe the user head gesture, and the human body rotation vector is used to describe the user human body gesture, so that a more detailed basis is provided when the user state is determined later, and the detection accuracy is improved. On the other hand, the head rotation vector and the human body rotation vector are calculated through part of characteristic information in the key point image respectively, so that the calculated amount can be reduced, and the output efficiency of the gesture recognition network is improved.

Fig. 6 shows a flowchart of an implementation of the method provided by the fourth embodiment of the application. Referring to fig. 6, with respect to the embodiment described in fig. 4, the method provided in this embodiment includes S601 to S604, which are specifically described as follows:

Further, before the original video about the user is obtained, the method further includes:

In S601, a training image set is acquired.

In this embodiment, the training image set includes a plurality of training images; typically, the training image set is acquired by a camera. It should be appreciated that, in order to ensure that the trained gesture recognition network determines the accuracy of the user state based on the original video, the camera should be consistent with the relative position of the camera that acquired the original video and the target object.

It should be appreciated that, when each training image in the training image set is acquired, pose information of the sample object at the time of the training image acquisition should be recorded so as to configure true-value pose information for each training image later.

In S602, true-value posture information is configured for each training image.

In this embodiment, the truth posture information includes a head truth rotation vector and a human body truth rotation vector. When each training image in the training image set is acquired in S601, the head rotation vector and the human rotation vector of the sample object during the acquisition of the training image should be recorded as the head truth rotation vector and the human truth rotation vector.

In one possible implementation manner, the sample object is instructed to perform a sitting gesture with a head rotation vector (a, b, c) and a human body rotation vector (i, j, k), and then a training image of the sample object is acquired again, and (a, b, c) is taken as the head truth value rotation vector corresponding to the training image, and (i, j, k) is taken as the human body truth value rotation vector corresponding to the training image.

In S603, each training image is respectively imported into a key point extraction network, and a key point training image is output.

In this embodiment, since the implementation manner of S603 is identical to that of S102 in the embodiment described in fig. 1, the specific description of S102 may be referred to in the related description, and will not be repeated here.

In S604, the key point training image is used as an input, the true pose information is used as an output, and the pose recognition network is trained based on a deep learning algorithm.

In this embodiment, the truth posture information includes a head truth rotation vector and a human body truth rotation vector; the deep learning algorithm may be keras deep learning algorithm; the gesture recognition network may be a preset neural network; the training of the gesture recognition network based on the deep learning algorithm by using the key point training image as input and the true gesture information as output may be specifically: presetting a neural network as the gesture recognition network, taking the key point training image as input, and outputting a head predictive rotation vector and a human body predictive rotation vector; and updating internal parameters of the gesture recognition network based on keras deep learning algorithm by taking the head truth rotation vector and the human body truth rotation vector as truth values and taking the head prediction rotation vector and the human body prediction rotation vector as predicted values.

It should be understood that the gesture recognition network should include two parallel computing layers, namely a head rotation vector computing layer and a human rotation vector computing layer. Referring to the embodiment shown in fig. 4, the training process includes: the face characteristic information of the key point training image is taken as input, the head prediction rotation vector is taken as a prediction value, the head truth rotation vector is taken as a truth value, and parameters of the head rotation vector calculation layer are updated based on keras deep learning algorithm; and taking human body characteristic information of the key point training image as input, taking the human body predicted rotation vector as a predicted value, taking the human body true value rotation vector as a true value, and updating parameters of the human body rotation vector calculation layer based on keras deep learning algorithm.

It should be understood that the training image set obtained above may be used to divide a portion of the training image into a verification image set for determining the accuracy of the gesture recognition network after each training period; and if the accuracy of the gesture recognition network is higher than or equal to the preset standard accuracy, the gesture recognition network training is completed.

In this embodiment, the gesture recognition network is constructed and trained based on the obtained training image set and the deep learning algorithm, and the step S103 may be implemented by the gesture recognition network after training, and in particular, the head rotation vector and the human rotation vector may be output by taking the key point image as input, so as to facilitate the subsequent determination of the user state.

Fig. 7 shows a flowchart of an implementation of the method provided by the fifth embodiment of the application. Referring to fig. 7, with respect to the embodiment illustrated in fig. 4, the method S104 provided in this embodiment includes S701 to S703, which are specifically described as follows:

Further, the determining the user state according to the gesture information corresponding to all the original images in the original video includes:

In S701, a head pose is determined based on the head rotation vector.

In the present embodiment, in order to distinguish the respective head poses of the user based on the head rotation vector, a threshold value is set in advance for the rotation angle values of the head rotation vector in the respective directions. In one possible implementation manner, the determining the head pose based on the head rotation vector may specifically be: determining sub-head gestures corresponding to all directions based on rotation angle values corresponding to all directions of the head rotation vector respectively; and identifying the set of sub-head gestures corresponding to the three directions as a head gesture. Illustratively, the head rotation vector is (a, b, c), and the rotation angle value a corresponding to the x-axis of the head rotation vector is described as an example: setting two thresholds of rotation angle values corresponding to an x axis as a ₁ and a ₂ respectively, wherein a ₁ and a ₂ meet-90 < a ₁<0<a₂ <90; when the a is in the interval [ -90, a ₁ ], determining the sub-head gesture A corresponding to the x-axis as head lifting (specifically, the sub-head gesture A can be represented by A= -1); when the a is in the interval [ a ₁,a₂ ], determining that the sub-head posture A corresponding to the x axis is normal (specifically, the sub-head posture A can be represented by A=0); when the a is in the interval [ a ₂, 90], the sub-head posture a corresponding to the x-axis is determined to be low head (specifically, may be represented by a=1). It should be understood that, for the rotation angle value B corresponding to the y-axis of the head rotation vector, two thresholds B ₁ and B ₂ are set and the sub-head pose B is determined, and for the rotation angle value C corresponding to the z-axis of the head rotation vector, two thresholds C ₁ and C ₂ are set and the sub-head pose C is determined, which are referred to the above steps and will not be repeated herein. So far, the set of sub-head poses corresponding to the three directions is identified as a head pose, namely the head pose is specifically (a, B, C), and it is understood that 3 sub-head poses of the head pose have 3 values respectively, namely the head pose has 27 values in total.

In S702, a human body posture is determined based on the human body rotation vector.

In this embodiment, since the implementation manner of determining the human body posture based on the human body rotation vector is identical to the implementation manner of S701, the description of determining the head posture based on the head rotation vector with reference to S701 is omitted here. In general, the posture change obtained by rotating the human body posture about the x-axis as the rotation axis is reflected to some extent in the head posture, and for example, the human body posture is generally accompanied by a head posture being low when the human body posture is bent over. Thus, for example, in S702, the sub-human body pose corresponding to the x-axis may not be considered, that is, the human body pose has 2 sub-human body poses, each of the sub-human body poses has 3 values, and the human body poses has 9 values in total.

It should be appreciated that if a user gesture of the original video needs to be output, a user gesture may be generated based on the head gesture and the human gesture of each of the original images within the original video, and the user gesture may be output. Optionally, identifying the head pose value with the highest proportion in the head poses corresponding to all the original images in the original video as the head pose corresponding to the original video; identifying the human body posture value with the highest ratio in the human body postures corresponding to all the original images in the original video as the human body posture corresponding to the original video; generating a user gesture based on the head gesture and the human gesture, and outputting the user gesture; referring to the possible implementation manner, the head gesture has 27 values, the human gesture has 9 values, and the user gesture has 243 values.

In S703, the user state is determined according to the head rotation vector, the head posture, the human rotation vector, and the human posture corresponding to all the original images in the original video.

In this embodiment, the user gesture is used to characterize the concentration or lack of concentration of the user in the original video; each frame of the original image in the original video corresponds to one of the head rotation vector, one of the head pose, one of the human body rotation vector, and one of the human body pose.

In one possible implementation manner, the specific implementation steps of S703 may be: based on the time stamp of the original image of each frame, the head rotation vectors and the variation amounts of the human body rotation vectors corresponding to the original images of two adjacent frames are compared to judge whether the original images of the two adjacent frames are recognized to have enough variation for forming the inattention. Specifically, the head rotation vector and the human body rotation vector of the original image are (a _n,b_n,c_n) and (i _n,j_n,k_n) respectively, the head rotation vector and the human body rotation vector of the previous frame of the original image are (a _n-1,b_n-1,c_n-1) and (i _n-1,j_n-1,k_n-1) respectively, and n is the frame number of the original image in the original video; and judging whether the original images of the two adjacent frames are recognized to have enough variation for forming the inattention, and generally not considering the rotation angle value i corresponding to the x axis of the human body rotation vector, namely only considering the variation of the rotation angle value corresponding to the 5 dimensions of the xyz axis of the head rotation vector and the yz axis of the human body rotation vector. See the relevant description of S702 above for specific reasons.

Optionally, a change threshold is set in each dimension in advance, and if there is a change in one dimension greater than or equal to the change threshold corresponding to the dimension, it is determined that there is a change between the two adjacent frames of the original images sufficient to form a distraction. Specifically, taking the dimension corresponding to the x-axis of the head rotation vector as an example, if |a _n-a_n-1 | is greater than or equal to the change threshold corresponding to the x-axis of the head rotation vector, the original image is identified as a changed frame image, otherwise, the original image is not identified. The altered frame image is used to characterize the user's distraction in the original image of the frame.

Optionally, calculating an average change value in all dimensions, presetting an average change threshold, if the average change value is greater than or equal to the average change threshold, judging that there is enough change between the two adjacent frames of original images to form a lack of concentration, identifying the original images as changed frame images, otherwise, not identifying. The average variation value is an average value of variation values in each dimension.

In another possible implementation manner, the user state is determined according to the head rotation vector, the head gesture, the human body rotation vector and the human body gesture corresponding to all the original images in the original video, and fig. 8 is a schematic flow chart of determining the user state according to the fifth embodiment of the present application. Referring to fig. 8, S703 includes S7031 to S7034, which are specifically described as follows:

further, the determining the user state according to the head rotation vector, the head gesture, the human body rotation vector and the human body gesture corresponding to all the original images in the original video includes: s7031 and/or S7032 and/or S7033, and S7034.

In S7031, if the difference between the head rotation vector of the original image and the previous frame of original image of the original image is greater than or equal to a first threshold, or the difference between the human rotation vector of the original image and the previous frame of original image of the original image is greater than or equal to a first threshold, the original image is identified as a changed frame image.

In this embodiment, specifically, the head rotation vector and the human body rotation vector of the original image are (a _n,b_n,c_n) and (i _n,j_n,k_n), respectively, the head rotation vector and the human body rotation vector of the previous frame of the original image are (a _n-1,b_n-1,c_n- 1) and (i _n-1,j_n-1,k_n-1), respectively, and n is the frame number of the original image in the original video and is greater than 0.

In general, in the present embodiment, the rotation angle value i corresponding to the x-axis of the human rotation vector is not considered, that is, only the change of the rotation angle values corresponding to the 5 dimensions of the xyz-axis of the head rotation vector and the yz-axis of the human rotation vector is considered, and specifically, see fig. 8.

In S7032, if the head pose corresponding to the original image is different from the head pose corresponding to the original image of the previous frame of the original image, and the difference value between the head rotation vectors of the original image and the original image of the previous frame of the original image is greater than or equal to a second threshold, the original image is identified as a changed frame image.

In this embodiment, the second threshold is smaller than the first threshold. Specifically, the head pose has 27 different values, referring to fig. 8, the head pose of the original image is p _n, and the head pose of the previous frame of the original image is p _n-1.

In S7033, if the human body posture corresponding to the original image is different from the human body posture corresponding to the original image of the previous frame of the original image, and the difference value of the human body rotation vector of the original image and the original image of the previous frame of the original image is greater than or equal to the second threshold, the original image is identified as a changed frame image.

In this embodiment, specifically, the human body posture has 9 different values, referring to fig. 8, the head posture of the original image is q _n, and the head posture of the original image of the previous frame of the original image is q _n-1.

In general, in the present embodiment, the rotation angle value i corresponding to the x-axis of the human body rotation vector is not considered, that is, only the change in the rotation angle values corresponding to the 5 dimensions of the xyz-axis of the head rotation vector and the yz-axis of the human body rotation vector is considered.

Preferably, if q _n≠q_n-1 is satisfied and there is any difference in either |j _n-j_n-1 | or |k _n-k_n-1 | greater than or equal to the second threshold, then the original image is identified as a changed frame image, otherwise not identified.

In S7034, if the ratio of all the changed frame images is greater than or equal to a preset ratio among all the original images, the user state is recognized as inattentive.

In this embodiment, specifically, the original video includes N frames of the original image, and if the number of frames of the changed frame image in the original video is M, the duty ratio is M/N, where N is greater than M. If M/N is greater than or equal to a preset ratio (e.g., 40%), the user state is identified as inattentive, otherwise the user state is identified as inattentive.

In this embodiment, a change frame image for characterizing the user's inattention is identified from within the original video, and the user state is determined based on the duty ratio of the change frame image, so as to facilitate the subsequent generation of an attention detection result.

In this embodiment, the head pose and the human body pose are determined first, and then a changed frame image for representing that the attention of the user is not concentrated is identified from the original video according to the head pose and the human body pose, so that the basis for identifying the changed frame image is increased, the accuracy of determining the user state subsequently is improved, and a more accurate attention detection result is generated subsequently.

Fig. 9 shows a flowchart of an implementation of the method provided by the sixth embodiment of the present application. Referring to fig. 9, with respect to any of the foregoing embodiments, the method S105 provided in this embodiment includes S901 to S902, which are specifically described as follows:

further, the outputting the attention detection result includes:

in S901, the attention detection result is transmitted to the user terminal.

In this embodiment, a connection is established with the user terminal, and the attention detection result output in S105 is sent to the user terminal. The connection with the user terminal may specifically be established by searching for a user terminal within a connectable range; connection may also be established for the user terminal through the relay server.

In S902, the user terminal is instructed to display the attention detection result.

In this embodiment, a request for displaying the attention detection result sent in S901 is sent by establishing a connection with the user terminal, and the user terminal is instructed to display the attention detection result through a display module of the user terminal, so as to inform a user.

In one possible implementation, the user terminal may be a monitor terminal, which may be a terminal device used by a supervisor of the user, and send the attention detection result to the monitor terminal and instruct the monitor terminal to display the attention detection result so as to inform the supervisor.

In this embodiment, the attention detection result is sent to the user terminal to inform the user, so that the user can adjust the state of the user according to the attention detection result in time, the user terminal can store the attention detection result so that the user can view the attention detection result on the user terminal at any time, and the user terminal can perform integrated analysis on all the received attention detection results to obtain an analysis report which is more understandable to the user and relates to the attention of the user.

Corresponding to the method described in the above embodiments, fig. 10 shows a schematic structural diagram of a detection device according to an embodiment of the present application, and for convenience of explanation, only the portion related to the embodiment of the present application is shown.

Referring to fig. 10, the attention detection device includes: the original video acquisition module is used for acquiring an original video about a user; the original video comprises a plurality of frames of original images; the key point extraction module is used for respectively importing a plurality of frames of original images into a key point extraction network and outputting key point images; the gesture recognition module is used for importing the key point images into a gesture recognition network and outputting gesture information; the state determining module is used for determining the state of the user according to the posture information corresponding to all the original images in the original video; the detection result generation module is used for generating an attention detection result based on the original video and the user state; and the detection result output module is used for outputting the attention detection result.

It should be noted that, because the content of information interaction and execution process between the above devices is based on the same concept as the method embodiment of the present application, specific functions and technical effects thereof may be referred to in the method embodiment section, and will not be described herein.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

Fig. 11 is a schematic structural diagram of a terminal device according to an embodiment of the present application. As shown in fig. 11, the terminal device 11 of this embodiment includes: at least one processor 110 (only one shown in fig. 11), a memory 111, and a computer program 112 stored in the memory 111 and executable on the at least one processor 110, the processor 110 implementing the steps in any of the various method embodiments described above when executing the computer program 112.

The terminal device 11 may be a computing device such as a desktop computer, a notebook computer, a palm computer, a cloud server, etc. The terminal device may include, but is not limited to, a processor 110, a memory 111. It will be appreciated by those skilled in the art that fig. 11 is merely an example of the terminal device 11 and is not meant to be limiting as to the terminal device 11, and may include more or fewer components than shown, or may combine certain components, or may include different components, such as input-output devices, network access devices, etc.

It should be understood that, when the terminal device 11 is specifically a computing device such as a cloud server that does not have a function of acquiring the original video, the original video uploaded from another device may be acquired, and the detection method of the present application may be implemented based on the original video uploaded from the other device.

The Processor 110 may be a central processing unit (Central Processing Unit, CPU), the Processor 110 may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL processors, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 111 may in some embodiments be an internal storage unit of the terminal device 11, such as a hard disk or a memory of the terminal device 11. The memory 111 may also be an external storage device of the terminal device 11 in other embodiments, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the terminal device 11. Further, the memory 111 may also include both an internal storage unit and an external storage device of the terminal device 11. The memory 111 is used to store an operating system, application programs, boot loader (BootLoader), data, and other programs, etc., such as program codes of the computer program. The memory 111 may also be used to temporarily store data that has been output or is to be output.

Embodiments of the present application also provide a computer readable storage medium storing a computer program which, when executed by a processor, implements steps for implementing the various method embodiments described above.

Embodiments of the present application provide a computer program product which, when run on a mobile terminal, causes the mobile terminal to perform steps that enable the implementation of the method embodiments described above.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application may implement all or part of the flow of the method of the above embodiments, and may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing device/terminal apparatus, recording medium, computer Memory, read-Only Memory (ROM), random access Memory (RAM, random Access Memory), electrical carrier signals, telecommunications signals, and software distribution media. Such as a U-disk, removable hard disk, magnetic or optical disk, etc. In some jurisdictions, computer readable media may not be electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other manners. For example, the apparatus/terminal device embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical function division, and there may be additional divisions in actual implementation, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. A method of detecting attention, comprising:

Acquiring an original video about a user; the original video comprises a plurality of frames of original images;

Respectively importing a plurality of frames of original images into a key point extraction network, and outputting key point images;

importing the key point image into a gesture recognition network and outputting gesture information;

determining a user state according to all the gesture information corresponding to the original images in the original video;

generating an attention detection result based on the original video and the user state, and outputting the attention detection result;

The gesture information comprises a head rotation vector and a human body rotation vector; the step of importing the key point image into a gesture recognition network and outputting gesture information comprises the following steps:

determining a head rotation vector based on the face feature information in the key point image;

Determining a human body rotation vector based on the human body characteristic information in the key point image;

the determining the user state according to the gesture information corresponding to all the original images in the original video includes:

determining a head pose based on the head rotation vector;

determining a human body pose based on the human body rotation vector;

Determining the user state according to head rotation vectors, head postures, human body rotation vectors and human body postures corresponding to all the original images in the original video;

The determining the user state according to the head rotation vector, the head gesture, the human body rotation vector and the human body gesture corresponding to all the original images in the original video includes:

If the difference value of the head rotation vector of the original image and the original image of the previous frame of the original image is larger than or equal to a first threshold value, or the difference value of the human body rotation vector of the original image and the original image of the previous frame of the original image is larger than or equal to the first threshold value, identifying the original image as a changed frame image;

and/or the number of the groups of groups,

If the head gesture corresponding to the original image is different from the head gesture corresponding to the original image of the previous frame of the original image, and the difference value of the head rotation vectors of the original image and the original image of the previous frame of the original image is greater than or equal to a second threshold value, identifying the original image as a changed frame image; the second threshold is less than the first threshold;

and/or the number of the groups of groups,

If the human body posture corresponding to the original image is different from the human body posture corresponding to the original image of the previous frame of the original image, and the difference value of the human body rotation vectors of the original image and the original image of the previous frame of the original image is larger than or equal to the second threshold value, identifying the original image as a changed frame image;

and if the ratio of all the changed frame images in all the original images is larger than or equal to a preset ratio, identifying the user state as inattention.

2. The detection method according to claim 1, wherein the key point extraction network includes a human body recognition layer and a key point recognition layer; the step of respectively importing the original images into a key point extraction network to output key point images, which comprises the following steps:

importing the original image into the human body recognition layer, and intercepting a human body image from the original image;

and importing the human body image into the key point identification layer, extracting a plurality of key points from the human body image, and outputting a key point image containing the plurality of key points.

3. The method of detecting as claimed in claim 1, wherein before the capturing of the original video about the user, further comprising:

Acquiring a training image set; the training image set comprises a plurality of training images;

Configuring true-value posture information for each training image; the truth posture information comprises a head truth rotation vector and a human body truth rotation vector;

respectively importing each training image into a key point extraction network, and outputting a key point training image;

and training the gesture recognition network based on a deep learning algorithm by taking the key point training image as input and the true gesture information as output.

4. A detection method according to any one of claims 1 to 3, wherein said outputting the attention detection result includes:

transmitting the attention detection result to a user terminal;

and indicating the user terminal to display the attention detection result.

5. An attention detection device, characterized by comprising:

the original video acquisition module is used for acquiring an original video about a user; the original video comprises a plurality of frames of original images;

The key point extraction module is used for respectively importing a plurality of frames of original images into a key point extraction network and outputting key point images;

the gesture recognition module is used for importing the key point images into a gesture recognition network and outputting gesture information;

The state determining module is used for determining the state of the user according to the posture information corresponding to all the original images in the original video;

The detection result generation module is used for generating an attention detection result based on the original video and the user state;

the detection result output module is used for outputting the attention detection result;

determining a head pose based on the head rotation vector;

determining a human body pose based on the human body rotation vector;

and/or the number of the groups of groups,

6. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 4 when executing the computer program.

7. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the method according to any one of claims 1 to 4.