CN112101123A

CN112101123A - Attention detection method and device

Info

Publication number: CN112101123A
Application number: CN202010845697.8A
Authority: CN
Inventors: 周鲁平; 胡晓华
Original assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Current assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Priority date: 2020-08-20
Filing date: 2020-08-20
Publication date: 2020-12-18
Anticipated expiration: 2040-08-20
Also published as: CN112101123B

Abstract

The application is applicable to the technical field of image processing, and provides an attention detection method and device. The method comprises the following steps: acquiring an original video about a user; the original video comprises a plurality of frames of original images; respectively importing multiple frames of original images into a key point extraction network, and outputting key point images; importing the key point image into a gesture recognition network, and outputting gesture information; determining the user state according to the attitude information corresponding to all the original images in the original video; and generating an attention detection result based on the original video and the user state, and outputting the attention detection result. The method and the device use continuous multi-frame original images of the user as judgment references, detect whether the attention of the user is concentrated or not by using the key point extraction network and the gesture recognition network, and output an attention detection result to remind the user to correct the attention in time.

Description

Attention detection method and device

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a method and an apparatus for detecting attention.

Background

With the development of the modernization of education informatization, more and more students can learn without supervision, such as online lessons, at home for study, and the like. However, students can easily lose their luck and lose their minds when they learn without supervision, or influence their attention and thus their learning effect due to other factors. Under the unsupervised condition, the students cannot concentrate attention in the whole course during the learning period, and cannot return to the attention in time after the attention is lost, so that the learning effect is poor, and therefore an attention detection method is urgently needed, which can detect whether the attention of the students is concentrated or not, so as to help the students correct the attention in time when the attention is not concentrated.

Disclosure of Invention

In view of this, the embodiments of the present application provide an attention detection method and apparatus, which solve the problem of identifying whether the attention of a student is focused to remind the student to correct the attention in time during learning.

In a first aspect, an embodiment of the present application provides an attention detection method, including: acquiring an original video about a user; the original video comprises a plurality of frames of original images; respectively importing multiple frames of original images into a key point extraction network, and outputting key point images; importing the key point image into a gesture recognition network, and outputting gesture information; determining the user state according to the attitude information corresponding to all the original images in the original video; and determining the attention detection result based on the original video and the user state, and outputting the attention detection result.

In a possible implementation manner of the first aspect, the obtaining an original video about a user includes: and acquiring the original video corresponding to each acquisition period according to the preset acquisition period.

Illustratively, the duration of the acquisition period is a preset duration, and specifically may be one minute; the original video comprises a plurality of frames of original images; the original video corresponding to one acquisition period may specifically include sixty frames of original images, that is, one second in the acquisition period corresponds to one frame of the original image. The user state determined subsequently is the user state corresponding to the acquisition period; and the subsequently determined attention detection result is the attention detection result corresponding to the acquisition period.

It should be understood that by adjusting the preset duration of the acquisition period, the method provided by the embodiment of the application can be applied to an application scene for monitoring the attention of the user in real time, so as to prompt the user to correct the problem of inattention. In addition, an attention analysis report can be generated by collecting the attention detection results corresponding to each acquisition cycle, and the attention analysis report is used for representing the attention condition of the user during the period of acquiring all original videos; the attention analysis report is output to let the user (e.g., student) or other users (e.g., parent or teacher of the student) know the attention situation of the user during the acquisition of all original videos (e.g., the time of a class).

In a second aspect, an embodiment of the present application provides an attention detection device, including: an original video acquisition module for acquiring an original video about a user; the original video comprises a plurality of frames of original images; the key point extraction module is used for respectively importing the original images of the multiple frames into a key point extraction network and outputting key point images; the gesture recognition module is used for importing the key point image into a gesture recognition network and outputting gesture information; the state determining module is used for determining the state of a user according to the corresponding posture information of all the original images in the original video; the detection result generation module is used for generating an attention detection result based on the original video and the user state; and the detection result output module is used for outputting the attention detection result.

In a third aspect, an embodiment of the present application provides a terminal device, including: a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the method of any of the above first aspects when executing the computer program.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, including: the computer readable storage medium stores a computer program which, when executed by a processor, implements the method of any of the first aspects described above.

In a fifth aspect, the present application provides a computer program product, which when run on a terminal device, causes the terminal device to execute the method of any one of the above first aspects.

It is understood that the beneficial effects of the second aspect to the fifth aspect can be referred to the related description of the first aspect, and are not described herein again.

Compared with the prior art, the embodiment of the application has the advantages that:

compared with the prior art, the attention detection method provided by the application takes the key point image of the user as the input of the gesture recognition network and outputs the gesture information; and determining whether the attention of the user is concentrated or not based on the change between the posture information corresponding to the key point images of the continuous multiple frames, and outputting an attention detection result to remind the user to correct in time.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a flowchart of an implementation of a detection method provided in a first embodiment of the present application;

FIG. 2 is a schematic diagram of an application scenario provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of a key point extraction network according to a second embodiment of the present application;

FIG. 4 is a flow chart of an implementation of a detection method provided in the third embodiment of the present application;

FIG. 5 is a schematic view of a rotation vector provided in a third embodiment of the present application;

FIG. 6 is a flow chart of an implementation of a detection method provided in the fourth embodiment of the present application;

fig. 7 is a flowchart of an implementation of a detection method provided in a fifth embodiment of the present application;

fig. 8 is a schematic flowchart of determining a user status according to a fifth embodiment of the present application;

FIG. 9 is a flowchart of an implementation of a detection method according to a sixth embodiment of the present application;

FIG. 10 is a schematic structural diagram of a detection apparatus according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.

In the embodiment of the present application, the main execution body of the flow is a terminal device. The terminal devices include but are not limited to: the device comprises a server, a computer, a smart phone, a tablet computer and the like, and can execute the method provided by the application. Preferably, the terminal device is an intelligent education terminal device capable of acquiring an original video about a user. Fig. 1 shows a flowchart of an implementation of the method provided in the first embodiment of the present application, which is detailed as follows:

in S101, an original video about a user is acquired.

In this embodiment, the original video includes a plurality of frames of original images. Typically, the raw video about the user is captured by a camera. Illustratively, the original video of the user in the sitting state is acquired, and the camera is arranged at a position where the user in the sitting state can be shot, such as on a target desk or on a device (a display screen for the user to watch during learning or a fixed book shelf for the user to place a book during reading) used by the user.

In a possible implementation manner, the obtaining of the original video related to the user may specifically be obtaining the original video corresponding to each acquisition cycle according to a preset acquisition cycle. Illustratively, the duration of the acquisition period is a preset duration, and specifically may be one minute; the original video comprises a plurality of frames of original images; the original video corresponding to one acquisition period may specifically include sixty frames of original images, that is, one second in the acquisition period corresponds to one frame of the original image. The user state determined subsequently is the user state corresponding to the acquisition period; and the subsequently determined attention detection result is the attention detection result corresponding to the acquisition period.

It should be understood that, in the above possible implementation manners, by adjusting the preset duration of the acquisition period, the method provided in the embodiment of the present application may be applicable to an application scenario for monitoring the attention of the user in real time, so as to prompt the user to correct the problem of inattention in time. In addition, an attention analysis report can be generated by collecting the attention detection results corresponding to each acquisition cycle, and the attention analysis report is used for representing the attention condition of the user during the period of acquiring all original videos; the attention analysis report is output to let the user (e.g., student) or other users (e.g., parent or teacher of the student) know the attention situation of the user during the acquisition of all original videos (e.g., the time of a class).

In S102, the multiple frames of original images are respectively imported into a key point extraction network, and key point images are output.

In this embodiment, the key point extraction network is configured to extract key point feature information about the user in the original image; the key point image comprises all the key point characteristic information in the original image. The key point extraction network may be a trained key point recognition network for extracting a target object from an image, and exemplarily, the key point extraction network may be an openpos human body key point recognition model, where the key points include a left eye key point, a right eye key point, a nose key point, a left ear key point, a right ear key point, a left shoulder key point, a right shoulder key point, and a middle (neck) key point.

In a possible implementation manner, the importing multiple frames of the original images into a key point extraction network, and outputting the key point images may specifically be: extracting feature information of the left eye key point, the right eye key point, the nose key point, the left ear key point, the right ear key point, the left shoulder key point, the right shoulder key point and the middle part (neck) key point of the user in the original image, and obtaining the key point image based on the feature information.

In S103, the keypoint image is imported into a gesture recognition network, and gesture information is output.

In this embodiment, the gesture recognition network is an algorithm model trained based on a deep learning algorithm, and is configured to determine the gesture information based on feature information of each keypoint in the keypoint image, with the keypoint image as input and the gesture information as output. The posture information is used for representing the posture of the user in the original image, and exemplarily, the posture information can comprise head-down, normal and head-up, and the like, and is used for representing the sitting posture condition that the human body of the user is in the original image.

In a possible implementation manner, the importing the keypoint image into a pose recognition network and outputting pose information may specifically be extracting feature information of each keypoint in the keypoint image, and performing calculation according to an internal parameter of the pose recognition network and the feature information to obtain the pose information.

In S104, a user state is determined according to the pose information corresponding to all the original images in the original video.

In this embodiment, the original video includes multiple frames of the original images and timestamps of the frames of the original images, and each frame of the original image corresponds to one of the pose information obtained in S103; the user state is used to characterize the user's concentration or lack thereof in the original video.

In a possible implementation manner, the determining the user state according to the pose information corresponding to all the original images in the original video may specifically be: judging whether the attitude information of the original images of two adjacent frames is the same or not based on the time stamp of each frame of the original image; if the original image is different from the posture information of the previous frame of original image of the original image, marking the original image as unfocused attention, otherwise, not marking; if the ratio of the number of all the original images marked as inattentive to the number of the frames of the original video is greater than or equal to a preset ratio, determining that the user state is inattentive, otherwise determining that the user state is attentive.

In S105, an attention detection result is generated based on the original video and the user state, and the attention detection result is output.

In this embodiment, the attention detection result is used to characterize the attention of the user during the time period for acquiring the original video. Specifically, the acquisition time period of the original video is determined based on the time stamp of the original video, and the attention detection result is used for representing that the user state of the user is attention-focused or attention-deficit in the acquisition time period of the original video. Illustratively, the starting timestamp of the original video is 12:00:00, the ending timestamp is 12:01:00, the acquisition time period of the original video is determined to be 12:00:00-12:01:00, the value of the user state is 1, the user state is attentive (0 indicates user attentive), and the generated attention detection result is specifically "within 12:00:00-12:01:00, user attentive".

In a possible implementation manner, the outputting the attention detection result may specifically be: and displaying the attention detection result on a display module of the terminal equipment, or sending the attention detection result to a user terminal to inform a user of the specific condition of the attention of the user in the acquisition time period of the original video.

In the embodiment, the key point image about the user is used as the input of the gesture recognition network, and the gesture information is output; and determining whether the attention of the user is concentrated or not based on the change between the posture information corresponding to the key point images of the continuous multiple frames, and outputting an attention detection result to remind the user to correct in time.

Fig. 2 shows a schematic view of an application scenario provided in an embodiment of the present application. Referring to fig. 2, in one possible application scenario, the human body in the figure is a student who sits on a chair and is in a class; at this time, by the detection method provided by the present application, the terminal device including the camera is set on the desk, so that the terminal device implements the detection method provided by the present application, that is, the camera acquires the original video of the student, and the attention focusing situation of the student during the acquisition of the original video is determined according to the original video. Exemplarily, taking one minute as a collection period, acquiring an original video of the student in each collection period, determining an attention detection result of the student in each collection period, and sending the attention detection result to a teacher terminal to inform a teacher; by using forty minutes as a class, the method provided by the application can realize that a teacher supervises the attention focusing situation of the student within forty minutes of the class, specifically which minutes are in focus and which minutes are out of focus.

Fig. 3 shows a schematic diagram of a keypoint extraction network provided in a second embodiment of the present application. Referring to fig. 3, with respect to the embodiment shown in fig. 1, the method S102 provided in this embodiment includes S301 to S302, which are detailed as follows:

further, the importing multiple frames of the original images into a key point extraction network, and outputting a key point image includes:

in S301, the original image is imported into the human body recognition layer, and a human body image is captured from the original image.

In this embodiment, referring to fig. 3, the key point extraction network includes a human body recognition layer, and when the original image is imported into the key point extraction network, the original image is first imported into the human body recognition layer to determine a human body image of a user in the original image.

In a possible implementation manner, the above introducing the original image into a human body recognition layer, and capturing a human body image from the original image may specifically be: the original image is preprocessed, human body edge contour information in the original image is determined according to the preprocessed original image, and a human body image containing the human face and the upper half of the body of a user is intercepted in the original image according to the human body edge contour information. The preprocessing of the original image may specifically be: carrying out image processing means for highlighting the edge contour such as image sharpening processing on the original image to obtain a preprocessed original image; the determining of the human body edge contour information in the original image according to the preprocessed original image may specifically be: importing the preprocessed original image into a trained human body recognition model for determining the human body edge contour to obtain human body edge contour information; the above-mentioned intercepting the human body image including the human face and the upper half of the user in the original image according to the human body edge contour information may specifically be: and determining the edge contour of the target human body on the original image according to the human body edge contour information, and intercepting an area surrounded by the edge contour of the target human body to identify the area as the human body image. It should be understood that the human body recognition model may be a model trained in the prior art and used for confirming the human body edge contour information in the image containing the human body, and will not be described in detail herein.

In S302, the human body image is imported into the key point identification layer, a plurality of key points are extracted from the human body image, and a key point image including the plurality of key points is output.

In this embodiment, referring to fig. 3, the key point extraction network includes a key point identification layer, and when the human body identification layer outputs the human body image, the human body image is imported to the key point identification layer to determine key point images related to a plurality of key points in the human body image. .

In this embodiment, the key point identification layer is configured to identify key points on the human body image with respect to the user, for example, the key points include a left eye key point, a right eye key point, a nose key point, a left ear key point, a right ear key point, a left shoulder key point, a right shoulder key point, and a middle (neck) key point. Optionally, the key point extraction layer may be an openpos human key point identification model, and specific implementation is not described herein again.

In a possible implementation manner, the respectively importing each frame of the human body image into a key point extraction network and outputting a key point image may specifically be: extracting feature information about a left-eye key point, a right-eye key point, a nose key point, a left-ear key point, a right-ear key point, a left-shoulder key point, a right-shoulder key point, and a middle (neck) key point of the user in the human body image, and obtaining the key point image based on the feature information, specifically, connecting the key points according to a preset connection relationship, and extracting the key points and a connection line of the key points from the human body image to obtain the key point image including a plurality of key points (as shown in fig. 3).

In this embodiment, in the key point extraction network, a human body recognition layer is arranged, so that feature information of an unimportant background environment in the original image can be removed, and only feature information of a target human body is kept as much as possible, which is equivalent to preprocessing the original image, and reducing information amount of an image to be processed in a subsequent step (or reducing calculation amount of the subsequent step) so as to improve efficiency of subsequent attention detection; the method comprises the steps of setting a key point recognition layer, wherein recognition of key points of different target human bodies (in various postures or wearing various clothes) can be realized, extracting the key points on a human body image can enlarge applicable people for attention detection, and can further simplify feature information to be processed subsequently, and only the feature information of the key points of the human body image is reserved, so that the efficiency of attention detection is improved subsequently, and the training efficiency of the subsequent posture recognition network is also improved.

Fig. 4 shows a flowchart of an implementation of the method provided in the third embodiment of the present application. Referring to fig. 4, with respect to the embodiment shown in fig. 1, the method S103 provided in this embodiment includes S401 to S402, which are detailed as follows:

further, the importing the keypoint image into a gesture recognition network and outputting gesture information includes:

in this embodiment, the posture information includes a head rotation vector and a body rotation vector; fig. 5 shows an example of the head rotation vector. Referring to fig. 5, the head rotation vector refers to a rotation vector between a head three-dimensional coordinate system (a coordinate system formed by an x ' axis, a y ' axis, and a z ' axis in the drawing) established based on the head orientation of the target object and a standard three-dimensional coordinate system (a coordinate system formed by an x axis, a y axis, and a z axis in the drawing) established based on the ground, the head three-dimensional coordinate system being the same as the center (point O in the drawing) of the standard three-dimensional coordinate system. Exemplarily, the head rotation vector is (a, b, c), where a is an angle value rotated by using an x-axis of a standard three-dimensional coordinate system as a rotation axis, b is an angle value rotated by using a y-axis of the standard three-dimensional coordinate system as a rotation axis, and c is an angle value rotated by using a z-axis of the standard three-dimensional coordinate system as a rotation axis; it should be understood that the standard three-dimensional coordinate system coincides with the head three-dimensional coordinate system after the three rotations.

In S401, a head rotation vector is determined based on the face feature information in the keypoint image.

In this embodiment, generally, the only necessary information in the keypoint image when determining the head rotation vector is the keypoint feature information located on the head of the target object, that is, the face feature information. In a possible implementation manner, the determining a head rotation vector based on the face feature information in the keypoint image may specifically be: extracting the face feature information in the key point image, that is, integrating the feature information of each face key point on the head of the target object in the key point image, where the face key points may specifically be: a left eye key point, a right eye key point, a nose key point, a left ear key point, and a right ear key point; and calculating based on the human face feature information and the internal parameters of the gesture recognition network to obtain the head rotation vector.

In S402, a human body rotation vector is determined based on human body feature information in the keypoint image.

In this embodiment, generally, the necessary information in the keypoint image when determining the human body rotation vector, that is, the human body feature information, should include human body keypoints located on the human body of the target object in the keypoint image, and the human body keypoints include, for example, feature information of a nose keypoint (for describing the longitudinal direction of the human body, which may be replaced by other human body keypoints), a left shoulder keypoint, a right shoulder keypoint, and a middle (neck) keypoint. The implementation manner for determining the human body rotation vector based on the human body feature information in the keypoint image may specifically refer to the related description of S401, and is not described herein again.

In this embodiment, the head rotation vector and the human body rotation vector, which can be used for representing a user gesture, are respectively calculated by the gesture recognition network, and the user gesture can be described more specifically, that is, the head rotation vector is used for describing the head gesture of the user, and the human body rotation vector is used for describing the human body gesture of the user, so that a more detailed basis is provided when a user state is determined later, and detection accuracy is improved. On the other hand, the head rotation vector and the human body rotation vector are respectively calculated through partial feature information in the key point image, so that the calculation amount can be reduced, and the output efficiency of the gesture recognition network is improved.

Fig. 6 shows a flowchart of an implementation of the method provided in the fourth embodiment of the present application. Referring to fig. 6, in comparison with the embodiment shown in fig. 4, the method provided in this embodiment includes steps S601 to S604, which are detailed as follows:

further, before the obtaining of the original video about the user, the method further includes:

in S601, a training image set is acquired.

In this embodiment, the training image set includes a plurality of training images; typically, the training image set is acquired by a camera. It should be appreciated that in order to ensure the accuracy of the trained gesture recognition network in determining the user state based on the original video, the camera should be consistent with the relative position of the camera acquiring the original video and the target object.

It should be appreciated that, when each training image in the set of training images is acquired, pose information of the sample object at the time of acquisition of the training image should be recorded so as to subsequently configure true pose information for each training image.

In S602, true pose information is configured for each training image.

In this embodiment, the true posture information includes a head true rotation vector and a human true rotation vector. When each training image in the training image set is acquired in S601, a head rotation vector and a human body rotation vector of a sample object at the time of acquiring the training image are recorded as the head truth rotation vector and the human body truth rotation vector.

In one possible implementation, the sample object is instructed to perform a sitting motion with a head rotation vector (a, b, c) and a body rotation vector (i, j, k), and then a training image of the sample object is acquired, and (a, b, c) is used as the head true value rotation vector corresponding to the training image, and (i, j, k) is used as the body true value rotation vector corresponding to the training image.

In S603, each training image is imported into the keypoint extraction network, and the keypoint training images are output.

In this embodiment, since the implementation manner of S603 is completely the same as the implementation manner of S102 in the embodiment described in fig. 1, for specific description, reference may be made to related description of S102, and details are not repeated here.

In S604, the keypoint training image is used as input, the true-value posture information is used as output, and the posture recognition network is trained based on a deep learning algorithm.

In this embodiment, the true-value posture information includes a head true-value rotation vector and a human body true-value rotation vector; the deep learning algorithm may be a keras deep learning algorithm; the gesture recognition network may be a preset neural network; the above-mentioned training of the gesture recognition network based on the deep learning algorithm with the key point training image as input and the true value gesture information as output may specifically be: presetting a neural network as the posture recognition network, taking the key point training image as input, and outputting a head predicted rotation vector and a human body predicted rotation vector; and updating the internal parameters of the posture identification network based on a keras deep learning algorithm by taking the head true value rotation vector and the human body true value rotation vector as true values and taking the head predicted rotation vector and the human body predicted rotation vector as predicted values.

It should be understood that the gesture recognition network should include two parallel computing layers, namely a head rotation amount computing layer and a body rotation amount computing layer. With reference to the embodiment shown in fig. 4, the training process includes: updating parameters of the head rotation vector calculation layer based on a keras deep learning algorithm by taking the face feature information of the key point training image as input, the head predicted rotation vector as a predicted value and the head truth value rotation vector as a true value; and updating the parameters of the human body rotation vector calculation layer based on a keras deep learning algorithm by taking the human body characteristic information of the key point training image as input, the human body predicted rotation vector as a predicted value and the human body true value rotation vector as a true value.

It should be understood that the above-mentioned training image set may be divided into a part of the training images as a verification image set, which is used to determine the correctness of the posture recognition network after each training period; and if the accuracy of the gesture recognition network is higher than or equal to the preset standard accuracy, the gesture recognition network training is finished.

In this embodiment, the gesture recognition network is constructed and trained based on the acquired training image set and the deep learning algorithm, and the trained gesture recognition network can implement the step S103, in particular, the head rotation vector and the human body rotation vector can be output by using the key point image as input, so as to subsequently determine the user state.

Fig. 7 shows a flowchart of an implementation of the method provided in the fifth embodiment of the present application. Referring to fig. 7, with respect to the embodiment shown in fig. 4, the method S104 provided in this embodiment includes S701 to S703, which are detailed as follows:

further, the determining the user state according to the posture information corresponding to all the original images in the original video includes:

in S701, a head pose is determined based on the head rotation vector.

In the present embodiment, in order to distinguish the respective head postures of the user based on the head rotation vector, thresholds are set in advance for rotation angle values of the head rotation vector in the respective directions. In a possible implementation manner, the determining the head pose based on the head rotation vector may specifically be: determining sub-head gestures corresponding to all directions respectively based on rotation angle values corresponding to all directions of the head rotation vector; identifying the set of sub-head poses corresponding to three directions as head poses. Exemplarily, the head rotation vector is (a, b, c), and a rotation angle value a corresponding to an x-axis of the head rotation vector is taken as an example to be described: setting two threshold values of the rotation angle value corresponding to the x axis as a₁And a₂Wherein a is₁And a₂Satisfies-90<a₁<0<a₂<90, respectively; when said a is in the interval [ -90, a [ -90₁]When the head is raised, determining that the sub-head posture a corresponding to the x-axis is raised (specifically, a ═ 1); when said a is in the interval [ a₁，a₂]When the head posture a corresponding to the x-axis is determined to be normal (specifically, a ═ 0 may be used for representing the head posture a); when said a is in the interval [ a₂，90]When the head is lowered, the sub-head posture a corresponding to the x-axis is determined (specifically, a may be represented by "1"). It will be appreciated that for a rotation angle value b corresponding to the y-axis of the head rotation vector, two thresholds b are set₁And b₂And determining the sub-head pose B and setting two thresholds c for the rotation angle value c corresponding to the z-axis of the head rotation vector₁And c₂And determining the sub-head pose C can refer to the above steps, which are not described herein again. Up to this point, the set of sub-head poses corresponding to three directions is recognized as the head pose, i.e. the head pose is specifically (a, B, C), it should be understood that there are 3 sub-head poses of the head pose respectively3 values, namely 27 values are total in the head posture.

In S702, a human body posture is determined based on the human body rotation vector.

In this embodiment, since the implementation manner of determining the human body posture based on the human body rotation vector is completely the same as the implementation manner of S701, reference may be specifically made to the description of determining the head posture based on the head rotation vector in S701, and details are not repeated here. In general, the posture change of the human body posture obtained by rotation about the x-axis as the rotation axis is reflected to some extent in the head posture, and for example, the head posture is generally lowered when the human body posture is stooped down. Therefore, for example, in S702, the sub-body posture corresponding to the x-axis may not be considered, that is, the body posture has 2 sub-body postures, each sub-body posture has 3 values, and the body postures have 9 values in total.

It should be understood that if a user gesture of the original video needs to be output, a user gesture may be generated based on the head gestures and the body gestures of each of the original images in the original video, and the user gesture may be output. Optionally, identifying a head pose value which accounts for the highest percentage of the head poses corresponding to all the original images in the original video as a head pose corresponding to the original video; identifying the human body posture value which accounts for the highest proportion in the human body postures corresponding to all the original images in the original video as the human body posture corresponding to the original video; generating a user gesture based on the head gesture and the body gesture, and outputting the user gesture; referring to the possible implementation manners, the head posture has 27 values, and the user posture has 243 values if there are 9 values in all the body postures.

In S703, the user state is determined according to the head rotation vectors, the head postures, the human body rotation vectors, and the human body postures corresponding to all the original images in the original video.

In this embodiment, the user gesture is used to characterize the user's attention concentration or lack thereof in the original video; each frame of the original image in the original video corresponds to one head rotation vector, one head pose, one body rotation vector and one body pose.

In a possible implementation manner, the specific implementation steps of S703 may be: and comparing the head rotation vector and the variation of the human body rotation vector corresponding to the original images of two adjacent frames based on the time stamp of each frame of the original images to judge whether the change which is enough to form the attention deficit exists between the original images of the two adjacent frames. Specifically, the head rotation vector and the body rotation vector of the original image are (a) respectively_n，b_n，c_n) And (i)_n，j_n，k_n) The head rotation vector and the body rotation vector of the previous frame of the original image are respectively (a)_n-1，b_n-1，c_n-1) And (i)_n-1，j_n-1，k_n-1) N is the frame number of the original image in the original video; when judging whether the original images of the two adjacent frames are identified to have enough change to form inattentive attention, generally, the rotation angle value i corresponding to the x axis of the human body rotation vector is not considered, that is, only the change of the rotation angle values corresponding to 5 dimensions, that is, the xyz axis of the head rotation vector and the yz axis of the human body rotation vector, is considered. For specific reasons see the above description of S702.

Optionally, a change threshold is set in each dimension in advance, and if there is a change in one dimension that is greater than or equal to the change threshold corresponding to the dimension, it is determined that there is a change between the two adjacent frames of the original images that is sufficient to constitute inattention. Specifically, taking the dimension corresponding to the x-axis of the head rotation vector as an example, if | a_n-a_n-1If | is greater than or equal to the change threshold corresponding to the x-axis of the head rotation vector, identifying the original image as a changed frame image, otherwise, not identifying. The altered frame image is used to characterize the attention of the user in the original frame imageThe force is not concentrated.

Optionally, an average change value in all dimensions is calculated, an average change threshold is preset, if the average change value is greater than or equal to the average change threshold, it is determined that there is enough change between the two adjacent frames of the original images to constitute inattentive attention, and the original images are identified as changed frame images, otherwise, the original images are not identified. The average variation value is an average value of variation values in each dimension.

In another possible implementation manner, the user state is determined according to the head rotation vectors, the head pose, the human body rotation vectors, and the human body pose corresponding to all the original images in the original video, specifically, refer to fig. 8, and fig. 8 shows a schematic flow chart for determining the user state provided in a fifth embodiment of the present application. Referring to fig. 8, S703 includes S7031 to S7034, which are detailed as follows:

further, the determining the user state according to the head rotation vectors, the head postures, the human body rotation vectors and the human body postures corresponding to all the original images in the original video includes: s7031 and/or S7032 and/or S7033, and S7034.

In S7031, if the difference between the head rotation vectors of the original image and the previous frame of original image of the original image is greater than or equal to a first threshold, or the difference between the body rotation vectors of the original image and the previous frame of original image of the original image is greater than or equal to a first threshold, the original image is identified as a changed frame image.

In this embodiment, specifically, the head rotation vector and the body rotation vector of the original image are (a) respectively_n，b_n，c_n) And (i)_n，j_n，k_n) The head rotation vector and the body rotation vector of the previous frame of the original image are respectively (a)_n-1，b_n-1，c_n-1) And (i)_n-1，j _n-1，k_n-1) And the n is the frame number of the original image in the original video and is greater than 0.

Generally, in this embodiment, the rotation angle value i corresponding to the x-axis of the human body rotation vector is not considered, that is, only the change of the rotation angle values corresponding to 5 dimensions, i.e., the xyz-axis of the head rotation vector and the yz-axis of the human body rotation vector, is considered, and specifically, see fig. 8.

In S7032, if the head pose corresponding to the original image is different from the head pose corresponding to the original image of the previous frame of the original image, and the difference between the head rotation vectors of the original image and the original image of the previous frame of the original image is greater than or equal to a second threshold, the original image is identified as a changed frame image.

In this embodiment, the second threshold is smaller than the first threshold. Specifically, the head pose has 27 different values, and referring to fig. 8, the head pose of the original image is p_nThe head pose of the original image of the previous frame of the original image is p_n-1。

Preferably, a second threshold (e.g., 20) is preset if p is satisfied_n≠p_n-1And exist | a_n-a_n-1|、|b_n-b_n-1I or I c_n-c_n-1If any difference value in | is greater than or equal to the second threshold value, identifying the original image as a changed frame image, otherwise, not identifying.

In S7033, if the human body posture corresponding to the original image is different from the human body posture corresponding to the previous original image of the original image, and the difference between the human body rotation vectors of the original image and the previous original image of the original image is greater than or equal to the second threshold, the original image is identified as a changed frame image.

In this embodimentSpecifically, the human body posture has 9 different values, see fig. 8, and the head posture of the original image is q_nThe head pose of the original image of the previous frame of the original image is q_n-1。

Generally, in this embodiment, the rotation angle value i corresponding to the x-axis of the human body rotation vector is not considered, that is, only the change of the rotation angle values corresponding to 5 dimensions, i.e., the xyz-axis of the head rotation vector and the yz-axis of the human body rotation vector, is considered.

Preferably, if q is satisfied_n≠q_n-1And there is | j_n-j_n-1I or I k_n-k_n-1If any difference value in | is greater than or equal to the second threshold value, identifying the original image as a changed frame image, otherwise, not identifying.

In S7034, if the ratio of all the changed frame images in all the original images is greater than or equal to a preset ratio, the user state is identified as inattentive.

In this embodiment, specifically, the original video includes N frames of the original image, and if the number of the changed frame images in the original video is M, the ratio is M/N, where N is greater than M. And if the M/N is larger than or equal to a preset proportion (for example, 40%), identifying the user state as inattentive, otherwise identifying the user state as attentive.

In the embodiment, a change frame image for representing the inattention of the user is identified from the original video, and the user state is determined based on the proportion of the change frame image, so as to generate an attention detection result subsequently.

In this embodiment, the head pose and the human body pose are determined first, and then the change frame image for representing the inattention of the user is identified from the original video according to the head pose and the human body pose, so that the basis for identifying the change frame image is increased, the accuracy for subsequently determining the user state is improved, and a more accurate attention detection result is generated subsequently.

Fig. 9 shows a flowchart of an implementation of the method provided in the sixth embodiment of the present application. Referring to fig. 9, with respect to any one of the above embodiments, the method S105 provided in this embodiment includes S901 to S902, which are detailed as follows:

further, the outputting the attention detection result includes:

in S901, the attention detection result is sent to the user terminal.

In this embodiment, a connection is established with the user terminal, and the attention detection result output in S105 is sent to the user terminal. The connection with the user terminal may be specifically established by searching for a user terminal within a connectable range; or the connection can be established for the user terminal through the transit server.

In S902, the user terminal is instructed to display the attention detection result.

In this embodiment, a connection is established with the user terminal, a request for displaying the attention detection result sent in S901 is sent, and the user terminal is instructed to display the attention detection result through a display module of the user terminal, so as to notify the user.

In a possible implementation manner, the user terminal may be a supervising terminal, and the supervising terminal may be a terminal device used by a supervisor of the user, and the attention detection result is sent to the supervising terminal and is instructed to be displayed by the supervising terminal to inform the supervisor.

In this embodiment, the attention detection result is sent to the user terminal to inform the user, so that the user can adjust the state of the user in time according to the attention detection result, the user terminal can store the attention detection result so that the user can check the attention detection result at any time, and the user terminal can perform integrated analysis on all the received attention detection results to obtain an analysis report about the attention of the user, which is easier for the user to understand.

Fig. 10 shows a schematic structural diagram of a detection apparatus provided in an embodiment of the present application, corresponding to the method described in the above embodiment, and only shows a part related to the embodiment of the present application for convenience of description.

Referring to fig. 10, the attention detecting device includes: an original video acquisition module for acquiring an original video about a user; the original video comprises a plurality of frames of original images; the key point extraction module is used for respectively importing the original images of the multiple frames into a key point extraction network and outputting key point images; the gesture recognition module is used for importing the key point image into a gesture recognition network and outputting gesture information; the state determining module is used for determining the state of a user according to the corresponding posture information of all the original images in the original video; the detection result generation module is used for generating an attention detection result based on the original video and the user state; and the detection result output module is used for outputting the attention detection result.

It should be noted that, for the information interaction, the execution process, and other contents between the above-mentioned apparatuses, the specific functions and the technical effects of the embodiments of the method of the present application are based on the same concept, and specific reference may be made to the section of the embodiments of the method, which is not described herein again.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Fig. 11 shows a schematic structural diagram of a terminal device according to an embodiment of the present application. As shown in fig. 11, the terminal device 11 of this embodiment includes: at least one processor 110 (only one shown in fig. 11), a memory 111, and a computer program 112 stored in the memory 111 and operable on the at least one processor 110, the processor 110 implementing the steps in any of the various method embodiments described above when executing the computer program 112.

The terminal device 11 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device may include, but is not limited to, a processor 110, a memory 111. Those skilled in the art will appreciate that fig. 11 is only an example of the terminal device 11, and does not constitute a limitation to the terminal device 11, and may include more or less components than those shown, or combine some components, or different components, for example, and may further include an input/output device, a network access device, and the like.

It should be understood that, when the terminal device 11 is specifically a computing device such as a cloud server that does not have a function of acquiring the original video, the original video uploaded from other devices may be acquired, and the detection method of the present application may be implemented based on the original video uploaded from other devices.

The Processor 110 may be a Central Processing Unit (CPU), and the Processor 110 may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 111 may in some embodiments be an internal storage unit of the terminal device 11, such as a hard disk or a memory of the terminal device 11. In other embodiments, the memory 111 may also be an external storage device of the terminal device 11, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 11. Further, the memory 111 may also include both an internal storage unit and an external storage device of the terminal device 11. The memory 111 is used for storing an operating system, an application program, a BootLoader (BootLoader), data, and other programs, such as program codes of the computer programs. The memory 111 may also be used to temporarily store data that has been output or is to be output.

The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in the above-mentioned method embodiments.

The embodiments of the present application provide a computer program product, which when running on a mobile terminal, enables the mobile terminal to implement the steps in the above method embodiments when executed.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing apparatus/terminal apparatus, a recording medium, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. An attention detection method, comprising:

acquiring an original video about a user; the original video comprises a plurality of frames of original images;

respectively importing multiple frames of original images into a key point extraction network, and outputting key point images;

importing the key point image into a gesture recognition network, and outputting gesture information;

determining the user state according to the attitude information corresponding to all the original images in the original video;

and generating an attention detection result based on the original video and the user state, and outputting the attention detection result.

2. The detection method of claim 1, wherein the key point extraction network comprises a human body recognition layer and a key point recognition layer; the step of respectively importing multiple frames of the original images into a key point extraction network and outputting key point images comprises the following steps:

importing the original image into the human body recognition layer, and intercepting a human body image from the original image;

and importing the human body image into the key point identification layer, extracting a plurality of key points on the human body image, and outputting a key point image containing the plurality of key points.

3. The detection method of claim 1, wherein the pose information includes a head rotation vector and a body rotation vector; the importing the key point image into a gesture recognition network and outputting gesture information includes:

determining a head rotation vector based on the face feature information in the key point image;

and determining a human body rotation vector based on the human body feature information in the key point image.

4. The detection method of claim 3, wherein prior to obtaining the original video about the user, further comprising:

acquiring a training image set; the training image set comprises a plurality of training images;

configuring true value attitude information for each training image; the truth posture information comprises a head truth rotation vector and a human body truth rotation vector;

respectively importing each training image into a key point extraction network, and outputting key point training images;

and training the posture recognition network based on a deep learning algorithm by taking the key point training image as input and the true value posture information as output.

5. The detection method according to claim 3, wherein the determining the user state according to the pose information corresponding to all the original images in the original video comprises:

determining a head pose based on the head rotation vector;

determining a human body posture based on the human body rotation vector;

and determining the user state according to the head rotation vectors, the head postures, the human body rotation vectors and the human body postures corresponding to all the original images in the original video.

6. The detection method according to claim 5, wherein the determining the user state according to the head rotation vector, the head pose, the human body rotation vector and the human body pose corresponding to all the original images in the original video comprises:

if the difference value of the head rotation vectors of the original image and the previous frame of original image of the original image is greater than or equal to a first threshold value, or the difference value of the human body rotation vectors of the original image and the previous frame of original image of the original image is greater than or equal to a first threshold value, identifying the original image as a changed frame image;

and/or the presence of a gas in the gas,

if the head pose corresponding to the original image is different from the head pose corresponding to the original image of the previous frame of the original image, and the difference value of the head rotation vectors of the original image and the original image of the previous frame of the original image is greater than or equal to a second threshold value, identifying the original image as a changed frame image; the second threshold is less than the first threshold;

and/or the presence of a gas in the gas,

if the human body posture corresponding to the original image is different from the human body posture corresponding to the previous frame of original image of the original image, and the difference value of the human body rotation vectors of the original image and the previous frame of original image of the original image is greater than or equal to the second threshold value, identifying the original image as a changed frame image;

and if the ratio of all the changed frame images in all the original images is greater than or equal to a preset ratio, identifying the user state as inattentive.

7. The detection method of any one of claims 1-6, wherein said outputting the attention detection result comprises:

sending the attention detection result to a user terminal;

and instructing the user terminal to display the attention detection result.

8. An attention detection device, comprising:

an original video acquisition module for acquiring an original video about a user; the original video comprises a plurality of frames of original images;

the key point extraction module is used for respectively importing the original images of the multiple frames into a key point extraction network and outputting key point images;

the gesture recognition module is used for importing the key point image into a gesture recognition network and outputting gesture information;

the state determining module is used for determining the state of a user according to the corresponding posture information of all the original images in the original video;

the detection result generation module is used for generating an attention detection result based on the original video and the user state;

and the detection result output module is used for outputting the attention detection result.

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.