CN115457617A

CN115457617A - Concentration degree identification method based on artificial intelligence education platform

Info

Publication number: CN115457617A
Application number: CN202210061937.4A
Authority: CN
Inventors: 谢天明; 陈哲; 杨怡
Original assignee: Chengdu Jiegao Education Technology Co ltd
Current assignee: Chengdu Jiegao Education Technology Co ltd
Priority date: 2022-01-19
Filing date: 2022-01-19
Publication date: 2022-12-09

Abstract

The invention provides a concentration degree identification method based on an artificial intelligence education platform, which comprises the following steps: capturing input video frames of a plurality of attending users; detecting a face area of a user attending class; calculating the pixel average image of the group of face image windows to establish a face appearance model of the user attending class; generating paths of the users who attend to the lessons in a plurality of input video frames; estimating a direction of the detected face to calculate a concentration; detecting the number of faces having a frontal posture, detecting a lecture user who has gazed for a predefined length of time in a direction in which content is displayed; calculating the time for watching the displayed contents by the class users to calculate the concentration degree of each class user; associating the limb behavior with one of a plurality of emotion type labels; the features extracted from the video frame data are used to train a classifier, which is used to detect emotional feedback of the user attending the class. The invention provides a concentration degree identification method based on an artificial intelligence education platform, which is better suitable for application scenes of low-resolution images, combines visual identification and emotion identification and helps the artificial intelligence education platform to acquire the concentration degree distribution state of a class attending user in real time.

Description

Concentration degree identification method based on artificial intelligence education platform

Technical Field

The invention relates to intelligent education, in particular to a concentration degree identification method based on an artificial intelligent education platform.

Background

In recent years, image recognition and education-related scenes are combined, and the method is gradually applied to scenes such as personalized education, automatic scoring, voice recognition evaluation and the like. The students obtain customized learning support to form self-adaptive education facing the future. In order to obtain the concentration degree of the students, the front videos of the students in class can be collected through the camera; the extraction of the face area in the video image can judge the number of students taking lessons seriously and the facial expression of the students. And providing data support for educational effects. While the prior art has employed techniques based on eye gaze when measuring a person's attention, measuring eye gaze typically requires close-range, high-resolution images. And are susceptible to errors when using long-range, low-resolution images.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a concentration degree identification method based on an artificial intelligence education platform, which comprises the following steps:

capturing, by an image capture device, a plurality of input video frames of a plurality of attending users in an area where display content is located;

segmenting a plurality of input video frames by adopting a human face detection method based on machine learning to obtain regions with skin color pixel values, and detecting human face regions of the users attending classes in the plurality of input video frames;

establishing a face appearance model of the user attending class by calculating the pixel average image of the group of face image windows;

individually tracking the detected face and keeping the identity assigned to the lecture user by generating a path of the lecture user in a plurality of input video frames, wherein when the face of the lecture user is detected, the path of the lecture user is generated, and the detected face is assigned to the generated path;

estimating a direction of the detected face to calculate a concentration; detecting a lecture user who has gazed for a predefined length of time in a direction in which content is displayed by detecting the number of faces having a frontal posture;

calculating the concentration degree of each class-attending user by calculating the time for the class-attending user to watch the display content;

processing the video frame data to detect the limb behaviors of the listening user in the video frame sequence; associating the observed limb behavior with one of a plurality of emotional type tags, wherein each type tag corresponds to a respective emotional feedback; features extracted from the video frame data are used to train a classifier with which to detect emotional feedback of a attending user in a sequence of video frames.

Preferably, the method further comprises:

potential lessee-attending users of displayed content are determined by tracking a plurality of behaviors of a plurality of users around the displayed content.

Preferably, each emotional feedback is a predicted facial expression expressing an emotional state of the attending user, and the method further comprises capturing second video frame data of the attending user; applying features extracted from the second video frame data to the classifier to determine an emotional state of the attending user.

Preferably, the method further comprises:

applying a violJones face detector algorithm to the input video frame to determine a face region; applying a model based on the deformable part to determine an ROI area corresponding to a face sign of the attending user in the face area; extracting features in the ROI area; associating the features with emotion types; and train a classifier using the correlation results.

Preferably, a feature histogram is generated from the extracted features; performing coordinate transformation on the ROI area in a plurality of video frames;

concatenating the extracted features to generate feature descriptors;

training the classifier using the final feature descriptors and the feature histograms. .

Compared with the prior art, the invention has the following advantages:

the invention provides a concentration degree identification method based on an artificial intelligence education platform, which is better suitable for application scenes of low-resolution images, combines visual identification and emotion identification and helps the artificial intelligence education platform to acquire the concentration degree distribution state of a class-attending user in real time.

Drawings

Fig. 1 is a flowchart of a concentration recognition method based on an artificial intelligence education platform according to an embodiment of the present invention.

Detailed Description

A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details.

One aspect of the invention provides a concentration degree identification method based on an artificial intelligence education platform. Fig. 1 is a flowchart of a concentration recognition method based on an artificial intelligence education platform according to an embodiment of the present invention.

The invention automatically measures the concentration degree of the lesson-listening users on the displayed content by counting the number of the lesson-listening users and the duration time watched by the lesson-listening users. Concentration also includes concentration level of the attending user, amount of concentration, e.g., how many people actually looked at the display, average length of concentration, distribution of concentration time, and scoring based on the attending user's response. Display content is measured by tracking the behavior of the users attending to the class around a given display content. A means for capturing images is employed to gather information about the proximity of the attending user to the display content.

The actual number of users attending the displayed content is measured using a forward means for capturing images that detects when people are looking at the screen. The calculation of the concentration time starts when the user looks towards the screen for a predefined minimum length of time. The total amount of concentration of the displayed content constitutes the number of actual users who are attending to the displayed content.

The invention uses the combination of skin color detection and mode-based face detection to correctly detect the face in the complex background, so that the tracking method can accurately mark the entry and exit time. Continuity of the path is achieved by combining face detection and face matching. And determining the attention degree by utilizing the three-dimensional attitude estimation of the whole mode change of the human face, thereby realizing measurement which is more meaningful for concentration. A class user who actually watches the display content is distinguished from other users who appear near the display content but do not actually watch the display content.

When a plurality of users are present in the gazing area, the image capturing apparatus captures a multi-person image. The captured images are processed by a control and processing system of a computer system that applies a variety of visual techniques of face detection, face tracking, and three-dimensional face pose estimation to the captured visual information of multiple users. In an exemplary embodiment, the present invention also measures the effectiveness of the display content for the attending user. The user views the display content in a limited spatial range to take advantage of robust face detection/tracking techniques and face pose estimation. The sum of the concentration users of the display content obtains the number of the actual users who attend the lessons of the display content.

The artificial intelligent education platform comprises a skin color detection module, a face detection module, a user path management module, a three-dimensional face posture estimation module and a data collection module. The user path management module further comprises a geometric matching module, a path generation module, a path maintenance module and a path termination module. The skin color detection module determines a region similar to the skin color of the human face in the video frame. The face detection module then runs a face detection window over the area determined by the skin tone detection module. The detected face is first processed by a geometric matching module to determine if the face belongs to an existing path and if the face is a new user, thereby generating a new path. If the face is of a new user, the path generation module is activated to generate a new path and place it in the path queue. And if the face belongs to the existing path, the path maintenance module acquires path data and activates the three-dimensional face pose estimation module. If the geometry matching module cannot find a subsequent face belonging to a certain path, the path termination module is activated to store path data and remove the path from the store queue. The data collection module then records the path data as well as the estimated face pose data.

The artificial intelligence education platform automatically calculates a concentration of the display content by processing video input video frames from image capture devices proximate to the display content. The method comprises the steps of taking a live video as an input, detecting the face of a user in the video, tracking each user independently through the identity of the user, estimating the posture of the three-dimensional face, recording the time stamps of appearance and disappearance, and collecting data so as to collect the occurrence of concentration degree and the time of concentration. The perspective offset between the camera and the display content is automatically corrected by a three-dimensional pose estimation method.

In the process of face detection, skin color segmentation is processed firstly. In the skin color segmentation step, a region where a human face may exist in a video frame, that is, a detected skin region, is first segmented by using color information. The skin color forms a compact area in the transformed space by utilizing color space transformation, and the face which is falsely detected from the background is also obviously reduced by using the skin color detection as a means for accelerating the face detection. The output of this step is a collection of mask regions in the video frame. Then the face detection process is entered. A machine learning based approach is used to detect faces within the skin tone regions determined in the previous step. The image converted into the gray scale is operated to detect a human face. This step provides the location and size of the detected face in a given video frame.

In the face tracking process, once a face is detected, an automatic face geometry correction step is entered. The estimated face geometry is used to generate a corrected face from the detected face image such that facial features are placed at standard locations in the cropped face image window, thereby creating a reliable face appearance model that is constructed by computing an average image of pixels over the entire face image window in the path each time the face is added to the user's path.

The tracking step is used to monitor the identity of the user in the scene to derive a measure of the duration of the attending user gazing at the display content. Tracking utilizes two measurements: geometric matches between the tracking history and the newly detected face. Path management is used to generate a path when a new face appears in a scene, assign a path to a detected face to monitor the identity of a user in the scene, and terminate the path when the user leaves the scene.

And when a new face is detected in the current video frame, constructing a mapping table of the face and the path. Then, a geometric match score is calculated for each face and path pair for measuring the likelihood that a given face belongs to a given path. The geometric match score is based on the position, size, and time difference between the corrected face and the last face in the path, as well as the difference between the average face appearance stored in the path and the corrected face. If the total score is below a predefined threshold, the data pair is excluded from the mapping table. This process is repeated until all faces have been assigned matching paths. A path is terminated if it has no new faces for more than a predefined period of time.

Further, the concentration degree during the watching period of the class user is accurately measured by calculating the proportion of the time for which the class user pays attention to the display contents to the total duration of the face of the total user. It is determined whether the face is in a frontal direction based on the estimated face direction. The ratio of the number of frontal faces to the number of detected faces is then calculated.

In a preferred embodiment, after the identification of the concentration degree is completed, the video frame data is continuously processed to detect the limb behaviors of the attending user in the video frame sequence; associating the observed limb behavior with one of a plurality of emotional type tags, wherein each type tag corresponds to a respective emotional feedback; features extracted from the video frame data are used to train a classifier with which to detect emotional feedback of a listening user in a sequence of video frames.

Wherein each emotional feedback is a predicted facial expression expressing an emotional state of the lecture-attending user, and the method further comprises capturing second video frame data of the lecture-attending user; applying features extracted from the second video frame data to the classifier to determine an emotional state of the lecture attending user.

Wherein. The detecting a face area of a listening user in the plurality of input video frames further comprises:

applying a ViolaJones face detector algorithm to the input video frame to determine a face region; applying a model based on the deformable part to determine an ROI area corresponding to a face sign of the attending user in the face area; extracting features in the ROI area; associating the features with emotion types; and train the classifier using the correlation results. Generating a feature histogram from the extracted features; performing coordinate transformation on the ROI area in a plurality of video frames; concatenating the extracted features to generate feature descriptors; training the classifier using the final feature descriptors and the feature histograms.

To implement a gaze detection process of determining a point of regard, in a further embodiment, a lecture attending user is photographed by a capturing device having a zoom function, and an image and a zoom value obtained by the photographing are output; distinguishing an image of the user's iris from a background of the image; then specifying a center of an eyeball of the user based on the image of the iris, and specifying an intersection point between the user and a vertical line from the center of the eyeball to the face of the user as a reference point; setting a zoom value indicating a predetermined size of the iris image, and specifying a distance from the iris to the user based on the zoom value; determining an offset of the iris based on the offset of the iris image, and specifying a gaze offset on the user based on the offset of the iris and a distance from the iris to the user; the gaze point is calculated based on the reference point and the gaze offset.

The distance measuring step and the reference point determining step are iteratively performed each time a change in the position of the attending user is detected. The distance measuring step further comprises the steps of obtaining the size of the iris image as a reference value, obtaining a zoom value as a reference value, obtaining the distance from the iris to the display content as a reference value, and storing the size, the zoom value and the distance of the image in advance; controlling a zoom function such that an image size of an iris is equal to a size of an iris image as a reference value; the distance from the iris to the display content is determined based on the scaling value used. Wherein the gaze offset is specified by using a pre-stored distance from the center of the eyeball to the iris.

In conclusion, the invention provides a concentration degree identification method based on an artificial intelligence education platform, which is better suitable for the application scene of the low-resolution image, combines visual identification and emotion identification and helps the artificial intelligence education platform to acquire the concentration degree distribution state of the user attending classes in real time.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented in a general purpose computing system, centralized on a single computing system, or distributed across a network of computing systems, and optionally implemented in program code that is executable by a computing system, such that the modules or steps may be stored in a storage system and executed by a computing system. Thus, the present invention is not limited to any specific combination of hardware and software.

It is to be understood that the above-described embodiments of the present invention are merely illustrative of or explaining the principles of the invention and are not to be construed as limiting the invention. Therefore, any modification, equivalent replacement, improvement and the like made without departing from the spirit and scope of the present invention should be included in the protection scope of the present invention. Further, it is intended that the appended claims cover all such variations and modifications as fall within the scope and boundaries of the appended claims or the equivalents of such scope and boundaries.

Claims

1. A concentration degree identification method based on an artificial intelligence education platform is characterized by comprising the following steps:

capturing, by an image capture device, a plurality of input video frames of a plurality of lecturers in an area where display content is located;

adopting a human face detection method based on machine learning to obtain regions with skin color pixel values in a plurality of input video frames in a dividing manner, and detecting human face regions of the users who attend to lessons in the plurality of input video frames;

establishing a face appearance model of the user in class by calculating pixel average images of the group of face image windows;

estimating a direction of the detected face to calculate a concentration; detecting a lecture user who has gazed for a predefined length of time in a direction in which content is displayed by detecting the number of faces having a frontal gesture;

processing the video frame data to detect the limb behaviors of the attending user in the video frame sequence; associating the observed limb behavior with one of a plurality of emotional type labels, wherein each type label corresponds to a respective emotional feedback; features extracted from the video frame data are used to train a classifier with which to detect emotional feedback of a listening user in a sequence of video frames.

2. The method of claim 1, further comprising:

potential lesson-attending users of displayed content are determined by tracking a plurality of behaviors of a plurality of users around the displayed content.

3. The method of claim 1, wherein each emotional feedback is a predicted facial expression expressing an emotional state of the attending user, and further comprising capturing second video frame data of the attending user; applying features extracted from the second video frame data to the classifier to determine an emotional state of the lecture attending user.

4. The method of claim 1, wherein detecting face regions of a listening user in the plurality of input video frames further comprises:

applying a ViolaJones face detector algorithm to the input video frame to determine a face region; applying a model based on the deformable part to determine an ROI area corresponding to a face sign of the lecture user in the face area; extracting features in the ROI area; associating the features with emotion types; and train a classifier using the correlation results.

5. The method of claim 4, further comprising:

generating a feature histogram from the extracted features; performing coordinate transformation on the ROI area in a plurality of video frames;

concatenating the extracted features to generate feature descriptors;

training the classifier using the final feature descriptors and the feature histograms.