CN116030519A

CN116030519A - Learning attention detection and assessment method for live broadcast teaching platform

Info

Publication number: CN116030519A
Application number: CN202211743625.8A
Authority: CN
Inventors: 刘雄华; 黄凯伦; 何顶新; 邓伟明; 李曼娜; 吴悦; 刘婷婷; 刘海
Original assignee: Wuhan Technology and Business University
Current assignee: Wuhan Technology and Business University
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2023-04-28

Abstract

Aiming at the problem that a learner cannot obtain detection in an online learning environment, the invention discloses a learning attention detection and assessment method for a live-broadcast teaching platform, which is used for simultaneously using RGB multi-frame images and TOF multi-frame images acquired in the live-broadcast teaching platform in a learner sight estimation task, using an acquired WTBU-Gaze data set to match qualified human faces, and estimating the sight direction of the learner by adopting a novel TransHGE deep neural network model, thereby greatly improving the sight estimation result accuracy under the condition that the learner has individual differences of wearing glasses or different eyeball sizes.

Description

Learning attention detection and assessment method for live broadcast teaching platform

Technical Field

The invention relates to the fields of computer vision technology and online education, in particular to a learning attention detection and assessment method of a live broadcast teaching platform.

Background

Along with the gradual promotion of live lesson teaching, the learner's study form is developed into live lesson teaching platform study by traditional classroom study, and live lesson teaching platform study can enough avoid the learner to concentrate to the classroom and the risk of infection, can not fall the study progress because of unable classroom of arriving again. However, the traditional offline classroom learning depends on the manner of real-time detection of the learner learning condition by the teacher, and cannot meet the requirement of the live-broadcast class teaching platform learning. The existing live-broadcast class teaching platform is provided with a whole-course video recording device for detecting the learner in the learning period after class, but the whole-course video recording detection mode needs to spend a great deal of time in the later period to judge the attention condition of the learner, and the ideal effect is difficult to obtain in the real-time performance and the accuracy of detection.

The vision is an important expression form of the learning attention degree of the learner, and the learning attention degree is judged by acquiring the vision estimation direction of the learner during the learning of the live-broadcast course teaching platform and analyzing whether the vision direction and the gaze point of the learner are in the screen of the live-broadcast course teaching platform or not and integrating the key elements, so that the live-broadcast course teaching detection and evaluation function is achieved. However, estimating the learner's gaze direction in a live lesson teaching platform still has some serious challenges, such as: (1) the learner's view is partially blocked by the glasses worn; (2) the sizes of eyeballs produced by individual differences of learners are different; (3) The distance between the learner and the screen of the learning apparatus is difficult to determine, etc. Aiming at the challenges, the method for estimating the 3D sight direction of the learner by adopting the TransHGE deep neural network, which is disclosed by the invention, has the advantages that a large number of face data sets of the learner in a live-broadcast class teaching platform are collected, depth information such as the distance between the learner and a screen of learning equipment is introduced, and the challenging factors existing in the aspects of real-time performance and accuracy of sight estimation are solved to a great extent. At present, a method for detecting the attention degree of a learner when a live-broadcast course teaching platform learns by utilizing 3D sight estimation does not exist. Therefore, the problem of detecting the attention degree of the learner when the live-broadcast teaching platform learns by the learning attention detection and evaluation method oriented to the live-broadcast teaching platform is worthy of intensive study.

Disclosure of Invention

Aiming at one or more of the above defects or improvement demands of the prior art, the invention provides a learning attention detection and assessment method of a live teaching platform, which comprises the following steps:

step S1, respectively acquiring video resources of a learner under an RGB camera and a TOF depth imaging camera in the live-broadcast class teaching platform in real time, and dividing the video resources into a multi-frame RGB image and a multi-frame TOF image according to time sequence;

s2, extracting face feature patterns from the multi-frame RGB image and the multi-frame TOF image by using a convolutional neural network CNN, wherein each feature pattern contains information of a face local area;

s3, inputting the preprocessed human face characteristic patterns into a visual self-attention encoder to acquire the sight directions of learners at different moments in a live-broadcast class teaching platform;

step S4, according to the obtained sight line direction results of the learner at different moments, a sight-line fixation point resolving method based on a TOF depth imaging camera is provided, and the sight-line fixation point coordinates can be extracted in real time;

and S5, comparing the number of times that the extracted gaze point coordinates are out of the screen coordinate range of the live-broadcast teaching platform with the set detection number f to evaluate the attention condition of the learner during online learning.

Preferably, in the step S1, video resources of a learner under an RGB camera and a TOF depth imaging camera in a live-broadcast class teaching platform are respectively obtained in real time, and the video resources are divided into a multi-frame RGB image and a multi-frame TOF image according to time sequence, which specifically comprises the following steps:

step S1.1: setting the resolution and shooting angle of the RGB camera and the TOF depth imaging camera, and ensuring that the video resource received in real time contains a complete face area of a learner;

step S1.2: video resource of set RGB camera and TOF depth imaging camera

Wherein L, H, W, C represents video length, height and width and channel number, respectively;

step S1.3: dividing the video resource V obtained in real time in the step S1.2 into a plurality of frames of face image sequences according to time sequence

Wherein t is ₀ The number of frames for video, H, W, C corresponding to video resources, represent the image height and width and the number of channels, respectively. .

Preferably, in the step S2, the convolutional neural network CNN is used to extract the face feature map for the multi-frame RGB image and the multi-frame TOF image, which specifically includes the steps of:

step S2.1: face standardized using WTBU-Gaze datasetA recognition model for performing face attribute calculation for each face satisfying conditions in the received image, and storing the data as =f (I _i ) Wherein A is a slave face image I _i Eye parameter attribute and head posture attribute A= { a extracted from the above _eye ，a _{head pose} }；

Step S2.2: for each qualified face f in the received image _d Adopting a face feature point detection algorithm based on a CNN model, and locating the qualified face f in the step S2.1 through CNN with two layers of convolution kernels _d Is characterized by (a) feature pattern

And a head posture rotation matrix R, wherein h, w and c respectively represent the length, the width and the channel number of the characteristic spectrum.

Step S2.3: the face selected in the WTBU-Gaze dataset is called f _c We calculate f _d And f _c The difference between the eye parameters and the head posture of the face image is f through a scoring function _c Calculating a matching score for each face image of: s (f) _c ，f _d )＝∑ _{m∈{eye，heed pose}} σ _m |a _m，d -a _m，c I, wherein the parameter sigma _m Is determined empirically by comparing the matching results.

Preferably, the step of acquiring the WTBU-size data set used in step S2.1 is as follows:

step S2.1.1: 2N volunteers are recruited, wherein N people of men and women are included, face information of the 2N volunteers during live-broadcast class teaching platform learning is collected, and standardized single faces including sight line tag information, distance tag information and bounding box information are included;

step S2.1.2: the standardized single face is used for pre-training a proposed TransHGE deep neural network model, so that a WTBU-Gaze data set matched with a sight estimation task is obtained.

Preferably, in the step S3, the preprocessed face feature pattern is input to a visual self-attention encoder to obtain the directions of sight of the learner at different moments in the live-broadcast class teaching platform, and the specific steps are as follows:

step S3.1: and (2) obtaining the face feature map f in the step S2.2 _map Remodelling 2D patches

Wherein l=h×w;

step S3.2: the 2D patch at step S3.1

On the basis of this, an additional marking matrix is added>

And a position embedding +.>

The final feature matrix is obtained as follows: f (f) _p ＝[f _to ；f _map ]+f _po ；

Step S3.3: the feature matrix f obtained in the step S3.2 _p The visual self-attention encoder is input to acquire the sight direction of a learner in the live-broadcast course teaching platform as follows:

g _f ＝MLP(Trans(f)[0，:])。

preferably, in the step S3.3, the line of sight direction of the learner in the live-broadcast class teaching platform is obtained, and the specific steps are as follows:

step S3.3.1: processing feature matrix f using a 6-layer standard visual self-attention encoder _p ＝[f _to ；f _map ]+f _po And outputs a new feature matrix

Wherein P is ² Is the length of the feature vector, D is the dimension of each feature vector;

step S3.3.2: selecting a first feature vector as a line-of-sight marker and returning a line-of-sight g from the line-of-sight marker using a 2-layer multi-layer perceptron _f = (α, β), where α represents the pitch angle of the line of sight and β represents the yaw angle;

step S3.3.3: regression line of sight g to be output _f Is converted into standardization by the following method3D line-of-sight vector in space

Thereby obtaining the 3D sight vector +.>

Step S3.3.4: for the loss problem of the 3D vision estimation task, the loss L from the head posture is respectively carried out _head And line of sight estimation loss L _gaze Two parts calculate, through two part loss, minimize 3D sight estimation task loss: loss=min { delta } ₁ L _head +δ ₂ L _gaze }, wherein the parameter delta ₁ ，δ ₂ ∈[0，1]For adjusting the losses.

Preferably, the training step of the standard single-layer visual self-attention encoder model in step S3.3.1 is as follows:

step S3.3.1.1: the single-layer visual self-attention encoder is a self-attention module, and consists of three parts, namely a multi-head self-attention mechanism MSA, a multi-layer perception mechanism MLP and a layer normalization mechanism LN, and the characteristic matrix f obtained in the step S3.2 is obtained _p The output of the self-attention module is calculated as the input mapped to the query q, the key k and the value v:

wherein d is ₀ For each feature dimension.

Step S3.3.1.2: the multi-headed self-attention mechanism MSA in step S3.3.1.1 can be represented by the formula wherein f _p As an input to the current module,

output and f of MSA for multi-head self-attention mechanism _p Sum of (1) is

Step S3.3.1.3: the multi-layer perceptron MLP of step S3.3.1.1 may be represented by the formula wherein

For the output of the current self-attention encoder +.>

Preferably, in the step S4, according to the obtained line-of-sight direction results of the learner at different moments, a line-of-sight gaze point resolving method based on a TOF depth imaging camera is provided, and line-of-sight gaze point coordinates can be obtained in real time, which specifically includes the steps of:

step S4.1: the relative position relationship between the learner head coordinate system H and the gazing screen coordinate system G can be calibrated by using the existing TOF depth imaging camera to measure the distance measurement technology by shining light on the target object and measuring the transmission time of the light between the lens and the object

Step S4.2: the three-dimensional line-of-sight vector g to be acquired by said step S4.1 _F Unitizing to obtain a unit sight line vector as follows:

step S4.3: defining the midpoint of the connecting line of the eyes and the corners of eyes as the starting point of the sight, and representing p in the head coordinate system H of the learner ₀ Based on the calibrated relative positional relationship in the step S4.1

The unit sight line vector +.>

And a line of sight departure point p ₀ (x ₀ ，y ₀ ，z ₀ ) Uniformly converting into a screen coordinate system;

step S4.4: under the screen coordinate system, knowing the sight line direction vector and the sight line departure point, solving a linear equation of the sight line, and further calculating the sight line point coordinate P (x, y) by the space geometrical relationship of the intersection of the plane and the line in the three-dimensional space.

Preferably, in the step S5, the attention condition of the learner during online learning is evaluated by comparing the number of times that the extracted gaze point coordinate is out of the screen coordinate range of the live-broadcast teaching platform with the set detection number f, which specifically includes the steps of:

step S5.1: recording the number of times n of the sight-line fixation point coordinates P (x, y) outside the screen coordinate range in real time and dynamically, and circularly carrying out the steps S1 to S4 every time the fixation point is recorded;

step S5.2: if the number of times n of out-of-screen fixation points in the step S5.1 is greater than the set detection number of times f, the system judges that the attention of the learner is out of focus, and the system carries out popup reminding, updates the number of times of fixation points after popup reminding, and resumes detection; if the number of out-of-screen fixation points n in step S5.1 is always smaller than the set detection number f, the system operates normally until the course is finished.

In general, the above technical solutions conceived by the present invention have the beneficial effects compared with the prior art including:

(1) According to the learning attention detection and assessment method for the live-broadcast teaching platform, the RGB multi-frame images and the TOF multi-frame images obtained from the live-broadcast teaching platform are simultaneously used for a learner sight estimation task, the collected WTBU-Gaze data set is used for matching qualified human faces, and a novel TransHGE deep neural network model is used for estimating the sight direction of the learner, so that the sight estimation result accuracy under the condition that the learner has individual differences such as wearing glasses or different eyeball sizes is greatly improved.

(2) Compared with the situation that a 2D image has a large error in estimating the sight line direction, the invention provides a sight-line gaze point resolving method based on a TOF depth imaging camera, the distance between a learner and a screen of learning equipment can be acquired in real time, and the sight-line gaze point coordinates can be extracted in real time through conversion of the sight line estimation result.

Drawings

FIG. 1 is a flow chart of a learning attention detection and assessment method for a live-broadcast teaching platform of the present invention;

FIG. 2 is a schematic diagram of learner face data acquisition under a live lesson teaching platform;

FIG. 3 is a network framework diagram of a learner view estimation model of the present invention;

FIG. 4 is a schematic diagram of a learner's view estimation result conversion according to the present invention;

FIG. 5 is a schematic view of a distance scenario between a learner and a learning device screen according to the present invention;

fig. 6 is a schematic diagram of the spatial geometry of the learner gaze point map of the present invention.

Like reference numerals denote like technical features throughout the drawings, in particular:

1. learner, 2, learning equipment screen, 3, RGB camera, 4, TOF camera.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

Examples:

as shown in fig. 1, the embodiment of the invention is a learning attention detection and assessment method for a live broadcast teaching platform, which comprises the following steps:

step S1, respectively acquiring video resources of a learner under an RGB camera and a TOF depth imaging camera in a live-broadcast class teaching platform in real time, and dividing the video resources into multiple frames of images according to time sequence;

step S2, extracting face feature patterns from the RGB multi-frame images and the TOF multi-frame images in the live-broadcast class teaching platform by using CNN, wherein each feature pattern contains information of a face local area;

As shown in fig. 2, a learner is performing online learning by using a live-broadcast class teaching platform, in this scenario, an RGB camera and a TOF depth imaging camera in the live-broadcast class teaching platform are used to obtain face video resources of the learner in real time, and the video resources are divided into multiple frames of images according to time sequence, and the RGB multiple frames of images and the TOF multiple frames of images collected in this scenario provide important data sources for step S3.

As shown in fig. 3, in this embodiment, CNN is used to extract a face feature map for the RGB multi-frame image and the TOF multi-frame image in the live-broadcast class teaching platform, which specifically includes the following steps:

step S2.1: face attribute calculation is performed on each face meeting the conditions in the received image by using a face recognition model standardized in a WTBU-Gaze dataset, and the data is stored as A=F (I _i ) Wherein A is a slave face image I _i Eye parameter attribute and head posture attribute A= { a extracted from the above _eye ，a _{head pose} }。

And a head posture rotation matrix R, wherein h, w and c respectively represent the length, the width and the characteristic spectrumNumber of channels.

Step S2.3: the face selected in the WTBU-Gaze dataset is called f _c We calculate f _d And f _c The difference between the eye parameters and the head posture of the face image is f through a scoring function _c Calculating a matching score for each face image of: s (f) _c ，f _d )＝∑ _{m∈{eye，head pose}} σ _m |a _m，d -a _m，c I, wherein the parameter sigma _m Is determined empirically by comparing the matching results.

According to the above scheme, the step of acquiring the WTBU-size data set used in the step S2.1 is as follows:

step S2.1.1: 50 volunteers were recruited, including 25 men and women, and face information of these 50 volunteers, standardized single faces including depth information such as line-of-sight tags, distance tags, and bounding boxes, was collected while the live-broadcast class teaching platform was learned.

Step S2.1.2: the standardized single face is used for pre-training of the proposed TransHGE deep neural network model (the TransHGE deep neural network consists of step S1 and step S2), so as to obtain a WTBU-Gaze data set matching the sight line estimation task.

According to the above scheme, in the step S3, the preprocessed face feature pattern is input into the visual self-attention encoder to obtain the line of sight directions of the learner at different moments in the live-broadcast course teaching platform, and the specific steps are as follows:

Where l=h×w.

Step S3.2: the 2D patch at step S3.1

On the basis of this, an additional marking matrix is added>

And a position embedding +.>

The final feature matrix is obtained as follows: />

g _f ＝MLP(Trans(f)[0，：]) (13)

in this embodiment, as shown in fig. 4, the line of sight direction of the learner in the live-broadcast course teaching platform is obtained, and the specific steps are as follows:

Wherein P is ² Is the length of the feature vector and D is the dimension of each feature vector.

Step S3.3.2: selecting a first eigenvector (corresponding to the token matrix

Is used as a line of sight mark and a 2-layer multi-layer perceptron is used to return the line of sight g from the line of sight mark _f = (α, β), where α represents the pitch angle of the line of sight and β represents the yaw angle.

Step S3.3.3: regression line of sight g to be output _f Conversion to 3D line-of-sight vectors in normalized space by

Thereby obtaining the 3D sight vector +.>

Step S3.3.4: for the loss problem of the 3D vision estimation task, the loss L from the head posture is respectively carried out _head And line of sight estimation loss L _gaze The two parts perform the calculation. Head pose loss L _head The method comprises the following steps:

wherein the parameter gamma ₁ ，γ ₂ ∈[0，1]For adjusting losses; z _p ，z _b ，z _l Respectively representing the probability of the predicted existence of the face image, the boundary box and the position of the landmark;

respectively correspond to z _p ，z _b ，z _l Ground truth of (2). Line of sight estimation loss L _gaze The method comprises the following steps:

where s.epsilon. { F, T, S }, parameter λ ₁ ，λ ₂ ∈[0，1]For adjusting losses, p _s Is a trainable weight. The 3D line-of-sight estimation task loss is minimized by the two-part loss:

Loss＝min{δ ₁ L _head +δ ₂ L _gaze }， (17)

wherein the parameter delta ₁ ，δ ₂ ∈[0，1]For adjusting the losses.

According to the above scheme, the training steps of the standard single-layer visual self-attention encoder model in the step S3.3.1 are as follows:

step S3.3.1.1: the single-layer visual self-attention encoder is a self-attention module and mainly comprises three parts of a multi-head self-attention mechanism MSA, a multi-layer perception mechanism MLP and a layer normalization mechanism LN, and the characteristic matrix f obtained in the step S3.2 is obtained _p Mapping as input to query q, key k and value v, self-attention moduleThe output of (2) is calculated as:

wherein d is ₀ For each feature dimension.

output and f of MSA for multi-head self-attention mechanism _p A kind of electronic device.

Is the output of the current self-attention encoder.

As shown in fig. 5 and 6, in the embodiment, step S4 provides a method for resolving a gaze point based on a TOF depth imaging camera according to the gaze direction results of a learner at different moments, which can extract gaze point coordinates in real time, and specifically includes the steps of:

The unit sight line vector +.>

According to the above scheme, in the step S5, the number of times of the extracted gaze point coordinates outside the screen coordinate range of the live-broadcast teaching platform (24 inch screen, 53.15 cm wide by 29.90 cm high) is compared with the set detection number f to evaluate the attention situation of the learner during online learning, and the specific steps are as follows:

step S5.2: the learning attention detection and evaluation rules of the learner in the live-broadcast course teaching platform are shown in the following table:

it will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A learning attention detection and assessment method of a live broadcast teaching platform comprises the following steps:

2. The learning attention detection and assessment method of a live teaching platform according to claim 1, wherein in the step S1, the video resources of the learner under the RGB camera and the TOF depth imaging camera in the live teaching platform are respectively obtained in real time, and the video resources are divided into a multi-frame RGB image and a multi-frame TOF image according to time sequence, and the specific steps are as follows:

step S1.2: video resource of set RGB camera and TOF depth imaging camera

i∈{1,2,…,t ₀ }, t is ₀ The number of frames for video, H, W, C corresponding to video resources, represent the image height and width and the number of channels, respectively. .

3. The learning attention detection and assessment method of a live teaching platform according to claim 1, wherein in the step S2, the face feature map is extracted from the multi-frame RGB image and the multi-frame TOF image by using a convolutional neural network CNN, and the specific steps are as follows:

step S2.1: face attribute calculation is performed for each eligible face in the received image using a face recognition model standardized in the WTBU-size dataset, and the data is saved as =f (I _i ) Wherein A is a slave face image I _i Eye parameter attribute and head posture attribute A= { a extracted from the above _eye ,a _headpose }；

And a head pose rotation matrix R, where h,w and c respectively represent the length, width and channel number of the characteristic map.

Step S2.3: the face selected in the WTBU-Gaze dataset is called f _c We calculate f _d And f _c The difference between the eye parameters and the head posture of the face image is f through a scoring function _c Calculating a matching score for each face image of: s (f) _c ,f _d )＝∑ _{m∈{eye,headpose}} σ _m |a _m,d -a _m,c I, wherein the parameter sigma _m Is determined empirically by comparing the matching results.

4. The learning attention detection and assessment method of a live teaching platform as claimed in claim 3, wherein the step of obtaining the WTBU-size data set used in step S2.1 is as follows:

5. The learning attention detection and assessment method of a live-broadcast teaching platform as claimed in claim 1, wherein in the step S3, the preprocessed face feature pattern is input into a visual self-attention encoder to obtain the directions of the eyes of the learner at different moments in the live-broadcast teaching platform, and the specific steps are as follows:

Wherein l=h×w;

step S3.2: the 2D patch at step S3.1

On the basis of this, an additional marking matrix is added>

And a position embedding +.>

g _f ＝MLP(Trans(f)[0,:])。

6. the learning attention detection and assessment method of a live-broadcast teaching platform as claimed in claim 5, wherein the step S3.3 is to obtain the line of sight direction of the learner in the live-broadcast teaching platform, and the specific steps are as follows:

Thereby obtaining the 3D sight vector +.>

Step S3.3.4: for the loss problem of the 3D vision estimation task, the loss L from the head posture is respectively carried out _head And line of sight estimation loss L _gaze Two parts calculate, through two part loss, minimize 3D sight estimation task loss: loss=min { delta } ₁ L _head +δ ₂ L _gaze }, wherein the parameter delta ₁ ,δ ₂ ∈[0,1]For adjusting the losses.

7. The learning attention detection and assessment method for live-action teaching platform as claimed in claim 6, wherein the training step of the standard single-layer visual self-attention encoder model in step S3.3.1 is as follows:

wherein d is ₀ For each feature dimension.

output and f of MSA for multi-head self-attention mechanism _p Sum of (1) is

For the output of the current self-attention encoder +.>

8. The learning attention detection and assessment method of a live broadcast teaching platform as claimed in claim 1, wherein the step S4 provides a line-of-sight gaze point resolving method based on a TOF depth imaging camera according to the line-of-sight direction results of the learner at different moments, and the line-of-sight gaze point coordinates can be obtained in real time, specifically comprising the steps of:

step S4.3: definition of doubleThe midpoint of the connecting line of the inner canthus is the line of sight departure point and is expressed as p in the head coordinate system H of the learner ₀ Based on the calibrated relative positional relationship in the step S4.1

The unit sight line vector +.>

And a line of sight departure point p ₀ (x ₀ ,y ₀ ,z ₀ ) Uniformly converting into a screen coordinate system;

9. The learning attention detection and assessment method for a live-broadcast teaching platform according to claim 1, wherein in the step S5, the attention condition of the learner during online learning is assessed by comparing the number of times that the extracted gaze point coordinates are out of the screen coordinate range of the live-broadcast teaching platform with the set detection number f, and the specific steps are as follows: