CN116030519A - Learning attention detection and assessment method for live broadcast teaching platform - Google Patents

Learning attention detection and assessment method for live broadcast teaching platform Download PDF

Info

Publication number
CN116030519A
CN116030519A CN202211743625.8A CN202211743625A CN116030519A CN 116030519 A CN116030519 A CN 116030519A CN 202211743625 A CN202211743625 A CN 202211743625A CN 116030519 A CN116030519 A CN 116030519A
Authority
CN
China
Prior art keywords
sight
line
live
learner
teaching platform
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211743625.8A
Other languages
Chinese (zh)
Inventor
刘雄华
黄凯伦
何顶新
邓伟明
李曼娜
吴悦
刘婷婷
刘海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Technology and Business University
Original Assignee
Wuhan Technology and Business University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Technology and Business University filed Critical Wuhan Technology and Business University
Priority to CN202211743625.8A priority Critical patent/CN116030519A/en
Publication of CN116030519A publication Critical patent/CN116030519A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

Aiming at the problem that a learner cannot obtain detection in an online learning environment, the invention discloses a learning attention detection and assessment method for a live-broadcast teaching platform, which is used for simultaneously using RGB multi-frame images and TOF multi-frame images acquired in the live-broadcast teaching platform in a learner sight estimation task, using an acquired WTBU-Gaze data set to match qualified human faces, and estimating the sight direction of the learner by adopting a novel TransHGE deep neural network model, thereby greatly improving the sight estimation result accuracy under the condition that the learner has individual differences of wearing glasses or different eyeball sizes.

Description

Learning attention detection and assessment method for live broadcast teaching platform
Technical Field
The invention relates to the fields of computer vision technology and online education, in particular to a learning attention detection and assessment method of a live broadcast teaching platform.
Background
Along with the gradual promotion of live lesson teaching, the learner's study form is developed into live lesson teaching platform study by traditional classroom study, and live lesson teaching platform study can enough avoid the learner to concentrate to the classroom and the risk of infection, can not fall the study progress because of unable classroom of arriving again. However, the traditional offline classroom learning depends on the manner of real-time detection of the learner learning condition by the teacher, and cannot meet the requirement of the live-broadcast class teaching platform learning. The existing live-broadcast class teaching platform is provided with a whole-course video recording device for detecting the learner in the learning period after class, but the whole-course video recording detection mode needs to spend a great deal of time in the later period to judge the attention condition of the learner, and the ideal effect is difficult to obtain in the real-time performance and the accuracy of detection.
The vision is an important expression form of the learning attention degree of the learner, and the learning attention degree is judged by acquiring the vision estimation direction of the learner during the learning of the live-broadcast course teaching platform and analyzing whether the vision direction and the gaze point of the learner are in the screen of the live-broadcast course teaching platform or not and integrating the key elements, so that the live-broadcast course teaching detection and evaluation function is achieved. However, estimating the learner's gaze direction in a live lesson teaching platform still has some serious challenges, such as: (1) the learner's view is partially blocked by the glasses worn; (2) the sizes of eyeballs produced by individual differences of learners are different; (3) The distance between the learner and the screen of the learning apparatus is difficult to determine, etc. Aiming at the challenges, the method for estimating the 3D sight direction of the learner by adopting the TransHGE deep neural network, which is disclosed by the invention, has the advantages that a large number of face data sets of the learner in a live-broadcast class teaching platform are collected, depth information such as the distance between the learner and a screen of learning equipment is introduced, and the challenging factors existing in the aspects of real-time performance and accuracy of sight estimation are solved to a great extent. At present, a method for detecting the attention degree of a learner when a live-broadcast course teaching platform learns by utilizing 3D sight estimation does not exist. Therefore, the problem of detecting the attention degree of the learner when the live-broadcast teaching platform learns by the learning attention detection and evaluation method oriented to the live-broadcast teaching platform is worthy of intensive study.
Disclosure of Invention
Aiming at one or more of the above defects or improvement demands of the prior art, the invention provides a learning attention detection and assessment method of a live teaching platform, which comprises the following steps:
step S1, respectively acquiring video resources of a learner under an RGB camera and a TOF depth imaging camera in the live-broadcast class teaching platform in real time, and dividing the video resources into a multi-frame RGB image and a multi-frame TOF image according to time sequence;
s2, extracting face feature patterns from the multi-frame RGB image and the multi-frame TOF image by using a convolutional neural network CNN, wherein each feature pattern contains information of a face local area;
s3, inputting the preprocessed human face characteristic patterns into a visual self-attention encoder to acquire the sight directions of learners at different moments in a live-broadcast class teaching platform;
step S4, according to the obtained sight line direction results of the learner at different moments, a sight-line fixation point resolving method based on a TOF depth imaging camera is provided, and the sight-line fixation point coordinates can be extracted in real time;
and S5, comparing the number of times that the extracted gaze point coordinates are out of the screen coordinate range of the live-broadcast teaching platform with the set detection number f to evaluate the attention condition of the learner during online learning.
Preferably, in the step S1, video resources of a learner under an RGB camera and a TOF depth imaging camera in a live-broadcast class teaching platform are respectively obtained in real time, and the video resources are divided into a multi-frame RGB image and a multi-frame TOF image according to time sequence, which specifically comprises the following steps:
step S1.1: setting the resolution and shooting angle of the RGB camera and the TOF depth imaging camera, and ensuring that the video resource received in real time contains a complete face area of a learner;
step S1.2: video resource of set RGB camera and TOF depth imaging camera
Figure SMS_1
Wherein L, H, W, C represents video length, height and width and channel number, respectively;
step S1.3: dividing the video resource V obtained in real time in the step S1.2 into a plurality of frames of face image sequences according to time sequence
Figure SMS_2
Wherein t is 0 The number of frames for video, H, W, C corresponding to video resources, represent the image height and width and the number of channels, respectively. .
Preferably, in the step S2, the convolutional neural network CNN is used to extract the face feature map for the multi-frame RGB image and the multi-frame TOF image, which specifically includes the steps of:
step S2.1: face standardized using WTBU-Gaze datasetA recognition model for performing face attribute calculation for each face satisfying conditions in the received image, and storing the data as =f (I i ) Wherein A is a slave face image I i Eye parameter attribute and head posture attribute A= { a extracted from the above eye ,a head pose };
Step S2.2: for each qualified face f in the received image d Adopting a face feature point detection algorithm based on a CNN model, and locating the qualified face f in the step S2.1 through CNN with two layers of convolution kernels d Is characterized by (a) feature pattern
Figure SMS_3
And a head posture rotation matrix R, wherein h, w and c respectively represent the length, the width and the channel number of the characteristic spectrum.
Step S2.3: the face selected in the WTBU-Gaze dataset is called f c We calculate f d And f c The difference between the eye parameters and the head posture of the face image is f through a scoring function c Calculating a matching score for each face image of: s (f) c ,f d )=∑ m∈{eye,heed pose} σ m |a m,d -a m,c I, wherein the parameter sigma m Is determined empirically by comparing the matching results.
Preferably, the step of acquiring the WTBU-size data set used in step S2.1 is as follows:
step S2.1.1: 2N volunteers are recruited, wherein N people of men and women are included, face information of the 2N volunteers during live-broadcast class teaching platform learning is collected, and standardized single faces including sight line tag information, distance tag information and bounding box information are included;
step S2.1.2: the standardized single face is used for pre-training a proposed TransHGE deep neural network model, so that a WTBU-Gaze data set matched with a sight estimation task is obtained.
Preferably, in the step S3, the preprocessed face feature pattern is input to a visual self-attention encoder to obtain the directions of sight of the learner at different moments in the live-broadcast class teaching platform, and the specific steps are as follows:
step S3.1: and (2) obtaining the face feature map f in the step S2.2 map Remodelling 2D patches
Figure SMS_4
Wherein l=h×w;
step S3.2: the 2D patch at step S3.1
Figure SMS_5
On the basis of this, an additional marking matrix is added>
Figure SMS_6
And a position embedding +.>
Figure SMS_7
The final feature matrix is obtained as follows: f (f) p =[f to ;f map ]+f po
Step S3.3: the feature matrix f obtained in the step S3.2 p The visual self-attention encoder is input to acquire the sight direction of a learner in the live-broadcast course teaching platform as follows:
g f =MLP(Trans(f)[0,:])。
preferably, in the step S3.3, the line of sight direction of the learner in the live-broadcast class teaching platform is obtained, and the specific steps are as follows:
step S3.3.1: processing feature matrix f using a 6-layer standard visual self-attention encoder p =[f to ;f map ]+f po And outputs a new feature matrix
Figure SMS_8
Wherein P is 2 Is the length of the feature vector, D is the dimension of each feature vector;
step S3.3.2: selecting a first feature vector as a line-of-sight marker and returning a line-of-sight g from the line-of-sight marker using a 2-layer multi-layer perceptron f = (α, β), where α represents the pitch angle of the line of sight and β represents the yaw angle;
step S3.3.3: regression line of sight g to be output f Is converted into standardization by the following method3D line-of-sight vector in space
Figure SMS_9
Thereby obtaining the 3D sight vector +.>
Figure SMS_10
Figure SMS_11
Step S3.3.4: for the loss problem of the 3D vision estimation task, the loss L from the head posture is respectively carried out head And line of sight estimation loss L gaze Two parts calculate, through two part loss, minimize 3D sight estimation task loss: loss=min { delta } 1 L head2 L gaze }, wherein the parameter delta 1 ,δ 2 ∈[0,1]For adjusting the losses.
Preferably, the training step of the standard single-layer visual self-attention encoder model in step S3.3.1 is as follows:
step S3.3.1.1: the single-layer visual self-attention encoder is a self-attention module, and consists of three parts, namely a multi-head self-attention mechanism MSA, a multi-layer perception mechanism MLP and a layer normalization mechanism LN, and the characteristic matrix f obtained in the step S3.2 is obtained p The output of the self-attention module is calculated as the input mapped to the query q, the key k and the value v:
Figure SMS_12
wherein d is 0 For each feature dimension.
Step S3.3.1.2: the multi-headed self-attention mechanism MSA in step S3.3.1.1 can be represented by the formula wherein f p As an input to the current module,
Figure SMS_13
output and f of MSA for multi-head self-attention mechanism p Sum of (1) is
Figure SMS_14
Step S3.3.1.3: the multi-layer perceptron MLP of step S3.3.1.1 may be represented by the formula wherein
Figure SMS_15
For the output of the current self-attention encoder +.>
Figure SMS_16
Preferably, in the step S4, according to the obtained line-of-sight direction results of the learner at different moments, a line-of-sight gaze point resolving method based on a TOF depth imaging camera is provided, and line-of-sight gaze point coordinates can be obtained in real time, which specifically includes the steps of:
step S4.1: the relative position relationship between the learner head coordinate system H and the gazing screen coordinate system G can be calibrated by using the existing TOF depth imaging camera to measure the distance measurement technology by shining light on the target object and measuring the transmission time of the light between the lens and the object
Figure SMS_17
Step S4.2: the three-dimensional line-of-sight vector g to be acquired by said step S4.1 F Unitizing to obtain a unit sight line vector as follows:
Figure SMS_18
step S4.3: defining the midpoint of the connecting line of the eyes and the corners of eyes as the starting point of the sight, and representing p in the head coordinate system H of the learner 0 Based on the calibrated relative positional relationship in the step S4.1
Figure SMS_19
The unit sight line vector +.>
Figure SMS_20
And a line of sight departure point p 0 (x 0 ,y 0 ,z 0 ) Uniformly converting into a screen coordinate system;
step S4.4: under the screen coordinate system, knowing the sight line direction vector and the sight line departure point, solving a linear equation of the sight line, and further calculating the sight line point coordinate P (x, y) by the space geometrical relationship of the intersection of the plane and the line in the three-dimensional space.
Preferably, in the step S5, the attention condition of the learner during online learning is evaluated by comparing the number of times that the extracted gaze point coordinate is out of the screen coordinate range of the live-broadcast teaching platform with the set detection number f, which specifically includes the steps of:
step S5.1: recording the number of times n of the sight-line fixation point coordinates P (x, y) outside the screen coordinate range in real time and dynamically, and circularly carrying out the steps S1 to S4 every time the fixation point is recorded;
step S5.2: if the number of times n of out-of-screen fixation points in the step S5.1 is greater than the set detection number of times f, the system judges that the attention of the learner is out of focus, and the system carries out popup reminding, updates the number of times of fixation points after popup reminding, and resumes detection; if the number of out-of-screen fixation points n in step S5.1 is always smaller than the set detection number f, the system operates normally until the course is finished.
In general, the above technical solutions conceived by the present invention have the beneficial effects compared with the prior art including:
(1) According to the learning attention detection and assessment method for the live-broadcast teaching platform, the RGB multi-frame images and the TOF multi-frame images obtained from the live-broadcast teaching platform are simultaneously used for a learner sight estimation task, the collected WTBU-Gaze data set is used for matching qualified human faces, and a novel TransHGE deep neural network model is used for estimating the sight direction of the learner, so that the sight estimation result accuracy under the condition that the learner has individual differences such as wearing glasses or different eyeball sizes is greatly improved.
(2) Compared with the situation that a 2D image has a large error in estimating the sight line direction, the invention provides a sight-line gaze point resolving method based on a TOF depth imaging camera, the distance between a learner and a screen of learning equipment can be acquired in real time, and the sight-line gaze point coordinates can be extracted in real time through conversion of the sight line estimation result.
Drawings
FIG. 1 is a flow chart of a learning attention detection and assessment method for a live-broadcast teaching platform of the present invention;
FIG. 2 is a schematic diagram of learner face data acquisition under a live lesson teaching platform;
FIG. 3 is a network framework diagram of a learner view estimation model of the present invention;
FIG. 4 is a schematic diagram of a learner's view estimation result conversion according to the present invention;
FIG. 5 is a schematic view of a distance scenario between a learner and a learning device screen according to the present invention;
fig. 6 is a schematic diagram of the spatial geometry of the learner gaze point map of the present invention.
Like reference numerals denote like technical features throughout the drawings, in particular:
1. learner, 2, learning equipment screen, 3, RGB camera, 4, TOF camera.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
Examples:
as shown in fig. 1, the embodiment of the invention is a learning attention detection and assessment method for a live broadcast teaching platform, which comprises the following steps:
step S1, respectively acquiring video resources of a learner under an RGB camera and a TOF depth imaging camera in a live-broadcast class teaching platform in real time, and dividing the video resources into multiple frames of images according to time sequence;
step S2, extracting face feature patterns from the RGB multi-frame images and the TOF multi-frame images in the live-broadcast class teaching platform by using CNN, wherein each feature pattern contains information of a face local area;
s3, inputting the preprocessed human face characteristic patterns into a visual self-attention encoder to acquire the sight directions of learners at different moments in a live-broadcast class teaching platform;
step S4, according to the obtained sight line direction results of the learner at different moments, a sight-line fixation point resolving method based on a TOF depth imaging camera is provided, and the sight-line fixation point coordinates can be extracted in real time;
and S5, comparing the number of times that the extracted gaze point coordinates are out of the screen coordinate range of the live-broadcast teaching platform with the set detection number f to evaluate the attention condition of the learner during online learning.
As shown in fig. 2, a learner is performing online learning by using a live-broadcast class teaching platform, in this scenario, an RGB camera and a TOF depth imaging camera in the live-broadcast class teaching platform are used to obtain face video resources of the learner in real time, and the video resources are divided into multiple frames of images according to time sequence, and the RGB multiple frames of images and the TOF multiple frames of images collected in this scenario provide important data sources for step S3.
As shown in fig. 3, in this embodiment, CNN is used to extract a face feature map for the RGB multi-frame image and the TOF multi-frame image in the live-broadcast class teaching platform, which specifically includes the following steps:
step S2.1: face attribute calculation is performed on each face meeting the conditions in the received image by using a face recognition model standardized in a WTBU-Gaze dataset, and the data is stored as A=F (I i ) Wherein A is a slave face image I i Eye parameter attribute and head posture attribute A= { a extracted from the above eye ,a head pose }。
Step S2.2: for each qualified face f in the received image d Adopting a face feature point detection algorithm based on a CNN model, and locating the qualified face f in the step S2.1 through CNN with two layers of convolution kernels d Is characterized by (a) feature pattern
Figure SMS_21
And a head posture rotation matrix R, wherein h, w and c respectively represent the length, the width and the characteristic spectrumNumber of channels.
Step S2.3: the face selected in the WTBU-Gaze dataset is called f c We calculate f d And f c The difference between the eye parameters and the head posture of the face image is f through a scoring function c Calculating a matching score for each face image of: s (f) c ,f d )=∑ m∈{eye,head pose} σ m |a m,d -a m,c I, wherein the parameter sigma m Is determined empirically by comparing the matching results.
According to the above scheme, the step of acquiring the WTBU-size data set used in the step S2.1 is as follows:
step S2.1.1: 50 volunteers were recruited, including 25 men and women, and face information of these 50 volunteers, standardized single faces including depth information such as line-of-sight tags, distance tags, and bounding boxes, was collected while the live-broadcast class teaching platform was learned.
Step S2.1.2: the standardized single face is used for pre-training of the proposed TransHGE deep neural network model (the TransHGE deep neural network consists of step S1 and step S2), so as to obtain a WTBU-Gaze data set matching the sight line estimation task.
According to the above scheme, in the step S3, the preprocessed face feature pattern is input into the visual self-attention encoder to obtain the line of sight directions of the learner at different moments in the live-broadcast course teaching platform, and the specific steps are as follows:
step S3.1: and (2) obtaining the face feature map f in the step S2.2 map Remodelling 2D patches
Figure SMS_22
Where l=h×w.
Step S3.2: the 2D patch at step S3.1
Figure SMS_23
On the basis of this, an additional marking matrix is added>
Figure SMS_24
And a position embedding +.>
Figure SMS_25
The final feature matrix is obtained as follows: />
Figure SMS_26
Step S3.3: the feature matrix f obtained in the step S3.2 p The visual self-attention encoder is input to acquire the sight direction of a learner in the live-broadcast course teaching platform as follows:
g f =MLP(Trans(f)[0,:]) (13)
in this embodiment, as shown in fig. 4, the line of sight direction of the learner in the live-broadcast course teaching platform is obtained, and the specific steps are as follows:
step S3.3.1: processing feature matrix f using a 6-layer standard visual self-attention encoder p =[f to ;f map ]+f po And outputs a new feature matrix
Figure SMS_27
Wherein P is 2 Is the length of the feature vector and D is the dimension of each feature vector.
Step S3.3.2: selecting a first eigenvector (corresponding to the token matrix
Figure SMS_28
Is used as a line of sight mark and a 2-layer multi-layer perceptron is used to return the line of sight g from the line of sight mark f = (α, β), where α represents the pitch angle of the line of sight and β represents the yaw angle.
Step S3.3.3: regression line of sight g to be output f Conversion to 3D line-of-sight vectors in normalized space by
Figure SMS_29
Thereby obtaining the 3D sight vector +.>
Figure SMS_30
Figure SMS_31
Step S3.3.4: for the loss problem of the 3D vision estimation task, the loss L from the head posture is respectively carried out head And line of sight estimation loss L gaze The two parts perform the calculation. Head pose loss L head The method comprises the following steps:
Figure SMS_32
wherein the parameter gamma 1 ,γ 2 ∈[0,1]For adjusting losses; z p ,z b ,z l Respectively representing the probability of the predicted existence of the face image, the boundary box and the position of the landmark;
Figure SMS_33
respectively correspond to z p ,z b ,z l Ground truth of (2). Line of sight estimation loss L gaze The method comprises the following steps:
Figure SMS_34
where s.epsilon. { F, T, S }, parameter λ 1 ,λ 2 ∈[0,1]For adjusting losses, p s Is a trainable weight. The 3D line-of-sight estimation task loss is minimized by the two-part loss:
Loss=min{δ 1 L head2 L gaze }, (17)
wherein the parameter delta 1 ,δ 2 ∈[0,1]For adjusting the losses.
According to the above scheme, the training steps of the standard single-layer visual self-attention encoder model in the step S3.3.1 are as follows:
step S3.3.1.1: the single-layer visual self-attention encoder is a self-attention module and mainly comprises three parts of a multi-head self-attention mechanism MSA, a multi-layer perception mechanism MLP and a layer normalization mechanism LN, and the characteristic matrix f obtained in the step S3.2 is obtained p Mapping as input to query q, key k and value v, self-attention moduleThe output of (2) is calculated as:
Figure SMS_35
wherein d is 0 For each feature dimension.
Step S3.3.1.2: the multi-headed self-attention mechanism MSA in step S3.3.1.1 can be represented by the formula wherein f p As an input to the current module,
Figure SMS_36
output and f of MSA for multi-head self-attention mechanism p A kind of electronic device.
Figure SMS_37
Step S3.3.1.3: the multi-layer perceptron MLP of step S3.3.1.1 may be represented by the formula wherein
Figure SMS_38
Is the output of the current self-attention encoder.
Figure SMS_39
As shown in fig. 5 and 6, in the embodiment, step S4 provides a method for resolving a gaze point based on a TOF depth imaging camera according to the gaze direction results of a learner at different moments, which can extract gaze point coordinates in real time, and specifically includes the steps of:
step S4.1: the relative position relationship between the learner head coordinate system H and the gazing screen coordinate system G can be calibrated by using the existing TOF depth imaging camera to measure the distance measurement technology by shining light on the target object and measuring the transmission time of the light between the lens and the object
Figure SMS_40
Step S4.2: the three-dimensional line-of-sight vector g to be acquired by said step S4.1 F Unitizing to obtain a unit sight line vector as follows:
Figure SMS_41
step S4.3: defining the midpoint of the connecting line of the eyes and the corners of eyes as the starting point of the sight, and representing p in the head coordinate system H of the learner 0 Based on the calibrated relative positional relationship in the step S4.1
Figure SMS_42
The unit sight line vector +.>
Figure SMS_43
And a line of sight departure point p 0 (x 0 ,y 0 ,z 0 ) Uniformly converting into a screen coordinate system;
step S4.4: under the screen coordinate system, knowing the sight line direction vector and the sight line departure point, solving a linear equation of the sight line, and further calculating the sight line point coordinate P (x, y) by the space geometrical relationship of the intersection of the plane and the line in the three-dimensional space.
According to the above scheme, in the step S5, the number of times of the extracted gaze point coordinates outside the screen coordinate range of the live-broadcast teaching platform (24 inch screen, 53.15 cm wide by 29.90 cm high) is compared with the set detection number f to evaluate the attention situation of the learner during online learning, and the specific steps are as follows:
step S5.1: recording the number of times n of the sight-line fixation point coordinates P (x, y) outside the screen coordinate range in real time and dynamically, and circularly carrying out the steps S1 to S4 every time the fixation point is recorded;
step S5.2: the learning attention detection and evaluation rules of the learner in the live-broadcast course teaching platform are shown in the following table:
Figure SMS_44
Figure SMS_45
it will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (9)

1. A learning attention detection and assessment method of a live broadcast teaching platform comprises the following steps:
step S1, respectively acquiring video resources of a learner under an RGB camera and a TOF depth imaging camera in the live-broadcast class teaching platform in real time, and dividing the video resources into a multi-frame RGB image and a multi-frame TOF image according to time sequence;
s2, extracting face feature patterns from the multi-frame RGB image and the multi-frame TOF image by using a convolutional neural network CNN, wherein each feature pattern contains information of a face local area;
s3, inputting the preprocessed human face characteristic patterns into a visual self-attention encoder to acquire the sight directions of learners at different moments in a live-broadcast class teaching platform;
step S4, according to the obtained sight line direction results of the learner at different moments, a sight-line fixation point resolving method based on a TOF depth imaging camera is provided, and the sight-line fixation point coordinates can be extracted in real time;
and S5, comparing the number of times that the extracted gaze point coordinates are out of the screen coordinate range of the live-broadcast teaching platform with the set detection number f to evaluate the attention condition of the learner during online learning.
2. The learning attention detection and assessment method of a live teaching platform according to claim 1, wherein in the step S1, the video resources of the learner under the RGB camera and the TOF depth imaging camera in the live teaching platform are respectively obtained in real time, and the video resources are divided into a multi-frame RGB image and a multi-frame TOF image according to time sequence, and the specific steps are as follows:
step S1.1: setting the resolution and shooting angle of the RGB camera and the TOF depth imaging camera, and ensuring that the video resource received in real time contains a complete face area of a learner;
step S1.2: video resource of set RGB camera and TOF depth imaging camera
Figure FDA0004031512310000011
Wherein L, H, W, C represents video length, height and width and channel number, respectively;
step S1.3: dividing the video resource V obtained in real time in the step S1.2 into a plurality of frames of face image sequences according to time sequence
Figure FDA0004031512310000021
i∈{1,2,…,t 0 }, t is 0 The number of frames for video, H, W, C corresponding to video resources, represent the image height and width and the number of channels, respectively. .
3. The learning attention detection and assessment method of a live teaching platform according to claim 1, wherein in the step S2, the face feature map is extracted from the multi-frame RGB image and the multi-frame TOF image by using a convolutional neural network CNN, and the specific steps are as follows:
step S2.1: face attribute calculation is performed for each eligible face in the received image using a face recognition model standardized in the WTBU-size dataset, and the data is saved as =f (I i ) Wherein A is a slave face image I i Eye parameter attribute and head posture attribute A= { a extracted from the above eye ,a headpose };
Step S2.2: for each qualified face f in the received image d Adopting a face feature point detection algorithm based on a CNN model, and locating the qualified face f in the step S2.1 through CNN with two layers of convolution kernels d Is characterized by (a) feature pattern
Figure FDA0004031512310000022
And a head pose rotation matrix R, where h,w and c respectively represent the length, width and channel number of the characteristic map.
Step S2.3: the face selected in the WTBU-Gaze dataset is called f c We calculate f d And f c The difference between the eye parameters and the head posture of the face image is f through a scoring function c Calculating a matching score for each face image of: s (f) c ,f d )=∑ m∈{eye,headpose} σ m |a m,d -a m,c I, wherein the parameter sigma m Is determined empirically by comparing the matching results.
4. The learning attention detection and assessment method of a live teaching platform as claimed in claim 3, wherein the step of obtaining the WTBU-size data set used in step S2.1 is as follows:
step S2.1.1: 2N volunteers are recruited, wherein N people of men and women are included, face information of the 2N volunteers during live-broadcast class teaching platform learning is collected, and standardized single faces including sight line tag information, distance tag information and bounding box information are included;
step S2.1.2: the standardized single face is used for pre-training a proposed TransHGE deep neural network model, so that a WTBU-Gaze data set matched with a sight estimation task is obtained.
5. The learning attention detection and assessment method of a live-broadcast teaching platform as claimed in claim 1, wherein in the step S3, the preprocessed face feature pattern is input into a visual self-attention encoder to obtain the directions of the eyes of the learner at different moments in the live-broadcast teaching platform, and the specific steps are as follows:
step S3.1: and (2) obtaining the face feature map f in the step S2.2 map Remodelling 2D patches
Figure FDA0004031512310000031
Wherein l=h×w;
step S3.2: the 2D patch at step S3.1
Figure FDA0004031512310000032
On the basis of this, an additional marking matrix is added>
Figure FDA0004031512310000033
And a position embedding +.>
Figure FDA0004031512310000034
The final feature matrix is obtained as follows: f (f) p =[f to ;f map ]+f po
Step S3.3: the feature matrix f obtained in the step S3.2 p The visual self-attention encoder is input to acquire the sight direction of a learner in the live-broadcast course teaching platform as follows:
g f =MLP(Trans(f)[0,:])。
6. the learning attention detection and assessment method of a live-broadcast teaching platform as claimed in claim 5, wherein the step S3.3 is to obtain the line of sight direction of the learner in the live-broadcast teaching platform, and the specific steps are as follows:
step S3.3.1: processing feature matrix f using a 6-layer standard visual self-attention encoder p =[f to ;f map ]+f po And outputs a new feature matrix
Figure FDA0004031512310000035
Wherein P is 2 Is the length of the feature vector, D is the dimension of each feature vector;
step S3.3.2: selecting a first feature vector as a line-of-sight marker and returning a line-of-sight g from the line-of-sight marker using a 2-layer multi-layer perceptron f = (α, β), where α represents the pitch angle of the line of sight and β represents the yaw angle;
step S3.3.3: regression line of sight g to be output f Conversion to 3D line-of-sight vectors in normalized space by
Figure FDA0004031512310000036
Thereby obtaining the 3D sight vector +.>
Figure FDA0004031512310000041
Figure FDA0004031512310000042
Step S3.3.4: for the loss problem of the 3D vision estimation task, the loss L from the head posture is respectively carried out head And line of sight estimation loss L gaze Two parts calculate, through two part loss, minimize 3D sight estimation task loss: loss=min { delta } 1 L head2 L gaze }, wherein the parameter delta 12 ∈[0,1]For adjusting the losses.
7. The learning attention detection and assessment method for live-action teaching platform as claimed in claim 6, wherein the training step of the standard single-layer visual self-attention encoder model in step S3.3.1 is as follows:
step S3.3.1.1: the single-layer visual self-attention encoder is a self-attention module, and consists of three parts, namely a multi-head self-attention mechanism MSA, a multi-layer perception mechanism MLP and a layer normalization mechanism LN, and the characteristic matrix f obtained in the step S3.2 is obtained p The output of the self-attention module is calculated as the input mapped to the query q, the key k and the value v:
Figure FDA0004031512310000043
wherein d is 0 For each feature dimension.
Step S3.3.1.2: the multi-headed self-attention mechanism MSA in step S3.3.1.1 can be represented by the formula wherein f p As an input to the current module,
Figure FDA0004031512310000044
output and f of MSA for multi-head self-attention mechanism p Sum of (1) is
Figure FDA0004031512310000045
Step S3.3.1.3: the multi-layer perceptron MLP of step S3.3.1.1 may be represented by the formula wherein
Figure FDA0004031512310000046
For the output of the current self-attention encoder +.>
Figure FDA0004031512310000047
Figure FDA0004031512310000048
8. The learning attention detection and assessment method of a live broadcast teaching platform as claimed in claim 1, wherein the step S4 provides a line-of-sight gaze point resolving method based on a TOF depth imaging camera according to the line-of-sight direction results of the learner at different moments, and the line-of-sight gaze point coordinates can be obtained in real time, specifically comprising the steps of:
step S4.1: the relative position relationship between the learner head coordinate system H and the gazing screen coordinate system G can be calibrated by using the existing TOF depth imaging camera to measure the distance measurement technology by shining light on the target object and measuring the transmission time of the light between the lens and the object
Figure FDA0004031512310000051
Step S4.2: the three-dimensional line-of-sight vector g to be acquired by said step S4.1 F Unitizing to obtain a unit sight line vector as follows:
Figure FDA0004031512310000052
step S4.3: definition of doubleThe midpoint of the connecting line of the inner canthus is the line of sight departure point and is expressed as p in the head coordinate system H of the learner 0 Based on the calibrated relative positional relationship in the step S4.1
Figure FDA0004031512310000053
The unit sight line vector +.>
Figure FDA0004031512310000054
And a line of sight departure point p 0 (x 0 ,y 0 ,z 0 ) Uniformly converting into a screen coordinate system;
step S4.4: under the screen coordinate system, knowing the sight line direction vector and the sight line departure point, solving a linear equation of the sight line, and further calculating the sight line point coordinate P (x, y) by the space geometrical relationship of the intersection of the plane and the line in the three-dimensional space.
9. The learning attention detection and assessment method for a live-broadcast teaching platform according to claim 1, wherein in the step S5, the attention condition of the learner during online learning is assessed by comparing the number of times that the extracted gaze point coordinates are out of the screen coordinate range of the live-broadcast teaching platform with the set detection number f, and the specific steps are as follows:
step S5.1: recording the number of times n of the sight-line fixation point coordinates P (x, y) outside the screen coordinate range in real time and dynamically, and circularly carrying out the steps S1 to S4 every time the fixation point is recorded;
step S5.2: if the number of times n of out-of-screen fixation points in the step S5.1 is greater than the set detection number of times f, the system judges that the attention of the learner is out of focus, and the system carries out popup reminding, updates the number of times of fixation points after popup reminding, and resumes detection; if the number of out-of-screen fixation points n in step S5.1 is always smaller than the set detection number f, the system operates normally until the course is finished.
CN202211743625.8A 2022-12-30 2022-12-30 Learning attention detection and assessment method for live broadcast teaching platform Pending CN116030519A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211743625.8A CN116030519A (en) 2022-12-30 2022-12-30 Learning attention detection and assessment method for live broadcast teaching platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211743625.8A CN116030519A (en) 2022-12-30 2022-12-30 Learning attention detection and assessment method for live broadcast teaching platform

Publications (1)

Publication Number Publication Date
CN116030519A true CN116030519A (en) 2023-04-28

Family

ID=86078888

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211743625.8A Pending CN116030519A (en) 2022-12-30 2022-12-30 Learning attention detection and assessment method for live broadcast teaching platform

Country Status (1)

Country Link
CN (1) CN116030519A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116453384A (en) * 2023-06-19 2023-07-18 江西德瑞光电技术有限责任公司 Immersion type intelligent learning system based on TOF technology and control method
CN117636341A (en) * 2024-01-26 2024-03-01 中国海洋大学 Multi-frame seaweed microscopic image enhancement recognition method and model building method thereof

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116453384A (en) * 2023-06-19 2023-07-18 江西德瑞光电技术有限责任公司 Immersion type intelligent learning system based on TOF technology and control method
CN117636341A (en) * 2024-01-26 2024-03-01 中国海洋大学 Multi-frame seaweed microscopic image enhancement recognition method and model building method thereof
CN117636341B (en) * 2024-01-26 2024-04-26 中国海洋大学 Multi-frame seaweed microscopic image enhancement recognition method and model building method thereof

Similar Documents

Publication Publication Date Title
CN111709409B (en) Face living body detection method, device, equipment and medium
CN107423730B (en) Human gait behavior active detection and recognition system and method based on semantic folding
CN105913487B (en) One kind is based on the matched direction of visual lines computational methods of iris edge analysis in eye image
US9545217B2 (en) Movement correction in MRI using a camera
CN116030519A (en) Learning attention detection and assessment method for live broadcast teaching platform
CN112040834A (en) Eyeball tracking method and system
CN111144207B (en) Human body detection and tracking method based on multi-mode information perception
CN109584290A (en) A kind of three-dimensional image matching method based on convolutional neural networks
CN109782902A (en) A kind of operation indicating method and glasses
US20200193607A1 (en) Object shape regression using wasserstein distance
CN113850865A (en) Human body posture positioning method and system based on binocular vision and storage medium
CN106156714A (en) The Human bodys' response method merged based on skeletal joint feature and surface character
CN113610046B (en) Behavior recognition method based on depth video linkage characteristics
CN112016497A (en) Single-view Taijiquan action analysis and assessment system based on artificial intelligence
CN111524183A (en) Target row and column positioning method based on perspective projection transformation
CN114333046A (en) Dance action scoring method, device, equipment and storage medium
CN109993116B (en) Pedestrian re-identification method based on mutual learning of human bones
CN104063689B (en) Face image identification method based on binocular stereoscopic vision
CN109886780B (en) Commodity target detection method and device based on eyeball tracking
CN110796699B (en) Optimal view angle selection method and three-dimensional human skeleton detection method for multi-view camera system
CN110738123B (en) Method and device for identifying densely displayed commodities
Ehinger et al. Local depth edge detection in humans and deep neural networks
CN110400333A (en) Coach's formula binocular stereo vision device and High Precision Stereo visual pattern acquisition methods
CN113012201B (en) Ground unmanned platform personnel tracking method based on deep learning
CN112099330B (en) Holographic human body reconstruction method based on external camera and wearable display control equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination