CN117173758A - Learning attention state assessment method based on multidimensional feature fusion network - Google Patents

Learning attention state assessment method based on multidimensional feature fusion network Download PDF

Info

Publication number
CN117173758A
CN117173758A CN202211662783.0A CN202211662783A CN117173758A CN 117173758 A CN117173758 A CN 117173758A CN 202211662783 A CN202211662783 A CN 202211662783A CN 117173758 A CN117173758 A CN 117173758A
Authority
CN
China
Prior art keywords
feature
learner
module
graph
attention state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211662783.0A
Other languages
Chinese (zh)
Inventor
田斌
李少义
黎曦
罗芷萱
侯常辉
刘婷婷
刘海
肖振华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanchang Institute of Technology
Original Assignee
Nanchang Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanchang Institute of Technology filed Critical Nanchang Institute of Technology
Priority to CN202211662783.0A priority Critical patent/CN117173758A/en
Publication of CN117173758A publication Critical patent/CN117173758A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Image Analysis (AREA)

Abstract

Aiming at the problem that a learner is difficult to evaluate the self attention state in real time, the invention discloses a learning attention state evaluation method based on a multidimensional feature fusion network. The method comprises the following steps: 1) Acquiring a learner video resource acquired by a binocular imaging device (a short wave infrared camera and a laser radar scanner) on an office table, and dividing the learner video resource into multiple frames of images; the hand wearable device acquires blood oxygen saturation and heart rate signals of a learner; 2) And (5) locating the face area and the facial feature points of the SWIR image of the learner. Segmenting the head region 3D point cloud set; 3) Inputting the SWIR image of the face region and the head 3D point cloud set into a corresponding feature extraction network to obtain a feature topological graph, and inputting the feature topological graph into a Cauchy tag distribution regression module to obtain the head posture angle of the learner after the feature topological graph is fused by a self-attention weighting module. Extracting blood oxygen saturation and heart rate variation characteristics, and judging the fatigue level of a learner; 4) Comprehensively evaluating the attention state according to the head posture angle and the facial feature point change of the learner and the fatigue degree, and reminding the learner if the attention state is not concentrated; 5) And counting the attention state condition in the learning process, and feeding back a statistical analysis report. The invention helps the learner to improve concentration degree and cultivate good learning habit by comprehensively evaluating and statistically feeding back the attention state of the learner.

Description

Learning attention state assessment method based on multidimensional feature fusion network
Technical Field
The invention relates to the field of computer vision and behavior analysis, in particular to a learning attention state evaluation method based on a multidimensional feature fusion network.
Background
Along with the increasing and comprehensive acquisition ways of learning resources, the improvement of personal ability through self-learning is just like the future learning development trend, but under the unsupervised environment, especially under the home environment, the attention of many learners is easy to be dispersed, and the learning efficiency is low. In recent years, due to the convenience and high efficiency of the artificial intelligence technology, the technology is widely applied in a plurality of fields, and in a self-learning environment, a learner cannot always realize and correct own behaviors in time, so that the learning condition of the learner can be supervised in real time by using the artificial intelligence technology, particularly the gesture recognition technology, and whether the attention of the learner is concentrated or not is judged, thereby better helping the learner to develop good learning habits.
The head gesture is an important expression form reflecting the attention of the learner, the attention of the learner can be effectively judged by analyzing the head gesture angle change of the learner in the learning process and combining with the facial feature point change, and if the head deflection angle of the learner is positioned outside a desk or a screen in the learning process or the behavior of frequently yawning or even closing eyes occurs, the system can timely detect and remind the learner to concentrate the attention. In addition, the fatigue degree of the learner can be well reflected through the blood oxygen saturation and heart rate variation characteristics, and the fatigue degree can directly influence the concentration degree of the attention. However, current head pose estimation also faces some challenges:
the learner has the problems that hands are blocked, hair styles are blocked, heads are blocked by clothes and the like easily in the learning process, and in addition, the problem of insufficient illumination of indoor scenes easily occurs. These all result in poor quality images that are acquired and complete information cannot be obtained. There is therefore a need for an image acquisition device that can acquire head pose information in multiple dimensions and is immune to illumination variations.
The data set that is used for training head gesture angle today, the distribution of training sample is extremely unbalanced, does not have enough big gesture training sample, and the picture in many training sets all has the problem of head gesture angle mislabel. Resulting in an inability to train to obtain robust network parameters.
According to the method, the head posture estimation problem is solved, most of the methods are only based on RGB images or head 3D point cloud data to return to the head posture angle, accuracy of regression angles of the two-dimensional image-based methods is difficult to improve, and the three-dimensional point cloud-based methods are often overlarge in calculated amount. There is therefore a need for a lightweight method of head pose estimation that combines two-dimensional and three-dimensional features.
Disclosure of Invention
Aiming at the improvement requirement of the prior art, the invention adopts binocular imaging equipment with a short wave infrared camera and a laser radar scanner, and provides a learning attention state evaluation method based on a multi-dimensional feature fusion network, which can monitor whether the learner has the action of distraction of head deflection angle in a desk or an off-screen area in real time in combination with the fatigue degree of the learner, and can prompt the learner to concentrate attention in time and generate a attention concentration report during learning so as to help the learner to develop good learning habit.
The technical scheme adopted for solving the technical problems is as follows: a learning attention state evaluation method based on a multidimensional feature fusion network comprises the following steps:
acquiring a learner video resource acquired by a binocular imaging device (a short wave infrared camera and a laser radar scanner) on an office table, and dividing the learner video resource into multiple frames of images; the hand wearable device acquires blood oxygen saturation and heart rate signals of a learner;
and (5) locating the face area and the facial feature points of the SWIR image of the learner. Segmenting the head region 3D point cloud set;
inputting the SWIR image of the face region and the head 3D point cloud set into a corresponding feature extraction network to obtain a feature topological graph, and inputting the feature topological graph into a Cauchy tag distribution regression module to obtain the head posture angle of the learner after the feature topological graph is fused by a self-attention weighting module. Extracting blood oxygen saturation and heart rate variation characteristics, and judging the fatigue level of a learner;
comprehensively evaluating the attention state according to the head posture angle and the facial feature point position change of the learner and the fatigue degree, and reminding the learner if the attention state is not concentrated;
and counting the attention state condition in the learning process, and feeding back a statistical analysis report.
According to the scheme, the facial area and facial feature point positioning module processes are as follows:
step 1.1.1: each frame of SWIR image of the interactive object is adjusted to 624×624 pixels, and is input into a lightweight mask-CNN network which is pre-trained on a face data set to obtain a face region (I) x ,I y ,m,n);
Step 1.2.1: the cut face region SWIR image is input into a global rough feature extraction network RG-Net, wherein the RG-Net network structure can be expressed as { conv1-res1-res2-res3-glDSC-fc }, wherein conv1 represents a convolution layer, res represents a residual connection layer, glDSC represents global channel separable convolution, fc represents a full connection layer, and the final global rough feature point coordinate vector P is regressed 0
Step 1.2.2: taking an output characteristic diagram of res1 layer in RG-Net networkCut out at the correspondence to rough feature points (x j ,y j ) The characteristic diagram with p multiplied by q as the center is used for obtaining a first-level refinement characteristic diagram +.>Will F R1 Input local refinement network FL-Net extract feature vector +.>First-order refinement facial feature point coordinate vector +.>
Step 1.2.3: taking out output characteristic diagram of conv1 layer in RG-Net networkCut out at the correspondence to rough feature points (x j ,y j ) A characteristic map of p×q size as the center. Obtaining a second level refinement feature map->Will F R2 Input local refinement network FL-Net extract feature vector +.>Two-level refinement facial feature point coordinate vectorP 2 T Namely, sparse facial feature point coordinate vectors.
The head gesture two-dimensional feature extraction model comprises a channel separable convolution module, a pixel space transform module, a fusion feature topological graph construction module and a self-adaptive graph convolution module. The channel separable convolution module is used for extracting local features of the preprocessed SWIR face region image pixel space. The pixel space Transformer extracts a pixel space global feature relationship from the local feature map. And the self-adaptive graph rolling module updates the value of the graph vertex to obtain a head posture fusion characteristic topological graph with a new dimension.
According to the scheme, the process of the channel separable convolution module is as follows:
step 2.1.1: a group of cut SWIR face area images I with 328 multiplied by 328 pixel value swir ∈R N ×H×W×C Inputting the images into a separable convolution network of a double branch channel, and extracting local features of the images;
step 2.1.2: the I branch structure is { SC_MAX (16) -SC 1 (32) -sc_max (32) }, where SC 1 The modular structure is [ SC, BN, RL]SC represents a separable convolution of channels, and local features are extracted for each channel by point convolution. BN represents normalization processing of a batch of images input by the batch, batch normalization processing is performed on C channels respectively, and a calculation formula of batch normalization can be expressed as follows:
wherein Γ= { Γ 1 ,,,Γ N×H×W Each channel represents a corresponding set of elements or set of pixel values,represents the average value of the set of pixel values, i.e +.>ζ is an extremely small positive number, the standard deviation is avoided, namely the denominator is 0, and a, b are network training parameters, and the final standardized result is scaled and translated. The RL activation function replaces the negative element with zero, making the feature map element values easier to converge. SC_MAX is at SC 1 Local maximization patch_max is carried out on the basis to obtain a head posture local feature map I s1_1 ∈R N×H′×W′×C′
Step 2.1.3: the II branch structure is { SC_AVE (16) -SC 2 (32) -SC_AVE (32) }, wherein SC 2 The modular structure is [ SC, BN, TH ]]The TH activation function normalizes the element value range to (-1, 1) so that the network is more prone to convergence. SC_AVE is at SC 2 Local averaging is carried out on the basis to obtain a head gesture local characteristic diagram I s2_1 ∈R N×H′×W′×C′
According to the scheme, the pixel space transducer module training process is as follows:
step 2.2.1: map I of local features s1_1 ,I s2_1 Inputting a double-branch two-stage pixel space Transformer network to extract global features of a pixel space and generating a fusion feature map;
step 2.2.2: will I s1_1 The I branch is input, the first stage structure of the I branch is { SC_MAX (32) -transducer-Patch_Max }, and the second stage structure is { SC_MAX (32) -transducer }. The pixel space transducer layer is formed by cascading three pixel space transducer encoders, and the global characteristic relation of the pixel space is extracted;
step 2.2.3: the pixel space transform encoder separates the outputs of the convolutional layers SC
Go out feature map I sc ∈R N×H″×W″×C′ Stretching into three-dimensional embedded vector I emb ∈R N×A′×C′ Wherein a' =h "×w";
step 2.2.4: respectively giving the corresponding embedded vector of each imagei∈[1,N]The addition of position codes to each element, i.e., pixel point, can be expressed as the following expression,
wherein m is E [0, A' -1 ]],n∈[0,(C′-1)/2]. Embedding vector I emb Update to I emb +I P I is prepared emb +I P Inputting a multi-head self-attention mapping module;
step 2.2.5: the multi-head self-attention mapping module comprises 8-channel self-attention heads, each channel obtains a self-mapping weight matrix based on input, and the self-attention mapping of each channel is obtained through nonlinear transformation after the input is dot multiplied with the self-mapping weight matrixThe final output of the multi-head self-attention mapping module is +.>
Step 2.2.6: output I of multi-head self-attention mapping module A And obtaining the output of the pixel space transducer encoder through the residual error normalization layer and the full connection layer. Three pixel space transform encoders are cascaded to finally obtain a head posture fusion feature MAP MAP 1 ∈R N×H″′×W″′×C′
Step 2.2.7: will I s2_1 Inputting a II branch which is similar to the I branch in structure, extracting different feature graphs based on local average, wherein the first stage structure is { SC_AVE (32) -transducer-Patch_Ave }, and the second stage structure is { SC_AVE (32) -transducer }, and finally obtaining a head posture fusion feature graph MAP 2 ∈R N×H″′×W″′×C′
According to the above scheme, the fused feature topological graph construction module comprises a fused feature graph vertexMoment of connection with topologyAnd (3) constructing an array T. MAP will be 1 ,MAP 2 Element point multiplication of (2) to obtain a total fusion feature MAP MAP, mapping the MAP to a low-dimensional fusion feature vector +.>N represents the number of images in a batch, and a fusion characteristic topological graph is respectively constructed for each frame of image. Fused feature graph vertex->The value of (2) is the fusion feature vector of the single image>And the fusion characteristic topological graph and the 3D point cloud topological graph share a topological connection matrix T. Construction of a fused feature topology graph G 2 =(V M T), wherein
The head point cloud segmentation module processes as follows, according to two-dimensional coordinate information (I) provided by a face region detection frame x ,I y M, n), comparing the point cloud images of each frame with the point cloud set pic corresponding to the dense point cloud set, and screening [ I ] from the point cloud images x ,I y ]~[I x +m,I y +n]A point cloud set in the range is obtained to obtain a dense point cloud set pic of the head area 1 ={(x 1 ,y 1 ,z 1 ),,,,,(x n ,y n ,z n )}。
The head posture three-dimensional feature extraction model comprises a facial feature point 3D point cloud topological graph construction module and a self-adaptive graph rolling module. The facial feature point 3D point cloud topological graph construction module comprises a 3D point cloud graph vertexAnd constructing a topological connection matrix T. And the self-adaptive graph rolling module extracts weight relations among all vertex pairs of the topological graph, updates the values of the vertex pairs of the graph and obtains a head posture 3D point cloud topological graph with new dimensions.
According to the scheme, the facial feature point 3D point cloud topological graph construction module comprises the following steps:
step 3.1.1: according to the two-dimensional coordinate information P of the facial feature points 2 T Aggregating pic from point clouds 1 And selecting corresponding 3D point cloud coordinates. 3D point cloud vertexThe value of (a) is the face key point 3D point cloud coordinate value pic key =(x key_i ,y key_i ,z key_i ),i=1,,,25;
Step 3.1.2: searching each graph vertex based on KD-Tree algorithmConnecting 5 vertexes closest to each other in Euclidean space to construct a topological connection matrix T E R N×N N is the number of the feature points, and the value of T (i, j) is 1, which represents that the vertices of the graph are connected, otherwise, the value of T (i, j) is 0;
step 3.1.3: construction of a 3D Point cloud topology graph G 1 =(V D T), wherein
According to the scheme, the self-adaptive graph rolling module training process is as follows:
step 4.1.1: the network structure of the self-adaptive graph convolution module is a self-adaptive graph convolution layer, a batch normalization layer, a RL function activation layer, a 1-dimensional convolution layer, a batch normalization layer and a RL function activation layer, and graph vertex values v of a plurality of input characteristic topological graphs G are calculated i Updating to 192-dimensional eigenvalues
Step 4.1.2: the self-adaptive graph convolution layer selects each vertex V of the feature graph G n K vertexes whose neighborhood is nearest to the vertex pair are formed, for each vertex pairM channels are constructed, each channel independently calculates characteristic values, and the stageAnd connecting K vertex pairs to the characteristic value. Obtaining updated graph vertex ++through channel maximum pooling>Wherein K is taken as 6 and M is taken as 192.
The blood oxygen saturation-electrocardiosignal characteristic extraction module comprises the following steps:
step 5.1.1: for the blood oxygen saturation spo2, the mean square error of sampling values in one period is calculatedWhere N is the number of samplings, sp i For the i-th sample value, +.>Is the average value of the samples in one period;
step 5.1.2: for electrocardiosignals, calculating the standard deviation sigma of the occurrence intervals of adjacent R waves of two continuous heartbeat signals RR Ratio of high frequency to ultra low frequency energy spectral density in adjacent R-wave periods Representing the spacing between two adjacent peaks. Calculating the spectrum density theta of the ultra-low frequency and the high frequency according to the spectrum diagram in the interval HF ,Θ SLF The ratio is gamma;
step 5.1.3: comprehensive analysis of sigma sp ,σ RR Change of γ, Δσ sp <0.005,Δσ RR <10ms,Δγ<And 0.2 is judged to be of fatigue grade 1, consciousness is awake, and thinking is active. If delta sigma sp ∈[0.005,0.01),Δσ RR E [10ms,35 ms), Δγe [0.2, 0.8) is judged as fatigue level 2, and consciousness is blurred and thinking is relaxed. If delta sigma sp ≥0.01,Δσ RR If the total length of the fatigue grade is more than or equal to 35ms and the delta gamma is more than or equal to 0.8, the fatigue grade is 3, the consciousness is fuzzy, and the thinking cannot be concentrated.
The self-attention weighting module processes as follows:
step 6.1.1: the self-attention weighting module includes a self-attention layer, a fully connected layer, and a softmax regression layer. Updated three-dimensional point cloud topological graph obtained by the last moduleAnd two-dimensional fusion feature topology->Update after input from attention layer +.>
Step 6.1.2: full connection layer willMapping to a vector of dimension 1 XN', finally calculating +.>And->Is a weighting parameter alpha of (a) 1 ,α 2 Obtaining the final weighted fusion characteristic topological graph as
The Cauchy tag distribution regression module processes as follows:
step 7.1.1: weighting and fusing characteristic topological graph through full connection layerMapping into multidimensional feature vectors, regressing accurate head posture angles, and calculating MAE (mean average value) with real angles as Loss function Loss M
Step 7.1.2: for each training set image I i Converting actual angle labels into Cauchy label distributionAt the same time theThe module trains the network to generate three groups of parameters delta, eta and zeta to obtain the predicted Cauchy tag probability distribution (P) A (I i ;δ),P B (I i ;η),P C (I i ;ζ));
Step 7.1.4: calculating the spatial distance Loss between the probability distribution of the predicted Cauchy tag and the actual Cauchy tag θ And KL divergenceAs a Loss function Loss G And Loss function Loss M Weighting to obtain the final Loss function Loss total =Loss G +0.06Loss M
According to the scheme, the optimal network parameters are obtained by training on a training set in advance according to the loss function, the short wave infrared image of the learner and the 3d point cloud data are input into the pre-trained multi-dimensional feature fusion self-attention network, and then the real-time head posture angle Yaw, pitch and Roll of the learner can be obtained, whether the learner is located in the inattention zone or not is judged, meanwhile, the position condition of the facial feature points and the fatigue degree of the learner are combined, the attention concentration condition of the learner is comprehensively judged, and if the attention concentration condition of the learner is not concentrated, the learner is reminded.
Overall, compared with the prior art, the invention has the beneficial effects:
(1) According to the invention, the short wave infrared video image and the 3D point cloud data are respectively acquired, and the head posture information is acquired from multiple dimensions and is not influenced by illumination variation. The two-dimensional and three-dimensional information of the head posture is comprehensively considered, and a more accurate head posture angle can be obtained through regression.
(2) The multidimensional feature fusion self-attention network utilizes facial key point space information to construct a topological graph structure, and a head gesture topological graph is respectively constructed according to two-dimensional features and three-dimensional features, wherein the head gesture two-dimensional feature extraction combines convolution operation for extracting local information and pixel space transform for extracting global information, and the local-global two-dimensional fusion features with more comprehensive features are extracted. The cauchy label distribution regression module fully considers the similarity between adjacent head poses, thereby solving the problem that the training set does not have enough large-pose training samples.
(3) In order to assist the head posture angle to represent the attention concentration of the learner, the blood oxygen saturation and the electrocardiosignal are collected, and the corresponding parameter sigma is extracted through comprehensive analysis sp ,σ RR The change of gamma can qualitatively judge the fatigue degree of the learner.
Drawings
FIG. 1 is a flow chart of a learning attention state assessment method based on a multidimensional feature fusion network according to an embodiment of the invention
FIG. 2 is a schematic diagram of data acquisition in a home environment;
fig. 3 is a schematic diagram of a multi-dimensional feature fusion network according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
As shown in fig. 1, the embodiment of the invention is a learning attention state evaluation method based on a multidimensional feature fusion network, which comprises the following steps:
step 1: and acquiring a learner video resource acquired by a binocular imaging device (a short-wave infrared camera and a laser radar scanner) on the office table, and dividing the learner video resource into multiple frames of images according to time sequence. Simultaneously, the blood oxygen saturation and heart rate signals of the learner are acquired through the hand wearable equipment.
Step 2: and (3) locating the face region and the facial feature points of the SWIR image of the learner, and simultaneously dividing the head region 3D point cloud set.
Step 3: inputting the SWIR image of the face area and the head 3D point cloud set into a corresponding head gesture two-dimensional, three-dimensional feature extraction network to obtain a feature topological graph, and inputting the feature topological graph into a Cauchy tag distribution regression module to obtain the head gesture angle of a learner after the feature topological graph is fused by a self-attention weighting module. And simultaneously extracting the blood oxygen saturation and heart rate variation characteristics, and judging the fatigue level of the learner.
According to the scheme, the process of the blood oxygen saturation-electrocardiosignal characteristic extraction module is as follows: taking 5 minutes as a period, calculating the mean square error of sampling values of the blood oxygen saturation spo2 in the periodWhere N is the number of samplings, sp i For the i-th sample value, +.>Is the average of samples over one period. Simultaneously, all wave peaks of the electrocardiosignal time domain waveform in one period are detected by wavelet transformation, and the interval between two adjacent wave peaks is calculated>Obtaining standard deviation of adjacent R wave occurrence intervals of two continuous heartbeat signals> Converting the time domain signal in the period into frequency domain signal by using fast Fourier transform, analyzing the spectrogram, and calculating the spectral density theta of ultra-low frequency and high frequency HF ,Θ SLF Obtaining the second parameter +.>
Comprehensive analysis of sigma sp ,σ RR Change of γ, Δσ sp <0.005,Δσ RR <10ms,Δγ<And 0.2 is judged to be of fatigue grade 1, consciousness is awake, and thinking is active. If delta sigma sp ∈[0.005,0.01),Δσ RR E [10ms,35 ms), Δγe [0.2, 0.8) is judged as fatigue level 2, and consciousness is blurred and thinking is relaxed. If delta sigma sp ≥0.01,Δσ RR If the total length of the fatigue grade is more than or equal to 35ms and the delta gamma is more than or equal to 0.8, the fatigue grade is 3, the consciousness is fuzzy, and the thinking cannot be concentrated.
As shown in fig. 2, the learner is learning at home, in which a short wave infrared camera and a lidar scanner are used to capture a video sequence of the learner's face. The SWIR image and the 3D point cloud image of the multi-frame learner acquired in the scene provide important data sources for the head posture estimation module.
As shown in fig. 3, in this embodiment, the multi-dimensional feature fusion self-attention network includes a face region and facial feature point positioning module, a head point cloud segmentation module, a head pose two-dimensional feature extraction module, a head pose three-dimensional feature extraction module, a self-attention weighting module, and a cauchy tag distribution regression module.
According to the scheme, the facial area and facial feature point positioning module processes are as follows:
step 3.1.1: adjusting each frame of SWIR image of the interactive object to 624×624 pixels, inputting into a lightweight Mask R-CNN network trained on the face data set to obtain face region (I) x ,I y ,m,n);
Step 3.1.2: according to the face region (I) x ,I y M, n) cutting out each frame of SWIR image of the interactive object, inputting a sparse facial feature point extraction network, wherein the network consists of a global rough feature point extraction network RG-Net and a cascade local refinement network FL-Net;
step 3.1.3: the cut face region SWIR image is input into RG-Net, the RG-Net network structure can be expressed as { conv1-res1-res2-res3-glDSC-fc }, wherein conv1 represents a convolution layer, res represents a residual connection layer, glDSC represents global channel separable convolution, fc represents a full connection layer, and the final global rough feature point coordinate vector P is regressed 0
Step 3.1.4: taking an output characteristic diagram of res1 layer in RG-Net networkCut out at the correspondence to rough feature points (x j ,y j ) The characteristic diagram with p multiplied by q as the center is used for obtaining a first-level refinement characteristic diagram +.>Will F R1 Inputting local refinement network FL-Net, firstly reducing the dimension of the multichannel feature map to two-dimensional vector by convolution, then carrying out normalization and relu activation function nonlinear transformation, and finally returning the first-order feature vector by full connection layer->First-order refinement facial feature point coordinate vector +.>
Step 3.1.5: taking out output characteristic diagram of conv1 layer in RG-Net networkCut out at the correspondence to rough feature points (x j ,y j ) Obtaining a second level of refinement of the feature map for the centered p×q feature map>Inputting local refinement network FL-Net to obtain secondary eigenvector +.>Second level refined facial feature point coordinate vector +.>P 2 T The final sparse facial feature point coordinate vector is obtained. The extraction procedure can be expressed as follows:
P l =P l-1 +PL l (ψ(RG(I) l ,P l-1 ))#(3)
wherein P is 0 Extracting the output of network RG-Net for global rough feature point, l represents layer number, FL l Representing local refinement network FL-Net cascade l times, RG (I) l Output feature map representing global rough feature point extraction network RG-Net first layer, and psi ()' representing feature multiplexing, namely constructing a rough feature point (x j ,y j ) A characteristic map of p×q size as the center.
The head pose two-dimensional feature extraction model comprises a channelThe system comprises a separable convolution module, a pixel space transform module, a fusion characteristic topological graph construction module and an adaptive graph convolution module. The channel separable convolution module converts the SWIR image into a multi-channel local feature map for extracting the local features of the preprocessed SWIR face region image pixel space. And the pixel space Transformer extracts a pixel space global characteristic relation from the multi-channel local characteristic map to generate a pixel space fusion characteristic map. The fusion characteristic topological graph construction module is used for constructing a fusion characteristic graph vertexAnd constructing a topological connection matrix T. And the self-adaptive graph convolution module extracts weight relations among all vertex pairs of the topological graph, updates the values of the vertex pairs of the graph and obtains a head posture fusion characteristic topological graph with new dimensions.
According to the scheme, the process of the channel separable convolution module is as follows:
step 3.2.1: window of the face area which is positioned Adjust to->Simultaneously, the size of a face area window is adjusted to 328 multiplied by 328, and a group of cut SWIR face area images I are obtained swir ∈R N×H×W×C The local characteristics of the images are extracted by inputting the images into a double-branch channel separable convolution network;
step 3.2.2: i swir Input branch I, the branch structure is { SC_MAX (16), SC 1 (32) SC_MAX (32) }, where SC 1 The modular structure is [ SC, BN, RL]SC represents the separable convolution of channels, local features are extracted from the point-by-point convolution of each channel, BN represents the normalization processing of a batch of images input by the batch, and the calculation formula of batch normalization is that Γ={Γ 1 ,,,Γ N×H×W And represents a corresponding set of elements, i.e., pixel values, for each channel. The RL activation function replaces the negative element with zero, making the feature map element values easier to converge. SC_MAX is at SC 1 Based on the local maximization Patch_Max, obtaining a head posture local feature map I s1_1 ∈R N×H′×W′×C′
Step 3.2.3: i swir Input branch II, the branch structure is { SC_AVE (16), SC 2 (32) SC_AVE (32) }, where SC 2 The modular structure is [ SC, BN, TH ]]The TH activation function normalizes the element value range to (-1, 1) so that the network is more prone to convergence. SC_AVE is at SC 2 On the basis of local average Patch_Ave, the head posture local feature map I is obtained s2_1 ∈R N×H′×W′×C′
According to the scheme, the pixel space transducer module training process is as follows:
step 3.3.1: map I of local features s1_1 ,I s2_1 Inputting a double-branch two-stage pixel space Transformer network to extract global features of a pixel space and generating a fusion feature map;
step 3.3.2: will I s1_1 The I branch is input, the first stage structure of the I branch is { SC_MAX (32) -transducer-Patch_Max }, and the second stage structure is { SC_MAX (32) -transducer }. The pixel space transducer layer is formed by cascading three pixel space transducer encoders, and the global characteristic relation of the pixel space is extracted;
step 3.3.3: the pixel space transform encoder outputs a characteristic map I of the separable convolutional layer SC sc ∈R N ×H″×W″×C′ Stretching into three-dimensional embedded vector I emb ∈R N×A′×C′ Wherein a' =h "×w";
step 3.3.4: respectively giving the corresponding embedded vector of each imagei∈[1,N]Each element of (i.e. pixel point) adds position codeCode->Wherein m is E [0, A' -1 ]],n∈[0,(C′-1)/2]. Embedding vector I emb Update to I emb +I P . Will I emb +I P Inputting a multi-head self-attention mapping module;
step 3.3.5: the multi-head self-attention mapping module comprises an 8-channel self-attention head, and self-attention mapping of each channelThe computational expression is as follows:
wherein each channel's self-mapped weight matrixBased on the input, the access vector R v And key value vector P v From input I emb +I P And->And obtaining the product after dot multiplication. The final output of the multi-head self-attention mapping module is +.>
Step 3.3.6: output I of multi-head self-attention mapping module A Through the residual normalization layer and the full connection layer, the output of the pixel space transducer encoder is obtained through normalization, and the calculation flow can be summarized as follows:
I TF =Norm(f(max(0,Norm(I emb +I A )))+Norm(I emb +I A ))#(5)
norm normalizes a 'x C' pixels per layer to a standard normal distribution, f (°) representing a linear transformation. Three pixel space transform encoders are cascaded to finally obtain a head posture fusion feature MAP MAP 1 ∈R N×H″′×W″′×C′
Step 3.3.7: will I s2_1 Inputting a II branch which is similar to the I branch in structure, extracting different feature graphs based on local average, wherein the first stage structure is { SC_AVE (32) -transducer-Patch_Ave }, and the second stage structure is { SC_AVE (32) -transducer }, and finally obtaining a head posture fusion feature graph MAP 2 ∈R N×H″′×W″′×C′
According to the above scheme, the fused feature topological graph construction module comprises a fused feature graph vertexAnd constructing a topological connection matrix T. MAP will be 1 ,MAP 2 Element point multiplication of (2) to obtain a total fusion feature MAP MAP, mapping the MAP to a low-dimensional fusion feature vector +.>N represents the number of images in a batch, and a fusion characteristic topological graph is respectively constructed for each frame of image. Fused feature graph vertex->The value of (2) is the fusion feature vector of the single image>And the fusion characteristic topological graph and the 3D point cloud topological graph share a topological connection matrix T. Construction of a fused feature topology graph G 2 =(V M T), wherein
The head point cloud segmentation module processes as follows, according to two-dimensional coordinate information (I) provided by a face region detection frame x ,I y M, n), comparing the point cloud images of each frame with the point cloud set pic corresponding to the dense point cloud set, and screening [ I ] from the point cloud images x ,I y ]~[I x +m,I y +n]A point cloud set in the range is obtained to obtain a dense point cloud set pic of the head area 1 ={(x 1 ,y 1 ,z 1 ),,,,,(x n ,y n ,z n )}。
The head posture three-dimensional feature extraction model comprises a facial feature point 3D point cloud topological graph construction module and a self-adaptive graph rolling module. The facial feature point 3D point cloud topological graph construction module comprises a 3D point cloud graph vertexAnd constructing a topological connection matrix T. And the self-adaptive graph rolling module extracts weight relations among all vertex pairs of the topological graph, updates the values of the vertex pairs of the graph and obtains a head posture 3D point cloud topological graph with new dimensions.
According to the scheme, the facial feature point 3D point cloud topological graph construction module comprises the following steps:
step 3.4.1:3D point cloud topological graph construction module and 3D point cloud graph vertexAnd constructing a topological connection matrix T. According to the two-dimensional coordinate information P of the facial feature points 2 T Aggregating pic from point clouds 1 And selecting corresponding 3D point cloud coordinates. 3D Point cloud Point->The value of (a) is the face key point 3D point cloud coordinate value pic key =(x key_i ,y key_i ,z key_i ),i=1,,,25;
Step 3.4.2: searching each graph vertex based on KD-Tree algorithmConnecting 5 vertexes closest to each other in Euclidean space to construct a topological connection matrix T E R N×N N is the number of the feature points, and the value of T (i, j) is 1, which represents that the vertices of the graph are connected, otherwise, the value of T (i, j) is 0;
step 3.1.3: construction of a 3D Point cloud topology graph G 1 =(V D T), wherein
According to the scheme, the self-adaptive graph rolling module training process is as follows:
step 3.5.1: the network structure of the self-adaptive graph convolution module is a self-adaptive graph convolution layer, a batch normalization layer, a RL function activation layer, a 1-dimensional convolution layer, a batch normalization layer and a RL function activation layer, and the self-adaptive graph convolution layer extracts the weight relation between each vertex pair of the topological graph and correspondingly updates the value of the vertex of the graph. The relation among the sequences is further extracted by the 1-dimensional convolution layer, the network is easier to converge due to batch normalization operation and RL activation function, and finally, the graph vertex values v of the characteristic topological graphs G of a plurality of inputs are obtained i Updating to 192-dimensional eigenvalues
Step 3.5.2: the self-adaptive graph convolution layer selects each vertex V of the feature graph G n K vertexes whose neighborhood is nearest toForm vertex pairs, for each vertex pair +.>M channels are constructed, and characteristic values are independently calculated for each channel, and the calculation process is summarized as follows:
wherein [ A, B ] represents A, B vector cascade, "-represents dot product," -represents MLP layer, "-represents RL (") represents nonlinear activation of RL function, and negative element is converted to 0;
step 3.5.3: updating the eigenvalue of each vertex to M-dimensional eigenvectorConcatenating K vertex pair eigenvaluesObtaining updated graph vertex ++through channel maximum pooling>Wherein K is taken as 6 and M is taken as 192.
According to the scheme, the self-attention weighting module comprises the following steps:
step 3.6.1: the self-attention weighting module includes a self-attention layer, a fully connected layer, and a softmax regression layer. Three-dimensional point cloud topological graph updated by last moduleAnd two-dimensional fusion feature topology->Input self-attention layer and update toThe calculation process is summarized as follows:
step 3.6.2: the full connection layer calculation process can be expressed as follows, where f (-) represents a linear transformation, which will beVector mapped to 1 XN' dimension +.>
Finally calculate by using softmax functionAnd->Is a weighting parameter alpha of (a) 1 ,α 2 The calculation process can be expressed as:
obtaining the final weighted fusion characteristic topological graph as
According to the scheme, the cauchy tag distribution regression module comprises the following steps:
step 3.6.3: weighting and fusing characteristic topological graph through full connection layerMapping into multidimensional feature vectors, regressing accurate head posture angles, and calculating MAE (mean average value) with real angles as Loss function Loss M
Step 3.6.4: considering that the head pose similarity along the three directions Yaw, pitch, roll is different for the same head pose angle variation, the deflection angles of the three directions { -90 °,,0,, 90 ° } are divided into 46, 100, 62 segments, respectively, i.e. the angles are encoded as corresponding tag sets a = { a 1 ,,,A 45 },B={B 1 ,,,B 99 },C={C 1 ,,,C 61 }。
Step 3.6.5: for each training set image I i Converting actual angle labels into Cauchy label distributionWherein->Element value->Can be expressed as
Wherein i represents the i-th tag, t y Representing the code value corresponding to the real yaw angle, and the standard deviation delta of the label 1 Set to 4.Element value->Can be expressed as
Wherein j represents the j-th tag, t p Representing the corresponding coding value of the real pitch angle and the standard deviation delta of the label 2 Set to 10.Element value->Can be expressed as
Wherein k represents the kth tag, t r Representing the corresponding coded value of the real rolling angle and the standard deviation delta of the label 3 Set to 6.
Meanwhile, the module trains the network to generate three groups of parameters delta, eta and zeta which respectively correspond to three groups of tag sets A, B and C. Obtaining a predictive cauchy tag probability distribution (P A (I i ;δ),P B (I i ;η),P C (I i ;ζ))。
Calculating the predicted Cauchy tag probability distribution and the actual Ke XibiaoSpatial distance of signature distribution Loss θ And KL divergenceAs a Loss function Loss G And Loss function Loss M Weighting to obtain the final Loss function Loss total =Loss G +0.06Loss M
Step 3.6.6: calculating the spatial distance Loss between the probability distribution of the predicted Cauchy tag and the actual Cauchy tag θ And KL divergence
Final loss functionAccording to the scheme, the optimal network parameters are obtained by training on a training set in advance according to the loss function, short-wave infrared images of learners and 3d point cloud data are input into a pre-trained multi-dimensional feature fusion self-attention network, and a final head attitude angle (Yaw, pitch, roll) can be obtained.
Step 4: and judging the fatigue level according to the head attitude angle and the facial feature point positions of the learner at different moments and combining the blood oxygen saturation and electrocardiosignal change conditions, and comprehensively evaluating the attention state of the learner. And judging whether the learning machine is located in a non-concentration zone, if so, the learner is not concentrated at the moment, otherwise, the learner is concentrated.
Table 1 learner attentional state comprehensive evaluation rules
It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (10)

1. The learning attention state evaluation method based on the multidimensional feature fusion network is characterized by comprising the following steps of:
acquiring a learner video resource acquired by a binocular imaging device (a short wave infrared camera and a laser radar scanner) on an office table, and dividing the learner video resource into multiple frames of images; the hand wearable device acquires blood oxygen saturation and heart rate signals of a learner;
and (5) locating the face area and the facial feature points of the SWIR image of the learner. Segmenting the head region 3D point cloud set;
inputting the SWIR image of the face region and the head 3D point cloud set into a corresponding feature extraction network to obtain a feature topological graph, and inputting the feature topological graph into a Cauchy tag distribution regression module to obtain the head posture angle of the learner after the feature topological graph is fused by a self-attention weighting module. Extracting blood oxygen saturation and heart rate variation characteristics, and judging the fatigue level of a learner;
comprehensively evaluating the attention state according to the head posture angle and the facial feature point position change of the learner and the fatigue degree, and reminding the learner if the attention state is not concentrated;
and counting the attention state condition in the learning process, and feeding back a statistical analysis report.
2. The learning attention state evaluation method based on the multidimensional feature fusion network as claimed in claim 1, wherein the face area and facial feature point positioning module processes are as follows:
step 1.1.1: adjusting each frame of SWIR image of the interactive object to 624×624 pixels, inputting into a pre-trained lightweight Mask R-CNN network to obtain face region (I) x ,I y ,m,n);
Step 1.2.1: inputting the cut face region SWIR image into a global rough feature point extraction network RG-Net with the structure of { conv1-res1-res2-res3-glDSC-fc }, wherein res represents a residual connecting layer, glDSC represents global channel separable convolution, fc represents a full connecting layer, and returning to a final global rough feature point coordinate vector P 0
Step 1.2.2: taking res1 layer output characteristic diagram of RG-NetCut out to rough feature points (x j ,y j ) The characteristic diagram with p multiplied by q as the center is input into a local refinement network FL-Net to extract characteristic vectors +.>The coordinate vector of the facial feature point of the first level refinement is +.>
Step 1.2.3: taking out the conv1 layer output characteristic diagram of RG-NetCut out to rough feature points (x j ,y j ) The characteristic diagram with p multiplied by q as the center is input into a local refinement network FL-Net to extract characteristic vectors +.> Namely, sparse facial feature point coordinate vectors.
3. The learning attention state evaluation method based on the multidimensional feature fusion network according to claim 1, wherein the head gesture two-dimensional feature extraction model comprises a channel separable convolution module, a pixel space transform module, a fusion feature topological graph construction module and an adaptive graph convolution module.
4. A method for learning attention state assessment based on a multi-dimensional feature fusion network as claimed in claim 3, wherein said channel separable convolution module training procedure is as follows:
step 2.1.1: SWIR face region image I swir ∈R N×H×W×C Inputting the two-dimensional images into a double-branch channel separable convolution network to extract local features of the two-dimensional images;
step 2.1.2: the I branch structure is { SC_MAX (16) -SC 1 (32) -sc_max (32) }, where SC 1 The modular structure is [ SC, BN, RL]SC represents the separable convolution of channels, local features are extracted by point-to-point convolution of each channel, BN represents normalization processing of the batch of input images, and RL activation function replaces negative elements with zeros. SC_MAX is at SC 1 Local maximization is carried out on the basis to obtain a head posture local feature map I s1_1 ∈R N×H′×W′×C′
Step 2.1.3: the II branch structure is { SC_AVE (16) -SC 2 (32) -SC_AVE (32) }, wherein SC 2 The modular structure is [ SC, BN, TH ]]The TH activation function normalizes the element to (-1, 1), the SC_AVE is at SC 2 Local averaging is carried out on the basis, and finally the head gesture local feature map I is obtained s2_1 ∈R N×H′×W′×C′
5. The method for estimating a learning attention state based on a multidimensional feature fusion network as recited in claim 3, wherein the pixel space transducer module training process is as follows:
step 2.2.1: will I s1_1 The I branch is input, and the first stage of the I branch is { SC_MAX (32) -transducer-Patch_Max }. The second stage is { SC_MAX (32) -transducer }. The pixel space transducer layer is formed by cascading three pixel space transducer encoders, and the global characteristic relation of the pixel space is extracted;
step 2.2.2: the pixel space transform encoder outputs the characteristic diagram I of the SC layer sc ∈R N×H″×W″×C′ Stretching into three-dimensional vector I emb And adds position codesWill I emb +I P Input multi-head self-attention mapping module and final output The method is obtained by carrying out dot multiplication and nonlinear transformation on an input and self-mapping weight matrix. I A And obtaining the output of the transducer encoder through the residual normalization layer and the full connection layer. Three pixel space transform encoders are cascaded to finally obtain a head posture fusion feature MAP MAP 1 ∈R N×H″′×W″′×C′
Step 2.2.3: will I s2_1 Inputting a II branch, wherein the II branch is similar to the I branch in structure, extracting different feature MAPs based on local average Patch_Ave, and finally obtaining a head posture fusion feature MAP MAP 2 ∈RN ×H″′×W″′×C′
6. The method for learning attention state assessment based on multi-dimensional feature fusion network of claim 4, wherein the fused feature topology construction module comprises a fused feature graph vertexAnd constructing a topological connection matrix T. MAP will be 1 ,MAP 2 Element point multiplication and mapping to low-dimensional fusion feature vector through full connection layer>Head posture fusion feature map vertex->The values of the three-dimensional point cloud topological graph are the fusion feature vector M of a single image, and the head gesture fusion feature topological graph and the 3D point cloud topological graph share a topological connection matrix T.
7. The learning attention state evaluation method based on the multidimensional feature fusion network as claimed in claim 1, wherein the head posture three-dimensional feature extraction model comprises a facial feature point 3D point cloud topological graph construction module and an adaptive graph rolling module.
8. The learning attention state evaluation method based on the multidimensional feature fusion network as claimed in claim 7, wherein the facial feature point 3D point cloud topological graph construction module process is as follows:
step 3.1.1:3D point cloud topological graph construction module and 3D point cloud graph vertexAnd constructing a topological connection matrix T. According to the two-dimensional coordinate information P of the facial feature points 2 T Aggregating pic from point clouds 1 And selecting corresponding 3D point cloud coordinates. 3D point cloud vertexThe value of (1) is the 3D point cloud coordinate value of the key point of the face;
step 3.1.2: searching each graph vertex based on KD-Tree algorithmDistance in Euclidean spaceThe nearest 5 vertexes are connected to construct a topological connection matrix T epsilon R N×N N is the number of feature points, and a value of T (i, j) is 1, which represents that the vertices of the graph are connected, and otherwise, the value of T (i, j) is 0.
9. The learning attention state evaluation method based on the multidimensional feature fusion network as recited in claim 1, wherein the adaptive graph rolling module process is as follows:
step 4.1.1: selecting each vertex V of the feature map G n K vertexes whose neighborhood is nearest toForming vertex pairs, and updating characteristic values;
step 4.1.2: for each vertex pairM channels are constructed, and each channel independently calculates characteristic values [A,B]Representing the A, B vector cascade, +.representative dot product, (-) representative MLP layer, RL (-) representative RL function nonlinear activation, converting the negative element to 0. The characteristic values of all vertex pairs are cascaded and are subjected to channel maximum pooling to obtain updated M-dimensional graph vertices +.>
10. The learning attention state evaluation method based on a multidimensional feature fusion network as recited in claim 1, wherein the blood oxygen saturation-electrocardiosignal feature extraction module includes calculating a mean square error sigma of blood oxygen saturation spo2 sampling points in one period sp Standard deviation sigma of R wave occurrence interval of two adjacent heartbeat signals RR Ratio of high frequency to ultra low frequency energy spectral density in adjacent R-wave periods
CN202211662783.0A 2022-12-23 2022-12-23 Learning attention state assessment method based on multidimensional feature fusion network Pending CN117173758A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211662783.0A CN117173758A (en) 2022-12-23 2022-12-23 Learning attention state assessment method based on multidimensional feature fusion network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211662783.0A CN117173758A (en) 2022-12-23 2022-12-23 Learning attention state assessment method based on multidimensional feature fusion network

Publications (1)

Publication Number Publication Date
CN117173758A true CN117173758A (en) 2023-12-05

Family

ID=88935740

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211662783.0A Pending CN117173758A (en) 2022-12-23 2022-12-23 Learning attention state assessment method based on multidimensional feature fusion network

Country Status (1)

Country Link
CN (1) CN117173758A (en)

Similar Documents

Publication Publication Date Title
Liu et al. Two-stream 3d convolutional neural network for skeleton-based action recognition
CN110348330B (en) Face pose virtual view generation method based on VAE-ACGAN
CN108537743B (en) Face image enhancement method based on generation countermeasure network
CN109815826B (en) Method and device for generating face attribute model
CN112766160A (en) Face replacement method based on multi-stage attribute encoder and attention mechanism
CN112418095A (en) Facial expression recognition method and system combined with attention mechanism
CN112418041B (en) Multi-pose face recognition method based on face orthogonalization
CN107169117B (en) Hand-drawn human motion retrieval method based on automatic encoder and DTW
KR20210025020A (en) Face image recognition using pseudo images
Sharma et al. Vision-based sign language recognition system: A Comprehensive Review
CN111428689B (en) Face image feature extraction method based on multi-pool information fusion
CN111914643A (en) Human body action recognition method based on skeleton key point detection
CN106096517A (en) A kind of face identification method based on low-rank matrix Yu eigenface
CN113095149A (en) Full-head texture network structure based on single face image and generation method
CN111028319A (en) Three-dimensional non-photorealistic expression generation method based on facial motion unit
CN112836680A (en) Visual sense-based facial expression recognition method
CN115661246A (en) Attitude estimation method based on self-supervision learning
CN116645717A (en) Microexpressive recognition method and system based on PCANet+ and LSTM
CN117935339A (en) Micro-expression recognition method based on multi-modal fusion
CN115908896A (en) Image identification system based on impulse neural network with self-attention mechanism
CN113111797B (en) Cross-view gait recognition method combining self-encoder and view transformation model
CN112418399B (en) Method and device for training gesture estimation model and method and device for gesture estimation
Zhao et al. Research on human behavior recognition in video based on 3DCCA
CN116311472A (en) Micro-expression recognition method and device based on multi-level graph convolution network
CN117173758A (en) Learning attention state assessment method based on multidimensional feature fusion network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination