CN116127350A

CN116127350A - Learning concentration monitoring method based on Transformer network

Info

Publication number: CN116127350A
Application number: CN202211596338.9A
Authority: CN
Inventors: 刘海; 张�诚; 刘婷婷; 张昭理; 朱晓倩; 宋林森; 林丹月; 王镜淇
Original assignee: Central China Normal University
Current assignee: Central China Normal University
Priority date: 2022-12-12
Filing date: 2022-12-12
Publication date: 2023-05-16

Abstract

The application discloses a learning concentration monitoring method based on a transducer network, which comprises the following steps: acquiring a head posture RGB image, environment sound data and a skin electric signal of a monitored object in a monitoring time period, and dividing the monitoring time period into a plurality of time domain fragments according to time sequence; inputting the corresponding head posture RGB image, the environment sound data and the skin electric signal in each time domain segment into a multi-mode information classification model to obtain a head posture angle, an environment sound type and an emotion state type, and carrying out weighted summation according to preset weights to obtain an emotion concentration degree score of each time domain segment; and fusing emotion concentration scores of the time domain segments according to a preset rule to obtain concentration estimation results of the monitored object in the monitoring time period. The invention introduces complementary multi-mode data and time information, can realize comprehensive and whole-process evaluation of learning emotion concentration of students, and helps to analyze factors and individual differences affecting the learning emotion concentration of learners.

Description

Learning concentration monitoring method based on Transformer network

Technical Field

The present disclosure relates to the field of pattern recognition and information processing technologies, and in particular, to a method, a system, a device, and a storage medium for monitoring learning concentration based on a transducer network.

Background

Learning emotion concentration is an important means for evaluating learning effect and learning quality of students, and is a key problem for researching learning states of learners. The basic technology of behavior recognition based on images and audio is widely used for evaluating learning emotion concentration of students, analyzing learning states of students, helping students to improve learning concentration, and the like. Because the informatization degree of the current world is continuously improved, computer equipment, interference of the environment and the like, emotion of students is easy to be in focus during learning, the learning concentration capability is continuously reduced, and particularly, the concentration of the students in the learning process is more difficult to evaluate under the scene of on-line teaching and lack of teacher supervision. Therefore, a method for quantitatively evaluating the concentration of learning emotion of students is needed.

At present, a mode based on head posture estimation is mostly adopted to evaluate the learning concentration state of students, however, the difficulty of the head posture estimation part is that: the head is a similar rigid body, and has no more detail auxiliary information except the facial expression; the information of different far and near scales is large in change and is easy to influence by illumination change of learning environment, and the like, and all factors can cause deviation of evaluation results of concentration degree.

Disclosure of Invention

Aiming at least one defect or improvement requirement of the prior art, the invention provides a learning concentration monitoring method, a system, equipment and a storage medium based on a Transformer network, which are used for carrying out learning concentration evaluation by introducing complementary multi-mode data and time information, so that the learning emotion concentration evaluation of students in an omnibearing and whole-process manner can be realized, and the accuracy and the credibility of results are improved.

To achieve the above object, according to a first aspect of the present invention, there is provided a method for monitoring learning concentration based on a fransformer network, the method comprising the steps of:

acquiring a head posture RGB image, environment sound data and a skin electric signal of a monitored object in a monitoring time period, and dividing the monitoring time period into a plurality of time domain fragments according to time sequence;

inputting the corresponding head posture RGB image, the environment sound data and the skin electric signal in each time domain segment into a trained multi-mode information classification model to obtain a head posture angle, an environment sound category and an emotion state category;

coding the head posture angle, the environment sound category and the emotion state category, and carrying out weighted summation according to preset weights to obtain emotion concentration degree scores of the monitored object in each time domain segment;

and fusing emotion concentration scores of the time domain segments according to a preset rule to obtain concentration estimation results of the monitored object in the monitoring time period.

Further, in the learning concentration monitoring method, the multi-modal information classification model includes a head gesture recognition model for predicting and obtaining a head gesture angle according to a head gesture RGB image; the head pose recognition model includes:

the feature extraction network is used for extracting multi-scale feature vectors from the input head posture RGB image;

the first coding layer is used for mapping the multi-scale feature vectors to the same dimension through linear transformation, generating one-dimensional feature vectors and embedding randomly initialized learnable parameters;

a multi-layer transducer network, which calculates the one-dimensional feature vector output by the first coding layer based on the attention mechanism to obtain a final head posture feature vector;

and the first classification network performs classification calculation according to the head posture feature vector to obtain the head posture angle of the monitored object.

Further, in the learning concentration monitoring method, the multi-mode information classification model includes an environmental sound classification model for predicting environmental sound classes according to environmental sound data; the environmental sound classification model includes:

the second coding layer is used for coding the environmental sound data to generate a one-dimensional sound feature vector;

the multi-layer transducer network is used for calculating the one-dimensional sound feature vector output by the second coding layer based on the attention mechanism to obtain a final environment sound feature vector;

and the second classification network performs classification calculation according to the environmental sound feature vector to obtain the environmental sound type of the environment where the monitored object is currently located.

Further, in the learning concentration monitoring method, the multi-modal information classification model includes a skin-electric emotion classification model for predicting and obtaining an emotion state category according to skin-electric signals; the skin-electric emotion classification model comprises:

the third coding layer is used for coding the skin electric signal to generate a one-dimensional physiological characteristic vector;

the multi-layer transducer network is used for calculating the one-dimensional physiological characteristic vector output by the third coding layer based on the attention mechanism to obtain a final physiological emotion characteristic vector;

and the third classification network performs classification calculation according to the physiological emotion feature vector to obtain the emotion state category of the monitored object.

Further, in the learning concentration monitoring method, the head gesture recognition model, the environmental sound classification model and the skin electric emotion classification model share the same multi-layer transducer network.

Further, in the learning concentration monitoring method, the first classification network includes three continuous full connection layers;

the output of the first full-connection layer is respectively connected with the second full-connection layer and the third full-connection layer; the output of the second fully-connected layer is connected with the third fully-connected layer.

Further, in the learning concentration monitoring method, the fusing the emotion concentration scores of the time domain segments according to a preset rule includes:

and giving weights to the learning content corresponding to each time domain segment according to the importance degree of the learning content, and carrying out weighted summation on emotion concentration degree scores corresponding to each time domain segment according to the weights.

According to a second aspect of the present invention, there is also provided a learning concentration monitoring system based on a transducer network, the system comprising:

the data acquisition module is used for acquiring a head posture RGB image, environment sound data and a skin electric signal of a monitored object in a monitoring time period, and dividing the monitoring time period into a plurality of time domain fragments according to time sequence;

the classification module is used for inputting the corresponding head posture RGB image, the environment sound data and the skin electric signal in each time domain segment into the trained multi-mode information classification model to obtain a head posture angle, an environment sound category and an emotion state category;

the concentration degree calculation module is used for encoding the head gesture angle, the environment sound category and the emotion state category, and carrying out weighted summation according to preset weights to obtain emotion concentration degree scores of the monitored object in each time domain segment;

and the result output module is used for fusing the emotion concentration degree scores of the time domain fragments according to a preset rule to obtain concentration degree estimation results of the monitored object in the monitoring time period.

According to a third aspect of the present invention there is also provided a computer device comprising at least one processing unit, and at least one storage unit, wherein the storage unit stores a computer program which, when executed by the processing unit, causes the processing unit to perform the steps of the learning concentration monitoring method of any of the above.

According to a fourth aspect of the present invention there is also provided a storage medium storing a software program executable by a computer device, the software program when run on the computer device causing the computer device to perform the steps of the learning concentration monitoring method of any one of the preceding claims.

In general, the above technical solutions conceived by the present invention, compared with the prior art, enable the following beneficial effects to be obtained:

(1) The invention introduces complementary multi-mode data and time information to evaluate learning concentration, so that the learning emotion concentration evaluation of students in an all-around and whole-process manner can be realized; the transducer architecture can obtain the multi-mode information classification result of the learner more accurately, and can obtain and evaluate the emotion concentration degree of the student more accurately, so that the development of the student is assisted.

(2) The multi-mode information fusion index system adopted by the invention can reasonably evaluate the emotion concentration degree of the learner.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a learning concentration monitoring method based on a transducer network according to the present embodiment;

fig. 2 is a schematic view of a scenario of data acquisition provided in the present embodiment;

fig. 3 is a schematic diagram of a network structure of a multi-modal information classification model according to the present embodiment.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

The terms first, second, third and the like in the description and in the claims of the application and in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

Furthermore, well-known or widely-used techniques, elements, structures, and processes may not be described or shown in detail in order to avoid obscuring the understanding of the present invention by the skilled artisan. Although the drawings represent exemplary embodiments of the present invention, the drawings are not necessarily to scale and certain features may be exaggerated or omitted in order to better illustrate and explain the present invention.

Fig. 1 is a flow chart of a learning concentration monitoring method based on a transducer network according to the present embodiment, please refer to fig. 1, and the method mainly includes the following steps:

s1, acquiring a head posture RGB image, environment sound data and a skin electric signal of a monitored object in a monitoring time period, and dividing the monitoring time period into a plurality of time domain fragments according to time sequence;

in this embodiment, the monitored object is a learner who is learning online.

Fig. 2 is a schematic view of a data acquisition scenario provided in this embodiment, as shown in fig. 2, an RGB camera installed on a learning video playing device may be used to capture an RGB image of a head gesture of a learner, an environmental sound recording device may be used to record environmental sound information during learning, and a contact type skin electric acquisition device may be used to acquire skin electric signals during learning. Then, the three kinds of information obtained are preprocessed and divided into time-series fragments.

Specifically, an RGB camera installed on the learning video playing device captures RGB images of the head pose of the learner, and a proper visual capture interval time can be preset, and the RGB camera is triggered to collect the head pose information of the learner at intervals of the preset time. Then, the acquired head posture RGB image is subjected to resolution compression, picture scale reforming and the like.

The ambient sound is recorded by a dedicated ambient sound recording device placed at the learner's side. And then, extracting the environment audio information of the learner in the learning process through signal noise reduction and format conversion processing.

Human skin electrical signal information is captured by a skin electrical contact piece fixed on a learner. Then, the skin electric signal information of the learner is extracted through signal noise reduction and format conversion processing.

The method comprises the steps of dividing information recorded in the whole process into a plurality of time domain fragments according to a dividing rule of time sequence fragments, wherein a fixed dividing interval epsilon (5+/-2) s is given.

S2, inputting the corresponding head posture RGB image, the environment sound data and the skin electric signal in each time domain segment into a trained multi-mode information classification model to obtain a head posture angle, an environment sound category and an emotion state category;

fig. 3 is a schematic diagram of a network structure of a multi-modal information classification model according to the present embodiment; referring to fig. 3, in the present embodiment, the multimodal information classification model includes a head gesture recognition model, an environmental sound classification model, and a skin-electric emotion classification model; the head gesture recognition model is used for predicting and obtaining a head gesture angle according to the head gesture RGB image; the environmental sound classification model is used for predicting environmental sound types according to the environmental sound data; the skin-electric emotion classification model is used for predicting and obtaining emotion state types according to skin-electric signals.

In an alternative embodiment, the head pose recognition model includes a feature extraction network, a first encoding layer, a multi-layer transducer network, and a first classification network; wherein, the liquid crystal display device comprises a liquid crystal display device,

the feature extraction network is used for extracting multi-scale feature vectors from the input head posture RGB image; in this embodiment, the feature extraction network adopts a Resnet-50 pretraining network to extract multi-scale features of the head gesture image, the model has a 50-layer structure, the input is a tensor of 3×224×224, after passing through a 50-layer convolution block, the output size is 2048×7×7, the tensor of the last three stages is selected to be spliced through a concatemer function in a Pytorch frame, and the multi-scale feature vector is output.

The specific treatment process is as follows:

image x e R for a head ^H×W×C Firstly, inputting the multi-scale feature vector into Resnet to obtain a multi-scale feature vector, wherein the expression of the operation formalization of Resnet is as follows:

C＝F(x,{W _i })+x (1)

wherein x is an input image, C is a feature vector, F is a multi-layer full-connection layer, and the specific expression of F is as follows:

F＝W ₂ σ(W ₁ x) (2)

wherein σ is a ReLU activation function, which is defined as follows:

after Resnet treatment, the last three phases C are obtained ₁ ,C ₂ ,C ₃ Is expressed as follows:

the multi-scale feature map output by the feature extraction network is input into a first coding layer, and the first coding layer flattens the multi-scale feature map into a one-dimensional vector, and the formula is expressed as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,

is a flattening operation. The vector of different dimensions is transformed into the same dimension d=256 by linear reforming, expressed as:

E:p′ _i →v _i ∈R ^d (6)

E _i is a linear shift reforming operation; v _i One-dimensional multi-scale feature vectors are obtained for the reformation.

Then, the multi-scale feature vectors are spliced together and added with 2D position coding to form a visual vector]The method comprises the steps of carrying out a first treatment on the surface of the Wherein v is _i I=1, 2,3, …, L is a multi-scale one-dimensional feature vector. Splicing and adding 2D position coding pe _i The operation of obtaining the visual vector is as follows:

[visual]＝{v ₁ +pe ₁ ,v ₂ +pe ₂ ,…,v _L +pe _L } (7)

splicing the visual vector and the head position token together to form a T, and using the T as the input of the multi-layer transducer network;

T＝{[visual],[headposetoken]} (8)

wherein [ head phase token ] represents a randomly initialized learnable parameter.

The multi-layer transducer network calculates a one-dimensional feature vector output by the first coding layer based on an attention mechanism to obtain a final head posture feature vector;

the multi-layer transducer network comprises a plurality of transducer blocks, and the calculation steps of each transducer block are as follows:

step (1): layer normalization operation

/>

Wherein a is ^l As the original output of the first layer,

the results obtained after layer normalization.

Step (2): first, a single-head self-attention (SA) is calculated, and the specific calculation mode is as follows:

wherein W is _Q ,W _k ,W _V Is a parameter that can be learned in the network, h is the number of heads that operate with self-attention, preferably h is set to 8, i.e. 8 heads are self-attention.

Step (3): the Multi-head self-attention is calculated by the following specific calculation method:

MSA(T)＝[SA ₁ (T)；SA ₂ (T)；SA ₃ (T)；…,SA _h (T)]W _P (13)

wherein, there is residual connection between the output of the multi-head self-attention and the original input, expressed as:

z′ _l ＝MSA(LN(z _l-1 ))+z _l-1 (14)

step (4): and performing a layer normalization operation again, and performing a full-connection neural network, wherein residual connection exists between the output and the original input of the full-connection neural network, and the process is expressed as follows:

z _l ＝MLP(LN(z′ _l ))+z′ _l (15)

thus, the operation of a single transducer block is completed.

Step (5): repeatedly passing the data through M transducer blocks to obtain a final head posture feature vector; preferably, M is set to 12.

The first classification network is used for performing classification calculation according to the head posture feature vector to obtain the head posture angle of the monitored object.

Taking the head posture feature vector output by the multi-layer transducer as the input of a first classification network, and calculating different head posture angles through the full-connection layer and an excitation function softmax; in this embodiment, the first classification network includes three continuous full connection layers, and the three full connection layers are also connected with each other; the output of the first full-connection layer is respectively connected with the second full-connection layer and the third full-connection layer, and parameters are overlapped on the second full-connection layer and the third full-connection layer after passing through the first layer; the output of the second full-connection layer is connected with the third full-connection layer, after passing through the second layer, parameters can be added to the third full-connection layer, three-layer dense connection is realized, and finally 3 Euler angles are output. The specific calculation process is as follows:

the j-th neuron of the full-connection layer, and w and b are the connection parameters of the j-th neuron of the full-connection layer and the hidden layer of the upper layer; σ is an activation unit whose calculation formula is as follows:

output for the third fully-connected layer using softmax operationOutputting three Euler angles for obtaining the head gesture, namely a classification result A of the head gesture _cls . The manner in which softmax operates is as follows:

in this embodiment, the training process of the head gesture recognition model is as follows:

step 1.1: constructing a training sample set and a test sample set;

step 1.2: data enhancement is carried out on the test set samples, and data preprocessing plays a vital role in training of the neural network. The data enhancement method comprises the following steps: sample enhancement is carried out on the RGB image in the modes of rotation, translation, scale transformation and the like, so that the robustness of the model is enhanced; at the same time, these operations also provide a large number of counterfeit samples for model training.

Step 1.3: training the head gesture recognition model by using a training sample set, calculating loss by using a cross entropy loss function from Gaussian distribution of the predicted head gesture and Gaussian distribution of the real head gesture of the training sample, performing gradient optimization by using an AdamW optimizer, setting the initial learning rate to be 0.01, reducing the learning rate by using a cosine annealing algorithm after a plurality of epoch training rounds, and enabling the network to learn more stably until the loss value is not reduced;

step 1.4: and performing fine adjustment learning on the head gesture recognition model by using a test sample set, wherein the learning rate is set to be 5e-6.

In an alternative embodiment, the ambient sound classification model includes a second coding layer, a multi-layer transducer network, and a second classification network;

the multi-layer transducer network calculates a one-dimensional sound feature vector output by the second coding layer based on an attention mechanism to obtain a final environment sound feature vector;

and the second classification network performs classification calculation according to the environmental sound feature vector to obtain the environmental sound class of the current environment of the monitored object. Specifically, the second classification network calculates different environmental sound classes through the full connection layer and the excitation function softmax by using the environmental sound feature vector.

In this embodiment, the training process of the environmental sound classification model is as follows:

step 2.1: constructing a training sample set and a test sample set;

step 2.2: training the environmental sound classification model by using a training sample set, calculating loss of the predicted environmental sound class and the real environmental sound class of the training sample through a cross entropy loss function, performing gradient optimization by using an AdamW optimizer, setting the initial learning rate to be 0.01, reducing the learning rate through a cosine annealing algorithm after a plurality of epoch training rounds, and enabling the network learning to be more stable until the loss value is not reduced;

step 2.3: and performing fine adjustment learning on the environmental sound classification model by using a test sample set, wherein the learning rate is set to be 5e-6.

The skin-electric emotion classification model comprises a third coding layer, a multi-layer transducer network and a third classification network; wherein, the liquid crystal display device comprises a liquid crystal display device,

the multi-layer transducer network calculates the one-dimensional physiological characteristic vector output by the third coding layer based on the attention mechanism to obtain a final physiological emotion characteristic vector;

and the third classification network performs classification calculation according to the physiological emotion feature vector to obtain the emotion state category of the monitored object. Different emotion expressions are calculated by the physiological emotion feature vectors obtained by the third classification network through the full connection layer and the incentive function softmax, and in the embodiment, the emotion state categories are classified into three categories: positive, neutral, negative.

In this embodiment, the training process of the skin-electric emotion classification model is as follows:

step 3.1: constructing a training sample set and a test sample set;

step 3.2: training the skin-electricity emotion classification model by using a training sample set, calculating loss of a predicted emotion type and a real emotion type of the training sample through a cross entropy loss function, performing gradient optimization by using an AdamW optimizer, setting an initial learning rate to be 0.01, reducing the learning rate through a cosine annealing algorithm after a plurality of epoch training rounds, and enabling network learning to be more stable until a loss value is not reduced;

step 3.3: and performing fine adjustment learning on the skin-electric emotion classification model by using a test sample set, wherein the learning rate is set to be 5e-6.

In a preferred embodiment, the head pose recognition model, the ambient sound classification model, and the dermatologic emotion classification model share the same multi-layer transducer network. The processing procedure of the environmental sound and the skin electricity information is similar to the head posture RBG image data, and the environmental sound class B is obtained by analogy _cls Emotional state class C _cls 。

S3, coding the head posture angle, the environment sound category and the emotion state category, and carrying out weighted summation according to preset weights to obtain emotion concentration degree scores of the monitored object in each time domain segment;

firstly, coding the Euler angle of the head gesture output by the head gesture recognition model, the environment sound category output by the environment sound classification model and the skin electric emotion category output by the skin electric emotion classification model.

Wherein ψ (·) is the coding function; preferably, the head pose angle, the ambient sound category and the emotional state category are encoded with a single thermal encoding as an encoding function.

Then, the coded head posture angle, the environment sound category and the emotion state category are weighted and scored to obtain the emotion concentration score of the ith time domain segment, and the process is expressed as follows:

S _i ＝αA _code +βB _code +γC _code (19)

wherein A is _code ,B _code ,C _code The head gesture angle type codes, the environment sound type codes and the skin electric signal type codes are respectively adopted; alpha, beta and gamma are respectively corresponding weights. Through calculation, the given weight ranges are alpha epsilon [ 0.5+/-0.15],β∈[0.25±0.15],γ∈[0.25±0.15]。

And S4, fusing emotion concentration scores of the time domain segments according to a preset rule to obtain concentration estimation results of the monitored object in the monitoring time period.

In an alternative embodiment, weights are given to the learning content corresponding to each time domain segment according to the importance degree of the learning content, and the emotion concentration scores corresponding to each time domain segment are weighted and summed according to the weights.

In the embodiment, the fractional learning emotion concentration degrees are obtained respectively with the time interval epsilon (5+/-2) S, and a learning emotion concentration degree score S is calculated every epsilon (5+/-2) S in the whole-stage learning process _i All S _i Summing according to a certain weight to obtain a full-stage emotion concentration score, and calculating as follows:

the value of λ may be empirically set, and in a specific example, the parameter λ is given by the following range:

and finally obtaining S which is the whole-process emotion concentration score of the study.

It should be noted that while in the above-described embodiments the operations of the methods of the embodiments of the present specification are described in a particular order, this does not require or imply that the operations must be performed in that particular order or that all of the illustrated operations be performed in order to achieve desirable results. Rather, the steps depicted in the flowcharts may change the order of execution. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

The embodiment also provides a learning concentration monitoring system based on a transducer network, which can be realized in a software and/or hardware mode and can be integrated on computer equipment; specifically, the system comprises a data acquisition module, a classification module, a concentration calculation module and a result output module; wherein, the liquid crystal display device comprises a liquid crystal display device,

the data acquisition module is used for acquiring a head gesture RGB image, environment sound data and a skin electric signal of a monitored object in a monitoring time period, and dividing the monitoring time period into a plurality of time domain fragments according to time sequence;

the concentration degree calculation module is used for coding the head gesture angle, the environment sound category and the emotion state category, and carrying out weighted summation according to preset weights to obtain emotion concentration degree scores of the monitored object in each time domain segment;

the result output module is used for fusing emotion concentration degree scores of all the time domain fragments according to a preset rule to obtain concentration degree estimation results of the monitored object in the monitoring time period.

For specific limitations on the learning concentration monitoring system, reference may be made to the limitations on the learning concentration monitoring method hereinabove, and no further description is given here. The above-described respective modules in the learning concentration monitoring system may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

The embodiment also provides a computer device, which includes at least one processor and at least one memory, where the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the steps of the learning concentration monitoring method, and the specific steps are referred to above and are not repeated herein; in the present embodiment, the types of the processor and the memory are not particularly limited, for example: the processor may be a microprocessor, digital information processor, on-chip programmable logic system, or the like; the memory may be volatile memory, non-volatile memory, a combination thereof, or the like.

The computer device may also communicate with one or more external devices (e.g., keyboard, pointing terminal, display, etc.), with one or more terminals that enable a user to interact with the computer device, and/or with any terminals (e.g., network card, modem, etc.) that enable the computer device to communicate with one or more other computing terminals. Such communication may be through an input/output (I/O) interface. Moreover, the computer device may also communicate with one or more networks such as a local area network (Local Area Network, LAN), a wide area network (Wide Area Network, WAN) and/or a public network such as the internet via a network adapter.

The present application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the above method. The computer readable storage medium may include, among other things, any type of disk including floppy disks, optical disks, DVDs, CD-ROMs, micro-drives, and magneto-optical disks, ROM, RAM, EPROM, EEPROM, DRAM, VRAM, flash memory devices, magnetic or optical cards, nanosystems (including molecular memory ICs), or any type of media or device suitable for storing instructions and/or data.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, such as the division of the units, merely a logical function division, and there may be additional manners of dividing the actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some service interface, device or unit indirect coupling or communication connection, electrical or otherwise.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a memory, including several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Those of ordinary skill in the art will appreciate that all or a portion of the steps in the various methods of the above embodiments may be performed by hardware associated with a program that is stored in a computer readable memory, which may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.

The foregoing is merely exemplary embodiments of the present disclosure and is not intended to limit the scope of the present disclosure. That is, equivalent changes and modifications are contemplated by the teachings of this disclosure, which fall within the scope of the present disclosure. Embodiments of the present disclosure will be readily apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a scope and spirit of the disclosure being indicated by the claims.

It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. The method for monitoring learning concentration based on the Transformer network is characterized by comprising the following steps of:

2. The learning concentration monitoring method of claim 1, wherein the multi-modal information classification model includes a head pose recognition model for predicting a head pose angle from a head pose RGB image; the head pose recognition model includes:

3. The learning concentration monitoring method of claim 2, wherein the multi-modal information classification model includes an ambient sound classification model for predicting an ambient sound class based on ambient sound data; the environmental sound classification model includes:

4. The learning concentration monitoring method of claim 3, wherein the multi-modal information classification model includes a skin-to-electricity emotion classification model for deriving an emotion state category from skin-to-electricity signal predictions; the skin-electric emotion classification model comprises:

5. The learning concentration monitoring method of claim 4, wherein the head pose recognition model, the ambient sound classification model, and the skin-electric emotion classification model share the same multi-layer transducer network.

6. The learning concentration monitoring method of claim 2, wherein the first classification network includes three consecutive fully connected layers;

7. The learning concentration monitoring method according to any one of claims 1 to 6, wherein the fusing the emotion concentration scores of the time domain segments according to a preset rule includes:

8. A Transformer network-based learning concentration monitoring system, comprising:

9. A computer device comprising at least one processing unit and at least one storage unit, wherein the storage unit stores a computer program which, when executed by the processing unit, causes the processing unit to perform the steps of the learning concentration monitoring method of any one of claims 1 to 7.

10. A storage medium storing a software program executable by a computer device, the software program when run on the computer device causing the computer device to perform the steps of the learning concentration monitoring method of any one of claims 1 to 7.