CN115439915A

CN115439915A - Classroom participation identification method and device based on region coding and sample balance optimization

Info

Publication number: CN115439915A
Application number: CN202211246980.4A
Authority: CN
Inventors: 徐敏; 张曦淼; 王嘉豪; 孙众; 邱德慧; 董瑶
Original assignee: Capital Normal University
Current assignee: Capital Normal University
Priority date: 2022-10-12
Filing date: 2022-10-12
Publication date: 2022-12-06

Abstract

The application provides a classroom participation identification method based on regional coding and sample balance optimization, which comprises the following steps: acquiring video data of on-line learning of a student, and generating original sample data according to the video data, wherein the original sample data comprises high participation sample data and low participation sample data; inputting low participation sample data into a StarGAN model to generate target low participation samples with different styles; inputting original sample data and target low participation samples into an RCN model for training to obtain a trained RCN model; acquiring video data to be identified, and generating image data to be identified according to the video data to be identified; and inputting the image data to be recognized into the trained RCN model to obtain a participation degree recognition result. The method and the device effectively solve the problems of extremely unbalanced sample distribution and face shielding by hands in the participation degree identification task, and remarkably improve the discrimination and robustness of the network model.

Description

Classroom participation identification method and device based on region coding and sample balance optimization

Technical Field

The application relates to the technical field of intelligent education and computer vision cross research, in particular to a classroom participation identification method and device based on regional coding and sample balance optimization.

Background

On-line education provides a brand-new knowledge propagation mode and a learning mode, teachers can conduct education activities such as live broadcast teaching, recorded broadcast playback, on-line question answering and correction homework through network education platforms such as MOOC and students can complete learning tasks according to own rhythms. On-line teaching has the characteristics of abundant learning resources, timely knowledge acquisition, various learning modes and the like, and gradually becomes an organic component of normal education and teaching activities. The interaction between teachers and students is a key link in the teaching process. In a traditional classroom environment, a teacher can directly observe the facial expressions and behaviors of students to judge the input degree of the students. However, in an online class, due to factors such as teaching scenes, students lack real-time interaction with teachers in face-to-face communication, attention is easy to disperse, teachers cannot obtain real-time feedback of input states of students, and learning effects of students can only be judged through class questioning and post-school assignment feedback. Therefore, how to realize the automatic evaluation of the learning participation of students in the online learning environment through the computer vision technology is a problem to be solved urgently at present.

Research of automatic recognition of participation can be divided into two categories, namely traditional machine learning-based research and deep learning-based research. Traditional computer vision technology-based recognition methods typically estimate engagement through facial features or manually extracted features of other modalities and through machine learning. For the participation identification task, whether the online class or the offline class, most students can listen to the conversation seriously, and only a few students are not concentrated in attention, so that the problem of serious unbalanced sample distribution exists in the participation data acquired in the natural environment, namely the number of samples with low participation is very small, and the number of samples with high participation accounts for a large proportion. Most of the prior engagement recognition algorithms can obtain higher accuracy in the whole classification task, but the classification capability of most types is often improved, and the judgment of few types of samples is ignored. In addition, since the behavior of students in the learning process is not artificially restricted in the natural environment, part of the facial area is often carelessly covered by hands, so that the facial expression change cannot be captured, and the situation is easily recognized by the model as distraction and a low participation degree predicted value is obtained.

In summary, in the learning participation degree identification method in the prior art, characteristics of unbalanced sample data distribution, the hand shielding condition of the participation degree sample and the like of the participation degree identification task are not fully considered, and the method has the defect of low identification accuracy.

Disclosure of Invention

The present application is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, a first objective of the present application is to provide a classroom participation identification method for area Coding and sample balance optimization, which solves the technical problems of extremely unbalanced sample distribution and face occlusion by hand in the participation identification task of the existing participation identification method, and by providing a StarGAN (Star generated adaptive Network) model to generate a low-participation sample, enhance the participation database, and at the same time, providing an RCN (Region Coding Network) model for face area Coding, which can adaptively learn attention weights of different face areas, and combine model feature learning and occlusion area Coding, thereby significantly improving the discriminative power and robustness of the Network model.

A second objective of the present application is to provide a classroom participation identification device with optimized region coding and sample balance.

A third object of the present application is to propose a non-transitory computer-readable storage medium.

In order to achieve the above object, an embodiment of a first aspect of the present application provides a classroom participation identification method based on region coding and sample balance optimization, including: acquiring video data of on-line learning of a student, and generating original sample data according to the video data, wherein the original sample data comprises high participation sample data and low participation sample data; inputting low participation sample data into a StarGAN model to generate target low participation samples with different styles; inputting original sample data and target low participation samples into an RCN model for training to obtain a trained RCN model; acquiring video data to be identified, and generating image data to be identified according to the video data to be identified; and inputting the image data to be recognized into the trained RCN model to obtain a participation degree recognition result.

Optionally, in an embodiment of the present application, generating original sample data according to video data includes:

defining a participation degree label of the video data by utilizing manual work and prior information;

extracting an image frame from video data, cutting a face area of the extracted image frame to obtain a face image as original sample data, wherein the original sample data is divided into high-participation sample data and low-participation sample data according to a participation degree label.

Optionally, in an embodiment of the present application, the StarGAN model includes: before inputting the low participation sample data into the StarGAN model and generating target low participation samples with different styles, the mapping network, the style encoder, the generator and the discriminator further comprise:

acquiring low participation training data, wherein the low participation training data are face images;

inputting low participation training data into the StarGAN model for training, and performing iterative optimization on the StarGAN model through a loss function.

Optionally, in an embodiment of the present application, the loss function of the StarGAN model includes: confrontation loss, style reconstruction loss, diversity sensitivity loss, and cycle consistency loss, wherein,

the challenge loss is expressed as:

wherein L is _adv Representing the countermeasure loss, E () representing the mathematical expectation value, x representing the input image, y representing the original domain of the input image, D _y (x) Is the output of the discriminator in the original domain y,

representing the target domain, z represents random gaussian noise,

representing the style characteristics of the target domain generated by the mapping network from random gaussian noise,

representing the output of the discriminator on the image generated by the generator,

a representation generator generates a false image with a field of y according to the input image and the target style characteristics;

the style reconstruction penalty is expressed as:

wherein L is _sty Representing a loss of stylistic reconstruction, E () representing a mathematical expectation value, x representing an input image, y representing an original field of the input image,

representing the target domain, z represents random gaussian noise,

representing the style characteristics of the target domain generated by the mapping network based on random gaussian noise,

the loss of diversity sensitivity is expressed as:

wherein L is _ds Representing loss of diversity sensitivity, E () representing the mathematical expectation value, z ₁ And z ₂ Representing a random gaussian noise vector and representing the noise,

and

respectively representing a vector z of random gaussian noise by a mapping network ₁ And z ₂ Outputting the obtained style characteristic vector and outputting the style characteristic vector,

the representation generator is based on the input image and the style characteristics

The image to be generated is then displayed on the display,

A generated image;

the cycle consistency loss is expressed as:

wherein L is _cyc Representing a loss of cyclic consistency, E () representing a mathematical expectation value, x representing an input image, y representing an original field of the input image,

representing the target domain, z represents random gaussian noise,

is an estimated stylistic encoding of the input image x,

representing a false image to be generated using a generator

And

reconstructing to obtain a style

The image of (a) is displayed on the display,

A generated image;

the StarGAN model is optimized using an objective function, where the objective function is expressed as:

min _G,F,E max _D L _adv +λ _sty L _sty -λ _ds L _ds +λ _cyc L _cyc；

wherein, min _G,F,E Representing minimization of an objective function, max, by a training generator, a mapping network and a style encoder _D Representing maximization of an objective function, L, by training discriminators _adv Denotes the resistance to loss, L _sty Represents a loss of style reconstruction, L _ds Indicates a loss of diversity sensitivity, L _cyc Denotes the loss of cyclic consistency, λ _sty 、λ _ds And λ _cyc Is a hyper-parameter used to balance losses.

Optionally, in an embodiment of the present application, inputting low participation sample data into the StarGAN model, and generating target low participation samples with different styles, includes:

the method comprises the steps of inputting a face image in low-participation sample data into a StarGAN model, generating different style characteristics through a mapping network or a style encoder, and generating target low-participation samples with different styles through a generator according to the input face image and the different style characteristics.

Optionally, in an embodiment of the present application, the RCN model includes a feature extraction unit, a region attention unit, and a global attention unit, and the method includes inputting original sample data and a target low participation sample into the RCN model for training to obtain a trained RCN model, including:

inputting original sample data and a target low participation sample into an RCN model, and performing feature extraction on the original sample data and the target low participation sample through a feature extraction unit to obtain local area features of the sample;

in the feature space, the local region features of the sample are subjected to region coding through a region attention unit learning attention weights of different face regions, and global features of the sample are obtained;

respectively connecting the local area characteristics of the sample with the global characteristics of the sample in series to obtain sample characteristics, obtaining the attention weight of the sample characteristics through a global attention unit, and performing weighted fusion on the sample characteristics to obtain final sample characteristics;

and according to the characteristics of the final sample, performing iterative updating and optimization on the network parameters of the RCN model by using an SGD algorithm through combining the regional deviation loss and the cross entropy loss to obtain the trained RCN model.

Optionally, in an embodiment of the present application, inputting image data to be recognized into a trained RCN model to obtain an engagement recognition result, including:

inputting image data to be identified into a feature extraction unit for feature extraction to obtain a feature map, and randomly cutting the feature map into a preset number of regional features;

inputting the regional characteristics into a regional attention unit, calculating attention weight of the regional characteristics, and weighting the regional characteristics to obtain global characteristics;

and respectively connecting the regional characteristics with the global characteristics in series to obtain target characteristics, obtaining the attention weight of the target characteristics through a global attention unit, weighting the target characteristics to obtain final characteristics, and identifying and classifying the final characteristics to obtain the participation degree identification result of the image data to be identified.

To achieve the above object, a second aspect of the present application provides a classroom participation identification device optimized by region coding and sample balance, including:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring video data of on-line learning of a student and generating original sample data according to the video data, and the original sample data comprises high participation sample data and low participation sample data;

the generating module is used for inputting the low participation sample data into the StarGAN model and generating target low participation samples with different styles;

the training module is used for inputting original sample data and target low participation samples into the RCN model for training to obtain a trained RCN model;

the second acquisition module is used for acquiring the video data to be identified and generating image data to be identified according to the video data to be identified;

and the recognition module is used for inputting the image data to be recognized into the trained RCN model to obtain a participation degree recognition result.

extracting an image frame from the video data, cutting a face area of the extracted image frame to obtain a face image as original sample data, wherein the original sample data is divided into high-participation sample data and low-participation sample data according to the participation degree label.

In order to achieve the above object, a non-transitory computer-readable storage medium is provided in a third aspect of the present application, and when executed by a processor, the instructions in the storage medium can perform a classroom participation identification method based on region coding and sample balance optimization.

According to the classroom participation identification method, device and non-transitory computer-readable storage medium for area coding and sample balance optimization, the technical problems that sample distribution is extremely unbalanced and a hand shields a face in a participation identification task of an existing participation identification method are solved, low-participation samples are generated by providing a StarGAN model, a participation database is enhanced, meanwhile, an area coding network for face area coding is provided, attention weights of different face areas can be learned in a self-adaptive mode, modeling characteristics are learned and shielding area coding is combined, and the discrimination and robustness of a network model are remarkably improved.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The above and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a flowchart of a classroom participation identification method for area coding and sample balance optimization according to an embodiment of the present application;

fig. 2 is another flowchart of a classroom participation identification method for area coding and sample balance optimization according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an online learning low-participation image generated based on a StarGAN model by a classroom participation identification method based on region coding and sample balance optimization according to an embodiment of the present application;

fig. 4 is a schematic diagram of a low-participation sample generated based on the StarGAN model by the area coding and sample balance optimized classroom participation identification method according to the embodiment of the present application;

FIG. 5 is a schematic structural diagram of a feature extraction convolutional neural network of a classroom participation identification method for regional coding and sample balance optimization according to an embodiment of the present application;

fig. 6 is a schematic diagram of an RCN model-based engagement recognition framework of a classroom engagement recognition method for area coding and sample balance optimization according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a classroom participation identification device with area coding and sample balance optimization according to a second embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.

The method and apparatus for classroom participation identification with region coding and sample balance optimization according to the embodiments of the present application are described below with reference to the accompanying drawings.

Fig. 1 is a flowchart of a classroom participation identification method based on region coding and sample balance optimization according to an embodiment of the present application.

As shown in fig. 1, the classroom participation identification method based on region coding and sample balance optimization includes the following steps:

step 101, acquiring video data of online learning of a student, and generating original sample data according to the video data, wherein the original sample data comprises high participation sample data and low participation sample data;

step 102, inputting low-participation sample data into a StarGAN model to generate target low-participation samples with different styles;

103, inputting original sample data and a target low participation sample into an RCN model for training to obtain a trained RCN model;

104, acquiring video data to be identified, and generating image data to be identified according to the video data to be identified;

and 105, inputting the image data to be recognized into the trained RCN model to obtain a participation degree recognition result.

According to the classroom participation identification method based on regional coding and sample balance optimization, video data of online learning of students are obtained, and original sample data are generated according to the video data, wherein the original sample data comprise high participation sample data and low participation sample data; inputting low participation sample data into a StarGAN model to generate target low participation samples with different styles; inputting original sample data and a target low participation sample into an RCN model for training to obtain a trained RCN model; acquiring video data to be identified, and generating image data to be identified according to the video data to be identified; and inputting the image data to be recognized into the trained RCN model to obtain a participation degree recognition result. Therefore, the technical problems of extremely unbalanced sample distribution and face shielding by hands in the participation identification task of the conventional participation identification method can be solved, a low-participation sample is generated by providing a StarGAN model, the participation database is enhanced, and meanwhile, a region coding network for face region coding is provided, so that the attention weights of different face regions can be adaptively learned, modeling feature learning and shielding region coding are combined, and the discrimination and the robustness of a network model are remarkably improved.

Further, in this embodiment of the present application, generating original sample data according to video data includes:

Illustratively, videos of online learning of students can be acquired through a camera, saved as one video every 10 seconds, and an engagement label {0,1,2,3} is defined for each video by utilizing manual and prior information.

The method comprises the steps of extracting image frames by adopting OpenCV (open source computer vision), cutting and extracting a face area of the image frames by adopting a face recognition open source tool face _ recogmtion, and storing the face area into a database, wherein original sample data can be divided into high-participation sample data and low-participation sample data according to a participation degree label of video data, for example, the original sample data generated by videos with participation degree labels of 0 and 1 are divided into the low-participation sample data, and the original sample data generated by the videos with participation degree labels of 2 and 3 are divided into the high-participation sample data.

Further, in an embodiment of the present application, the StarGAN model includes: the mapping network, the style encoder, the generator and the discriminator further comprise the following steps before inputting the low participation sample data into the StarGAN model to generate target low participation samples with different styles:

inputting the training data with low participation into a StarGAN model for training, and performing iterative optimization on the StarGAN model through a loss function.

According to the method and the device, the thought of 'fighting game' for generating the fighting network is introduced, the fighting network StarGAN is generated based on the star, the low-participation-degree sample is generated based on the fighting network StarGAN, the number of the few samples of the database is expanded, and the participation-degree database is enhanced, so that the influence caused by imbalance of the data set is improved.

Initializing StarGAN model parameters, inputting low-participation sample data with participation degree labels of 0 and 1 into a StarGAN model, generating low-participation samples with different styles, and enhancing a database.

The StarGAN model of the present application includes a mapping network, a style encoder, a generator, and a discriminator. The mapping network is composed of a multi-layer perceptron with a plurality of output branches, and can map given random Gaussian noise into diversified style characteristic representations. The style encoder can extract different style feature representations using a depth network given different reference images. The mapping network and the style encoder each have a plurality of output branches, each branch corresponding to a style characteristic of a particular domain. The generator generates a false image with multiple styles but unchanged content according to the given input image and style characteristics. The discriminator has a plurality of output branches corresponding to a plurality of target domains, each output branch being a classifier for discriminating whether the input image is authentic at a specific target domain thereof.

In the StarGAN model training process, the generator combines the input style characteristics to generate a vivid image with certain style characteristics as far as possible, the discriminator identifies the false image generated by the generator as far as possible, the false image and the false image are mutually played continuously, the capability of the generator for generating the vivid image is continuously improved, and finally the false image generated by the generator is close to the real image as far as possible.

According to the data distribution of the on-line learning engagement of students, the field of the engagement data is set based on the engagement degree of the students, namely, the concept of the field in the application refers to the engagement label, and the style characteristics of the image comprise the hair style of a person, the skin color, the beard, whether the person wears glasses, the angle and the posture of the eyes staring at the screen, and the like.

Further, in the embodiment of the present application, the loss function of the StarGAN model includes: confrontation loss, style reconstruction loss, diversity sensitivity loss, and cycle consistency loss, wherein,

the challenge loss is expressed as:

representing the target domain, z represents random gaussian noise,

representation mapThe ray network generates style characteristics of a target domain according to random Gaussian noise,

a representation generator generates a false image with a field of y according to the input image and the target style characteristics, and inputs the false image and the target field into a discriminator so that the discriminator learns to distinguish the authenticity of the input image;

the style reconstruction penalty is expressed as:

representing the target domain, z represents random gaussian noise,

the loss of diversity sensitivity is expressed as:

and

respectively representing a vector z of random gaussian noise by a mapping network ₁ And z ₂ Outputting the obtained style feature vector, and outputting the style feature vector,

The image to be generated is then displayed on the display,

Generated images, the method maximizing the loss between generated images having different styles, thereby encouraging the generator to generate more diverse styles of images during the training process;

the cycle consistency loss is expressed as:

representing the target domain, z represents random gaussian noise,

is an estimated stylistic encoding of the input image x,

representing a false image to be generated using a generator

And

reconstructing to obtain a style

Image of (2)

Generated image, by constraining

L1 loss from the input image x, so that the generator retains some of the original features of x while changing the style;

min _G,F,E max _D L _adv +λ _sty L _sty -λ _ds L _ds +λ _cyc L _cyc ；

wherein, min _G,F,E Represents the minimization of an objective function, max, by a training generator, a mapping network and a style encoder _D Representing maximization of an objective function, L, by training an arbiter _adv Denotes the loss of antagonism, L _sty Represents a loss of style reconstruction, L _ds Indicates a loss of diversity sensitivity, L _cyc Denotes the loss of cyclic consistency, λ _sty 、λ _ds And λ _cyc Is a hyper-parameter used to balance the losses.

The loss functions of the StarGAN model of the present application include confrontational loss, stylistic reconstruction loss, diversity sensitivity loss, and cycle consistency loss. The confrontation loss enables the generator and the discriminator to confront and optimize in the training process, and the model performance is continuously improved. The stylistic reconstruction penalty causes the generator to use a particular stylistic representation when generating the image, resulting in a greater penalty value if other stylistic representations are used. The diversity-sensitive loss makes the images generated by the generator rich in diversity by maximizing the L1 loss between two images of different domains, where the L1 loss is used to minimize the error, expressed as the absolute value of the difference between the true and predicted values. The cyclic consistency loss is used to ensure that certain unaltered features of the input image can be correctly retained in the generated image.

Further, in the embodiment of the present application, inputting low participation sample data into the StarGAN model, and generating target low participation samples with different styles, includes:

inputting the face image in the low participation sample data into a StarGAN model, generating different style characteristics through a mapping network or a style encoder, and generating target low participation samples with different styles through a generator according to the input face image and the different style characteristics.

Further, in this embodiment of the present application, the RCN model includes a feature extraction unit, a region attention unit, and a global attention unit, and the method includes inputting original sample data and a target low participation sample into the RCN model for training to obtain a trained RCN model, including:

in a feature space, a region attention unit learns attention weights of different face regions to perform region coding on local region features of a sample to obtain global features of the sample;

According to the method and the device, the region coding is carried out by learning the attention weights of different face regions, so that the model focuses more on the region with larger weight, and the model identification performance is further improved.

The method comprises the steps of inputting original sample data and a target low participation sample into an RCN together, firstly, carrying out feature extraction on the input sample, and then carrying out region coding in a feature space by learning attention weights of different face regions; weighting and fusing all local region features to obtain a global feature, connecting the local feature and the global feature in series, obtaining more accurate weight by adopting an attention mechanism, and obtaining final feature representation after weighting and fusing; and finally, carrying out iterative updating and optimization on network parameters by using an SGD algorithm through combining the regional deviation loss and the cross entropy loss to obtain a more optimal participation degree identification model.

Wherein the loss of regional bias is used to constrain the attention weight α _i I.e. existence of a certain local area F using a hyper-parametric delta constraint _i Attention weight of alpha _i Larger than the original face image F with edges ₀ Weight of alpha ₀ ，

The loss of area deviation is expressed as

L _RB ＝max{0,δ-(α _max -α ₀ )}

Wherein L is _RB Indicating regional bias loss, delta denotes hyper-parameter, alpha ₀ Is the attention weight, alpha, of the original face image _max Representing the maximum weight of all local regions.

The cross entropy loss is expressed as:

wherein L is _CE (p, y) represents cross entropy loss, N represents the number of samples, y _i Label representing the ith sample, p _i And the ith result is represented after the model calculation is output.

Further, in this embodiment of the present application, inputting image data to be recognized into a trained RCN model to obtain a result of participation degree recognition, including:

Extracting image frames from a video to be recognized by adopting OpenCV (open source computer vision), and cutting and extracting face areas of the image frames by adopting a face recognition open source tool face-recognition to obtain face images serving as images to be recognized; inputting an image to be recognized into a trained RCN model, firstly extracting facial features of the input image, randomly cutting, then learning weights of different facial regions in a self-adaptive manner, and performing weighted fusion to obtain global features; and connecting the local features and the global features in series, then performing participation identification, and outputting an identification result.

The RCN model in the application comprises a feature extraction unit, a region attention unit and a global attention unit.

The method for recognizing the image to be recognized based on the RCN model is described in detail below.

The feature extraction unit takes the facial expression image to be recognized with the size of 224 multiplied by 3 as input, uses a convolution neural network to carry out feature extraction, and obtains a feature map f with the size of 28 multiplied by 512 ₀ . The convolutional neural network includes 10 convolutional layers and 3 pooling layers. First, after two convolutions with 64 convolution kernels, pooling is performed once, and then 128 volumes are passedPooling again after twice convolution of the kernels, pooling again after three times of convolution of 256 convolution kernels, and finally pooling again after three times of convolution of 512 convolution kernels to obtain a feature map f ₀ . Then f is mixed ₀ Randomly cutting the image into n area features f with the size of 6 multiplied by 512 _i (i =1, 2.... N), each region being processed separately by a region attention unit. The region attention unit is realized by an attention network which comprises a pooling layer, two convolution layers with convolution kernels of 512 and 128 respectively, a full connection layer and a sigmoid layer. By calculating the input regional characteristics f _i Attention weight α of (i =0,1,.., n) _i (i =0, 1.. Eta., n), for the region feature f _i Weighting to obtain a global attention representation f _m And the region coding mechanism is assisted to be optimized from a global angle, and the weight parameters are adaptively adjusted.

Wherein the attention weight α of the region feature _i Expressed as:

α _i ＝sigmoid(f _i ^T ·q)

wherein sigmoid () is a nonlinear activation function, f _i ^T The region characteristics after the transposition, the q full link layer parameters,

global attention representation f _m Expressed as:

wherein n represents the number of regions, α _i Attention weight, f, representing the characteristics of a region _i The region features are represented.

Using a regional bias penalty in the regional attention unit for constraining the attention weight α _i I.e. to restrict the existence of a certain region f _i Attention weight α of (i =1, 2.. Said., n) _i Larger than the original face image f with edges ₀ Weight of alpha ₀ The attention degree to important areas is improved through the 'encouraging' RCN model, so that the model can obtain better areas and global representation weight values.

The area deviation loss function is expressed as:

L _RB ＝max{0,δ-(α _max -α ₀ )}

where LRB denotes regional bias loss, δ is a hyper-parameter, α _max Representing the maximum weight of all local regions.

The global attention unit is implemented by an attention network comprising a fully connected layer and a sigmoid layer. Characterizing the region f _i (i =0,1,. Eta., n) and the global representation feature f, respectively _m Are connected in series to obtain the target characteristic (f) _i :f _m ]The attention weight β is then derived by the global attention unit _i (i =0,1,. Ang., n), pair [ f [ ] _i :f _m ]And weighting to obtain a final feature representation P, and finally identifying and classifying the P.

Wherein the attention weight beta of the target feature _i Expressed as:

wherein sigmoid () is a nonlinear activation function,

the characteristics which are transposed after the area characteristics and the global characteristics are connected in series are shown,

representing the full connection layer parameters.

The final feature representation P is expressed as:

wherein n represents the number of regions, α _i Attention weight, β, representing a feature of a region _i Represents the target feature attention weight, [ f ] _i :f _m ]Representing the target feature.

Fig. 2 is another flowchart of a classroom participation identification method based on region coding and sample balance optimization according to an embodiment of the present application.

As shown in FIG. 2, the classroom participation degree identification method of area coding and sample balance optimization comprises the steps of capturing real-time online learning pictures of educated persons by using a camera, and synchronously performing data preprocessing; inputting the low participation samples into a StarGAN model, generating the low participation samples with different styles through a mapping network or a style encoder, and expanding the number of the few samples of the database to improve the influence caused by imbalance of the data set; inputting original data and the generated low participation sample into the RCN together, and performing region coding by learning attention weights of different face regions to enable the model to focus more on the region with larger weight; and obtaining an engagement recognition result by the on-line collected real-time learning video through the trained engagement recognition framework.

Fig. 3 is a schematic structural diagram of an online learning low-participation image generated based on a StarGAN model by a classroom participation identification method based on region coding and sample balance optimization according to an embodiment of the present application.

As shown in FIG. 3, random Gaussian noise z is combined with a reference image

Respectively input into the mapping network and the style encoder to generate the target style characteristics

Characterizing object style

And inputting the given image x into a generator G to generate a false image, and finally identifying the generated image by an identifier D to obtain a low participation sample.

Fig. 4 is a schematic diagram of a low-participation sample generated based on the StarGAN model by the area coding and sample balance optimized classroom participation identification method according to the embodiment of the present application.

As shown in fig. 4, given an input image and a reference image, different low-participation generated image samples are obtained after different training iterations.

Fig. 5 is a schematic structural diagram of a feature extraction convolutional neural network of a classroom participation identification method for regional coding and sample balance optimization according to an embodiment of the present application.

As shown in fig. 5, the feature extraction unit performs feature extraction by using a convolutional neural network, performs feature extraction by using a facial expression image with a size of 224 × 224 × 3 as an input, reduces the dimensions of the width and the height of the features after passing through the VGG16 model, and increases the number of channels, thereby finally obtaining a feature map with dimensions of 28 × 28 × 512.

Fig. 6 is a schematic diagram of an RCN model-based engagement recognition framework of a classroom engagement recognition method for region coding and sample balance optimization according to an embodiment of the present application.

As shown in FIG. 6, the original image and the generated image are input to the feature extraction unit for feature extraction to obtain a feature map f ₀ A 1 is to f ₀ Randomly cutting n area characteristics f _i (i =0, 1.., n), calculating the inputted region feature f by the region attention unit _i Attention weight α of (i =0, 1.., n) _i (i =0, 1.. Eta., n), and for the region feature f _i Weighting to obtain a global attention representation f _m Characterizing the region f _i (i =0, 1.. N.) is associated with the global representation feature f, respectively _m Are connected in series to obtain the target characteristic (f) _i :f _m ]Then the attention weight β is obtained by the global attention unit _i (i =0,1,. Eta., n), and pair [ f _i :f _m ]Weighting is performed to obtain the final feature representation P.

As shown in fig. 7, the classroom participation identification device for area coding and sample balance optimization comprises:

the first obtaining module 10 is configured to obtain video data of online learning of a student, and generate original sample data according to the video data, where the original sample data includes high participation sample data and low participation sample data;

the generating module 20 is configured to input the low participation sample data into the StarGAN model, and generate target low participation samples with different styles;

the training module 30 is configured to input the original sample data and the target low participation sample into the RCN model for training, so as to obtain a trained RCN model;

the second obtaining module 40 is configured to obtain video data to be identified, and generate image data to be identified according to the video data to be identified;

and the recognition module 50 is configured to input the image data to be recognized into the trained RCN model to obtain a participation degree recognition result.

The classroom participation degree identification device with regional coding and sample balance optimization comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring video data of online learning of students and generating original sample data according to the video data, and the original sample data comprises high participation sample data and low participation sample data; the generating module is used for inputting the low participation sample data into the StarGAN model and generating target low participation samples with different styles; the training module is used for inputting original sample data and target low participation samples into the RCN model for training to obtain a trained RCN model; the second acquisition module is used for acquiring the video data to be identified and generating image data to be identified according to the video data to be identified; and the recognition module is used for inputting the image data to be recognized into the trained RCN model to obtain a participation degree recognition result. Therefore, the technical problems of extremely unbalanced sample distribution and face shielding by hands in the participation identification task of the conventional participation identification method can be solved, a low-participation sample is generated by providing a StarGAN model, the participation database is enhanced, and meanwhile, a region coding network for face region coding is provided, so that the attention weights of different face regions can be adaptively learned, modeling feature learning and shielding region coding are combined, and the discrimination and the robustness of a network model are remarkably improved.

In order to achieve the above embodiments, the present application further proposes a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor, implements the classroom participation identification method for area coding and sample balance optimization of the above embodiments.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Moreover, various embodiments or examples and features of various embodiments or examples described in this specification can be combined and combined by one skilled in the art without being mutually inconsistent.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless explicitly specified otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

The logic and/or steps represented in the flowcharts or otherwise described herein, such as an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. While embodiments of the present application have been shown and described above, it will be understood that the above embodiments are exemplary and should not be construed as limiting the present application and that changes, modifications, substitutions and alterations in the above embodiments may be made by those of ordinary skill in the art within the scope of the present application.

Claims

1. A classroom participation identification method based on regional coding and sample balance optimization is characterized by comprising the following steps:

the method comprises the steps of obtaining video data of on-line learning of a student, and generating original sample data according to the video data, wherein the original sample data comprises high-participation sample data and low-participation sample data;

inputting the low participation sample data into a StarGAN model to generate target low participation samples with different styles;

inputting the original sample data and the target low participation sample into an RCN model for training to obtain a trained RCN model;

acquiring video data to be identified, and generating image data to be identified according to the video data to be identified;

and inputting the image data to be recognized into the trained RCN model to obtain a participation degree recognition result.

2. The method of claim 1, wherein said generating original sample data from said video data comprises:

defining a participation degree label of the video data by utilizing manual and prior information;

and extracting an image frame from the video data, cutting and extracting a face area of the image frame to obtain a face image as the original sample data, wherein the original sample data is divided into high-participation sample data and low-participation sample data according to the participation degree label.

3. The method of claim 1, wherein the StarGAN model comprises: before inputting the low participation sample data into a StarGAN model to generate target low participation samples with different styles, the mapping network, the style encoder, the generator and the discriminator further comprise:

inputting the low participation training data into a StarGAN model for training, and performing iterative optimization on the StarGAN model through a loss function.

4. The method of claim 3, wherein the loss function of the StarGAN model comprises: confrontation loss, style reconstruction loss, diversity sensitivity loss, and cycle consistency loss, wherein,

the challenge loss is expressed as:

representing the target domain, z represents random gaussian noise,

the representation generator generates a false image with a field of y according to the input image and the target style characteristics;

the loss of style reconstruction is expressed as:

representing the target domain, z represents random gaussian noise,

the diversity sensitivity loss is expressed as:

and

The image that is generated is displayed on the display,

The generated image;

the cycle consistency loss is expressed as:

wherein L is _cyc Representing a loss of cyclic consistency, E () representing a mathematical expectation value, x representing the input image, y representing the original domain of the input image,

representing the target domain, z represents random gaussian noise,

is an estimated stylistic encoding of the input image x,

representing a false image to be generated using a generator

And

reconstructing to obtain a style

The image of (a) is displayed on the display,

The generated image;

optimizing the StarGAN model using an objective function, wherein the objective function is represented as:

min _G,F,E max _D L _adv +λ _sty L _sty -λ _ds L _ds +λ _cyc L _cyc；

wherein, min _G,F,E Representation minimization by training generators, mapping networks and style encodersObjective function, max _D Representing maximization of an objective function, L, by training an arbiter _adv Denotes the loss of antagonism, L _sty Represents a loss of style reconstruction, L _ds Indicating a loss of diversity sensitivity, L _cyc Denotes the loss of cyclic consistency, λ _sty 、λ _ds And λ _cyc Is a hyper-parameter used to balance losses.

5. The method of claim 4, wherein said inputting the low engagement sample data into a StarGAN model, generating target low engagement samples having different styles, comprises:

inputting the face image in the low-participation sample data into a StarGAN model, generating different style characteristics through the mapping network or the style encoder, and generating target low-participation samples with different styles through a generator according to the input face image and the different style characteristics.

6. The method of claim 1, wherein the RCN model comprises a feature extraction unit, a region attention unit, and a global attention unit, and the inputting the original sample data and the target low participation sample into the RCN model for training to obtain a trained RCN model comprises:

inputting the original sample data and the target low participation sample into an RCN model, and performing feature extraction on the original sample data and the target low participation sample through the feature extraction unit to obtain local area features of the sample;

in a feature space, the region attention unit learns the attention weights of different face regions to perform region coding on the local region features of the sample to obtain global features of the sample;

respectively connecting the local area characteristics of the sample with the global characteristics of the sample in series to obtain sample characteristics, obtaining attention weights of the sample characteristics through the global attention unit, and performing weighted fusion on the sample characteristics to obtain final sample characteristics;

and according to the final sample characteristics, performing iterative updating and optimization on the network parameters of the RCN model by using an SGD algorithm through combining the regional deviation loss and the cross entropy loss to obtain the trained RCN model.

7. The method of claim 6, wherein the inputting the image data to be recognized into the trained RCN model to obtain an engagement recognition result comprises:

inputting the image data to be identified into the feature extraction unit for feature extraction to obtain a feature map, and randomly cutting the feature map into a preset number of regional features;

inputting the regional features into the regional attention unit, calculating attention weights of the regional features, and weighting the regional features to obtain global features;

and respectively connecting the region features with the global features in series to obtain target features, obtaining attention weights of the target features through the global attention unit, weighting the target features to obtain final features, and identifying and classifying the final features to obtain a participation identification result of the image data to be identified.

8. An area coding and sample balance optimized classroom participation identification device, comprising:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring video data of on-line learning of a student and generating original sample data according to the video data, and the original sample data comprises high-participation sample data and low-participation sample data;

the generating module is used for inputting the low participation sample data into a StarGAN model and generating target low participation samples with different styles;

the training module is used for inputting the original sample data and the target low participation sample into an RCN model for training to obtain a trained RCN model;

the second acquisition module is used for acquiring video data to be identified and generating image data to be identified according to the video data to be identified;

9. The apparatus of claim 8, wherein said generating original sample data from said video data comprises:

defining a participation label of the video data by utilizing manual work and prior information;

10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the method of any one of claims 1-7.