CN113239915B

CN113239915B - Classroom behavior identification method, device, equipment and storage medium

Info

Publication number: CN113239915B
Application number: CN202110787829.0A
Authority: CN
Inventors: 梁美玉; 黄勇康
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2021-07-13
Filing date: 2021-07-13
Publication date: 2021-11-30
Anticipated expiration: 2041-07-13
Also published as: CN113239915A

Abstract

The disclosure provides a method, a device, equipment and a storage medium for identifying classroom behaviors, wherein the method comprises the steps of acquiring classroom videos; carrying out target identification and tracking based on image frames in the classroom video to obtain a target image stream; inputting the target image stream into a trained recognition model for recognition to obtain a classroom behavior result, wherein the trained recognition model comprises: the space-time characteristic network is used for extracting the characteristics of the target image flow to obtain the space-time characteristics of the target image flow; the deep characteristic network is used for learning the space-time characteristics to obtain deep characteristics; and the classification network is used for classifying the deep features to obtain a classroom behavior result. According to the method and the device, the problems that in practical application, a plurality of targets are shielded when the classroom behavior of a teaching classroom scene is identified are solved. The time sequence characteristic learning capacity of the classroom behaviors can be enhanced to find the change rule of the classroom behaviors in the time dimension, and the accuracy of classroom behavior identification of students is further improved.

Description

Classroom behavior identification method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for recognizing classroom behavior.

Background

Along with the continuous development of artificial intelligence, intellectualization gradually moves into all aspects of production and life, concepts such as smart cities, smart offices, smart medical treatment and the like continuously emerge and are rapidly developed, smart education gradually moves into campuses from theories, and the intelligent education becomes one of inevitable development trends of science and education ways, and how to more efficiently recognize and analyze the classroom behavior of students becomes a research hotspot of the smart education. In traditional education, a teacher needs to teach dozens of hundreds of students, and traditional teaching experience is as follows: the teaching and edutainment based on the factors cannot be realized under such a large base number, the energy of a teacher is limited, so many students cannot be considered at the same time, and the feedback of a small part of students on the teaching method can be obtained only from occasional observation: the expression and behavior states of the students and other lecture listening states; moreover, the receptivity of students to the content taught by the teacher is gradually reduced, because the teaching of the teacher is often from easy to difficult, the students are likely to be unable to follow the teaching progress of the teacher at a certain moment, and the teacher needs to often observe the teaching listening state of the students to effectively adjust the teaching progress and method of the students, so as to obtain a better teaching effect.

However, the problems of numerous student targets, serious occlusion and the like often exist in the classroom teaching video, which brings great research challenges to student behavior recognition in classroom scenes. The current multi-person behavior recognition algorithm mainly comprises two major types, one is an algorithm based on target detection and image classification, and the other is an algorithm based on joint point recognition and joint point movement rule analysis. The former can often achieve real-time effect, but does not consider the time sequence characteristics of behaviors, and only classifies the target state; although the latter can learn the time sequence characteristics of the position change of the joint points in the motion process of the person, the identification process of the joint points is a process with a very large calculation amount, and the accuracy cannot be guaranteed.

Disclosure of Invention

In view of this, an object of the present disclosure is to provide a method, an apparatus, a device, and a storage medium for recognizing classroom behavior.

In view of the above, according to a first aspect of the present disclosure, there is provided a method for identifying classroom behavior, including:

acquiring a classroom video;

performing target identification and tracking based on the image frames in the classroom video to obtain a target image stream;

inputting the target image stream into a trained recognition model for recognition to obtain a classroom behavior result, wherein the trained recognition model comprises:

the space-time characteristic network is used for extracting the characteristics of the target image stream to obtain the space-time characteristics of the target image stream;

the deep feature network is connected with the space-time network and is used for learning the space-time features to obtain deep features;

and the classification network is connected with the deep characteristic network and is used for classifying the deep characteristics to obtain the classroom behavior result.

Optionally, the spatio-temporal feature network includes a two-dimensional space convolution layer and a one-dimensional time sequence convolution layer, wherein a calculation formula of the two-dimensional space convolution layer includes:

wherein the content of the first and second substances,M _ithe spatio-temporal characteristics of the output are represented,tthe step of time is represented as a step of time,drepresenting the dimensions of said two-dimensional spatial convolution,N _irepresenting the parameters of the ith two-dimensional convolution.

Optionally, learning the spatiotemporal features to obtain deep features includes:

dividing the spatiotemporal features into a first spatiotemporal feature and a second spatiotemporal feature;

performing approximate attention calculation on the first time-space characteristic based on locality sensitive hashing to obtain an attention characteristic;

combining the second spatiotemporal feature with the attention feature to obtain a first output feature;

performing feedforward calculation on the combined characteristic to obtain a feedforward characteristic, and combining the feedforward characteristic with the first time-space characteristic to obtain a second output characteristic;

combining the first output feature and the second output feature to obtain the deep feature.

Optionally, the performing target recognition and tracking based on the image frames in the classroom video to obtain a target image stream includes:

carrying out target recognition on the image frame to obtain a target recognition result comprising a target image;

and tracking the target based on the appearance characteristics of the target recognition result to obtain the target image stream.

Optionally, performing target recognition on the image frame to obtain a target recognition result including a target image, including:

carrying out feature extraction on the image frame to obtain target features;

and carrying out target detection based on the target features to obtain a target recognition result, wherein the target recognition result further comprises a boundary box used for indicating the region of the target image and a confidence used for indicating the boundary box.

Optionally, performing target tracking based on the appearance feature of the target recognition result to obtain the target image stream, including:

extracting appearance characteristics of a target image of a current image frame to obtain target current appearance characteristics of the target image;

predicting a predicted image corresponding to the target image in the next image frame based on the target current appearance characteristic;

and associating the predicted image with a target image detected in the next image frame to obtain the target image stream.

Optionally, the method further comprises: and displaying the classroom behavior result.

According to a second aspect of the present disclosure, there is provided an apparatus for recognizing classroom behavior, comprising

The acquisition module is used for acquiring classroom videos;

the identification tracking module is used for carrying out target identification and tracking based on the image frames in the classroom video to obtain a target image stream;

the recognition module is used for inputting the target image stream into a trained recognition model for recognition to obtain a classroom behavior result, wherein the trained recognition model comprises:

According to a third aspect of the present disclosure, there is provided an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method according to the first aspect when executing the program.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of the first aspects.

As can be seen from the foregoing, according to the identification method, apparatus, device and storage medium for classroom behavior provided by the present disclosure, the target image stream is obtained by using target detection and tracking, the spatiotemporal features of classroom behavior are extracted by fully using the target image stream, and the deep features of classroom behavior are further learned, so that the problems that in practical application, a great number of targets are present and occlusion occurs during classroom behavior identification of a teaching classroom scene are solved. The time sequence characteristic learning capacity of the classroom behaviors can be enhanced to find the change rule of the classroom behaviors in the time dimension, and the accuracy of classroom behavior identification of students is further improved.

Drawings

In order to more clearly illustrate the technical solutions in the present disclosure or related technologies, the drawings needed to be used in the description of the embodiments or related technologies are briefly introduced below, and it is obvious that the drawings in the following description are only embodiments of the present disclosure, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a schematic flow chart of a method of identification of classroom behavior in accordance with an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a trained recognition model according to an embodiment of the present disclosure;

fig. 3 is a schematic block diagram of an identification apparatus of classroom behavior in accordance with an embodiment of the present disclosure;

fig. 4 is a more specific hardware structure diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

For the purpose of promoting a better understanding of the objects, aspects and advantages of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings.

It is to be noted that technical terms or scientific terms used in the embodiments of the present disclosure should have a general meaning as understood by those having ordinary skill in the art to which the present disclosure belongs, unless otherwise defined. The use of "first," "second," and similar terms in the embodiments of the disclosure is not intended to indicate any order, quantity, or importance, but rather to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.

The existing behavior recognition method cannot be applied to the scenes of classroom education sometimes, and some methods are specific to single-person scenes and cannot be directly applied to multi-person scenes. For example, spatial features of student classroom behaviors are learned based on a VGG16 deep neural network, and behavior categories are obtained by utilizing a classification algorithm; the method extracts the student targets by using a target detection algorithm, only learns the spatial characteristics of the student classroom behaviors, neglects the very important time dimension of the behaviors, and cannot deeply mine the time-space characteristics of the student classroom behaviors. The other part of the method is to perform multi-person behavior recognition based on joint point recognition and a deep neural network, for example, openpos is adopted to extract human body joint points, and the joint points are used for representing a human body; based on a deep neural network, learning the position change rule of the human body joint points, and further obtaining behavior categories; the method simply uses the joint points to represent the human body, only extracts the spatial characteristics of the behaviors, and cannot completely represent the space-time characteristics of the behaviors; and under the scene that the target quantity is different and the shielding is serious, the accuracy rate of joint point identification cannot be ensured.

Therefore, in the existing network education resources, the video data for classroom teaching are not much, and the problems of numerous targets and serious occlusion exist in classroom teaching scenes. These problems all increase the difficulty of multi-user behavior recognition in a classroom scene, so that the existing multi-user behavior recognition method has no good effect in a teaching scene. The method creatively discovers that the listening state of students comprises expressions and behaviors, the positive expressions and behaviors represent that the students actively participate in classroom teaching, the students can follow the teaching progress of a teacher and think about teaching contents, but the main listening state still needs to be obtained by analyzing the behaviors of the students, and if one student embodies negative behaviors such as sleeping, turning around and the like, the students generally confuse or even dislike the teaching contents; moreover, the classroom behavior of the students can change along with the time, and the change of the lecture listening behavior of the students needs to be observed in real time, so that good teaching effect feedback can be obtained in time. After the classroom, the classroom behaviors of the students in the recorded teaching classroom video are correspondingly analyzed, and the teaching method can be adjusted to achieve a better teaching effect.

Based on the above consideration, the embodiment of the present disclosure provides a classroom behavior identification method. Referring to fig. 1, fig. 1 shows a schematic flow chart of a method of identification of classroom behavior according to an embodiment of the present disclosure. As shown in fig. 1, the identification method of classroom behavior includes:

step S110, obtaining a classroom video;

step S120, performing target identification and tracking based on the image frames in the classroom video to obtain a target image stream;

step S130, inputting the target image stream into a trained recognition model for recognition to obtain a classroom behavior result, wherein the trained recognition model comprises:

The method and the device for learning the classroom behavior comprise the steps of obtaining a target image stream after carrying out target recognition and tracking on a classroom video to obtain characteristics reflecting the classroom behavior, specifically extracting spatiotemporal characteristics of the classroom behavior by utilizing a spatiotemporal characteristic network, further learning deep characteristics of the classroom behavior by utilizing a deep characteristic network, and finally classifying the deep characteristics to obtain a recognized classroom behavior result. Compared with the traditional classroom behavior identification method, the classroom behavior identification method can enhance the learning capacity of the time sequence characteristics of classroom behaviors, discover the change rule of classroom behaviors in the time dimension, and further improve the accuracy of classroom behavior identification of students.

Because most of the traditional multi-person classroom behavior identification methods only consider the position change rule of the joint points, the simple joint point positions cannot completely express the behavior characteristics, and in classroom scenes with a lot of targets seriously shielded, it is not a simple matter to accurately identify the joint points; or only single spatial characteristics are considered, the time sequence characteristics of the behavior are ignored, the behavior is a complete action, and the behavior is not the behavior if the time dimension is lost. According to the embodiment of the disclosure, a target detection and tracking algorithm can be used for obtaining the continuous image stream of each student in a video, the space-time characteristics of behaviors are fully mined, and a multi-user classroom behavior recognition result with higher accuracy can be obtained based on a more accurate target detection and tracking algorithm, and the method has higher robustness.

According to the embodiment of the present disclosure, in step S110, a classroom video is acquired.

The classroom video can be real-time data directly acquired by an image acquisition device, or classroom video data acquired from a local data source or a remote data source. For example, the classroom video can be real-time classroom data collected by a camera disposed in a classroom or non-real-time classroom data obtained from other data sources.

Alternatively, the video data may be one or more frames of images in real-time video data or non-real-time video data. In some embodiments, the classroom video data can be framed to obtain image frames of classroom video.

According to the embodiment of the disclosure, in step S120, target recognition and tracking are performed based on the image frames in the classroom video, so as to obtain a target image stream.

Wherein, a network model (such as a YOLOV5 network) can be constructed to extract the distribution characteristics of target objects (such as students) of image frames in the classroom video to obtain a target image; and extracting appearance characteristics of corresponding targets between different image frames according to the target images of the targets, and calculating the relevance of the target images between the different image frames aiming at each target object to obtain a target image stream of the target object.

In some embodiments, performing target recognition on the image frame to obtain a target recognition result including a target image includes:

carrying out feature extraction on the image frame to obtain target features;

In some embodiments, the image frames may be subjected to target recognition through a target detection network, and a target recognition result including a target image is obtained.

Specifically, the object detection network may include a deep convolutional neural network that extracts object features of image frames in classroom video, a feature fusion network that fuses multiple layers of features, and a convolutional network for object detection. The convolution process of the deep convolutional neural network may include:

wherein the content of the first and second substances,

is the input to each of the layers of the web,

represents the weight value of each layer and represents the weight value of each layer,

represents the offset value of each layer, σ () represents the convolution operation,

representing the weight of the convolution kernel,

represents the offset of the convolutional layer, N represents the total number of convolutional layers, k represents the number of the current convolutional layer, m represents the dimension of the picture, and N represents the number of neurons in the convolutional layer.

After target features of image frames in classroom videos are obtained, a feature fusion network is utilized to learn high-level features obtained by a deep convolutional neural network, the fused features are input into a convolutional network for target detection, and confidence coefficients of corresponding categories are obtained through calculation, wherein the calculation method comprises the following steps:

wherein the content of the first and second substances,

representing the confidence of the jth bounding box of the ith cell,

representing the probability of whether the current bounding box has an object,

representing the ratio of the intersection and union of the predicted bounding box and the true bounding box.

Meanwhile, the coordinate position of the target frame is obtained through calculation, and the loss function comprises the following steps:

wherein b and

respectively representing the center points of the bounding box and the real box,

the euclidean distance is represented and c represents the diagonal length of the smallest rectangle containing the two boxes.

In some embodiments, performing target tracking based on appearance characteristics of the target recognition result to obtain the target image stream includes:

Specifically, the target images of the target objects identified in the current image frame may be input into a trained appearance embedding model, and the appearance features of each target object (i.e., student) may be obtained. And predicting the appearance characteristics of each student in the next image frame by using a Kalman filtering algorithm, and after the size and the position of the target object in the next image frame are detected, matching by using a Hungarian algorithm so as to associate the target images of the target object in the current image frame and the next image frame, and so on, thereby realizing target tracking and obtaining a target image stream.

According to the embodiment of the present disclosure, in step S130, the target image stream is input into a trained recognition model for recognition, so as to obtain a classroom behavior result, where the trained recognition model includes:

Referring to fig. 2, fig. 2 shows a schematic diagram of a trained recognition model according to an embodiment of the present disclosure. As shown in fig. 2, the spatio-temporal feature network extracts spatio-temporal features of a target image stream representing classroom behavior, then further excavates deep features of classroom behavior by using a deep feature network, and finally obtains classroom behavior results by using a classification algorithm of a classification network, thereby realizing identification of classroom behavior of multiple persons. In practical application, the data under each behavior category in the data set can be divided into a training set and a testing set, the constructed R (2+1) D-Refomer model is trained on the training set according to different parameters, and the model with the highest accuracy is selected as the trained recognition model.

Optionally, the spatiotemporal feature network may comprise an R (2+1) D network. Further, in some embodiments, the R (2+1) D network employs a ResNet3D50 network architecture. The R (2+1) D network decomposes the 3-dimensional convolution into a plurality of 2-dimensional convolutions and a 1-dimensional convolution, so that the spatial characteristics and the temporal characteristics can be decomposed, a linear operation is added, the optimization is easier, and the training error is smaller.

In some embodiments, the spatio-temporal feature network comprises two-dimensional spatial convolution layers and one-dimensional timing convolution layers, wherein the computational formula of the two-dimensional spatial convolution layers comprises:

Further, the time step t may refer to a video frame number.

Optionally, the deep feature network may include a reflector network.

In some embodiments, learning the spatiotemporal features results in deep features, including:

Specifically, to better learn the deep features of classroom behavior, the disclosed embodiments incorporate a refromer network to further learn the deep features of classroom behavior based on an R (2+1) D network. The Reformer network adopts Local Sensitive Hash (LSH) to replace a general attention mechanism, and can obtain better effect. The standard focus of the conventional approach is the scaled dot product, as follows:

，

wherein Q and K represent dimensions of

V represents a value matrix. And the local sensitive hashing is to divide the hash code into buckets, and only the labeling attention values of Q and K in the same bucket need to be calculated, so that the local sensitive hashing has better sensitivity. Meanwhile, in order to reduce memory occupation, the Reformer network adopts the idea of a reversible layer (RevNet), and a conventional residual error network is single-input and single-output:

and pairs of inputs/outputs in RevNet:

and follows the following equation:

wherein, F represents the attention Layer (LSH) and G represents the feed-forward layer (FFB), so that the parameters of each layer can be derived from the parameters of the current layer, so that the parameters only need to be stored once instead of many times, and the memory occupation can be greatly reduced.

Therefore, according to the embodiment of the disclosure, not only is the target image stream of the student obtained by target detection and tracking, but also the space-time characteristics of the classroom behavior of the student are learned by fully utilizing two-dimensional convolution and one-dimensional time sequence convolution, and the significant characteristics of the classroom behavior of the student are learned by utilizing local hash attention and mechanics, so that the deep characteristics of the classroom behavior of the student are fully learned, and the problems of numerous targets and shielding of teaching classroom scenes in practical application are solved.

According to an embodiment of the present disclosure, the method further comprises: and displaying the classroom behavior result.

In some embodiments, the classroom performance results can include cursory, sleeping, attending to classes, chatting, playing cell phones, and the like.

Further, in some embodiments, presenting the classroom performance results can include: different classroom activities are indicated in the display screen with differently colored bounding boxes.

Specifically, the classroom behavior result can be displayed by selecting an offline classroom teaching video or online classroom monitoring, and the recognition result is printed on a video frame by using a boundary box with different colors.

For an actual teaching scene, classroom teaching video data has the characteristics of various targets and serious shielding, and in order to realize accurate identification of classroom behaviors of multi-user students, student target distribution characteristics of teaching classroom video frames are extracted on the basis of a YOLOV5 network according to the embodiment of the disclosure to obtain student target images; extracting appearance characteristics of student targets among different frames based on a classroom student target detection algorithm, and calculating relevance to obtain a student target image stream; and extracting the space-time characteristics of the image stream representing the class behaviors of the students, and obtaining the class of the class behaviors of the students by using a classification algorithm. Compared with the existing method, the depth characteristics of the classroom behaviors of the multi-student students are deeply mined in the embodiment of the invention from the following two angles: (1) in the student target acquisition stage, detecting a student target in a video based on a classroom student target detection and tracking algorithm, extracting the appearance characteristics of the student, and associating the student targets of adjacent frames to obtain a continuous student target image stream; (2) in the class behavior recognition stage of students, the time-space characteristics of the class behaviors of students are learned based on an R (2+1) D network, and deep characteristics of the class behaviors of the students are mined in combination with a Reformer network depth, so that better characteristic representation of the class behaviors of the students is obtained, and more accurate identification of the class behaviors of the students of multiple persons is realized.

It should be noted that the method of the embodiments of the present disclosure may be executed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In such a distributed scenario, one of the devices may only perform one or more steps of the method of the embodiments of the present disclosure, and the devices may interact with each other to complete the method.

It should be noted that the above describes some embodiments of the disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Based on the same inventive concept, corresponding to the method of any embodiment, the disclosure also provides a device for recognizing classroom behaviors.

Referring to fig. 3, the classroom behavior recognition apparatus includes:

the acquisition module is used for acquiring classroom videos;

In some embodiments, the classroom behavior recognition apparatus further includes: and the display module (not shown in the figure) is used for displaying the classroom behavior result.

For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, the functionality of the various modules may be implemented in the same one or more software and/or hardware implementations of the present disclosure.

The device of the above embodiment is used for implementing the corresponding classroom behavior identification method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Based on the same inventive concept, corresponding to any of the above embodiments, the present disclosure further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the program, the method for recognizing the classroom behavior according to any of the above embodiments is implemented.

Fig. 4 shows a more specific hardware structure diagram of an electronic device according to an embodiment of the present disclosure, where the device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.

The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.

The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called to be executed by the processor 1010.

The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).

Bus 1050 includes a path that transfers information between various components of the device, such as processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.

It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.

The electronic device of the above embodiment is used for implementing the corresponding classroom behavior identification method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Based on the same inventive concept, corresponding to any of the above-described embodiment methods, the present disclosure also provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the classroom behavior identification method as described in any of the above embodiments.

Computer-readable media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

The computer instructions stored in the storage medium of the above embodiment are used to enable the computer to execute the classroom behavior identification method according to any of the above embodiments, and have the beneficial effects of the corresponding method embodiments, and are not described herein again.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of the present disclosure, also technical features in the above embodiments or in different embodiments may be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the present disclosure as described above, which are not provided in detail for the sake of brevity.

In addition, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown in the provided figures for simplicity of illustration and discussion, and so as not to obscure the embodiments of the disclosure. Furthermore, devices may be shown in block diagram form in order to avoid obscuring embodiments of the present disclosure, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the embodiments of the present disclosure are to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the disclosure, it should be apparent to one skilled in the art that the embodiments of the disclosure can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.

While the present disclosure has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.

The disclosed embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omissions, modifications, equivalents, improvements, and the like that may be made within the spirit and principles of the embodiments of the disclosure are intended to be included within the scope of the disclosure.

Claims

1. A classroom behavior identification method comprises the following steps:

acquiring a classroom video;

the deep feature network is connected with the space-time network and used for learning the space-time features to obtain deep features, and the deep feature network comprises:

performing feedforward calculation on the first output characteristic to obtain a feedforward characteristic, and combining the feedforward characteristic with the first time-space characteristic to obtain a second output characteristic;

combining the first output feature and the second output feature to obtain the deep feature;

the deep feature network comprises a Reformer network, the Reformer network adopts the idea of a reversible layer and has paired input/output:

and follows the following equation:

wherein F represents the attention layer and G represents the feedforward layer;

the classification network is connected with the deep characteristic network and is used for classifying the deep characteristics to obtain the classroom behavior result;

the space-time characteristic network comprises a two-dimensional space convolution layer and a one-dimensional time sequence convolution layer, wherein a calculation formula of the two-dimensional space convolution layer comprises:

2. The method of claim 1, wherein the performing target recognition and tracking based on image frames in the classroom video to obtain a target image stream comprises:

3. The method of claim 2, wherein performing target recognition on the image frame to obtain a target recognition result comprising a target image comprises:

carrying out feature extraction on the image frame to obtain target features;

4. The method of claim 2, wherein performing target tracking based on appearance features of the target recognition result to obtain the target image stream comprises:

5. The method of claim 1, further comprising: and displaying the classroom behavior result.

6. A classroom behavior recognition device comprises

The acquisition module is used for acquiring classroom videos;

and follows the following equation:

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1 to 5 when executing the program.

8. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 5.