CN113158803A

CN113158803A - Classroom vacant seat query system, real-time video analysis system and method

Info

Publication number: CN113158803A
Application number: CN202110306777.0A
Authority: CN
Inventors: 胥志伟; 王胜科; 李兴隆; 丁来辉; 姚凤芹
Original assignee: Shandong Weiran Intelligent Technology Co ltd
Current assignee: Shandong Weiran Intelligent Technology Co ltd
Priority date: 2021-03-23
Filing date: 2021-03-23
Publication date: 2021-07-23

Abstract

The invention belongs to the technical field of video analysis, and discloses a classroom vacant seat query system, a real-time video analysis system and a method. A real-time video analysis system adopts a YOLOv4-Tiny network to carry out real-time analysis on multi-channel video stream data, wherein the YOLOv4-Tiny network comprises a CSPDarknet53-Tiny network, an FPN network and a YOLOv3 network; firstly, extracting features of an input picture through a CSPDarknet53-Tiny network to obtain two feature maps with different sizes, then performing feature fusion and processing through an FPN network, and finally connecting a yolov3 network to output a result. The system of the invention does not need to install any complex hard equipment on the seat, and only needs to acquire the camera information of each classroom and then transmit the camera information into the real-time video analysis system for analysis to acquire classroom vacancy information; the invention also develops an applet platform and a background management system which are convenient for users to use, and can inquire the classroom situation in real time.

Description

Classroom vacant seat query system, real-time video analysis system and method

Technical Field

The invention relates to the technical field of video analysis, in particular to a classroom vacant seat query system, a real-time video analysis system and a method.

Background

At present, with the increase of the number of college students, classroom resources of colleges and universities are more and more tense, the college students can not find suitable classrooms after self-study, and the problems are common at present, especially when the college students review at the end of a period, the vacant seats of the study rooms are difficult to find, and the college students can not find suitable positions in several classrooms, so that the study mood is influenced due to the waste of time, and the college students are greatly inconvenient.

The existing free seat system for the study room mostly adopts a mode of combining software and hardware, a sensor is required to be installed on each seat, whether the seat is occupied or not is sensed in a physical pressure transmission mode, and then the seat is connected with a display through a single chip microcomputer to transmit results. Or a campus card reader is arranged on the seat to count and transmit the count to a data display center. These methods consume resources, are inconvenient to popularize in a large range, and are inconvenient to use from the perspective of students.

With the development of deep learning becoming mature day by day, many algorithms for deep learning can be used for industrial production, so how to provide a classroom vacant seat query method and system based on real-time video analysis without installing various hardware systems is a problem to be solved urgently at present.

Disclosure of Invention

The embodiment of the invention provides a classroom vacant seat query system, which aims to solve the problem that software and hardware are required to be combined to detect a vacant seat of a study room in the prior art. The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview and is intended to neither identify key/critical elements nor delineate the scope of such embodiments. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.

According to a first aspect of embodiments of the present invention, a real-time video analysis system is provided.

In some optional embodiments, the real-time video analysis system performs real-time analysis on the multi-channel video stream data by using a YOLOv4-Tiny network, wherein the YOLOv4-Tiny network comprises a CSPDarknet53-Tiny network, a FPN network and a YOLOv3 network; firstly, extracting features of an input picture through a CSPDarknet53-Tiny network to obtain two feature maps with different sizes, then performing feature fusion and processing through an FPN network, and finally connecting a yolov3 network to output a result.

Optionally, the image input into the YOLOV4-Tiny network is subjected to convolutional layer, regularization and activation layer operations to obtain a first feature map;

performing convolution layer, regularization and activation layer operation on the first characteristic diagram to obtain a second characteristic diagram; obtaining a third characteristic diagram after the second characteristic diagram is operated by a residual block, converting the third characteristic diagram into a fourth characteristic diagram after the second characteristic diagram is operated by the residual block, dividing the fourth characteristic diagram into two paths, directly entering the FPN module by the first path, and selecting a candidate frame through the convolutional layer; and the second path is continuously subjected to the operation of a residual block to obtain a fifth feature map, and is subjected to the operation of a convolutional layer, regularization and an activation layer to obtain a sixth feature map, then the candidate frames are selected by the convolutional layer entering the FPN module, the second path is divided into two paths, one path directly enters a Yolov3 network prediction result, the other path is subjected to tensor splicing with the fourth feature map directly entering the FPN module in the first path through convolutional and upsampling operations to expand tensor dimensionality, the seventh feature map obtained after splicing is subjected to convolution operation, and finally the seventh feature map enters a Yolov3 network prediction result.

Optionally, the Yolov3 network divides the feature map without the stretch splicing process into 13 × 13 grid cells, and if the central coordinate of a certain target in the grid cell falls into the certain grid cell, the target is predicted by the certain grid cell; each gridc cell predicts 3 fixed number of bounding boxes, with the largest field of 13 x 13 signature, and hence the largest anchor box size, 116 x 90, 156 x 198, 373 x 326, respectively, suitable for detecting larger targets; and applying smaller anchors box (30 × 61, 62 × 45 and 59 × 119) to the feature map subjected to tensor stitching processing, and being suitable for detecting small targets.

Alternatively, for a group channel in the training picture, if the center point of the group channel falls within a certain cell, 3 anchor boxes in the cell are responsible for predicting the group channel, the anchor box with the largest IOU of the group channel predicts the group channel, and the remaining 2 anchor boxes do not match the group channel.

Optionally, the Yolov3 network assumes that each cell contains at most one group truth, the anchor box matching the group truth calculates coordinate errors, confidence errors, and classification errors, while the other anchor boxes only calculate confidence errors.

Optionally, the Yolov3 network predicts the goal score of each prior bounding box by using logistic regression, and scores the goal by calculating the overlapping area of the prior bounding box and the ground route; screening the prior bounding box through a threshold, directly removing the prior bounding box with the overlapping area smaller than the threshold, and if the prior bounding box and the ground route target overlap more than other targets, the corresponding targetability score is 1.

According to a second aspect of the embodiments of the present invention, a real-time video analysis method is provided.

In some optional embodiments, the real-time video analysis method is developed based on a deep stream SDK and operates on a Jetson Nano platform, and CUDA, OpenCV and TensorRT modules are built in the Jetson Nano platform, and the method includes the following steps:

step (1), capturing video stream data;

step (2), decoding video stream data;

step (3), after decoding, preprocessing the video stream data;

step (4), combining multiple paths of video stream data together for batch processing;

step (5), carrying out real-time analysis on multi-channel video stream data by adopting a YOLOV4-Tiny network, executing target detection, and transmitting an inference result to the next plug-in along a pipeline;

step (6), adding information such as characters corresponding to the inference result to the metadata, and displaying the information on a screen;

and (7) acquiring a final inference result.

performing convolution layer, regularization and activation layer operation on the first characteristic diagram to obtain a second characteristic diagram;

obtaining a third characteristic diagram after the second characteristic diagram is operated by a residual block, converting the third characteristic diagram into a fourth characteristic diagram after the second characteristic diagram is operated by the residual block, dividing the fourth characteristic diagram into two paths, directly entering the FPN module by the first path, and selecting a candidate frame through the convolutional layer; and the second path is continuously subjected to the operation of a residual block to obtain a fifth feature map, and is subjected to the operation of a convolutional layer, regularization and an activation layer to obtain a sixth feature map, then the candidate frames are selected by the convolutional layer entering the FPN module, the second path is divided into two paths, one path directly enters a Yolov3 network prediction result, the other path is subjected to tensor splicing with the fourth feature map directly entering the FPN module in the first path through convolutional and upsampling operations to expand tensor dimensionality, the seventh feature map obtained after splicing is subjected to convolution operation, and finally the seventh feature map enters a Yolov3 network prediction result.

According to a third aspect of embodiments of the present invention, a classroom vacant seat query system is provided.

In some optional embodiments, the classroom vacant seat query system comprises the real-time video analysis system, a background management system and a WeChat small program;

the real-time video analysis system obtains a detection result through a YOLOV4-tiny network, stores the detection result in a database and performs visual display in a background management system;

the background management system comprises a user management module, a school zone management module, a classroom management module, a camera management module and a user feedback management module;

the WeChat applet displays the classroom name and corresponding profile, and updates the data in real time at regular intervals.

Optionally, the background management system interface is compiled by adopting java language, project system construction is carried out by using Maven technology, an overall system framework is developed on the basis of a Spring boot framework, all interfaces complete data interaction tasks with the front end, and data all adopt a JSON structure; deploying the interface system to a cloud server by using a reverse proxy nginx technology to interact with a front end; and code management is performed by utilizing git technology.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

(1) the seat is not required to be provided with any complex hard equipment, and only the camera information of each classroom needs to be acquired and then transmitted to a real-time video analysis system for analysis, so that classroom vacancy information is acquired;

(2) an applet platform and a background management system which are convenient for users to use are developed, and classroom conditions can be inquired in real time.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a schematic diagram illustrating a YOLOV4-Tiny network architecture in accordance with an exemplary embodiment;

fig. 2 is a flow diagram illustrating a video analysis method according to an example embodiment.

Detailed Description

The following description and the drawings sufficiently illustrate specific embodiments herein to enable those skilled in the art to practice them. Portions and features of some embodiments may be included in or substituted for those of others. The scope of the embodiments herein includes the full ambit of the claims, as well as all available equivalents of the claims. The terms "first," "second," and the like, herein are used solely to distinguish one element from another without requiring or implying any actual such relationship or order between such elements. In practice, a first element can also be referred to as a second element, and vice versa. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a structure, apparatus, or device that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such structure, apparatus, or device. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a structure, device or apparatus that comprises the element. The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The terms "longitudinal," "lateral," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like herein, as used herein, are defined as orientations or positional relationships based on the orientation or positional relationship shown in the drawings, and are used for convenience in describing and simplifying the description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus should not be construed as limiting the present invention. In the description herein, unless otherwise specified and limited, the terms "mounted," "connected," and "connected" are to be construed broadly, and may include, for example, mechanical or electrical connections, communications between two elements, direct connections, and indirect connections via intermediary media, where the specific meaning of the terms is understood by those skilled in the art as appropriate.

Herein, the term "plurality" means two or more, unless otherwise specified.

Herein, the character "/" indicates that the preceding and following objects are in an "or" relationship. For example, A/B represents: a or B.

Herein, the term "and/or" is an associative relationship describing objects, meaning that three relationships may exist. For example, a and/or B, represents: a or B, or A and B.

The invention provides a real-time video analysis system, which solves the problems of complexity, inconvenience and rapidness of the conventional classroom vacant seat query method, does not need to install any complex hard equipment on a seat, and only needs to acquire the camera information of each classroom and then transmit the camera information to the real-time video analysis system for analysis to obtain classroom vacant seat information.

In some optional embodiments, the real-time video analysis system of the present invention adopts the YOLOv4-Tiny network to perform real-time analysis on multi-channel video stream data, the precision of the YOLOv4-Tiny network is higher, the speed of the system is also improved, and the system is a deep learning algorithm suitable for industrial development. The YOLOv4-Tiny network structure mainly comprises three parts, namely a CSPDarknet53-Tiny network, an FPN network and a YOLOv3 network, firstly, features of an input picture are extracted through the CSPDarknet53-Tiny network to obtain two feature maps with different sizes, then feature fusion and processing are carried out through the FPN network, and finally, the YOLOv3 network is connected to output a result.

FIG. 1 shows an alternative embodiment of the YOLOV4-Tiny network architecture.

As shown in fig. 1, an image (416 x 3) input to YOLOV4-Tiny network undergoes convolutional layer, regularization and activation layer operations to obtain a first feature map; performing convolution layer, regularization and activation layer operation on the first characteristic diagram to obtain a second characteristic diagram; and operating the second characteristic diagram by a residual block to obtain a third characteristic diagram, converting the third characteristic diagram into a fourth characteristic diagram after the operation of the residual block, dividing the fourth characteristic diagram into two paths, directly entering the first path into the FPN module, and selecting a candidate frame by the convolutional layer. The second path is continuously operated by a residual block to obtain a fifth characteristic diagram, the fifth characteristic diagram is obtained by the operations of a convolution layer, regularization and an activation layer, then the sixth characteristic diagram is obtained, the sixth characteristic diagram enters the convolution layer of the FPN module to select a candidate frame, then the sixth characteristic diagram is divided into two paths, one path directly enters the yolov3 network prediction result, the other path is subjected to convolution and upsampling operations and is subjected to tensor splicing with the fourth characteristic diagram directly entering the FPN in the first path, the two characteristic diagrams have the same length and width dimensions and different heights, the two characteristic diagrams are spliced together, the splicing operation is different from the traditional add operation, the tensor dimension can be expanded by splicing, and the add is only directly added and cannot cause the change of the tensor dimension. And performing convolution operation on the seventh feature map obtained after splicing, and finally entering a Yolov3 network prediction result.

In the embodiment shown in fig. 1, the image (416 × 3) input to the network undergoes convolutional layer, regularization, and activation layer operations, resulting in a 208 × 32 feature map; subjecting the 208 × 32 feature map to convolution, regularization and activation layer operations to obtain a 104 × 64 feature map; and (3) operating the 104 × 64 feature graph by a residual block to obtain a 52 × 128 feature graph, then converting the feature graph into a 26 × 256 feature graph by the operation of the residual block, and then dividing the feature graph into two paths, wherein the first path directly enters the FPN module and selects candidate frames through the convolution layer. And the second path is continuously subjected to the operation of a residual block to obtain a 13 x 512 feature map, is subjected to the operation of a convolutional layer regularization activation layer to obtain a 13 x 512 feature map, enters the convolutional layer of the FPN module to select a candidate frame, is divided into two paths, directly enters the Yolov3 network prediction result in one path, is subjected to convolution and upsampling operation in the other path, is subjected to tensor splicing with the feature map directly entering the FPN in the first path, is subjected to convolution operation, and finally enters the Yolov3 network prediction result. The Yolov3 network predicts the class and location of objects.

The Yolov3 network is used to predict people and empty seats in the classroom, as well as head-up and head-down situations for people. The Yolov3 network is responsible for predicting or training the obtained characteristic diagram, and the coordinate loss and the classification loss are calculated respectively during training, and a mean square error loss function is used.

The Yolov3 network is used to predict the position and category of objects (vacant seat, person, head-up, head-down) in the picture, each object generates a detection box (bounding box) for marking the position of the object, each detection box includes 4 coordinates, tx, ty, tw, th, the coordinate corresponding to the upper left corner of the detection box is (tx, ty) and the length and width of the detection box is (tw, th). In addition, the Yolov3 network sets a confidence for each detection box, the object confidence represents the probability that the detection box contains an object, the confidence is 1 if the detection box contains an object, and otherwise, the confidence is 0.

The Yolov3 head structure is adopted in the Yolov3 network, a 1 × 1 convolutional layer is next to a feature diagram input into the Yolov3 network, and then the Yolov3 head structure is subjected to regularization and activation, and then a full connection layer (detection layer) is connected for classification prediction.

In the embodiment shown in fig. 1, for the feature map output in the FPN, the Yolov3 network divides the feature map 13 × 13 (corresponding to 416 × 416 picture size) without the tensor stitching process into 13 × 13 grid cells, and then predicts a target in a grid cell if the central coordinate of the target falls in the grid cell. Each gridc cell predicts a fixed number of 3 bounding boxes, with the largest field of 13 x 13 signature, and hence the largest anchor box size, 116 x 90, 156 x 198, 373 x 326, respectively, suitable for detecting larger targets. The feature map 26 subjected to tensor stitching processing is suitable for detecting small targets by using smaller anchors box (30 × 61), (62 × 45) and (59 × 119) because the field of the feature map is smaller.

The Yolov3 network predicts the goal score of each prior bounding box (anchor box) by using logistic regression, that is, how likely the position of the block is to be the target, and scores the target by calculating the overlapping area of the prior bounding box and the ground route, the embodiment of the present invention sets a certain threshold in advance to screen the prior bounding box, the general threshold is set to 0.5, the prior bounding box with the overlapping area smaller than the threshold of 0.5 is directly removed, and if the prior bounding box overlaps the ground route target more than other targets, the corresponding goal score should be 1. If the a priori bounding box is not the best result, it will not be predicted even if it exceeds a set threshold. Thus, a nearest prior bounding box can be screened out, and fine tuning of subsequent steps is facilitated. For a group channel in a training picture, if the center point of the group channel falls within a certain cell, 3 anchor boxes in the cell are responsible for predicting the group channel, specifically which anchor box predicts the group channel, and in the learning process, the group cell gradually learns how to select the anchor box with which size, and fine-tunes the anchor box. The embodiment of the invention defines a rule to make the grid cell select a certain anchor box, that is, the anchor box with the largest IOU with the ground channel predicts the ground channel, and the remaining 2 anchor boxes do not match with the ground channel. Yolov3 needs to assume that each cell contains at most one ground channel, and in practice there are substantially no more than 1. The anchor box that matches the ground truth computes the coordinate error, confidence error (where confidence is 1), and classification error, while the other anchor boxes only compute confidence errors (where confidence is 0).

And (2) predicting coordinate points tx, ty, tw and th, wherein (tx, ty) is a translation scale, and (tw, th) is a scaling scale, and the anchor box is subjected to translation and scaling to be subjected to fine tuning and superposed with the grand channel. When the input anchor box is small in difference from the ground truth, i.e. the IOU is large, the transformation can be considered as a linear transformation, then linear regression (linear regression is given to the input feature vector X, a set of parameters W is learned, so that the value after linear regression is very close to the true value Y (ground truth), i.e. Y ≈ WX) can be used to model the fine tuning of the window, otherwise the trained regression model will be mismatched.

Finally, the Yolov3 network outputs a detection box, a detection box for predicting the object, and the object type, including empty seat, person, head up and head down. Class prediction uses an independent logistic classifier, non-softmax, and a binary cross entropy loss is used in the training process.

Optionally, the YOLOV4-Tiny target detection algorithm uses a multi-scale feature map for detection, i.e. feature maps with different sizes are used for detection, the feature map one is larger, the feature map two is smaller because convolution and pooling are gradually used for reducing the size of the feature map at the back, and the feature map two is combined with the feature map at the front through upsampling, the large feature map (i.e. feature map one) is used for detecting a relatively smaller target, and the small feature map (i.e. feature map two) is used for detecting a relatively larger target. For the characteristics of different distances of human beings in a monitored scene, the YOLOV4-Tiny target detection algorithm in the embodiment is very suitable for the detection of the scene, namely, a large feature map is used for detecting relatively small targets, a small feature map is used for detecting relatively large targets, and the accurate detection can be realized for the targets with different sizes.

The YOLOV4-Tiny target detection algorithm in the embodiment is very suitable for detecting objects in a classroom, the large-scale feature map belongs to the bottom level feature map, semantic meaning is not rich enough, and precision is high. And the small-scale feature map belongs to a high-level feature map, so that the semantic meaning is rich, but the precision is low. In the characteristic extraction process, the small-scale characteristic diagram is subjected to up-sampling amplification and then spliced with the large-scale characteristic diagram, the output characteristic diagram has high precision and rich semantic meanings, and the detection precision can be improved aiming at the condition that objects in a scene are different in size.

The resolution information directly reflects the number of pixels constituting the object. The larger the number of pixels of an object, the more detailed and specific the object is represented, i.e. the more resolution information is. Thus, the large scale feature map provides resolution information. Semantic information refers to information for distinguishing an object from a background in object detection, that is, semantic information is for a user to know that the object is the object, and the rest is the background. The semantic information does not need much detail information in different categories, the resolution information is large, and the semantic information is reduced, so that the semantic information can be better provided by the small-scale feature map under the condition of providing necessary resolution information. For small targets, the small-scale feature map cannot provide necessary resolution information, so the large-scale feature map is also required to be combined. In the embodiment of the invention, YOLOV4-Tiny further adopts 2 feature maps with different scales to detect the object, so that the features with higher fine granularity can be detected. The final output of the grafting network has 2 scales which are respectively a large feature map I and a small feature map; here, the small-scale feature map has a large field of view, and is therefore suitable for detecting an object having a relatively large size in an image. And then the result is spliced with the previous feature map tensor through upsampling, the splicing operation is different from the residual layer adding operation, the splicing can expand the dimensionality of the tensor, and the addition only directly does not cause the change of the tensor dimensionality. The large-scale feature map has a small field of view, and is suitable for detecting small-scale objects, such as people and tables far away from a small target camera, and the large-scale feature map has larger sizes, so that the detection frame cannot completely frame objects, and people and tables can be better detected in the small-scale feature map. Similarly, the large target is smaller in the small-scale feature map, the detection frame contains the background, and the large target can be better surrounded in the large feature map.

In other optional embodiments, the invention further provides a video analysis method, the YOLOV4-tiny network and the Jetson Nano platform are bound together, the deployment is convenient, the operation is flexible, and the method can be applied to classroom monitoring scenes. Fig. 2 shows an alternative embodiment of the video analysis method of the present invention in conjunction with the Jetson Nano platform.

In the optional embodiment, the video analysis method is developed based on a deep stream SDK and operates on a Jetson Nano platform, and CUDA, OpenCV and TensorRT modules are built in the Jetson Nano platform.

In this alternative embodiment, the video analysis method includes the steps of:

and (1) capturing video stream data. Optionally, the video stream data is from an RTSP stream of the camera or a USB or CSI camera;

and (2) decoding the video stream data. Optionally, the decoder plug-in employs a NIVIDIA's hardware accelerated decoding engine.

And (3) after decoding, preprocessing the video stream data, such as scaling, cropping, format conversion and the like of the image.

And (4) combining multiple paths of video stream data together for batch processing.

And (5) analyzing the multi-channel video stream data in real time by using a YOLOV4-Tiny network shown in the figure 1, executing target detection, and transmitting an inference result to the next plug-in along a pipeline. The YOLOV4-Tiny network analyzes multi-channel video stream data in real time, and can meet the requirements of video real-time analysis while ensuring the detection precision.

And (6) attaching information such as characters corresponding to the reasoning result to the metadata, and displaying the information on a screen.

And (7) acquiring a final inference result.

The 5 main roles of TensorRT:

1. and optimizing the weight parameter types. The parameter types include FP32, FP16 and INT8, and the use of lower data precision can reduce memory occupation and delay, so that the model size is smaller and the reasoning speed is greatly improved.

2. And (4) interlayer fusion. When the deployment model is inferred, the operation of each layer is completed by the GPU, the GPU starts different CUDA (computer unified device architecture) cores to perform the calculation, and because the CUDA has high operation speed, a large amount of time is wasted on the starting of the CUDA cores and the read-write operation of input and output of each layer, the bottleneck of a memory broadband and the waste of GPU resources are caused. TensorRT performs transverse or longitudinal fusion between layers, thereby greatly reducing the number of layers. Lateral merging may merge the convolution, bias and active layers into one CBR structure, occupying only one CUDA core. The vertical combination can combine the layers with the same structure but different weights into a wider layer, and only occupies one CUDA core. The levels of the calculation graphs after combination are fewer, and the number of CUDA cores is also fewer, so that the whole model structure is smaller, faster and more efficient.

3. Execution of multiple streams. The GPU is good at parallel computing, different threads and blocks and different streams are provided, and the execution of the multiple streams can hide the data transmission time. The GPU divides a large block of data into different small blocks for calculation, when the first block of data is transmitted, all the following tasks are waited, and when the first block is transmitted, the second block starts to be transmitted. Meanwhile, the first block of data starts to be calculated, so that the transmission time can be hidden in the calculation time.

4. Dynamic sensor Memory. During the use period of each Tensor, the Tensor RT can appoint the video memory for the Tensor RT, so that repeated application of the video memory can be avoided, the memory occupation is reduced, and the repeated use efficiency is improved.

5. And (6) kernel calling. The TensorRT can adjust the CUDA core according to different algorithms, different network models and different GPU platforms so as to ensure that the current model is calculated with the optimal performance on a specific platform.

The invention also provides a classroom vacant seat query system, which comprises the real-time video analysis system, a background management system and a WeChat applet, wherein the real-time video analysis system stores a detection result obtained through YOLOV4-tiny in a database and performs visual display on the background management system. The background management system is relative to an administrator and comprises a user management module, a school zone management module, a classroom management module, a camera management module and a user feedback management module. The user needs to register information and bind the WeChat when using the WeChat applet at the beginning, enters the applet, switches different teaching buildings to check the classrooms with the top ten empty seats of the teaching building, displays the names and the corresponding distribution maps of the classrooms, and updates the data in real time at fixed time intervals, for example, at every 2 minutes.

The user management module is divided into an administrator and students, the authority of the administrator is higher than that of the students, and information such as the study number/job number, the name, the user authority, the affiliated school zone, the creation time and the like of the user is recorded. The module can add, check, edit and delete users, and can also import user information in batch. The school zone management module is used for classifying the multiple school zone conditions. Each school is divided into a plurality of school zones, each school zone corresponds to a plurality of teaching buildings, corresponding relations can be listed, and information of the school zones and the teaching buildings can be modified, deleted and newly added. The classroom management module comprises information such as an affiliated school district, an affiliated teaching district, classroom names, rows, columns, empty columns, cameras and creation time, and can also check, edit and delete classroom information, each classroom corresponds to a corresponding teaching building and school district, and the classrooms can also be inquired through the school district names, the teaching district names and the classroom names. The camera management module comprises information such as a school zone, a teaching zone, a classroom, the total number of positions of study rooms, an rtsp address, a deactivation state, a detection time period and the like, and the camera corresponds to the classroom and can be checked, edited, deleted and deactivated. The user feedback management module reflects the opinions proposed by the user, including information such as user name, creation time and state, and can also inquire feedback information through the user name and the creation time.

The background management system interface is compiled by adopting java language, the java version is java 1.8, project system construction is carried out by using Maven technology, the whole system framework is developed on the basis of a Spring boot framework, the Spring boot jpa, redis, Mysql, Spring MVC and other technologies are used, all interfaces complete data interaction tasks with the front end, and data all adopt JSON structures. After the development is completed, the interface system is deployed to the cloud server by using a reverse proxy nginx technology to interact with the front end. And after the development process is completed, code management is carried out by utilizing git technology.

When the administrator logs in, the request information is verified with data in the Mysql database, if the verification is successful, a token is generated by using an jwt protocol, then the user information and the verification token are stored by using a redis system, and the next operation can be guaranteed to be operated by using the token (the token is placed in a request header).

When the background management system manages the background, a school district teaching building is added firstly, classrooms are added according to the school district teaching building, cameras are arranged according to the classrooms, data management is completed according to the logical relation, and all the data management is related through the main key of the depended table.

The data is stored in a Mysql database, each management module corresponds to one table, each table also has a main key and other necessary attributes, but does not relate to an external key, and the dependency relationship is realized in a program code. When data in the background management system is added or modified, verification of integrity of a database data entity, data reference integrity and user-defined integrity is carried out in the interface function, and if the integrity of the database data entity, the data reference integrity and the user-defined integrity are not met, the database data is not added or modified. And during deletion, whether the requirements are met or not is preferentially judged, and cascade deletion is carried out according to actual requirements. Data increase, deletion, modification and search are all based on Spring boot jpa technology.

The user feedback management module is used for managing user feedback information in the applet and corresponds to a table in the Mysql database. And submitting feedback at the applet end, storing feedback contents in a database, and paging and displaying and carrying out related operations by the background management system. Wherein the feedback states are read and unread, and are judged through the flag bit; the feedback publisher associates through the user primary key.

The small program comprises a vacancy inquiry module and a classroom patrol module. The classroom patrol module informs the front end of the address of a picture captured by a camera in a classroom (the address of a url system after reverse proxy by using a nginx server) and displays the address through the url system. The vacancy inquiry module comprises an analysis module, a rear-end module and a front-end module. And the back-end module stores the analysis result of the analysis module in a database and calculates the corresponding vacancy position and vacancy rate by using the relative position. When the front end calls the relevant interface, the back end only needs to read the data by utilizing the spring boot jpa, and the data is transmitted to the front end after being sorted according to the vacancy rate.

The WeChat applet realizes the display of analysis data, adopts canvas 2d to realize the drawing of a virtual classroom, uses echarts to draw a line graph and the like to realize the visualization of the data, and adopts vant as a ui frame. The login is verified by adopting a school number and a mobile phone number short message, the school number and the mobile phone number short message are bound with the WeChat after the first login, and the verification-free login can be realized under the condition of not removing the binding.

The invention discloses a classroom vacant seat query system, which realizes the functions of real-time video analysis, video image interception and video derivation based on a deep learning network YOLOV4-tiny algorithm. From the perspective of a user, the invention develops a small program platform and a background management system which are convenient for the user to use, and can inquire the classroom situation in real time. The system is combined with the Jetson Nano platform, has the characteristics of low cost, convenience in deployment and flexibility in operation, and is convenient to popularize, so that the system has high industrial use value.

The present invention is not limited to the structures that have been described above and shown in the drawings, and various modifications and changes can be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A real-time video analysis system is characterized in that a YOLOv4-Tiny network is adopted to carry out real-time analysis on multi-channel video stream data, wherein the YOLOv4-Tiny network comprises a CSPDarknet53-Tiny network, an FPN network and a YOLOv3 network; firstly, extracting features of an input picture through a CSPDarknet53-Tiny network to obtain two feature maps with different sizes, then performing feature fusion and processing through an FPN network, and finally connecting a yolov3 network to output a result.

2. The real-time video analysis system of claim 1,

the image input into the YOLOV4-Tiny network is subjected to convolutional layer, regularization and activation layer operation to obtain a first characteristic diagram;

3. The real-time video analysis system of claim 2,

the Yolov3 network divides the feature map without the strain splicing process into 13 × 13 grid cells, if the center coordinate of a certain target in the ground route falls into a certain grid cell, the target is predicted by the grid cell; each gridc cell predicts 3 fixed number of bounding boxes, with the largest field of 13 x 13 signature, and hence the largest anchor box size, 116 x 90, 156 x 198, 373 x 326, respectively, suitable for detecting larger targets; and applying smaller anchors box (30 × 61, 62 × 45 and 59 × 119) to the feature map subjected to tensor stitching processing, and being suitable for detecting small targets.

4. A real-time video analysis system according to claim 3,

for a group channel in the training picture, if the center point of the group channel falls within a certain cell, 3 anchor boxes in the cell are responsible for predicting the group channel, the anchor box with the largest IOU of the group channel is used for predicting the group channel, and the remaining 2 anchor boxes are not matched with the group channel.

5. The real-time video analysis system of claim 4,

the Yolov3 network assumes that each cell contains at most one ground channel, and the anchor box matched with the ground channel calculates the coordinate error, confidence error and classification error, while the other anchor boxes only calculate the confidence error.

6. The real-time video analysis system of claim 2,

the Yolov3 network predicts the goal score of each prior bounding box by using logistic regression, and carries out the goal score by calculating the overlapping area of the prior bounding box and the ground route; screening the prior bounding box through a threshold, directly removing the prior bounding box with the overlapping area smaller than the threshold, and if the prior bounding box and the ground route target overlap more than other targets, the corresponding targetability score is 1.

7. A video analysis method is characterized in that the video analysis method is developed based on a deep stream SDK and operated on a Jetson Nano platform, CUDA, OpenCV and TensrT modules are built in the Jetson Nano platform, and the video analysis method comprises the following steps:

step (1), capturing video stream data;

step (2), decoding video stream data;

step (3), after decoding, preprocessing the video stream data;

and (7) acquiring a final inference result.

8. A video analysis method as claimed in claim 7,

9. A classroom vacant seat query system comprising the real-time video analysis system of any one of claims 1 to 6, further comprising a background management system and a WeChat applet;

10. The system for inquiring the vacant seat in the classroom as claimed in claim 9, wherein the interface of the background management system is written by java language, project system construction is performed by using Maven technology, the overall framework of the system is developed based on a Spring boot framework, all interfaces complete data interaction tasks with the front end, and the data adopts a JSON structure; deploying the interface system to a cloud server by using a reverse proxy nginx technology to interact with a front end; and code management is performed by utilizing git technology.