CN117542085B

CN117542085B - Park scene pedestrian detection method, device and equipment based on knowledge distillation

Info

Publication number: CN117542085B
Application number: CN202410036468.XA
Authority: CN
Inventors: 佘亮; 曾阳艳; 曹文治; 梁伟
Original assignee: Hunan University of Technology
Current assignee: Hunan University of Technology
Priority date: 2024-01-10
Filing date: 2024-01-10
Publication date: 2024-05-03
Anticipated expiration: 2044-01-10
Also published as: CN117542085A

Abstract

The invention discloses a park scene pedestrian detection method, device and equipment based on knowledge distillation, comprising the following steps: acquiring a training data set, wherein the training data set comprises pedestrian detection data of pedestrian labels; training a first model and a second model by adopting a training data set to obtain a target teacher model and a target student model, wherein the first model is a pedestrian detection model without an anchor frame with a backbone network of ResNet and the second model is a pedestrian detection model without an anchor frame with a backbone network of ResNet 18; aligning and matching the target teacher model and the target student model by adopting feature-based knowledge distillation and output-based knowledge distillation to obtain a pedestrian recognition model; pedestrian detection of the park scene is performed by adopting a pedestrian recognition model. The compression of the pedestrian detection model through knowledge distillation is achieved, so that the calculation pressure of pedestrian detection tasks on a park scene is reduced, and the efficiency of park pedestrian recognition is improved.

Description

Park scene pedestrian detection method, device and equipment based on knowledge distillation

Technical Field

The invention relates to the field of data processing, in particular to a method, a device and equipment for detecting pedestrians in a park scene based on knowledge distillation.

Background

In recent years, with the wide application of related technologies such as artificial intelligence, big data, cloud computing and the like, the intellectualization of a park is greatly broken through and developed. In order to ensure the normal operation of the park and maintain the safety of the park, real-time pedestrian monitoring needs to be carried out on the park. With the introduction of the deep neural network, the pedestrian detection technology is greatly developed, and the detection accuracy is greatly improved. Meanwhile, the parameter number and the calculation amount of the pedestrian detection model based on the depth network are also greatly increased. More and more parks employ this approach for real-time pedestrian monitoring.

The inventor realizes that at least the following technical problems exist in the prior art in the process of realizing the invention: in a campus scene, the computing resources available for pedestrian detection are very limited, and because the campus staff is frequent in coming and going and is not fixed in time, the efficiency and real-time requirements of the campus staff monitoring task on the model are very high.

Disclosure of Invention

The embodiment of the invention provides a park scene pedestrian detection method and device based on knowledge distillation, computer equipment and a storage medium, so as to improve pedestrian recognition efficiency.

In order to solve the technical problems, an embodiment of the present application provides a method for detecting a pedestrian in a park scene based on knowledge distillation, where the method for detecting a pedestrian in a park scene based on knowledge distillation includes:

Acquiring a training data set, wherein the training data set comprises pedestrian detection data of pedestrian labels;

Training a first model and a second model by adopting the training data set to obtain a target teacher model and a target student model, wherein the first model is a pedestrian detection model without an anchor frame with a backbone network of ResNet and the second model is a pedestrian detection model without an anchor frame with a backbone network of ResNet 18;

Aligning and matching the target teacher model and the target student model by adopting feature-based knowledge distillation and output-based knowledge distillation to obtain a pedestrian recognition model;

and adopting the pedestrian recognition model to detect pedestrians in the park scene.

Optionally, the training dataset is a pedestrian detection dataset CityPersons including street scenes in different places, different time periods, and different weather conditions.

Optionally, the training the first model and the second model with the training data set respectively, to obtain a target teacher model and a target student model includes:

Training a first model by adopting the training data set and the model super-parameters so that the detection performance of the first model reaches a preset condition to obtain the target teacher model, and pre-training a second model by adopting the training data set and the model super-parameters so that the second model reaches a convergence state to obtain the target student model, wherein the model super-parameters are a loss function, an optimizer and a learning rate.

Optionally, the training data set is trained and identified by the target teacher model and the target student model, and neck characteristics, a detection diagram and a splice diagram are sequentially obtained after the training data set passes through a backbone network;

the step of aligning and matching the target teacher model and the target student model by adopting feature-based knowledge distillation and output-based knowledge distillation, and the step of obtaining the pedestrian recognition model comprises the following steps:

minimizing the distance between neck features of the target teacher model and the target student model by adopting a feature distillation mode;

And calculating to obtain an output-based knowledge distillation loss value according to the spliced graph of the target teacher model and the target student model.

Optionally, the aligning and matching the target teacher model and the target student model by using feature-based knowledge distillation and output-based knowledge distillation, to obtain a pedestrian recognition model further includes:

the gaussian mask is calculated using the following formula:

；

Wherein, Numerical value representing point with position (i, j) in Gaussian mask, K is number of target lines in picture corresponding to Gaussian mask,/>Coordinates representing the center point of the kth pedestrian,/>And/>The width and the height of the detection frame corresponding to the pedestrian are the true value;

calculating a teacher true value mask by adopting the following formula:

；

Wherein, Target center point heat map representing teacher model,/>Representing a truth value diagram of a target center point;

the student truth mask is calculated using the following formula:

；

Wherein, A target center point heat map generated by the student model is represented;

fusing the Gaussian mask, the teacher truth mask and the student truth mask to obtain a fused mask;

and calculating local loss based on the fusion mask, and training based on the local loss to obtain the pedestrian recognition model.

Optionally, the detection map is a target center point heat map, a scale heat map and an offset map, the aligning and matching the target teacher model and the target student model by using feature-based knowledge distillation and output-based knowledge distillation, and obtaining the pedestrian recognition model further includes:

channel compression is respectively carried out on the neck feature and the detection graph to obtain a corresponding attention graph, and subtraction and splicing are carried out on the attention graph to obtain a bridging matrix;

Splicing the bridging matrixes of the three detection graphs to obtain a combined bridging matrix;

and calculating information flow knowledge distillation loss based on the combined bridging moment, and training according to the information flow knowledge distillation loss to obtain the pedestrian recognition model.

In order to solve the above technical problems, an embodiment of the present application further provides a device for detecting pedestrians in a park scene based on knowledge distillation, including:

the acquisition module is used for acquiring a training data set, wherein the training data set comprises pedestrian detection data of pedestrian labels;

the training module is used for training a first model and a second model by adopting the training data set to obtain a target teacher model and a target student model, wherein the first model is a pedestrian detection model without an anchor point frame with a backbone network of ResNet and the second model is a pedestrian detection model without an anchor point frame with a backbone network of ResNet;

the alignment module is used for aligning and matching the target teacher model and the target student model by adopting feature-based knowledge distillation and output-based knowledge distillation to obtain a pedestrian recognition model;

and the detection module is used for detecting pedestrians in the park scene by adopting the pedestrian recognition model.

The alignment module includes:

a first alignment unit for minimizing a distance between neck features of the target teacher model and the target student model by means of feature distillation;

and the second alignment unit is used for calculating and obtaining an output-based knowledge distillation loss value according to the spliced graph of the target teacher model and the target student model.

a first calculation unit that calculates a gaussian mask using the following formula:

；

The second calculating unit is used for calculating a teacher true value mask by adopting the following formula:

；

a third calculation unit for calculating a student true value mask using the following formula:

；

The fusion unit is used for fusing the Gaussian mask, the teacher true value mask and the student true value mask to obtain a fusion mask;

And the first training unit is used for calculating the local loss based on the fusion mask and training based on the local loss to obtain the pedestrian recognition model.

Optionally, the alignment module further includes:

The generating unit is used for respectively carrying out channel compression on the neck feature and the detection graph to obtain corresponding attention graph, and subtracting and splicing the attention graph to obtain a bridging matrix;

The splicing unit is used for splicing the bridging matrixes of the three detection graphs to obtain a combined bridging matrix;

and the second training unit is used for calculating information flow knowledge distillation loss based on the combined bridge moment and training according to the information flow knowledge distillation loss to obtain the pedestrian recognition model.

In order to solve the technical problem, the embodiment of the application also provides a computer device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the steps of the method for detecting the pedestrians in the park scene based on knowledge distillation when executing the computer program.

To solve the above technical problem, the embodiments of the present application further provide a computer readable storage medium storing a computer program, where the computer program when executed by a processor implements the steps of the above method for detecting a pedestrian in a park scene based on knowledge distillation.

According to the method, the device, the computer equipment and the storage medium for detecting pedestrians in the park scene based on knowledge distillation, which are provided by the embodiment of the invention, a training data set is obtained, wherein the training data set comprises pedestrian detection data of pedestrian labels; training a first model and a second model by adopting a training data set to obtain a target teacher model and a target student model, wherein the first model is a pedestrian detection model without an anchor frame with a backbone network of ResNet and the second model is a pedestrian detection model without an anchor frame with a backbone network of ResNet 18; aligning and matching the target teacher model and the target student model by adopting feature-based knowledge distillation and output-based knowledge distillation to obtain a pedestrian recognition model; pedestrian detection of the park scene is performed by adopting a pedestrian recognition model. The compression of the pedestrian detection model is achieved through knowledge distillation, so that the calculation pressure of pedestrian detection tasks on a park scene is reduced, and the efficiency of park pedestrian recognition is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow chart of one embodiment of a knowledge distillation based campus scene pedestrian detection method of the present application;

FIG. 3 is an exemplary diagram of the basic distillation architecture of a pedestrian detection model without anchor boxes of the present application;

FIG. 4 is a diagram showing an exemplary structure of mask fusion according to the present application;

FIG. 5 is an exemplary diagram of a bridging matrix generation process in accordance with one embodiment of the present application;

FIG. 6 is a schematic structural diagram of one embodiment of a knowledge distillation based campus scene pedestrian detection device in accordance with the present application;

FIG. 7 is a schematic diagram of an embodiment of a computer device in accordance with the application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the description of the drawings above are intended to cover a non-exclusive inclusion. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, as shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like.

The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic video expert compression standard audio plane 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic video expert compression standard audio plane 4) players, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.

It should be noted that, the method for detecting the pedestrian in the park scene based on the knowledge distillation provided by the embodiment of the application is executed by the server, and correspondingly, the device for detecting the pedestrian in the park scene based on the knowledge distillation is arranged in the server.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. Any number of terminal devices, networks and servers may be provided according to implementation requirements, and the terminal devices 101, 102, 103 in the embodiment of the present application may specifically correspond to application systems in actual production.

Referring to fig. 2, fig. 2 shows a method for detecting pedestrians in a park scene based on knowledge distillation according to an embodiment of the present invention, and the method is applied to a server in fig. 1 for illustration, and is described in detail as follows:

s201: and acquiring a training data set, wherein the training data set comprises pedestrian detection data of pedestrian labels.

Optionally, the training dataset is a pedestrian detection dataset CityPersons, including street scenes in different places, different time periods, different weather conditions.

S202: and training the first model and the second model by adopting a training data set to obtain a target teacher model and a target student model, wherein the first model is a pedestrian detection model without an anchor frame with a backbone network of ResNet and the second model is a pedestrian detection model without an anchor frame with a backbone network of ResNet 18.

Specifically, in this embodiment, pedestrian detection algorithms are classified into two types, namely, a detection algorithm based on an anchor frame and a detection algorithm without an anchor frame, according to whether an anchor frame is set. Compared with the algorithm based on the anchor block, the algorithm without the anchor block has fewer parameters and higher detection speed, so that the algorithm without the anchor block is more suitable for actual application scenes of parks. The knowledge distillation method is a model compression method based on a teacher-student framework. Wherein, the teacher model is more complex, generally a model with wider, deeper and more parameters; the student model is simpler, and is generally a model with shallower parameters. Knowledge distillation enables student models to learn information useful in teacher models, enabling efficient knowledge transfer between large and small models.

Preferably, the present embodiment selects an anchor-free pedestrian detection model (CENTER AND SCALE Prediction, CSP) based on center point and size predictions.

Optionally, training the first model and the second model with the training data set respectively, to obtain the target teacher model and the target student model includes:

Training the first model by using a training data set and model super parameters to enable the detection performance of the first model to reach preset conditions to obtain a target teacher model, and pre-training the second model by using the training data set and the model super parameters to enable the second model to reach a convergence state to obtain a target student model, wherein the model super parameters are a loss function, an optimizer and a learning rate.

In this embodiment, a teacher model is trained, and a pedestrian detection model without an anchor frame with a backbone network of ResNet is selected as the teacher model, and super parameters such as a loss function, an optimizer, a learning rate and the like are used to train the teacher model, so that higher detection performance is achieved. Training a student model, selecting a pedestrian detection model without an anchor point frame, wherein the backbone network of the pedestrian detection model is ResNet, as the student model, and pre-training the student model by using super parameters such as loss function, optimizer, learning rate and the like to enable the student model to reach a convergence state.

S203: and aligning and matching the target teacher model and the target student model by adopting feature-based knowledge distillation and output-based knowledge distillation to obtain the pedestrian recognition model.

In this embodiment, output-based distillation and feature-based distillation commonly used in the target detection model knowledge distillation algorithm are transplanted to the pedestrian detection model CSP without anchor frame, and a basic distillation (General Distillation, GD) architecture suitable for the pedestrian detection model without anchor frame is obtained, as shown in fig. 3.

Specifically, the embodiment transfers the distillation based on output and the distillation based on characteristics to the pedestrian detection model without an anchor frame, and proposes a basic distillation architecture. And aligning and matching the teacher model and the student model by using feature-based knowledge distillation and output-based knowledge distillation, so that the student model can simulate the neck features and the detection diagrams of the teacher model, and the detection accuracy of the student model is improved.

In a specific optional implementation manner of this embodiment, training data sets are trained and identified through a target teacher model and a target student model, and neck features, detection diagrams and splice diagrams are sequentially obtained after the training data sets pass through a backbone network;

The method for aligning and matching the target teacher model and the target student model by adopting feature-based knowledge distillation and output-based knowledge distillation comprises the following steps:

and calculating to obtain an output-based knowledge distillation loss value according to the splice diagram of the target teacher model and the target student model.

Further, the embodiment realizes local area knowledge distillation on the basis of a basic distillation framework, designs various masks and fuses the masks to obtain a distillation weight map, so that the distillation process can pay more attention to an area effective for improving the detection capability of the student model, such as a target center point, a small target or a shielding target, and the like, thereby further improving the detection accuracy of the student model.

Specifically, in a specific implementation of this embodiment, the non-anchor frame pedestrian detection model CSP uses the residual network ResNet as a backbone network, and processes the upper four-layer output of the backbone network ResNet through the neck module. The neck module uses inverse convolution to process features of different layers to obtain features of the same size and channel number. The neck module then splices the processed upper four layers of features in the channel dimension, resulting in a spliced neck feature of size 728×160×320. Therefore, the size of the neck feature obtained by the neck module of the pedestrian detection model without the anchor frame is not influenced by the specification of the backbone network, and the neck feature has fixed size 728×160×320, namely the neck features of the teacher model and the student model are identical. The neck features of the pedestrian detection model without the anchor frame simultaneously comprise bottom features and high-level features in backbone network output features, and rich information is contained. Therefore, the basic knowledge distillation architecture proposed in this embodiment directly uses the neck features for distillation, i.e. minimizes the distance between the neck features of the teacher model and the student model. Using SmoothL a distance as an evaluation function, a loss value based on neck features in trainingCan be calculated from formula (1):

；

Wherein, And/>Representing neck features of the teacher model and the student, respectively. /(I)The number of channels representing the neck feature is 728. /(I)And/>Representing the height and width of the feature, 160 and 320, respectively.

Furthermore, the structure of the detection head module in the pedestrian detection model CSP without the anchor frame is simpler, the detection head module is only composed of a plurality of convolution layers, and three detection diagrams, namely a target center point heat diagram, a scale heat diagram and an offset diagram, can be obtained and are respectively used for predicting the center coordinates, the scale size and the offset of the center point of the target. From these three feature maps, the final target pedestrian prediction frame can be obtained.

Even if the sizes of the backbone networks are different, the three detection graphs obtained by the pedestrian detection models without anchor blocks have the same size, namely the sizes of the detection graphs of the teacher model and the student model are equal. In order to simplify the calculation process of the loss value, distillation training is more convenient, and the three detection graphs are spliced. Order the、/>And/>Representing a target central point heat map, a scale heat map and an offset map respectively, and splicing the detection map/>Can be obtained from formula (2):

；

By using Spliced detection graph representing teacher model,/>Representing a spliced detection diagram corresponding to the student model. In distillation training, the SmoothL function was still used as the loss function. The output-based knowledge distillation loss value can be calculated according to equation (3):

；

Wherein, 、/>And/>The number of channels, the height and the width of the splice detection map are respectively shown.

In a specific optional implementation manner of this embodiment, the aligning and matching the target teacher model and the target student model by using feature-based knowledge distillation and output-based knowledge distillation, and obtaining the pedestrian recognition model further includes:

the gaussian mask is calculated using the following formula:

；

calculating a teacher true value mask by adopting the following formula:

；

the student truth mask is calculated using the following formula:

；

And calculating the local loss based on the fusion mask, and training based on the local loss to obtain the pedestrian recognition model.

Specifically, the present embodiment proposes a local area knowledge distillation algorithm that can focus distillation attention on an area effective for improving the detection ability of a student model, and is therefore also referred to as a local area knowledge distillation algorithm. The local area knowledge distillation algorithm is realized based on multiple types of masks, and the multiple masks are designed and fused, so that the overall weight during distillation is obtained, and the knowledge distillation effect of the pedestrian detection model CSP without the anchor point frame is further improved on the basic distillation framework.

Further, if the center point position diagram (the size of the center point position of the pedestrian is 1 and the rest is 0) in the truth value of the pedestrian detection model CSP training process without the anchor point frame is directly used as a mask, the distillation will only pay attention to the information of the center point of the target pedestrian, and the target edge information will be lost. In this embodiment, a gaussian mask with a larger attention area is used, and the magnitude of the upper value gradually decreases from the center to the edge of the target.

Further, the detection result of the teacher model is not necessarily completely correct, and in many cases, the prediction result of the teacher model is deviated or erroneous from the true value. It is therefore difficult to avoid introducing some of the noise in the teacher model that affects pedestrian detection into the student model during distillation using the basic distillation architecture. The invention designs a teacher true value mask to guide distillation, and guides distillation to transfer information favorable for pedestrian detection in a teacher model to a student model, so that the transfer of error information in the teacher model to the student model is reduced.

Furthermore, the invention provides a student truth value mask calculated by a target center point heat map and a target center point truth value map output by students. In the student truth mask, the value of the point where the student model predicts the error is higher, and the value of the point where the prediction is correct is lower. Thus, adding a student truth mask to the distillation may guide the distillation to focus more on areas where student models are mispredicted.

After three masks are obtained, as shown in fig. 4, the three masks are spliced, and the spliced masks are input into a convolution layer with a convolution kernel size of 1×1, so as to obtain a single-channel combined mask. And then, removing the channel dimension in the combined mask to obtain the finally used fusion mask. Further, a loss calculation is performed.

In the present embodiment, letRepresenting the fused mask. The loss function corresponding to whole local area knowledge distillation can be based on the/>, of the basic distillation architectureAnd/>Expressed as the following formula:

Further, the embodiment realizes knowledge distillation of information flow based on a basic distillation architecture, performs knowledge distillation on the generation process from the neck feature to the detection graph, and expands the types of information transferred between the teacher model and the student model.

In a specific optional implementation manner of this embodiment, the detection map is a target center point heat map, a scale heat map and an offset map, and the aligning and matching are performed on the target teacher model and the target student model by adopting feature-based knowledge distillation and output-based knowledge distillation, so that obtaining the pedestrian recognition model further includes:

Specifically, the neck feature of the pedestrian detection model CSP without the anchor frame is different from the channel numbers of the three detection graphs, the channel number of the neck feature is 728, and the channel numbers of the target center point heat graph, the scale heat graph and the offset graph of the three detection graphs are unit numbers. Therefore, the two types of information graphs need to be converted into the same size, so that subsequent calculation operation on the neck feature to detection graph generation matrix is facilitated.

For channel alignment, this embodiment draws attention to this concept. The degree of activation of each pixel on the feature map can be obtained by computing an attention map. Thus, the present embodiment calculates corresponding attention patterns for both the neck feature map and the detection map, and uses the difference between the attention patterns to represent the change from the neck feature to a certain detection map. Since this difference matrix serves to connect the neck feature and the test pattern, this strive-to-note difference matrix is made the bridging matrix. The generation of the bridging matrix is shown in fig. 5. Firstly, respectively carrying out channel compression on the neck feature and the detection graph to obtain a corresponding attention graph. And then subtracting and splicing the attention graphs to obtain the actually used bridging matrix.

The calculation method of the attention map of the neck characteristics of the pedestrian detection model CSP without the anchor frame is as follows:

；

In the above-mentioned method, the step of, For/>Channel number,/>Representing neck features/>The corresponding attention is sought. By adopting the calculation mode, the attention force diagrams of the detection diagrams can be obtained, and the attention force diagrams of the three detection diagrams of the target center point heat diagram, the scale heat diagram and the offset diagram are respectively recorded as/>、/>And/>。

Taking the target center point heat map as an example, the calculation formula of the bridge matrix from the neck feature to the target center point heat map can be expressed as the following formula:

；

In the same way, a bridging matrix corresponding to the scale heat map and offset map can be calculated And/>. Splicing the three bridging matrices to obtain a combined bridging matrix/>。

By usingCombined bridge matrix representing teacher model,/>Representing the bridge matrix of the student model. /(I)The number of channels, height and width of the bridge matrix are shown, respectively. Since all bridging matrices are of the same size, therefore/>Is common. Therefore, the calculation mode of the loss value corresponding to the information flow knowledge distillation algorithm is shown as the following formula:

。

S204: pedestrian detection of the park scene is performed by adopting a pedestrian recognition model.

Specifically, experiments and verification are carried out on CityPersons datasets on the park scene pedestrian detection method based on knowledge distillation, so that the knowledge distillation algorithm based on the anchor-free frame pedestrian detection model CSP provided by the embodiment can effectively improve the detection effect of the anchor-free frame pedestrian detection model, the experimental result shows that the detection effect of the anchor-free frame pedestrian detection model obtained after distillation can be equivalent to that of the current mainstream model, the parameter quantity of the model after distillation is far smaller than that of the common pedestrian detection model, and the method is well suitable for park scenes with limited calculation resources. After the pedestrian recognition model with a relatively standing item is obtained through training, the pedestrian recognition model is used for pedestrian detection of a park scene, and has relatively good detection efficiency.

In the embodiment, a training data set is obtained, wherein the training data set comprises pedestrian detection data of pedestrian labels; training a first model and a second model by adopting a training data set to obtain a target teacher model and a target student model, wherein the first model is a pedestrian detection model without an anchor frame with a backbone network of ResNet and the second model is a pedestrian detection model without an anchor frame with a backbone network of ResNet 18; aligning and matching the target teacher model and the target student model by adopting feature-based knowledge distillation and output-based knowledge distillation to obtain a pedestrian recognition model; pedestrian detection of the park scene is performed by adopting a pedestrian recognition model. The compression of the pedestrian detection model is achieved through knowledge distillation, so that the calculation pressure of pedestrian detection tasks on a park scene is reduced, and the efficiency of park pedestrian recognition is improved.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.

Fig. 6 shows a schematic block diagram of a knowledge distillation-based campus scene pedestrian detection device in one-to-one correspondence with the knowledge distillation-based campus scene pedestrian detection method of the above embodiment. As shown in fig. 6, the knowledge distillation-based campus scene pedestrian detection device includes an acquisition module 31, a training module 32, an alignment module 33, and a detection module 34. The functional modules are described in detail as follows:

An obtaining module 31, configured to obtain a training data set, where the training data set includes pedestrian detection data of a pedestrian label;

The training module 32 is configured to train the first model and the second model with a training data set to obtain a target teacher model and a target student model, where the first model is a pedestrian detection model without an anchor frame with a backbone network of ResNet and the second model is a pedestrian detection model without an anchor frame with a backbone network of ResNet;

An alignment module 33, configured to align and match the target teacher model and the target student model by using feature-based knowledge distillation and output-based knowledge distillation, so as to obtain a pedestrian recognition model;

the detection module 34 is configured to perform pedestrian detection of the campus scene by using the pedestrian recognition model.

Optionally, the alignment module 33 includes:

Optionally, the aligning and matching the target teacher model and the target student model by using feature-based knowledge distillation and output-based knowledge distillation, and obtaining the pedestrian recognition model further includes:

a first calculation unit for calculating a gaussian mask using the following formula:

；/>

；

the fusion unit is used for fusing the Gaussian mask, the teacher true mask and the student true mask to obtain a fusion mask;

Optionally, the alignment module 33 further includes:

and the second training unit is used for calculating information flow knowledge distillation loss based on the combined bridging moment and training according to the information flow knowledge distillation loss to obtain a pedestrian recognition model.

For specific limitations on the knowledge distillation-based campus scene pedestrian detection device, reference may be made to the above limitation on the knowledge distillation-based campus scene pedestrian detection method, and no further description is given here. The various modules in the knowledge distillation based campus scene pedestrian detection device described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 7, fig. 7 is a basic structural block diagram of a computer device according to the present embodiment.

The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It is noted that only a computer device 4 having a component connection memory 41, a processor 42, a network interface 43 is shown in the figures, but it is understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), a Programmable gate array (Field-Programmable GATE ARRAY, FPGA), a digital Processor (DIGITAL SIGNAL Processor, DSP), an embedded device, and the like.

The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.

The memory 41 includes at least one type of readable storage medium including flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or D interface display memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the computer device 4. Of course, the memory 41 may also comprise both an internal memory unit of the computer device 4 and an external memory device. In this embodiment, the memory 41 is generally used to store an operating system and various application software installed on the computer device 4, such as program codes of a method for detecting pedestrians in a park scene based on knowledge distillation. Further, the memory 41 may be used to temporarily store various types of data that have been output or are to be output.

The processor 42 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute the program code stored in the memory 41 or process data, such as program code for executing a method for detecting pedestrians in a campus scene based on knowledge distillation.

The network interface 43 may comprise a wireless network interface or a wired network interface, which network interface 43 is typically used for establishing a communication connection between the computer device 4 and other electronic devices.

The present application also provides another embodiment, namely, a computer readable storage medium storing an interface display program executable by at least one processor to cause the at least one processor to perform the steps of the method for detecting a pedestrian in a campus scene based on knowledge distillation as described above.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.

It is apparent that the above-described embodiments are only some embodiments of the present application, but not all embodiments, and the preferred embodiments of the present application are shown in the drawings, which do not limit the scope of the patent claims. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the scope of the application.

Claims

1. The park scene pedestrian detection method based on knowledge distillation is characterized by comprising the following steps of:

Training a first model and a second model by adopting the training data set to obtain a target teacher model and a target student model, wherein the first model is an anchor-free frame pedestrian detection model with a backbone network of ResNet & lt 50 & gt, the second model is an anchor-free frame pedestrian detection model with a backbone network of ResNet & lt 18 & gt, the training data set is trained and identified by the target teacher model and the target student model, and neck characteristics, a detection diagram and a splice diagram are sequentially obtained after the training data set passes through the backbone network;

Adopting the pedestrian recognition model to detect pedestrians in a park scene;

calculating to obtain an output-based knowledge distillation loss value according to the spliced graph of the target teacher model and the target student model;

The method for obtaining the pedestrian recognition model comprises the steps of obtaining a target teacher model, obtaining a target student model, obtaining a target center point heat map, a scale heat map and an offset map, aligning and matching the target teacher model and the target student model by adopting feature-based knowledge distillation and output-based knowledge distillation, and obtaining the pedestrian recognition model, wherein the detection map is the target center point heat map, the scale heat map and the offset map, and the method comprises the following steps:

and calculating information flow knowledge distillation loss based on the combined bridge matrix, and training according to the information flow knowledge distillation loss to obtain the pedestrian recognition model.

2. The knowledge distillation based campus scene pedestrian detection method of claim 1 wherein the training dataset is a pedestrian detection dataset CityPersons comprising street scenes at different locations, different time periods, different weather conditions.

3. The knowledge distillation based campus scene pedestrian detection method of claim 1 wherein training the first model and the second model with the training data set to obtain a target teacher model and a target student model, respectively, comprises:

4. The knowledge distillation based campus scene pedestrian detection method of claim 1 wherein the aligning and matching the target teacher model and the target student model using feature based knowledge distillation and output based knowledge distillation to obtain a pedestrian recognition model further comprises:

the gaussian mask is calculated using the following formula:

Wherein Mask _gaussi,j represents the numerical value of the point with the position (i, j) in the Gaussian Mask, K is the number of target rows in the picture corresponding to the Gaussian Mask, (x _k,y_k) represents the coordinate of the center point of the kth pedestrian, and w _k and h _k are the width and the height of the detection frame true value corresponding to the pedestrian;

calculating a teacher true value mask by adopting the following formula:

Wherein, A target center point heat map representing a teacher model, and GT ^center represents a target center point truth map;

the student truth mask is calculated using the following formula:

5. Park scene pedestrian detection device based on knowledge distillation, its characterized in that, park scene pedestrian detection device based on knowledge distillation includes:

The detection module is used for detecting pedestrians in a park scene by adopting the pedestrian recognition model;

the training data set is trained and identified through the target teacher model and the target student model, and neck characteristics, a detection diagram and a splicing diagram are sequentially obtained after the training data set passes through a backbone network; the alignment module includes:

The second alignment unit is used for calculating and obtaining an output-based knowledge distillation loss value according to the spliced graph of the target teacher model and the target student model;

The detection map is a target center point heat map, a scale heat map and an offset map, and the alignment module further comprises:

6. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the knowledge distillation based campus scene pedestrian detection method of any one of claims 1 to 4.

7. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the knowledge distillation based campus scene pedestrian detection method of any one of claims 1 to 4.