CN114220076A

CN114220076A - Multi-target detection method, device and application thereof

Info

Publication number: CN114220076A
Application number: CN202111559680.7A
Authority: CN
Inventors: 郁强; 张香伟; 毛云青; 金仁杰
Original assignee: CCI China Co Ltd
Current assignee: CCI China Co Ltd
Priority date: 2021-12-20
Filing date: 2021-12-20
Publication date: 2022-03-22

Abstract

The application provides a multi-target detection method, a device and application thereof, wherein the method comprises the following steps: acquiring an image to be detected; inputting the image to be detected into a multi-target detection model to obtain a detection result; the multi-target detection model comprises a backbone network, a neck module and a prediction head which are sequentially connected; each trunk layer is used for obtaining a second feature map with multi-scale feature information and channel attention weighting; the neck module is used for performing feature aggregation on the second feature maps output by different main layers to obtain an aggregated feature map; and the prediction head is used for carrying out multi-target detection according to the aggregation characteristic diagram. The method can extract multi-scale feature information, can flexibly adjust the parameters of the convolution kernel according to different sizes of input images to be detected, more effectively extracts multi-scale space information with finer granularity, establishes a longer-distance channel dependency relationship, and adaptively re-performs feature calibration on multi-scale channel attention vectors.

Description

Multi-target detection method, device and application thereof

Technical Field

The present application relates to the field of target detection, and in particular, to a multi-target detection method, apparatus and application thereof.

Background

In recent years, the target detection algorithm has made a great breakthrough. Currently, mainstream target detection algorithms are classified into two types according to algorithm stages: the first type is a two-stage target detection algorithm, such as R-CNN algorithms of Region Proposal, Fast R-CNN, and Fast R-CNN, which need to generate a target candidate frame, i.e. a target position in an image, by the algorithm, and then classify and regress the target candidate frame; the second category is single-stage object detection algorithms, such as algorithms of Yolo and SSD, which directly predict the classes and locations of different objects using only one convolutional neural network CNN. The first category is more accurate but slower, and the second category is faster but less accurate.

Many researchers have made some progress in the field of target detection and identification, however, the existing target detection algorithms on the market still have great limitations in some special application scenarios considering the difference of the specific application scenarios to which the target detection algorithms are applicable. Specifically, the current target detection algorithm cannot well detect different targets with different scales, and the conventional convolution process treats different semantics of feature representations with different scales equally, that is, the current target detection algorithm easily ignores differences between different targets with different scales, which is one of the reasons that the accuracy of the single-stage target detection algorithm is slightly low.

For example, Yolo4, Yolo4 in the Yolo series of algorithms includes five basic components: CBM, CBL, Res unit, CSPX, and SPP. The CBM is the minimum component in the Yolov4 network structure and consists of Conv + Bn + Mish activation functions, wherein Conv is a convolutional layer, Bn is a batch normalization layer, and Mish is an activation function in a convolutional neural network. The CBL is composed of Conv + Bn + Leaky _ relu activation functions. Res unit is a residual unit, and by taking reference to a residual structure in a Resnet network, the network can be built deeper. CSPX: by reference to the CSPNet network structure, the network structure is composed of a plurality of convolution layers and X Res units stacked. SPPs are spatially pyramid pooled, and multi-scale fusion is performed using a maximal pooling of 1 × 1, 5 × 5, 9 × 9, 13 × 13. The convolution kernel size in front of each CSP module is 3 × 3, the step size is 2, and the function of downsampling is equivalently performed. In addition, Yolo4 also includes Concat operation, which is tensor stitching and expands dimensions, and Add operation, which is tensor addition and does not expand dimensions.

Specifically, the Yolo4 includes a BackBone network backhaul, a Neck module Neck, and a prediction header, where the BackBone network adopts a network structure of CSPDarknet53, SPP is an additional module of the Neck module, and PANet is a feature fusion module of the Neck module. Since the backbone network includes 5 CSP modules, if the size of the input image to be detected is 608 × 608, the rule of the change of the feature map corresponding to the image to be detected is: 608. 304, 152, 76, 38, 19, that is, after the features are extracted from the images input to the backbone network by the CBM, the images are further processed by the CSP module 5 times to obtain a feature map of 19 × 19. The header is predicted to predict the final multi-class classification and bounding box location. The head predicting module comprises a classification sub-network and a frame regression sub-network, wherein the classification sub-network is used for predicting classes, and the regression sub-network is used for predicting frames; the method comprises the steps of outputting branches of predicted small targets in a shallow layer of the neural network, outputting branches of predicted medium targets in a middle layer of the neural network, outputting branches of predicted large targets in a deep layer of the neural network, and finally selecting a prediction frame with the minimum label loss through non-maximum value inhibition.

Since the two main properties of conventional convolution are spatial invariance and channel specificity. Advantages of spatial invariance include: the disadvantages of denaturation such as parameter sharing and translation are obvious: the extracted features are single, and parameters of a convolution kernel cannot be flexibly adjusted according to different input target sizes; while channel-specific convolution kernels are redundant in channel dimension. That is, on one hand, the conventional convolution deprives the convolution kernel of the ability to adapt to different visual modes of different spatial positions, limits the field of perception of the convolution, and has great difficulty in detecting small targets or blurred images; on the other hand, inter-channel redundancy inside the convolution kernel limits the flexibility of the convolution kernel to different channels.

Disclosure of Invention

In a first aspect, an embodiment of the present application provides a multi-target detection method, including:

acquiring an image to be detected;

inputting the image to be detected into a multi-target detection model to obtain a detection result;

the multi-target detection model comprises a backbone network, a neck module and a prediction head which are sequentially connected; the trunk network comprises a CBM module and a plurality of trunk layers, each trunk layer comprises a CSP module, convolutional layers of residual error units in the CSP modules are replaced by segmentation attention modules, the CBM module is used for acquiring a first feature map according to the image to be detected, and the CSP module is used for segmenting the first feature map and then extracting multi-scale features and channel attention weighting to obtain a second feature map with multi-scale feature information and channel attention weighting; the neck module is used for performing feature aggregation on the second feature maps output by different main layers to obtain an aggregated feature map; and the prediction head is used for carrying out multi-target detection according to the aggregation characteristic diagram.

In some embodiments, the backbone network includes a first backbone layer, a second backbone layer, and a third backbone layer, wherein the first backbone layer outputs a large-size second feature map, the second backbone layer outputs a medium-size second feature map, and the third backbone layer outputs a small-size second feature map.

In some embodiments, the prediction header comprises a first detection layer, a second detection layer and a third detection layer, wherein the first detection layer outputs a prediction box for predicting a large target, the second detection layer outputs a prediction box for predicting a medium target, the third detection layer outputs a prediction box for predicting a small target, and the prediction box with the minimum label loss is selected through non-maximum suppression.

In some embodiments, the neck module comprises a CBL module for extracting features and an SPP module for multi-scale fusion of the extracted features; and the neck module adopts a structure that an FPN layer is combined with a PAN layer to perform feature aggregation on different detection layers from different trunk layers, the FPN layer transmits strong semantic features from top to bottom, and the PAN layer transmits strong positioning features from bottom to top.

In some application embodiments, each detection layer of the prediction header includes a classification module and a regression module, and the regression module performs target detection on the image to be detected according to the strong semantic feature and the strong localization feature by using a CIOU _ Loss regression mode to obtain a prediction frame corresponding to the detection layer.

Specifically, in some application embodiments, the segmentation attention module includes an SPC module, an SE weight module, and a feature recalibration module, which are connected in sequence;

the SPC module is used for carrying out feature extraction of different scales after the first feature maps are evenly divided, and fusing the extracted features to obtain a first multi-scale feature map with multi-scale feature information;

the SE weighting module is used for extracting attention vectors of different scale channels from the first multi-scale feature map;

the feature re-calibration module is used for re-performing feature calibration on the attention vectors of the channels with different scales to obtain a new attention weight after interaction of the channels with multiple scales;

and weighting the first multi-scale feature map according to the attention weight to obtain a second feature map with multi-scale feature information and attention weight.

In some application embodiments, the SPC module includes a feature segmentation module and a feature fusion module, where the feature segmentation module is configured to divide the first feature graph equally into a plurality of first feature sub-graphs according to the number of channels, perform inner-volume convolution on the first feature sub-graphs at different scales to obtain corresponding first inner-volume feature sub-graphs, and fuse the plurality of first inner-volume feature sub-graphs to obtain a first multi-scale feature graph with multi-scale feature information.

In some application embodiments, the feature recalibration module performs feature recalibration on the attention vectors of the different scale channels by using a Softmax function to obtain the attention weight after new multi-scale channel interaction.

In some application embodiments, the hyper-parameter of the kernel convolution is selected to be a kernel with a size of 7 × 7, the number of grouping channels is 16, and the channel compression ratio parameter in the kernel convolution for generating the kernel is 4.

In a second aspect, an embodiment of the present application provides a city management detection method, including the following steps:

acquiring a first image to be processed;

inputting the first image to be processed into a multi-target detection model to obtain a first detection result, wherein the multi-target detection model comprises a trunk network, a neck module and a prediction head which are sequentially connected; the trunk network comprises a CBM module and a plurality of trunk layers, each trunk layer comprises a CSP module, convolutional layers of residual error units in the CSP modules are replaced by segmentation attention modules, the CBM module is used for acquiring a first feature map according to the image to be detected, and the CSP module is used for segmenting the first feature map and then extracting multi-scale features and channel attention weighting to obtain a second feature map with multi-scale feature information and channel attention weighting; the neck module is used for performing feature aggregation on the second feature maps output by different main layers to obtain an aggregated feature map; the prediction head is used for carrying out multi-target detection according to the aggregation characteristic diagram;

if the first detection result comprises one or more events to be processed, setting up a case for the events to be processed according to the source of the first image to be processed, and forming a case comprising case information corresponding to the events to be processed;

selecting a corresponding processing method according to the case information and dispatching a task to be executed;

and acquiring a second image to be processed at the same place as the first image to be processed again according to the feedback result of the task to be executed, and rechecking the processing result of the case according to the second image to be processed.

Specifically, in some embodiments of the application, "acquiring the first to-be-processed image" includes: and acquiring the first image to be processed from a monitoring video or acquiring the first image to be processed from reported event information.

In some application embodiments, the case information includes a case time, an event location, and an event category corresponding to the event to be processed. Therefore, in accordance with the embodiment of the present application, "selecting a corresponding processing method according to the case information and dispatching a task to be executed" includes: and inquiring a corresponding handling department according to the event category, dispatching a task to be executed to executive personnel close to the event site in the handling department, and setting a cut-off event of the task to be executed according to the plan time.

In addition, in some application embodiments, the "acquiring a second image to be processed at the same location as the first image to be processed again according to the feedback result of the task to be executed, and reviewing the processing result of the event to be processed" includes: after the executive staff finishes processing the event to be processed, uploading a processed image as a feedback result of the task to be executed; after the feedback result is obtained, obtaining a second image to be processed at the same place as the first image to be processed again; inputting the second image to be processed into the multi-target detection model to obtain a second detection result; and rechecking the event to be processed according to the second detection result.

Specifically, the rechecking the event to be processed according to the second detection result includes: if the second detection result does not comprise the event to be processed, the event to be processed is finalized; and if the second detection result still contains the event to be processed, regenerating the task to be executed.

In a third aspect, an embodiment of the present application provides a multi-target detection apparatus, configured to implement the multi-target detection method in the first aspect, where the apparatus includes the following modules:

the first acquisition module is used for acquiring an image to be detected;

the first detection module is used for inputting the image to be detected into the multi-target detection model to obtain a detection result; the multi-target detection model comprises a backbone network, a neck module and a prediction head which are sequentially connected; the trunk network comprises a CBM module and a plurality of trunk layers, each trunk layer comprises a CSP module, convolutional layers of residual error units in the CSP modules are replaced by segmentation attention modules, the CBM module is used for acquiring a first feature map according to the image to be detected, and the CSP module is used for segmenting the first feature map and then extracting multi-scale features and channel attention weighting to obtain a second feature map with multi-scale feature information and channel attention weighting; the neck module is used for performing feature aggregation on the second feature maps output by different main layers to obtain an aggregated feature map; and the prediction head is used for carrying out multi-target detection according to the aggregation characteristic diagram.

In a fourth aspect, an embodiment of the present application provides a city management apparatus, configured to implement the city management detection method in the second aspect, where the apparatus includes the following modules:

a second obtaining module: the image processing device is used for acquiring a first image to be processed;

a second detection module: the multi-target detection model is used for inputting the first image to be processed into the multi-target detection model to obtain a first detection result, wherein the multi-target detection model comprises a trunk network, a neck module and a prediction head which are sequentially connected; the trunk network comprises a CBM module and a plurality of trunk layers, each trunk layer comprises a CSP module, convolutional layers of residual error units in the CSP modules are replaced by segmentation attention modules, the CBM module is used for acquiring a first feature map according to the image to be detected, and the CSP module is used for segmenting the first feature map and then extracting multi-scale features and channel attention weighting to obtain a second feature map with multi-scale feature information and channel attention weighting; the neck module is used for performing feature aggregation on the second feature maps output by different main layers to obtain an aggregated feature map; the prediction head is used for carrying out multi-target detection according to the aggregation characteristic diagram;

a case setting module: if the first detection result comprises one or more events to be processed, the first detection result is used for setting up a case for the events to be processed according to the source of the first image to be processed, and a case comprising case information is formed corresponding to the events to be processed;

a task dispatching module: the system is used for selecting a corresponding processing method according to the case information and dispatching a task to be executed;

case review module: and the image processing device is used for acquiring a second image to be processed at the same place as the first image to be processed again according to the feedback result of the task to be executed, and rechecking the processing result of the case according to the second image to be processed.

In a fifth aspect, an embodiment of the present application provides an electronic apparatus, including a memory and a processor, where the memory stores a computer program, and the processor is configured to run the computer program to perform the multi-target detection method or the city management detection method according to any of the embodiments of the present application.

In a sixth aspect, an embodiment of the present application provides a computer program product, where the computer program product includes: a program or instructions which, when run on a computer, causes the computer to perform a multi-target detection method or a city management detection method as described in any of the application embodiments above.

In a seventh aspect, the present application provides a readable storage medium, in which a computer program is stored, where the computer program includes a program code for controlling a process to execute a process, where the process includes the multi-target detection method or the city management detection method according to any of the above application embodiments.

The main contributions and innovation points of the embodiment of the application are as follows: the embodiment of the application provides a multi-target detection method, a multi-target detection device and application thereof, wherein a multi-target detection model in the multi-target detection method improves convolutional layers of all residual error units of a CSP module in a backbone network into a segmentation attention module on the basis of a Yolo4 model. The segmentation attention module extracts multi-scale feature information by performing grouping convolution on the first feature map and then combining the first feature map, replaces the conventional convolution with the inner-volume convolution, avoids the problem that the conventional convolution deprives a convolution kernel of the capability of adapting to different visual modes of different spatial positions, and simultaneously avoids the problem that a small target or a blurred image is difficult to identify due to the fact that the receptive field of the conventional convolution is limited by the limitations of the conventional convolution. That is to say, the multi-target detection model in the embodiment of the application can not only extract multi-scale feature information, but also flexibly adjust the parameters of the convolution kernel according to different sizes of the input images to be detected, thereby solving the problem that the specific convolution kernel has redundancy in channel dimensions; and multi-scale spatial information with finer granularity can be more effectively extracted, a channel dependency relationship with longer distance is established, richer multi-scale feature representation is learned, and feature calibration is carried out on the multi-scale channel attention vector in a self-adaptive manner.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a flow diagram of a multi-target detection method according to an embodiment of the present application;

FIG. 2 is a model architecture diagram of a multi-target detection model according to an embodiment of the present application;

FIG. 3 is a CSP module schematic diagram of a multi-target detection method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a residual error unit structure before improvement of a multi-target detection method according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a residual error unit after being improved by the multi-target detection method according to the embodiment of the present application;

FIG. 6 is a schematic diagram of a segmentation attention module of a multi-target detection method according to an embodiment of the present application;

FIG. 7 is a schematic diagram of an SPC module of the multi-target detection method according to an embodiment of the present application;

FIG. 8 is a flow chart of a city management detection method according to an embodiment of the application;

FIG. 9 is a block diagram of a multi-target detection apparatus according to an embodiment of the present application;

fig. 10 is a block diagram of a city management apparatus according to an embodiment of the present application;

fig. 11 is a hardware configuration diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of one or more embodiments of the specification, as detailed in the claims which follow.

It should be noted that: in other embodiments, the steps of the corresponding methods are not necessarily performed in the order shown and described herein. In some other embodiments, the method may include more or fewer steps than those described herein. Moreover, a single step described in this specification may be broken down into multiple steps for description in other embodiments; multiple steps described in this specification may be combined into a single step in other embodiments.

Example one

The embodiment provides a multi-target detection method, which aims to effectively extract multi-scale spatial information with finer granularity, establish a channel dependency relationship with longer distance, learn richer multi-scale feature representation and perform feature calibration on multi-scale channel attention vectors in a self-adaptive manner. .

Referring specifically to FIG. 1, as shown in FIG. 1, the method includes steps S1-S2:

step S1: acquiring an image to be detected;

step S2: inputting the image to be detected into a multi-target detection model to obtain a detection result; the multi-target detection model comprises a backbone network, a neck module and a prediction head which are sequentially connected; the trunk network comprises a CBM module and a plurality of trunk layers, each trunk layer comprises a CSP module, convolutional layers of residual error units in the CSP modules are replaced by segmentation attention modules, the CBM module is used for acquiring a first feature map according to the image to be detected, and the CSP module is used for segmenting the first feature map and then extracting multi-scale features and channel attention weighting to obtain a second feature map with multi-scale feature information and channel attention weighting; the neck module is used for performing feature aggregation on the second feature maps output by different main layers to obtain an aggregated feature map; and the prediction head is used for carrying out multi-target detection according to the aggregation characteristic diagram.

In the embodiment, the multi-target detection model is improved from the Yolov4 model. Specifically, the model architecture of the multi-target detection model refers to fig. 2, and as shown in fig. 2, the multi-target model includes a trunk network, a neck module and a prediction head connected in sequence.

The backbone network includes three backbone layers, which are a first backbone layer, a second backbone layer, and a third backbone layer. The first backbone layer includes 3 adjacent CSP modules, namely adjacent CSP1 module, CSP2 module and CSP8 module in fig. 2; the second backbone layer includes a CSP8 module; the third backbone layer includes a CSP4 module. If the size of the acquired image to be detected is 608 × 608, the image to be detected outputs a large-size second feature map with the size of 76 × 76 through the first trunk layer, the large-size second feature map outputs a medium-size second feature map with the size of 38 × 38 through the second trunk layer, and the medium-size second feature map outputs a small-size second feature map with the size of 19 × 19 through the third trunk layer. Specifically, referring to fig. 3, the structure of each CSP module, as shown in fig. 3, CSPX represents a stack of a plurality of convolutional layers and X residual units, that is, a CSP8 module includes a plurality of convolutional layers and 8 residual units, a CSP4 module includes a plurality of convolutional layers and 4 residual units, and so on.

As shown in fig. 4, the residual unit in the original Yolo4 model includes two CBM modules, and the CBM modules are composed of Conv + Bn + Mish activation functions, where Conv is a convolutional layer, Bn is a batch normalization layer, and Mish is an activation function in a convolutional neural network. The multi-target detection model in this embodiment replaces the convolutional layer in the residual unit with the split attention module, that is, the structure of the residual unit of the multi-target detection model in this embodiment is shown in fig. 5.

Specifically, the structure of the split attention module in some embodiments is described with reference to fig. 6, which, as shown in fig. 6, includes an SPC module, an SE weighting module, and a feature recalibration module connected in series.

In particular, SPC modules have also been improved in some embodiments. The improved SPC module comprises a feature segmentation module and a feature fusion module, wherein the feature segmentation module is used for equally dividing the first feature graph into a plurality of first feature sub-graphs according to the number of channels, performing inner volume convolution on the first feature sub-graphs in different scales to obtain corresponding first inner volume feature sub-graphs, and fusing the first inner volume feature sub-graphs to obtain a first multi-scale feature graph with multi-scale feature information.

Specifically, for parameter and precision, in some embodiments, the hyper-parameter of the kernel convolution is selected to be a kernel with a size of 7 × 7, the number of grouping channels is 16, and the channel compression ratio parameter in the kernel convolution for generating the kernel is 4. That is, referring to fig. 7, the improved SPC module structure may refer to fig. 7, as shown in fig. 7, the first feature graphs X are all four first feature sub-graphs, which are X0, X1, X2 and X3, respectively, and the number of channels in each first feature sub-graph is X/4. And then, performing feature extraction of different scales on each first feature subgraph by adopting inner volume convolution to obtain corresponding first inner volume feature subgraphs, namely F0, F1, F2 and F3 in the subgraphs. And finally, F0, F1, F2 and F3 are fused to obtain a first multi-scale feature map with multi-scale feature information.

In this embodiment, the intra convolution after the first feature subgraph is uniformly divided is mainly to reduce the parameter number by means of the grouping convolution. And the inner volume convolution has double advantages compared with the conventional convolution, the first is that the context can be aggregated in a wider space through the inner volume convolution, so that the difficulty of modeling remote interaction is overcome, and the second is that the weight can be adaptively distributed on different positions, so that the most abundant visual elements in the spatial domain are prioritized. In addition, parameters and calculated amount in the whole convolution process can be reduced, so that the convolution performance is improved. The design of replacing the conventional convolution with the inner volume convolution actually reconfigures the computational power at a microscopic granularity, and the essence of the network design is the allocation of the computational power, so that the limited computational power is adjusted to the position where the performance can be best achieved.

Specifically, the design principle of the inner convolutional layer is to reverse the two design principles of the conventional convolutional layer, i.e., to convert from spatial independence to spatial specificity and from frequency domain specificity to frequency domain independence. Therefore, the way of grouping convolution is that the feature maps in each group share the parameters of one convolution kernel, but different convolution kernels are used in different spatial positions in the same group, and the results of the groups are spliced back after processing.

In addition, in some embodiments, the feature re-calibration module performs feature re-calibration on the attention vectors of the different scale channels using a Softmax function to obtain the attention weight after a new multi-scale channel interaction. The Softmax function is also called a normalized exponential function, and is a generalization of a two-classification function sigmoid on multi-classification, which aims to show the result of multi-classification in a probability form, and a specific principle is not described in detail herein.

The neck module comprises a CBL module and an SPP module, wherein the CBL module is used for extracting features, and the SPP module is used for performing multi-scale fusion on the extracted features; and the neck module adopts the structure that the FPN layer combines the PAN layer to carry out feature aggregation to different detection layers from different backbone layers, and the FPN layer conveys strong semantic features from the top down, and the PAN layer conveys strong positioning features from the bottom up.

The prediction module comprises a first detection layer, a second detection layer and a third detection layer, wherein the first detection layer outputs a prediction frame for predicting a large target, the second detection layer outputs a prediction frame for predicting a medium target, the third detection layer outputs a prediction frame for predicting a small target, and the prediction frame with the minimum label loss is selected through non-maximum value inhibition. Specifically, each detection layer of the prediction module comprises a classification module and a regression module, and the regression module performs target detection on the image to be detected according to the strong semantic features and the strong positioning features by adopting a CIOU _ Loss regression mode to obtain the prediction frame corresponding to the detection layer.

Example two

The embodiment applies the method of the first embodiment to smart city management, such as managing road garbage, going out of store operation, mobile operation, illegal parking and other problems, and provides a city management detection method, so that real-time information is captured through monitoring videos, and a multi-target detection method is used for efficiently identifying and managing target events needing to be managed.

In this embodiment, illegal events that often occur in city management, such as road garbage, operation out of a store, mobile dealer, illegal parking, water accumulation in low-lying road surfaces, scraggling materials, parking of non-motor vehicles, airing along streets, greening damage, illegal billboard arrangement, road surface damage, illegal small advertisement posting, and the like, are taken as management targets, that is, if a first detection result obtained by inputting a first to-be-processed image into the multi-target detection model includes any illegal event that belongs to the management targets, the first to-be-processed event is taken as the to-be-processed event. The method comprises the steps of detecting an event to be processed, carrying out early warning display on a monitoring platform managed by the smart city, wherein the time length judgment needs to be added by mobile vendors, illegal parking and non-motor vehicles, namely, the three types of events are detected and monitored to exist for more than a certain event, and then the events are displayed on the monitoring platform as illegal events.

Referring to fig. 8, as shown in fig. 8, the city management detection method includes steps S10 to S14:

step S10: and acquiring a first image to be processed.

Specifically, the "acquiring the first image to be processed" includes: and acquiring the first image to be processed from a monitoring video or acquiring the first image to be processed from reported event information. That is to say, the manner of acquiring the first to-be-processed image in this embodiment is not limited, and may be acquired from the monitoring video, or the reporting event information acquired from the terminal associated with the smart management city, or acquired in other manners of the internet of things.

Step S11: inputting the first image to be processed into a multi-target detection model to obtain a first detection result, wherein the multi-target detection model comprises a trunk network, a neck module and a prediction head which are sequentially connected; the trunk network comprises a CBM module and a plurality of trunk layers, each trunk layer comprises a CSP module, convolutional layers of residual error units in the CSP modules are replaced by segmentation attention modules, the CBM module is used for acquiring a first feature map according to the image to be detected, and the CSP module is used for segmenting the first feature map and then extracting multi-scale features and channel attention weighting to obtain a second feature map with multi-scale feature information and channel attention weighting; the neck module is used for performing feature aggregation on the second feature maps output by different main layers to obtain an aggregated feature map; and the prediction head is used for carrying out multi-target detection according to the aggregation characteristic diagram.

The first to-be-processed image is input into a multi-target detection model for violation event detection, it should be noted that the multi-target detection model in this embodiment is obtained after training, and image data for training the model includes image data corresponding to multiple types of violation events.

The multi-target detection model in this embodiment has the same structure as the multi-target detection model described in the first embodiment, and is not described herein again.

Step S12: if the first detection result comprises one or more events to be processed, setting up a case for the events to be processed according to the source of the first image to be processed, and forming a case comprising case information corresponding to the events to be processed;

when the first detection result comprises one or more events to be processed, the events to be processed are planned according to the source of the events to be processed. That is, if the source of the event to be processed is video monitoring, an event location is generated according to the location where the corresponding video monitoring is located; and if the source of the processing event is the event information reported by the terminal, the reported event information is sensed to obtain the event location. And each event to be processed generates a corresponding case, and records case setting time and event location in case information of the case. In addition, the event in the video monitoring or the reporting time of the event reported by the terminal can be recorded in the case information according to the requirement.

In particular, event categories of events to be processed need to be recorded in case information so as to dispatch tasks to relevant treatment departments in subsequent steps.

Step S13: and selecting a corresponding processing method according to the case information and dispatching a task to be executed.

Specifically, a corresponding processing method is selected according to case information or a relevant part which is specially responsible for processing the type of event to be processed is inquired, and a task is dispatched to an executive in a relevant department.

In some embodiments, "selecting a corresponding processing method and dispatching a task to be executed according to the case information" includes: and inquiring a corresponding handling department according to the event category, dispatching a task to be executed to executive personnel close to the event site in the handling department, and setting a cut-off event of the task to be executed according to the plan time.

That is, according to the event category, the task to be executed generated according to the case is sent to a special handling department for processing, the handling department sends out the executive personnel nearest to the event location for processing, and the time limit for the executive personnel to complete the task to be executed is set according to the case setting, and the task to be executed can be sent to another executive personnel beyond the time limit, so that the executive personnel to be sent cannot complete the task to be executed for a long time due to delay of other things.

Step S14: and acquiring a second image to be processed at the same place as the first image to be processed again according to the feedback result of the task to be executed, and rechecking the processing result of the case according to the second image to be processed.

Specifically, after receiving a feedback result of the executive staff on the task to be executed, detecting whether the event to be processed still exists in the event location at a close time after receiving the feedback result. If the event does not exist, the case can be finalized, and if the event does exist, the task to be executed for processing the event to be processed needs to be dispatched again.

In some embodiments, "obtaining a second to-be-processed image at the same location as the first to-be-processed image again according to the feedback result of the to-be-executed task, and reviewing the processing result of the to-be-processed event" includes: after the executive staff finishes processing the event to be processed, uploading a processed image as a feedback result of the task to be executed; after the feedback result is obtained, obtaining a second image to be processed at the same place as the first image to be processed again; inputting the second image to be processed into the multi-target detection model to obtain a second detection result; and rechecking the event to be processed according to the second detection result. Specifically, the "rechecking the event to be processed according to the second detection result" includes: if the second detection result does not comprise the event to be processed, the event to be processed is finalized; and if the second detection result still contains the event to be processed, regenerating the task to be executed.

EXAMPLE III

Based on the same concept of the first embodiment, the present embodiment further provides a multi-target detection apparatus for implementing the multi-target detection method described in the first embodiment, specifically referring to fig. 9, where fig. 9 is a structural block diagram of the multi-target detection apparatus according to the embodiment of the present application, and as shown in fig. 9, the apparatus includes:

the first acquisition module is used for acquiring an image to be detected;

Example four

Based on the same concept of the second embodiment, the present embodiment further provides a city management device, which is used to implement the city management detection method described in the second embodiment, specifically referring to fig. 10, where fig. 10 is a structural block diagram of the city management device according to the embodiment of the present application, and as shown in fig. 10, the device includes:

EXAMPLE five

The present embodiment further provides an electronic apparatus, referring to fig. 11, comprising a memory 404 and a processor 402, wherein the memory 404 stores a computer program, and the processor 402 is configured to execute the computer program to perform the steps of any one of the multi-target detection method or the city management detection method in the foregoing embodiments.

Specifically, the processor 402 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.

Memory 404 may include, among other things, mass storage 404 for data or instructions. By way of example, and not limitation, memory 404 may include a Hard Disk Drive (Hard Disk Drive, abbreviated to HDD), a floppy Disk Drive, a Solid State Drive (SSD), flash memory, an optical Disk, a magneto-optical Disk, tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. Memory 404 may include removable or non-removable (or fixed) media, where appropriate. Memory 404 may include removable or non-removable (or fixed) media, where appropriate. The memory 404 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 404 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, Memory 404 includes Read-Only Memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), Electrically rewritable ROM (EAROM), or FLASH Memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a Static Random-Access Memory (SRAM) or a Dynamic Random-Access Memory (DRAM), where the DRAM may be a Fast Page Mode Dynamic Random-Access Memory 404 (FPMDRAM), an Extended data output Dynamic Random-Access Memory (eddram), a Synchronous Dynamic Random-Access Memory (SDRAM), and the like.

Memory 404 may be used to store or cache various data files for processing and/or communication use, as well as possibly computer program instructions for execution by processor 402.

The processor 402 may implement any of the data warehousing methods described in the above embodiments by reading and executing computer program instructions stored in the memory 404.

Optionally, the electronic apparatus may further include a transmission device 406 and an input/output device 408, where the transmission device 406 is connected to the processor 402, and the input/output device 408 is connected to the processor 402.

The transmitting device 406 may be used to receive or transmit data via a network. Specific examples of the network described above may include wired or wireless networks provided by communication providers of the electronic devices. In one example, the transmission device includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmitting device 406 may be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

The input and output devices 408 are used to input or output information. In this embodiment, the input information may be a current data table such as an epidemic situation trend document, feature data, a template table, and the like, and the output information may be a feature fingerprint, a fingerprint template, text classification recommendation information, a file template configuration mapping table, a file template configuration information table, and the like.

Alternatively, in this embodiment, the processor 402 may be configured to execute the steps of any one of the multi-target detection methods or the city management detection method in the above embodiments through a computer program.

It should be noted that, for specific examples in this embodiment, reference may be made to examples described in the foregoing embodiments and optional implementations, and details of this embodiment are not described herein again.

In addition, in combination with any one of the object detection method and the city management detection method in the first embodiment, the embodiments of the present application may be implemented as a computer program product. The computer program product comprises: a program or instructions which, when run on a computer, causes the computer to perform a method of implementing any one of the multi-target detection methods or city management detection methods in the above embodiments.

In addition, in combination with any one of the target detection method or the city management detection method in the first embodiment, the embodiment of the present application may provide a readable storage medium to implement the target detection method or the city management detection method. The readable storage medium having stored thereon a computer program; the computer program, when executed by a processor, implements any one of the multi-target detection methods or city management detection methods in the above embodiments.

In general, the various embodiments may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. Some aspects of the invention may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

Embodiments of the invention may be implemented by computer software executable by a data processor of the mobile device, such as in a processor entity, or by hardware, or by a combination of software and hardware. Computer software or programs (also referred to as program products) including software routines, applets and/or macros can be stored in any device-readable data storage medium and they include program instructions for performing particular tasks. The computer program product may comprise one or more computer-executable components configured to perform embodiments when the program is run. The one or more computer-executable components may be at least one software code or a portion thereof. Further in this regard it should be noted that any block of the logic flow as in the figures may represent a program step, or an interconnected logic circuit, block and function, or a combination of a program step and a logic circuit, block and function. The software may be stored on physical media such as memory chips or memory blocks implemented within the processor, magnetic media such as hard or floppy disks, and optical media such as, for example, DVDs and data variants thereof, CDs. The physical medium is a non-transitory medium.

It should be understood by those skilled in the art that various features of the above embodiments can be combined arbitrarily, and for the sake of brevity, all possible combinations of the features in the above embodiments are not described, but should be considered as within the scope of the present disclosure as long as there is no contradiction between the combinations of the features.

The above examples are merely illustrative of several embodiments of the present application, and the description is more specific and detailed, but not to be construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. The multi-target detection method is characterized by comprising the following steps of:

acquiring an image to be detected;

2. The multi-target detection method of claim 1, wherein the segmentation attention module comprises an SPC module, an SE weighting module and a feature re-calibration module which are connected in sequence;

3. The multi-target detection method according to claim 2, wherein the SPC module comprises a feature segmentation module and a feature fusion module, the feature segmentation module is configured to divide the first feature graph into a plurality of first feature sub-graphs according to the number of channels, perform inner-volume convolution on the first feature sub-graphs in different scales to obtain corresponding first inner-volume feature sub-graphs, and fuse the plurality of first inner-volume feature sub-graphs to obtain a first multi-scale feature graph with multi-scale feature information.

4. The multi-target detection method of claim 1, wherein the backbone network includes a first backbone layer, a second backbone layer, and a third backbone layer, wherein the first backbone layer outputs a large-size second feature map, the second backbone layer outputs a medium-size second feature map, and the third backbone layer outputs a small-size second feature map.

5. The multi-target detection method of claim 1, wherein the prediction header includes a first detection layer, a second detection layer, and a third detection layer, wherein the first detection layer outputs a prediction box predicting a large target, the second detection layer outputs a prediction box predicting a medium target, and the third detection layer outputs a prediction box predicting a small target, and the prediction box with the least tag loss is selected by non-maximum suppression.

6. The multi-target detection method of claim 5, wherein the neck module comprises a CBL module and an SPP module, wherein the CBL module is used for extracting features and the SPP module is used for performing multi-scale fusion on the extracted features; and the neck module adopts a structure that an FPN layer is combined with a PAN layer to perform feature aggregation on different detection layers from different trunk layers, the FPN layer transmits strong semantic features from top to bottom, and the PAN layer transmits strong positioning features from bottom to top.

7. The multi-target detection method as claimed in claim 6, wherein each detection layer of the prediction head comprises a classification module and a regression module, and the regression module adopts a CIOU _ Loss regression mode to perform target detection on the image to be detected according to the strong semantic features and the strong localization features to obtain a prediction frame corresponding to the detection layer.

8. The multi-target detection method according to claim 3, wherein the hyper-parameters of the kernel convolution are selected to be kernel with the size of 7 x 7, the number of grouping channels is 16, and the channel compression ratio parameter in the kernel convolution for generating the kernel is 4.

9. The multi-target detection method as claimed in claim 2, wherein the feature recalibration module performs feature calibration on the attention vectors of the different scale channels again by using a Softmax function to obtain a new attention weight after multi-scale channel interaction.

10. The city management detection method is characterized by comprising the following steps:

acquiring a first image to be processed;

11. The city management detection method according to claim 10, wherein the acquiring the first to-be-processed image includes: and acquiring the first image to be processed from a monitoring video or acquiring the first image to be processed from reported event information.

12. The city management detection method according to claim 10, wherein the case information includes a case time, an event location, and an event category corresponding to the event to be processed.

13. The city management detection method according to claim 12, wherein selecting a corresponding processing method and dispatching a task to be executed according to the case information comprises: and inquiring a corresponding handling department according to the event category, dispatching a task to be executed to executive personnel close to the event site in the handling department, and setting a cut-off event of the task to be executed according to the plan time.

14. The city management detection method according to claim 10, wherein the "obtaining again a second image to be processed at the same location as the first image to be processed according to the feedback result of the task to be executed, and rechecking the processing result of the event to be processed" includes: after the executive staff finishes processing the event to be processed, uploading a processed image as a feedback result of the task to be executed; after the feedback result is obtained, obtaining a second image to be processed at the same place as the first image to be processed again; inputting the second image to be processed into the multi-target detection model to obtain a second detection result; and rechecking the event to be processed according to the second detection result.

15. The city management detection method according to claim 14, wherein the "rechecking the event to be processed according to the second detection result" includes: if the second detection result does not comprise the event to be processed, the event to be processed is finalized; and if the second detection result still contains the event to be processed, regenerating the task to be executed.

16. Multi-target detection apparatus, comprising the following modules:

the first acquisition module is used for acquiring an image to be detected;

17. The city management device is characterized by comprising the following modules:

18. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and the processor is configured to execute the computer program to perform the multi-target detection method of any one of claims 1 to 9 or the city management detection method of any one of claims 10 to 15.

19. A computer program product, characterized in that it comprises software code portions for performing a multiple object detection method according to any one of claims 1 to 9 or a city management detection method according to any one of claims 10 to 15, when said computer program product is run on a computer.

20. A readable storage medium, characterized in that a computer program is stored therein, the computer program comprising program code for controlling a process to execute a process, the process comprising the multi-object detection method according to any one of claims 1 to 9 or the city management detection method according to any one of claims 10 to 15.