CN113989753A

CN113989753A - Multi-target detection processing method and device

Info

Publication number: CN113989753A
Application number: CN202010659215.XA
Authority: CN
Inventors: 李松
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2020-07-09
Filing date: 2020-07-09
Publication date: 2022-01-28

Abstract

The invention provides a multi-target detection processing method and a device, wherein the method comprises the following steps: extracting characteristic information of a target image; performing spatial feature pooling on the feature information according to the spatial pyramid pooling SPP to obtain effective target features, and performing multi-scale processing on the effective target features through a feature pyramid FPN to obtain output target feature information of multiple sizes; and determining a target detection frame of the target object in the target image according to the effective target feature and the feature information, so that the problem of insufficient representation in the process of extracting features of the cart through a Dakrnet53 network in the related art can be solved, and the detection recall rate and the detection precision of the detection network on the cart target are improved.

Description

Multi-target detection processing method and device

Technical Field

The invention relates to the field of image processing, in particular to a multi-target detection processing method and device.

Background

With the rapid increase of urban roads, high speeds, tunnels and vehicles, the workload of processing images generated by cameras on traffic roads is increased so as to exceed the artificial load. To solve this problem, various target detection algorithms have emerged and are rapidly gaining wide-ranging use. The current target detection algorithms can be mainly divided into two categories, namely target detection algorithms based on traditional manual characteristics and target detection algorithms based on deep learning.

The traditional target detection algorithm based on manual features, such as dpm (deformable Parts model), is consistent with the thought of the HOG algorithm, and can be regarded as an extension method of HOG, however, the manual features cannot sufficiently extract the target features in the image, so that the finally obtained feature representation vector has certain expression defects.

The target detection algorithm based on deep learning is a target detection algorithm which starts late, and a two-step fast-RCNN and Mask-RCNN detection algorithm and a single-step SSD and YOLOV3 detection algorithm are developed up to now. However, the two-step detection strategy causes the complexity of the network to be high and the inference speed to be slow, so that the real-time requirement in practical application cannot be met; the single step detection algorithm, while an improvement in detection speed, loses accuracy. For example, the SSD algorithm can obtain good precision when detecting a large target, but the effect is not very ideal when detecting a small target; the YOLOV3 is a detection algorithm widely applied at present, introduces residual error thought and expands network depth, and although the detection performance is improved to a certain extent, the complexity of the network is increased. When the YOLOV3 is applied to multi-target detection equipment in an actual traffic scene, ideal speed and precision cannot be obtained, such as missed detection and false detection of a large vehicle and other targets, such as missed detection and false detection of an automobile, a motorcycle, a pedestrian and the like.

A vehicle detection method is provided in the related technology, a detection algorithm based on YOLOV3 is adopted, and features extracted from a backbone network-Darknet-53 are subjected to feature splicing, residual mapping and feature fusion and then taken as final picture features to pass through a YOLO layer. A deeper Darknet53 network structure is adopted, and a feature splicing module, a residual mapping module and a feature fusion module are added in the network, so that the inference speed of the network is slowed, and the low-delay requirement in practical application cannot be met; meanwhile, the Dakrnet53 network still has the disadvantage of indicating inadequacy in feature extraction for large cars (trucks, buses, etc.).

No solution has been proposed to the problem in the related art that the representation is insufficient in the feature extraction of large cars (trucks, buses, etc.) through the Dakrnet53 network.

Disclosure of Invention

The embodiment of the invention provides a multi-target detection processing method and device, which are used for at least solving the problem of insufficient representation in the process of extracting the characteristics of large vehicles (trucks, buses and the like) through a Dakrnet53 network in the related art.

According to an embodiment of the present invention, there is provided a multi-target detection processing method including:

extracting characteristic information of a target image;

performing spatial feature pooling on the feature information according to a spatial pyramid (SPP) to obtain effective target features, and performing multi-scale processing on the effective target features through a Feature Pyramid (FPN) to obtain output target feature information of multiple sizes;

and determining a target detection frame of the target object in the target image according to the output target characteristic information of the plurality of sizes.

Optionally, determining the target detection frame of the target object in the target image according to the output target feature information of the plurality of sizes includes:

determining a predicted frame of a target object in the target image and a confidence degree of the predicted frame according to the output target characteristic information of the plurality of sizes;

and determining a target detection frame of the target object in the target image according to the intersection ratio and the confidence coefficient of the prediction frame.

According to another embodiment of the present invention, there is also provided a multi-target detection processing apparatus including:

the extraction module is used for extracting the characteristic information of the target image;

the multi-scale target feature extraction module is used for performing spatial feature pooling on the feature information according to the spatial pyramid SPP to obtain effective target features, and performing multi-scale processing on the effective target features through a feature pyramid FPN to obtain output target feature information of multiple sizes;

and the first determining module is used for determining a target detection frame of the target object in the target image according to the output target characteristic information of the plurality of sizes.

According to a further embodiment of the present invention, a computer-readable storage medium is also provided, in which a computer program is stored, wherein the computer program is configured to perform the steps of any of the above-described method embodiments when executed.

According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.

By the method, the characteristic information of the target image is extracted; performing spatial feature pooling on the feature information according to the spatial pyramid pooling SPP to obtain effective target features, and performing multi-scale processing on the effective target features through a feature pyramid FPN to obtain output target feature information of multiple sizes; the target detection frame for determining the target object in the target image is determined according to the output target characteristic information of the plurality of sizes, the problem that the characteristic extraction of a large vehicle (a truck, a passenger car and the like) is insufficient through a Dakrnet53 network in the related art can be solved, the detection recall rate and the detection precision of the detection network on the large vehicle (the truck, the passenger car and the like) target are improved, the characteristic information of the target image is extracted through a lightweight DenseNet network model, the processing speed of the network is improved, and the requirement of detection timeliness on a mobile terminal is met.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a block diagram of a hardware configuration of a mobile terminal of a multi-target detection processing method according to an embodiment of the present invention;

FIG. 2 is a flow diagram of a multi-target detection processing method according to an embodiment of the invention;

FIG. 3 is a flow chart of a method of multi-object detection in a traffic scene in accordance with an embodiment of the present invention;

FIG. 4 is a block diagram of a multi-target detection processing apparatus according to an embodiment of the present invention.

Detailed Description

The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Example 1

The method provided by the first embodiment of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Taking a mobile terminal as an example, fig. 1 is a hardware structure block diagram of a mobile terminal of the multi-target detection processing method according to the embodiment of the present invention, and as shown in fig. 1, the mobile terminal may include one or more processors 102 (only one is shown in fig. 1) (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), and a memory 104 for storing data, and optionally, the mobile terminal may further include a transmission device 106 for communication function and an input/output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration, and does not limit the structure of the mobile terminal. For example, the mobile terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used for storing computer programs, for example, software programs and modules of application software, such as computer programs corresponding to the multi-target detection processing method in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer programs stored in the memory 104, so as to implement the above-mentioned method. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the mobile terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal. In one example, the transmission device 106 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

In this embodiment, a multi-target detection processing method operating in the mobile terminal or the network architecture is provided, and fig. 2 is a flowchart of the multi-target detection processing method according to the embodiment of the present invention, as shown in fig. 2, the flowchart includes the following steps:

step S202, extracting characteristic information of a target image;

step S204, performing spatial feature pooling processing on the feature information according to the spatial pyramid pooling SPP to obtain effective target features, and performing multi-scale processing on the effective target features through a feature pyramid FPN to obtain output target feature information with multiple sizes;

in this embodiment of the present invention, in the step S204, performing spatial feature pooling processing on the feature information according to the spatial pyramid pooling SPP, and obtaining the effective target feature specifically may include: inputting the feature information into a spatial pyramid pooling model, and performing spatial feature pooling to obtain the effective target feature, wherein the spatial pyramid pooling model is formed by parallel windows with a size of 5 × 5,9 × 9, and 13 × 13.

Step S206, determining a target detection frame of the target object in the target image according to the output target characteristic information of the plurality of sizes.

Further, the step S206 may specifically include: determining a predicted frame of a target object in the target image and a confidence degree of the predicted frame according to the output target characteristic information of the plurality of sizes; and determining a target detection frame of the target object in the target image according to the intersection ratio and the confidence coefficient of the prediction frame.

Through the steps S202 to S208, the problem that the characteristic extraction of the large vehicle (truck, passenger car and the like) is insufficient through the Dakrnet53 network in the related art can be solved, the detection recall rate and the detection precision of the detection network on the large vehicle (truck, passenger car and the like) target are improved, the characteristic information of the target image is extracted through the lightweight DenseNet network model, the processing speed of the network is improved, and the requirement of detection timeliness on the mobile terminal is met.

In an embodiment of the present invention, the step S202 may specifically include:

inputting the target image into the target detection lightweight densnet model to obtain the characteristic information output by the target detection lightweight densnet model, wherein the characteristic information comprises: the target detection lightweight Denset network model is a network formed by cascading a convolutional Layer, a maximum pooling Layer, 4 cascaded Transition blocks and a first Transition Layer, 8 Transition blocks and a second Transition Layer, 10 Transition blocks and a third Transition Layer, and 8 Transition blocks, wherein the input characteristics of the Transition blocks are combined with the output characteristics of the Transition blocks to form the output characteristics of the Transition blocks, and the Transition blocks are composed of two convolutional layers and a maximum pooling Layer.

Correspondingly, in step S204, performing multi-scale processing on the effective target feature according to the feature pyramid FPN to obtain output target feature information of multiple sizes may specifically include:

taking the target features output by the target lightweight DenseNet model as first output target feature information of a large-size target in the target image;

performing 2 times of upsampling on the first output target characteristic information, and performing channel-level connection with the second characteristic output by the target lightweight DenseNet model to serve as second output target characteristic information of a medium-size target in the target image;

after the second output target characteristic information is subjected to 2 times of upsampling, the second output target characteristic information passes through a convolution layer, and after the output characteristic of the convolution layer is subjected to 2 times of upsampling, the second output target characteristic information is connected with the first characteristic in a channel level mode to serve as third output target characteristic information of a small-size target in the target image;

wherein the output target feature information of the plurality of sizes includes the first output target feature information, the second output target feature information, and the third output target feature information.

In an embodiment of the present invention, the step S206 may specifically include:

inputting the first output target characteristic information into a trained target recognition and frame regression convolution model to obtain horizontal offset and vertical offset of a central point of a large-size target in the target image output by the target recognition and frame regression convolution model relative to central points of 3 large-size clustering frames in a predetermined anchor frame, and a width scaling ratio and a high scaling ratio of the large-size target in the target image corresponding to the 3 large-size clustering frames; determining the center of a predicted frame of the large-size target in the icon image according to the coordinates of the center point of the large-size target in the target image, the horizontal offset and the vertical offset; determining 3 predicted frames corresponding to the large-size target in the target image according to the wide scaling ratio, the high scaling ratio and the widths and heights of the 3 large-size clustering frames, and acquiring target types corresponding to the 3 predicted frames output by the frame regression convolution model and the target identification and confidence degrees of the 3 predicted frames corresponding to the target types;

inputting the second output target characteristic information into a pre-trained target recognition and frame regression convolution model to obtain horizontal offset and vertical offset of a central point of a medium-size target in the target image output by the target recognition and frame regression convolution model relative to central points of 3 medium-size clustering frames in a predetermined anchor frame, and an aspect ratio and a high scaling ratio of the medium-size target in the target image corresponding to the 3 medium-size clustering frames; determining the center of a predicted frame of the medium-sized target in the icon image according to the center point coordinate of the medium-sized target in the target image, the horizontal offset and the vertical offset; determining 3 predicted frames corresponding to the medium-sized targets in the target image according to the wide scaling ratio, the high scaling ratio and the widths and heights of the 3 medium-sized clustering frames, and acquiring target types corresponding to the 3 predicted frames output by the frame regression convolution model and the target identification and confidence degrees of the target types corresponding to the 3 predicted frames;

inputting the third output target characteristic information into a pre-trained target recognition and frame regression convolution model to obtain the horizontal offset and the vertical offset of the central point of the small-size target in the target image output by the target recognition and frame regression convolution model relative to the central point of the 3 small-size clustering frames in a pre-determined anchor frame, and the width scaling ratio and the high scaling ratio of the small-size target in the target image corresponding to the 3 small-size clustering frames; determining the center of a predicted frame of the small-sized target in the icon image according to the coordinates of the central point of the small-sized target in the target image, the horizontal offset and the vertical offset; determining 3 predicted frames corresponding to small-size targets in the target image according to the wide scaling ratio, the high scaling ratio and the widths and heights of the 3 small-size clustering frames, and acquiring target types corresponding to the 3 predicted frames output by the frame regression convolution model and the target identification and confidence degrees of the target types corresponding to the 3 predicted frames;

wherein the target object includes the large-sized target, the middle-sized target, and the small-sized target.

In an embodiment of the present invention, the step S208 may specifically include:

sorting the 3 prediction frames corresponding to the large-size target, the 3 prediction frames corresponding to the medium-size target and the prediction frames of the same target type in the 3 prediction frames corresponding to the small-size target from high confidence to low confidence to obtain a data set of the confidence of the prediction frames of the multiple target types;

keeping the maximum confidence degree of two or more prediction frames with the overlapping area larger than the preset area in the data sets of the confidence degrees of the prediction frames of the multiple target types to obtain a target data set, and further, executing the following steps on each data set of the data sets of the confidence degrees of the prediction frames of the multiple target types to obtain the target data set, wherein the data set being executed is called a current data set, and the current data set comprises the confidence degrees of the N prediction frames: respectively calculating the intersection ratio of the confidence of the 1 st prediction frame in the current data set and the confidence of a first target prediction frame, wherein the first target prediction frame is the confidence of the prediction frame in the current data set except for the confidence of the 1 st prediction frame; if the intersection ratio is larger than or equal to a first preset threshold value, attenuating the confidence coefficient of the first target prediction frame; if the intersection ratio is smaller than the first preset threshold, keeping the confidence coefficient of the first target prediction frame unchanged; sequencing the confidence degrees of the predicted frames in the current data set from large to small again; circularly executing the following steps until the intersection ratio of the confidence of the N-1 th prediction frame and the confidence of the Nth prediction frame in the current data set is determined to be less than the first preset threshold value: when i is 2, respectively calculating the intersection ratio of the confidence of the ith prediction frame in the reordered current data set and the confidence of the ith target prediction frame, wherein the ith target prediction frame is the confidence of the prediction frame in the current data set except the confidence of the first prediction frame to the confidence of the (i-1) th prediction frame; if the intersection ratio is larger than or equal to the first preset threshold, attenuating the confidence of the first target prediction frame; if the intersection ratio is smaller than the first preset threshold, keeping the confidence coefficient of the first target prediction frame unchanged; removing the prediction frame with the confidence level smaller than a second preset threshold value in the current data set; i +1, i < N;

determining target detection frames of the target object in the target image according to the target data set, and further, if the number of the target detection frames of the target object in the target data set is greater than a third preset threshold, outputting the target detection frame corresponding to the confidence coefficient of the third preset threshold with a higher confidence coefficient in the target data set; and if the number of the target detection frames of the target object in the target data set is less than or equal to the third preset threshold, outputting the target detection frames corresponding to all the confidence degrees in the target data set.

In the embodiment of the invention, before the predicted frame of the target object in the target image and the confidence of the predicted frame are determined according to the output target characteristic information with multiple sizes, the frames of the target object in a preset number of images are obtained to form a data set; clustering the frames in the data set through a K-means + + clustering algorithm to obtain 9 clustering frames; sequencing the 9 clustering borders according to the sizes of the borders; determining the front 3 clustering borders as the 3 large-size clustering borders, the middle 3 clustering borders as the 3 medium-size clustering borders, and the back 3 clustering borders as the 3 small-size clustering borders.

Fig. 3 is a flowchart of a method for detecting multiple targets in a traffic scene according to an embodiment of the present invention, as shown in fig. 3, including:

and S301, marking the acquired data according to an actual data marking strategy, specifically, classifying the targets into 7 types, namely passenger cars, trucks, cars, motorcycles, bicycles, tricycles and pedestrians, when marking the data set according to analysis of complex traffic scenes. When the vehicle or the pedestrian is shielded by plants beside the road by more than 2/3 or is difficult to be identified by naked eyes, the mark is ignored, and noise is prevented from being introduced. In order to avoid the situations that data labeling is fuzzy and data labeling is carried out on a rider and a pedestrian on a road, and the rider and a vehicle are labeled as corresponding vehicle types together.

S302, determining the structure of the lightweight DenseNet network, taking the design idea of the DenseNet network into consideration in view of the outstanding performance of the DenseNet network in terms of feature representation, and performing fine design to meet the requirement of real-time performance, the input image size of the network is 608 x 320. The structure of the lightweight DenseNet designed in the proposal is as follows: firstly, a network stores an initial two-layer structure of DenseNet, namely a convolution layer with the step size of 2 and the convolution kernel size of 7 and a maximum pooling layer with the kernel size of 3 and the step size of 2, and 2 times of downsampling is carried out; then cascading 4 sense blocks, wherein each sense Block consists of convolution layers with convolution kernel sizes of 1 and 3 respectively, and meanwhile, storing a Transition Layer behind the sense Block, wherein each sense Block consists of a largest pooling Layer with the convolution kernel size of 1, the kernel size of 2 and the step length of 2; and then the network is cascaded with 3 groups of modules consisting of a Dense Block and a Transition Layer, the number of the Dense blocks corresponding to each group is 8, 10 and 8 respectively, the Transition Layer module is removed after the last Dense Block, and the growth rate k of the lightweight DenseNet network in the embodiment of the invention is set as 32. When the network is trained, a data augmentation technology of color transformation and shape transformation is adopted, wherein the color transformation comprises the addition of Gaussian noise, picture saturation and picture tone to an input picture, the shape transformation comprises the addition of motion blur and blur to the input picture, multi-scale training is utilized to adapt to the problem of different sizes of the input pictures, the maximum iteration number set during the training is 500000, 64 pictures are input each time, and the learning rate is set to be 5 e-4; the specific training is divided into two stages, namely a forward stage and a reverse stage, wherein the forward stage outputs a frame deviation value and a target category confidence coefficient predicted for each input picture, the reverse stage calculates an error value between a target frame and a category corresponding to input picture data by using forward output information, the error comprises three parts, namely a frame position, a category confidence coefficient and whether the position comprises the target confidence coefficient, wherein the category confidence coefficient and whether the position comprises the target confidence coefficient error adopt a sigmoid cross entropy loss function, the frame position error adopts mean square error loss (MSE), the three output values are added, and parameters in a convolution kernel and the like are updated according to a chain derivation rule.

And S303, performing spatial pyramid pooling, and performing spatial feature pooling on the obtained features through a Spatial Pyramid Pooling (SPP) module with the window size of 5 x 5,9 x 9 and 13 x 13 respectively to extract effective target feature information.

S304, identifying the FPN + and frame regression convolution module + YOLO layer, and effectively utilizing the feature information of the small target by the feature pyramid module (FPN) of the features extracted in the steps S302 and S303; and generating a final target frame by respectively passing each feature generated by the feature pyramid module through a recognition and frame regression convolution module and a candidate target post-processing module, wherein the recognition and frame regression model is trained together with the light-weight network. The proposed FPN module utilizes feature layers that are the output features of the SPP module and the convolution layers in the 1 st and 3 rd Transition layers, respectively, and the reason why the feature information of the small target is obtained by using the feature before the 2 nd downsampling is because the feature information of the small target is lost with the increase of the network depth due to the excellent feature extraction capability of the DenseNet. In a specific implementation, the method uses a second feature layer to be upsampled and then passed through a convolutional layer, and performs channel-level linking with features before the second downsampling to enhance features.

Since Anchor (box) listed in Yolov3 can not be effectively applied to the target in the traffic data set, a K-means + + clustering algorithm is used to cluster the target border of the data set, the number of clustering centers is 9, and the corresponding evaluation function is as follows:

d(gt_box,c_box)＝1-IOU(gt_box,c_box)；

wherein gt _ box and c _ box respectively represent the labeled border and the cluster center border, and the function IOU represents the cross-over ratio. When performing border regression, regarding each point on the feature map as a detection target center, predicting a horizontal and vertical offset for each center point, and predicting a corresponding width and height scaling ratio, i.e. t ∈ R4, for the size of Anchor, as shown in formula (2):

b_x＝σ(t_x)+c_x

b_y＝σ(t_y)+c_y

bx, by, bw and bh respectively represent the horizontal coordinate, the vertical coordinate, the frame width and the frame height of the central point of the predicted frame on the corresponding feature map, pw and ph represent the zoom value of the frame width and the frame height output by the model prediction, cx and cy represent the horizontal coordinate and the vertical coordinate of the current point on the corresponding feature map, tx and ty represent the horizontal and the vertical offset values of the central point coordinate output by the model prediction, and sigma represents a sigma function.

And obtaining a corresponding downsampling multiple of each feature map, so as to obtain a corresponding complete frame in the original image. When the corresponding category is predicted for each central point, each input picture contains a plurality of targets, so that the Sigmoid function is used for prediction, and the cross entropy loss function is used for back propagation during training.

S305, performing post-processing by using Soft-NMS, wherein after the category and the frame of the target are obtained, if a simple NMS algorithm is adopted to remove the overlapped part between the similar targets, the missed detection condition of the target is caused. Therefore, the Soft-NMS algorithm is adopted to carry out post-processing on the frames, and the corresponding specific flow is as follows: 1) sorting each category according to the grade of the input prediction frame in a descending order; 2) sequentially calculating the intersection ratio of the target frames in the same category, if the intersection ratio is greater than a given threshold value, performing d-time attenuation on the score (confidence) of the target category with lower score, and specifically calculating d as shown in a formula (1); 3) and continuing to execute the steps until all the frames are processed.

The embodiment of the invention improves the performance of the detection algorithm in detecting the target in a complex traffic scene, particularly the performance of the detection algorithm in timeliness and detection of large vehicles (automobiles, passenger cars and the like). The method utilizes the advantages of a DenseNet network design structure in target feature extraction, uses a designed lightweight DenseNet network, finely designs the number of DenseBlock aiming at different sizes of detected targets in the network, and utilizes a multi-step up-sampling strategy to utilize shallower semantic information; meanwhile, the idea of a characteristic pyramid is combined to enhance the characteristic information, so that the semantic information of the cart with small size and larger size extracted by the network is effectively utilized and fused. And simultaneously, Soft-NMS is introduced to process the missing detection condition when the same vehicles have higher overlapping rate so as to reduce the missing detection rate. The speed of target detection is greatly improved, and the real-time requirement in an actual application scene is met. The method can be mainly decomposed into the following two parts: the structure of the traditional deeper DenseNet network is simplified, and a finely designed lightweight DenseNet network is obtained; the number of channels of the convolutional layer in the design of the DenseNet network is based on the multiple of k, and the parameter number in the network is greatly reduced under the condition that a smaller k value is adopted in the proposal and the comparable performance is ensured.

Example 2

According to another embodiment of the present invention, there is also provided a multi-target detection processing apparatus, and fig. 4 is a block diagram of the multi-target detection processing apparatus according to the embodiment of the present invention, as shown in fig. 4, including:

an extraction module 42, configured to extract feature information of the target image;

the multi-scale feature extraction module 44 performs spatial feature pooling on the feature information according to the spatial pyramid SPP to obtain effective target features, and performs multi-scale processing on the effective target features through a feature pyramid FPN to obtain output target feature information of multiple sizes;

a first determining module 46, configured to determine a target detection frame of the target object in the target image according to the output target feature information of the multiple sizes.

Optionally, the first determining module 46 includes:

the first determining submodule is used for determining a predicted frame of a target object in the target image and the confidence of the predicted frame according to the output target characteristic information of the plurality of sizes;

and the second determining submodule is used for determining a target detection frame of the target object in the target image according to the intersection ratio and the confidence coefficient of the prediction frame.

Optionally, the multi-scale target feature extraction module 44 is further configured to

Inputting the feature information into a spatial pyramid pooling model, and performing spatial feature pooling to obtain the effective target feature, wherein the spatial pyramid pooling model is formed by parallel windows with a size of 5 × 5,9 × 9, and 13 × 13.

Optionally, the extracting module 42 is further used for

Inputting the target image into a pre-trained target detection lightweight DenseNet model to obtain characteristic information output by the target detection lightweight DenseNet model, wherein the characteristic information comprises: the target detection lightweight Denset network model is a network formed by cascading a convolutional Layer, a maximum pooling Layer, 4 cascaded Transition blocks and a first Transition Layer, 8 Transition blocks and a second Transition Layer, 10 Transition blocks and a third Transition Layer, and 8 Transition blocks, wherein the input characteristics of the Transition blocks are combined with the output characteristics of the Transition blocks to form the output characteristics of the Transition blocks, and the Transition blocks are composed of two convolutional layers and a maximum pooling Layer.

Optionally, the first determining sub-module includes:

the first input unit is used for inputting the first output target characteristic information into a trained target recognition and frame regression convolution model to obtain the horizontal offset and the vertical offset of the central point of a large-size target in the target image output by the target recognition and frame regression convolution model relative to the central points of 3 large-size clustering frames in a predetermined anchor frame, and the aspect scaling ratio and the high scaling ratio of the large-size target in the target image corresponding to the 3 large-size clustering frames; determining the center of a predicted frame of the large-size target in the icon image according to the coordinates of the center point of the large-size target in the target image, the horizontal offset and the vertical offset; determining 3 predicted frames corresponding to the large-size target in the target image according to the wide scaling ratio, the high scaling ratio and the widths and heights of the 3 large-size clustering frames, and acquiring target types corresponding to the 3 predicted frames output by the frame regression convolution model and the target identification and confidence degrees of the 3 predicted frames corresponding to the target types;

the second input unit is used for inputting the second output target characteristic information into a pre-trained target recognition and frame regression convolution model to obtain the horizontal offset and the vertical offset of the central point of the medium-size target in the target image output by the target recognition and frame regression convolution model relative to the central points of 3 medium-size clustering frames in a predetermined anchor frame, and the aspect ratio and the high ratio of the medium-size target in the target image corresponding to the 3 medium-size clustering frames; determining the center of a predicted frame of the medium-sized target in the icon image according to the center point coordinate of the medium-sized target in the target image, the horizontal offset and the vertical offset; determining 3 predicted frames corresponding to the medium-sized targets in the target image according to the wide scaling ratio, the high scaling ratio and the widths and heights of the 3 medium-sized clustering frames, and acquiring target types corresponding to the 3 predicted frames output by the frame regression convolution model and the target identification and confidence degrees of the target types corresponding to the 3 predicted frames;

a third input unit, configured to input the third output target feature information into a pre-trained target recognition and frame regression convolution model, so as to obtain a horizontal offset and a vertical offset of a center point of a small-size target in the target image output by the target recognition and frame regression convolution model relative to a center point of a 3 small-size clustering frame in a predetermined anchor frame, and a wide scaling ratio and a high scaling ratio of the small-size target in the target image corresponding to the 3 small-size clustering frames; determining the center of a predicted frame of the small-sized target in the icon image according to the coordinates of the central point of the small-sized target in the target image, the horizontal offset and the vertical offset; determining 3 predicted frames corresponding to small-size targets in the target image according to the wide scaling ratio, the high scaling ratio and the widths and heights of the 3 small-size clustering frames, and acquiring target types corresponding to the 3 predicted frames output by the frame regression convolution model and the target identification and confidence degrees of the target types corresponding to the 3 predicted frames;

Optionally, the second determining sub-module includes:

the sorting unit is used for sorting the 3 prediction frames corresponding to the large-size target, the 3 prediction frames corresponding to the medium-size target and the prediction frames of the same target type in the 3 prediction frames corresponding to the small-size target from high confidence to low confidence to obtain a data set of the confidence of the prediction frames of the multiple target types;

the retention unit is used for retaining the prediction frames with the maximum confidence degrees in two or more prediction frames with the overlapping areas larger than the preset area in the data sets of the confidence degrees of the prediction frames of the multiple target types to obtain a target data set;

a determining unit, configured to determine a target detection frame of the target object in the target image according to the target data set.

Optionally, the reservation unit is further configured to

Performing the following steps on each data set of the data sets of the confidence degrees of the predicted frames of the target types to obtain the target data set, wherein the data set being executed is referred to as a current data set, and the current data set comprises the confidence degrees of the N predicted frames:

respectively calculating the intersection ratio of the confidence of the 1 st prediction frame in the current data set and the confidence of a first target prediction frame, wherein the first target prediction frame is the confidence of the prediction frame in the current data set except for the confidence of the 1 st prediction frame;

if the intersection ratio is larger than or equal to a first preset threshold value, attenuating the confidence coefficient of the first target prediction frame;

if the intersection ratio is smaller than the first preset threshold, keeping the confidence coefficient of the first target prediction frame unchanged;

sequencing the confidence degrees of the predicted frames in the current data set from large to small again;

circularly executing the following steps until the intersection ratio of the confidence of the N-1 th prediction frame and the confidence of the Nth prediction frame in the current data set is determined to be less than the first preset threshold value:

when i is 2, respectively calculating the intersection ratio of the confidence of the ith prediction frame in the reordered current data set and the confidence of the ith target prediction frame, wherein the ith target prediction frame is the confidence of the prediction frame in the current data set except the confidence of the first prediction frame to the confidence of the (i-1) th prediction frame;

if the intersection ratio is larger than or equal to the first preset threshold, attenuating the confidence of the first target prediction frame;

removing the prediction frame with the confidence level smaller than a second preset threshold value in the current data set;

i＝i+1，i<N。

optionally, the determination unit is further configured to

If the number of the target detection frames of the target object in the target data set is larger than a third preset threshold, outputting the target detection frame corresponding to the confidence coefficient of the third preset threshold with the front confidence coefficient in the target data set;

and if the number of the target detection frames of the target object in the target data set is less than or equal to the third preset threshold, outputting the target detection frames corresponding to all the confidence degrees in the target data set.

Optionally, the apparatus further comprises:

the acquisition module is used for acquiring the frames of the target objects in the images with the preset number to form a data set;

the clustering module is used for clustering the frames in the data set through a K-means + + clustering algorithm to obtain 9 clustering frames;

the sorting module is used for sorting the 9 clustering frames according to the sizes of the frames;

and the second determination module is used for determining the front 3 clustering borders as the 3 large-size clustering borders, the middle 3 clustering borders as the 3 medium-size clustering borders, and the rear 3 clustering borders as the 3 small-size clustering borders.

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.

Example 3

Embodiments of the present invention also provide a computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to perform the steps of any of the above method embodiments when executed.

Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:

s1, extracting the characteristic information of the target image;

s2, performing spatial feature pooling on the feature information according to a spatial pyramid SPP to obtain effective target features, and performing multi-scale processing on the effective target features through a feature pyramid FPN to obtain output target feature information of multiple sizes;

s3, determining a target detection frame of the target object in the target image according to the output target characteristic information of the plurality of sizes.

Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Example 4

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

s1, extracting the characteristic information of the target image;

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A multi-target detection processing method is characterized by comprising the following steps:

extracting characteristic information of a target image;

performing spatial feature pooling on the feature information according to the spatial pyramid pooling SPP to obtain effective target features, and performing multi-scale processing on the effective target features through a feature pyramid FPN to obtain output target feature information of multiple sizes;

2. The method of claim 1, wherein determining the target detection frame of the target object in the target image according to the output target feature information of the plurality of sizes comprises:

3. The method of claim 2, wherein extracting feature information of the target image comprises:

4. The method of claim 3, wherein performing multi-scale processing on the valid target features according to a Feature Pyramid (FPN) to obtain output target feature information of multiple sizes comprises:

5. The method of claim 4, wherein determining a predicted bounding box of a target object in the target image and a confidence level of the predicted bounding box based on the output target feature information for the plurality of sizes comprises:

6. The method of claim 5, wherein determining a target detection frame of the target object in the target image according to the cross-over ratio and the confidence of the prediction bounding box comprises:

reserving the prediction frames with the highest confidence in two or more prediction frames with the overlapping areas larger than the preset area in the data sets of the confidence of the prediction frames of the target types to obtain a target data set;

and determining a target detection frame of the target object in the target image according to the target data set.

7. The method of claim 6, wherein the step of retaining the confidence of the predicted frames of the plurality of target types includes the step of obtaining a predicted frame with the highest confidence in two or more predicted frames with an overlapping area larger than a preset area in the data set of the confidence of the predicted frames of the plurality of target types, and the step of obtaining the target data set includes:

i＝i+1，i<N。

8. the method of claim 6, wherein determining a target detection box for the target object in the target image from the set of target data comprises:

9. The method of any of claims 5 to 8, wherein prior to determining a predicted bounding box of a target object in the target image and a confidence level for the predicted bounding box based on the output target feature information for the plurality of sizes, the method further comprises:

acquiring frames of target objects in a preset number of images to form a data set;

clustering the frames in the data set through a K-means + + clustering algorithm to obtain 9 clustering frames;

sequencing the 9 clustering borders according to the sizes of the borders;

determining the front 3 clustering borders as the 3 large-size clustering borders, the middle 3 clustering borders as the 3 medium-size clustering borders, and the back 3 clustering borders as the 3 small-size clustering borders.

10. The method according to any one of claims 1 to 8, wherein performing spatial feature pooling on the feature information according to spatial pyramid pooling SPP to obtain the valid target feature comprises:

inputting the feature information into a spatial pyramid pooling model, and pooling spatial features to obtain the effective target features, wherein the spatial pyramid pooling model is formed by parallel windows with the sizes of 5 × 5,9 × 9 and 13 × 13.

11. A multi-target detection processing apparatus, comprising:

12. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to carry out the method of any one of claims 1 to 10 when executed.

13. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 10.