CN116310688A

CN116310688A - Target detection model based on cascade fusion, and construction method, device and application thereof

Info

Publication number: CN116310688A
Application number: CN202310291522.0A
Authority: CN
Inventors: 李圣权; 乐耀东; 袁帆; 张香伟
Original assignee: CCI China Co Ltd
Current assignee: CCI China Co Ltd
Priority date: 2023-03-16
Filing date: 2023-03-16
Publication date: 2023-06-23

Abstract

The application provides a target detection model based on cascade fusion, and a construction method, a device and application thereof, wherein the method comprises the following steps: constructing a target detection model based on cascade fusion, wherein the target detection model based on cascade fusion comprises a backbone network, a neck network and a head network; the backbone network comprises a convolution layer and a cascade characteristic fusion layer; inputting the image to be detected into a backbone network for feature extraction, and outputting the features of the image to be detected by a last cascade feature fusion layer; extracting features by a neck network to obtain compression features of the image to be detected; and the head network obtains the category information of the target to be detected according to the compression characteristics of the image to be detected. According to the cascade fusion-based target detection model, the space complexity is reduced in a layered cascade mode, the calculated amount of model parameters is greatly reduced, and meanwhile, the accuracy of target detection is improved.

Description

Target detection model based on cascade fusion, and construction method, device and application thereof

Technical Field

The application relates to the field of deep learning target detection, in particular to a target detection model based on cascade fusion, and a construction method, a device and application thereof.

Background

Object detection is a basic and challenging task in the computer vision field that has been detecting the smallest bounding box covering the object of interest in the input image and synchronously assigning related semantic labels, in general, the latest approach based on Convolutional Neural Networks (CNNs) can be roughly divided into two-stage and one-stage detectors, which often use regional proposal networks to generate candidate boxes first and then perfect in the next stage, so that due to their multi-stage nature these detectors performance is always not high, whereas one-stage detectors always have faster reasoning speeds than two-stage detectors, but most CNN-based detectors involve hundreds or even thousands of convolutional layers and feature channels, where model size and reasoning efficiency are unacceptable for real world applications requiring online estimation and real-time prediction, such as autopilot techniques, robotics and virtual reality techniques.

The main task of object detection is to locate an object of interest from an input image and then accurately judge the category of each object of interest, and the current object detection technology has been widely applied to the fields of daily life safety, robot navigation, intelligent video monitoring, traffic scene detection and aerospace, and meanwhile, object detection is the basis of other advanced visual problems such as behavior understanding, scene classification and video content retrieval, but because of the great diversity among different examples of the same type of object, the similarity among different types of objects is possible, and the great influence of different imaging conditions and environmental phonemes on the appearance of the object is caused, the object detection has great challenges.

The traditional target detection algorithm adopts a sliding window mode or an image segmentation technology similar to exhaustion to generate a large number of candidate areas, then extracts image features (comprising HOG, SIFT, haar and the like) of each candidate area, and transmits the features to a classifier (such as SVM, adaboost, random Forest and the like) for judging the category of the candidate area.

The method has a certain limitation in practical application, has poor single-scale target detection effect under non-limiting conditions, so that the problem that the detection accuracy under a complex scene is difficult to improve due to the single-scale depth characteristic is solved, the detection accuracy under the complex scene is improved by using global attention or expanding a convolution kernel to expand a receptive field currently, but a large amount of calculation cost and memory overhead are usually introduced due to the oversized input image, and the visual transducer has low efficiency due to high space complexity.

In view of the foregoing, there is a need for a method that can rapidly detect a target in a complex scene under non-limiting conditions, and has high detection accuracy.

Disclosure of Invention

The embodiment of the application provides a target detection model based on cascade fusion, a construction method, a device and application thereof, wherein multi-scale feature representation is learned through a plurality of cascade feature fusion layers, and the space complexity is remarkably reduced, so that the calculation amount of the model is reduced under the condition that the detection accuracy is not influenced.

In a first aspect, an embodiment of the present application provides a method for constructing a target detection model based on cascade fusion, where the method includes:

constructing a target detection model based on cascade fusion, wherein the target detection model based on cascade fusion comprises a backbone network, a neck network and a head network;

acquiring at least one image to be detected containing an object to be detected as a training sample, marking the object to be detected by the training sample, inputting the object to be detected into a backbone network, and extracting features to obtain the image feature to be detected, wherein the backbone network is formed by connecting a backbone convolution layer and a plurality of cascade feature fusion layers in series, the output of the last cascade feature fusion layer is the image feature to be detected, the cascade feature fusion layer is formed by connecting a cascade fusion multi-head self-attention module and a cascade fusion module in series in sequence, and a double-branch downsampling module exists between any two adjacent cascade feature fusion layers;

The neck network performs feature extraction on the image features to be detected to obtain image compression features to be detected;

the head network is divided into a target category prediction branch and a detection frame position prediction branch, the detection frame position prediction branch carries out regression prediction on the compression characteristics of the image to be detected to obtain candidate frames containing detection frame position information of the object to be detected, the target category prediction branch carries out category prediction according to the compression characteristics of the image to be detected to obtain classification detection frames, and each candidate frame is multiplied by the corresponding classification detection frame after target prediction to output a classification result of the object to be detected.

In a second aspect, an embodiment of the present application provides a target detection method, including:

acquiring an image to be detected containing a target to be detected, and inputting the image to be detected into a trained target detection model based on cascade fusion;

the backbone network in the target detection model based on cascade fusion carries out feature extraction on the image to be detected to obtain the feature of the image to be detected;

the neck network in the target detection model based on cascade fusion carries out further feature extraction on the image features to be detected to obtain compression features of the image to be detected;

And inputting the compression characteristics of the image to be detected into a head network of a target detection model based on cascade fusion to perform target category prediction and detection frame position prediction to obtain the category and the corresponding detection frame of each target to be detected in the image to be detected.

In a third aspect, an embodiment of the present application provides a device for constructing a target detection model based on cascade fusion, including:

the construction module comprises: constructing a target detection model based on cascade fusion, wherein the target detection model based on cascade fusion comprises a backbone network, a neck network and a head network;

a first feature extraction module: acquiring at least one image to be detected containing an object to be detected as a training sample, marking the object to be detected by the training sample, inputting the object to be detected into a backbone network, and extracting features to obtain the image feature to be detected, wherein the backbone network is formed by connecting a backbone convolution layer and a plurality of cascade feature fusion layers in series, the output of the last cascade feature fusion layer is the image feature to be detected, the cascade feature fusion layer is formed by connecting a cascade fusion multi-head self-attention module and a cascade fusion module in series in sequence, and a double-branch downsampling module exists between any two adjacent cascade feature fusion layers;

And a second feature extraction module: the neck network performs feature extraction on the image features to be detected to obtain image compression features to be detected;

and a second feature extraction module: the neck network is a network encoder, and the network encoder performs further feature extraction on the image features to be detected to obtain image compression features to be detected;

and a detection module: the head network is divided into a target category prediction branch and a detection frame position prediction branch, the detection frame position prediction branch carries out regression prediction on the compression characteristics of the image to be detected to obtain candidate frames containing detection frame position information of the object to be detected, the target category prediction branch carries out category prediction according to the compression characteristics of the image to be detected to obtain classification detection frames, and each candidate frame is multiplied by the corresponding classification detection frame after target prediction to output a classification result of the object to be detected.

In a fourth aspect, embodiments of the present application provide an electronic device, including a memory, and a processor, where the memory stores a computer program, and the processor is configured to run the computer program to perform a method for constructing a cascade fusion-based object detection model or a method for detecting an object.

In a fifth aspect, embodiments of the present application provide a readable storage medium having stored therein a computer program comprising program code for controlling a process to execute a process comprising a method of constructing a cascade fusion-based object detection model or an object detection method.

The main contributions and innovation points of the invention are as follows:

according to the embodiment of the application, the multiscale representation is learned based on the extracted high-resolution features through a plurality of cascade feature fusion layers in the backbone network, so that most of trunks can be effectively utilized to fuse the multiscale features; the cascade fusion multi-head self-attention module in the backbone network can gradually calculate self-attention in the increasing area network size, naturally model global characteristic relations in a layering mode, remarkably reduce space complexity in a layering mode, solve the problem that visual transducer is low in efficiency due to high space complexity in multi-head self-attention, and enable self-attention calculation in the transducer to be more flexible and efficient; the cascade fusion module in the embodiment of the application can combine local features and global features at the same time under the condition of introducing a very small amount of additional cost, and the model has stronger multi-scale fusion capability through interaction of the local features and the global features; the embodiment of the application improves the localization performance of the model by distributing smaller gradient gains for low-quality anchor frames to reduce harmful gradients and focusing on the anchor frames with common quality.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the other features, objects, and advantages of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

FIG. 1 is a flow diagram of a method of constructing a cascade fusion-based object detection model according to an embodiment of the present application;

fig. 2 is a schematic diagram of a backbone network according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a cascaded fusion multi-headed self-attention module according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a cascading fusion module according to an embodiment of the present application;

FIG. 5 is a schematic diagram of the architecture of a coarse-fine granularity fusion module according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a dual-branch downsampling module according to an embodiment of the present application;

FIG. 7 is a schematic diagram of the overall structure of a cascade fusion-based object detection model according to an embodiment of the present application;

FIG. 8 is a block diagram of a construction device of a cascade fusion-based object detection model according to an embodiment of the present application;

fig. 9 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with aspects of one or more embodiments of the present description as detailed in the accompanying claims.

It should be noted that: in other embodiments, the steps of the corresponding method are not necessarily performed in the order shown and described in this specification. In some other embodiments, the method may include more or fewer steps than described in this specification. Furthermore, individual steps described in this specification, in other embodiments, may be described as being split into multiple steps; while various steps described in this specification may be combined into a single step in other embodiments.

Example 1

The embodiment of the application provides a method for constructing a target detection model based on cascade fusion, and specifically referring to fig. 1, the method comprises the following steps:

In some specific embodiments, the method can be used for realizing intelligent detection of the tree lodging target in the city through the fixed camera and the mobile phone camera, and the target to be detected is not limited to the tree lodging target.

When acquiring data, if the mobile phone image data acquisition is carried out by an image acquisition personnel, the mobile phone image data acquisition personnel can shoot the mobile phone image data, the mobile phone image data acquisition personnel can divide the mobile phone image data into a distant view target image and a close view target image, the distant view target image and the close view target image are respectively placed in two rows of tables, if the image acquisition is carried out through a monitoring video, the video containing tree lodging with different places and different times and different weather is found out in the monitoring video, vi represents the ith video, vi shares Ni video images, mi video images are selected from the Ni video images as training and testing images, and then L video types can be obtained

The video images serve as training and test images.

In some embodiments, the data enhancement is performed on the acquired training samples, where the enhancement method is: 1. color transformation: data enhancement is performed in the color channel space, such as turning off a certain color channel or changing brightness value; 2. rotation transformation: selecting an angle, rotating the image left and right, and changing the content orientation of the image; 3. adding noise: adding a random value matrix sampled from Gaussian distribution into an image; 4. sharpening and blurring: processing the image using a gaussian operator, a laplace operator, or the like; 5. scaling transformation: scaling up and down the image and not changing the content in the image; 6. translation transformation: moving the image in four dimensions up, down, left and right; 7. and (3) overturn transformation: performing an image flipping operation about a horizontal or vertical axis; 8. cutting and transforming: mainly comprises center cutting and random cutting; 9. affine transformation: performing linear transformation on the image once and connecting a translation transformation; autoAutoAutoAutoAutoAutoAutomation data enhancement is adopted in neural network training, and an image enhancement scheme suitable for a specific data set is found in a search space of a series of image enhancement sub-strategies through a search algorithm based on the AutoAutoAutoAutomation data enhancement of NAS search; for different types of data sets, different numbers of sub-policies may be included; each sub-strategy comprises two kinds of transformation, one sub-strategy is selected randomly for each image, and then whether each transformation method in the sub-strategy is executed or not is determined according to a certain probability; data enhancement has been widely used for network optimization and proved to be advantageous for visual tasks, can improve the performance of deep learning algorithms, prevents overfitting, and is easy to implement.

In some embodiments, the backbone network structure is shown in fig. 2, where the backbone network shown in fig. 2 is formed by connecting a backbone convolution layer and four cascade feature fusion layers in series, a dual-branch downsampling module exists between any two adjacent cascade feature fusion layers, xLi in the figure represents the number of cascade feature fusion layers, and H and W represent the height and width of an input image.

In some embodiments, the backbone convolution layer consists of two 3*3 convolution layers with a stride of 2, each followed by a layer-by-layer (LayerNorm layer) and an activation function unit.

Specifically, the backbone network of the scheme calculates the self-attention of the image features to be detected in a layered manner, so that each step only processes a limited number of tokens, and the cascade feature fusion layer of each layer in the backbone network performs downsampling through average pooling.

Specifically, the scheme can enable the network to generate richer multi-scale features by introducing a plurality of cascade feature fusion layers.

Specifically, the structure of the cascade fusion multi-head self-attention module is shown in fig. 3, in some embodiments, the cascade fusion multi-head self-attention module is formed by sequentially connecting a feature remolding layer, a first multi-head self-attention layer, a second multi-head self-attention layer and a third multi-head self-attention layer in series, the input of the cascade fusion multi-head self-attention module is defined as a first input feature, the first input feature performs feature remolding on the feature remolding layer to obtain a remolding feature, the remolding feature is input to the first multi-head self-attention layer to obtain a first attention result, the first attention result and the remolding feature perform feature stitching to obtain a first stitching result, the first stitching result is input to the second attention layer to obtain a second stitching result, the second stitching result and the first stitching result perform feature stitching to obtain a second stitching result, the second stitching result is input to the third multi-head self-attention layer to obtain a third attention result, the third attention result and the third multi-head self-attention module performs feature stitching to obtain a first stitching result, and the first multi-head self-attention module performs stitching to obtain a sample result.

The cascade fusion multi-head self-attention module gradually calculates self-attention in the area grid size of the object to be detected, naturally models the global characteristic relation of the whole image to be detected in a layered mode, remarkably reduces space complexity in a layered cascade mode, solves the problem that visual transducer is low in efficiency due to high space complexity in multi-head self-attention, and meanwhile can enable self-attention calculation in the transducer to be more flexible and efficient.

Specifically, the structure of the cascade fusion module is shown in fig. 4, in some embodiments, the cascade fusion module is formed by sequentially connecting a first transfer residual bottleneck module, a first downsampling layer, a second transfer residual bottleneck module, a second downsampling layer, a coarse-fine granularity fusion module and a conversion module in series, the input of the cascade fusion module is defined as a second input feature, the second input feature is input to the first transfer residual bottleneck module to obtain a first transfer result, the first transfer result is downsampled in the first downsampling layer to obtain a first downsampling result, the first downsampling result is input to the second transfer residual bottleneck module to obtain a second transfer result, the second downsampling result is downsampled in the second downsampling layer to obtain a second downsampling result, the second downsampling result is input to the coarse-fine granularity fusion module to obtain a coarse-fine granularity fusion result, and the second downsampling result are input to the conversion module to fuse and convert to obtain the cascade output module.

The cascade fusion module can be used for effectively fusing the characteristics with most of parameters of the whole trunk in a deeper way, so that more parameters can be used for characteristic fusion, and the richness of the characteristic fusion is greatly increased.

Specifically, the conversion module gradually upsamples the features of different scales to perform cascading feature combinations without additional convolution to convert the added features.

Furthermore, the first transpose residual bottleneck module and the second transpose residual bottleneck module have the same structure and are composed of three serially connected convolution layers, input features are processed through the three serially connected convolution layers, and then the processed results are subjected to feature fusion on the input features to obtain transposed results.

In some embodiments, when four cascade feature fusion layers, respectively L1, L2, L3, and L4, exist in the backbone network, a first conversion result with a high value of H/8*W/8 is obtained in L1 by performing feature extraction through a first conversion residual bottleneck module in the cascade fusion module, the first conversion result is sent to a first downsampling layer to obtain a first downsampling result, the first downsampling result is subjected to feature extraction through a second conversion residual module to obtain a second transposed result, the second transposed result is downsampled in the second downsampling layer to obtain a second downsampling result, the second downsampling result is input to the coarse-fine granularity fusion module to obtain a coarse-fine granularity fusion result, and obtaining an identity mapping of the first transposition result and an identity mapping of the second transposition result, carrying out feature fusion on the identity mapping of the first transposition result, the identity mapping of the second transposition result and the coarse-fine granularity fusion result in a conversion module to obtain an output of a cascade fusion module, wherein the output of the cascade fusion module is the output of a cascade feature fusion layer L1, a downsampling function is not required after the first transposition residual bottleneck module in the L2 cascade fusion module, other steps are the same as L1, the first upsampling is carried out in L3, other steps are the same as L2, the second upsampling is carried out in L4, and other steps are the same as L3.

In some embodiments, the three convolution layers in the first transposed residual bottleneck module and the second transposed residual bottleneck module are sequentially 1×1, k×k,1×1, where it is worth mentioning that k=5 in L1, the channel expansion ratio is 4, k=3 in L2, the channel expansion ratio is 4, k=5 in L3, the channel expansion ratio is 6, L4 stage k=3, the channel expansion ratio is 5, and the number of characteristic channels of L2 is twice L1, the number of characteristic channels of L3 is twice L2, and the number of characteristic channels of L4 is twice L3.

Specifically, the structure of the coarse-fine granularity fusion module is shown in fig. 5, further, the coarse-fine granularity fusion module is formed by sequentially connecting a coarse-fine granularity window attention layer, a coarse-fine granularity depth separable convolution layer, a first coarse-fine granularity convolution layer and a second coarse-fine granularity convolution layer in series, the second downsampling result is input into the coarse-fine granularity window attention layer to obtain a window attention result, the window attention result and the second downsampling result are subjected to feature fusion and then are input into the coarse-fine granularity depth separable convolution layer to obtain a coarse-fine granularity depth separable convolution result, the window attention result and the second downsampling result are subjected to feature fusion to obtain a preliminary coarse-fine granularity fusion result, the preliminary coarse-fine granularity fusion result sequentially passes through the first coarse-fine granularity convolution layer and the second coarse-fine granularity convolution layer to obtain a final coarse-fine granularity fusion result, and the final coarse-fine granularity fusion result is obtained.

In the figure, d7 x 7 represents a depth separable convolution with a convolution kernel size of 7*7, the parameter quantity of the depth separable convolution is reduced greatly compared with that of a conventional convolution, calculation resources are saved, s7 x 7 represents a window attention layer with a window size of 7*7, 49 characteristic pixels are contained, the global self-attention calculation can lead to square-times complexity, the scheme adopts a window-based self-attention mechanism, an original image can be divided into a plurality of non-overlapped windows in average, the parameter quantity is greatly reduced in the complexity of the window-based self-attention mechanism and the global self-attention mechanism, as shown in the following two formulas,

Ω(MSA)＝4hwc ² +2(hw) ² c (10)|

Ω(WMSA)＝4hwc ² +2W ² hwc (11)

wherein Ω (MSA) represents a global attention formula, Ω (WMSA) represents a window attention formula, W, h, c represent the number of rows, columns, and channels of the convolution kernel, respectively, W is much smaller than the sizes of h and W, and r represents the void fraction of the convolution, which is used to expand the convolution kernel.

It is worth mentioning that, this scheme utilizes thick and thin granularity fusion module to expand the receptive field of neuron, makes the characteristic possess bigger receptive field to along with the reduction of characteristic space size, the passageway number can be more and more, so the parameter of introducing can be more, consequently expand the receptive field of neuron through thick and thin granularity fusion module in cascade fusion module.

In some embodiments, the dual-branch downsampling module has a first branch and a second branch connected in parallel, the first branch is composed of a first branch pooling layer, a first branch convolution layer and a first branch normalization layer, the second branch is composed of a second branch convolution layer, the input feature of the dual-branch downsampling module is defined as a third input feature, the third input feature obtains a first branch downsampling result through the first branch, the third input feature obtains a second branch downsampling result through the second branch, and feature fusion is performed on the first branch downsampling result and the second branch downsampling result to obtain a dual-branch downsampling result.

Specifically, the structure of the dual-branch downsampling module is shown in fig. 6, where the first branch convolution layer is 1*1 convolution, and the second branch convolution layer is 3*3 convolution.

In particular, the use of the dual-branch downsampling module may reduce the number of parameters, enhance feature expression, and increase the depth of the model.

In some embodiments, the neck network is a network encoder.

In some embodiments, the target class prediction branch is formed by sequentially connecting a target class convolution module, a target class depth separation convolution layer and a classifier in series, the detection frame position prediction branch is formed by sequentially connecting a detection frame position convolution layer, a detection frame position depth separation convolution module and a frame regression layer in series, the target class depth separation convolution module is formed by sequentially connecting a depth separation convolution layer, a first normalization layer, a first activation function layer, a convolution layer, a second normalization layer and a second activation function layer in series, and the detection frame position depth separation convolution module and the target class depth separation convolution module are identical in structure.

In some embodiments, the head network follows an automatic allocation and uses a targeting prediction for each candidate box on the regression head to prove whether the candidate box contains a target, the final prediction of the classification score is obtained by multiplying the classification output by the targeting prediction, the overall structure of the cascade fusion-based target detection model is shown in fig. 7, where a represents the number of candidate boxes, K represents the number of classes of the model, N represents the number of samples, C represents the number of feature channels, H1 and W1 represent the height and width of the feature, respectively.

Specifically, the structure of the detection frame position depth separation convolution module and the target category depth separation convolution module is shown in fig. 8, and the depth separation convolution module can greatly reduce the calculation amount of the model and save the calculation resources of a computer.

Specifically, the target class convolution layer and the detection frame position convolution layer are 1*1 convolutions, the depth separation convolution layer in the target class depth separation convolution module is 3*3 convolutions, and the convolution layer in the target class depth separation convolution module is 1*1 convolutions.

Specifically, the first activation function layer and the second activation function layer are Mish activation functions, mish eliminates the Dying ReLU phenomenon (the value of the activation function is equal to 0) by designing the precondition of retaining a small amount of negative information, mish can avoid saturation on a gradient close to zero, is unbounded above, so that the output cannot saturate to the maximum value, and Mish is continuously and slightly available and can play a role in better gradient flow.

In some specific embodiments, after the step of "the detection frame position prediction branch uses targeting prediction to obtain detection frame position information containing a target to be detected", using Wise-IoU as a loss function of frame regression, the overlapping degree between the candidate frame and the target frame is represented by IoU, the influence of the geometric factor on the candidate frame is reduced, the momentum m is set, and the gradient gain is set for the candidate frame by the momentum m, where the specific formula is as follows:

L _IoU ＝1-IoU (2)

L _WIoU1 ＝R _WIoU L _IoU (3)

L _WIoU ＝rL _WIoU1 (5)

wherein L is _IoU Representing the loss of the degree of overlap of a predicted frame and a target frame, R _WIoU To increase the degree of overlap of the candidate frames,

for representing momentum mRunning average, LWIoU assigns gradient gains to candidate boxes, and β and r are used to adjust the gradient gains of the candidate boxes.

The loss function of bounding box regression is critical for target detection; its good definition will bring significant performance improvement to the model; most existing work assumes that the samples in the training data are of high quality and focus on enhancing the fit ability of the bounding box regression loss; if the bounding box regression of low quality samples is blindly enhanced, this would jeopardize localization performance; focal EIoU v1 was proposed to solve this problem, but because of its static focusing mechanism, the potential of a non-monotonic focusing mechanism is not fully utilized; the Wise-IoU penalty has a dynamic non-monotonic focusing mechanism; in the target detection task, ioU is used to measure the degree of overlap between the anchor box and the target box, which effectively shields the bounding box size from interference in a proportional manner; ioU the ratio of the prediction box and marker box intersections and union; since the training data inevitably contains low quality examples, geometric factors (such as distance and aspect ratio) will exacerbate the penalty on low quality examples, thereby reducing the generalization performance of the model; when the anchor box is well coincident with the target box, a good loss function should weaken the penalty of the geometric factors, and fewer training interventions will make the model get better generalization ability; r is R _WIoU E [1, e), which will significantly amplify the L of the common quality Anchor box _IoU ；L _IoU ∈[0，1]This will significantly reduce the R of the high quality Anchor box _WIoU When the anchor box is overlapped with the target frame, the distance between the center points is focused; wherein Wg and Hg are the size of the smallest closed box; to prevent R _WIoU Creating a gradient that impedes convergence, wg and Hg are separated from the calculation map (superscript indicates this operation); no new metrics are introduced because it effectively eliminates factors that prevent convergence; due to L _IoU The quality demarcation criteria of the anchor box are dynamic, which enables the WIoU to formulate a gradient gain allocation strategy that best meets the current situation at each moment; wherein the method comprises the steps of

Is the running level of momentum mMean value, to prevent leaving a low quality anchor box in the early stage of training, initialize L _IoU =1, let L _IoU Anchor box of=1 enjoys the highest gradient gain; in order to maintain this strategy in the early stages of training, it is necessary to set a small momentum m to delay the approach +.>

True value +.>

Time of (2); for training with data batch number n, the momentum is set to m, as shown in the formula 9; the arrangement is such that +.>

Is shown in formula 8; in the middle and late stage of training, L _WIoU Assigning a smaller gradient gain to the low quality anchor box to reduce deleterious gradients; meanwhile, the method is also focused on an anchor box with common quality so as to improve the localization performance of the model; wherein, when β=6, 6 causes r=1; when the outlier of the anchor box satisfies β=c (C is a constant), the anchor box will obtain the highest gradient gain; since it is dynamic, the quality demarcation criteria of the Anchor box is also dynamic, which makes L _WIoU The gradient gain allocation strategy which is most suitable for the current situation can be formulated at each moment, and the specific frame regression formula is as follows:

in some specific embodiments, the scheme trains 300epoch on own data and performs the norm-up of 5epoch during training, the initial norm-up of training sets the learning rate to be very small, the learning rate gradually rises along with the progress of training, and finally the learning rate of normal training is achieved, the stage is the core stage of norm-up, the neural network hopes to gradually reduce the learning rate (learning rate decay) along with the progress of training, and the learning rate is reduced to 0 after the completion of training; the optimizer is SGD, the learning rate is 0.01, the initial learning rate is 0.01, the weight decade is set to 0.0005, the momentum is set to 0.9, the batch is determined by hardware equipment, and the input size is uniformly transited from 448 to 832 in step length 32; the connection weight w and the bias b of each layer are randomly initialized, the learning rate eta and the minimum Batch are given, an activation function SMU is selected, and a frame Loss function is selected as SIoU_Loss and the maximum iteration number (or algebra) under the current data.

In the model training, a plurality of graphics cards are used under the condition that hardware meets the requirement, and a multi-GPU (graphics card) parallel processing mechanism with a training deep learning framework of PyTorch, pyTorch is that firstly, a model is loaded on a master GPU, then the model is copied into each appointed slave GPU, and then input data is divided according to batch dimension, specifically, the batch number of the data allocated to each GPU is the batch of the total input data divided by the appointed GPU number; each GPU independently performs forward (forward propagation) calculation on respective input data, and finally sums loss of each GPU, and then updates model parameters on a single GPU by reverse propagation, and copies the updated model parameters to the rest specified GPUs, so that iterative calculation is completed; after the parameters of the neural network are determined, the processed data are input, iteration is repeated until the error of the output layer of the neural network reaches the preset precision requirement or the training frequency reaches the maximum iteration frequency, training is finished, the network structure and the parameters are saved, and a trained target detection model based on cascade fusion is obtained.

For a model, not only is it required to have a good fit (training error) to a training dataset, but it is also expected to have a good fit (generalization capability) to an unknown dataset (prediction set), and the resulting test error is called a generalization error; the best visual representation is over fitting (overfitting) and under fitting (underfitting) of the model, wherein the over fitting and the under fitting are used for describing two states of the model in the training process, and generally, the training is a curve; when training is started, the model is still in a learning process, the performance of the training set and the performance of the test set are poor, the model does not learn knowledge at this time, the model is in an under-fitting state, the curve falls in underfitting zone (under-fitting interval), and as training is carried out, the training error and the test error are reduced; with further training of the model, the performance on the training set is better and better, the error of the training set is reduced after a point is broken through, the error of the test set is increased, and the overfitting zone is entered at the moment; regularization is used for improving the generalization capability of the model, and the regularization aims that the sum of the empirical risk of the model and the complexity of the model is minimum, namely the structural risk is minimum; there are two explicit regularization methods, the first one being by modification of the network structure or by adjustment on the method of use; the second type directly modifies the loss function; another regularization method, which is an implicit regularization method, does not intentionally go directly to regularization, but can even achieve better results, which is data-related operations, including normalization methods and data enhancement, disturbing the labels; in the patent, the implicit regularization method is applied to data enhancement, so that the generalization capability of the model is improved.

The label distribution screening positive sample method comprises two key technical points of preliminary screening and SimOTA; 1. the primary screening modes mainly comprise two modes: judging according to the center point and judging according to the target frame; a. judging according to the center point, and the rule: searching for an anchor_box center point, and all anchors falling in a rectangular range of groundtrunk_boxes; for example, calculating the coordinates of the upper left corner and the lower right corner of each groudtluth of each picture through [ x_center, y_center, w, h ] of the groudtluth; calculating the upper left corner (gt_l, gt_t) and the lower right corner (gt_r, gt_b) according to the existing coordinate information; determining the rectangular frame range of the groundtrunk, and selecting a proper anchor frame according to the range; drawing a central point (x_center, y_center) of the anchor frame, and searching for the corresponding relation between the anchor frame and the groudtruth, namely calculating the corresponding distances between the central point (x_center, y_center) of the anchor frame, the upper left corner (gt_l, gt_t) of the mark frame and the two corner points of the lower right corner (gt_r, gt_b); determine if all are greater than 0? All anchors falling within the rectangular range of the groundtrunk can be extracted, because the center point of the ancor box is only falling within the rectangular range, and b_l, b_r, b_t and b_b are all larger than 0; b. judging according to the target frame, and setting a method for judging according to the target frame besides judging according to the center point of the anchor frame and the distances between two sides of the groundtrunk; rules: setting a square with a side length of 5 by taking a groudtluth central point as a reference, and selecting all anchor frames in the square; the method is the same as the above; through the two selection modes, the preliminary screening is finished, a part of candidate anchors are selected, and the next fine screening is carried out; 2. fine screening, namely using a SimOTA method; the front and back processes of the SimOTA are disassembled, and the whole screening process is mainly divided into four stages: primary screening positive sample information extraction, loss function calculation, cost calculation and SimOTA solving.

When the cascade fusion-based target detection model is used, a video stream is firstly obtained, a neural network model is loaded, an output result is the confidence level of the position and the target of a tree lodging type target bounding box, a new batch of data is collected, the model is used for detecting the batch of data, the detection result is divided into two main types of framed images and non-framed images, the framed images are divided into a real target image and a false report target image, the non-framed images can be divided into an image without a detected target and an image without the target in the images, the false report target image is taken as a negative sample, the images with the detected target but without the detected target are taken as training samples, then the images without the detected target are subjected to data marking and data enhancement, then a new model is trained on the basis of the original model, whether the model effect check precision meets the standard or not is tested, if the new model does not meet the standard, the new data is added and the network adjustment parameter is trained, if the model precision meets the requirements, and training is stopped when the current training data is optimal, the step is circulated, and the model is suitable for the complexity of the sample in the actual environment.

In some specific embodiments, the target detection and early warning can be carried out on tree lodging through the monitoring camera, and the specific method comprises the following steps: firstly, the interface receives parameters such as camera addresses, algorithm types, callback addresses and the like; secondly, the interface starts a new process, starts capturing image frames of the video stream of the camera, stores the image frames in rediss, and simultaneously notifies a monitoring program; thirdly, the monitoring program receives the notification, acquires the image frame from the redis, simultaneously starts to create an algorithm instance, and calls the algorithm instance to start analysis on the image frame; fourth, the algorithm example analyzes the image frame according to the business logic, store the analysis result into redis, notify the monitoring program at the same time; fifthly, the monitoring program receives the notification, takes out the result and submits the analysis result to a service interface (callback); the algorithm called by the flow is a self-attention mechanism target detection framework; the method is mainly used for detecting the lodging target of the tree under the fixed-point monitoring, recording and documenting the lodging target for the management department to verify, and notifying related personnel to arrive at the site in time for processing; it should be noted that the method provided in the field application of the present invention can be further extended to other suitable application environments, and is not limited to be used for detecting tree lodging.

Example two

A target detection method comprising:

Example III

Based on the same conception, referring to fig. 8, the application further provides a device for constructing a target detection model based on cascade fusion, which comprises:

Example IV

This embodiment also provides an electronic device, referring to fig. 9, comprising a memory 404 and a processor 402, the memory 404 having stored therein a computer program, the processor 402 being arranged to run the computer program to perform the steps of any of the method embodiments described above.

In particular, the processor 402 may include a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or may be configured to implement one or more integrated circuits of embodiments of the present application.

The memory 404 may include, among other things, mass storage 404 for data or instructions. By way of example, and not limitation, memory 404 may comprise a Hard Disk Drive (HDD), floppy disk drive, solid State Drive (SSD), flash memory, optical disk, magneto-optical disk, tape, or Universal Serial Bus (USB) drive, or a combination of two or more of these. Memory 404 may include removable or non-removable (or fixed) media, where appropriate. Memory 404 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 404 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, memory 404 includes Read-only memory (ROM) and Random Access Memory (RAM). Where appropriate, the ROM may be a mask-programmed ROM, a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), an electrically rewritable ROM (EAROM) or FLASH memory (FLASH) or a combination of two or more of these. The RAM may be Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM) where appropriate, and the DRAM may be fast page mode dynamic random access memory 404 (FPMDRAM), extended Data Output Dynamic Random Access Memory (EDODRAM), synchronous Dynamic Random Access Memory (SDRAM), or the like.

Memory 404 may be used to store or cache various data files that need to be processed and/or used for communication, as well as possible computer program instructions for execution by processor 402.

The processor 402 reads and executes the computer program instructions stored in the memory 404 to implement any of the above-described methods for constructing the object detection model based on cascade fusion.

Optionally, the electronic apparatus may further include a transmission device 406 and an input/output device 408, where the transmission device 406 is connected to the processor 402 and the input/output device 408 is connected to the processor 402.

The transmission device 406 may be used to receive or transmit data via a network. Specific examples of the network described above may include a wired or wireless network provided by a communication provider of the electronic device. In one example, the transmission device includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through the base station to communicate with the internet. In one example, the transmission device 406 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.

The input-output device 408 is used to input or output information. In this embodiment, the input information may be an image to be detected, and the output information may be category information of the target to be detected, confidence level of the target detection frame, and the like.

Alternatively, in the present embodiment, the above-mentioned processor 402 may be configured to execute the following steps by a computer program:

s101, constructing a target detection model based on cascade fusion, wherein the target detection model based on cascade fusion comprises a backbone network, a neck network and a head network;

s102, acquiring at least one image to be detected containing a target to be detected as a training sample, marking the target to be detected by the training sample, inputting the target to a backbone network, and carrying out feature extraction to obtain the image feature to be detected, wherein the backbone network is formed by connecting a backbone convolution layer and a plurality of cascade feature fusion layers in series, the output of the last cascade feature fusion layer is the image feature to be detected, the cascade feature fusion layer is formed by sequentially connecting a cascade fusion multi-head self-attention module and a cascade fusion module in series, and a double-branch downsampling module exists between any two adjacent cascade feature fusion layers;

S103, the neck network performs feature extraction on the image features to be detected to obtain image compression features to be detected;

s104, the head network is divided into a target category prediction branch and a detection frame position prediction branch, the detection frame position prediction branch carries out regression prediction on the compression characteristics of the image to be detected to obtain candidate frames containing detection frame position information of the target to be detected, the target category prediction branch carries out category prediction according to the compression characteristics of the image to be detected to obtain classification detection frames, and each candidate frame is multiplied by the corresponding classification detection frame after target prediction to output a classification result of the target to be detected.

It should be noted that, specific examples in this embodiment may refer to examples described in the foregoing embodiments and alternative implementations, and this embodiment is not repeated herein.

In general, the various embodiments may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. Some aspects of the invention may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

Embodiments of the invention may be implemented by computer software executable by a data processor of a mobile device, such as in a processor entity, or by hardware, or by a combination of software and hardware. Computer software or programs (also referred to as program products) including software routines, applets, and/or macros can be stored in any apparatus-readable data storage medium and they include program instructions for performing particular tasks. The computer program product may include one or more computer-executable components configured to perform embodiments when the program is run. The one or more computer-executable components may be at least one software code or a portion thereof. In this regard, it should also be noted that any block of the logic flow as in fig. 9 may represent a procedure step, or interconnected logic circuits, blocks and functions, or a combination of procedure steps and logic circuits, blocks and functions. The software may be stored on a physical medium such as a memory chip or memory block implemented within a processor, a magnetic medium such as a hard disk or floppy disk, and an optical medium such as, for example, a DVD and its data variants, a CD, etc. The physical medium is a non-transitory medium.

It should be understood by those skilled in the art that the technical features of the above embodiments may be combined in any manner, and for brevity, all of the possible combinations of the technical features of the above embodiments are not described, however, they should be considered as being within the scope of the description provided herein, as long as there is no contradiction between the combinations of the technical features.

The foregoing examples merely represent several embodiments of the present application, the description of which is more specific and detailed and which should not be construed as limiting the scope of the present application in any way. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. The method for constructing the target detection model based on cascade fusion is characterized by comprising the following steps of:

2. The method for constructing the cascade fusion-based target detection model according to claim 1, wherein the cascade fusion multi-head self-attention module is formed by sequentially connecting a characteristic remodelling layer, a first multi-head self-attention layer, a second multi-head self-attention layer and a third multi-head self-attention layer in series, the input of the cascade fusion multi-head self-attention module is defined as a first input characteristic, the first input characteristic is subjected to characteristic remodelling on the characteristic remodelling layer to obtain remodelling characteristics, the remodelling characteristics are input to the first multi-head self-attention layer to obtain a first attention result, the first attention result and the remodelling characteristics are subjected to characteristic splicing to obtain a first splicing result, the first splicing result is input to the second multi-head self-attention layer to obtain a second splicing result, the second splicing result is subjected to characteristic splicing to the first multi-head self-attention layer to obtain a third attention result, the third attention result is input to the third multi-head self-attention layer, the third attention result and the third multi-head self-attention module is subjected to characteristic splicing to characteristic remodelling to obtain a first attention result, the first splicing result is sampled and the third multi-head self-splicing result is sampled and the self-splicing result is obtained.

3. The method for constructing the target detection model based on cascade fusion according to claim 1, wherein the cascade fusion module is formed by sequentially connecting a first transfer residual bottleneck module, a first downsampling layer, a second transfer residual bottleneck module, a second downsampling layer, a coarse-fine granularity fusion module and a conversion module in series, wherein the input of the cascade fusion module is defined as a second input characteristic, the second input characteristic is input into the first transfer residual bottleneck module to obtain a first transfer result, the first transfer result is downsampled in the first downsampling layer to obtain a first downsampling result, the first downsampling result is input into the second transfer residual bottleneck module to obtain a second transfer result, the second downsampling result is downsampled in the second downsampling layer to obtain a second downsampling result, the second downsampling result is input into the coarse-fine granularity fusion module to obtain a coarse-fine granularity fusion result, and the second downsampling result are input into the conversion module to obtain the cascade fusion module.

4. The method for constructing the target detection model based on cascade fusion according to claim 3, wherein the coarse-fine particle size fusion module is formed by sequentially connecting a coarse-fine particle size window attention layer, a coarse-fine particle size depth separable convolution layer, a first coarse-fine particle size convolution layer and a second coarse-fine particle size convolution layer in series, the second downsampling result is input into the coarse-fine particle size window attention layer to obtain a window attention result, the window attention result and the second downsampling result are subjected to feature fusion and then are input into the coarse-fine particle size depth separable convolution layer to obtain a coarse-fine particle size depth separable convolution result, the window attention result and the second downsampling result are subjected to feature fusion to obtain a preliminary coarse-fine particle size fusion result, the preliminary coarse-fine particle size convolution result sequentially passes through the first coarse-fine particle size convolution layer and the second coarse-fine particle size convolution layer to obtain a final coarse-fine particle size fusion result, and the final coarse-fine particle size fusion result is obtained.

5. The method for constructing the target detection model based on cascade fusion according to claim 1, wherein the double-branch downsampling module comprises a first branch and a second branch which are connected in parallel, the first branch comprises a first branch pooling layer, a first branch convolution layer and a first branch normalization layer, the second branch comprises a second branch convolution layer, the input feature of the double-branch downsampling module is defined as a third input feature, the third input feature obtains a first branch downsampling result through the first branch, the third input feature obtains a second branch downsampling result through the second branch, and the first branch downsampling result and the second branch downsampling result are subjected to feature fusion to obtain a double-branch downsampling result.

6. The method for constructing the target detection model based on cascade fusion according to claim 1, wherein the target class prediction branch is formed by sequentially connecting a target class convolution module, a target class depth separation convolution layer and a classifier in series, the detection frame position prediction branch is formed by sequentially connecting a detection frame position convolution layer, a detection frame position depth separation convolution module and a frame regression layer in series, the target class depth separation convolution module is formed by sequentially connecting a depth separation convolution layer, a first normalization layer, a first activation function layer, a convolution layer, a second normalization layer and a second activation function layer in series, and the detection frame position depth separation convolution module and the target class depth separation convolution module are identical in structure.

7. A method of detecting an object, comprising:

8. A data warehousing apparatus, comprising:

9. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, the processor being arranged to run the computer program to perform a method of constructing a cascade fusion based object detection model according to any of claims 1-6 or an object detection method according to claim 7.

10. A readable storage medium, characterized in that the readable storage medium has stored therein a computer program comprising program code for controlling a process to execute a process comprising a method of constructing a cascade fusion based object detection model according to any one of claims 1-6 or an object detection method according to claim 7.