CN116740516A

CN116740516A - Target detection method and system based on multi-scale fusion feature extraction

Info

Publication number: CN116740516A
Application number: CN202310580498.2A
Authority: CN
Inventors: 陈振学; 张馨悦; 刘成云; 杨悦; 梁田; 王文成
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2023-05-19
Filing date: 2023-05-19
Publication date: 2023-09-12

Abstract

The invention discloses a target detection method and a target detection system based on multi-scale fusion feature extraction, wherein the method comprises the following steps: inputting the preprocessed original image into a target detection network, extracting a multi-scale fusion feature map, and outputting a target detection result; the preprocessed original image is input into a backbone network to obtain three layers of characteristic images with different scales, and the three layers of characteristic images with different scales are obtained after primary fusion through an SWFC network; and (3) carrying out three-time up-sampling on the feature map with the minimum scale, inputting the feature map with all the scales into an LBiFN network, dividing the feature map into two groups for feature fusion, respectively recovering the two layers of the feature map after fusion to the size corresponding to the original input feature map, discarding the two layers of the feature map with the minimum scale, and respectively outputting the weighted four layers of multi-scale fusion feature maps with different scales through a self-attention module. According to the invention, through multi-scale fusion and attention mechanism, multiple layers of different scale features are fused, and the accuracy of target detection is improved.

Description

Target detection method and system based on multi-scale fusion feature extraction

Technical Field

The invention relates to the technical field of computer vision images, in particular to a target detection method and system based on multi-scale fusion feature extraction.

Background

In recent years, object detection plays an increasingly important role in the field of computer vision, and has been receiving more and more research attention. In the target detection task, feature extraction refers to extracting important features such as position and semantic information from an original image, and provides powerful support for finding positions and bounding boxes of different objects in the image later and classifying the positions and bounding boxes. With the continuous development of computer vision technology, feature extraction-based deep learning algorithms have been widely used in many fields, such as automatic driving, safety monitoring, medical image analysis, and the like.

In the current field, some feature extraction algorithms based on deep learning have been successful to a great extent, such as the fast R-CNN algorithm, the YOLO algorithm, the SSD algorithm, and the like. The algorithms have advantages of different degrees in the aspects of accuracy, speed, efficiency and the like, and can be selected and adjusted according to specific practical application scenes.

In the prior art, an FPN network (Feature Pyramid Networks, feature pyramid network) is proposed for target detection, in which a conventional highest-level feature layer excessively abstracts an original image, resulting in that high-level features extract enough semantic information, but lose position information. In order to solve the problem, a plurality of different scale feature fusion methods are proposed at present, and a new feature layer is formed by fusing the position information of the low scale features and the semantic information of the high scale features, so that the method has the advantages of all feature layers, and is widely and well-appreciated. Further, a series of improved FPN network models such as PANet (Path Aggregation Network ), ASFF (Adaptively Spatial Feature Fusion, adaptive feature fusion network), NAS-FPN (Neural Architecture Search, reinforcement learning to find optimal network), biFPN (Bidirectional Feature Pyramid Network, weighted bi-directional feature pyramid network), etc. are developed successively. In the field of target detection, model accuracy can be significantly improved after multi-scale feature fusion is used.

In the backbone network of the network model, deep high-level features have more semantic information, shallow low-level features have more content description information, and feature integration is performed through transverse connection in the network such as FPN, PANet and the like, so that the development of object detection is promoted. That is, the low-level and high-level information is complementary to object detection, and the method of integrating pyramid representations using high-level features and low-level features determines detection performance. To ensure final detection performance, the integrated features should have balanced information from each resolution, however, the above method or the integrated sequential approach in the network model may make the integrated features pay more attention to adjacent resolutions, and pay less attention to other resolutions, and semantic information contained in non-adjacent layers may be diluted in each fusion during the information flow process, resulting in poor final detection performance. In practice, the FPN, PANet and other networks excessively fuse the features of different feature layers, confuse the high-level position features and the low-level semantic features, further influence the processing of the problems of subsequent target regression and the like, and finally lead to poor detection performance.

Furthermore, the use of attention mechanisms is becoming more and more common in feature extraction. Attention mechanisms refer to models that pay more attention to important information by dynamically weighting different data points as they process the data. The method can enable the model to recognize and process important information more efficiently and accurately when processing information, thereby improving the performance of the model. In the field of deep learning, attention mechanisms have been widely used in various tasks such as natural language processing, image classification, object detection, and the like; in natural language processing, the model can better understand key words in sentences through an attention mechanism and associate the key words with questions or other sentences; in image classification, the attention mechanism can enable the model to better understand important areas in the image, so that the classification accuracy is improved; in object detection, the attention mechanism can help the model pay better attention to the features related to the object, thereby improving the accuracy and speed of detection. Common Attention mechanism modules are SE (Squeeze-and-expression), SK (Selective Kernel), SA (Spatial Attention), CA (Channel Attention), non-local, GC (Global Context), MHA (Multi-head Attention), etc.

Disclosure of Invention

Aiming at the defects of the traditional feature pyramid network in feature fusion, the invention provides a target detection method and system based on multi-scale fusion feature extraction, which reprocesses a feature extraction layer of the network on the basis of the existing feature extraction network, combines a global attention mechanism, realizes multi-scale fusion feature extraction, and further improves the accuracy of target detection.

In a first aspect, the present disclosure provides a target detection method and system based on multi-scale fusion feature extraction.

A target detection method based on multi-scale fusion feature extraction comprises the following steps:

acquiring an original image to be detected, and preprocessing the original image to be detected;

inputting the preprocessed original image into a target detection network, extracting a multi-scale fusion feature map, and outputting a target detection result;

the target detection network comprises a backbone network, a SWFC network and an LBiFN network; the preprocessed original image is input into a backbone network to obtain three layers of feature images with different scales; inputting the three-layer feature images with different scales into an SWFC network to obtain three-layer feature images with different scales after primary fusion; and carrying out three-time up-sampling on the feature map with the minimum dimension, inputting the feature map with all dimensions into an LBiFN network, dividing the input feature map into two groups according to the dimension from large to small, carrying out feature fusion on the two layers of the feature map after fusion, respectively recovering the two layers of the feature map to the size corresponding to the original input feature map, discarding the two layers of the feature map with the minimum dimension, respectively outputting weighted multi-dimension fusion feature maps through a self-attention module with the parameter sharing, and obtaining four layers of multi-dimension fusion feature maps with different dimensions.

According to a further technical scheme, the pretreatment comprises the following steps:

cutting an input original image, and unifying the size of the image;

the original image after clipping is randomly processed, including flipping, shielding, changing contrast, changing image format.

According to a further technical scheme, the backbone network is built based on a Resnet-50 network architecture and comprises a CNN module and a residual network module which are sequentially connected, the residual network module is composed of 4 residual blocks, the ratio of the 4 residual blocks from top to bottom is 1:1:4:1, and the last 3 residual blocks respectively output three layers of characteristic diagrams with different scales.

In a further technical scheme, in each residual block, an input feature map is sequentially convolved with a 1*1 convolution, 3*3 convolution and 1*1 convolution, and then is fused with the input feature map, and the fused feature map is output.

In a further technical solution, in the SWFC network,

inputting three layers of characteristic diagrams with different scales into an SWFC network, and respectively regulating the number of channels of each whole layer through one 1*1 convolution to obtain a characteristic diagram l3, a characteristic diagram l4 and a characteristic diagram l5;

each feature map is respectively sampled up and down to generate three groups of feature maps with the same input dimension;

and fusing the feature images with the same dimension to obtain a feature image M3, a feature image M4 and a feature image M5 which are fused in multiple dimensions for the first time.

In a further technical solution, in the LBiFN network,

performing 3*3 convolution processing on the feature images with the minimum scale in the three layers of feature images with different scales after the first fusion for three times, and respectively outputting 3 feature images after up sampling;

the 3 feature images and the three-layer feature images with different scales after the first fusion are input into an LBiFN together; the LBiFN network adopts cross-layer connection, divides the 6-layer feature map into two groups according to the scale from large to small, and performs multi-scale feature fusion in each group according to the largest feature map dimension to obtain two fused feature maps;

the two layers of feature images after fusion are respectively restored to the size of the feature image corresponding to the input feature image, the two layers of feature images with the minimum scale are discarded, and four layers of feature images with different scales are output;

and respectively inputting the four layers of the feature images to an attention module for attention extraction, and generating a weighted multi-scale fusion feature image, namely finally outputting the multi-scale fusion feature image with the four layers of semantic information and the position information slightly fused.

In a further technical solution, in the attention module,

the input feature map is firstly subjected to 1*1 convolution and softmax regression and then multiplied by the original input feature map, and then the multiplication result is output;

the output multiplication result is convolved through a 1*1, subjected to LayerNorm, reLU regularization operation, and convolved through a 1*1 convolution output characteristic diagram;

the output characteristic diagram is fused with the original input characteristic diagram, and the attention diagram with the same dimension is output.

In a second aspect, the present disclosure provides a target detection system based on multiscale fusion feature extraction.

A target detection system based on multiscale fusion feature extraction, comprising:

the image acquisition module is used for acquiring an original image to be detected;

the image preprocessing module is used for preprocessing an original image to be detected;

the target detection module is used for inputting the preprocessed original image into a target detection network, extracting a multi-scale fusion feature map and outputting a target detection result;

In a third aspect, the present disclosure also provides an electronic device comprising a memory and a processor, and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the steps of the method of the first aspect.

In a fourth aspect, the present disclosure also provides a computer readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the method of the first aspect.

The one or more of the above technical solutions have the following beneficial effects:

1. the invention provides a target detection method and a target detection system based on multi-scale fusion feature extraction, which strengthen original features by comprehensively balanced semantic features, so that each resolution in a pyramid can acquire the same information from other resolutions, thereby balancing information flow, enabling features to have more discernability, solving the problem that information is diluted when the features are fused in the existing feature pyramid network, and effectively improving detection performance.

2. The invention respectively fuses the low-level features and the high-level features based on the proposed feature extraction network in a grouping fusion feature mode, and solves the problems that the existing FPN, PANet and other networks excessively fuse the features of different feature layers, confuse the high-level position features and the low-level semantic features, further influence the subsequent target regression and other problem treatments, and finally cause poor target detection performance.

3. According to the invention, through multi-scale fusion and attention mechanisms, multiple layers of different scale features are fused, so that the full fusion of information such as position information, semantic information and the like of a network extraction feature layer is realized, the accuracy of target detection is further improved, and the accuracy and the precision of target detection in different vision fields are improved to a certain extent.

4. According to the invention, the multi-scale feature map is extracted through the Resnet network based on the residual error module, so that the problem of gradient disappearance in the deep neural network is solved; the SWFC network is used for fusing multi-scale features, so that the problem caused by target scale change in a feature extraction task is solved; through the LBiFN network of the attention module, the characteristic information is efficiently fused, transferred and optimized in a cross-layer connection and characteristic adjustment mode, so that the characteristic information of different levels can be mutually transferred and shared, information is extracted from a plurality of convolution levels and characteristic diagrams with resolution ratios and fused together, further, a more accurate characteristic diagram with characterization capability is generated, and a foundation is laid for improving the target detection precision.

5. The target detection method can be used for various different image scenes, such as road traffic scenes, indoor and outdoor scenes, city street scenes and the like.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

FIG. 1 is a flowchart of a target detection method based on multi-scale fusion feature extraction according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a network backbone of a Resnet in the method according to the embodiment of the present invention;

FIG. 3 is a schematic diagram of a residual block in the method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of the structure of a SWFC network in the method according to the embodiment of the invention;

fig. 5 is a schematic structural diagram of an lbiffn network according to the method of the embodiment of the present invention;

fig. 6 is a schematic structural diagram of an attention module in the method according to the embodiment of the invention.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

Example 1

In the technical field of computer vision, the realization of target detection needs to extract the characteristics of a visual image, the visual image is input into a neural network with the characteristics extracted, and the finally output information expressed by a characteristic layer is more and more abstract along with the deepening of the neural network, so that a large amount of position information is lost, and the detection precision obtained by a traditional characteristic image processing mode is very low. Therefore, the embodiment integrates semantic information and position information by fusing the multi-scale feature layers, so that model precision can be better improved, and effective feature extraction is realized. Specifically, the existing FPN and PANet networks excessively fuse the features of different feature layers, confuse the high-level position features and the low-level semantic features, affect the processing of the subsequent target regression and other problems, and solve the problem by respectively fusing the low-level features and the high-level features through a grouping fusion feature method by the feature extraction network provided by the embodiment.

The embodiment provides a target detection method based on multi-scale fusion feature extraction, as shown in fig. 1, comprising the following steps:

step S1, acquiring an original image to be detected, and preprocessing the original image to be detected;

s2, inputting the preprocessed original image into a target detection network, extracting a multi-scale fusion feature map, and outputting a target detection result;

Specifically, in the target detection method based on multi-scale fusion feature extraction disclosed in this embodiment, in step S1, an original image to be detected is first obtained, and preprocessing is performed on the original image to be detected, where the preprocessing includes:

(1) Cutting an input original image, and unifying the size of the image;

(2) The image is randomly processed, such as flipped, occluded, changed contrast, changed image format, etc.

In this embodiment, the input original image is preprocessed, and the input original image is first scaled to a fixed size (in this embodiment, the fixed size is 1333 in height or 800 in width) while keeping the aspect ratio unchanged; the image is then flipped randomly horizontally and the image format is changed, i.e. BGR format is converted to RGB format. The method processes the input original image to form a random data set, so that the generalization performance of the network can be improved, and the overfitting of the network can be avoided.

In step S2, the preprocessed original image is input into a target detection network, a multi-scale fusion feature map is extracted, and a target detection result is output. The focus of the step is on the constructed target detection network, the multiscale fusion feature diagram is extracted through the target detection network, the target detection is carried out based on the extracted feature diagram, and the target detection result is output.

The target detection network constructed in this embodiment includes a backbone network, a SWFC network, and an LBiFN network. Specifically, the preprocessed original image is input into a target detection network, and firstly, three layers of feature images with different scales are obtained through a Resnet backbone network; then, inputting the three-layer feature images with different scales into an SWFC network to obtain three-layer feature images with different scales after primary fusion; and thirdly, up-sampling the feature map with the minimum dimension for three times, inputting the feature map with all dimensions into the LBiFN network, fusing the feature maps with two adjacent dimensions, outputting weighted multi-dimension fused feature maps through the self-attention module with shared parameters, and discarding the feature map with the minimum dimension to obtain five-layer multi-dimension fused feature maps with different dimensions.

The Resnet backbone network is shown in fig. 2, and is built based on the existing Resnet-50 network architecture, and comprises a CNN module and a residual network module which are sequentially connected, wherein the residual network module is composed of 4 residual blocks, and the 4 residual blocks are respectively Block2, block3, block4 and Block5. After the preprocessed original image is input into a backbone network, feature extraction is carried out through a CNN module and 4 residual blocks, three layers of feature images with different scales are respectively output by the last 3 residual blocks, namely, a Feat3 layer, a Feat4 layer and a Feat5 layer, and the 3 layers of multi-scale feature images are used as input of a subsequent network or module.

After the feature map of the original image is extracted by the CNN module, the feature map is input into a residual block of the value residual network module. In the residual block, firstly, the characteristic diagram is convolved by a 1*1 module to sort the information of the last module; after passing through the linear rectification function (ReLU, rectified Linear Unit), the feature map is processed through a 3*3 convolution; finally, the channel number is transformed by a 1*1 convolution, as shown in fig. 3.

The embodiment adopts a residual connection mode, and can effectively solve the problems of gradient disappearance and the like in the deep neural network. Residual connection can avoid the occurrence of the problems of gradient disappearance and the like in the module by connecting the residual errors of the input and output of the module, and can improve the network training efficiency. Unlike conventional neural networks, the residual module takes the difference between the input data and the output data as the output of the module, so that the network can learn the residual transformation instead of directly learning the transformation of the original input data. The method can help the network to better adapt to complex data distribution, and improves the expression capacity and generalization capacity of the network.

Further, the proportion of each stage module is adjusted on the overall architecture of the Resnet-50 network, and the proportion of 4 residual blocks from bottom to top is adjusted to be 1:1:4:1. In this embodiment, using 4, 16, 4, a corresponding scale feature map is output after each residual block.

The backbone network in this embodiment adopts a variant framework of the existing Resnet-r50 network, which increases the depth of the neural network and increases the degree of distinction between different scale feature layers during feature extraction. In addition, the use of convolutional layers modifies the characteristic channel size of each layer, reducing computational cost with a smaller number of channels.

The Resnet backbone network provided by the embodiment is composed of a plurality of residual blocks, so that the Resnet backbone network has better stability compared with other deep networks, the problems of gradient disappearance and the like in the deep neural network can be effectively solved, the adaptability of the network is improved through learning residual transformation, and the problems of fitting and the like of the network in the training process are avoided. In addition, the structure of the ResNet backbone network can be selected and adjusted according to task requirements and computing resources, for example, the network depth and the receptive field size can be increased by adding additional convolution layers and pooling layers so as to better adapt to different application scenes. The Resnet backbone network mainly realizes the extraction and dimension reduction operation of the characteristics of the input image, converts the image into characteristic vectors, and the characteristic vectors can be used for representing the characteristics of different areas and different scales of the image, such as textures, colors, shapes and the like, and provide important basis for the subsequent regression and classification operation. The backbone network is improved and obtained on the basis of a Resnet-r50 network, and the network structure of the backbone network can be selected and adjusted according to task requirements and computing resources.

Three layers of feature maps with different scales, namely feature map Feat3, feature map Feat4 and feature map Feat5, are output through a backbone network, and the 3 layers of multi-scale feature maps are used as inputs of the SWFC network. The SWFC (Scale-Wise Feature Concatenation) network, i.e., the feature stitching network, is used to extract features of different scales from feature maps of different levels, and then fuse the features, as shown in fig. 4, to obtain a set of features with more characterization capability. Specifically, three layers of characteristic diagrams (C3, C4 and C5 in FIG. 5) with different scales are input into an SWFC network, and the number of channels of each whole layer is adjusted to 256 through quick transverse connection, namely through one 1*1 convolution, so that characteristic diagrams l3, l4 and l5 are obtained; then, each feature map is respectively sampled up and down to generate three groups of feature maps with the same input dimension, namely, a feature map P3, a feature map P4 and a feature map P5 are respectively obtained for each feature map; finally, the feature images with the same dimension are fused to obtain a feature image M3, a feature image M4 and a feature image M5 which are fused in a multi-scale mode for the first time.

Through the process of up-and-down sampling and then merging in the SWFC network, the feature expression of the original feature layers with different scales is merged, so that the extracted features have better characterization capability.

The embodiment solves the problem caused by the target scale change in the feature extraction task through the SWFC network. In computer vision, targets of the same category may have different scales due to differences in size, shape, etc. of different targets, and thus features of the targets at different scales need to be extracted and processed.

The SWFC network is an improvement on the basis of a FPN (Feature Pyramid Network) network and comprises a plurality of branches, wherein each branch comprises a feature extraction network and a feature pyramid network, the feature extraction network is used for extracting features from an input image, and the feature pyramid network is used for carrying out multi-scale fusion on a feature graph so as to strengthen original features by integrating balanced semantic features, so that each resolution in the pyramid can acquire the same information from other resolutions, thereby balancing information flow, enabling the features to be more discernable and overcoming the defects in the existing FPN network.

The SWFC network uses a parallel structure and is connected with the feature graphs of all branches, so that multi-scale information can be more fully utilized, and the segmentation accuracy can be effectively improved. And extracting features with different scales from the feature graphs with different levels by adopting a pyramid structure, and then fusing the features to obtain a group of features with more characterization capability. These features not only can improve the detection rate of small targets, but also can improve the positioning accuracy of targets and the accuracy of semantic segmentation.

After obtaining a feature map M3, a feature map M4 and a feature map M5 which are fused for the first time, inputting the obtained feature map into an LBiFN (Light-Balanced Feature Net) network, namely a Light balance feature network, wherein the LBiFN network structure is shown in fig. 5, and performing three up-sampling operations on a feature layer with the smallest input size so as to increase the number of feature layers and protrude out differences among each group of feature layers, and dividing the obtained 6 feature maps into two groups from large to small according to the scale to perform feature fusion, so that more accurate feature maps with characterization capability are generated, and the feature fusion effect under different scales is further improved; in addition, by the grouping fusion mode, excessive fusion of the position information and the semantic information can be avoided, and finally, the characteristic of mild fusion of the semantic information and the position information is output.

Specifically, for the obtained feature map M3, feature map M4 and feature map M5, wherein the dimension of the feature map M3 is the smallest, the last feature map (i.e., the feature map with the smallest dimension) is subjected to three 3*3 convolution processes, and three up-sampled feature maps are respectively output; the 3 feature images and three different scale feature images obtained in the last step are input into an LBiFN network together; the network adopts cross-layer connection, the 6-layer feature images are divided into two groups according to the scale from large to small, and multi-scale feature fusion is carried out in each group according to the largest feature image dimension, so as to obtain two layers of fused feature images; restoring the two layers of the fused feature images to the sizes of the feature images corresponding to the original input respectively, obtaining 6 layers of feature images which are respectively in one-to-one correspondence with the dimensions of the feature images of the original input and are subjected to mild feature fusion, discarding the two layers of feature images with the minimum dimensions, and obtaining 4 feature layer outputs; and respectively inputting the output feature images to an attention module for attention extraction, and outputting the weighted multi-scale fusion feature images, namely finally outputting the multi-scale fusion feature images with 4 layers of semantic information and position information being slightly fused.

In the LBiFN network, the characteristic expression of the characteristic layers with different scales is further fused through cross-layer connection, so that the characteristic expression capability is enhanced.

Based on the LBiFN network, the scheme of the embodiment introduces a mode of cross-layer connection and dynamic adjustment of the feature graphs, realizes efficient fusion and optimization of the feature information, extracts information from convolution feature graphs of different levels through transverse connection, and fuses the information together. The feature images of the two stages are mutually interacted in a repeated iteration mode to generate more accurate feature images with characterization capability, the feature fusion effect under different scales is further improved, efficient feature transfer is realized in a cross-layer connection and feature adjustment mode, feature information of different levels can be mutually transferred and shared, information is extracted from the feature images of a plurality of convolution levels and resolution, and the feature images are fused together to generate the feature images with more accuracy and characterization capability. Meanwhile, as the bidirectional characteristic transmission mechanism is adopted, the loss and distortion of information in the transmission process can be avoided, targets with different scales in a target detection task can be rapidly adapted, and the efficiency and the accuracy of the target detection task are improved.

The structure of the attention module (i.e., global Context module) is shown in fig. 6, and includes a Context Modeling (i.e., context Modeling unit) unit and a bottleneck conversion unit (i.e., bottleneck Transform unit), where the attention module effectively extracts Global attention through lightweight computation. In the Context Modeling unit, an input feature diagram is subjected to 1*1 convolution and softmax regression and then multiplied by the original input feature diagram to realize a Context Modeling function; in a bottleneck transform unit, the output of the Context Modeling unit firstly reduces the channel dimension through a 1*1 convolution, then carries out LayerNorm, reLU regularization operation, and then adjusts the number of the original channels through a 1*1 convolution, and through the unit, the dependence among channels can be captured while the network parameters are reduced; finally, the output of the Bottleneck Transform unit is fused with the original input feature map, and the same-dimension attention map is output. In the above process, the global feature vector is input into the fully connected layer, and the weight of each position is calculated through the weight parameter, so that an attention map with the same size as the input feature map is generated.

Further, the parameters of the convolution layers of all GC modules (namely the attention modules) are shared, so that the generalization capability of the feature layers can be improved, the number of parameters can be reduced, the calculation cost is reduced by using a smaller channel number, and the effect of reducing the feature redundancy is achieved.

In this embodiment, by introducing global context vectors into feature graphs, similarity between each feature graph position and the global context vector is calculated, and then the feature graphs are weighted and summed by using the similarity as a weight, so as to obtain feature graphs processed by the global context attention mechanism module. By the scheme, the model can better utilize the information of the whole image, but not the information of the local area, so that the expressive force and generalization capability of the model are improved. The method can help the model to better understand the whole image and extract more meaningful features from the whole image, thereby improving the performance and accuracy of the model.

And marking the preprocessed image in the random data set based on the obtained random data set, and training the target detection network by using the marked random data set. And preprocessing an original image to be detected, inputting the preprocessed original image into a target detection network after training, and outputting a target detection result based on a multi-scale fusion feature map extracted by the target detection network to realize target detection with better precision.

The target detection method based on multi-scale fusion feature extraction, which is provided by the embodiment, can be applied to the following actual application scenes:

(1) Autopilot

The method can be applied to the field of automatic driving, and in the process of realizing automatic driving, the vehicle needs to detect various obstacles in the surrounding environment, such as vehicles, pedestrians, road signs, roadblocks and the like through a sensing system, and make corresponding decisions in time so as to ensure the safety and stability of driving. The method can detect and identify the targets in the road traffic scene in real time, thereby providing important data for automatic driving to support real-time detection of surrounding vehicles, pedestrians, road signs, obstacles and the like, and realizing the automatic driving better and guaranteeing the driving safety through the real-time monitoring of the road traffic scene.

(2) Security monitoring

The method can be applied to security monitoring, can be used for monitoring and identifying personnel, vehicles and the like in a monitoring picture in real time, and can better identify dangerous situations and ensure personal safety and property safety through the real-time monitoring of a safety environment. The security monitoring system is helped to realize rapid and accurate target identification and alarm.

(3) Medical image analysis

The method can be applied to various tasks in medical image analysis, including but not limited to the fields of lesion recognition, organ detection, disease classification and the like. Medical images contain a lot of information, but at the same time there is much noise, interference and complexity, which makes medical image analysis very challenging. By adopting the method, more meaningful features can be extracted from the medical image, thereby helping doctors to more accurately find and diagnose diseases.

In addition, the method can be applied to automatic analysis and case screening of medical images, and helps doctors to perform disease analysis and diagnosis more quickly by automatically processing and analyzing a large number of medical images, so that the time for seeing a doctor is shortened, and the medical efficiency is improved. Meanwhile, the method is also beneficial to information sharing and transmission of medical images, and provides more accurate and comprehensive case information for doctors, so that medical decision making and treatment scheme making are better guided.

(4) Industrial manufacture

The method can be applied to the fields of object detection, defect detection and the like in industrial manufacturing, and can help manufacturers to improve the production efficiency and quality and improve the controllability and the safety of the production process by rapidly screening objects on a production line.

(5) Unmanned aerial vehicle application

The method can be applied to the field of unmanned aerial vehicle target tracking and detection, and can effectively solve some problems encountered by unmanned aerial vehicles in practical application, such as complex and changeable environments, target scale change and the like. The method can realize autonomous flight and intelligent control of the unmanned aerial vehicle, and improve the working efficiency and precision of the unmanned aerial vehicle. Can be used in the fields of agriculture, geological exploration, environmental monitoring, rescue and the like. The unmanned aerial vehicle is helped to realize autonomous flight and intelligent control, and the working efficiency and the precision of the unmanned aerial vehicle are improved.

Example two

The embodiment provides a target detection system based on multiscale fusion feature extraction, which comprises:

Example III

The present embodiment provides an electronic device comprising a memory and a processor, and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the steps in the object detection method based on multi-scale fusion feature extraction as described above.

Example IV

The present embodiment also provides a computer-readable storage medium storing computer instructions that, when executed by a processor, perform the steps in the target detection method based on multiscale fusion feature extraction as described above.

The steps involved in the second to fourth embodiments correspond to the first embodiment of the method, and the detailed description of the second embodiment refers to the relevant description of the first embodiment. The term "computer-readable storage medium" should be taken to include a single medium or multiple media including one or more sets of instructions; it should also be understood to include any medium capable of storing, encoding or carrying a set of instructions for execution by a processor and that cause the processor to perform any one of the methods of the present invention.

It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented by general-purpose computer means, alternatively they may be implemented by program code executable by computing means, whereby they may be stored in storage means for execution by computing means, or they may be made into individual integrated circuit modules separately, or a plurality of modules or steps in them may be made into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

While the foregoing description of the embodiments of the present invention has been presented in conjunction with the drawings, it should be understood that it is not intended to limit the scope of the invention, but rather, it is intended to cover all modifications or variations within the scope of the invention as defined by the claims of the present invention.

Claims

1. A target detection method based on multi-scale fusion feature extraction is characterized by comprising the following steps:

2. The method for detecting targets based on multi-scale fusion feature extraction of claim 1, wherein the preprocessing comprises:

cutting an input original image, and unifying the size of the image;

3. The target detection method based on multi-scale fusion feature extraction according to claim 1, wherein the backbone network is built based on a Resnet-50 network architecture and comprises a CNN module and a residual network module which are sequentially connected, the residual network module comprises 4 residual blocks, the ratio of the 4 residual blocks from top to bottom is 1:1:4:1, and the last 3 residual blocks respectively output three layers of feature images with different scales.

4. The target detection method based on multi-scale fusion feature extraction as claimed in claim 3, wherein in each residual block, the input feature map is sequentially convolved with a 1*1 convolution, 3*3 convolution and 1*1 convolution, and then fused with the input feature map, and the fused feature map is output.

5. The method for object detection based on multiscale fusion feature extraction according to claim 1, wherein in said SWFC network,

6. The method for detecting targets based on multi-scale fusion feature extraction according to claim 1, wherein in the LBiFN network,

7. The method for object detection based on multi-scale fusion feature extraction of claim 6, wherein, in said attention module,

8. A target detection system based on multiscale fusion feature extraction is characterized by comprising:

9. An electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the steps of a method of object detection based on multiscale fusion feature extraction as claimed in any one of claims 1 to 7.

10. A computer readable storage medium storing computer instructions which, when executed by a processor, perform the steps of a method of object detection based on multiscale fusion feature extraction as claimed in any one of claims 1 to 7.