CN116229086A - Multi-target multi-size image detection method and system under complex background, electronic equipment and storage medium - Google Patents

Multi-target multi-size image detection method and system under complex background, electronic equipment and storage medium Download PDF

Info

Publication number
CN116229086A
CN116229086A CN202310233827.6A CN202310233827A CN116229086A CN 116229086 A CN116229086 A CN 116229086A CN 202310233827 A CN202310233827 A CN 202310233827A CN 116229086 A CN116229086 A CN 116229086A
Authority
CN
China
Prior art keywords
frame
channel
feature
representing
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310233827.6A
Other languages
Chinese (zh)
Inventor
王纪晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Publication of CN116229086A publication Critical patent/CN116229086A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/766Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the invention discloses a multi-target multi-size image detection method, a multi-target multi-size image detection system, electronic equipment and a storage medium under a complex background, wherein images are selected from a data set base, preprocessed, marked and a data training set is obtained; extracting the image by adopting a Darknet-53 feature extraction network structure, and then extracting the features; the channel self-adaptive FPN recursion layer feature enhancement extraction network is adopted to enhance feature extraction, so that the intensity and the precision of feature detection are enhanced; the regression loss function of the replacement boundary frame is self-adaptive by judging the position relationship between the prediction frame and the real frame, so that model parameters can be continuously optimized, and the accuracy of the prediction frame is improved; and training the data training set, and performing effect inspection by using the self-collected picture to be tested after the training is finished. The multi-dimension and multi-scale detection precision of the existing target detection algorithm is improved, the target detection effect is enhanced, the whole performance of the target detection model is improved, and the practicability of the target detection model is improved.

Description

Multi-target multi-size image detection method and system under complex background, electronic equipment and storage medium
Technical Field
The disclosure relates to the field of multi-target multi-size image detection methods, systems, electronic devices and storage media in complex contexts.
Background
Object detection serves as a basis for image understanding and computer vision, and is the basis for more complex and higher-level visual tasks such as segmentation, scene understanding, object tracking, image description, event detection, and activity recognition. The background of target detection is constantly changing, which is a nuisance for the detection effort.
Aiming at multi-target multi-size image detection under a complex background, the existing multi-target multi-size detection calculation model, such as a Dense-YOLOv3 model, integrates the characteristics of Dense convolutional neural networks DenseNet and YOLOv3 networks, strengthens the characteristic propagation and repeated utilization of target sets among convolutional layers, and improves the overfitting resistance of the network; meanwhile, targets are detected in different scales, a cross loss function is constructed, and multi-target detection is achieved. However, the existing target detection algorithm has low multi-size multi-scale detection precision, so that the target detection effect is poor, the whole performance of the target detection model is reduced, the target detection algorithm is difficult to apply to a multi-size multi-target detection scene under a complex background, and the practicability of the target detection model is reduced.
Disclosure of Invention
This disclosure is provided in part to introduce concepts in a simplified form that are further described below in the detailed description. This disclosure is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
The embodiment of the disclosure provides a multi-target multi-size image detection method, a system, electronic equipment and a storage medium under a complex background, which improve the multi-size multi-scale detection precision of the existing target detection algorithm, enhance the target detection effect, promote the whole performance of a target detection model, promote the application of the multi-size multi-target detection scene under the complex background and promote the practicability of the target detection model.
In a first aspect, an embodiment of the present disclosure provides a method for detecting a multi-target multi-size image in a complex background, including: selecting images with complex backgrounds and large target numbers from a data set base, preprocessing the images, and labeling the images to obtain a multi-target multi-size data training set under the complex backgrounds; based on a YOLOv3 model, extracting the image by adopting a Darknet-53 feature extraction network structure, and then extracting the features; the channel self-adaptive FPN recursion layer feature enhancement extraction network is adopted to enhance feature extraction, so that the intensity and the precision of feature detection are enhanced; the regression loss function of the replacement boundary frame is self-adaptive by judging the position relationship between the prediction frame and the real frame, the model parameters can be continuously optimized, the convergence speed of the model is accelerated, and the accuracy of the prediction frame is improved; and training the data training set, and performing effect inspection by using the self-collected picture to be tested after the training is finished.
With reference to the embodiment of the first aspect, in some embodiments, the preprocessing the image includes:
randomly initializing a point in the blank image, dividing the blank image into four areas by using initialized abscissas and ordinates, and randomly reading four images;
and performing mirror image overturning and scale scaling on the image to form a new image, and performing corresponding rotation, scaling and translation on the label corresponding to the read image.
With reference to the embodiment of the first aspect, in some embodiments, the enhancing feature extraction by using the channel adaptive FPN recursive layer feature enhancement extraction network, enhancing the strength and accuracy of feature detection includes: based on FPN, a top-down path can be provided to fuse the feature principles in the multi-scale feature map, and a FPN pyramid network structure is established, wherein the network output formula of the FPN is as follows
f i =F i (f i+1 ,X i )
X i =B i (X i-1 )
Wherein B is i Stage i, X representing the bottom-to-top path of the backbone network i Representing the output of the ith stage, F i Representing the corresponding operation of additional feedback from the top to the bottom path, f i Representing the output, wherein the value range of i is 1-S, S represents the number of stages in the backbone network, f (s+1) =0;
Connecting the additional feedback of each FPN layer from top to bottom to the Darknet-53 backbone characteristic extraction layer from bottom to top;
Based on the principle that ECA-Block module fuses the space and channel information of each layer of local receiving field, so that the network can construct information characteristics and adaptively change the channel weight relation, then channel adaptation in the recursion layer is completed through ECA-Block, and the weight of each channel is calculated through the self-adaptive weight module to distribute the weight of the channel;
after the attention module of the ECA-Block channel calculates the weight, each channel has a corresponding number, and the number is the self-adaptive weight of the channel;
the attention module of the ECA-Block channel calculates weights:
the image characteristics extracted by the product blocks are output as U epsilon R W×H×C Wherein W, H, C is width, height and number of channels, respectively;
all spatial information is compressed into statistical information through global average pooling, and the statistical information
Figure BDA0004121340160000031
Resulting from the convolution output of the spatial dimension H W contraction calculation, so the c-th element of Z, Z 1×1×c The average pooling generation formula:
Figure BDA0004121340160000032
obtaining the polymerization characteristic Z 1×1×C And then, generating a channel weight through one-dimensional convolution with a convolution kernel size K, wherein the one-dimensional convolution kernel K is adaptively determined by a channel dimension C, and generating a convolution kernel K formula:
Figure BDA0004121340160000033
where k is the size of the convolution kernel, |t| odd For the nearest odd value of the distance t, C is the dimension of the channel, and gamma and b are fixed values of 2 and 1 respectively;
Performing feature extraction on the target image twice, wherein the recursive extraction enhances the intensity and the precision of feature detection, and the output features depend on the output of the previous step, wherein R i Representing feature conversion before the adaptive FPN recursion layer is connected to the backbone network, outputting features f i
f i =F i (f i+1 ,x i ),x i =u i (x i-1 ,R i (f i ))
Wherein u is i Representing the ith layer convolution operation in the backbone extraction network, x i For the features after the convolution operation, F i Represents FPN versus x i I-th operation of (a);
the recursive output feature can be expanded into an output sequential network, i.e. when
Figure BDA0004121340160000041
Figure BDA0004121340160000042
Figure BDA0004121340160000043
Figure BDA0004121340160000044
/>
With reference to the embodiment of the first aspect, in some embodiments, the adaptively replacing the bounding box regression loss function by judging a position relationship between the prediction box and the real box, model parameters can be continuously optimized, a convergence speed of the model is increased, and accuracy of the prediction box is improved, including:
the prediction frame is a, the real frame is c, the intersection area of the two baskets is D, the combined area of the two baskets is D, and the minimum area which can contain the two baskets is E
No inclusion exists between the prediction frame and the real frame, the loss function uses the GIOU, and the loss function calculation formula is:
L GIOU =1-GIOU
Figure BDA0004121340160000045
where IOU represents the ratio of the portion where two regions overlap to the portion where two regions are assembled,
Figure BDA0004121340160000046
the method can distinguish states of the prediction frame and the real frame which have the same intersection region but different intersection modes, and model parameters can be continuously optimized even if the condition that the intersection region does not exist between the prediction frame and the real frame occurs in the training process;
The prediction frame and the real frame are included, the loss function uses CIOG, and the position relation is calculated and optimized through the center point distance between the two frames and the aspect ratio of the two frames;
Figure BDA0004121340160000051
wherein ρ is 2 (b,b gt ) Representing the square of the distance between the center points of the two frames, C 2 A square representing a diagonal length of a minimum rectangular region including the prediction frame and the real frame;
Figure BDA0004121340160000052
wherein the parameter v represents the uniformity of the aspect ratio between the predicted and real frames
Figure BDA0004121340160000053
Wherein w is gt And h gt Representing the width and height, w, of the real frame respectively p And h p Representing the width and height of the prediction frame;
the complete definition of CIOU is
Figure BDA0004121340160000054
By judging the position relationship between the prediction frame and the real frame to adapt to the regression loss function of the replacement boundary frame, the model parameters can be continuously optimized, and the matching degree between the prediction frame and the real frame can be accurately estimated, so that the detection precision is effectively improved.
In a second aspect, embodiments of the present disclosure provide a multi-object multi-size image detection system in a complex context, comprising: the data processing unit is used for selecting images with complex backgrounds and large target quantity from the data set library, preprocessing the images, and marking the images to obtain a multi-target multi-size data training set under the complex backgrounds; the feature extraction unit is used for extracting the image by adopting a Darknet-53 feature extraction network structure based on the YOLOv3 model and then extracting the features; the feature strengthening unit strengthens feature extraction by adopting a channel self-adaptive FPN recursion layer feature strengthening extraction network, and enhances the strength and the accuracy of feature detection; the judging unit is used for adaptively replacing the boundary frame regression loss function by judging the position relationship between the prediction frame and the real frame, so that the model parameters can be continuously optimized, the convergence speed of the model is increased, and the accuracy of the prediction frame is improved; and the checking unit is used for training the data training set, and performing effect checking by using the self-collected picture to be tested after the training is finished.
With reference to the embodiments of the second aspect, in some embodiments, the preprocessing the image includes:
randomly initializing a point in the blank image, dividing the blank image into four areas by using initialized abscissas and ordinates, and randomly reading four images;
and performing mirror image overturning and scale scaling on the image to form a new image, and performing corresponding rotation, scaling and translation on the label corresponding to the read image.
With reference to the embodiments of the second aspect, in some embodiments, the feature enhancement unit uses a channel adaptive FPN recursion layer feature enhancement extraction network to enhance feature extraction and enhance strength and accuracy of feature detection, including:
based on FPN, a top-down path can be provided to fuse the feature principles in the multi-scale feature map, and a FPN pyramid network structure is established, wherein the network output formula of the FPN is as follows
f i =F i (f i+1 ,X i )
X i =B i (X i-1 )
Wherein B is i Stage i, X representing the bottom-to-top path of the backbone network i Representing the output of the ith stage, F i Representing the corresponding operation of additional feedback from the top to the bottom path, f i Representing the output, wherein the value range of i is 1-S, S represents the number of stages in the backbone network, f (s+1) =0;
Connecting the additional feedback of each FPN layer from top to bottom to the Darknet-53 backbone characteristic extraction layer from bottom to top;
Based on the principle that ECA-Block module fuses the space and channel information of each layer of local receiving field, so that the network can construct information characteristics and adaptively change the channel weight relation, then channel adaptation in the recursion layer is completed through ECA-Block, and the weight of each channel is calculated through the self-adaptive weight module to distribute the weight of the channel;
after the attention module of the ECA-Block channel calculates the weight, each channel has a corresponding number, and the number is the self-adaptive weight of the channel;
the attention module of the ECA-Block channel calculates weights:
the image characteristics extracted by the product blocks are output as U epsilon R W×H×C Wherein W, H, C is width, height and number of channels, respectively;
all spatial information is compressed into statistical information through global average pooling, and the statistical information
Figure BDA0004121340160000071
Resulting from the convolution output of the spatial dimension H W contraction calculation, so the c-th element of Z, Z 1×1×c The average pooling generation formula:
Figure BDA0004121340160000072
obtaining the polymerization characteristic Z 1×1×C And then, generating a channel weight through one-dimensional convolution with a convolution kernel size K, wherein the one-dimensional convolution kernel K is adaptively determined by a channel dimension C, and generating a convolution kernel K formula:
Figure BDA0004121340160000073
where k is the size of the convolution kernel, |t| odd For the nearest odd value of the distance t, C is the dimension of the channel, and gamma and b are fixed values of 2 and 1 respectively;
Performing feature extraction on the target image twice, wherein the recursive extraction enhances the intensity and the precision of feature detection, and the output features depend on the output of the previous step, wherein R i Representing feature conversion before the adaptive FPN recursion layer is connected to the backbone network, outputting features f i
f i =F i (f i+1 ,x i ),x i =u i (x i-1 ,R i (f i ))
Wherein u is i Representing in a backbone extraction networkLayer i convolution operation, x i For the features after the convolution operation, F i Represents FPN versus x i I-th operation of (a);
the recursive output feature can be expanded into an output sequential network, i.e. when
Figure BDA0004121340160000074
Figure BDA0004121340160000075
Figure BDA0004121340160000076
Figure BDA0004121340160000077
With reference to the embodiments of the second aspect, in some embodiments, the determining unit, by determining that a position relationship between the prediction frame and the real frame is adaptive to a regression loss function of the replacement bounding box, the model parameter can continue to be optimized, speed up convergence of the model, and improve accuracy of the prediction frame, including:
the prediction frame is a, the real frame is c, the intersection area of the two baskets is D, the combined area of the two baskets is D, and the minimum area which can contain the two baskets is E
No inclusion exists between the prediction frame and the real frame, the loss function uses the GIOU, and the loss function calculation formula is:
L GIOU =1-GIOU
Figure BDA0004121340160000081
where IOU represents the ratio of the portion where two regions overlap to the portion where two regions are assembled,
Figure BDA0004121340160000082
Can distinguish that the prediction frame and the real frame have the same intersection area but different intersectionsThe state of the mode, even if the situation that the intersection area does not exist between the prediction frame and the real frame occurs in the training process, the model parameters can be optimized continuously;
the prediction frame and the real frame are included, the loss function uses CIOG, and the position relation is calculated and optimized through the center point distance between the two frames and the aspect ratio of the two frames;
Figure BDA0004121340160000083
wherein ρ is 2 (b,b gt ) Representing the square of the distance between the center points of the two frames, C 2 A square representing a diagonal length of a minimum rectangular region including the prediction frame and the real frame;
Figure BDA0004121340160000084
wherein the parameter v represents the uniformity of the aspect ratio between the predicted and real frames
Figure BDA0004121340160000085
Wherein w is gt And h gt Representing the width and height, w, of the real frame respectively p And h p Representing the width and height of the prediction frame;
the complete definition of CIOU is
Figure BDA0004121340160000086
By judging the position relationship between the prediction frame and the real frame to adapt to the regression loss function of the replacement boundary frame, the model parameters can be continuously optimized, and the matching degree between the prediction frame and the real frame can be accurately estimated, so that the detection precision is effectively improved.
In a third aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; and a storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the multi-object multi-size image detection method in the complex context as described in the first aspect.
In a fourth aspect, embodiments of the present disclosure provide a computer readable medium having stored thereon a computer program which, when executed by a processor, implements the steps of a multi-object multi-size image detection method in a complex context as described in the first aspect.
The invention has the beneficial effects that: selecting images with complex backgrounds and large target numbers from a data set base, preprocessing the images, and labeling the images to obtain a multi-target multi-size data training set under the complex backgrounds; based on a YOLOv3 model, extracting the image by adopting a Darknet-53 feature extraction network structure, and then extracting the features; the channel self-adaptive FPN recursion layer feature enhancement extraction network is adopted to enhance feature extraction, so that the intensity and the precision of feature detection are enhanced; the regression loss function of the replacement boundary frame is self-adaptive by judging the position relationship between the prediction frame and the real frame, the model parameters can be continuously optimized, the convergence speed of the model is accelerated, and the accuracy of the prediction frame is improved; and training the data training set, and performing effect inspection by using the self-collected picture to be tested after the training is finished. The multi-size multi-scale detection accuracy of the existing target detection algorithm is improved, the target detection effect is enhanced, the whole performance of the target detection model is improved, the application of multi-size multi-target detection scenes under a complex background is promoted, and the practicability of the target detection model is improved.
Drawings
The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.
FIG. 1 is a flow chart of one embodiment of a multi-target multi-size image detection method in a complex context according to the present disclosure;
FIG. 2 is a schematic diagram of the structure of the FPN network of the present disclosure;
FIG. 3 is a schematic diagram of an ECA-Block network architecture of the present disclosure;
FIG. 4 is a schematic diagram of the structure of a multi-target multi-size image detection system in the complex context of the present disclosure;
fig. 5 is a schematic diagram of a basic structure of an electronic device provided according to an embodiment of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.
It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.
It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.
The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.
Referring to fig. 1, a flow of one embodiment of a multi-target multi-size image detection method in a complex context according to the present disclosure is shown. As shown in fig. 1, the method comprises the steps of:
and 101, selecting images with complex backgrounds and large target numbers from a data set base, preprocessing the images, and labeling the images to obtain a multi-target multi-size data training set under the complex backgrounds.
Here, step 101 includes:
selecting a large number of images with complex background and large target number from an MS-COCO database;
randomly initializing a point in the blank image, dividing the blank image into four areas by using initialized abscissas and ordinates, and randomly reading four selected images;
performing mirror image overturning and scale scaling on the image to form a new image, and performing corresponding rotation, scaling and translation on the label corresponding to the read image;
labeling the preprocessed image;
and obtaining the multi-target multi-size data training set under the complex background.
Step 102, extracting the image by adopting a Darknet-53 characteristic extraction network structure based on the YOLOv3 model, and then extracting the characteristics.
Here, step 102 includes:
inputting a training set image, and outputting a characteristic from a residual error module C2 of a backbone network Darknet-53;
inputting the output characteristics of the residual module C2 of the backbone network Darknet-53 into the auxiliary network N1 and the residual module C3 of the backbone network Darknet-53
The output characteristics of the auxiliary network N1 are output to the auxiliary network N2, and meanwhile, the output characteristics of the auxiliary network N1 are up-sampled;
the characteristics after up-sampling of the auxiliary network N1 and the output characteristics of the residual error module C3 of the trunk network Darknet-53 are fused in an accumulated mode and then input into the residual error module C4 of the trunk network Darknet 53;
the output characteristics of the auxiliary network N2 are up-sampled, and then are fused with the output characteristics of the residual error module C4 in the trunk network Darknet-53 in an accumulation mode and are input into the residual error module C5 of the trunk network Darknet-53;
the output characteristics of the residual error module C5 of the backbone network Darknet-53 are input into the channel self-adaptive FPN recursion layer characteristic enhancement extraction network, so that the data characteristics are obtained and the preparation is made for the next data characteristic enhancement.
And 103, strengthening feature extraction by adopting a channel self-adaptive FPN recursion layer feature strengthening extraction network, and enhancing the strength and the accuracy of feature detection.
Here, step 103 includes:
the FPN network structure is shown in FIG. 2, and the feature principle in the multi-scale feature map is fused based on the fact that the FPN can provide a top-down path to build the FPN pyramid network structure, wherein the network output formula of the FPN is as follows
f i =F i (f i+1 ,X i )
X i =B i (X i-1 )
Wherein B is i Stage i, X representing the bottom-to-top path of the backbone network i Representing the output of the ith stage, F i Representing the corresponding operation of additional feedback from the top to the bottom path, f i Representing the output, wherein the value range of i is 1-S, S represents the number of stages in the backbone network, f (s+1) =0;
Connecting the additional feedback of each FPN layer from top to bottom to the Darknet-53 backbone characteristic extraction layer from bottom to top;
the ECA-Block network structure is shown in figure 3, the ECA-Block module is based on the principle that the ECA-Block module fuses the space and channel information of each layer of local receiving field, so that the network can construct information characteristics, the channel weight relation is adaptively changed, then channel adaptation in a recursion layer is completed through the ECA-Block, and the weight of each channel is calculated through the adaptive weight module to distribute the weight of the channel;
after the attention module of the ECA-Block channel calculates the weight, each channel has a corresponding number, and the number is the self-adaptive weight of the channel;
The attention module of the ECA-Block channel calculates weights:
the image characteristics extracted by the product blocks are output as U epsilon R W×H×C Wherein W, H, C is width, height and number of channels, respectively;
all spatial information is compressed into statistical information through global average pooling, and the statistical information
Figure BDA0004121340160000131
Resulting from the convolution output of the spatial dimension H W contraction calculation, so the c-th element of Z, Z 1×1×c The average pooling generation formula:
Figure BDA0004121340160000132
obtaining the polymerization characteristic Z 1×1×C And then, generating a channel weight through one-dimensional convolution with a convolution kernel size K, wherein the one-dimensional convolution kernel K is adaptively determined by a channel dimension C, and generating a convolution kernel K formula:
Figure BDA0004121340160000133
where k is the size of the convolution kernel, |t| odd For the nearest odd value of the distance t, C is the dimension of the channel, and gamma and b are fixed values of 2 and 1 respectively;
performing feature extraction on the target image twice, wherein the recursive extraction enhances the intensity and the precision of feature detection, and the output features depend on the output of the previous step, wherein R i Representing connection of adaptive FPN recursion layer to backbone networkBefore feature conversion, output feature f i
f i =F i (f i+1 ,x i ),x i =u i (x i-1 ,R i (f i ))
Wherein u is i Representing the ith layer convolution operation in the backbone extraction network, x i For the features after the convolution operation, F i Represents FPN versus x i I-th operation of (a);
the recursive output feature can be expanded into an output sequential network, i.e. when
Figure BDA0004121340160000141
Figure BDA0004121340160000142
Figure BDA0004121340160000143
Figure BDA0004121340160000144
And 104, by judging the position relationship between the prediction frame and the real frame to adapt to the regression loss function of the replacement boundary frame, the model parameters can be continuously optimized, the convergence speed of the model is accelerated, and the accuracy of the prediction frame is improved.
Here, step 104 includes:
the prediction frame is a, the real frame is c, the intersection area of the two baskets is D, the combined area of the two baskets is D, and the minimum area which can contain the two baskets is E
No inclusion exists between the prediction frame and the real frame, the loss function uses the GIOU, and the loss function calculation formula is:
L GIOU =1-GIOU
Figure BDA0004121340160000145
where IOU represents the ratio of the portion where two regions overlap to the portion where two regions are assembled,
Figure BDA0004121340160000146
the method can distinguish states of the prediction frame and the real frame which have the same intersection region but different intersection modes, and model parameters can be continuously optimized even if the condition that the intersection region does not exist between the prediction frame and the real frame occurs in the training process;
the prediction frame and the real frame are included, the loss function uses CIOG, and the position relation is calculated and optimized through the center point distance between the two frames and the aspect ratio of the two frames;
Figure BDA0004121340160000147
wherein ρ is 2 )b,b gt ) Representing the square of the distance between the center points of the two frames, C 2 A square representing a diagonal length of a minimum rectangular region including the prediction frame and the real frame;
Figure BDA0004121340160000151
Wherein the parameter v represents the uniformity of the aspect ratio between the predicted and real frames
Figure BDA0004121340160000152
Wherein w is gt And h gt Representing the width and height, w, of the real frame respectively p And h p Representing the width and height of the prediction frame;
the complete definition of CIOU is
Figure BDA0004121340160000153
By judging the position relationship between the prediction frame and the real frame to adapt to the regression loss function of the replacement boundary frame, the model parameters can be continuously optimized, and the matching degree between the prediction frame and the real frame can be accurately estimated, so that the detection precision is effectively improved.
Step 105, training the data training set, and performing effect inspection by using the self-collected picture to be tested after training is finished.
Selecting images with complex backgrounds and large target numbers from a data set base, preprocessing the images, and labeling the images to obtain a multi-target multi-size data training set under the complex backgrounds; based on a YOLOv3 model, extracting the image by adopting a Darknet-53 feature extraction network structure, and then extracting the features; the channel self-adaptive FPN recursion layer feature enhancement extraction network is adopted to enhance feature extraction, so that the intensity and the precision of feature detection are enhanced; the regression loss function of the replacement boundary frame is self-adaptive by judging the position relationship between the prediction frame and the real frame, the model parameters can be continuously optimized, the convergence speed of the model is accelerated, and the accuracy of the prediction frame is improved; and training the data training set, and performing effect inspection by using the self-collected picture to be tested after the training is finished. The multi-size multi-scale detection accuracy of the existing target detection algorithm is improved, the target detection effect is enhanced, the whole performance of the target detection model is improved, the application of multi-size multi-target detection scenes under a complex background is promoted, and the practicability of the target detection model is improved.
Here, step 105 includes:
training a data set, and adopting an SGD (random gradient descent) optimization algorithm;
by comparing the P-R curves, which represent the relationship between precision and recall, where P represents precision and R is recall, the samples are divided into correct positive (TP), incorrect positive (FP), correct negative (TN) and incorrect negative (FN)
Figure BDA0004121340160000161
Figure BDA0004121340160000162
Taking the precision rate P as a vertical axis and the recall rate R as a horizontal axis, the performance P-R curve of the model can be obtained;
the latter can be considered to perform better than the former if the curve of one model is completely contained by the other;
when the P-R curves of two models intersect, performance comparison can be done according to the balance point (BEP), which is a value when p=r, the larger the balance point (BEP), the better the performance of the model can be considered;
Figure BDA0004121340160000163
by comparing mAP values, which are the average of all classes of APs (average precision) in the data set, AP is the area under the P-R curve, and the higher the mAP, the better the performance of the model.
With further reference to fig. 4, as an implementation of the method shown in fig. 1, the present disclosure discloses a multi-object multi-size image detection system in a complex context, where an embodiment of the system corresponds to the embodiment of the method shown in fig. 1. The system is particularly applicable to a variety of electronic devices.
As shown in fig. 4, the system of the present embodiment includes:
the data processing unit 401 selects images with complex backgrounds and large target numbers from the data set base, pre-processes the images, marks the images and obtains a multi-target multi-size data training set under the complex backgrounds;
the feature extraction unit 402 extracts images based on the YOLOv3 model by adopting a dark-53 feature extraction network structure and then performs feature extraction;
a feature strengthening unit 403, which strengthens feature extraction by adopting a channel self-adaptive FPN recursion layer feature strengthening extraction network, and strengthens the intensity and precision of feature detection;
the judging unit 404 is used for adaptively replacing the boundary frame regression loss function by judging the position relationship between the prediction frame and the real frame, so that the model parameters can be continuously optimized, the convergence speed of the model is increased, and the accuracy of the prediction frame is improved;
the test unit 405 performs training on the data training set, and performs effect test by using the self-collected picture to be tested after the training is finished.
In some alternative embodiments, preprocessing the image includes:
randomly initializing a point in the blank image, dividing the blank image into four areas by using initialized abscissas and ordinates, and randomly reading four images;
And performing mirror image overturning and scale scaling on the image to form a new image, and performing corresponding rotation, scaling and translation on the label corresponding to the read image.
In some alternative embodiments, feature enhancement unit 403, which uses a channel adaptive FPN recursive layer feature enhancement extraction network to enhance feature extraction and enhance the intensity and accuracy of feature detection, includes:
based on FPN, a top-down path can be provided to fuse the feature principles in the multi-scale feature map, and a FPN pyramid network structure is established, wherein the network output formula of the FPN is as follows
f i =F i (f i+1 ,X i )
X i =B i (X i-1 )
Wherein B is i Stage i, X representing the bottom-to-top path of the backbone network i Representing the output of the ith stage, F i Representing the corresponding operation of additional feedback from the top to the bottom path, f i Representing the output, wherein the value range of i is 1-S, S represents the number of stages in the backbone network, f (S+1) =0;
Connecting the additional feedback of each FPN layer from top to bottom to the Darknet-53 backbone characteristic extraction layer from bottom to top;
based on the principle that ECA-Block module fuses the space and channel information of each layer of local receiving field, so that the network can construct information characteristics and adaptively change the channel weight relation, then channel adaptation in the recursion layer is completed through ECA-Block, and the weight of each channel is calculated through the self-adaptive weight module to distribute the weight of the channel;
After the attention module of the ECA-Block channel calculates the weight, each channel has a corresponding number, and the number is the self-adaptive weight of the channel;
the attention module of the ECA-Block channel calculates weights:
the image characteristics extracted by the product blocks are output as U epsilon R W×H×C Wherein W, H, X is width, height and number of channels, respectively;
all spatial information is compressed into statistical information through global average pooling, and the statistical information
Figure BDA0004121340160000181
Resulting from the convolution output of the spatial dimension H W contraction calculation, so the c-th element of Z, Z 1×1×c The average pooling generation formula:
Figure BDA0004121340160000182
obtaining the polymerization characteristic Z 1×1×C And then, generating a channel weight through one-dimensional convolution with a convolution kernel size K, wherein the one-dimensional convolution kernel K is adaptively determined by a channel dimension C, and generating a convolution kernel K formula:
Figure BDA0004121340160000183
where k is the size of the convolution kernel, |t| odd For the nearest odd value of the distance t, C is the dimension of the channel, and gamma and b are fixed values of 2 and 1 respectively;
performing feature extraction on the target image twice, wherein the recursive extraction enhances the intensity and the precision of feature detection, and the output features depend on the output of the previous step, wherein R i Representing feature conversion before the adaptive FPN recursion layer is connected to the backbone network, outputting features f i
f i =F i (f i+1 ,x i ),x i =u i (x i-1 ,R i (f i ))
Wherein u is i Representing the ith layer convolution operation in the backbone extraction network, x i For the features after the convolution operation, F i Represents FPN versus x i I-th operation of (a);
the recursive output feature can be expanded into an output sequential network, i.e. when
Figure BDA0004121340160000191
Figure BDA0004121340160000192
Figure BDA0004121340160000193
Figure BDA0004121340160000194
In some alternative embodiments, the determining unit 404, by determining that the position relationship between the prediction frame and the real frame is adaptive to the regression loss function of the replacement bounding box, can continuously optimize the model parameters, speed up the convergence speed of the model, and improve the accuracy of the prediction frame, including:
the prediction frame is a, the real frame is c, the intersection area of the two baskets is D, the combined area of the two baskets is D, and the minimum area which can contain the two baskets is E
No inclusion exists between the prediction frame and the real frame, the loss function uses the GIOU, and the loss function calculation formula is:
L GIOU =1-GIOU
Figure BDA0004121340160000195
where IOU represents the ratio of the portion where two regions overlap to the portion where two regions are assembled,
Figure BDA0004121340160000196
the method can distinguish states of the prediction frame and the real frame which have the same intersection region but different intersection modes, and model parameters can be continuously optimized even if the condition that the intersection region does not exist between the prediction frame and the real frame occurs in the training process;
the prediction frame and the real frame are included, the loss function uses CIOG, and the position relation is calculated and optimized through the center point distance between the two frames and the aspect ratio of the two frames;
Figure BDA0004121340160000197
Wherein ρ is 2 (b,b gt ) Representing the square of the distance between the center points of the two frames, C 2 A square representing a diagonal length of a minimum rectangular region including the prediction frame and the real frame;
Figure BDA0004121340160000198
wherein the parameter v represents the uniformity of the aspect ratio between the predicted and real frames
Figure BDA0004121340160000201
Wherein w is gt And h gt Representing the width and height, w, of the real frame respectively p And h p Representing the width and height of the prediction frame;
the complete definition of CIOU is
Figure BDA0004121340160000202
By judging the position relationship between the prediction frame and the real frame to adapt to the regression loss function of the replacement boundary frame, the model parameters can be continuously optimized, and the matching degree between the prediction frame and the real frame can be accurately estimated, so that the detection precision is effectively improved.
Referring now to fig. 5, a schematic diagram of an electronic device suitable for use in implementing embodiments of the present disclosure is shown. The electronic device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), etc., a fixed terminal such as a digital TV, a desktop computer, etc. The electronic device shown in fig. 5 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.
As shown in fig. 5, the electronic device may include a processing means (e.g., a central processor, a graphics processor, etc.) 901, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 902 or a program loaded from a storage means 908 into a Random Access Memory (RAM) 903. In the RAM903, various programs and data required for the operation of the electronic device are also stored. The processing device 901, the ROM902, and the RAM903 are connected to each other through a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.
In general, the following devices may be connected to the I/O interface 905: input devices 906 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 907 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 908 including, for example, magnetic tape, hard disk, etc.; and a communication device 909. Communication means 909 may allow the electronic device to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 shows an electronic device having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication device 909, or installed from the storage device 908, or installed from the ROM 902. When executed by the processing device 901, performs the above-described functions defined in the methods of the embodiments of the present disclosure.
It should be noted that the computer readable medium of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.
In some embodiments, the client, server, etc. may communicate using any currently known or future developed network protocol, such as HTTP (hypertext transfer protocol), etc., and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.
The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: selecting images with complex backgrounds and large target numbers from a data set base, preprocessing the images, and labeling the images to obtain a multi-target multi-size data training set under the complex backgrounds; based on a YOLOv3 model, extracting the image by adopting a Darknet-53 feature extraction network structure, and then extracting the features; the channel self-adaptive FPN recursion layer feature enhancement extraction network is adopted to enhance feature extraction, so that the intensity and the precision of feature detection are enhanced; the regression loss function of the replacement boundary frame is self-adaptive by judging the position relationship between the prediction frame and the real frame, the model parameters can be continuously optimized, the convergence speed of the model is accelerated, and the accuracy of the prediction frame is improved; and training the data training set, and performing effect inspection by using the self-collected picture to be tested after the training is finished.
Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++, python and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The name of the unit is not limited to the unit itself in some cases, for example, the preprocessing module may also be described as "a unit that performs blurring preprocessing on surrounding rock levels of each layer of the face based on the face refinement classification result".
The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).
Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.
The foregoing is merely a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and variations may be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.

Claims (10)

1. The multi-target multi-size image detection method under the complex background is characterized by comprising the following steps:
selecting images with complex backgrounds and large target numbers from a data set base, preprocessing the images, and labeling the images to obtain a multi-target multi-size data training set under the complex backgrounds;
based on a YOLOv3 model, extracting the image by adopting a Darknet-53 feature extraction network structure, and then extracting the features;
the channel self-adaptive FPN recursion layer feature enhancement extraction network is adopted to enhance feature extraction, so that the intensity and the precision of feature detection are enhanced;
The regression loss function of the replacement boundary frame is self-adaptive by judging the position relationship between the prediction frame and the real frame, the model parameters can be continuously optimized, the convergence speed of the model is accelerated, and the accuracy of the prediction frame is improved;
and training the data training set, and performing effect inspection by using the self-collected picture to be tested after the training is finished.
2. The method of claim 1, wherein the preprocessing the image comprises:
randomly initializing a point in the blank image, dividing the blank image into four areas by using initialized abscissas and ordinates, and randomly reading four images;
and performing mirror image overturning and scale scaling on the image to form a new image, and performing corresponding rotation, scaling and translation on the label corresponding to the read image.
3. The method of claim 1, wherein the enhancing feature extraction with the channel adaptive FPN recursive layer feature enhancement extraction network enhances the strength and accuracy of feature detection, comprising:
based on FPN, a top-down path can be provided to fuse the feature principles in the multi-scale feature map, and a FPN pyramid network structure is established, wherein the network output formula of the FPN is
f i =F i (f i+1 ,X i )
X i =B i (X i-1 )
Wherein B is i Stage i, X representing the bottom-to-top path of the backbone network i Representing the output of the ith stage, F i Representing the corresponding operation of additional feedback from the top to the bottom path, f i Representing the output, wherein the value range of i is 1-S, S represents the number of stages in the backbone network, f (s+1) =0;
Connecting the additional feedback of each FPN layer from top to bottom to the Darknet-53 backbone characteristic extraction layer from bottom to top;
based on the principle that ECA-Block module fuses the space and channel information of each layer of local receiving field, so that the network can construct information characteristics and adaptively change the channel weight relation, then channel adaptation in the recursion layer is completed through ECA-Block, and the weight of each channel is calculated through the self-adaptive weight module to distribute the weight of the channel;
after the attention module of the ECA-Block channel calculates the weight, each channel has a corresponding number, and the number is the self-adaptive weight of the channel;
the attention module of the ECA-Block channel calculates weights:
the image characteristics extracted by the product blocks are output as U epsilon R W×H×G Wherein W, H, C is width, height and number of channels, respectively;
all spatial information is compressed into statistical information through global average pooling, and the statistical information
Figure FDA0004121340140000023
Resulting from the convolution output of the spatial dimension H W contraction calculation, so the c-th element of Z, Z 1×1×c The average pooling generation formula:
Figure FDA0004121340140000021
obtaining the polymerization characteristic Z 1×1×C And then, generating a channel weight through one-dimensional convolution with a convolution kernel size K, wherein the one-dimensional convolution kernel K is adaptively determined by a channel dimension C, and generating a convolution kernel K formula:
Figure FDA0004121340140000022
where k is the size of the convolution kernel, |t| odd For the nearest odd value of the distance t, C is the dimension of the channel, and gamma and b are fixed values of 2 and 1 respectively;
performing feature extraction on the target image twice, wherein the recursive extraction enhances the intensity and the precision of feature detection, and the output features depend on the output of the previous step, wherein R i Representing feature conversion before the adaptive FPN recursion layer is connected to the backbone network, outputting features f i
f i =F i (f i+1 ,x i ),x i =u i (x i-1 ,R i (f i ))
Wherein u is i Representing the ith layer convolution operation in the backbone extraction network, x i For the features after the convolution operation, F i Represents FPN versus x i I-th operation of (a);
the recursive output feature can be expanded into an output sequential network, i.e. when
Figure FDA0004121340140000035
Figure FDA0004121340140000036
Figure FDA0004121340140000031
Figure FDA0004121340140000032
4. The method according to claim 1, wherein the step of adaptively replacing the bounding box regression loss function by judging the position relationship between the predicted box and the real box, the model parameters can be continuously optimized, the convergence speed of the model is increased, and the accuracy of the predicted box is improved, includes:
The prediction frame is a, the real frame is c, the intersection area of the two baskets is D, the combined area of the two baskets is D, and the minimum area which can contain the two baskets is E
No inclusion exists between the prediction frame and the real frame, the loss function uses the GIOU, and the loss function calculation formula is:
L GIOU =1-GIOU
Figure FDA0004121340140000033
where IOU represents the ratio of the portion where two regions overlap to the portion where two regions are assembled,
Figure FDA0004121340140000034
can distinguish the states of prediction frames and real frames with the same intersection area but different intersection modes, even if training is performedIn the training process, the intersection area between the prediction frame and the real frame does not exist, and model parameters can be optimized continuously;
the prediction frame and the real frame are included, the loss function uses CIOG, and the position relation is calculated and optimized through the center point distance between the two frames and the aspect ratio of the two frames;
Figure FDA0004121340140000041
wherein ρ is 2 (b,b gt ) Representing the square of the distance between the center points of the two frames, C 2 A square representing a diagonal length of a minimum rectangular region including the prediction frame and the real frame;
Figure FDA0004121340140000042
wherein, the parameter v represents the consistency of the aspect ratio between the predicted frame and the real frame;
Figure FDA0004121340140000043
/>
wherein w is gt And h gt Representing the width and height, w, of the real frame respectively p And h p Representing the width and height of the prediction frame;
the complete definition of CIOU is
Figure FDA0004121340140000044
By judging the position relationship between the prediction frame and the real frame to adapt to the regression loss function of the replacement boundary frame, the model parameters can be continuously optimized, and the matching degree between the prediction frame and the real frame can be accurately estimated, so that the detection precision is effectively improved.
5. A multi-target multi-size image detection system in a complex context, the system comprising:
the data processing unit is used for selecting images with complex backgrounds and large target quantity from the data set library, preprocessing the images, and marking the images to obtain a multi-target multi-size data training set under the complex backgrounds;
the feature extraction unit is used for extracting the image by adopting a Darknet-53 feature extraction network structure based on the YOLOv3 model and then extracting the features;
the feature strengthening unit strengthens feature extraction by adopting a channel self-adaptive FPN recursion layer feature strengthening extraction network, and enhances the strength and the accuracy of feature detection;
the judging unit is used for adaptively replacing the boundary frame regression loss function by judging the position relationship between the prediction frame and the real frame, so that the model parameters can be continuously optimized, the convergence speed of the model is increased, and the accuracy of the prediction frame is improved;
and the checking unit is used for training the data training set, and performing effect checking by using the self-collected picture to be tested after the training is finished.
6. The method and system for detecting multi-object and multi-size images in complex background according to claim 5, wherein the preprocessing the images comprises:
Randomly initializing a point in the blank image, dividing the blank image into four areas by using initialized abscissas and ordinates, and randomly reading four images;
and performing mirror image overturning and scale scaling on the image to form a new image, and performing corresponding rotation, scaling and translation on the label corresponding to the read image.
7. The system of claim 5, wherein the feature enhancement unit enhances feature extraction using a channel adaptive FPN recursive layer feature enhancement extraction network to enhance the strength and accuracy of feature detection, comprising:
based on FPN, a top-down path can be provided to fuse the feature principles in the multi-scale feature map, and a FPN pyramid network structure is established, wherein the network output formula of the FPN is as follows
f i =F i (f i+1 ,X i )
X i =B i (X i-1 )
Wherein B is i Stage i, X representing the bottom-to-top path of the backbone network i Representing the output of the ith stage, F i Representing the corresponding operation of additional feedback from the top to the bottom path, f i Representing the output, wherein the value range of i is 1-S, S represents the number of stages in the backbone network, f (s+1) =0;
Connecting the additional feedback of each FPN layer from top to bottom to the Darknet-53 backbone characteristic extraction layer from bottom to top;
based on the principle that ECA-Block module fuses the space and channel information of each layer of local receiving field, so that the network can construct information characteristics and adaptively change the channel weight relation, then channel adaptation in the recursion layer is completed through ECA-Block, and the weight of each channel is calculated through the self-adaptive weight module to distribute the weight of the channel;
After the attention module of the ECA-Block channel calculates the weight, each channel has a corresponding number, and the number is the self-adaptive weight of the channel;
the attention module of the ECA-Block channel calculates weights:
the image characteristics extracted by the product blocks are output as U epsilon R W×H×G Wherein W, H, C is width, height and number of channels, respectively;
all spatial information is compressed into statistical information through global average pooling, and the statistical information
Figure FDA0004121340140000065
Resulting from the convolution output of the spatial dimension H W contraction calculation, so the c-th element of Z, Z 1×1×c The average pooling generation formula:
Figure FDA0004121340140000061
obtaining the polymerization characteristic Z 1×1×C And then, generating a channel weight through one-dimensional convolution with a convolution kernel size K, wherein the one-dimensional convolution kernel K is adaptively determined by a channel dimension C, and generating a convolution kernel K formula:
Figure FDA0004121340140000062
where k is the size of the convolution kernel, |t| odd For the nearest odd value of the distance t, C is the dimension of the channel, and gamma and b are fixed values of 2 and 1 respectively;
performing feature extraction on the target image twice, wherein the recursive extraction enhances the intensity and the precision of feature detection, and the output features depend on the output of the previous step, wherein R i Representing feature conversion before the adaptive FPN recursion layer is connected to the backbone network, outputting features f i
f i =F i (f i+1 ,x i ),x i =u i (x i-1 ,R i (f i ))
Wherein u is i Representing the ith layer convolution operation in the backbone extraction network, x i For the features after the convolution operation, F i Represents FPN versus x i I-th operation of (a);
the recursive output feature can be expanded into an output sequential network, i.e. when
Figure FDA0004121340140000066
Figure FDA0004121340140000067
Figure FDA0004121340140000063
Figure FDA0004121340140000064
8. The system according to claim 5, wherein the judging unit is configured to adapt the regression loss function of the replacement bounding box by judging a positional relationship between the prediction frame and the real frame, and the model parameters can be continuously optimized to increase a convergence rate of the model and improve an accuracy of the prediction frame, and the judging unit includes:
the prediction frame is a, the real frame is c, the intersection area of the two baskets is D, the combined area of the two baskets is D, and the minimum area which can contain the two baskets is E
No inclusion exists between the prediction frame and the real frame, the loss function uses the GIOU, and the loss function calculation formula is:
L GIOU =1-GIOU
Figure FDA0004121340140000071
where IOU represents the ratio of the portion where two regions overlap to the portion where two regions are assembled,
Figure FDA0004121340140000072
the states of the prediction frame and the real frame which have the same intersection region but different intersection modes can be distinguished, and model parameters can be continuously optimized even if the intersection region does not exist between the prediction frame and the real frame in the training process;
the prediction frame and the real frame are included, the loss function uses CIOG, and the position relation is calculated and optimized through the center point distance between the two frames and the aspect ratio value of the two frames;
Figure FDA0004121340140000073
/>
Wherein ρ is 2 (b,b gt ) Representing the square of the distance between the center points of the two frames, C 2 A square representing a diagonal length of a minimum rectangular region including the prediction frame and the real frame;
Figure FDA0004121340140000074
wherein, the parameter v represents the consistency of the aspect ratio between the predicted frame and the real frame;
Figure FDA0004121340140000075
wherein w is gt And h gt Representing the width and height, w, of the real frame respectively p And h p Representing the width and height of the prediction frame;
the complete definition of CIOU is
Figure FDA0004121340140000081
By judging the position relationship between the prediction frame and the real frame to adapt to the regression loss function of the replacement boundary frame, the model parameters can be continuously optimized, and the matching degree between the prediction frame and the real frame can be accurately estimated, so that the detection precision is effectively improved.
9. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs,
when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-4.
10. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-4.
CN202310233827.6A 2023-03-03 2023-03-12 Multi-target multi-size image detection method and system under complex background, electronic equipment and storage medium Pending CN116229086A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2023102012746 2023-03-03
CN202310201274 2023-03-03

Publications (1)

Publication Number Publication Date
CN116229086A true CN116229086A (en) 2023-06-06

Family

ID=86589004

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310233827.6A Pending CN116229086A (en) 2023-03-03 2023-03-12 Multi-target multi-size image detection method and system under complex background, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116229086A (en)

Similar Documents

Publication Publication Date Title
CN111476309B (en) Image processing method, model training method, device, equipment and readable medium
CN108876792B (en) Semantic segmentation method, device and system and storage medium
CN111126472A (en) Improved target detection method based on SSD
US20240112035A1 (en) 3d object recognition using 3d convolutional neural network with depth based multi-scale filters
CN112258512B (en) Point cloud segmentation method, device, equipment and storage medium
CN114140683A (en) Aerial image target detection method, equipment and medium
CN111767750A (en) Image processing method and device
CN112381717A (en) Image processing method, model training method, device, medium, and apparatus
CN112580558A (en) Infrared image target detection model construction method, detection method, device and system
CN111783777B (en) Image processing method, apparatus, electronic device, and computer readable medium
CN112712036A (en) Traffic sign recognition method and device, electronic equipment and computer storage medium
CN115346278A (en) Image detection method, device, readable medium and electronic equipment
CN113469025B (en) Target detection method and device applied to vehicle-road cooperation, road side equipment and vehicle
CN114202648A (en) Text image correction method, training method, device, electronic device and medium
CN113673446A (en) Image recognition method and device, electronic equipment and computer readable medium
CN114419322B (en) Image instance segmentation method and device, electronic equipment and storage medium
CN115100536B (en) Building identification method and device, electronic equipment and computer readable medium
CN116453154A (en) Pedestrian detection method, system, electronic device and readable medium
CN113470026B (en) Polyp recognition method, device, medium, and apparatus
CN113255812B (en) Video frame detection method and device and electronic equipment
CN116229086A (en) Multi-target multi-size image detection method and system under complex background, electronic equipment and storage medium
CN114120423A (en) Face image detection method and device, electronic equipment and computer readable medium
CN113139540A (en) Backboard detection method and equipment
CN113642510A (en) Target detection method, device, equipment and computer readable medium
CN113177463A (en) Target positioning method and device in mobile scene

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination