CN114677504B - Target detection method, device, equipment terminal and readable storage medium - Google Patents

Target detection method, device, equipment terminal and readable storage medium Download PDF

Info

Publication number
CN114677504B
CN114677504B CN202210600445.8A CN202210600445A CN114677504B CN 114677504 B CN114677504 B CN 114677504B CN 202210600445 A CN202210600445 A CN 202210600445A CN 114677504 B CN114677504 B CN 114677504B
Authority
CN
China
Prior art keywords
attention
feature map
feature
extraction
intermediate feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210600445.8A
Other languages
Chinese (zh)
Other versions
CN114677504A (en
Inventor
陈磊
周有喜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Core Computing Integrated Shenzhen Technology Co ltd
Original Assignee
Shenzhen Aishen Yingtong Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Aishen Yingtong Information Technology Co Ltd filed Critical Shenzhen Aishen Yingtong Information Technology Co Ltd
Priority to CN202210600445.8A priority Critical patent/CN114677504B/en
Publication of CN114677504A publication Critical patent/CN114677504A/en
Application granted granted Critical
Publication of CN114677504B publication Critical patent/CN114677504B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The application relates to a target detection method, a device, an equipment terminal and a readable storage medium, wherein the target detection method preprocesses each training picture in a training set through an input end to obtain a preprocessed training set; extracting the features of each training picture in the preprocessed training set based on a feature extraction unit to obtain intermediate feature maps with different scales; according to the size of each intermediate feature map, at least two attention subunits are obtained to respectively perform feature extraction on each intermediate feature map so as to obtain respective corresponding attention extraction feature maps; respectively carrying out feature combination on each intermediate feature map and the attention extraction feature maps corresponding to each intermediate feature map to obtain each target feature map; respectively detecting each target characteristic diagram through a prediction output unit to generate corresponding prediction values; and performing loss function calculation according to the corresponding predicted value to generate a corresponding target detection model. The target detection method improves the accuracy of the target detection method on the whole.

Description

Target detection method, device, equipment terminal and readable storage medium
Technical Field
The present application relates to the field of image processing, and in particular, to a target detection method, apparatus, device terminal, and readable storage medium.
Background
With the deep application of the deep convolutional neural network in the field of computer vision, a real-time target detection model represented by a YOLO algorithm plays a good detection effect in the industrial field and in practical application scenes.
The YOLOv5-Lite model is improved on the basis of the previous generation YOLOv4, has higher training speed, and has smaller model size, so that the YOLOv5-Lite model is favorable for rapid deployment of the model.
In practical application, a large number of targets with various sizes are generated in the near and far and complex application environments of a shooting scene, however, the targets with various sizes cannot be respectively subjected to characteristic extraction and collection in a targeted manner, and the target detection accuracy is not high overall.
Disclosure of Invention
In view of this, the present application provides a target detection method, an apparatus, a device terminal, and a readable storage medium, which can overcome a disadvantage that a YOLOv5-Lite model cannot respectively perform feature extraction and collection in a targeted manner when detecting targets of various sizes, and improve an overall detection accuracy of the YOLOv5-Lite model.
A target detection method is applied to a YOLOv5-Lite network, the YOLOv5-Lite network comprises an input end, a feature extraction unit, an attention unit and a prediction output unit which are sequentially connected, the attention unit comprises a plurality of different attention subunits, and the target detection method comprises the following steps:
acquiring picture input data as a training set;
preprocessing each training picture in the training set through an input end to obtain a preprocessed training set;
extracting the features of each training picture in the preprocessed training set based on a feature extraction unit to obtain intermediate feature maps with different scales;
according to the size of each intermediate feature map, at least two attention subunits are obtained to respectively perform feature extraction on each intermediate feature map so as to obtain an attention extraction feature map corresponding to each intermediate feature map;
respectively carrying out feature combination on each intermediate feature map and the attention extraction feature maps corresponding to the intermediate feature maps to obtain each target feature map;
respectively detecting each target characteristic diagram through a prediction output unit to generate corresponding prediction values;
and calculating a loss function according to the corresponding predicted value to obtain an optimized gradient, and updating the weight and the bias until the loss function is converged to generate a corresponding target detection model.
In one embodiment, the target detection method further comprises:
acquiring picture input data as a test set;
and testing the test set according to the target detection model, and outputting a corresponding target detection result.
In one embodiment, the feature extraction unit includes a backbone unit and a Neck unit which are connected in sequence, the backbone unit is connected with the input end, the output end of the Neck unit is connected with the attention unit, and the feature extraction unit performs feature extraction on each training picture in the preprocessed training set to obtain the attention extraction feature maps corresponding to the intermediate feature maps, wherein the step of extracting the feature from each training picture includes:
carrying out slicing operation and convolution operation on each training picture in the preprocessed training set based on a back bone unit to obtain an initial feature map;
and performing secondary feature extraction on the initial feature map based on a Neck unit to obtain intermediate feature maps with different scales.
In one embodiment, the attention unit includes a first attention subunit and a second attention subunit, the intermediate feature maps have three dimensions, and the step of obtaining at least two attention subunits to perform feature extraction on each intermediate feature map respectively according to the size of each intermediate feature map to obtain an attention extraction feature map corresponding to each intermediate feature map respectively includes:
performing feature extraction on the intermediate feature map of the first scale through a first attention subunit to obtain a corresponding first attention extraction feature map;
and respectively extracting the features of the intermediate feature maps in the second scale and the third scale through a second attention subunit to obtain a second attention extraction feature map and a third attention extraction feature map, wherein the first scale, the second scale and the third scale are sequentially reduced.
In one embodiment, the first attention subunit is a compression and excitation module and the second attention subunit is a convolution block attention module.
In one embodiment, the attention unit includes a first attention subunit, a second attention subunit, and a third attention subunit, the intermediate feature maps have three dimensions, and the step of obtaining at least two attention subunits to perform feature extraction on each intermediate feature map respectively according to the size of each intermediate feature map to obtain an attention extraction feature map corresponding to each intermediate feature map includes:
performing feature extraction on the intermediate feature map of the first scale through a first attention subunit to obtain a corresponding first attention extraction feature map;
performing feature extraction on the intermediate feature map of the second scale through a second attention subunit to obtain a second attention extraction feature map;
and performing feature extraction on the intermediate feature map of the third scale through a third attention subunit to obtain a third attention extraction feature map, wherein the first scale, the second scale and the third scale are sequentially reduced.
In an embodiment, a batch normalization layer is further connected between the feature extraction unit and the attention unit, and the step of obtaining at least two attention subunits to perform feature extraction on each intermediate feature map respectively according to the size of each intermediate feature map so as to obtain an attention extraction feature map corresponding to each intermediate feature map further includes:
respectively standardizing the intermediate characteristic diagrams with different scales based on the batch standardization layer, and adjusting the weight of each channel in the intermediate characteristic diagram with each size by adopting a preset dynamic adjustment factor to obtain the standardized intermediate characteristic diagrams with different scales.
In one embodiment, the formula employed in the normalization process is:
Figure DEST_PATH_IMAGE001
wherein, y i Represents the normalized intermediate feature map corresponding to the ith channel, m represents the number of channels per input intermediate feature map,
Figure 601848DEST_PATH_IMAGE002
represents a preset dynamic adjustment factor, x, corresponding to the ith channel i Intermediate feature graph, u, representing input corresponding to ith channel b Represents the mean of the input m-channel intermediate feature maps,
Figure DEST_PATH_IMAGE003
represents the overall variance of the input m-channel intermediate feature maps,
Figure 978602DEST_PATH_IMAGE004
and
Figure DEST_PATH_IMAGE005
all represent constants.
In one embodiment, the loss function is:
Figure 254863DEST_PATH_IMAGE006
wherein, the first and the second end of the pipe are connected with each other,
Figure DEST_PATH_IMAGE007
represents the overall loss function value of the YOLOv5-Lite network,
Figure 173884DEST_PATH_IMAGE008
represents a penalty coefficient, x represents an input target feature map, f (x) represents a predicted value,
Figure DEST_PATH_IMAGE009
a corresponding true value is represented and,
Figure 410961DEST_PATH_IMAGE010
representing the values of the loss function for x and y,
Figure DEST_PATH_IMAGE011
a weight corresponding to each channel is represented,
Figure 653724DEST_PATH_IMAGE012
represents the utilization of L 1 Paradigm pair weight
Figure DEST_PATH_IMAGE013
The absolute value summation is performed, i and j each represent a positive integer variable,
Figure 971442DEST_PATH_IMAGE002
represents the preset dynamic adjustment factor corresponding to the ith channel,
Figure 844720DEST_PATH_IMAGE014
representing the jth preset dynamic adjustment factor.
In addition, an object detection device is provided, which is applied to a YOLOv5-Lite network, the YOLOv5-Lite network includes an input end, a feature extraction unit, an attention unit and a prediction output unit which are connected in sequence, the attention unit includes a plurality of different attention subunits, and the object detection device includes:
the training set generation module is used for acquiring picture input data as a training set;
the preprocessing module is used for preprocessing each training picture in the training set through an input end to obtain a preprocessed training set;
the first feature map generation module is used for extracting features of each training picture in the preprocessed training set based on the feature extraction unit so as to obtain intermediate feature maps with different scales;
the second feature map generation module is used for acquiring at least two attention subunits to respectively perform feature extraction on each intermediate feature map according to the size of each intermediate feature map so as to obtain an attention extraction feature map corresponding to each intermediate feature map;
the target feature map generation module is used for respectively carrying out feature combination on each intermediate feature map and the attention extraction feature maps corresponding to the intermediate feature maps to obtain each target feature map;
the prediction value generation module is used for respectively detecting each target characteristic diagram through the prediction output unit so as to generate a corresponding prediction value;
and the detection model generation module is used for calculating a loss function according to the corresponding predicted value to obtain an optimized gradient, and updating the weight and the bias until the loss function is converged to generate a corresponding target detection model.
In addition, an apparatus terminal is provided, which includes a processor and a memory, the memory is used for storing a computer program, and the processor runs the computer program to make the apparatus terminal execute the above object detection method.
Furthermore, a readable storage medium is provided, which stores a computer program which, when executed by a processor, implements the above object detection method.
The target detection method is applied to a YOLOv5-Lite network, the YOLOv5-Lite network comprises an input end, a feature extraction unit, an attention unit and a prediction output unit which are sequentially connected, the attention unit comprises a plurality of different attention subunits, the target detection method comprises the steps of obtaining picture input data as a training set, preprocessing each training picture in the training set through the input end to obtain a preprocessed training set, performing feature extraction on each training picture in the preprocessed training set based on the feature extraction unit to obtain intermediate feature maps with different scales, respectively performing feature extraction on each intermediate feature map through at least two attention subunits according to the size of each intermediate feature map to obtain the attention extraction feature map corresponding to each intermediate feature map, respectively performing feature combination on each intermediate feature map and the attention extraction feature maps corresponding to each intermediate feature map, the target detection method comprises the steps of respectively extracting the characteristics of each intermediate characteristic diagram by at least two attention subunits through obtaining at least two attention subunits to obtain attention extraction characteristic diagrams respectively corresponding to each intermediate characteristic diagram, and further enabling the target detection model to respectively extract corresponding characteristic information through the attention subunits respectively corresponding to the sizes of each intermediate characteristic diagram when detecting the targets of each size in the image, namely respectively pertinently extracting and collecting the characteristics of the targets of each size, meanwhile, each intermediate feature map and the attention extraction feature maps corresponding to each intermediate feature map are further subjected to feature merging respectively to obtain each target feature map, on one hand, more information is extracted from the original intermediate feature map due to the attention extraction feature maps, on the other hand, the original intermediate feature maps are reserved, and then the information of the two feature maps is merged, so that more useful feature information is obtained, and the detection accuracy of the targets of all sizes is further improved on the whole.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic application environment diagram of a target detection method provided in an embodiment of the present application;
fig. 2 is a schematic flowchart of a target detection method according to an embodiment of the present application;
FIG. 3 is a schematic flow chart of another target detection method provided in the embodiments of the present application;
FIG. 4 is a schematic flowchart of a method for obtaining intermediate feature maps of different scales according to an embodiment of the present disclosure;
FIG. 5 is a block diagram of an attention unit according to an embodiment of the present disclosure;
fig. 6 is a flowchart illustrating a method for obtaining attention extraction feature maps corresponding to respective intermediate feature maps according to an embodiment of the present application;
FIG. 7 is a block diagram of another attention unit configuration provided in an embodiment of the present application;
fig. 8 is a schematic flowchart of another method for obtaining attention extraction feature maps corresponding to respective intermediate feature maps according to an embodiment of the present application;
FIG. 9 is a schematic diagram of an application environment of another target detection method provided in an embodiment of the present application;
FIG. 10 is a schematic flowchart illustrating a further method for detecting an object according to an embodiment of the present application;
fig. 11 is a block diagram of a target detection apparatus according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application are clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. The following embodiments and their technical features may be combined with each other without conflict.
As shown in fig. 1, an application environment schematic diagram of an object detection method is provided, fig. 1 is a schematic structural block diagram of a YOLOv5-Lite network, the YOLOv5-Lite network includes an input end 11, a feature extraction unit 12, an attention unit 13 and a prediction output unit 14, which are connected in sequence, and the attention unit 13 includes a plurality of different attention subunits.
As shown in fig. 2, there is provided an object detection method including:
step S110, acquiring the picture input data as a training set.
When the target is detected, a training set needs to be established to obtain a target detection model, and a large amount of picture input data needs to be acquired as the training set.
And step S120, preprocessing each training picture in the training set through the input end to obtain a preprocessed training set.
Each training picture in the training set needs to be further preprocessed, because many shooting pictures in the picture input data are not labeled yet, in addition, the preprocessing can also comprise at least one of data enhancement processing, self-adaptive anchor frame calculation and self-adaptive picture scaling processing, and further the preprocessed training set is obtained.
And step S130, extracting the features of each training picture in the preprocessed training set based on the feature extraction unit to obtain intermediate feature maps with different scales.
The YOLOv5-Lite network generally comprises a plurality of feature extraction units, and the YOLOv5-Lite network performs feature extraction on each training picture in the preprocessed training set through the plurality of feature extraction units to obtain intermediate feature maps with different scales.
Step S140, at least two attention subunits are obtained according to the size of each intermediate feature map to perform feature extraction on each intermediate feature map respectively, so as to obtain an attention extraction feature map corresponding to each intermediate feature map.
And respectively extracting the features of each intermediate feature map by adopting a corresponding proper attention subunit according to the size of each intermediate feature map, so as to obtain the attention extraction feature map corresponding to each intermediate feature map.
In an embodiment, three intermediate feature maps with different scales are obtained, and at this time, according to the size of each intermediate feature map, at least two attention subunits may be obtained to perform feature extraction on each intermediate feature map respectively, so as to obtain an attention extraction feature map corresponding to each intermediate feature map, where one attention subunit is used to perform feature extraction on the intermediate feature map of one scale, and the other attention subunit is used to perform feature extraction on the intermediate feature maps of the remaining two scales.
In this embodiment, corresponding attention subunits are respectively adopted for feature extraction on each intermediate feature graph with different scales, so that when the target detection model detects targets with various sizes in a picture, corresponding feature information can be respectively extracted through the corresponding attention subunits according to the size of each intermediate feature graph, that is, feature extraction and collection can be respectively performed on the targets with various sizes in a targeted manner.
And step S150, respectively carrying out feature combination on each intermediate feature map and the attention extraction feature maps corresponding to the intermediate feature maps to obtain each target feature map.
On one hand, the attention extraction feature map extracts more information from the original intermediate feature map, on the other hand, the original intermediate feature map is retained, and then the information of the two feature maps is merged, so that more useful feature information is obtained, and the detection accuracy of the target of each size is further improved on the whole.
In step S160, the target feature maps are detected by the prediction output unit to generate corresponding prediction values.
Wherein, the prediction output unit is generally referred to as a head part in a Yolov5-Lite network.
And S170, calculating a loss function according to the corresponding predicted value to obtain an optimized gradient, and updating the weight and the bias until the loss function is converged to generate a corresponding target detection model.
The above target detection method is applied to a YOLOv5-Lite network, the YOLOv5-Lite network comprises an input end, a feature extraction unit, an attention unit and a prediction output unit which are connected in sequence, the attention unit comprises a plurality of different attention subunits, the target detection method obtains at least two attention subunits to respectively extract features of each intermediate feature map so as to obtain an attention extraction feature map corresponding to each intermediate feature map, and further enables the target detection model to respectively extract corresponding feature information through the corresponding attention subunits according to the size of each intermediate feature map when detecting targets of each size in a picture, namely, the target of each size can be respectively and pertinently subjected to feature extraction and collection, and meanwhile, the feature extraction feature maps corresponding to each intermediate feature map and each intermediate feature map are respectively subjected to feature merging so as to obtain each target feature map, on one hand, the attention extraction feature map is that more information is extracted from the original intermediate feature map, on the other hand, the original intermediate feature map is retained, and further the two pieces of the feature information are merged, so that the accuracy of the feature maps is further improved for overall detection of the useful target sizes.
In one embodiment, as shown in fig. 3, the object detection method further includes:
and step S180, acquiring picture input data as a test set.
And step S190, testing the test set according to the target detection model, and outputting a corresponding target detection result.
In one embodiment, as shown in fig. 1, the feature extraction unit 12 includes a backbone unit and a Neck unit connected in sequence, the backbone unit is connected to the input end 11, and the output end of the Neck unit is connected to the attention unit 13, as shown in fig. 4, and the step S130 includes:
step S132, based on the backbone unit, slicing operation and convolution operation are carried out on each training picture in the preprocessed training set, so as to obtain an initial feature map.
And S134, performing secondary feature extraction on the initial feature map based on the Neck unit to obtain intermediate feature maps with different scales.
In one embodiment, as shown in fig. 5, the attention unit 13 includes a first attention subunit 13a and a second attention subunit 13b, as shown in fig. 6, step S140 includes:
in step S141, feature extraction is performed on the intermediate feature map of the first scale through the first attention subunit to obtain a corresponding first attention extraction feature map.
And step S142, respectively performing feature extraction on the intermediate feature maps of the second scale and the third scale through a second attention subunit to obtain a second attention extraction feature map and a third attention extraction feature map, wherein the first scale, the second scale and the third scale are sequentially reduced.
In this embodiment, the first attention subunit performs feature extraction on the intermediate feature map with the largest scale (i.e., the intermediate feature map with the first scale), and then, for the intermediate feature maps with the smaller scales, the second attention subunit performs feature extraction, so that more feature information can be extracted from the intermediate feature map with the smaller scale, that is, corresponding feature information can be respectively extracted through the respective corresponding attention subunits according to the sizes of the intermediate feature maps, thereby implementing feature extraction and collection on targets with various sizes respectively and specifically, and further improving the overall detection accuracy on the targets with various sizes.
In one embodiment, the first attention subunit is a compression and excitation module and the second attention subunit is a convolution block attention module.
The compression and Excitation Module is (SE Module), and the convolution Block Attention Module is (CBAM Module).
In one embodiment, as shown in fig. 7, the attention unit includes a first attention subunit 13a, a second attention subunit 13b and a third attention subunit 13c, as shown in fig. 8, and step S140 includes:
and step S143, performing feature extraction on the intermediate feature map of the first scale through the first attention subunit to obtain a corresponding first attention extraction feature map.
And step S144, performing feature extraction on the intermediate feature map of the second scale through a second attention subunit to obtain a second attention extraction feature map.
And S145, performing feature extraction on the intermediate feature map of the third scale through a third attention subunit to obtain a third attention extraction feature map, wherein the first scale, the second scale and the third scale are sequentially reduced.
In this embodiment, the first attention subunit performs feature extraction on the intermediate feature map with the largest scale (that is, the intermediate feature map with the first scale), then, for the intermediate feature map with the second scale with the smaller scale, the second attention subunit performs feature extraction, and for the intermediate feature map with the third scale with the smaller scale, the third attention subunit performs feature extraction, so that more feature information can be further extracted from the intermediate feature map with the smaller scale, that is, the corresponding feature information can be further extracted through the respective attention subunits according to the sizes of the intermediate feature maps, thereby implementing feature extraction and collection respectively and specifically for targets with various sizes, and further improving the detection accuracy of the targets with various sizes as a whole.
In one embodiment, as shown in fig. 9, a batch normalization layer 15 is further connected between the feature extraction unit 12 and the attention unit 13, and as shown in fig. 10, step S140 further includes:
and S200, respectively carrying out standardization processing on the intermediate characteristic diagrams with different scales based on the batch standardization layer, and adjusting the weight of each channel in the intermediate characteristic diagram with each size by adopting a preset dynamic adjustment factor to obtain the standardized intermediate characteristic diagrams with different scales.
In this embodiment, the intermediate feature maps are normalized by the batch normalization layer, and a preset dynamic adjustment factor is added, where the preset dynamic adjustment factor may reflect the degree of information change in each intermediate feature map, that is, the variance in the batch normalization layer, in other words, the variance may reflect the degree of information change, and the larger the variance is, the larger the degree of information change is, the richer the information therein is, and the higher the importance is, whereas the smaller the variance is, the smaller the degree of information change is, and the smaller the importance is, and therefore, by setting the batch normalization layer, the subsequent attention unit can better extract feature map information.
In the process of performing subsequent steps S140 to S150, the normalized intermediate feature maps with different scales need to be processed, and steps S160 to S170 are unchanged as shown in fig. 10, that is:
step S140, at least two attention subunits are obtained to respectively perform feature extraction on each normalized intermediate feature map according to the size of each normalized intermediate feature map, so as to obtain an attention extraction feature map corresponding to each normalized intermediate feature map.
And step S150, respectively carrying out feature merging on each normalized intermediate feature map and the attention extraction feature maps corresponding to each normalized intermediate feature map to obtain each target feature map.
In one embodiment, the formula employed in the normalization process is:
Figure 834673DEST_PATH_IMAGE001
wherein, y i Showing the normalized intermediate feature map corresponding to the ith channel, m showing the number of channels per input intermediate feature map,
Figure 881126DEST_PATH_IMAGE002
represents a preset dynamic adjustment factor, x, corresponding to the ith channel i Intermediate feature graph, u, representing the input corresponding to the ith channel b Represents the mean of the input m-channel intermediate feature maps,
Figure 600820DEST_PATH_IMAGE003
represents the overall variance of the input m-channel intermediate feature maps,
Figure 67836DEST_PATH_IMAGE004
and
Figure 935298DEST_PATH_IMAGE005
all represent constants.
In one embodiment, the loss function is:
Figure DEST_PATH_IMAGE015
wherein, the first and the second end of the pipe are connected with each other,
Figure 129650DEST_PATH_IMAGE007
represents the overall loss function value of the YOLOv5-Lite network,
Figure 500588DEST_PATH_IMAGE008
representing a penalty coefficient, x representing an input target feature map, f (x) representing a predicted value,
Figure 902620DEST_PATH_IMAGE009
the corresponding real value is represented by a value,
Figure 726219DEST_PATH_IMAGE010
representing the values of the loss function for x and y,
Figure 317738DEST_PATH_IMAGE011
a weight corresponding to each channel is represented,
Figure 684128DEST_PATH_IMAGE012
represents the utilization of L 1 Paradigm pair weight
Figure 866848DEST_PATH_IMAGE013
The absolute value summation is performed, i and j each represent a positive integer variable,
Figure 646585DEST_PATH_IMAGE002
represents the preset dynamic adjustment factor corresponding to the ith channel,
Figure 455841DEST_PATH_IMAGE014
representing the jth preset dynamic adjustment factor.
On the basis of the embodiment shown in fig. 8, the entire loss function of the YOLOv5-Lite network is made to include by setting a batch normalization layer
Figure 535793DEST_PATH_IMAGE016
And further, the loss function can be adjusted, and the accuracy of the whole target detection is improved on the whole.
Further, as shown in fig. 11, there is provided an object detection apparatus 300 applied to the YOLOv5-Lite network shown in fig. 1, the object detection apparatus 300 including:
a training set generating module 310, configured to obtain picture input data as a training set;
the preprocessing module 320 is used for preprocessing each training picture in the training set through an input end to obtain a preprocessed training set;
a first feature map generation module 330, configured to perform feature extraction on each training picture in the preprocessed training set based on the feature extraction unit to obtain intermediate feature maps of different scales;
a second feature map generation module 340, configured to obtain at least two attention subunits to perform feature extraction on each intermediate feature map respectively according to the size of each intermediate feature map, so as to obtain an attention extraction feature map corresponding to each intermediate feature map;
the target feature map generation module 350 is configured to perform feature merging on each intermediate feature map and the attention extraction feature maps corresponding to the intermediate feature maps, so as to obtain each target feature map;
and the predicted value generation module 360 detects each target feature map through the prediction output unit to generate a corresponding predicted value.
And the detection model generation module 370 performs loss function calculation according to the corresponding predicted value to obtain an optimized gradient, and performs weight and bias updating until the loss function converges to generate a corresponding target detection model.
In addition, an apparatus terminal is provided, which includes a processor and a memory, the memory is used for storing a computer program, and the processor runs the computer program to make the apparatus terminal execute the above object detection method.
Furthermore, a readable storage medium is provided, which stores a computer program which, when executed by a processor, implements the above object detection method.
The division of the units in the device is only used for illustration, and in other embodiments, the device may be divided into different units as needed to complete all or part of the functions of the device. For the specific limitations of the above device, reference may be made to the limitations of the above method, which are not described herein again.
That is, the above description is only an embodiment of the present application, and not intended to limit the scope of the present application, and all equivalent structures or equivalent flow transformations made by using the contents of the specification and the drawings, such as mutual combination of technical features between various embodiments, or direct or indirect application to other related technical fields, are included in the scope of the present application.
In addition, the present application may be identified by the same or different reference numerals for structural elements having the same or similar characteristics. Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to implicitly indicate the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more features. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.
In this application, the word "for example" is used to mean "serving as an example, instance, or illustration". Any embodiment described herein as "for example" is not necessarily to be construed as preferred or advantageous over other embodiments. The previous description is provided to enable any person skilled in the art to make or use the present application. In the foregoing description, various details have been set forth for the purpose of explanation.
It will be apparent to one of ordinary skill in the art that the present application may be practiced without these specific details. In other instances, well-known structures and processes are not shown in detail to avoid obscuring the description of the present application with unnecessary detail. Thus, the present application is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.

Claims (8)

1. An object detection method applied to a YOLOv5-Lite network, wherein the YOLOv5-Lite network comprises an input end, a feature extraction unit, an attention unit and a prediction output unit which are connected in sequence, the attention unit comprises a plurality of different attention subunits, and the object detection method comprises the following steps:
acquiring picture input data as a training set;
preprocessing each training picture in the training set through the input end to obtain a preprocessed training set;
extracting the features of each training picture in the preprocessed training set based on the feature extraction unit to obtain intermediate feature maps with different scales;
according to the size of each intermediate feature map, at least two attention subunits are obtained to respectively extract features of each intermediate feature map so as to obtain attention extraction feature maps corresponding to each intermediate feature map;
respectively carrying out feature combination on each intermediate feature map and the attention extraction feature maps corresponding to the intermediate feature maps to obtain each target feature map;
detecting each target characteristic diagram through the prediction output unit to generate corresponding prediction values;
calculating a loss function according to the corresponding predicted value to obtain an optimized gradient, and updating the weight and the bias until the loss function is converged to generate a corresponding target detection model;
the method comprises the following steps that a batch standardization layer is further connected between the feature extraction unit and the attention unit, at least two attention subunits are obtained according to the size of each intermediate feature graph to respectively extract features of each intermediate feature graph, and the steps of obtaining the attention extraction feature graphs corresponding to the intermediate feature graphs respectively further comprise the following steps:
respectively standardizing the intermediate characteristic diagrams with different scales based on the batch standardization layer, and adjusting the weight of each channel in the intermediate characteristic diagram with each size by adopting a preset dynamic adjustment factor to obtain the standardized intermediate characteristic diagrams with different scales;
the attention unit comprises a first attention subunit and a second attention subunit, wherein the first attention subunit is a compression and excitation module, and the second attention subunit is a convolution block attention module; the method comprises the following steps of obtaining at least two attention subunits according to the size of each intermediate feature map, and respectively extracting features of each intermediate feature map to obtain an attention extraction feature map corresponding to each intermediate feature map, wherein the three scales of the intermediate feature maps are three, and the step of obtaining the attention extraction feature map corresponding to each intermediate feature map comprises the following steps:
performing feature extraction on the intermediate feature map of the first scale through the first attention subunit to obtain a corresponding first attention extraction feature map;
and respectively extracting features of the intermediate feature maps of the second scale and the third scale through the second attention subunit to obtain a second attention extraction feature map and a third attention extraction feature map, wherein the first scale, the second scale and the third scale are sequentially reduced.
2. The object detection method according to claim 1, characterized in that the object detection method further comprises:
acquiring picture input data as a test set;
and testing the test set according to the target detection model, and outputting a corresponding target detection result.
3. The target detection method according to claim 1, wherein the feature extraction unit includes a backbone unit and a heck unit, which are connected in sequence, the backbone unit is connected to the input end, an output end of the heck unit is connected to the attention unit, and the step of performing feature extraction on each training picture in the preprocessed training set based on the feature extraction unit to obtain the attention extraction feature maps corresponding to the intermediate feature maps includes: performing slicing operation and convolution operation on each training picture in the preprocessed training set based on the back bone unit to obtain an initial feature map;
and performing secondary feature extraction on the initial feature map based on the Neck unit to obtain intermediate feature maps with different scales.
4. The object detection method according to claim 1, wherein the formula employed in the normalization process is:
Figure 250399DEST_PATH_IMAGE001
wherein, y i Represents the normalized intermediate feature map corresponding to the ith channel, m represents the number of channels per input intermediate feature map,
Figure 498846DEST_PATH_IMAGE002
represents a preset dynamic adjustment factor, x, corresponding to the ith channel i Intermediate feature graph, u, representing the input corresponding to the ith channel b Represents the mean of the input m-channel intermediate feature maps,
Figure 189722DEST_PATH_IMAGE003
represents the overall variance of the input m-channel intermediate feature maps,
Figure 553314DEST_PATH_IMAGE004
and
Figure 112471DEST_PATH_IMAGE005
all represent constants.
5. The object detection method of claim 4, wherein the loss function is:
Figure 700578DEST_PATH_IMAGE006
wherein, the first and the second end of the pipe are connected with each other,
Figure 421409DEST_PATH_IMAGE007
represents the overall loss function value of the YOLOv5-Lite network,
Figure 711445DEST_PATH_IMAGE008
representing a penalty factor, x representing the input target feature map, f (x) representing a predictor,
Figure 74294DEST_PATH_IMAGE009
the corresponding real value is represented by a value,
Figure 782487DEST_PATH_IMAGE010
representing the values of the loss function for x and y,
Figure 877482DEST_PATH_IMAGE011
a weight corresponding to each channel is represented,
Figure 795759DEST_PATH_IMAGE012
represents the utilization of L 1 Paradigm pair weight
Figure 588397DEST_PATH_IMAGE013
The absolute value summation is performed, i and j each represent a positive integer variable,
Figure 213413DEST_PATH_IMAGE002
represents the preset dynamic adjustment factor corresponding to the ith channel,
Figure 541626DEST_PATH_IMAGE014
representing the jth preset dynamic adjustment factor.
6. An object detection device applied to a YOLOv5-Lite network, the YOLOv5-Lite network comprising an input terminal, a feature extraction unit, an attention unit and a prediction output unit which are connected in sequence, the attention unit comprising a plurality of different attention sub-units, the object detection device comprising:
the training set generation module is used for acquiring picture input data as a training set;
the preprocessing module is used for preprocessing each training picture in the training set through the input end to obtain a preprocessed training set;
the first feature map generation module is used for extracting features of each training picture in the preprocessed training set based on a feature extraction unit so as to obtain intermediate feature maps with different scales;
the second feature map generation module is used for acquiring at least two attention subunits according to the size of each intermediate feature map and respectively extracting features of each intermediate feature map so as to obtain attention extraction feature maps corresponding to each intermediate feature map;
the target feature map generation module is used for respectively carrying out feature combination on each intermediate feature map and the attention extraction feature maps corresponding to the intermediate feature maps to obtain each target feature map;
the prediction value generation module is used for respectively detecting each target characteristic diagram through the prediction output unit so as to generate a corresponding prediction value;
the detection model generation module is used for calculating a loss function according to the corresponding predicted value to obtain an optimized gradient, and updating the weight and the bias until the loss function is converged to generate a corresponding target detection model;
the feature extraction unit with still be connected with the batch standardization layer between the attention unit, target detection device still includes:
the dynamic standard processing module is used for respectively standardizing the intermediate characteristic diagrams with different scales based on the batch standardization layer and adjusting the weight of each channel in the intermediate characteristic diagram with each size by adopting a preset dynamic adjustment factor so as to obtain the standardized intermediate characteristic diagrams with different scales;
the attention unit comprises a first attention subunit and a second attention subunit, wherein the first attention subunit is a compression and excitation module, and the second attention subunit is a convolution block attention module; the scale of the intermediate feature map is three, the second feature map generation module is further configured to perform feature extraction on the intermediate feature map of the first scale through the first attention subunit to obtain a corresponding first attention extraction feature map, perform feature extraction on the intermediate feature map of the second scale and the intermediate feature map of the third scale through the second attention subunit to obtain a second attention extraction feature map and a third attention extraction feature map, and the first scale, the second scale, and the third scale are sequentially reduced.
7. A device terminal, characterized in that the device terminal comprises a processor and a memory for storing a computer program, the processor running the computer program to cause the device terminal to perform the object detection method of any of claims 1 to 5.
8. A readable storage medium, characterized in that the readable storage medium stores a computer program which, when executed by a processor, implements the object detection method of any one of claims 1 to 5.
CN202210600445.8A 2022-05-30 2022-05-30 Target detection method, device, equipment terminal and readable storage medium Active CN114677504B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210600445.8A CN114677504B (en) 2022-05-30 2022-05-30 Target detection method, device, equipment terminal and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210600445.8A CN114677504B (en) 2022-05-30 2022-05-30 Target detection method, device, equipment terminal and readable storage medium

Publications (2)

Publication Number Publication Date
CN114677504A CN114677504A (en) 2022-06-28
CN114677504B true CN114677504B (en) 2022-11-15

Family

ID=82081145

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210600445.8A Active CN114677504B (en) 2022-05-30 2022-05-30 Target detection method, device, equipment terminal and readable storage medium

Country Status (1)

Country Link
CN (1) CN114677504B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115049851B (en) * 2022-08-15 2023-01-17 深圳市爱深盈通信息技术有限公司 Target detection method, device and equipment terminal based on YOLOv5 network

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113688723A (en) * 2021-08-21 2021-11-23 河南大学 Infrared image pedestrian target detection method based on improved YOLOv5
CN113920107A (en) * 2021-10-29 2022-01-11 西安工程大学 Insulator damage detection method based on improved yolov5 algorithm
CN114005105A (en) * 2021-12-30 2022-02-01 青岛以萨数据技术有限公司 Driving behavior detection method and device and electronic equipment
CN114220015A (en) * 2021-12-21 2022-03-22 一拓通信集团股份有限公司 Improved YOLOv 5-based satellite image small target detection method
CN114359851A (en) * 2021-12-02 2022-04-15 广州杰赛科技股份有限公司 Unmanned target detection method, device, equipment and medium
CN114494415A (en) * 2021-12-31 2022-05-13 北京建筑大学 Method for detecting, identifying and measuring gravel pile by automatic driving loader

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113688723A (en) * 2021-08-21 2021-11-23 河南大学 Infrared image pedestrian target detection method based on improved YOLOv5
CN113920107A (en) * 2021-10-29 2022-01-11 西安工程大学 Insulator damage detection method based on improved yolov5 algorithm
CN114359851A (en) * 2021-12-02 2022-04-15 广州杰赛科技股份有限公司 Unmanned target detection method, device, equipment and medium
CN114220015A (en) * 2021-12-21 2022-03-22 一拓通信集团股份有限公司 Improved YOLOv 5-based satellite image small target detection method
CN114005105A (en) * 2021-12-30 2022-02-01 青岛以萨数据技术有限公司 Driving behavior detection method and device and electronic equipment
CN114494415A (en) * 2021-12-31 2022-05-13 北京建筑大学 Method for detecting, identifying and measuring gravel pile by automatic driving loader

Also Published As

Publication number Publication date
CN114677504A (en) 2022-06-28

Similar Documents

Publication Publication Date Title
CN110210560B (en) Incremental training method, classification method and device, equipment and medium of classification network
WO2020098250A1 (en) Character recognition method, server, and computer readable storage medium
CN109753928B (en) Method and device for identifying illegal buildings
WO2018121567A1 (en) Method and device for use in detecting object key point, and electronic device
CN108229531B (en) Object feature extraction method and device, storage medium and electronic equipment
CN110188829B (en) Neural network training method, target recognition method and related products
CN108846404B (en) Image significance detection method and device based on related constraint graph sorting
CN110929836B (en) Neural network training and image processing method and device, electronic equipment and medium
CN114241505B (en) Method and device for extracting chemical structure image, storage medium and electronic equipment
CN114677504B (en) Target detection method, device, equipment terminal and readable storage medium
CN111814821A (en) Deep learning model establishing method, sample processing method and device
CN115937571A (en) Device and method for detecting sphericity of glass for vehicle
CN115240280A (en) Construction method of human face living body detection classification model, detection classification method and device
CN113112518A (en) Feature extractor generation method and device based on spliced image and computer equipment
CN111814846A (en) Training method and recognition method of attribute recognition model and related equipment
CN111179245B (en) Image quality detection method, device, electronic equipment and storage medium
CN115049851B (en) Target detection method, device and equipment terminal based on YOLOv5 network
CN111967383A (en) Age estimation method, and training method and device of age estimation model
CN108446737B (en) Method and device for identifying objects
Mohammadi et al. Predictive Sampling for Efficient Pairwise Subjective Image Quality Assessment
JP3468108B2 (en) Face image matching method and face image matching device
CN115375980A (en) Block chain-based digital image evidence storing system and method
CN112949571A (en) Method for identifying age, and training method and device of age identification model
CN112183283A (en) Age estimation method, device, equipment and storage medium based on image
CN114267089B (en) Method, device and equipment for identifying forged image

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230704

Address after: 13C-18, Caihong Building, Caihong Xindu, No. 3002, Caitian South Road, Gangsha Community, Futian Street, Futian District, Shenzhen, Guangdong 518033

Patentee after: Core Computing Integrated (Shenzhen) Technology Co.,Ltd.

Address before: 518000 1001, building G3, TCL International e city, Shuguang community, Xili street, Nanshan District, Shenzhen City, Guangdong Province

Patentee before: Shenzhen Aishen Yingtong Information Technology Co.,Ltd.