CN114677504B

CN114677504B - Target detection method, device, equipment terminal and readable storage medium

Info

Publication number: CN114677504B
Application number: CN202210600445.8A
Authority: CN
Inventors: 陈磊; 周有喜
Original assignee: Shenzhen Aishen Yingtong Information Technology Co Ltd
Current assignee: Core Computing Integrated Shenzhen Technology Co ltd
Priority date: 2022-05-30
Filing date: 2022-05-30
Publication date: 2022-11-15
Anticipated expiration: 2042-05-30
Also published as: CN114677504A

Abstract

The application relates to a target detection method, a device, an equipment terminal and a readable storage medium, wherein the target detection method preprocesses each training picture in a training set through an input end to obtain a preprocessed training set; extracting the features of each training picture in the preprocessed training set based on a feature extraction unit to obtain intermediate feature maps with different scales; according to the size of each intermediate feature map, at least two attention subunits are obtained to respectively perform feature extraction on each intermediate feature map so as to obtain respective corresponding attention extraction feature maps; respectively carrying out feature combination on each intermediate feature map and the attention extraction feature maps corresponding to each intermediate feature map to obtain each target feature map; respectively detecting each target characteristic diagram through a prediction output unit to generate corresponding prediction values; and performing loss function calculation according to the corresponding predicted value to generate a corresponding target detection model. The target detection method improves the accuracy of the target detection method on the whole.

Description

Target detection method, device, equipment terminal and readable storage medium

Technical Field

The present application relates to the field of image processing, and in particular, to a target detection method, apparatus, device terminal, and readable storage medium.

Background

With the deep application of the deep convolutional neural network in the field of computer vision, a real-time target detection model represented by a YOLO algorithm plays a good detection effect in the industrial field and in practical application scenes.

The YOLOv5-Lite model is improved on the basis of the previous generation YOLOv4, has higher training speed, and has smaller model size, so that the YOLOv5-Lite model is favorable for rapid deployment of the model.

In practical application, a large number of targets with various sizes are generated in the near and far and complex application environments of a shooting scene, however, the targets with various sizes cannot be respectively subjected to characteristic extraction and collection in a targeted manner, and the target detection accuracy is not high overall.

Disclosure of Invention

In view of this, the present application provides a target detection method, an apparatus, a device terminal, and a readable storage medium, which can overcome a disadvantage that a YOLOv5-Lite model cannot respectively perform feature extraction and collection in a targeted manner when detecting targets of various sizes, and improve an overall detection accuracy of the YOLOv5-Lite model.

A target detection method is applied to a YOLOv5-Lite network, the YOLOv5-Lite network comprises an input end, a feature extraction unit, an attention unit and a prediction output unit which are sequentially connected, the attention unit comprises a plurality of different attention subunits, and the target detection method comprises the following steps:

acquiring picture input data as a training set;

preprocessing each training picture in the training set through an input end to obtain a preprocessed training set;

extracting the features of each training picture in the preprocessed training set based on a feature extraction unit to obtain intermediate feature maps with different scales;

according to the size of each intermediate feature map, at least two attention subunits are obtained to respectively perform feature extraction on each intermediate feature map so as to obtain an attention extraction feature map corresponding to each intermediate feature map;

respectively carrying out feature combination on each intermediate feature map and the attention extraction feature maps corresponding to the intermediate feature maps to obtain each target feature map;

respectively detecting each target characteristic diagram through a prediction output unit to generate corresponding prediction values;

and calculating a loss function according to the corresponding predicted value to obtain an optimized gradient, and updating the weight and the bias until the loss function is converged to generate a corresponding target detection model.

In one embodiment, the target detection method further comprises:

acquiring picture input data as a test set;

and testing the test set according to the target detection model, and outputting a corresponding target detection result.

In one embodiment, the feature extraction unit includes a backbone unit and a Neck unit which are connected in sequence, the backbone unit is connected with the input end, the output end of the Neck unit is connected with the attention unit, and the feature extraction unit performs feature extraction on each training picture in the preprocessed training set to obtain the attention extraction feature maps corresponding to the intermediate feature maps, wherein the step of extracting the feature from each training picture includes:

carrying out slicing operation and convolution operation on each training picture in the preprocessed training set based on a back bone unit to obtain an initial feature map;

and performing secondary feature extraction on the initial feature map based on a Neck unit to obtain intermediate feature maps with different scales.

In one embodiment, the attention unit includes a first attention subunit and a second attention subunit, the intermediate feature maps have three dimensions, and the step of obtaining at least two attention subunits to perform feature extraction on each intermediate feature map respectively according to the size of each intermediate feature map to obtain an attention extraction feature map corresponding to each intermediate feature map respectively includes:

performing feature extraction on the intermediate feature map of the first scale through a first attention subunit to obtain a corresponding first attention extraction feature map;

and respectively extracting the features of the intermediate feature maps in the second scale and the third scale through a second attention subunit to obtain a second attention extraction feature map and a third attention extraction feature map, wherein the first scale, the second scale and the third scale are sequentially reduced.

In one embodiment, the first attention subunit is a compression and excitation module and the second attention subunit is a convolution block attention module.

In one embodiment, the attention unit includes a first attention subunit, a second attention subunit, and a third attention subunit, the intermediate feature maps have three dimensions, and the step of obtaining at least two attention subunits to perform feature extraction on each intermediate feature map respectively according to the size of each intermediate feature map to obtain an attention extraction feature map corresponding to each intermediate feature map includes:

performing feature extraction on the intermediate feature map of the second scale through a second attention subunit to obtain a second attention extraction feature map;

and performing feature extraction on the intermediate feature map of the third scale through a third attention subunit to obtain a third attention extraction feature map, wherein the first scale, the second scale and the third scale are sequentially reduced.

In an embodiment, a batch normalization layer is further connected between the feature extraction unit and the attention unit, and the step of obtaining at least two attention subunits to perform feature extraction on each intermediate feature map respectively according to the size of each intermediate feature map so as to obtain an attention extraction feature map corresponding to each intermediate feature map further includes:

respectively standardizing the intermediate characteristic diagrams with different scales based on the batch standardization layer, and adjusting the weight of each channel in the intermediate characteristic diagram with each size by adopting a preset dynamic adjustment factor to obtain the standardized intermediate characteristic diagrams with different scales.

In one embodiment, the formula employed in the normalization process is:

wherein, y _i Represents the normalized intermediate feature map corresponding to the ith channel, m represents the number of channels per input intermediate feature map,

represents a preset dynamic adjustment factor, x, corresponding to the ith channel _i Intermediate feature graph, u, representing input corresponding to ith channel _b Represents the mean of the input m-channel intermediate feature maps,

represents the overall variance of the input m-channel intermediate feature maps,

and

all represent constants.

In one embodiment, the loss function is:

wherein, the first and the second end of the pipe are connected with each other,

represents the overall loss function value of the YOLOv5-Lite network,

represents a penalty coefficient, x represents an input target feature map, f (x) represents a predicted value,

a corresponding true value is represented and,

representing the values of the loss function for x and y,

a weight corresponding to each channel is represented,

represents the utilization of L ₁ Paradigm pair weight

The absolute value summation is performed, i and j each represent a positive integer variable,

represents the preset dynamic adjustment factor corresponding to the ith channel,

representing the jth preset dynamic adjustment factor.

In addition, an object detection device is provided, which is applied to a YOLOv5-Lite network, the YOLOv5-Lite network includes an input end, a feature extraction unit, an attention unit and a prediction output unit which are connected in sequence, the attention unit includes a plurality of different attention subunits, and the object detection device includes:

the training set generation module is used for acquiring picture input data as a training set;

the preprocessing module is used for preprocessing each training picture in the training set through an input end to obtain a preprocessed training set;

the first feature map generation module is used for extracting features of each training picture in the preprocessed training set based on the feature extraction unit so as to obtain intermediate feature maps with different scales;

the second feature map generation module is used for acquiring at least two attention subunits to respectively perform feature extraction on each intermediate feature map according to the size of each intermediate feature map so as to obtain an attention extraction feature map corresponding to each intermediate feature map;

the target feature map generation module is used for respectively carrying out feature combination on each intermediate feature map and the attention extraction feature maps corresponding to the intermediate feature maps to obtain each target feature map;

the prediction value generation module is used for respectively detecting each target characteristic diagram through the prediction output unit so as to generate a corresponding prediction value;

and the detection model generation module is used for calculating a loss function according to the corresponding predicted value to obtain an optimized gradient, and updating the weight and the bias until the loss function is converged to generate a corresponding target detection model.

In addition, an apparatus terminal is provided, which includes a processor and a memory, the memory is used for storing a computer program, and the processor runs the computer program to make the apparatus terminal execute the above object detection method.

Furthermore, a readable storage medium is provided, which stores a computer program which, when executed by a processor, implements the above object detection method.

The target detection method is applied to a YOLOv5-Lite network, the YOLOv5-Lite network comprises an input end, a feature extraction unit, an attention unit and a prediction output unit which are sequentially connected, the attention unit comprises a plurality of different attention subunits, the target detection method comprises the steps of obtaining picture input data as a training set, preprocessing each training picture in the training set through the input end to obtain a preprocessed training set, performing feature extraction on each training picture in the preprocessed training set based on the feature extraction unit to obtain intermediate feature maps with different scales, respectively performing feature extraction on each intermediate feature map through at least two attention subunits according to the size of each intermediate feature map to obtain the attention extraction feature map corresponding to each intermediate feature map, respectively performing feature combination on each intermediate feature map and the attention extraction feature maps corresponding to each intermediate feature map, the target detection method comprises the steps of respectively extracting the characteristics of each intermediate characteristic diagram by at least two attention subunits through obtaining at least two attention subunits to obtain attention extraction characteristic diagrams respectively corresponding to each intermediate characteristic diagram, and further enabling the target detection model to respectively extract corresponding characteristic information through the attention subunits respectively corresponding to the sizes of each intermediate characteristic diagram when detecting the targets of each size in the image, namely respectively pertinently extracting and collecting the characteristics of the targets of each size, meanwhile, each intermediate feature map and the attention extraction feature maps corresponding to each intermediate feature map are further subjected to feature merging respectively to obtain each target feature map, on one hand, more information is extracted from the original intermediate feature map due to the attention extraction feature maps, on the other hand, the original intermediate feature maps are reserved, and then the information of the two feature maps is merged, so that more useful feature information is obtained, and the detection accuracy of the targets of all sizes is further improved on the whole.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic application environment diagram of a target detection method provided in an embodiment of the present application;

fig. 2 is a schematic flowchart of a target detection method according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of another target detection method provided in the embodiments of the present application;

FIG. 4 is a schematic flowchart of a method for obtaining intermediate feature maps of different scales according to an embodiment of the present disclosure;

FIG. 5 is a block diagram of an attention unit according to an embodiment of the present disclosure;

fig. 6 is a flowchart illustrating a method for obtaining attention extraction feature maps corresponding to respective intermediate feature maps according to an embodiment of the present application;

FIG. 7 is a block diagram of another attention unit configuration provided in an embodiment of the present application;

fig. 8 is a schematic flowchart of another method for obtaining attention extraction feature maps corresponding to respective intermediate feature maps according to an embodiment of the present application;

FIG. 9 is a schematic diagram of an application environment of another target detection method provided in an embodiment of the present application;

FIG. 10 is a schematic flowchart illustrating a further method for detecting an object according to an embodiment of the present application;

fig. 11 is a block diagram of a target detection apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application are clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. The following embodiments and their technical features may be combined with each other without conflict.

As shown in fig. 1, an application environment schematic diagram of an object detection method is provided, fig. 1 is a schematic structural block diagram of a YOLOv5-Lite network, the YOLOv5-Lite network includes an input end 11, a feature extraction unit 12, an attention unit 13 and a prediction output unit 14, which are connected in sequence, and the attention unit 13 includes a plurality of different attention subunits.

As shown in fig. 2, there is provided an object detection method including:

step S110, acquiring the picture input data as a training set.

When the target is detected, a training set needs to be established to obtain a target detection model, and a large amount of picture input data needs to be acquired as the training set.

And step S120, preprocessing each training picture in the training set through the input end to obtain a preprocessed training set.

Each training picture in the training set needs to be further preprocessed, because many shooting pictures in the picture input data are not labeled yet, in addition, the preprocessing can also comprise at least one of data enhancement processing, self-adaptive anchor frame calculation and self-adaptive picture scaling processing, and further the preprocessed training set is obtained.

And step S130, extracting the features of each training picture in the preprocessed training set based on the feature extraction unit to obtain intermediate feature maps with different scales.

The YOLOv5-Lite network generally comprises a plurality of feature extraction units, and the YOLOv5-Lite network performs feature extraction on each training picture in the preprocessed training set through the plurality of feature extraction units to obtain intermediate feature maps with different scales.

Step S140, at least two attention subunits are obtained according to the size of each intermediate feature map to perform feature extraction on each intermediate feature map respectively, so as to obtain an attention extraction feature map corresponding to each intermediate feature map.

And respectively extracting the features of each intermediate feature map by adopting a corresponding proper attention subunit according to the size of each intermediate feature map, so as to obtain the attention extraction feature map corresponding to each intermediate feature map.

In an embodiment, three intermediate feature maps with different scales are obtained, and at this time, according to the size of each intermediate feature map, at least two attention subunits may be obtained to perform feature extraction on each intermediate feature map respectively, so as to obtain an attention extraction feature map corresponding to each intermediate feature map, where one attention subunit is used to perform feature extraction on the intermediate feature map of one scale, and the other attention subunit is used to perform feature extraction on the intermediate feature maps of the remaining two scales.

In this embodiment, corresponding attention subunits are respectively adopted for feature extraction on each intermediate feature graph with different scales, so that when the target detection model detects targets with various sizes in a picture, corresponding feature information can be respectively extracted through the corresponding attention subunits according to the size of each intermediate feature graph, that is, feature extraction and collection can be respectively performed on the targets with various sizes in a targeted manner.

And step S150, respectively carrying out feature combination on each intermediate feature map and the attention extraction feature maps corresponding to the intermediate feature maps to obtain each target feature map.

On one hand, the attention extraction feature map extracts more information from the original intermediate feature map, on the other hand, the original intermediate feature map is retained, and then the information of the two feature maps is merged, so that more useful feature information is obtained, and the detection accuracy of the target of each size is further improved on the whole.

In step S160, the target feature maps are detected by the prediction output unit to generate corresponding prediction values.

Wherein, the prediction output unit is generally referred to as a head part in a Yolov5-Lite network.

And S170, calculating a loss function according to the corresponding predicted value to obtain an optimized gradient, and updating the weight and the bias until the loss function is converged to generate a corresponding target detection model.

The above target detection method is applied to a YOLOv5-Lite network, the YOLOv5-Lite network comprises an input end, a feature extraction unit, an attention unit and a prediction output unit which are connected in sequence, the attention unit comprises a plurality of different attention subunits, the target detection method obtains at least two attention subunits to respectively extract features of each intermediate feature map so as to obtain an attention extraction feature map corresponding to each intermediate feature map, and further enables the target detection model to respectively extract corresponding feature information through the corresponding attention subunits according to the size of each intermediate feature map when detecting targets of each size in a picture, namely, the target of each size can be respectively and pertinently subjected to feature extraction and collection, and meanwhile, the feature extraction feature maps corresponding to each intermediate feature map and each intermediate feature map are respectively subjected to feature merging so as to obtain each target feature map, on one hand, the attention extraction feature map is that more information is extracted from the original intermediate feature map, on the other hand, the original intermediate feature map is retained, and further the two pieces of the feature information are merged, so that the accuracy of the feature maps is further improved for overall detection of the useful target sizes.

In one embodiment, as shown in fig. 3, the object detection method further includes:

and step S180, acquiring picture input data as a test set.

And step S190, testing the test set according to the target detection model, and outputting a corresponding target detection result.

In one embodiment, as shown in fig. 1, the feature extraction unit 12 includes a backbone unit and a Neck unit connected in sequence, the backbone unit is connected to the input end 11, and the output end of the Neck unit is connected to the attention unit 13, as shown in fig. 4, and the step S130 includes:

step S132, based on the backbone unit, slicing operation and convolution operation are carried out on each training picture in the preprocessed training set, so as to obtain an initial feature map.

And S134, performing secondary feature extraction on the initial feature map based on the Neck unit to obtain intermediate feature maps with different scales.

In one embodiment, as shown in fig. 5, the attention unit 13 includes a first attention subunit 13a and a second attention subunit 13b, as shown in fig. 6, step S140 includes:

in step S141, feature extraction is performed on the intermediate feature map of the first scale through the first attention subunit to obtain a corresponding first attention extraction feature map.

And step S142, respectively performing feature extraction on the intermediate feature maps of the second scale and the third scale through a second attention subunit to obtain a second attention extraction feature map and a third attention extraction feature map, wherein the first scale, the second scale and the third scale are sequentially reduced.

In this embodiment, the first attention subunit performs feature extraction on the intermediate feature map with the largest scale (i.e., the intermediate feature map with the first scale), and then, for the intermediate feature maps with the smaller scales, the second attention subunit performs feature extraction, so that more feature information can be extracted from the intermediate feature map with the smaller scale, that is, corresponding feature information can be respectively extracted through the respective corresponding attention subunits according to the sizes of the intermediate feature maps, thereby implementing feature extraction and collection on targets with various sizes respectively and specifically, and further improving the overall detection accuracy on the targets with various sizes.

The compression and Excitation Module is (SE Module), and the convolution Block Attention Module is (CBAM Module).

In one embodiment, as shown in fig. 7, the attention unit includes a first attention subunit 13a, a second attention subunit 13b and a third attention subunit 13c, as shown in fig. 8, and step S140 includes:

and step S143, performing feature extraction on the intermediate feature map of the first scale through the first attention subunit to obtain a corresponding first attention extraction feature map.

And step S144, performing feature extraction on the intermediate feature map of the second scale through a second attention subunit to obtain a second attention extraction feature map.

And S145, performing feature extraction on the intermediate feature map of the third scale through a third attention subunit to obtain a third attention extraction feature map, wherein the first scale, the second scale and the third scale are sequentially reduced.

In this embodiment, the first attention subunit performs feature extraction on the intermediate feature map with the largest scale (that is, the intermediate feature map with the first scale), then, for the intermediate feature map with the second scale with the smaller scale, the second attention subunit performs feature extraction, and for the intermediate feature map with the third scale with the smaller scale, the third attention subunit performs feature extraction, so that more feature information can be further extracted from the intermediate feature map with the smaller scale, that is, the corresponding feature information can be further extracted through the respective attention subunits according to the sizes of the intermediate feature maps, thereby implementing feature extraction and collection respectively and specifically for targets with various sizes, and further improving the detection accuracy of the targets with various sizes as a whole.

In one embodiment, as shown in fig. 9, a batch normalization layer 15 is further connected between the feature extraction unit 12 and the attention unit 13, and as shown in fig. 10, step S140 further includes:

and S200, respectively carrying out standardization processing on the intermediate characteristic diagrams with different scales based on the batch standardization layer, and adjusting the weight of each channel in the intermediate characteristic diagram with each size by adopting a preset dynamic adjustment factor to obtain the standardized intermediate characteristic diagrams with different scales.

In this embodiment, the intermediate feature maps are normalized by the batch normalization layer, and a preset dynamic adjustment factor is added, where the preset dynamic adjustment factor may reflect the degree of information change in each intermediate feature map, that is, the variance in the batch normalization layer, in other words, the variance may reflect the degree of information change, and the larger the variance is, the larger the degree of information change is, the richer the information therein is, and the higher the importance is, whereas the smaller the variance is, the smaller the degree of information change is, and the smaller the importance is, and therefore, by setting the batch normalization layer, the subsequent attention unit can better extract feature map information.

In the process of performing subsequent steps S140 to S150, the normalized intermediate feature maps with different scales need to be processed, and steps S160 to S170 are unchanged as shown in fig. 10, that is:

step S140, at least two attention subunits are obtained to respectively perform feature extraction on each normalized intermediate feature map according to the size of each normalized intermediate feature map, so as to obtain an attention extraction feature map corresponding to each normalized intermediate feature map.

And step S150, respectively carrying out feature merging on each normalized intermediate feature map and the attention extraction feature maps corresponding to each normalized intermediate feature map to obtain each target feature map.

In one embodiment, the formula employed in the normalization process is:

wherein, y _i Showing the normalized intermediate feature map corresponding to the ith channel, m showing the number of channels per input intermediate feature map,

represents a preset dynamic adjustment factor, x, corresponding to the ith channel _i Intermediate feature graph, u, representing the input corresponding to the ith channel _b Represents the mean of the input m-channel intermediate feature maps,

and

all represent constants.

In one embodiment, the loss function is:

represents the overall loss function value of the YOLOv5-Lite network,

representing a penalty coefficient, x representing an input target feature map, f (x) representing a predicted value,

the corresponding real value is represented by a value,

representing the values of the loss function for x and y,

a weight corresponding to each channel is represented,

represents the utilization of L ₁ Paradigm pair weight

representing the jth preset dynamic adjustment factor.

On the basis of the embodiment shown in fig. 8, the entire loss function of the YOLOv5-Lite network is made to include by setting a batch normalization layer

And further, the loss function can be adjusted, and the accuracy of the whole target detection is improved on the whole.

Further, as shown in fig. 11, there is provided an object detection apparatus 300 applied to the YOLOv5-Lite network shown in fig. 1, the object detection apparatus 300 including:

a training set generating module 310, configured to obtain picture input data as a training set;

the preprocessing module 320 is used for preprocessing each training picture in the training set through an input end to obtain a preprocessed training set;

a first feature map generation module 330, configured to perform feature extraction on each training picture in the preprocessed training set based on the feature extraction unit to obtain intermediate feature maps of different scales;

a second feature map generation module 340, configured to obtain at least two attention subunits to perform feature extraction on each intermediate feature map respectively according to the size of each intermediate feature map, so as to obtain an attention extraction feature map corresponding to each intermediate feature map;

the target feature map generation module 350 is configured to perform feature merging on each intermediate feature map and the attention extraction feature maps corresponding to the intermediate feature maps, so as to obtain each target feature map;

and the predicted value generation module 360 detects each target feature map through the prediction output unit to generate a corresponding predicted value.

And the detection model generation module 370 performs loss function calculation according to the corresponding predicted value to obtain an optimized gradient, and performs weight and bias updating until the loss function converges to generate a corresponding target detection model.

The division of the units in the device is only used for illustration, and in other embodiments, the device may be divided into different units as needed to complete all or part of the functions of the device. For the specific limitations of the above device, reference may be made to the limitations of the above method, which are not described herein again.

That is, the above description is only an embodiment of the present application, and not intended to limit the scope of the present application, and all equivalent structures or equivalent flow transformations made by using the contents of the specification and the drawings, such as mutual combination of technical features between various embodiments, or direct or indirect application to other related technical fields, are included in the scope of the present application.

In addition, the present application may be identified by the same or different reference numerals for structural elements having the same or similar characteristics. Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to implicitly indicate the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more features. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.

In this application, the word "for example" is used to mean "serving as an example, instance, or illustration". Any embodiment described herein as "for example" is not necessarily to be construed as preferred or advantageous over other embodiments. The previous description is provided to enable any person skilled in the art to make or use the present application. In the foregoing description, various details have been set forth for the purpose of explanation.

It will be apparent to one of ordinary skill in the art that the present application may be practiced without these specific details. In other instances, well-known structures and processes are not shown in detail to avoid obscuring the description of the present application with unnecessary detail. Thus, the present application is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.

Claims

1. An object detection method applied to a YOLOv5-Lite network, wherein the YOLOv5-Lite network comprises an input end, a feature extraction unit, an attention unit and a prediction output unit which are connected in sequence, the attention unit comprises a plurality of different attention subunits, and the object detection method comprises the following steps:

acquiring picture input data as a training set;

preprocessing each training picture in the training set through the input end to obtain a preprocessed training set;

extracting the features of each training picture in the preprocessed training set based on the feature extraction unit to obtain intermediate feature maps with different scales;

according to the size of each intermediate feature map, at least two attention subunits are obtained to respectively extract features of each intermediate feature map so as to obtain attention extraction feature maps corresponding to each intermediate feature map;

detecting each target characteristic diagram through the prediction output unit to generate corresponding prediction values;

calculating a loss function according to the corresponding predicted value to obtain an optimized gradient, and updating the weight and the bias until the loss function is converged to generate a corresponding target detection model;

the method comprises the following steps that a batch standardization layer is further connected between the feature extraction unit and the attention unit, at least two attention subunits are obtained according to the size of each intermediate feature graph to respectively extract features of each intermediate feature graph, and the steps of obtaining the attention extraction feature graphs corresponding to the intermediate feature graphs respectively further comprise the following steps:

respectively standardizing the intermediate characteristic diagrams with different scales based on the batch standardization layer, and adjusting the weight of each channel in the intermediate characteristic diagram with each size by adopting a preset dynamic adjustment factor to obtain the standardized intermediate characteristic diagrams with different scales;

the attention unit comprises a first attention subunit and a second attention subunit, wherein the first attention subunit is a compression and excitation module, and the second attention subunit is a convolution block attention module; the method comprises the following steps of obtaining at least two attention subunits according to the size of each intermediate feature map, and respectively extracting features of each intermediate feature map to obtain an attention extraction feature map corresponding to each intermediate feature map, wherein the three scales of the intermediate feature maps are three, and the step of obtaining the attention extraction feature map corresponding to each intermediate feature map comprises the following steps:

performing feature extraction on the intermediate feature map of the first scale through the first attention subunit to obtain a corresponding first attention extraction feature map;

and respectively extracting features of the intermediate feature maps of the second scale and the third scale through the second attention subunit to obtain a second attention extraction feature map and a third attention extraction feature map, wherein the first scale, the second scale and the third scale are sequentially reduced.

2. The object detection method according to claim 1, characterized in that the object detection method further comprises:

acquiring picture input data as a test set;

3. The target detection method according to claim 1, wherein the feature extraction unit includes a backbone unit and a heck unit, which are connected in sequence, the backbone unit is connected to the input end, an output end of the heck unit is connected to the attention unit, and the step of performing feature extraction on each training picture in the preprocessed training set based on the feature extraction unit to obtain the attention extraction feature maps corresponding to the intermediate feature maps includes: performing slicing operation and convolution operation on each training picture in the preprocessed training set based on the back bone unit to obtain an initial feature map;

and performing secondary feature extraction on the initial feature map based on the Neck unit to obtain intermediate feature maps with different scales.

4. The object detection method according to claim 1, wherein the formula employed in the normalization process is:

and

all represent constants.

5. The object detection method of claim 4, wherein the loss function is:

represents the overall loss function value of the YOLOv5-Lite network,

representing a penalty factor, x representing the input target feature map, f (x) representing a predictor,

the corresponding real value is represented by a value,

representing the values of the loss function for x and y,

a weight corresponding to each channel is represented,

represents the utilization of L ₁ Paradigm pair weight

representing the jth preset dynamic adjustment factor.

6. An object detection device applied to a YOLOv5-Lite network, the YOLOv5-Lite network comprising an input terminal, a feature extraction unit, an attention unit and a prediction output unit which are connected in sequence, the attention unit comprising a plurality of different attention sub-units, the object detection device comprising:

the preprocessing module is used for preprocessing each training picture in the training set through the input end to obtain a preprocessed training set;

the first feature map generation module is used for extracting features of each training picture in the preprocessed training set based on a feature extraction unit so as to obtain intermediate feature maps with different scales;

the second feature map generation module is used for acquiring at least two attention subunits according to the size of each intermediate feature map and respectively extracting features of each intermediate feature map so as to obtain attention extraction feature maps corresponding to each intermediate feature map;

the detection model generation module is used for calculating a loss function according to the corresponding predicted value to obtain an optimized gradient, and updating the weight and the bias until the loss function is converged to generate a corresponding target detection model;

the feature extraction unit with still be connected with the batch standardization layer between the attention unit, target detection device still includes:

the dynamic standard processing module is used for respectively standardizing the intermediate characteristic diagrams with different scales based on the batch standardization layer and adjusting the weight of each channel in the intermediate characteristic diagram with each size by adopting a preset dynamic adjustment factor so as to obtain the standardized intermediate characteristic diagrams with different scales;

the attention unit comprises a first attention subunit and a second attention subunit, wherein the first attention subunit is a compression and excitation module, and the second attention subunit is a convolution block attention module; the scale of the intermediate feature map is three, the second feature map generation module is further configured to perform feature extraction on the intermediate feature map of the first scale through the first attention subunit to obtain a corresponding first attention extraction feature map, perform feature extraction on the intermediate feature map of the second scale and the intermediate feature map of the third scale through the second attention subunit to obtain a second attention extraction feature map and a third attention extraction feature map, and the first scale, the second scale, and the third scale are sequentially reduced.

7. A device terminal, characterized in that the device terminal comprises a processor and a memory for storing a computer program, the processor running the computer program to cause the device terminal to perform the object detection method of any of claims 1 to 5.

8. A readable storage medium, characterized in that the readable storage medium stores a computer program which, when executed by a processor, implements the object detection method of any one of claims 1 to 5.