CN116310899A - YOLOv 5-based improved target detection method and device and training method - Google Patents

YOLOv 5-based improved target detection method and device and training method Download PDF

Info

Publication number
CN116310899A
CN116310899A CN202310180011.1A CN202310180011A CN116310899A CN 116310899 A CN116310899 A CN 116310899A CN 202310180011 A CN202310180011 A CN 202310180011A CN 116310899 A CN116310899 A CN 116310899A
Authority
CN
China
Prior art keywords
image
target
features
detected
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310180011.1A
Other languages
Chinese (zh)
Inventor
李肯立
郭佳靖
谭光华
刘楚波
段明星
肖国庆
唐卓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN202310180011.1A priority Critical patent/CN116310899A/en
Publication of CN116310899A publication Critical patent/CN116310899A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/17Terrestrial scenes taken from planes or by drones
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • G06V10/464Salient features, e.g. scale invariant feature transforms [SIFT] using a plurality of salient features, e.g. bag-of-words [BoW] representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/52Scale-space analysis, e.g. wavelet analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Remote Sensing (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The application relates to a target detection method and device based on YOLOv5 improvement and a training method. The target detection method comprises the following steps: inputting the weighted trunk features output by each key feature extraction layer to key feature fusion nodes appointed by a feature fusion module, and fusing the weighted trunk features and original FPN fusion features again to obtain target fusion features; adding a sub-detection network, weighting and re-fusing the high-resolution features obtained by the shallowest sub-key feature extraction layer to obtain high-detail features, and inputting the high-detail features into a new small detection head and feature fusion module; outputting a prediction frame of each image to be detected through each detection head to obtain a target detection result of the image to be detected; in the network training process, introducing SIOU loss functions considering various losses such as angle loss and the like; by adopting the method, the accuracy of small target detection can be improved.

Description

YOLOv 5-based improved target detection method and device and training method
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a target detection method, a target detection device and a training method based on YOLOv5 improvement.
Background
Along with the popularization of unmanned aerial vehicle technology, unmanned aerial vehicle has obtained extensive application in fields such as environmental survey, road traffic flow supervision, safety inspection. When the unmanned aerial vehicle is used for carrying out environmental survey, road traffic flow supervision and safety inspection, target detection is required to be carried out on images obtained by aerial photography of the unmanned aerial vehicle.
In the related art, the traditional target detection algorithm is directly adopted to detect the target of the image obtained by aerial photography of the unmanned aerial vehicle, however, the target in the image obtained by aerial photography of the unmanned aerial vehicle is smaller, the traditional target detection algorithm is directly adopted to detect the target, and the detection and identification errors of the small target are easy to cause, namely, the accuracy of the traditional target detection algorithm on the target detection in the aerial photography scene of the unmanned aerial vehicle is lower. Therefore, how to improve the accuracy of small target detection is a technical problem that needs to be solved by those skilled in the art.
Disclosure of Invention
Accordingly, it is necessary to provide a YOLOv 5-based improved target detection method, a YOLOv 5-based improved target detection device, and a YOLOv 5-based training method, which can improve the accuracy of target detection.
In a first aspect, the present application provides a method of target detection. The method comprises the following steps:
Acquiring an image to be detected;
inputting the image to be detected into a target detection network based on the Yolov5 improvement, wherein the target detection network based on the Yolov5 improvement comprises a feature extraction module and a feature fusion module, and the output of the feature extraction module is used as the input of the feature fusion module;
extracting features of an image to be detected through a plurality of key feature extraction layers of a feature extraction module, and sequentially obtaining target trunk features of a plurality of scales, wherein the target trunk features are key image features sequentially obtained in the feature extraction module according to a connection sequence, and as the depth of a network increases, the detail information about the image to be detected, carried by the corresponding target trunk features, is decreased, and the carried semantic information about the image to be detected is increased;
respectively carrying out weighting treatment on the target trunk features of the multiple scales to obtain weighted trunk features of the multiple scales;
respectively inputting the weighted trunk features of the multiple scales to key fusion nodes of corresponding levels in the feature fusion module, and re-fusing the weighted trunk features with original fusion features in the feature fusion module to obtain target fusion features;
And inputting the target fusion characteristics into a detection head, acquiring output of the detection head, obtaining a prediction frame corresponding to the image to be detected, and determining a target detection result of the image to be detected according to the prediction frame.
In a second aspect, the present application further provides a training method of the target detection network. The method comprises the following steps:
acquiring a training image set and a real frame corresponding to a target to be detected in each training image in the training image set;
inputting the training image into a Yolov 5-based improved original detection network, wherein the Yolov 5-based improved original detection network comprises a feature extraction module and a feature fusion module, and the output of the feature extraction module is used as the input of the feature fusion module;
extracting features of an image to be detected through a plurality of key feature extraction layers of a feature extraction module, and sequentially obtaining target trunk features of a plurality of scales, wherein the target trunk features are key image features sequentially obtained in the feature extraction module according to a connection sequence, and as the depth of a network increases, the detail information about the training image carried by the corresponding target trunk features is decreased, and the carried semantic information about the training image is increased;
Respectively carrying out weighting treatment on the target trunk features of the multiple scales to obtain weighted trunk features of the multiple scales;
respectively inputting the weighted trunk features of the multiple scales to key fusion nodes of corresponding levels in the feature fusion module, and re-fusing the weighted trunk features with original fusion features in the feature fusion module to obtain target fusion features;
inputting the target fusion characteristics into a detection head, and obtaining the output of the detection head to obtain a prediction frame corresponding to the training image;
and adjusting the original detection network according to the difference between the prediction frame corresponding to the training image and the real frame until the original detection network converges to obtain a target detection network.
In a third aspect, the present application also provides an object detection apparatus. The device comprises:
the image acquisition module is used for acquiring an image to be detected;
the image input module is used for inputting the image to be detected into a target detection network based on the YOLOv5 improvement, wherein the target detection network based on the YOLOv5 improvement comprises a feature extraction module and a feature fusion module, and the output of the feature extraction module is used as the input of the feature fusion module;
The extraction module is used for extracting the characteristics of the image to be detected through a plurality of key characteristic extraction layers of the characteristic extraction module, and sequentially obtaining a plurality of scale target trunk characteristics, wherein the target trunk characteristics are key image characteristics sequentially obtained in the characteristic extraction module according to the connection sequence, and as the depth of a network increases, the detail information about the image to be detected, carried by the corresponding target trunk characteristics, is decreased, and the carried semantic information about the image to be detected is increased;
the weighting module is used for respectively carrying out weighting treatment on the target trunk features of the multiple scales to obtain weighted trunk features of the multiple scales;
the fusion module is used for respectively inputting the weighted trunk features of the multiple scales to key fusion nodes of corresponding levels in the feature fusion module, and re-fusing the key fusion nodes with the original fusion features in the feature fusion module to obtain target fusion features;
the prediction module is used for inputting the target fusion characteristic into a detection head, obtaining output of the detection head, obtaining a prediction frame corresponding to the image to be detected, and determining a target detection result of the image to be detected according to the prediction frame.
In a fourth aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program, and a processor implementing the steps of the object detection method of the first aspect embodiment or the training method of the object detection network of the second aspect embodiment when the processor executes the computer program.
In a fifth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the object detection method of the first aspect embodiment or implements the steps of the training method of the object detection network of the second aspect embodiment.
In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of the object detection method of the first aspect embodiment or implements the steps of the training method of the object detection network of the second aspect embodiment.
According to the improved target detection method, device and training method based on the YOLOv5, the target trunk features of the original YOLOv5 feature extraction module with multiple scales are weighted and then input into the key nodes of the original YOLOv5 feature fusion module with appointed depth, and the key nodes are fused with the original fusion features again to obtain the target fusion features, so that the target trunk features with more detail information and the fusion features with more abundant semantic information are fused again, the feature fusion module focuses on the shallow target trunk features with more detail information, focuses on the detail information of the image to be detected, so that small targets in the image to be detected can be identified conveniently, missing of the small targets in the image to be detected by the target detection network is avoided, and the accuracy of target detection is improved.
Drawings
FIG. 1 is a diagram of an application environment for a target detection method and a training method for a target detection network in one embodiment;
FIG. 2 is a first flow chart of a target detection method according to one embodiment;
FIG. 3 is a first schematic diagram of an object detection network in one embodiment;
FIG. 4 is a second schematic diagram of an object detection network in one embodiment;
FIG. 5 is a second flow chart of a target detection method according to one embodiment;
FIG. 6 is a third flow chart of a method of detecting targets in one embodiment;
FIG. 7 is a fourth flow chart of a method of detecting targets in one embodiment;
FIG. 8 is a flow chart of a training method of the object detection network in one embodiment;
FIG. 9 is a flow chart illustrating the steps for adjusting the target detection network in one embodiment;
FIG. 10 is a schematic diagram of the angular difference between a real frame and a predicted frame in one embodiment;
FIG. 11 is a schematic diagram of the distance difference between a real frame and a predicted frame in one embodiment;
FIG. 12 is a schematic diagram of scale information between a predicted box and a real box in one embodiment;
FIG. 13 is a schematic diagram of an overlap region between a predicted frame and a real frame in one embodiment;
FIG. 14 is a fifth flow chart of a method of detecting targets in one embodiment;
FIG. 15 is a block diagram of an object detection device in one embodiment;
FIG. 16 is a block diagram of a training apparatus of an object detection network in one embodiment;
fig. 17 is an internal structural view of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
The target detection method and the training method of the target detection network provided by the embodiment of the application can be applied to the field of small target detection, such as the field of unmanned aerial vehicle aerial photography, and can be applied to an application environment shown in fig. 1. The method for detecting the target may be performed by the terminal 102 or the server 104, the method for training the target detection network may be performed by the server 104, and after the server 104 trains to obtain the target detection network, the target detection network may be stored on a data storage system or stored in a local memory of the terminal. The terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process, such as images to be detected, raw images, training image sets, and target detection networks. The data storage system may be integrated on the server 104 or may be located on a cloud or other network server. The terminal 102 acquires an image to be detected, and inputs the image to be detected into a target detection network based on Yolov5 improvement, wherein the target detection network comprises a feature extraction module and a feature fusion module, and the output of the feature extraction module is used as the input of the feature fusion module; extracting features of the image to be detected through a feature extraction module to sequentially obtain target trunk features of multiple scales, wherein the target trunk features are key image features sequentially obtained in the feature extraction module according to the connection sequence, and as the depth of a network increases, the detail information about the image to be detected carried by the corresponding target trunk features is decreased, and the carried semantic information about the image to be detected is increased; weighting the target trunk features of the multiple scales respectively to obtain weighted trunk features of the multiple scales; the method comprises the steps of respectively inputting weighted trunk features of multiple scales to key fusion nodes of corresponding levels in a feature fusion module, fusing the weighted trunk features with original fusion features in the feature fusion module again to obtain target fusion features, inputting the target fusion features into a detection head, obtaining output of the detection head, obtaining a prediction frame corresponding to an image to be detected, and determining a target detection result of the image to be detected according to the prediction frame.
The terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, etc. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers.
In one embodiment, as shown in fig. 2, a target detection method is provided, and the method is applied to the terminal 102 in fig. 1 for illustration, and includes the following steps:
step 202, an image to be detected is acquired.
The image to be detected may refer to an image to be subjected to target detection. The image to be detected can be a high-visual-angle high-resolution image obtained by shooting of an unmanned aerial vehicle in the unmanned aerial vehicle aerial shooting field, such as a traffic flow image, an agricultural monitoring image and a safety monitoring image, and can also be a vehicle image obtained by shooting of a vehicle probe installed on a road. The image to be detected may be captured in real time, or may be stored in the terminal 102 or the server 104.
The image to be detected stored in the server 104 may be acquired through a network, for example.
Step 204, inputting the image to be detected into a target detection network based on the YOLOv5 improvement, wherein the target detection network based on the YOLOv5 improvement comprises a feature extraction module and a feature fusion module, and the output of the feature extraction module is used as the input of the feature fusion module.
The object detection network may refer to a network for detecting an object present in an image to be detected. The object detection network can be improved by the original YOLOv5 network, and the structure of the object detection network is shown in fig. 3. The target detection network comprises a feature extraction module and a feature fusion module, wherein the output of the feature extraction module is used as the input of the feature fusion module.
The feature extraction module includes a plurality of ConvBNSILU modules, a C3 module, and an SPFF module. The up-sampling module is used for up-sampling the input features, the ConvBNSILU module consists of CONV convolution operation, BN operation (Batch Normalization ) and SILV activation function, and is used for carrying out convolution, batch normalization and activation processing on the input features, and the structure and the function of the C3 module are the same as CSP in the original YOLO series, so that the learning capability of CNN is improved, the calculation time is shortened, and the memory is reduced.
The feature fusion module comprises a plurality of ConvBNSILU modules, a Concat module and a C3 module. The Concat module comprises Concat operation for superposing the two layers of characteristics of the same input size channels, and adding the channel numbers. The specific structure and connection manner of each level corresponding to the feature extraction module and the feature fusion module can refer to fig. 3.
In step 206, feature extraction is performed on the image to be detected through a plurality of key feature extraction layers of the feature extraction module, and target trunk features with a plurality of scales are sequentially obtained, wherein the target trunk features are key image features sequentially obtained in the feature extraction module according to a connection sequence, as the depth of the network increases, the detail information about the image to be detected carried by the corresponding target trunk features decreases, and the carried semantic information about the image to be detected increases.
The key feature extraction layer may refer to a network layer where the C3 module in fig. 3 is located. For example, a C3 module corresponding to a layer 4 network, a C3 module corresponding to a layer 6 network, and a C3 module corresponding to a layer 8 network. The key feature extraction layer is connected in a manner shown in fig. 3.
The target trunk feature may refer to a key image feature obtained by performing feature extraction on the image to be detected by the feature extraction module. Different network depths correspond to different scales of the output target trunk features.
Network depth may refer to the depth of the layers of network in the object detection network. The depth of the layers may be as shown in fig. 3.
The method comprises the steps of carrying out feature extraction on an image to be detected through a feature extraction module, and obtaining output of the feature extraction module to obtain corresponding target trunk features. As the target backbone feature scale becomes smaller (the network depth increases), the detail information about the image to be detected carried by the target backbone feature decreases, and the semantic information about the image to be detected carried by the target backbone feature decreases.
And step 208, respectively carrying out weighting treatment on the target trunk features of the multiple scales to obtain weighted trunk features of the multiple scales.
The weighted trunk feature may refer to a feature obtained by weighting the target trunk feature according to a weight coefficient.
The weight coefficient is preset and can be modified according to actual conditions. For example, the setting may be performed according to the size of the target trunk feature and the size of the target to be detected. For example, when detecting a target with a smaller scale, the weight coefficient corresponding to the target trunk feature with a larger scale may be set larger, and the weight coefficient corresponding to the target trunk feature with a smaller scale may be set smaller, so that a small detection head of the target detection network focuses more on the detailed information. For example, the weight coefficient corresponding to the target trunk feature output by the C3 module corresponding to the layer 4 network is named weight3, the weight coefficient corresponding to the target trunk feature output by the C3 module corresponding to the layer 6 network is named weight4, the weight coefficient corresponding to the C3 module corresponding to the layer 8 network is named weight5, and when the image to be detected contains a large number of small objects, the weight3 may be set to 2, the weight4 may be set to 1, and the weight5 may be set to 0.5. When the image to be detected contains little minute or substantially large objects, all the weight coefficients may be set to 0. Other arrangements are also possible and are not illustrated here.
Illustratively, the target trunk feature output by each designated shallow layer is weighted according to a preset weight coefficient, so as to obtain a weighted trunk feature.
For example, the weight coefficient corresponding to the C3 module corresponding to the layer 4 network may be set to 2, and then the target trunk feature output by the C3 module corresponding to the layer 4 network is weighted according to the weight coefficient, so as to obtain a corresponding weighted trunk feature. The weight coefficient corresponding to the C3 module corresponding to the layer 6 network may be set to 1, and then the target trunk feature output by the C3 module corresponding to the layer 6 network is weighted according to the weight coefficient, so as to obtain a corresponding weighted trunk feature. The weight coefficient corresponding to the C3 module corresponding to the layer 8 network may be set to 0.5, and then the target trunk feature output by the C3 module corresponding to the layer 8 network is weighted according to the weight coefficient, so as to obtain a corresponding weighted trunk feature.
Step 210, the weighted trunk features of multiple scales are respectively input to key fusion nodes of corresponding levels in the feature fusion module, and are fused with the original fusion features in the feature fusion module again to obtain target fusion features.
The key fusion nodes of the corresponding hierarchy in the feature fusion module may refer to a Concat module of a preset hierarchy, such as layer 20 (belonging to the detection sub-network), layer 23, layer 26, and layer 29.
For example, the weighted trunk feature output by the C3 module corresponding to the layer 4 network may be input to the Concat module of the layer 23 for fusion, the weighted trunk feature output by the C3 module corresponding to the layer 6 network may be input to the Concat module of the layer 26 for fusion, and the weighted trunk feature output by the C3 module corresponding to the layer 8 network may be input to the Concat module of the layer 29 for fusion. The features output by the Concat modules after being fused again have more detail information.
Step 212, inputting the target fusion characteristic into the detection head, obtaining the output of the detection head, obtaining a prediction frame corresponding to the image to be detected, and determining a target detection result of the image to be detected according to the prediction frame.
The prediction box may be used to characterize an object in the image to be detected. When only one target exists in the image to be detected, the corresponding prediction frame is one, and when a plurality of targets exist in the image to be detected, the corresponding prediction frame is a plurality of.
The target fusion characteristic is input into the detection head for processing after the target fusion characteristic is obtained, and the output of the detection head is obtained to obtain a prediction frame corresponding to the image to be detected. When there is only one prediction frame, the prediction frame is directly used as a target detection result, and when there are a plurality of prediction frames, the plurality of prediction frames are used as target detection results.
According to the technical scheme, the weighted trunk features of the multiple scales are respectively input to the key fusion nodes of the corresponding levels in the feature fusion module and are fused with the original fusion features in the feature fusion module, so that the trunk features with more detail information and the original fusion features with more semantic information are fused again to obtain the target fusion features, the fused features have more detail information, the weighting mechanism of the target trunk features enables the small detection head of the target detection network to pay more attention to the detail information of the image to be detected, the small targets in the image to be detected can be conveniently identified, the small targets in the image to be detected can be prevented from being missed by the target detection network, and the accuracy of target detection is improved.
Referring to fig. 4 and 5, in some embodiments, the object detection network further includes a detection sub-network, the output of the feature extraction module is used as an input of the detection sub-network, and the output of the detection sub-network is used as an input of the feature fusion module. The structure of the target detection network is shown in fig. 4.
The target detection method further includes, but is not limited to, the steps of:
step 502, obtaining the target trunk feature of the maximum scale output by the feature extraction module, and obtaining the high-resolution feature.
The high-resolution feature may refer to a feature output by a first C3 module in the feature extraction module. The high-resolution features are features with the largest scale and the most detailed information of the image to be detected carried in all the output target trunk features.
Illustratively, the target trunk feature output by the C3 module corresponding to the layer 2 network shown in fig. 3 is obtained, so as to obtain the high-resolution feature.
And step 504, weighting the high-resolution features according to the weight coefficients to obtain weighted high-resolution features.
The weight coefficient is a preset coefficient, and the weight coefficient can be set according to actual conditions.
For example, the weight coefficient corresponding to the layer 2 network may be set to 3, and then the high resolution feature may be weighted according to the weight coefficient, to obtain a weighted high resolution feature.
And step 506, inputting the fusion characteristic of the weighted high-resolution characteristic and the FPN structure output corresponding to the Yolov5 into a detection sub-network to obtain a high-detail detection characteristic.
The fused feature of the output of the YOLOv5 corresponding FPN structure may refer to the feature of the output of the layer 19 network.
Illustratively, the detection sub-network performs fusion processing on the weighted high-resolution features and fusion features output by the corresponding FPN structure of Yolov5, and then performs processing through a C3 module to obtain high-detail detection features. The high detail detection feature has rich detail information.
For example, the weighted high-resolution feature may be input to a Concat module corresponding to the 20 th layer network in the feature fusion module, so that the feature fusion layer fuses the original FPN fusion feature and the high-resolution feature output by the upsampled upsampling module corresponding to the 19 th layer network again.
Step 508, inputting the high-detail features into the corresponding high-resolution detection head and feature fusion module.
Wherein a high resolution detection head may refer to a detection head for target detection based on high detail detection features.
Illustratively, the high-detail features are input into the high-resolution detection head P2 in fig. 3, and the high-detail features are input into the layer 22 ConvBNSILU module for input into the feature fusion module so as to facilitate detection of fine objects.
According to the technical scheme, the weighted high-resolution features of the feature extraction module are input to the detection sub-network, so that the detection sub-network pays more attention to detail information, the high-resolution features with particularly rich detail information are obtained, the detection head of the sub-network pays more attention to the detail information, and the detection accuracy of the target with smaller scale is further improved.
Referring to fig. 6, in some embodiments, before the step of "acquiring an image to be detected", the target detection method further includes, but is not limited to, the following steps:
Step 602, an original image and an image resolution of the original image are obtained.
The original image may be an image obtained by shooting by unmanned aerial vehicle, road probe, satellite and other equipment with shooting function. The original image may be stored in a data storage system of the server 104 or in a local memory of the terminal 102.
Image resolution may refer to the amount of information stored in the original image, i.e., how many pixels are within an image per inch.
Illustratively, an original image stored in the data storage system of the server 104 and an image resolution corresponding to the original image may be acquired through a network.
Step 604, if the resolution of the image is greater than or equal to the resolution threshold, dividing the original image to obtain a plurality of images to be detected; wherein, the image to be detected and the adjacent image to be detected have an overlapping area.
The resolution threshold is a preset threshold. Modifications may be made depending on the particular situation.
When the resolution of the image is greater than or equal to the resolution threshold, the image resolution of the original image is excessively high, so that the detection is convenient, the accuracy and the efficiency of the detection are improved, the calculated amount is reduced, and the original image is segmented to obtain a plurality of images to be detected. Wherein, the image to be detected and the adjacent image to be detected have an overlapping area.
For example, when the resolution of the image is greater than or equal to the resolution threshold, the original image is divided into 6 images to be detected, and 25% overlapping areas exist between the images to be detected and the adjacent images to be detected.
According to the technical scheme, the image resolution and the resolution threshold of the original image are judged, so that when the image resolution is larger than or equal to the resolution threshold, the original image is divided to obtain a plurality of images to be detected, the accuracy and efficiency of the detection of the original image by a subsequent target detection network are improved, the calculated amount is reduced, moreover, the images to be detected and the adjacent images to be detected are set to be in an overlapping area, the condition that the targets to be detected exist in a cutting area can be avoided, the detection is more complete, and the accuracy of target detection is improved.
In some embodiments, the target detection method further comprises: when the image resolution is smaller than the resolution threshold, the original image is taken as the image to be detected.
Specifically, when the image resolution is smaller than the resolution threshold, it is indicated that the image resolution of the original image is not large, and in order to save detection resources and improve detection efficiency, the original image may be directly used as the image to be detected.
Referring to fig. 7, in some embodiments, the target detection method further includes, but is not limited to, the following steps:
step 702, obtaining a target detection result corresponding to the to-be-detected image belonging to the same original image.
The to-be-detected images belonging to the same original image may refer to a plurality of to-be-detected images obtained by dividing the original image when the original image is divided. For example, when the original image is segmented, the segmented images to be detected may be named and marked, for example, when the original image A is segmented, and 6 images to be detected are segmented, the 6 images to be detected may be named as A-1, A-2, A-3, A-4, A-5, A-6, A-7 and A-8, respectively. And acquiring target detection results corresponding to all the images to be detected with the mark A when acquiring target detection results corresponding to the images to be detected of the same original image. The target detection results of the images to be detected may be stored in a data storage system of the server 104 or may be stored in a local memory of the terminal 102.
The target detection results corresponding to the to-be-detected images of the same identifier can be obtained through a network, so that the target detection results corresponding to the to-be-detected images belonging to the same original image can be obtained.
Step 704, performing coordinate transformation on each target detection result to obtain the position of each target detection result in the original image.
The coordinate conversion may refer to a manner of converting a pixel point corresponding to each coordinate in the image to be detected into a corresponding coordinate in the original image.
The method comprises the steps of obtaining a segmentation mode when an original image is segmented, obtaining a mode of establishing a coordinate system of each image to be detected, and carrying out coordinate conversion on coordinates of each pixel point in each image to be detected according to the segmentation mode and the mode of establishing the coordinate system of each image to be detected, so as to obtain positions of each pixel point in each image to be detected in the original image, and further determining positions of each target detection result in the original image.
Step 706, determining the detection result of the original image according to the position of each target detection result in the original image.
For example, all the target detection results may be combined, and the repeated target detection results may be removed, so that the detection result of the original image may be obtained.
According to the technical scheme, the positions of the target detection results in the original image are determined by carrying out coordinate conversion on the target detection results corresponding to the images to be detected, so that the detection results of the original image can be determined conveniently, and the accuracy of the detection results of the original image is improved.
In some embodiments, the step of determining the detection result of the original image based on the position of each target detection result in the original image includes: combining the target detection results according to the positions of the target detection results in the original image to obtain combined detection results; and screening the combined detection result according to a non-maximum suppression algorithm to obtain a detection result of the original image.
The Non-maximum suppression algorithm may be Non-Maximum Suppression (NMS algorithm), which refers to an algorithm for searching for local maxima and suppressing maxima.
In this embodiment, the prediction frames of each image to be detected may overlap, i.e. a large number of prediction frames may be generated at the same target position in this scheme, and these prediction frames may overlap with each other, so we need to find the best prediction frame by using non-maximum suppression, and eliminate redundant prediction frames.
Specifically, all target detection results can be directly combined to obtain a combined detection result, the combined detection result comprises a large number of prediction frames, then redundant prediction frames are screened out by adopting an NMS algorithm, and the optimal prediction frames of all targets are left, wherein the optimal prediction frames form the detection result of the original image.
According to the technical scheme, redundant prediction frames in the combined prediction results are removed by adopting the NMS maximum value, the optimal prediction frame is selected, and the accuracy of the detection result of the original image is improved.
Referring to fig. 8, some embodiments of the present application further provide a training method of the target detection network, which is illustrated by using the training method of the target detection network applied to the server 104 in fig. 1 as an example, including, but not limited to, the following steps:
step 802, obtaining a training image set and a real frame corresponding to each target to be detected in each training image in the training image set.
The training image set may be used for training the original detection network to obtain an image set of the target detection network, where the training image set includes a plurality of training images, and targets to be detected in each training image in the training image set are marked with a real frame. The training image set may be stored in the server 104. The training image set may take the visclone 2019 data set, or may take other image sets, and the application is not particularly limited in this regard.
For example, a training image set of the data storage system stored in the server 104 and a real frame corresponding to each target to be detected in each training image may be acquired.
Step 804, inputting the training image into an original detection network based on YOLOv5 improvement, wherein the original detection network comprises a feature extraction module and a feature fusion module, and the output of the feature extraction module is used as the input of the feature fusion module.
The original detection network may refer to an untrained target detection network, and the connection relationship between the structure and each network layer is shown in fig. 3, and the original detection network includes a feature extraction module and a feature fusion module, where the output of the feature extraction module is used as the input of the feature fusion module. The original detection network can be improved by a YOLOv5 network.
For example, the training image set may be divided into a training set (80%), a verification set (10%), and a test set (10%), and then training images belonging to the training set are input into the object detection network shown in fig. 3 to perform a training process on the object detection network.
Step 806, extracting features of the image to be detected through a plurality of key feature extraction layers of the feature extraction module, sequentially obtaining target trunk features with a plurality of scales, wherein as the depth of the network increases, the detail information about the image to be detected carried by the target trunk features decreases, and the semantic information about the image to be detected carried by the target trunk features increases.
The key feature extraction layer may refer to a network layer where the C3 module in fig. 3 is located. For example, a C3 module corresponding to a layer 4 network, a C3 module corresponding to a layer 6 network, and a C3 module corresponding to a layer 8 network. The key feature extraction layer is connected in a manner shown in fig. 3.
The feature extraction module is used for extracting features of the image to be detected, and obtaining output of each key feature extraction layer to obtain corresponding target trunk features. As the target backbone feature scale becomes smaller (the network depth increases), the detail information about the image to be detected carried by the target backbone feature decreases, and the semantic information about the image to be detected carried by the target backbone feature decreases.
Step 808, weighting the target trunk feature of multiple scales to obtain a weighted trunk feature.
The weighted trunk feature may refer to a feature obtained by weighting the target trunk feature according to a weight coefficient.
The weight coefficient is preset and can be modified according to actual conditions. For example, the setting may be performed according to the size of the target trunk feature and the size of the target to be detected. For example, when detecting a target with a smaller scale, the weight coefficient corresponding to the target trunk feature with a larger scale may be set larger, and the weight coefficient corresponding to the target trunk feature with a smaller scale may be set smaller, so that a small detection head of the target detection network focuses more on the detailed information. For example, the weight coefficient corresponding to the target trunk feature output by the C3 module corresponding to the layer 4 network is named weight3, the weight coefficient corresponding to the target trunk feature output by the C3 module corresponding to the layer 6 network is named weight4, the weight coefficient corresponding to the C3 module corresponding to the layer 8 network is named weight5, and when the image to be detected contains a large number of small objects, the weight3 may be set to 2, the weight4 may be set to 1, and the weight5 may be set to 0.5. When the image to be detected contains little minute or substantially large objects, all the weight coefficients may be set to 0. Other arrangements are also possible and are not illustrated here.
Illustratively, the target trunk feature output by each designated shallow layer is weighted according to a preset weight coefficient, so as to obtain a weighted trunk feature.
For example, the weight coefficient corresponding to the C3 module corresponding to the layer 4 network may be set to 2, and then the target trunk feature output by the C3 module corresponding to the layer 4 network is weighted according to the weight coefficient, so as to obtain a corresponding weighted trunk feature. The weight coefficient corresponding to the C3 module corresponding to the layer 6 network may be set to 1, and then the target trunk feature output by the C3 module corresponding to the layer 6 network is weighted according to the weight coefficient, so as to obtain a corresponding weighted trunk feature. The weight coefficient corresponding to the C3 module corresponding to the layer 8 network may be set to 0.5, and then the target trunk feature output by the C3 module corresponding to the layer 8 network is weighted according to the weight coefficient, so as to obtain a corresponding weighted trunk feature.
And step 810, respectively inputting the weighted trunk features of the multiple scales to key fusion nodes of corresponding levels in the feature fusion module, and re-fusing the weighted trunk features with the original fusion features in the feature fusion module to obtain target fusion features.
The key fusion nodes of the corresponding hierarchy in the feature fusion module may refer to a Concat module of a preset hierarchy, such as layer 20 (belonging to the detection sub-network), layer 23, layer 26, and layer 29.
Illustratively, the weighted trunk features output by the C3 module (key feature extraction layer) corresponding to the layer 4 network may be input to the Concat module of the layer 23 for fusion, the weighted trunk features output by the C3 module (key feature extraction layer) corresponding to the layer 6 network may be input to the Concat module of the layer 26 for fusion, and the weighted trunk features output by the C3 module (key feature extraction layer) corresponding to the layer 8 network may be input to the Concat module of the layer 29 for fusion. The features output by the Concat modules after being fused again have more detail information.
Step 812, inputting the target fusion characteristic into the detection head, and obtaining the output of the detection head to obtain a prediction frame corresponding to the training image.
The prediction box may be used to characterize a target in the training image. When only one target exists in the training image, the corresponding prediction frame is one, and when a plurality of targets exist in the training image, the corresponding prediction frame is a plurality of.
Step 814, adjusting the original detection network according to the difference between the predicted frame and the real frame corresponding to the training image until the original detection network converges, thereby obtaining the target detection network.
Specifically, a loss function can be established according to the difference between the actual real frame of each target to be detected and the predicted frame obtained by prediction, and then the original detection network is adjusted according to the loss function, so that a trained target detection network is obtained.
According to the technical scheme, the weighted trunk features of the multiple scales are respectively input to the key fusion nodes of the corresponding levels in the feature fusion module and are fused with the original fusion features in the feature fusion module, so that the trunk features with more detail information and the original fusion features with more semantic information are fused again, the fused features have more detail information, the weighting mechanism of the target trunk features enables the small detection head of the target detection network to pay more attention to the detail information of the image to be detected, the small target in the image to be detected is conveniently identified, the small target in the image to be detected is prevented from being missed by the target detection network, and the accuracy of target detection is improved.
It should be noted that, when the original detection network further includes a detection sub-network, the corresponding processing procedure is similar to that of the foregoing embodiment, and the fusion feature of the weighted high-resolution feature obtained by weighting the high-resolution feature and the output of the YOLOv5 corresponding FPN structure is the input of the detection sub-network; the output of the detection subnetwork is also one of the new inputs to the feature fusion module. The training method of the target detection network further comprises the following steps: acquiring high-resolution features output by a shallowest secondary key feature extraction layer in a feature extraction module; weighting the high-resolution features according to the weight coefficients to obtain weighted high-resolution features; inputting the fusion characteristics of the weighted high-resolution characteristics and the corresponding FPN structure output of the YOLOv5 into a detection sub-network to obtain high-detail detection characteristics; and inputting the high-detail features into the corresponding high-resolution detection head and the feature fusion module.
The specific process is similar to the embodiments shown in fig. 4 and fig. 5, and will not be described herein again, and the processing process of the predicted frame obtained by the corresponding prediction is also to adjust the target detection network by the difference between the real frame and the predicted frame until the target detection network converges, so as to obtain the target detection network.
Referring to fig. 9-13, in some embodiments, the step of "adjusting the target detection network according to the differences between each real frame and the predicted frame" includes, but is not limited to, the steps of:
in step 902, an angle loss is calculated according to the angle difference between the real frame and the predicted frame.
Wherein, please refer to fig. 10 for the angle difference between the real frame and the predicted frame, predicted frame B and real frame B GT The angles formed are alpha and beta as shown in FIG. 10, where alpha is less than pi/4, the convergence process minimizes alpha first, otherwise beta first.
The definition of the angular loss is shown in the following formula:
Figure BDA0004102148910000131
Figure BDA0004102148910000132
Figure BDA0004102148910000133
in the formulas (1) to (4), c h For the difference in height of the center points of the real and predicted frames, Λ is the angular loss, σ is the distance of the center points of the real and predicted frames,
Figure BDA0004102148910000134
coordinates of the center point of the real frame, (b) cx ,b cy ) Is the coordinates of the center point of the prediction box. It can be noted that when α is pi/2 or 0, the angle loss is 0.
By the above-described formulas (1) to (4), the angle loss can be calculated.
Step 904, calculating the distance loss according to the distance difference between the real frame and the predicted frame.
Among them, the difference in distance between the real frame and the predicted frame is as shown in fig. 11, and it can be seen from fig. 11 that the contribution of the distance loss is greatly reduced when α→0. Conversely, the closer α is to pi/4, the greater the contribution of distance loss.
The distance loss definition can be calculated from the following formulas (5) and (6):
Figure BDA0004102148910000141
Figure BDA0004102148910000142
in the formula (5) and the formula (6), Δ represents a distance loss, (c) w ,c h ) The width and height of the rectangle circumscribed at the bottom of the prediction frame and the real frame.
Illustratively, the angular loss Λ may be calculated by equations (1) through (4), and then the distance loss Δ may be calculated according to the angular loss Λ, equation (5), and equation (6).
Step 906, calculating to obtain the shape loss according to the scale information between the real frame and the prediction frame.
Wherein scale information between the prediction frame and the real frame is as shown in fig. 12, definition of the shape loss can be represented by formula (7) and formula (8).
Ω=∑ t=w,h (1-e -wt ) θ (7)
Figure BDA0004102148910000143
Wherein, in the formula (7) and the formula (8), Ω represents a shape loss, (w, h) represents a width and a height of the prediction frame, (w) gt ,h gt ) Representing the width and height of the real frame, θ controls the degree of concern for shape loss, and in order to avoid too much concern for shape loss and reduce the movement of the predicted frame, θ may be in the range of [2,6 ] ]θ may be 4.
Illustratively, the shape loss can be calculated by equation (7) and equation (8).
Step 908, calculating an overlap region loss from the overlap region between the real frame and the predicted frame.
Wherein the overlap region may be as shown in fig. 13, and the overlap region loss may be defined by formula (9):
Figure BDA0004102148910000144
where IOU may refer to overlap loss, B GT May refer to the area of the real frame and B may refer to the area of the predicted frame.
The overlap area penalty IOU can be calculated by the above equation (9), for example.
Step 910, calculating a total loss according to the angle loss, the distance loss, the shape loss and the overlapping area loss, and adjusting the original detection network according to the total loss.
Specifically, the total loss can be expressed by the following equation (10):
Figure BDA0004102148910000145
the total loss can be calculated through the formula (10), and then the original detection network is adjusted through a back propagation algorithm according to the total loss until the original detection network converges, so that the target detection network is obtained.
According to the technical scheme, the SIOU Loss is taken as the total Loss function, so that the difference between the prediction frame and the real frame can be conveniently obtained, the original detection network can be conveniently adjusted, the target detection network is obtained, and the accuracy of the subsequent target detection network on target detection is improved.
According to the technical scheme, when an original detection network is trained, a display card used by a training platform is Tesla-P100, the training epoch is 300, the imgsz is set to 640, and finally a weight file corresponding to the trained target detection network is obtained.
The results obtained by comparing the original YOLOv5 network, YOLOv5-Snet network (introducing the detection sub-network), YOLOv5-SIOU network (taking SIOU Loss as the total Loss function), YOLOv5-AW network (not using SIOU Loss as the total Loss function, introducing the weighted trunk feature and the detection sub-network) with YOLOv5-AW-SIOU network (taking SIOU Loss as the total Loss function, introducing the weighted trunk feature and the detection sub-network) on the test set are shown in table 1 below.
TABLE 1
mAP_0.5 Epoch for optimal value
YOLOv5 0.41473
YOLOv5-SIOU 0.42830
YOLOv5-Snet 0.44166
YOLOv5-AW 0.46207 232
YOLOv5-AW-SIOU 0.46668 202
Wherein, mAP_0.5 is the mAP value of the model calculated when IoU is 0.5, mAP is mean Average Precision, namely, each class is respectively calculated with AP, and then mean average is performed. This is a statistical concept in object detection, typically used to detect model effects. Smaller epochs indicate that the corresponding network trains faster, and the best solution can be reached faster.
As can be seen from table 1, the YOLOv5 network, which uses SIOU Loss as the total Loss function and introduces the weighted trunk feature and the detection sub-network, significantly improves the detection efficiency of the original YOLOv5 network on the target with smaller size, and the MAP is improved by more than 5%.
Referring to fig. 14, some embodiments of the present application further provide a target detection method, including but not limited to the following steps:
step 1402, obtaining a training image set and a real frame corresponding to each target to be detected in each training image in the training image set.
In step 1404, the training image is input into an original detection network, where the original detection network includes a feature extraction module, a feature fusion module, and a detection sub-network, an output of the feature extraction module is used as an input of the feature fusion module and the detection sub-network, and an output of the detection sub-network is also used as an input of the feature fusion module.
In step 1406, feature extraction is performed on the training image through a plurality of key feature extraction layers of the feature extraction module, so as to sequentially obtain a plurality of target trunk features output by each key feature extraction layer, as the depth of the network increases, the detail information about the image to be detected carried by the corresponding target trunk feature decreases, and the carried semantic information about the image to be detected increases.
In step 1408, weighting is performed on the target trunk features of the multiple scales, so as to obtain weighted trunk features of the multiple scales.
Step 1410, the weighted trunk features of multiple scales are respectively input to key fusion nodes of corresponding levels in the feature fusion module, and are fused with the original fusion features in the feature fusion module again, so as to obtain target fusion features.
In step 1412, the fusion feature of the weighted high-resolution feature and the output of the YOLOv5 corresponding FPN structure is input to the detection sub-network, so as to obtain a high-detail detection feature.
Step 1414, inputting the target fusion features into the detection head, inputting the high-detail features into the corresponding high-resolution detection head and feature fusion module, and obtaining the output of the detection head to obtain the prediction frames of the targets to be detected in the training images.
In step 1416, an angle loss is calculated based on the angle difference between the real frame and the predicted frame.
In step 1418, a distance loss is calculated based on the distance difference between the real frame and the predicted frame.
In step 1420, a shape loss is calculated based on the scale information between the real frame and the predicted frame.
Step 1422, calculating the overlap loss according to the overlap between the real frame and the predicted frame.
Step 1424, calculating to obtain total loss according to the angle loss, the distance loss, the shape loss and the overlapping area loss, and adjusting the original detection network according to the total loss until the original detection network converges to obtain a target detection network, wherein the target detection network is used for target detection.
Step 1426, the original image and the image resolution of the original image are acquired.
In step 1428, it is determined whether the image resolution is greater than or equal to a resolution threshold.
If yes, go to step 1432, if no, go to step 1442.
Step 1430, dividing the original image to obtain a plurality of images to be detected; wherein, the image to be detected and the adjacent image to be detected have an overlapping area.
In step 1432, each image to be detected is input into the target detection network, and a target detection result corresponding to the image to be detected belonging to the same original image is obtained.
In step 1434, coordinate transformation is performed on each target detection result, so as to obtain the position of each target detection result in the original image.
And step 1436, merging the target detection results according to the positions of the target detection results in the original image to obtain a merged detection result.
And step 1438, screening the combined detection result according to a non-maximum suppression algorithm to obtain a detection result of the original image.
Step 1440, inputting the original image into the target detection network, and obtaining the output of the target detection network, so as to obtain the detection result corresponding to the original image.
It should be noted that, the specific steps from step 1438 to step 1440 are shown in the embodiments of fig. 2 to 13, and are not described herein in detail.
According to the technical scheme, the weighted trunk features of the multiple scales are respectively input to the key fusion nodes of the corresponding levels in the feature fusion module and are fused with the original fusion features in the feature fusion module, so that the target trunk features with more detail information and the original fusion features with more semantic information are fused again, the fused features have more detail information, the weighting mechanism of the trunk features enables the small detection head of the target detection network to pay more attention to the detail information of the image to be detected, the small targets in the image to be detected can be conveniently identified, the small targets in the image to be detected can be prevented from being missed by the target detection network, and the accuracy of target detection is improved.
It should be understood that, although the steps in the flowcharts related to the above embodiments are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides an object detection device for realizing the above-mentioned object detection method. The implementation of the solution provided by the device is similar to the implementation described in the method above.
In one embodiment, as shown in fig. 15, there is provided an object detection apparatus including: an image acquisition module 1502, an image input module 1504, an extraction module 1506, a weighting module 1508, a fusion module 1510, and a prediction module 1512, wherein:
an image acquisition module 1502, configured to acquire an image to be detected;
an image input module 1504, configured to input an image to be detected into a target detection network that is improved based on YOLOv5, where the target detection network that is improved based on YOLOv5 includes a feature extraction module and a feature fusion module, and an output of the feature extraction module is used as an input of the feature fusion module;
the extracting module 1506 is configured to perform feature extraction on an image to be detected through a plurality of key feature extraction layers of the feature extracting module, and sequentially obtain target trunk features of a plurality of scales, where the target trunk features are key image features sequentially obtained in the feature extracting module according to a connection sequence, as the depth of the network increases, detailed information about the image to be detected carried by the corresponding target trunk features decreases, and semantic information about the image to be detected carried increases;
The weighting module 1508 is configured to perform weighting processing on the target trunk features of the multiple scales, so as to obtain weighted trunk features of the multiple scales;
the fusion module 1510 is configured to input weighted trunk features of multiple scales to key fusion nodes of corresponding levels in the feature fusion module respectively, and fuse the weighted trunk features with original fusion features in the feature fusion module again to obtain target fusion features;
the prediction module 1512 is configured to input the target fusion feature into the detection head, obtain an output of the detection head, obtain a prediction frame corresponding to the image to be detected, and determine a target detection result of the image to be detected according to the prediction frame.
In some embodiments, the object detection apparatus further comprises:
the high-resolution feature acquisition module is used for acquiring the target trunk feature of the maximum scale output by the feature extraction module to obtain the high-resolution feature.
And the high-resolution characteristic weighting module is used for weighting the high-resolution characteristic according to the weight coefficient to obtain a weighted high-resolution characteristic.
And the detection sub-network input module is used for inputting the fusion characteristic of the weighted high-resolution characteristic and the FPN structure output corresponding to the Yolov5 into the detection sub-network to obtain the high-detail detection characteristic.
And the high-resolution feature fusion module is used for inputting the high-detail features into the corresponding high-resolution detection head and the feature fusion module.
In some embodiments, the object detection apparatus further comprises:
the original image acquisition module is used for acquiring the original image and the image resolution of the original image.
The segmentation module is used for segmenting the original image to obtain a plurality of images to be detected if the resolution of the image is greater than or equal to the resolution threshold value; wherein, the image to be detected and the adjacent image to be detected have an overlapping area.
In some embodiments, the object detection apparatus further comprises:
the target detection result acquisition module is used for acquiring target detection results corresponding to the images to be detected belonging to the same original image.
The coordinate conversion module is used for carrying out coordinate conversion on each target detection result to obtain the position of each target detection result in the original image;
and the detection result determining module is used for determining the detection result of the original image according to the position of each target detection result in the original image.
In some embodiments, the detection result determining module is further configured to combine the target detection results according to the positions of the target detection results in the original image, so as to obtain a combined detection result; and screening the combined detection result according to a non-maximum suppression algorithm to obtain a detection result of the original image.
As shown in fig. 16, some embodiments of the present application further provide a training apparatus of an object detection network, including:
the training data acquisition module 1602 is configured to acquire a training image set and a real frame corresponding to each target to be detected in each training image in the training image set;
the training image input module 1604 is configured to input a training image into a primitive detection network that is improved based on YOLOv5, where the primitive detection network that is improved based on YOLOv5 includes a feature extraction module and a feature fusion module, and an output of the feature extraction module is used as an input of the feature fusion module;
the training image extraction module 1606 is configured to perform feature extraction on an image to be detected through a plurality of key feature extraction layers of the feature extraction module, and sequentially obtain target trunk features of a plurality of scales, where the target trunk features are key image features sequentially obtained in the feature extraction module according to a connection sequence, and as the depth of the network increases, detailed information about a training image carried by the corresponding target trunk feature decreases, and semantic information about the training image carried by the corresponding target trunk feature increases;
the training image feature weighting module 1608 is configured to respectively perform weighting processing on the target trunk features of multiple scales to obtain weighted trunk features of multiple scales;
The training image feature fusion module 1610 is configured to input weighted trunk features of multiple scales to key fusion nodes of corresponding levels in the feature fusion module respectively, and fuse the weighted trunk features with original fusion features in the feature fusion module again to obtain target fusion features;
the training image prediction module 1612 is configured to input the target fusion feature into the detection head, and obtain an output of the detection head, to obtain a prediction frame corresponding to the training image.
The training adjustment module 1614 is configured to adjust the original detection network according to the difference between the predicted frame and the real frame corresponding to the training image until the original detection network converges, so as to obtain the target detection network.
In some embodiments, the training adjustment module is further configured to calculate an angle loss according to an angle difference between the real frame and the predicted frame; according to the distance difference between the real frame and the prediction frame, calculating to obtain the distance loss; calculating to obtain shape loss according to the scale information between the real frame and the prediction frame; calculating the loss of the overlapping area according to the overlapping area between the real frame and the predicted frame; and calculating total loss according to the angle loss, the distance loss, the shape loss and the overlapping area loss, and adjusting the target detection network according to the total loss.
The above-described respective modules in the object detection apparatus or the training apparatus of the object detection network may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 17. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input means. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by the processor to implement a method of object detection or a training method of an object detection network. The display unit of the computer device is used for forming a visual picture, and can be a display screen, a projection device or a virtual reality imaging device. The display screen can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be a key, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those skilled in the art that the structure shown in fig. 17 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the computer device to which the present application applies, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In an embodiment, a computer device is provided, comprising a memory and a processor, the memory having stored therein a computer program, the processor, when executing the computer program, performing the steps of the above-described object detection method or training method of an object detection network.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, implements the steps of a target detection method or a training method of a target detection network.
In one embodiment, a computer program product is provided comprising a computer program which, when executed by a processor, implements the steps of a target detection method or a training method of a target detection network.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims (10)

1. A YOLOv 5-based improved target detection method, the method comprising:
acquiring an image to be detected;
inputting the image to be detected into a target detection network based on the Yolov5 improvement, wherein the target detection network based on the Yolov5 improvement comprises a feature extraction module and a feature fusion module, and the output of the feature extraction module is used as the input of the feature fusion module;
Extracting the features of the image to be detected through a plurality of key feature extraction layers of the feature extraction module, and sequentially obtaining target trunk features of a plurality of scales, wherein the target trunk features are key image features sequentially obtained in the feature extraction module according to the connection sequence, and as the depth of a network increases, the detail information about the image to be detected, carried by the corresponding target trunk features, is decreased, and the carried semantic information about the image to be detected is increased;
respectively carrying out weighting treatment on the target trunk features of the multiple scales to obtain weighted trunk features of the multiple scales;
respectively inputting the weighted trunk features of the multiple scales to key fusion nodes of corresponding levels in the feature fusion module, and re-fusing the weighted trunk features with original fusion features in the feature fusion module to obtain target fusion features;
and inputting the target fusion characteristics into a detection head, acquiring output of the detection head, obtaining a prediction frame corresponding to the image to be detected, and determining a target detection result of the image to be detected according to the prediction frame.
2. The method of claim 1, wherein the target detection network further comprises a detection subnetwork, the output of the feature extraction module being an input to the detection subnetwork, the output of the detection subnetwork being an input to the feature fusion module;
The method further comprises the steps of:
obtaining the target trunk feature of the maximum scale output by the feature extraction module to obtain a high-resolution feature;
weighting the high-resolution features according to the weight coefficient to obtain weighted high-resolution features;
inputting the fusion characteristic of the weighted high-resolution characteristic and the corresponding FPN structure output of the YOLOv5 into a detection sub-network to obtain a high-detail detection characteristic;
and inputting the high-detail features into the corresponding high-resolution detection head and the feature fusion module.
3. The method according to claim 1 or 2, characterized in that before acquiring the image to be detected, the method further comprises:
acquiring an original image and the image resolution of the original image;
if the resolution of the image is greater than or equal to a resolution threshold, dividing the original image to obtain a plurality of images to be detected; wherein, the image to be detected and the adjacent image to be detected have an overlapping area.
4. A method according to claim 3, characterized in that the method further comprises:
obtaining a target detection result corresponding to an image to be detected belonging to the same original image;
performing coordinate transformation on each target detection result to obtain the position of each target detection result in the original image;
And determining the detection result of the original image according to the position of each target detection result in the original image.
5. The method of claim 4, wherein determining the detection result of the original image based on the position of each of the target detection results in the original image comprises:
combining the target detection results according to the positions of the target detection results in the original image to obtain combined detection results;
and screening the combined detection result according to a non-maximum suppression algorithm to obtain a detection result of the original image.
6. A method of training a target detection network, the method comprising:
acquiring a training image set and a real frame corresponding to each target to be detected in each training image in the training image set;
inputting the training image into a Yolov 5-based improved original detection network, wherein the Yolov 5-based improved original detection network comprises a feature extraction module and a feature fusion module, and the output of the feature extraction module is used as the input of the feature fusion module;
extracting features of an image to be detected through a plurality of key feature extraction layers of a feature extraction module, and sequentially obtaining target trunk features of a plurality of scales, wherein the target trunk features are key image features sequentially obtained in the feature extraction module according to a connection sequence, and as the depth of a network increases, the detail information about the training image carried by the corresponding target trunk features is decreased, and the carried semantic information about the training image is increased;
Respectively carrying out weighting treatment on the target trunk features of the multiple scales to obtain weighted trunk features of the multiple scales;
respectively inputting the weighted trunk features of the multiple scales to key fusion nodes of corresponding levels in the feature fusion module, and re-fusing the weighted trunk features with original fusion features in the feature fusion module to obtain target fusion features;
inputting the target fusion characteristics into a detection head, and obtaining the output of the detection head to obtain a prediction frame corresponding to the training image;
and adjusting the original detection network according to the difference between the prediction frame corresponding to the training image and the real frame until the original detection network converges to obtain a target detection network.
7. The method of claim 6, wherein adjusting the original detection network based on a difference between the predicted frame and the real frame corresponding to the training image comprises:
calculating to obtain angle loss according to the angle difference between the real frame and the prediction frame;
calculating to obtain distance loss according to the distance difference between the real frame and the prediction frame;
calculating to obtain shape loss according to the scale information between the real frame and the prediction frame;
Calculating an overlap region loss according to the overlap region between the real frame and the prediction frame;
and calculating to obtain total loss according to the angle loss, the distance loss, the shape loss and the overlapping area loss, and adjusting the original detection network according to the total loss.
8. An object detection device, the device comprising:
the image acquisition module is used for acquiring an image to be detected;
the image input module is used for inputting the image to be detected into a target detection network based on the YOLOv5 improvement, wherein the target detection network based on the YOLOv5 improvement comprises a feature extraction module and a feature fusion module, and the output of the feature extraction module is used as the input of the feature fusion module;
the extraction module is used for extracting the characteristics of the image to be detected through a plurality of key characteristic extraction layers of the characteristic extraction module, and sequentially obtaining a plurality of scale target trunk characteristics, wherein the target trunk characteristics are key image characteristics sequentially obtained in the characteristic extraction module according to the connection sequence, and as the depth of a network increases, the detail information about the image to be detected, carried by the corresponding target trunk characteristics, is decreased, and the carried semantic information about the image to be detected is increased;
The weighting module is used for respectively carrying out weighting treatment on the target trunk features of the multiple scales to obtain weighted trunk features of the multiple scales;
the fusion module is used for respectively inputting the weighted trunk features of the multiple scales to key fusion nodes of corresponding levels in the feature fusion module, and re-fusing the key fusion nodes with the original fusion features in the feature fusion module to obtain target fusion features;
the prediction module is used for inputting the target fusion characteristic into a detection head, obtaining output of the detection head, obtaining a prediction frame corresponding to the image to be detected, and determining a target detection result of the image to be detected according to the prediction frame.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any one of claims 1 to 5 or the steps of the method of claims 6 to 7.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 5 or the steps of the method of claims 6 to 7.
CN202310180011.1A 2023-02-14 2023-02-14 YOLOv 5-based improved target detection method and device and training method Pending CN116310899A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310180011.1A CN116310899A (en) 2023-02-14 2023-02-14 YOLOv 5-based improved target detection method and device and training method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310180011.1A CN116310899A (en) 2023-02-14 2023-02-14 YOLOv 5-based improved target detection method and device and training method

Publications (1)

Publication Number Publication Date
CN116310899A true CN116310899A (en) 2023-06-23

Family

ID=86829942

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310180011.1A Pending CN116310899A (en) 2023-02-14 2023-02-14 YOLOv 5-based improved target detection method and device and training method

Country Status (1)

Country Link
CN (1) CN116310899A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116645502A (en) * 2023-07-27 2023-08-25 云南大学 Power transmission line image detection method and device and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116645502A (en) * 2023-07-27 2023-08-25 云南大学 Power transmission line image detection method and device and electronic equipment
CN116645502B (en) * 2023-07-27 2023-10-13 云南大学 Power transmission line image detection method and device and electronic equipment

Similar Documents

Publication Publication Date Title
CN109816012B (en) Multi-scale target detection method fusing context information
CN111754394B (en) Method and device for detecting object in fisheye image and storage medium
CN109683699B (en) Method and device for realizing augmented reality based on deep learning and mobile terminal
CN111667001B (en) Target re-identification method, device, computer equipment and storage medium
CN110176024B (en) Method, device, equipment and storage medium for detecting target in video
CN113076871A (en) Fish shoal automatic detection method based on target shielding compensation
CN112668573B (en) Target detection position reliability determination method and device, electronic equipment and storage medium
CN111079739A (en) Multi-scale attention feature detection method
CN111666922A (en) Video matching method and device, computer equipment and storage medium
CN116052026B (en) Unmanned aerial vehicle aerial image target detection method, system and storage medium
CN116645592B (en) Crack detection method based on image processing and storage medium
CN115272250B (en) Method, apparatus, computer device and storage medium for determining focus position
CN117152484B (en) Small target cloth flaw detection method based on improved YOLOv5s
CN114519819B (en) Remote sensing image target detection method based on global context awareness
CN116310899A (en) YOLOv 5-based improved target detection method and device and training method
CN116091946A (en) Yolov 5-based unmanned aerial vehicle aerial image target detection method
CN115331146A (en) Micro target self-adaptive detection method based on data enhancement and feature fusion
CN114663598A (en) Three-dimensional modeling method, device and storage medium
CN111027551B (en) Image processing method, apparatus and medium
CN114882490B (en) Unlimited scene license plate detection and classification method based on point-guided positioning
CN116704511A (en) Method and device for recognizing characters of equipment list
CN116310832A (en) Remote sensing image processing method, device, equipment, medium and product
CN115731442A (en) Image processing method, image processing device, computer equipment and storage medium
CN115713769A (en) Training method and device of text detection model, computer equipment and storage medium
CN115147720A (en) SAR ship detection method based on coordinate attention and long-short distance context

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination