CN116030364A - Unmanned aerial vehicle lightweight target detection method, system, medium, equipment and terminal - Google Patents

Unmanned aerial vehicle lightweight target detection method, system, medium, equipment and terminal Download PDF

Info

Publication number
CN116030364A
CN116030364A CN202211630194.4A CN202211630194A CN116030364A CN 116030364 A CN116030364 A CN 116030364A CN 202211630194 A CN202211630194 A CN 202211630194A CN 116030364 A CN116030364 A CN 116030364A
Authority
CN
China
Prior art keywords
network
yolo
frame
box
aerial vehicle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211630194.4A
Other languages
Chinese (zh)
Inventor
丛犁
黄成斌
窦增
姜华
李佳
葛晓楠
李施昊
王彦钊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Jilin Electric Power Corp
Information and Telecommunication Branch of State Grid Jilin Electric Power Co Ltd
Original Assignee
State Grid Jilin Electric Power Corp
Information and Telecommunication Branch of State Grid Jilin Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Jilin Electric Power Corp, Information and Telecommunication Branch of State Grid Jilin Electric Power Co Ltd filed Critical State Grid Jilin Electric Power Corp
Priority to CN202211630194.4A priority Critical patent/CN116030364A/en
Publication of CN116030364A publication Critical patent/CN116030364A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Image Analysis (AREA)

Abstract

The invention belongs to the technical field of computer vision, and discloses a method, a system, a medium, equipment and a terminal for detecting a lightweight target of an unmanned aerial vehicle, which are used for analyzing a backbone network structure of a YOLO series network, and combining a feature pyramid with a Darknet-53 network model to obtain a backbone network of Yolov 3; cutting the feature pyramid part of the YOLO-fast to obtain a cut YOLO-fast network; and constructing a YOLO decoder, realizing non-maximum suppression, and presenting a recognition effect on an LCD screen matched with the MCU. The detection result can be obviously found, and the position of the object target can be detected on the target picture by the YOLO-cast network after cutting; the network after cutting is greatly improved in reasoning speed, and the model after quantization greatly reduces the times of floating point operation and accelerates the reasoning speed.

Description

Unmanned aerial vehicle lightweight target detection method, system, medium, equipment and terminal
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to a method, a system, a medium, equipment and a terminal for detecting a lightweight target of an unmanned aerial vehicle.
Background
At present, the whole body formed by power transformation stations and power transmission and distribution lines of various voltages in a power system is called a power grid, and comprises three units of power transformation, power transmission and power distribution, wherein the power grid is used for transmitting and distributing electric energy and changing the voltage; in recent years, with the continuous acceleration of the development pace of the electric power in China, the running voltage level of the electric power system is continuously improved, the network scale is continuously enlarged, 6 large regional electric power grids crossing provinces, including northeast electric power grid, north China electric power grid, middle China electric power grid, east China electric power grid, northwest electric power grid and south electric power grid, are formed nationally, and a complete long-distance electric power transmission network frame is basically formed. Target detection is an important component in the unmanned aerial vehicle inspection system of the intelligent power grid, and a great deal of researches are carried out by a plurality of researchers in the field of computer vision aiming at the target detection.
For some current lightweight improved target detection algorithms, the target detection algorithm can be deployed on edge-applied embedded devices, such as: algorithms such as Tiny YOLOv3, YOLO Nano and peer SS, in particular to a YOLO-fasstests algorithm proposed by a recent researcher, and a real-time detection effect with single-frame reasoning time lower than 100ms can be achieved by using an NCNN reasoning framework on Cortex-A equipment such as raspberry group 3B. These algorithms, while allowing the object detection algorithm to implement real-time object detection functions on low-power, low-cost, low-power embedded devices relative to GPUs by making a slight sacrifice in accuracy from GPUs requiring high-power, high-cost, high-power consumption, are still relatively high for MCUs (Micro Controller Unit microcontrollers) that are less expensive to power, cost, and power consumption. In the existing intelligent traffic system, a large number of MCU-based IOT devices exist, the number is large, the distribution is wide, the algorithm weight for pedestrian target detection is basically larger than 1MB, the FLOPS is above 0.2 Bflow, and real-time target detection on the MCU devices is difficult to realize. If the model is further cut in a light weight manner, the detection precision is not greatly reduced, and the design of the algorithm model is also highly required.
In addition, when multi-dimensional data intensive computation such as convolution or pooling is performed, a large amount of computation force resources are required, the main frequency of the MCU is very low (tens of megameters to hundreds of megameters are different), and the CPU is required to perform continuous computation around the clock, so that the real-time effect is very difficult to ensure. The microprocessor itself has some resources for calculating acceleration, such as DSP resources contained in Cortex-m4 and Cortex-m7, which can play a role in accelerating the neural network to a certain extent, but the acceleration effect is very limited, not all operators are supported, and how to adapt to the specific model and the operators. In addition, the MCU is usually a bare computer or embedded microkernel real-time operating system without an operating system, and a macro kernel operating system such as linux or windows can be very conveniently installed with hundreds of megapieces of application software, such as: opencv, tensorflow, even if microInder and xidinianOS facing AI scene are used, further processing is still needed to output the model, so that the data of the output model and the frame of the real picture can be corresponded, and the practical significance is achieved, and a YOLO decoder used on the MCU and non-maximum suppression (NMS) are needed to be designed.
Through the above analysis, the problems and defects existing in the prior art are as follows:
(1) In the existing intelligent power grid system, a large number of MCU-based IOT devices exist, the number is large, the distribution is wide, the algorithm weight for target detection is basically larger than 1MB, the FLOPS is above 0.2 Bflow, and real-time target detection on the MCU devices is difficult to realize.
(2) If the model is further subjected to light-weight cutting, the detection precision is not greatly reduced, and the design of the algorithm model is also subjected to higher requirements; the microprocessor itself has resources available for computational acceleration, but its acceleration is very limited and not all operators are supported.
(3) When multi-dimensional data intensive computation such as convolution or pooling is carried out, a large amount of computation force resources are needed, the main frequency of the MCU is very low (tens of megameters to hundreds of megameters are different), the CPU is relied on to carry out computation around the clock, a lot of time is needed, and the real-time effect is very difficult to ensure.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a method, a system, a medium, equipment and a terminal for detecting a lightweight target of an unmanned aerial vehicle.
The technical scheme adopted by the invention is as follows: the unmanned aerial vehicle lightweight target detection method comprises the following steps:
step one: analyzing a backbone network structure of the Yolov series network, and combining the feature pyramid with a Darknet-53 network model to obtain a Yolov3 backbone network;
step two: cutting the feature pyramid part of the YOLO-fast to obtain a cut YOLO-fast network;
step three: and constructing a YOLO decoder, realizing non-maximum suppression, and finally, presenting a recognition effect on an LCD screen matched with the MCU.
The first step is specifically as follows: the Darknet-53 network model in the step consists of Convolitional and Residual structures; the rational structure comprises a common convolution layer, and the activation function used is leak_relu.
The second step is specifically as follows: in the Residual structure of the YOLO-fast backbone network, a Residual module uses a short cut mechanism for relieving the gradient vanishing problem caused by simply increasing depth in the neural network; a channel for establishing direct connection between the input and the output is established by an identity mapping method, so that a residual error value between the input and the output of the network summarizing network is further established; the backbone network of YOLO-fast uses 5 Convolitional structures in total, and shortcut is included between each Convolitional structure.
From the perspective of network clipping, the up-sampling operation in the original network structure is canceled, and branches which are originally used for detecting and outputting small targets are clipped, so that the detection capability of the large targets is only reserved; the original five-layer Convolitional structure of the backbone network is modified into three layers, and the rest convolution layers are reserved.
The construction of the YOLO decoder in the third step comprises the following steps:
step 3.1: encoding process
When the original image size (height×weight) of the input image is 500×600, the coordinate values of the prediction frame are: (x) min ,y min ,x max ,y max )=(50,100,250,300):
Calculating the central pixel coordinates of the prediction frame:
Figure BDA0004005443190000041
calculating the length and width of a prediction frame:
Figure BDA0004005443190000042
normalized calculation is performed with respect to the original image size to obtain a box coordinate point (b) x ,b y ,b w ,b h ):
Figure BDA0004005443190000043
When a picture is divided into 13×13 cells, wherein (G x ,G y ,G w ,G h ) The method comprises the steps of mapping a real prediction frame to four coordinates in a feature map, wherein the four coordinates respectively represent a mapped central coordinate X value and Y value and a mapped height and width of the prediction frame:
G x ,G y =[x,y,w,h]×[13,13]=[3.25,5.2];
obtaining the corresponding cell index C through rounding x ,C y = (3, 5), obtain the offset t of the prediction frame coordinate relative to grid coordinate x ,t y
Figure BDA0004005443190000044
Taking the sigmod inverse function to obtain:
Figure BDA0004005443190000045
for weight and height of the original, when anchor= (68,118), then:
Figure BDA0004005443190000046
finally obtaining the coded result: (t) x ,t y ,t w ,t y )=(0.25,0.2,1.07,0.528)。
Step 3.2: decoding process
The design of the decoding algorithm is the reverse order of the coding process, and the decoding operation is carried out on the output after each neural network reasoning is completed, thereby obtaining (b) of a real detection frame x ,b y ,b w ,b h )。
Define the Sigmod function as:
σ(x)=1/(1+e -x );
decoding to obtain (b) x ,b y ,b w ,b h ):
Figure BDA0004005443190000051
Neural network constantly learns t x ,t y ,t w ,t h Offset and scaling, and obtaining bx, by, bw and bh by using 4 offsets during prediction; pre-defining anchor points of a detection frame, and calculating P in advance w and Ph
Step 3.3: implementation of non-maximum suppression
The criterion for choosing the bounding box in training is to choose the largest bounding box of the IOU of the predicted box and the true labeling predicted box in the dataset as the optimal box, but in single reasoning prediction, no tag value in the dataset is used as a reference, and choose the optimal bounding box to refer to other confidence levels.
Confidence is one of the important parameters of each bounding box output, representing the probability P of whether the current box has a target r (Object) to indicate whether there is only a background in which the Object is located or a predicted Object is specifically present in the current box; when the current box is targeted, the predicted box and the possible IOU of the real box of the object truth_pred The value, represent the degree of confidence that the model thinks that all features of the target are framed by itself, the confidence defines:
Figure BDA0004005443190000052
wherein ,
Figure BDA0004005443190000053
and the j-th binding box confidence of the i-th grid cell is indicated.
The procedure for implementing NMS on MCU for improved YOLO-fsest is as follows:
(1) Marking a detection rectangular frame F with highest confidence as a rectangular frame which is determined to be reserved;
(2) Starting from the maximum probability rectangular frame F, traversing other rectangular frames, sequentially and respectively judging whether the overlapping degree IOU of the frame F is larger than a certain set threshold value, and if the overlapping degree IOU exceeds the threshold value, directly throwing away the rectangular frames;
(3) Then selecting the reserved frame with the highest probability from the remaining rectangular frames A, C, E, and marking the reserved frame as the rectangular frame to be reserved; sequentially judging the overlapping degree of the remaining rectangular frames, and throwing away the remaining rectangular frames when the overlapping degree exceeds a set threshold value;
(4) And so on until no rectangle is left, marking the remaining rectangle.
Setting a detection frame with a threshold value higher than the threshold value in NMS, filtering out the detection frame with the IOU higher than the threshold value, using the threshold value of 0.2 on MCU, and putting the detection frame reserved after single picture prediction into a box set; traversing all detection frames of the set, carrying out frame drawing output on the detection frames with the opposite confidence level larger than 0.2 according to coordinates, and finally presenting a recognition effect on an LCD screen matched with the MCU.
Another object of the present invention is to provide an unmanned aerial vehicle lightweight target detection system applying the unmanned aerial vehicle lightweight target detection method, the unmanned aerial vehicle lightweight target detection system comprising:
the trunk network analysis module is used for analyzing the trunk network structure of the Yolov series network, and combining the feature pyramid with the Darknet-53 network model to obtain a trunk network of Yolov 3;
the feature pyramid clipping module is used for clipping the feature pyramid part of the YOLO-fast to obtain a clipped YOLO-fast network;
and the target detection module is used for constructing a YOLO decoder, realizing non-maximum suppression and finally presenting the identification effect on an LCD screen matched with the MCU.
It is a further object of the present invention to provide a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the unmanned aerial vehicle lightweight object detection method.
It is a further object of the present invention to provide a computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the unmanned aerial vehicle lightweight target detection method.
The invention further aims at providing an information data processing terminal which is used for realizing the unmanned aerial vehicle lightweight target detection system.
In combination with the technical scheme and the technical problems to be solved, the technical scheme to be protected has the following advantages and positive effects:
first, aiming at the technical problems in the prior art and the difficulty of solving the problems, the technical problems solved by the technical proposal of the invention are analyzed in detail and deeply by tightly combining the technical proposal to be protected, the results and data in the research and development process, and the like, and some technical effects brought after the problems are solved have creative technical effects. The specific description is as follows:
simulation experiment results show that the average time consumption of the original YOLO-fast network in the upper computer for reasoning about 500 pictures reaches 268.08s, and the average time consumption for detecting each picture is 536.17ms; the YOLO-fast after cutting infers that the average time of 500 pictures is 13.89s and the average time of each picture is 27.78ms at the upper computer; the quantized YOLO-fast takes 4.25s on average for 500 pictures and 8.5ms on average for each picture. It can be seen that the network cut by the invention has greatly improved reasoning speed, and particularly the model after quantization greatly reduces the times of floating point operation, thereby further accelerating the reasoning speed.
Secondly, the technical scheme is regarded as a whole or from the perspective of products, and the technical scheme to be protected has the following technical effects and advantages:
according to the unmanned aerial vehicle lightweight target detection method provided by the invention, the trunk result of the existing YOLO series network is analyzed, the feature pyramid is combined with the Darknet-53 network model to obtain the YOLO trunk network, and the feature pyramid part of the YOLO-fast is cut to obtain the cut YOLO-fast network. The cropped YOLO-fast network has lighter weight, lighter size and faster detection speed, and can be actually deployed on an MCU to detect targets. When the MCU detects a target, the NMS is utilized to solve the problem of multiple detection of the same image, and the maximum search is carried out locally, so that the values except the maximum value in the region are restrained, and the local refers to the region where one detection frame in the image is located; the detection result can be obviously found, and the position of the object target can be detected on the target picture by the cropped YOLO-fast network.
Thirdly, as inventive supplementary evidence of the claims of the present invention, the following important aspects are also presented:
(1) The expected benefits and commercial values after the technical scheme of the invention is converted are as follows:
the technical scheme of the invention can be utilized to obtain a lighter and faster target detection model, and the obtained model can be deployed on MCU with lower performance, so that compared with the edge computing platform commonly used in the market at present, the cost can be greatly reduced, and huge economic benefits are brought to enterprises and society.
(2) The technical scheme of the invention overcomes the technical bias:
according to the technical scheme, the YOLO target detection model is creatively cut and deployed on the MCU platform, so that the technical prejudice that most people can only deploy on a high-performance server or edge computing platform for the target detection model at present is overcome. The lower performance requirements may enable more application of the object detection technology in various areas of society.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a method for detecting a lightweight target of an unmanned aerial vehicle according to an embodiment of the present invention;
FIG. 2 is a schematic view of a feature pyramid provided by an embodiment of the present invention;
FIG. 3 is a schematic diagram of the original YOLO-fast feature pyramid structure provided by an embodiment of the present invention;
FIG. 4 is a schematic diagram of a post-clipping YOLO-fast feature pyramid structure provided by an embodiment of the present invention;
FIG. 5 is a schematic diagram of a Darknet-53 according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a configuration of a Convolitional structure according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of a Residual architecture provided by an embodiment of the present invention;
FIG. 8 is a schematic diagram of a Yolo-fast backbone network according to an embodiment of the present invention;
FIG. 9 is a schematic diagram of a trimmed YOLO-cast backbone network according to an embodiment of the present invention;
FIG. 10 is a diagram of a cropped YOLO-fast network structure according to an embodiment of the present invention;
fig. 11 is a schematic diagram of NMS algorithm provided in an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Aiming at the problems in the prior art, the invention provides a method, a system, a medium, equipment and a terminal for detecting a lightweight target of an unmanned aerial vehicle, and the invention is described in detail below with reference to the accompanying drawings.
1. The embodiments are explained. In order to fully understand how the invention may be embodied by those skilled in the art, this section is an illustrative embodiment in which the claims are presented for purposes of illustration.
As shown in fig. 1, the method for detecting the lightweight target of the unmanned aerial vehicle provided by the embodiment of the invention comprises the following steps:
s101, analyzing a backbone network structure of a Yolov series network, and combining a feature pyramid with a Darknet-53 network model to obtain a Yolov3 backbone network;
s102, cutting a feature pyramid part of the YOLO-fast to obtain a cut YOLO-fast network;
s103, constructing a YOLO decoder, realizing non-maximum suppression, and finally, presenting a recognition effect on an LCD screen matched with the MCU.
The Darknet-53 network model provided by the embodiment of the invention consists of a Convolitional structure and a Residual structure; the rational structure comprises a common convolution layer, and the activation function used is leak_relu.
In the Residual structure of the YOLO-fast backbone network provided by the embodiment of the invention, a Residual module uses a short cut mechanism for relieving the gradient vanishing problem caused by simply increasing depth in the neural network; a channel for establishing direct connection between the input and the output is established by an identity mapping method, so that a residual error value between the input and the output of the network summarizing network is further established; the backbone network of YOLO-cast uses a total of 5 journal structures, and each journal structure includes a short cut therebetween.
From the perspective of network clipping, the up-sampling operation in the original network structure is canceled, and branches which are originally used for detecting and outputting small targets are clipped, so that the detection capability of the large targets is only reserved; the original five-layer Convolitional structure of the backbone network is modified into three layers, and the rest convolution layers are reserved.
The construction of the YOLO decoder provided by the embodiment of the invention comprises the following steps:
(1) Encoding process
When the original image size (height×weight) of the input image is 500×600, the coordinate values of the prediction frame are: (x) min ,y min ,x max ,y max )=(50,100,250,300):
Calculating the central pixel coordinates of the prediction frame:
Figure BDA0004005443190000101
/>
calculating the length and width of a prediction frame:
Figure BDA0004005443190000102
normalized calculation is performed with respect to the original image size to obtain a box coordinate point (b) x ,b y ,b w ,b h ):
Figure BDA0004005443190000103
When a picture is divided into 13×13 cells, wherein (G x ,G y ,G w ,G h ) The method comprises the steps of mapping a real prediction frame to four coordinates in a feature map, wherein the four coordinates respectively represent a mapped central coordinate X value and Y value and a mapped height and width of the prediction frame:
G x ,G y =[x,y,w,h]×[13,13]=[3.25,5.2];
obtaining the corresponding cell index C through rounding x ,C y = (3, 5), obtain the offset t of the prediction frame coordinate relative to grid coordinate x ,t y
Figure BDA0004005443190000104
Taking the sigmod inverse function to obtain:
Figure BDA0004005443190000105
for weight and height of the original, when anchor= (68,118), then:
Figure BDA0004005443190000111
finally obtaining the coded result: (t) x ,t y ,t w ,t y )=(0.25,0.2,1.07,0.528)。
(2) Decoding process
The design of the decoding algorithm is the reverse order of the coding process, and the decoding operation is carried out on the output after each neural network reasoning is completed, thereby obtaining (b) of a real detection frame x ,b y ,b w ,b h )。
Define the Sigmod function as:
σ(x)=1/(1+e -x );
decoding to obtain (b) x ,b y ,b w ,b h ):
Figure BDA0004005443190000112
Neural network constantly learns t x ,t y ,t w ,t h Offset and scaling, and obtaining bx, by, bw and bh by using 4 offsets during prediction; pre-defining anchor points of a detection frame, and calculating P in advance w and Ph
The implementation of non-maximum suppression provided by the embodiment of the invention comprises the following steps:
the criterion for choosing the bounding box in training is to choose the largest bounding box of the IOU of the predicted box and the true labeling predicted box in the dataset as the optimal box, but in single reasoning prediction, no tag value in the dataset is used as a reference, and choose the optimal bounding box to refer to other confidence levels.
Confidence is one of the important parameters of each bounding box output, representing the probability P of whether the current box has a target r (Object) to indicate whether there is only a background in which the Object is located or a predicted Object is specifically present in the current box; when the current box is targeted, the predicted box and the possible IOU of the real box of the object truth_pred Value representing self-confidence that the model considers itself to frame all features of the targetConfidence, confidence definition:
Figure BDA0004005443190000113
wherein ,
Figure BDA0004005443190000121
and the j-th binding box confidence of the i-th grid cell is indicated.
The procedure for implementing NMS on MCU for improved YOLO-fsest is as follows:
(1) Marking a detection rectangular frame F with highest confidence as a rectangular frame which is determined to be reserved;
(2) Starting from the maximum probability rectangular frame F, traversing other rectangular frames, sequentially and respectively judging whether the overlapping degree IOU of the frame F is larger than a certain set threshold value, and if the overlapping degree IOU exceeds the threshold value, directly throwing away the rectangular frames;
(3) Selecting the reserved frame with the highest probability from the remaining rectangular frames A, C, E, and marking the reserved frame as the rectangular frame to be reserved; sequentially judging the overlapping degree of the remaining rectangular frames, and throwing away the remaining rectangular frames when the overlapping degree exceeds a set threshold value;
(4) And so on until no rectangle is left, marking the remaining rectangle.
Setting a detection frame with a threshold value higher than the threshold value in NMS, filtering out the detection frame with the IOU higher than the threshold value, using the threshold value of 0.2 on MCU, and putting the detection frame reserved after single picture prediction into a box set; traversing all detection frames of the set, carrying out frame drawing output on the detection frames with the opposite confidence level larger than 0.2 according to coordinates, and finally presenting a recognition effect on an LCD screen matched with the MCU.
The unmanned aerial vehicle lightweight target detection system provided by the embodiment of the invention comprises the following components:
the trunk network analysis module is used for analyzing the trunk network structure of the Yolov series network, and combining the feature pyramid with the Darknet-53 network model to obtain a trunk network of Yolov 3;
the feature pyramid clipping module is used for clipping the feature pyramid part of the YOLO-fast to obtain a clipped YOLO-fast network;
and the target detection module is used for constructing a YOLO decoder, realizing non-maximum suppression and finally presenting the identification effect on an LCD screen matched with the MCU.
2. Application example. In order to prove the inventive and technical value of the technical solution of the present invention, this section is an application example on specific products or related technologies of the claim technical solution.
According to the scheme provided by the invention, a pedestrian detection system running on the MCU is constructed, a cropped YOLO-faster network is deployed in the system, a real-time YOLO decoder is constructed, non-maximum suppression is realized, and pedestrian images detected in the camera can be accurately identified.
3. Evidence of the effect of the examples. The embodiment of the invention has a great advantage in the research and development or use process, and has the following description in combination with data, charts and the like of the test process.
Examples: pedestrian detection network construction based on clipping YOLO-fast
The embodiment of the invention provides a method for carrying out light clipping and quantization on a YOLO-fast model.
1. Multi-scale characteristic network structure clipping
In the target detection, one image and a plurality of objects possibly exist, and the objects have sizes, so the target detection model must have the capability of detecting objects with different sizes. In the feature images output by all layers of the actual convolutional neural network, features detected by the convolutional layers with different depths are different, the feature images output by the shallow layer network are subjected to less convolutional operations, more small-size detail information such as object colors, positions, edges and the like is reserved, the information is lower-level and specific, the output feature images are subjected to more layers of convolutional operations along with deepening of the network depths, image information with a wider visual field range is contained in the output feature images, and the extracted information of the feature images becomes abstract.
FPN (Feature Pyramid Network) the network structure of the feature pyramid is shown in fig. 2, and the feature pyramid extracts high-resolution features of low-level features and high-semantic information of high-level features at the same time, so that the prediction effect is achieved by fusing the features of different levels. And prediction is performed separately on each fused feature layer, unlike conventional feature fusion approaches. The FPN performs up-sampling on deep information and then adds the deep information and the shallow information one by one, so that a feature pyramid structure with different sizes is constructed, the feature pyramid structure is suitable for targets with different sizes and has excellent performance, and the feature pyramid becomes a standard component of a target detection algorithm nowadays.
According to the invention, the feature pyramid part of the YOLO-fast is cut, and because the sizes of pedestrians appearing in the monitoring fixed area are basically consistent in the IOT device, the cutting mode of the specific feature pyramid is shown in fig. 3, and only a C1 layer with stronger detection capability for a large resolution target in a network is reserved, and small target feature acquisition capabilities of C2, C3, C4, C5 and the like which need deeper feature extraction are removed. The post-clipping YOLO-fast feature pyramid structure is shown in fig. 4, leaving only the C1 layer.
Because the backbone network of the YOLO-fast is relatively simple, the invention takes YOLO-v3 as an example to analyze the backbone network structure of the YOLO series network. By combining the feature pyramid with the dark-53 network model, there is a backbone network of Yolov3, while in the whole YOLO series network, only the convolution layer has no pooling layer, and the size of the output feature map can be controlled by adjusting the convolution step, so there is no particular limitation on the size of the input picture. Further tailoring of the backbone network is then required.
Taking the example of a YOLO-v3 input 416 x 416 size image as an example, the feature extraction process in the YOLO v3 dark net-53 network is shown in fig. 5.
The main framework of Darknet-53 is shown in FIG. 6 and is composed primarily of the Convolitional and Residual structures. For the conditional structure, except for the group back of the normal convolution layer, the activation function mainly used is the leak_relu, and the conditional structure used in the YOLO-fast backbone network is shown in fig. 6.
The residual structure used in the YOLO-fast backbone network is shown in fig. 7, and the most remarkable characteristic of the residual module is that the residual module uses a short cut mechanism (i.e. a short circuit mechanism in a similar circuit) to relieve the gradient vanishing problem caused by simply increasing the depth in the neural network, so that the neural network becomes easier to be optimized. A direct connection channel is established between the input and the output by an identity mapping (identity mapping) method, so that the network can collect residual values between the input and the output of the network.
In the backbone network of YOLO-fast, a total of 5 Convolitional structures are used, and a short cut is made between each Convolitional structure, and the overall connection relationship is shown in FIG. 8.
From the view of network clipping, if only the detection capability of a large object is reserved when the feature pyramid extracts the features, the method can properly clip the features in the main network, certainly better effect can be obtained without clipping, but certain choice is needed for light weight, so that the up-sampling operation in the original network structure is cancelled, the branches originally used for small target detection output are clipped, only the detection capability of the large target is reserved, five-layer Convolution structure is originally needed to be modified into three layers for the main network, the convolution operation consumes a large amount of calculation cost of an embedded microprocessor, the number of layers of continuous convolution is needed to be compressed as much as possible, and the rest convolution layers are reserved as shown in fig. 9.
Finally, the structure of the YOLO-cast network after cutting is shown in figure 10.
2. Pedestrian detection model deployment based on clipping YOLO-fast
2.1YOLO decoder implementation
The YOLO decoding operation (YOLO decoding) is to correspond the predicted value of the neural network to the actual picture predicted frame, i.e., how to draw a detection frame for the target object in the picture by the output value of the neural network. After extracting out0, out1 and out2 by using the yolov3 network structure, each grid point under different scales is provided with a priori frame, the network training process adjusts parameters of the priori frame to obtain a predicted frame, the predicted frame is restored to the original image input image under different scales, and meanwhile, the result condition (the predicted frame position, the category probability and the confidence score) of the target prediction in the frame is included, and the process is called decoding.
2.1.1 encoding procedure
Although YOLO coding is not required in the model deployment phase, the principle of YOLO coding is understood to design a corresponding decoding algorithm, and furthermore, for a custom data set, a corresponding tag value needs to be calculated through the coding algorithm.
Assuming that the size (height×weight) of the original image to be input is (500×600), the coordinate values of the prediction frame are: (x) min ,y min ,x max ,y max )=(50,100,250,300):
Calculating the central pixel coordinates of the prediction frame:
Figure BDA0004005443190000151
calculating the length and width of a prediction frame:
Figure BDA0004005443190000152
normalized calculation is performed with respect to the original image size to obtain a box coordinate point (b) x ,b y ,b w ,b h ):
Figure BDA0004005443190000153
Assume that a picture is partitioned into 13x13 grid cells, where (G x ,G y ,G w ,G h ) The four coordinates representing the mapping of the real prediction frame to the feature map represent the X value and Y value of the mapped center coordinates and the height and width of the prediction frame mapped, respectively. Namely:
G x ,G y =[x,y,w,h]×[13,13]=[3.25,5.2](4)
thereby obtaining the corresponding unit through roundingLattice index C x ,C y = (3, 5). Obtaining the offset t of the predicted frame coordinate relative to grid coordinate x ,t y
Figure BDA0004005443190000161
Taking the sigmod inverse function to obtain:
Figure BDA0004005443190000162
for weight and height of the original, assume anchor= (68,118):
Figure BDA0004005443190000163
finally obtaining the coded result: (t) x ,t y ,t w ,t y )=(0.25,0.2,1.07,0.528)。
2.1.2 decoding procedure
The design of the decoding algorithm is the reverse order of the coding process, and the decoding operation is carried out on the output after each neural network reasoning is completed, thereby obtaining (b) of a real detection frame x ,b y ,b w ,b h )。
Define the Sigmod function as:
σ(x)=1/(1+e -x )(8)
decoding to obtain (b) x ,b y ,b w ,b h ):
Figure BDA0004005443190000164
The neural network can continuously learn t x ,t y ,t w ,t h Offset and scaling, and these 4 offsets are used to find bx, by, bw, bh for the prediction, why for t x ,t y Still need the reverse Sigmod? In Yolo, G is not allowed x -C x Divided by P w Obtaining t x But is direct G x -C x Obtaining t x This has the problem of causing t x Is relatively large and likely greater than 1. Because of not dividing by P w Normalized scale, once t x ,t y Calculated larger than 1 falls into the necessary other real boxes and cannot appear in the grid beside it, causing contradiction, and thus normalization with Sigmod is necessary. As to why an anchor is used, then because t is used directly x ,t y Parameters, which may be significant for the width and height of the prediction bounding box, but in practice this will lead to unstable gradients during training, therefore the anchor points of the detection box need to be predefined, i.e. P is calculated in advance w and Ph
2.2 non-maximum suppression (NMS) implementations
NMS solves the multiple detection problem of the same image. Values other than the maximum value in the region are suppressed by performing the maximum search locally, which refers to the region in which one detection frame in the graph is located. When selecting the detection frame with the highest score in the screening target detection, for example, in pedestrian detection, a sliding window is used for feature extraction, and then the IOU value is calculated by a classifier for recognition, and each detection frame can obtain a score. However, sliding the window results in a situation where many boxes cross most of the other boxes, and this requires the NMS to select the window with the highest IOU value in those areas, and to remove those boxes with lower IOU values.
In training, the criterion of selecting the boundingbox is to select the boundingbox with the largest IOU of the true labeling prediction box in the data set as the optimal box, but in single reasoning prediction, no label value in the data set is used as a reference, and other parameters, namely confidence level, need to be referred to in selecting the optimal boundingbox.
Confidence is one of the important parameters of each boundingbox output, with two implications for its effect definition: a re-meaning is the probability P representing whether the current box has a target r (Object), i.e. the probability is used to describe whenOnly the background in which the object is located in the front box is a predicted object. Another meaning is that when the current box is targeted, the predicted box and the object's true box are likely IOU truth_pred The value, representing the degree of confidence that the model believes itself frames all features of the target, i.e., the confidence definition:
Figure BDA0004005443190000171
/>
wherein ,
Figure BDA0004005443190000172
and the j-th binding box confidence of the i-th grid cell is indicated.
As shown in fig. 11, the NMS algorithm steps are implemented on the MCU for the modified YOLO-fsest:
1) Marking a detection rectangular frame F with highest confidence as a rectangular frame which is determined to be reserved;
2) And traversing other rectangular frames from the maximum probability rectangular frame F, sequentially and respectively judging whether the overlapping degree IOU (the overlapping ratio of the two rectangular frames) of the sum F is larger than a certain set threshold value, and if the IOU exceeds the threshold value, directly throwing away the rectangular frames.
3) And selecting the reserved rectangular frame with the highest probability from the remaining rectangular frames A, C, E, marking the reserved rectangular frame as the reserved rectangular frame, and then sequentially judging the overlapping degree of the remaining rectangular frames, and discarding if the overlapping degree exceeds a set threshold value.
4) And so on until no rectangle is left, marking the remaining rectangle.
5) Therefore, a threshold value is required to be set in an NMS algorithm to filter out detection frames with IOU higher than the threshold value, the threshold value used by the invention is 0.2 on the MCU, the detection frames reserved after single picture prediction are put into a box set, then all the detection frames in the set are traversed, frame drawing output is carried out on the detection frames with the opposite confidence degree higher than 0.2 according to coordinates, and finally the identification effect is presented on an LCD screen matched with the MCU.
3. Experimental verification and result analysis
The invention mainly carries out experimental verification on the improved effect and analyzes the performance and the improved effect of the improved algorithm by comparing with other algorithms.
3.1 Experimental procedure and result analysis
The method mainly adopts a mode of combining actual shot pictures and network traffic pictures on a self-made traffic scene data set, and because pictures shot on roads in a western electronic technology university campus do not have labeling files in the self-made traffic scene data set, labels are generated by labeling the pictures by using a labeling tool.
The original YOLO-fast network infers that the average time consumption of 500 pictures reaches 268.08s and the average time consumption of each picture detection is 536.17ms at an upper computer; the YOLO-fast after cutting infers that the average time of 500 pictures is 13.89s and the average time of each picture is 27.78ms at the upper computer; the quantized YOLO-fast takes 4.25s on average for 500 pictures and 8.5ms on average for each picture. The method can be used for finding out that the cut network obtains great improvement in the reasoning speed, and particularly, the model after quantization greatly reduces the times of floating point operation, thereby further accelerating the reasoning speed.
And after the upper computer reflects the change of model precision and reasoning speed on the data through comparison of operation results, the multi-target and single-target detection effect pictures on the test set are displayed next.
The detection result can be obviously found, and the position of the object target can be detected on the target picture by the cropped YOLO-fast network. Reasoning is carried out on the YOLO-fast after clipping in different embedded microprocessor platforms.
It should be noted that the embodiments of the present invention can be realized in hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or special purpose design hardware. Those of ordinary skill in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such as provided on a carrier medium such as a magnetic disk, CD or DVD-ROM, a programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The device of the present invention and its modules may be implemented by hardware circuitry, such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., as well as software executed by various types of processors, or by a combination of the above hardware circuitry and software, such as firmware.
The foregoing is merely illustrative of specific embodiments of the present invention, and the scope of the invention is not limited thereto, but any modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present invention will be apparent to those skilled in the art within the scope of the present invention.

Claims (10)

1. The unmanned aerial vehicle lightweight target detection method is characterized by comprising the following steps of: analyzing a backbone network structure of the Yolov series network, and combining the feature pyramid with a Darknet-53 network model to obtain a Yolov3 backbone network; cutting the feature pyramid part of the YOLO-fast to obtain a cut YOLO-fast network; and constructing a YOLO decoder, realizing non-maximum suppression, and finally, presenting a recognition effect on an LCD screen matched with the MCU.
2. The unmanned aerial vehicle lightweight target detection method of claim 1, wherein the dark-53 network model consists of a conditional and Residual structure; the rational structure comprises a common convolution layer, and the activation function used is leak_relu.
3. The unmanned aerial vehicle lightweight target detection method of claim 1, wherein in the Residual structure of the YOLO-fast backbone network, a Residual module uses a short cut mechanism for alleviating a gradient vanishing problem caused by simply increasing depth in a neural network; a channel for establishing direct connection between the input and the output is established by an identity mapping method, so that a residual error value between the input and the output of the network summarizing network is further established; the backbone network of the YOLO-fast uses 5 Convolitional structures in total, and shortcut is included between each Convolitional structure;
from the perspective of network clipping, the up-sampling operation in the original network structure is canceled, and branches which are originally used for detecting and outputting small targets are clipped, so that the detection capability of the large targets is only reserved; the original five-layer Convolitional structure of the backbone network is modified into three layers, and the rest convolution layers are reserved.
4. The unmanned aerial vehicle lightweight target detection method of claim 1, wherein the constructing of the YOLO decoder comprises:
(1) Encoding process
When the original image size (height×weight) of the input image is 500×600, the coordinate values of the prediction frame are: (x) min ,y min ,x max ,y max )=(50,100,250,300):
Calculating the central pixel coordinates of the prediction frame:
Figure FDA0004005443180000011
calculating the length and width of a prediction frame:
Figure FDA0004005443180000021
normalized calculation is performed with respect to the original image size to obtain a box coordinate point (b) x ,b y ,b w ,b h ):
Figure FDA0004005443180000022
When a picture is divided into 13×13 cells, wherein (G x ,G y ,G w ,G h ) The method comprises the steps of mapping a real prediction frame to four coordinates in a feature map, wherein the four coordinates respectively represent a mapped central coordinate X value and Y value and a mapped height and width of the prediction frame:
G x ,G y =[x,y,w,h]×[13,13]=[3.25,5.2];
obtaining the corresponding cell index C through rounding x ,C y = (3, 5), obtain the offset t of the prediction frame coordinate relative to grid coordinate x ,t y
Figure FDA0004005443180000023
Taking the sigmod inverse function to obtain:
Figure FDA0004005443180000024
for weight and height of the original, when anchor= (68,118), then:
Figure FDA0004005443180000025
finally obtaining the coded result: (t) x ,t y ,t w ,t y )=(0.25,0.2,1.07,0.528);
(2) Decoding process
The design of the decoding algorithm is the reverse order of the coding process, and the output after each neural network reasoning is completed is decodedOperation to obtain (b) of a true detection frame x ,b y ,b w ,b h );
Define the Sigmod function as:
σ(x)=1/(1+e -x );
decoding to obtain (b) x ,b y ,b w ,b h ):
Figure FDA0004005443180000031
Neural network constantly learns t x ,t y ,t w ,t h Offset and scaling, and obtaining bx, by, bw and bh by using 4 offsets during prediction; pre-defining anchor points of a detection frame, and calculating P in advance w and Ph
5. The unmanned aerial vehicle lightweight target detection method of claim 1, wherein the implementation of non-maximum suppression comprises:
the criterion for selecting the binding box in training is to select the predicted box and the box with the largest IOU of the true labeling prediction box in the data set as the optimal box, but in single reasoning prediction, no tag value in the data set is used as a reference, and the optimal binding box is selected to refer to other confidence coefficients;
confidence is one of the important parameters of each bounding box output, representing the probability P of whether the current box has a target r (Object) to indicate whether there is only a background in which the Object is located or a predicted Object is specifically present in the current box; when the current box is targeted, the predicted box and the possible IOU of the real box of the object truth_pred The value, represent the degree of confidence that the model thinks that all features of the target are framed by itself, the confidence defines:
Figure FDA0004005443180000032
wherein ,
Figure FDA0004005443180000033
the j-th binding box confidence of the i-th grid cell is represented;
the procedure for implementing NMS on MCU for improved YOLO-fsest is as follows:
(1) Marking a detection rectangular frame F with highest confidence as a rectangular frame which is determined to be reserved;
(2) Starting from the maximum probability rectangular frame F, traversing other rectangular frames, sequentially and respectively judging whether the overlapping degree IOU of the frame F is larger than a certain set threshold value, and if the overlapping degree IOU exceeds the threshold value, directly throwing away the rectangular frames;
(3) Selecting the reserved frame with the highest probability from the remaining rectangular frames A, C, E, and marking the reserved frame as the rectangular frame to be reserved; sequentially judging the overlapping degree of the remaining rectangular frames, and throwing away the remaining rectangular frames when the overlapping degree exceeds a set threshold value;
(4) And analogizing until no rectangular frame is left, and marking the reserved rectangular frame;
setting a detection frame with a threshold value higher than the threshold value in NMS, filtering out the detection frame with the IOU higher than the threshold value, using the threshold value of 0.2 on MCU, and putting the detection frame reserved after single picture prediction into a box set; traversing all detection frames of the set, carrying out frame drawing output on the detection frames with the opposite confidence level larger than 0.2 according to coordinates, and finally presenting a recognition effect on an LCD screen matched with the MCU.
6. The unmanned aerial vehicle lightweight target detection method of claim 1, wherein the unmanned aerial vehicle lightweight target detection method comprises the steps of:
step one, cutting a multi-scale characteristic network structure;
and step two, deploying a pedestrian detection model based on clipping YOLO-fast.
7. An unmanned aerial vehicle lightweight target detection system applying the unmanned aerial vehicle lightweight target detection method according to any one of claims 1 to 6, wherein the unmanned aerial vehicle lightweight target detection system comprises:
the trunk network analysis module is used for analyzing the trunk network structure of the Yolov series network, and combining the feature pyramid with the Darknet-53 network model to obtain a trunk network of Yolov 3;
the feature pyramid clipping module is used for clipping the feature pyramid part of the YOLO-fast to obtain a clipped YOLO-fast network;
and the target detection module is used for constructing a YOLO decoder, realizing non-maximum suppression and finally presenting the identification effect on an LCD screen matched with the MCU.
8. A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the unmanned aerial vehicle lightweight target detection method of any of claims 1 to 6.
9. A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the unmanned aerial vehicle lightweight target detection method of any of claims 1-6.
10. An information data processing terminal, wherein the information data processing terminal is configured to implement the unmanned aerial vehicle lightweight target detection system according to claim 7.
CN202211630194.4A 2022-12-19 2022-12-19 Unmanned aerial vehicle lightweight target detection method, system, medium, equipment and terminal Pending CN116030364A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211630194.4A CN116030364A (en) 2022-12-19 2022-12-19 Unmanned aerial vehicle lightweight target detection method, system, medium, equipment and terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211630194.4A CN116030364A (en) 2022-12-19 2022-12-19 Unmanned aerial vehicle lightweight target detection method, system, medium, equipment and terminal

Publications (1)

Publication Number Publication Date
CN116030364A true CN116030364A (en) 2023-04-28

Family

ID=86078612

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211630194.4A Pending CN116030364A (en) 2022-12-19 2022-12-19 Unmanned aerial vehicle lightweight target detection method, system, medium, equipment and terminal

Country Status (1)

Country Link
CN (1) CN116030364A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117670882A (en) * 2024-01-31 2024-03-08 国网江西省电力有限公司电力科学研究院 Unmanned aerial vehicle infrared automatic focusing method and system for porcelain insulator string
CN118171049A (en) * 2024-05-13 2024-06-11 西南交通大学 Big data-based battery management method and system for edge calculation

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117670882A (en) * 2024-01-31 2024-03-08 国网江西省电力有限公司电力科学研究院 Unmanned aerial vehicle infrared automatic focusing method and system for porcelain insulator string
CN117670882B (en) * 2024-01-31 2024-06-04 国网江西省电力有限公司电力科学研究院 Unmanned aerial vehicle infrared automatic focusing method and system for porcelain insulator string
CN118171049A (en) * 2024-05-13 2024-06-11 西南交通大学 Big data-based battery management method and system for edge calculation

Similar Documents

Publication Publication Date Title
Wang et al. Data-driven based tiny-YOLOv3 method for front vehicle detection inducing SPP-net
CN109753913B (en) Multi-mode video semantic segmentation method with high calculation efficiency
Lin et al. A license plate recognition system for severe tilt angles using mask R-CNN
CN109583345B (en) Road recognition method, device, computer device and computer readable storage medium
CN116030364A (en) Unmanned aerial vehicle lightweight target detection method, system, medium, equipment and terminal
CN113723377B (en) Traffic sign detection method based on LD-SSD network
Xie et al. A binocular vision application in IoT: Realtime trustworthy road condition detection system in passable area
CN109886159B (en) Face detection method under non-limited condition
Xiang et al. Lightweight fully convolutional network for license plate detection
CN111767854B (en) SLAM loop detection method combined with scene text semantic information
CN116503709A (en) Vehicle detection method based on improved YOLOv5 in haze weather
CN113011338A (en) Lane line detection method and system
CN114943888B (en) Sea surface small target detection method based on multi-scale information fusion
CN114332921A (en) Pedestrian detection method based on improved clustering algorithm for Faster R-CNN network
Li et al. Vehicle detection in uav traffic video based on convolution neural network
CN112634289B (en) Rapid feasible domain segmentation method based on asymmetric void convolution
CN113177956B (en) Semantic segmentation method for unmanned aerial vehicle remote sensing image
CN114596548A (en) Target detection method, target detection device, computer equipment and computer-readable storage medium
CN117975003A (en) Scene segmentation method and system based on lightweight network
CN111160282B (en) Traffic light detection method based on binary Yolov3 network
CN112347967A (en) Pedestrian detection method fusing motion information in complex scene
CN116342877A (en) Semantic segmentation method based on improved ASPP and fusion module in complex scene
CN112232162B (en) Pedestrian detection method and device based on multi-feature fusion cascade classifier
CN114494302A (en) Image processing method, device, equipment and storage medium
Liu et al. L2-LiteSeg: A Real-Time Semantic Segmentation Method for End-to-End Autonomous Driving

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination