CN111553406A

CN111553406A - Target detection system, method and terminal based on improved YOLO-V3

Info

Publication number: CN111553406A
Application number: CN202010333517.8A
Authority: CN
Inventors: 田鹏程
Original assignee: Shanghai Kaike Intelligent Technology Co ltd
Current assignee: Shanghai Kaike Intelligent Technology Co ltd
Priority date: 2020-04-24
Filing date: 2020-04-24
Publication date: 2020-08-18
Anticipated expiration: 2040-04-24
Also published as: CN111553406B

Abstract

The invention discloses a target detection system based on improved YOLO-V3, which comprises: the system comprises an image acquisition module, an image preprocessing module, a darknet-39 trunk network module, a multi-scale convolutional layer feature combination module, a weighted feature fusion module and a prediction module, wherein the darknet-39 trunk network module adopts the darknet-39 trunk network module to extract image features to obtain feature maps of 5 convolutional layers with different scales; the multi-scale convolutional layer feature combination module is used for optimally combining the feature maps of 5 convolutional layers with different scales to obtain a combined feature map; the weighted feature fusion module is used for carrying out weighted feature fusion on the combined feature map; and the prediction module is used for performing regression prediction on the fused feature map by adopting a YOLO-V3 algorithm to obtain a target detection result. The system has a small network model, accelerates the target detection speed, enhances the network characteristic fusion effect and realizes a better detection result.

Description

Target detection system, method and terminal based on improved YOLO-V3

Technical Field

The invention relates to the technical field of computer vision, in particular to a system, a method and a terminal for detecting a target based on YOLO-V3.

Background

YOLO (You Only Look one) -V3 is a popular object detection algorithm at present, and is fast and stable, but a backbone network of YOLO-V3 adopts a Darknet-53 network structure, the number of parameters is 65.86BFLOPs (Billon Float Point operations), and the model parameters are large, so that the speed of the algorithm is greatly reduced when the embedded device runs, and the real-time detection effect cannot be achieved. When the input size is 416 × 416, the minimum feature map size of the YOLO-V3 for extracting features is 13 × 13, which is still too large, so that the YOLO-V3 operator has a poor detection effect on objects with medium or large sizes. The YOLOv3 predicts the targets with different sizes by using the multi-scale feature maps from different layers, fuses the feature information of high and low layers, improves the detection precision to a certain extent, but ignores the characteristic that the contribution degrees of the feature maps of different layers are different, and causes the feature fusion effect to be poor.

Disclosure of Invention

Aiming at the defects in the prior art, the target detection system, the target detection method, the terminal and the medium based on the YOLO-V3 provided by the embodiment of the invention have the advantages that the target detection speed is high, the detection effect on objects with medium or large sizes is improved, the fusion effect of fusing different layers of feature maps by the YOLO-V3 is improved, and the mAP index of the detected objects is improved.

In a first aspect, an embodiment of the present invention provides a target detection system based on YOLO-V3, including: an image acquisition module, an image preprocessing module, a darknet-39 backbone network module, a multi-scale convolutional layer feature combination module, a weighted feature fusion module and a prediction module,

the image acquisition module is used for acquiring an image to be identified;

the image preprocessing module is used for preprocessing an image to be identified to obtain a preprocessed image;

the darknet-39 trunk network module is obtained by improving a darknet-53 trunk network to obtain a darknet-39 trunk network model, and the darknet-39 trunk network model is adopted to extract image characteristics to obtain characteristic diagrams of convolution layers with 5 different scales;

the multi-scale convolutional layer feature combination module is used for optimally combining feature maps of 5 convolutional layers with different scales to obtain a combined feature map;

the weighted feature fusion module is used for carrying out weighted feature fusion on the combined feature map;

and the prediction module is used for performing regression prediction on the fused feature map by adopting a YOLO-V3 algorithm to obtain a target detection result.

In a second aspect, an embodiment of the present invention provides a target detection method based on improved YOLO-V3, including:

acquiring an image to be identified;

preprocessing an image to be recognized to obtain a preprocessed image;

extracting image characteristics by adopting a trained darknet-39 trunk network model to obtain characteristic graphs of convolution layers with 5 different scales;

optimally combining the feature maps of the convolutional layers with different scales to obtain a combined feature map;

performing weighted feature fusion on the combined feature map;

and performing regression prediction on the fused feature map by using a YOLO-V3 algorithm to obtain a target detection result.

In a third aspect, an intelligent terminal provided in an embodiment of the present invention includes a processor, an input device, an output device, and a memory, where the processor, the input device, the output device, and the memory are connected to each other, the memory is used to store a computer program, the computer program includes program instructions, and the processor is configured to call the program instructions to execute the method steps described in the foregoing embodiment.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, in which a computer program is stored, the computer program comprising program instructions, which, when executed by a processor, cause the processor to perform the method steps described in the above embodiments.

The invention has the beneficial effects that:

the embodiment of the invention provides a target detection system, a method, a terminal and a medium based on improved YOLO-V3, which adopt a darknet-39 backbone network to extract features, reduce the size of a model and accelerate the target detection speed, adopt 5 convolutional layers with different scales to extract feature maps, fully fuse shallow layer features and deep layer feature information, improve the detection effect of objects with medium or large sizes, carry out combined weighted feature fusion on the feature maps of different convolutional layers according to different contribution degrees of the feature maps of the different convolutional layers, enhance the network feature fusion effect and realize better detection results.

Drawings

In order to more clearly illustrate the detailed description of the invention or the technical solutions in the prior art, the drawings that are needed in the detailed description of the invention or the prior art will be briefly described below. Throughout the drawings, like elements or portions are generally identified by like reference numerals. In the drawings, elements or portions are not necessarily drawn to scale.

FIG. 1 is a block diagram of a target detection system based on improved YOLO-V3 according to a first embodiment of the present invention;

FIG. 2 is a flow chart of a target detection method based on improved YOLO-V3 according to a second embodiment of the present invention;

fig. 3 shows a block diagram of an intelligent terminal according to a third embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which the invention pertains.

As shown in fig. 1, there is shown a block diagram of a target detection system based on improved YOLO-V3 according to a first embodiment of the present invention, where the system includes: the system comprises an image acquisition module 101, an image preprocessing module 102, a dark-39 backbone network module 103, a multi-scale convolutional layer feature combination module 104, a weighted feature fusion module 105 and a prediction module 106, wherein the image acquisition module 101 is used for acquiring an image to be identified; the image preprocessing module 102 is configured to preprocess an image to be recognized to obtain a preprocessed image; the darknet-39 trunk network module 103 is used for obtaining a darknet-39 trunk network model through improving a darknet-53 trunk network, and extracting image characteristics by adopting the darknet-39 trunk network model to obtain characteristic diagrams of convolution layers with 5 different scales; the multi-scale convolutional layer feature combination module 104 is used for optimally combining feature maps of 5 convolutional layers with different scales to obtain a combined feature map, wherein the optimal combination is different combinations according to different layers, the front layers and the rear layers are combined in pairs, and the middle layer is three and combined; the weighted feature fusion module 105 is configured to perform weighted feature fusion on the combined feature map; the prediction module 106 is configured to perform regression prediction on the fused feature map by using a YOLO-V3 algorithm to obtain a target detection result.

The image preprocessing module 102 comprises an image rotation unit and a scaling unit, wherein the image rotation unit is used for randomly turning, rotating and cutting an image to be identified; the scaling unit is used for carrying out scale transformation on the image to be identified.

The darknet-39 trunk network module 103 cuts channels of the darknet-53 network, reduces the number of model parameters, fully extracts picture features and improves the operation efficiency, so that compared with the improved YOLO-V3 algorithm, the calculation amount is reduced by 80%, and the speed is improved by 4 times. The structure of the darknet-39 backbone network in the darknet-39 backbone network module is shown in table 1.

The darknet-39 trunk network module comprises a darknet-39 trunk network training unit, wherein 2 convolutional layers are added in a trunk network of a traditional YOLO-V3 algorithm by the aid of the darknet-39 trunk network training unit, and 5 convolutional layer characteristic graphs with different scales are adopted for target detection; and acquiring a data set, dividing the data set into a training set, a test set and a verification set, re-clustering the coordinates of the bounding box on the training set by adopting a k-means clustering algorithm, and calculating the coordinates of 15 bounding boxes of the characteristic diagrams of the convolutional layers with 5 different scales.

The method comprises the following steps that a darknet-39 main network module reasonably prunes a darknet-53 main network, optimizes a network structure, removes some redundant convolution operations and obtains the darknet-39 main network, and specifically comprises the steps of halving the number of channels of a Level 5 layer, taking the Level 5 layer as a characteristic output layer, and enabling stride (step length) to be 4, so that the small target object detection rate is favorably improved. The Level 4, Level 3 and Level 2 layers reduce the number of channels by half, and the number of simultaneous operations is also reduced by half, and stride at this time is respectively 8, 16 and 32. Finally, a 3 × 3 convolutional layer is added, and the feature extraction effect is enhanced while the number of parameters is hardly increased, wherein stride is 64. The darknet-39 network at this time cannot directly load the weight parameters of the original darknet-53, and needs to be retrained. This example performs classification training on the ImageNet LSVRC 2012 data set, trains 90 epochs, has an initial learning rate of 1e-03, decreases the learning rate ten times when step is 170000 and 350000, has a batch _ size of 128, and has a weight attenuation coefficient of 5 e-04.

Taking the coco dataset as an example, the coco 2017 test dataset has 118287 training sets, 5000 verification sets and 40670 test sets, and 80 categories. In addition, the normalization process is performed on the process considering the picture sizes on the training set are different. In the field of target detection, the similarity between two bounding boxes is measured by taking the size of IOU (area interaction ratio) as a standard, detectionResult represents the area of a predicted rectangular box, GroundTruth represents the area of a real rectangular box,

then for target detection, the distance metric formula can be calculated as follows:

d(box,centroid)＝1-IOU(box,centroid)

centroids refer to the centers of the bounding boxes, and if the IOU value between two bounding boxes is larger, the distance between them is smaller. Before the image to be recognized is input into the darknet-39 backbone network module, the image preprocessing module preprocesses the image to be recognized to transform the picture size into a fixed size, and in the embodiment, a multi-scale training method is adopted, and one size is randomly selected from the set {256,320,384,448,512,576,640,704,768} to serve as the picture input size at this time. Taking the size of the input image to be recognized as 448 × 448 as an example, the coordinates of 15 bounding boxes of the feature map of the 5 scale convolutional layers are calculated as follows:

(4,6),(7,16),(14,9),(22,17),(13,30),(28,37),(46,23),(25,70),(49,58),(86,39),(56,124),(99,83),(114,205),(199,124),(294,275)。

inputting the size of an image to be recognized to 448 × 448, establishing an image pyramid of the image to be recognized and different levels of image golden pointsThe pyramid network feature layer feature size is 7 ×,14 ×,28 ×,56, 112 ×, the feature map size is 1, 2, 3, 4, 5 from small to large, the feature pyramid performs up-sampling operation on the feature pyramid by 2 times of step length, and is fused with the next layer depth residual network to form a rapid detection model for depth fusion, the expression capability of the feature pyramid is enhanced, compared with the conventional YOLOv3 network, the method has a wider range, so that the detection effect of small objects and large objects can be remarkably improved, the detection effect of small objects and large objects can not be increased, the feature maps of different depths are respectively subjected to target detection, the feature maps of the future layers are subjected to up-sampling by the feature map of the current feature source map, the feature maps of the future layers are utilized, the feature maps of the future layers are organically fused with the semantic information of high order of improving the detection accuracy, the feature pyramid network feature map is 7 ×,14, the feature map can be subjected to depth fusion by the method of greatly reducing the convolution of the feature pyramid detection of the same pyramid detection method of the same as a convolution, the method of calculating a characteristic map of the method of greatly reducing a convolution, the method of calculating a depth fusion, the method of calculating a method of the same as a method of the method of reducing₁And L₂For better feature fusion, the embodiment adopts a weighted feature fusion mode, and the feature after fusion is F₁，L₁Corresponding weighting coefficient w₁，L₂Corresponding weighting coefficient is w₂And then:

the prediction module performs regression prediction on the weighted and fused feature map by using YOLO-V3, the YOLO-V3 divides the feature map into N × N grids (feature maps with different scales and N with different sizes, in this embodiment, there are 5 scales, N is 7,14, 28,56, and 112, each grid predicts 3 different bounding boxes, the target detection result can be represented as N × [3 × (C + Con + B) ], C represents the number of categories, Con represents the confidence, and B represents the coordinates of the bounding boxes.

In order to enable the detection network to be converged quickly, the tailored dark net-39 network structure is pre-trained on the ImageNet data set, and the obtained weight file is directly loaded into the detection network as an initialization weight. The hyper-parameters during pre-training the dark net-39 network are set as follows, the training epoch is 120, the initial learning rate is 1e-04, the learning rate adopts a cosine _ decay descending mode, the final learning rate is 1e-6, the momentum is 0.9, the batch _ size is set as 32, an l2 regularization mode is adopted, and the weight attenuation coefficient is 5 e-04.

According to the target detection system based on the improved YOLO-V3, the method and the device adopt the darknet-39 trunk network for feature extraction, reduce the size of a model, accelerate the target detection speed, adopt 5 different-scale convolutional layers for feature map extraction, fully fuse shallow feature information and deep feature information, improve the detection effect on objects with medium or large sizes, perform combined weighted feature fusion on different convolutional layer feature maps according to different contribution degrees of the feature maps of the different convolutional layers, enhance the network feature fusion effect, and realize better detection results.

In the first embodiment, a target detection system based on improved YOLO-V3 is provided, and correspondingly, the application also provides a target detection method based on improved YOLO-V3. Please refer to fig. 2, which is a flowchart illustrating a target detection method based on improved YOLO-V3 according to a second embodiment of the present invention. Since the method embodiment is basically similar to the device embodiment, the description is simple, and the relevant points can be referred to the partial description of the device embodiment. The method embodiments described below are merely illustrative.

As shown in fig. 2, a flowchart of a target detection method based on improved YOLO-V3 according to a second embodiment of the present invention is shown, and the method includes:

s201, acquiring an image to be identified.

In the present embodiment, the input image to be recognized has a size of 448 × 448.

S202, preprocessing the image to be recognized to obtain a preprocessed image.

Specifically, the specific method for preprocessing the image to be recognized comprises the following steps:

randomly turning and cutting the image to be recognized horizontally/vertically;

and carrying out scale transformation on the image to be identified.

S203, extracting image characteristics by adopting the trained darknet-39 trunk network model to obtain characteristic diagrams of convolution layers with 5 different scales.

Specifically, the step of training the darknet-39 backbone network model includes the following steps:

2 convolutional layers are added in a backbone network of a traditional YOLO-V3 algorithm, and 5 convolutional layer characteristic graphs with different scales are adopted for target detection.

Specifically, the method comprises the steps of reasonably pruning the darknet-53 network, optimizing the network structure, removing some redundant convolution operations, and obtaining the darknet-39 network, wherein the steps are specifically that the number of channels of a Level 5 layer is halved, the Level 5 layer is also used as a characteristic output layer, and stride is 4 at the moment, so that the detection rate of small target objects is favorably improved. The Level 4, Level 3 and Level 2 layers reduce the number of channels by half, and the number of simultaneous operations is also reduced by half, and stride at this time is respectively 8, 16 and 32. Finally, a 3 × 3 convolutional layer is added, and the feature extraction effect is enhanced while the number of parameters is hardly increased, wherein stride is 64. The darknet-39 network at this time cannot directly load the weight parameters of the original darknet-53, and needs to be retrained. This embodiment performs classification training on the ImageNet LSVRC 2012 data set, trains 120 epochs, has an initial learning rate of 1e-03, decreases the learning rate ten times when step is 170000 and 350000, has a batch _ size of 128, and has a weight attenuation coefficient of 5 e-04.

Acquiring a data set, and dividing the data set into a training set, a test set and a verification set;

re-clustering the coordinates of the bounding box on the training set by adopting a k-means clustering algorithm, and calculating the coordinates of 15 bounding boxes of the convolution layer characteristic diagrams with 5 different scales.

And S204, optimally combining the feature maps of the convolutional layers with different scales to obtain a combined feature map.

And S205, performing weighted feature fusion on the combined feature map.

And S206, performing regression prediction on the fused feature map by adopting a YOLO-V3 algorithm to obtain a target detection result.

According to the target detection method based on the improved YOLO-V3, provided by the embodiment of the invention, the characteristics are extracted by adopting a darknet-39 backbone network, the size of a model is reduced, the target detection speed is accelerated, the characteristic diagram extraction is carried out by adopting 5 convolutional layers with different scales, the shallow characteristic information and the deep characteristic information are fully fused, the detection effect on objects with medium or large sizes is improved, the combined weighted characteristic fusion is carried out on the characteristic diagrams of different convolutional layers according to different contribution degrees of the characteristic diagrams of different convolutional layers, the network characteristic fusion effect is enhanced, and a better detection result is realized.

As shown in fig. 3, a schematic structural diagram of an intelligent terminal according to a third embodiment of the present invention is shown, where the intelligent terminal includes a processor 301, an input device 302, an output device 303, and a memory 304, where the processor 301, the input device 302, the output device 303, and the memory 304 are connected to each other, the memory 304 is used for storing a computer program, the computer program includes program instructions, and the processor 301 is configured to call the program instructions to execute the method described in the second embodiment.

It should be understood that, in the embodiment of the present invention, the Processor 301 may be a Central Processing Unit (CPU), and the Processor may also be other general processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The input device 302 may include a touch pad, a fingerprint sensor (for collecting fingerprint information of a user and direction information of the fingerprint), a microphone, etc., and the output device 303 may include a display (LCD, etc.), a speaker, etc.

The memory 304 may include a read-only memory and a random access memory, and provides instructions and data to the processor 301. A portion of the memory 304 may also include non-volatile random access memory. For example, the memory 304 may also store device type information.

In a specific implementation, the processor 301, the input device 302, and the output device 303 described in this embodiment of the present invention may execute the implementation described in the method embodiment provided in this embodiment of the present invention, and may also execute the implementation described in the system embodiment described in this embodiment of the present invention, which is not described herein again.

The invention also provides an embodiment of a computer-readable storage medium, in which a computer program is stored, which computer program comprises program instructions that, when executed by a processor, cause the processor to carry out the method described in the above embodiment.

The computer readable storage medium may be an internal storage unit of the terminal described in the foregoing embodiment, for example, a hard disk or a memory of the terminal. The computer readable storage medium may also be an external storage device of the terminal, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the terminal. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the terminal. The computer-readable storage medium is used for storing the computer program and other programs and data required by the terminal. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the terminal and the unit described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed terminal and method can be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present invention, and they should be construed as being included in the following claims and description.

Claims

1. An improved YOLO-V3-based target detection system, comprising: an image acquisition module, an image preprocessing module, a darknet-39 backbone network module, a multi-scale convolutional layer feature combination module, a weighted feature fusion module and a prediction module,

the image acquisition module is used for acquiring an image to be identified;

2. The improved YOLO-V3-based target detection system according to claim 1, wherein the darknet-39 backbone network module comprises a darknet-39 backbone network training unit, the darknet-39 backbone network training unit adds 2 convolutional layers in the backbone network of the conventional YOLO-V3 algorithm, and performs target detection using 5 convolutional layer feature maps with different scales;

acquiring a data set, dividing the data set into a training set, a testing set and a verification set,

3. The improved YOLO-V3-based object detection system according to claim 1, wherein the image pre-processing module comprises an image rotation unit and a scaling unit, the image rotation unit is used for randomly flipping horizontally/vertically, cropping, and recognizing the image; the scaling unit is used for carrying out scale transformation on the image to be identified.

4. A target detection method based on improved YOLO-V3 is characterized by comprising the following steps:

acquiring an image to be identified;

preprocessing an image to be recognized to obtain a preprocessed image;

performing weighted feature fusion on the combined feature map;

5. The improved YOLO-V3-based target detection method according to claim 4, further comprising a step of training a darknet-39 backbone network model, wherein the method for training the darknet-39 backbone network model comprises:

2 convolutional layers are added in a backbone network of a traditional YOLO-V3 algorithm, and 5 convolutional layer characteristic graphs with different scales are adopted for target detection;

6. The improved YOLO-V3-based target detection method according to claim 4, wherein the specific method for preprocessing the image to be recognized comprises the following steps:

and carrying out scale transformation on the image to be identified.

7. An intelligent terminal comprising a processor, an input device, an output device and a memory, the processor, the input device, the output device and the memory being interconnected, the memory being adapted to store a computer program, the computer program comprising program instructions, characterized in that the processor is configured to invoke the program instructions to perform the method according to any of claims 4-6.

8. A computer-readable storage medium, characterized in that the computer storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to carry out the method according to any one of claims 4-6.