CN112699859A - Target detection method, device, storage medium and terminal - Google Patents
Target detection method, device, storage medium and terminal Download PDFInfo
- Publication number
- CN112699859A CN112699859A CN202110310610.1A CN202110310610A CN112699859A CN 112699859 A CN112699859 A CN 112699859A CN 202110310610 A CN202110310610 A CN 202110310610A CN 112699859 A CN112699859 A CN 112699859A
- Authority
- CN
- China
- Prior art keywords
- network
- module
- target detection
- yolov5s
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
- G06V40/23—Recognition of whole body movements, e.g. for sport training
- G06V40/25—Recognition of walking or running movements, e.g. gait recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/56—Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
- G06V20/58—Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads
- G06V20/584—Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads of vehicle lights or traffic lights
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/08—Detecting or categorising vehicles
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Multimedia (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Human Computer Interaction (AREA)
- Molecular Biology (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Evolutionary Biology (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a target detection method, which comprises the following steps: acquiring an image shot by a camera in a driving scene; inputting the image into a trained target detection network, and judging and predicting the image through the target detection network to obtain target classification and position information; the target detection network adopts a lightweight YOLOv5s network structure as a basic framework, a bottleneck attention mechanism module is embedded in a cross-stage local network of a YOLOv5s main network, and deep separable convolution operation is adopted in a specified convolution layer of the YOLOv5s main network. The invention effectively improves the target detection precision and speed of the driving scene image and meets the application requirement of front end lightweight under the driving scene.
Description
Technical Field
The present invention relates to the field of information technologies, and in particular, to a target detection method, an apparatus, a storage medium, and a terminal.
Background
With the rapid development of artificial intelligence technology, a large number of target detection algorithms based on deep learning emerge, and are widely applied to target detection tasks in the fields of auxiliary driving, video monitoring, robot vision, industrial detection and the like. The visual perception is an important component for road environment perception in auxiliary driving, and can automatically analyze images shot by a camera, and actively predict potential dangerous conditions around the vehicle, such as whether pedestrians cross the road without traffic rules, whether vehicles ahead suddenly brake, and the like.
In the prior art, when the target detection is performed on the image shot by the camera in the driving scene, the YOLOv3 algorithm is used as a basic frame, and the receptive field of the feature mapping image is enhanced by embedding the SEnet structure, so that the feature information learned by the network is more comprehensive. However, this method has the following disadvantages:
(1) the SEnet only carries out screening weighting on the features on the channel dimension, and cannot well acquire the position relation information, so that the detection precision is poor.
(2) The YOLOv3 algorithm has the defects of insufficient recall rate and inaccurate positioning. The accuracy of YOLOv3 is improved but the detection speed is reduced compared to previous versions of YOLOv1, YOLOv2, and the like.
(3) The detection precision of the partially shielded target is low, and the application requirement of a traffic road scene is difficult to meet.
(4) Aiming at the problem that the positive and negative samples of the target detection are unbalanced in the driving scene, the model can pay more attention to the easily-separated samples, and the performance of the model is low.
Disclosure of Invention
The embodiment of the invention provides a target detection method, a target detection device, a storage medium and a terminal, and aims to solve the problems of low detection precision and detection speed when the target detection is carried out on an image shot by a camera in a driving scene in the prior art.
A method of target detection, comprising:
acquiring an image shot by a camera in a driving scene;
inputting the image into a trained target detection network, and judging and predicting the image through the target detection network to obtain target classification and position information;
the target detection network adopts a lightweight YOLOv5s network structure as a basic framework, a bottleneck attention mechanism module is embedded in a cross-stage local network of a YOLOv5s main network, and deep separable convolution operation is adopted in a specified convolution layer of the YOLOv5s main network.
Optionally, the target detection network embeds a bottleneck attention mechanism module in a BottleneckCSP1_ x layer of a YOLOv5s backbone network to obtain a cross-stage local network based on the bottleneck attention mechanism module.
Optionally, the target detection network replaces the designated CBH module in the YOLOv5s backbone network with an MBH module, the CBH module consisting of a convolution operation, a normalization process, and an activation function, the MBH module consisting of an inverted residual error module based on a depth separable convolution operation, a normalization process, and an activation function.
Optionally, the depth separable convolution operation based inverted residual module comprises a first single-point convolution layer, a depth convolution layer, a second single-point convolution layer, and a blend layer, wherein the first single-point convolution layer and the depth convolution layer employ a BatchNorm operation and a non-linear ReLU6 activation function, and the second single-point convolution layer employs the BatchNorm operation without a non-linear ReLU6 activation function;
the first single-point convolutional layer is used for expanding a first low-dimensional feature representation to a first high-dimensional feature representation, and the depth convolutional layer is used for performing feature extraction on the first high-dimensional feature representation based on depth separable convolution operation to obtain a second high-dimensional feature representation; the second single-point convolutional layer is used for compressing the second high-dimensional feature representation to obtain a second low-dimensional feature representation; and the fusion layer is used for fusing the input first low-dimensional feature representation and the second low-dimensional feature representation processed by the first single-point convolutional layer, the depth convolutional layer and the second single-point convolutional layer by adopting layer jump connection operation to generate a new feature map.
Optionally, the target detection network replaces the fourth CBH module in the YOLOv5s backbone network with an MBH module.
Optionally, the target detection network is obtained by training through a preset loss function;
the loss function is composed of a classification loss function, a frame regression loss function and a confidence coefficient loss function.
Optionally, the target detection network is obtained by training through a preset loss function;
the loss function is composed of a gradient equilibrium mechanism loss function, a repulsive force loss function and a confidence coefficient loss function
An object detection apparatus, the apparatus comprising:
the acquisition module is used for acquiring images shot by the camera in a driving scene;
the detection module is used for inputting the image to a trained target detection network, and judging and predicting the image through the target detection network to obtain target classification and position information;
the target detection network adopts a lightweight YOLOv5s network structure as a basic framework, a bottleneck attention mechanism module is embedded in a cross-stage local network of a YOLOv5s main network, and deep separable convolution operation is adopted in a specified convolution layer of the YOLOv5s main network.
A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when executed by a processor, carries out the steps of the object detection method as described above.
A terminal comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the object detection method as described above when executing the computer program.
According to the embodiment of the invention, a target detection network is constructed in advance, the target detection network adopts a light-weight YOLOv5s network structure as a basic frame, and a bottleneck attention mechanism module is embedded in a cross-stage local network of a YOLOv5s backbone network, so that the characteristic information of channels and spaces can be screened at the same time, the expression capacity of the network channels and the space characteristics is improved, and the sensing range of the network to a target characteristic region can be expanded; and the deep separable convolution operation is adopted in the specified convolution layer of the YOLOv5s backbone network, so that the parameter quantity can be effectively reduced, and the detection speed is improved. When the target is detected, the image shot by the camera in the driving scene is obtained; and inputting the image into a trained target detection network, and judging and predicting the image by the target detection network to obtain target classification and position information, so that the target detection precision and speed of the driving scene image are effectively improved, and the application requirement of front end light weight under the driving scene is met.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
FIG. 1 is a flow chart of a method of target detection in an embodiment of the invention;
FIG. 2 is a schematic diagram of a target detection network according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a bottleneck attention mechanism module according to an embodiment of the invention;
FIG. 4 is a schematic structural diagram of a CBH module according to an embodiment of the present invention;
FIG. 5 is a block diagram of a BAM-CSP1_ x network module according to an embodiment of the invention;
FIG. 6 is a schematic diagram of the structure of an MBH module in an embodiment of the present invention;
FIG. 7 is a block diagram illustrating an inverted residual module based on a depth separable convolution operation according to an embodiment of the present invention;
FIG. 8 is a functional block diagram of an object detection device in accordance with an embodiment of the present invention;
FIG. 9 is a schematic diagram of a computer device according to an embodiment of the invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment provides a target detection method. The target detection method is applied to an auxiliary driving system, so that the auxiliary driving system can detect targets such as people, vehicles and the like as early as possible and accurately, and can remind a driver of adopting operations such as braking, steering and the like in time under emergency by combining other technologies, thereby avoiding collision and ensuring driving safety and traffic order. The following describes in detail the target detection method provided in this embodiment, as shown in fig. 1, the target detection method includes:
in step S101, an image captured by a camera in a driving scene is acquired.
The embodiment of the invention performs target detection on the images shot by the camera in the driving scene, including motor vehicle detection, non-motor vehicle detection and pedestrian detection, and obtains the position of the target.
In step S102, the image is input to a trained target detection network, and the image is judged and predicted by the target detection network, so as to obtain target classification and position information.
The embodiment of the invention aims at improving the accuracy rate of target detection in an image shot by a camera in a driving scene, and designs a deep neural network model, namely the target detection network, so as to realize the lightweight of the model while improving the accuracy rate of the target detection.
As shown in fig. 2, the target detection network includes four parts, i.e., an Input layer, a Backbone network Backbone, a Neck structure Neck, and an Output layer Output. Wherein, the Input layer Input is used for preprocessing the Input image, and the preprocessing comprises but is not limited to enhancement, adaptive scaling and adaptive anchor frame; the Backbone network Backbone is used for aggregating fine granularity of different images, forming a feature mapping chart and outputting the feature mapping chart to a Neck structure Neck; the Neck structure Neck is used for performing feature fusion on different detection layers from different trunk layers, and the capability of network feature fusion is enhanced; the Output layer Output is used to generate bounding boxes and classes of predicted targets.
In one embodiment of the present invention, the target detection network adopts a lightweight YOLOv5s network structure as a basic framework, and a bottleneck attention mechanism module is embedded in a cross-stage local network of a YOLOv5s backbone network. According to the embodiment of the invention, the bottleneck attention machine module is embedded into the YOLOv5 algorithm, the channel attention and space attention machine module is added into the feature extraction network, and the attention machine module is used for screening the feature information of the channel and the space at the same time, so that the network channel and space feature expression capability is improved, and the network can expand the sensing range of the target feature area.
Here, a Bottleneck Attention Module (BAM) is a hybrid Attention model, which can be embedded into a forward propagation convolutional neural network, and includes two branch networks, a channel Attention Module and a spatial Attention Module. Fig. 3 is a schematic structural diagram of a bottleneck attention mechanism module according to an embodiment of the present invention. Is provided with a feature mapThe feature map is obtained after the processing of two independent branch network structures of a Channel Attention mechanism (Channel Attention) and a Spatial Attention mechanism (Spatial Attention) of the BAM module respectivelyAndthe feature mapping obtained by fusing the two isAnd inhibiting unimportant features by point-by-point multiplication, highlighting the important features, and then matching the input feature map with the input feature mapAdding to obtain refined feature map。
In the existing network structure of YOLOv5s, a module composed of convolution operation Conv2d, normalization processing BatchNorm, and a HardSwish activation function is referred to as a CBH module, as shown in fig. 4, which is a schematic structural diagram of the CBH module provided in the embodiment of the present invention; the BottleneckCSP1_ X consists of a CBH module and X residual error structures Res unit modules; the BottleneckCSP2_ x has a structure similar to that of the BottleneckCSP1_ x, except that N Bottlenecks are replaced by N CBH modules; spatial Pyramid Pooling (SPP for short) is performed in a manner of maximal Pooling of 1 × 1, 5 × 5, 9 × 9, and 13 × 13 for multi-scale fusion. In order to improve the feature expression capability of the network, in the embodiment of the present invention, a bottleneck attention mechanism module is embedded into a botteleeck CSP1_ x layer of a YOLOv5s backbone network to form a repeating unit composed of a CBH module and a BAM module, so as to obtain a cross-stage local network based on the bottleneck attention mechanism module, which is herein referred to as a BAM-CSP1_ x network module. Fig. 5 is a schematic structural diagram of a BAM-CSP1_ x network module according to an embodiment of the present invention. The feature map is transmitted into a BAM-CSP1_ x network module, and the influence of other factors is eliminated as much as possible at the bottom layer of YOLOv5s through a channel attention mechanism and a space attention mechanism, so that the network focuses on effective feature information, inhibits unimportant feature information, focuses on extraction of target features in a driving scene, and is beneficial to improvement of detection accuracy.
In another embodiment of the present invention, embodiments of the present invention also employ deep separable convolution operations in the designated convolution layer of the YOLOv5s backbone network. Specifically, a designated CBH module in the YOLOv5s backbone network is replaced with an MBH module. Wherein, the MBH module is obtained by replacing the convolution operation Conv2d in the CBH module with an inverted residual error (herein, denoted as Mod module) module based on a depth separable convolution operation. As a preferred example of the present invention, as shown in fig. 6, the MBH module is composed of an inverted residual module based on a depth separable convolution operation, a normalization process of BatchNorm2d, and an activation function of HardSwish. The feature map is passed through an inverse residual module based on a depth separable convolution operation, followed by normalized BatchNorm2d and HardSwish activation function operations, respectively.
As shown in fig. 7, the inverted residual module based on the depth separable convolution operation includes a first single-point convolution layer and a depth convolution layer, wherein the first single-point convolution layer and the depth convolution layer employ a BatchNorm operation and a non-linear ReLU6 activation function, a second single-point convolution layer and a fusion layer, wherein the second single-point convolution layer employs the BatchNorm operation without the non-linear ReLU6 activation function;
the first single-point convolutional layer is used for expanding a first low-dimensional feature representation to a first high-dimensional feature representation, and the depth convolutional layer is used for performing feature extraction on the first high-dimensional feature representation based on depth separable convolution operation to obtain a second high-dimensional feature representation; the second single-point convolutional layer is used for compressing the second high-dimensional feature representation to obtain a second low-dimensional feature representation; and the fusion layer is used for fusing the input first low-dimensional feature representation and the second low-dimensional feature representation processed by the first single-point convolutional layer, the depth convolutional layer and the second single-point convolutional layer by adopting layer jump connection operation to generate a new feature map.
Here, the depth separable Convolution uses the correlation between the spatial dimension and the channel dimension of the 3 × 3 depth Convolution (Depthwise Convolution) and the 1 × 1 single-point Convolution (Pointwise Convolution) split feature, and the amount of calculation of the model can be effectively reduced. Because the depth separable convolution not only can effectively compress the calculated amount of the convolution kernel, but also can compress the exploration space of the convolution kernel, and after the exploration capacity of the feature space of the convolution is compressed, the feature expression capacity is weakened. In view of this, the embodiment of the present invention uses an inverse residual model based on a depth separable convolution operation, expands the low-dimensional feature representation to the high-dimensional feature representation by adding a 1 × 1 convolution before the depth convolution, performs feature extraction using the depth separable convolution operation, and then compresses the model into the low-dimensional space. In the inverse residual model based on the depth separable convolution operation, the first single-point convolution layer and the depth convolution layer are followed by the BatchNorm operation and the non-linear ReLU6 function. In the case of a large number of channels, the features may enter into a portion of the low dimensional space. Although the use of the ReLU6 function may maintain a good feature extraction capability, the ReLU6 function may instead reduce the feature extraction capability of the network after the feature is transformed from high dimension to low dimension, and thus the ReLU6 function is not used in the final second single-point convolutional layer operation. Finally, the original feature map is fused with the feature map of the depth separable convolution by using a short concatenation operation to generate a new feature map. It will be appreciated that the first low-dimensional feature representation and the first high-dimensional feature representation are relative, and the second high-dimensional feature representation and the second low-dimensional feature representation are also relative.
As a preferred example of the present invention, in the target detection network, specifically, the fourth CBH module in the YOLOv5s backbone network may be replaced by an MBH module. According to the embodiment of the invention, the YOLOv5 algorithm with small network parameter scale and very high reasoning speed is used as a basic framework, and the Convolution layer with large calculated amount in the main network is replaced by the depth Separable Convolution (Depthwise Separable Convolution), so that the parameter amount is effectively reduced, the target detection speed is improved, and a good detection effect can be obtained.
For the constructed target detection network, the embodiment of the invention uses an adam optimization method and adopts a preset loss function to train in an end-to-end mode. Optionally, the trained image size is 640 x 640, the batch-size is set to 16, and the epoch size is set to 300.
In one embodiment of the invention, the LOSS function LOSS of the YOLOv5s network in the target detection network is classified by a classification LOSS functionBounding box regression (Bounding box) loss functionAnd confidence loss functionThe composition is shown as a formula (1).
General classification loss functionA BCE (Binary Cross Entropy) loss function was used. Aiming at the problem that the positive and negative samples of the target detection are unbalanced in the driving scene, the embodiment of the invention classifies the loss functionAnd replacing the model with a Gradient equalization Mechanism Loss function (GHM Loss for short). In the loss function of the gradient equalization mechanism, for a candidate frame, letIs the probability that the model is predicting,is a certain class of real label, and calculates the binary cross entropy loss as shown in formula (2).
In dealing with the problem of gradient norm imbalance, gradient density is usedA function. As shown in formula (3).
In the formula (3), the reaction mixture is,is in samples 1 to N, the gradient mode length is distributed inThe number of samples within the range is,to representThe interval length of (2).
The embodiment of the invention aims at the problem that the positive and negative samples of the target detection are unbalanced in the driving scene, and classifies the loss functionAnd a gradient balance mechanism Loss function GHM Loss is replaced, the weights of simple negative samples and very difficult abnormal samples in the candidate samples are reduced, and the weights of normal difficult samples are improved, so that the model can be more concentrated on effective normal difficult samples, and the performance of the model is effectively improved.
For the case of partially shielding the target, the embodiment of the invention can also regress the frame into a loss functionReplacement is made with a repulsive force Loss function (Repulsion Loss). In this case, the loss function of the target detection network is arranged by a loss function of a gradient equilibrium mechanismA repulsive force loss function and a confidence loss function. The repulsive force loss function can reduce the distance between the prediction frame and the target frame in target detection and increase the distance between the prediction frame and the surrounding target frame or the prediction frame. The repulsive force loss function consists of three parts: the first part is the loss value generated by the prediction frame and the target frame; the second part is the loss value generated by the prediction frame and the surrounding target frame; the third part is the loss value generated by the prediction box and the surrounding prediction box which is not predicting the same target. By passingAndtwo relationship coefficients to adjust the second and third loss values. The loss value will be smaller if the distance to the surrounding target is larger. The expression of the repulsive force loss function is shown in equation (5):
the first partial expression on the right side of equation (5) is a loss function of a regression model. Wherein the content of the first and second substances,a set of positive samples is represented, and,is a prediction box of the propofol regression,then is with propofolThe real target box with the largest Intersection over Union (IoU for short),function for measuringAndthe distance of (d); in the second sub-formula, the first sub-formula,is in addition to andthe corresponding real box, the real box with the maximum value of IoU,to representAndthe overlapping region isThe percentage of the area over which the light is emitted,the function is used for measuring the distance between the prediction frame and the real frame of the surrounding target; in the third sub-formula, theThe division into different subsets is performed such that,,anda prediction box representing a different target is shown,function for metric prediction block and method thereofThe distance of the prediction boxes of the surrounding targets is such that the overlap area of the pro posal P of the different subsets is as small as possible. From the denominator part of the third fraction, it can be seen that the loss value is only counted if the prediction box has an overlapping area, and if not adjacent at all, it is not counted. The loss of the third fraction can reduce the probability that the bounding boxes of different regression targets are combined into one, so that the embodiment is more robust under the condition that the traffic road target is partially shielded, and the detection effect is effectively improved.
For the target detection network, the embodiment of the present invention adopts Precision (Precision), Recall (Recall), mean Average Precision (mAP) and detection speed (Frames Per Second, FPS) as evaluation indexes, and specifically calculates and explains that:
1. precision represents the proportion of the positive case in the case of being divided into positive cases, in lettersExpressed as shown in formula (6):
where TP + FP is the number of pictures of the predicted pictures that are positive classes, and TP is the number of pictures of the positive classes that are predicted as positive classes.
2. Recall indicates how many of the actual positive classes are classified as positive classes, and uses lettersThe expression is a measure of coverage of the detection result, and is shown in formula (7):
3. mAP is the mean of the average accuracies of all classes in the dataset, AP is the average accuracy of a certain class, for the secondSelecting different IoU thresholds according to each category, wherein the calculation formula of the average accuracy is as follows:
the geometric meaning is the area enclosed by a curve formed by the accuracy and the recall rate and a horizontal axis, and the number of the obtained categories isThe average precision of the mean value is calculated by the following formula:
4. the FPS detects the number of image frames per second, and the index is not only related to the calculated amount of the algorithm model, but also related to the hardware performance in the experimental process. Generally, if the detection speed is not less than 25fps, the algorithm model can be considered to meet the real-time requirement.
The embodiment of the invention provides a lightweight method for detecting targets in shot images in a driving scene based on a deep neural network, and a series of improvements and optimizations are performed on the basis of YOLOv5s, compared with the existing SE + YOLOv3 network structure, the accuracy rate of a test environment GTX1080 on an image data set shot by a camera in the driving scene is greatly improved, and a prediction frame is closer to a real target frame. Compared with the original SE + YOLOv3 network structure, the size of the model in the embodiment of the invention is greatly reduced, the detection accuracy is improved, and the application requirement of front end light weight in a driving scene is met.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
In an embodiment, the present invention further provides an object detection apparatus, which corresponds to the object detection method in the foregoing embodiment one to one. As shown in fig. 8, the object detection device includes an acquisition module 81 and a detection module 82. The functional modules are explained in detail as follows:
an acquisition module 81 for acquiring an image captured by a camera in a driving scene;
the detection module 82 is configured to input the image to a trained target detection network, and judge and predict the image through the target detection network to obtain target classification and position information;
the target detection network adopts a lightweight YOLOv5s network structure as a basic framework, a bottleneck attention mechanism module is embedded in a cross-stage local network of a YOLOv5s main network, and deep separable convolution operation is adopted in a specified convolution layer of the YOLOv5s main network.
Optionally, the target detection network embeds a bottleneck attention mechanism module in a BottleneckCSP1_ x layer of a YOLOv5s backbone network to obtain a cross-stage local network based on the bottleneck attention mechanism module.
Optionally, the target detection network replaces the designated CBH module in the YOLOv5s backbone network with an MBH module, the CBH module consisting of a convolution operation, a normalization process, and an activation function, the MBH module consisting of an inverted residual error module based on a depth separable convolution operation, a normalization process, and an activation function.
Optionally, the depth separable convolution operation based inverted residual module comprises a first single-point convolution layer, a depth convolution layer, a second single-point convolution layer, and a blend layer, wherein the first single-point convolution layer and the depth convolution layer employ a BatchNorm operation and a non-linear ReLU6 activation function, and the second single-point convolution layer employs the BatchNorm operation without a non-linear ReLU6 activation function;
the first single-point convolutional layer is used for expanding a first low-dimensional feature representation to a first high-dimensional feature representation, and the depth convolutional layer is used for performing feature extraction on the first high-dimensional feature representation based on depth separable convolution operation to obtain a second high-dimensional feature representation; the second single-point convolutional layer is used for compressing the second high-dimensional feature representation to obtain a second low-dimensional feature representation; and the fusion layer is used for fusing the input first low-dimensional feature representation and the second low-dimensional feature representation processed by the first single-point convolutional layer, the depth convolutional layer and the second single-point convolutional layer by adopting layer jump connection operation to generate a new feature map.
Optionally, the target detection network replaces the fourth CBH module in the YOLOv5s backbone network with an MBH module.
Optionally, the target detection network is obtained by training through a preset loss function;
the loss function is composed of a classification loss function, a frame regression loss function and a confidence coefficient loss function.
Optionally, the target detection network is obtained by training through a preset loss function;
the loss function consists of a gradient equilibrium mechanism loss function, a repulsive force loss function, and a confidence loss function.
For specific limitations of the target detection device, reference may be made to the above limitations of the target detection method, which are not described herein again. The modules in the target detection device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in a computer device, and can also be stored in a memory in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 9. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of object detection.
In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:
acquiring an image shot by a camera in a driving scene;
inputting the image into a trained target detection network, and judging and predicting the image through the target detection network to obtain target classification and position information;
the target detection network adopts a lightweight YOLOv5s network structure as a basic framework, a bottleneck attention mechanism module is embedded in a cross-stage local network of a YOLOv5s main network, and deep separable convolution operation is adopted in a specified convolution layer of the YOLOv5s main network.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, databases, or other media used in embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.
Claims (10)
1. A method of object detection, comprising:
acquiring an image shot by a camera in a driving scene;
inputting the image into a trained target detection network, and judging and predicting the image through the target detection network to obtain target classification and position information;
the target detection network adopts a lightweight YOLOv5s network structure as a basic framework, a bottleneck attention mechanism module is embedded in a cross-stage local network of a YOLOv5s main network, and deep separable convolution operation is adopted in a specified convolution layer of the YOLOv5s main network.
2. The target detection method of claim 1, wherein the target detection network embeds a bottleneck attention mechanism module in a BottleneckCSP1_ x layer of a YOLOv5s backbone network to obtain a cross-stage local network based on the bottleneck attention mechanism module.
3. The object detection method of claim 2, wherein the object detection network replaces a designated CBH module in the YOLOv5s backbone network with an MBH module, the CBH module consisting of a convolution operation, a normalization process, and an activation function, the MBH module consisting of an inverted residual module based on a deep separable convolution operation, a normalization process, and an activation function.
4. The object detection method of claim 3, wherein the depth separable convolution operation based inverted residual module comprises a first single point convolution layer, a depth convolution layer, a second single point convolution layer, and a blend layer, wherein the first single point convolution layer and the depth convolution layer employ a BatchNorm operation and a non-linear ReLU6 activation function, and the second single point convolution layer employs a BatchNorm operation and does not employ a non-linear ReLU6 activation function;
the first single-point convolutional layer is used for expanding a first low-dimensional feature representation to a first high-dimensional feature representation, and the depth convolutional layer is used for performing feature extraction on the first high-dimensional feature representation based on depth separable convolution operation to obtain a second high-dimensional feature representation; the second single-point convolutional layer is used for compressing the second high-dimensional feature representation to obtain a second low-dimensional feature representation; and the fusion layer is used for fusing the input first low-dimensional feature representation and the second low-dimensional feature representation processed by the first single-point convolutional layer, the depth convolutional layer and the second single-point convolutional layer by adopting layer jump connection operation to generate a new feature map.
5. The object detection method of claim 3 or 4, wherein the object detection network replaces the fourth CBH module in the YOLOv5s backbone network with an MBH module.
6. The object detection method according to any one of claims 1 to 4, wherein the object detection network is trained by a preset loss function;
the loss function is composed of a classification loss function, a frame regression loss function and a confidence coefficient loss function.
7. The object detection method according to any one of claims 1 to 4, wherein the object detection network is trained by a preset loss function;
the loss function consists of a gradient equilibrium mechanism loss function, a repulsive force loss function, and a confidence loss function.
8. An object detection apparatus, characterized in that the apparatus comprises:
the acquisition module is used for acquiring images shot by the camera in a driving scene;
the detection module is used for inputting the image to a trained target detection network, and judging and predicting the image through the target detection network to obtain target classification and position information;
the target detection network adopts a lightweight YOLOv5s network structure as a basic framework, a bottleneck attention mechanism module is embedded in a cross-stage local network of a YOLOv5s main network, and deep separable convolution operation is adopted in a specified convolution layer of the YOLOv5s main network.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the object detection method according to any one of claims 1 to 7.
10. A terminal comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the object detection method according to any one of claims 1 to 7 when executing the computer program.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110310610.1A CN112699859B (en) | 2021-03-24 | 2021-03-24 | Target detection method, device, storage medium and terminal |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110310610.1A CN112699859B (en) | 2021-03-24 | 2021-03-24 | Target detection method, device, storage medium and terminal |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112699859A true CN112699859A (en) | 2021-04-23 |
CN112699859B CN112699859B (en) | 2021-07-16 |
Family
ID=75515587
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110310610.1A Active CN112699859B (en) | 2021-03-24 | 2021-03-24 | Target detection method, device, storage medium and terminal |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112699859B (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113160062A (en) * | 2021-05-25 | 2021-07-23 | 烟台艾睿光电科技有限公司 | Infrared image target detection method, device, equipment and storage medium |
CN113449691A (en) * | 2021-07-21 | 2021-09-28 | 天津理工大学 | Human shape recognition system and method based on non-local attention mechanism |
CN113469087A (en) * | 2021-07-09 | 2021-10-01 | 上海智臻智能网络科技股份有限公司 | Method, device, equipment and medium for detecting picture frame in building drawing |
CN113569702A (en) * | 2021-07-23 | 2021-10-29 | 闽江学院 | Deep learning-based truck single-tire and double-tire identification method |
CN113705604A (en) * | 2021-07-15 | 2021-11-26 | 中国科学院信息工程研究所 | Botnet flow classification detection method and device, electronic equipment and storage medium |
CN113887706A (en) * | 2021-09-30 | 2022-01-04 | 苏州浪潮智能科技有限公司 | Method and device for low bit quantization aiming at one-stage target detection network |
CN113963167A (en) * | 2021-10-29 | 2022-01-21 | 北京百度网讯科技有限公司 | Method, device and computer program product applied to target detection |
CN114529825A (en) * | 2022-04-24 | 2022-05-24 | 城云科技(中国)有限公司 | Target detection model, method and application for fire fighting channel occupation target detection |
CN114549970A (en) * | 2022-01-13 | 2022-05-27 | 山东师范大学 | Night small target fruit detection method and system fusing global fine-grained information |
CN115223130A (en) * | 2022-09-20 | 2022-10-21 | 南京理工大学 | Multi-task panoramic driving perception method and system based on improved YOLOv5 |
CN115578624A (en) * | 2022-10-28 | 2023-01-06 | 北京市农林科学院 | Agricultural disease and pest model construction method, detection method and device |
CN116468730A (en) * | 2023-06-20 | 2023-07-21 | 齐鲁工业大学(山东省科学院) | Aerial insulator image defect detection method based on YOLOv5 algorithm |
CN114549970B (en) * | 2022-01-13 | 2024-06-07 | 山东师范大学 | Night small target fruit detection method and system integrating global fine granularity information |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110633610A (en) * | 2019-05-17 | 2019-12-31 | 西南交通大学 | Student state detection algorithm based on YOLO |
CN110852222A (en) * | 2019-10-31 | 2020-02-28 | 上海交通大学 | Campus corridor scene intelligent monitoring method based on target detection |
CN112233090A (en) * | 2020-10-15 | 2021-01-15 | 浙江工商大学 | Film flaw detection method based on improved attention mechanism |
CN112307921A (en) * | 2020-10-22 | 2021-02-02 | 桂林电子科技大学 | Vehicle-mounted end multi-target identification tracking prediction method |
-
2021
- 2021-03-24 CN CN202110310610.1A patent/CN112699859B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110633610A (en) * | 2019-05-17 | 2019-12-31 | 西南交通大学 | Student state detection algorithm based on YOLO |
CN110852222A (en) * | 2019-10-31 | 2020-02-28 | 上海交通大学 | Campus corridor scene intelligent monitoring method based on target detection |
CN112233090A (en) * | 2020-10-15 | 2021-01-15 | 浙江工商大学 | Film flaw detection method based on improved attention mechanism |
CN112307921A (en) * | 2020-10-22 | 2021-02-02 | 桂林电子科技大学 | Vehicle-mounted end multi-target identification tracking prediction method |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113160062A (en) * | 2021-05-25 | 2021-07-23 | 烟台艾睿光电科技有限公司 | Infrared image target detection method, device, equipment and storage medium |
CN113469087A (en) * | 2021-07-09 | 2021-10-01 | 上海智臻智能网络科技股份有限公司 | Method, device, equipment and medium for detecting picture frame in building drawing |
CN113705604A (en) * | 2021-07-15 | 2021-11-26 | 中国科学院信息工程研究所 | Botnet flow classification detection method and device, electronic equipment and storage medium |
CN113449691A (en) * | 2021-07-21 | 2021-09-28 | 天津理工大学 | Human shape recognition system and method based on non-local attention mechanism |
CN113569702A (en) * | 2021-07-23 | 2021-10-29 | 闽江学院 | Deep learning-based truck single-tire and double-tire identification method |
CN113569702B (en) * | 2021-07-23 | 2023-10-27 | 闽江学院 | Truck single-double tire identification method based on deep learning |
CN113887706A (en) * | 2021-09-30 | 2022-01-04 | 苏州浪潮智能科技有限公司 | Method and device for low bit quantization aiming at one-stage target detection network |
CN113887706B (en) * | 2021-09-30 | 2024-02-06 | 苏州浪潮智能科技有限公司 | Method and device for low-bit quantization of one-stage target detection network |
CN113963167B (en) * | 2021-10-29 | 2022-05-27 | 北京百度网讯科技有限公司 | Method, device and computer program product applied to target detection |
CN113963167A (en) * | 2021-10-29 | 2022-01-21 | 北京百度网讯科技有限公司 | Method, device and computer program product applied to target detection |
CN114549970A (en) * | 2022-01-13 | 2022-05-27 | 山东师范大学 | Night small target fruit detection method and system fusing global fine-grained information |
CN114549970B (en) * | 2022-01-13 | 2024-06-07 | 山东师范大学 | Night small target fruit detection method and system integrating global fine granularity information |
CN114529825A (en) * | 2022-04-24 | 2022-05-24 | 城云科技(中国)有限公司 | Target detection model, method and application for fire fighting channel occupation target detection |
CN115223130A (en) * | 2022-09-20 | 2022-10-21 | 南京理工大学 | Multi-task panoramic driving perception method and system based on improved YOLOv5 |
CN115223130B (en) * | 2022-09-20 | 2023-02-03 | 南京理工大学 | Multi-task panoramic driving perception method and system based on improved YOLOv5 |
CN115578624A (en) * | 2022-10-28 | 2023-01-06 | 北京市农林科学院 | Agricultural disease and pest model construction method, detection method and device |
CN116468730A (en) * | 2023-06-20 | 2023-07-21 | 齐鲁工业大学(山东省科学院) | Aerial insulator image defect detection method based on YOLOv5 algorithm |
CN116468730B (en) * | 2023-06-20 | 2023-09-05 | 齐鲁工业大学(山东省科学院) | Aerial Insulator Image Defect Detection Method Based on YOLOv5 Algorithm |
Also Published As
Publication number | Publication date |
---|---|
CN112699859B (en) | 2021-07-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112699859B (en) | Target detection method, device, storage medium and terminal | |
CN113688723B (en) | Infrared image pedestrian target detection method based on improved YOLOv5 | |
CN110929692A (en) | Three-dimensional target detection method and device based on multi-sensor information fusion | |
CN111461083A (en) | Rapid vehicle detection method based on deep learning | |
CN111797983A (en) | Neural network construction method and device | |
CN111738037B (en) | Automatic driving method, system and vehicle thereof | |
CN111222478A (en) | Construction site safety protection detection method and system | |
CN111611947A (en) | License plate detection method, device, equipment and medium | |
CN110533046B (en) | Image instance segmentation method and device, computer readable storage medium and electronic equipment | |
CN111160481B (en) | Adas target detection method and system based on deep learning | |
CN111242015A (en) | Method for predicting driving danger scene based on motion contour semantic graph | |
CN114187311A (en) | Image semantic segmentation method, device, equipment and storage medium | |
CN115049821A (en) | Three-dimensional environment target detection method based on multi-sensor fusion | |
CN116597411A (en) | Method and system for identifying traffic sign by unmanned vehicle in extreme weather | |
CN112462759B (en) | Evaluation method, system and computer storage medium of rule control algorithm | |
CN111435457B (en) | Method for classifying acquisitions acquired by sensors | |
CN110852272B (en) | Pedestrian detection method | |
JP2018124963A (en) | Image processing device, image recognition device, image processing program, and image recognition program | |
CN111652350A (en) | Neural network visual interpretation method and weak supervision object positioning method | |
CN112465037B (en) | Target detection method, device, computer equipment and storage medium | |
CN115880654A (en) | Vehicle lane change risk assessment method and device, computer equipment and storage medium | |
WO2018143278A1 (en) | Image processing device, image recognition device, image processing program, and image recognition program | |
CN112699809B (en) | Vaccinia category identification method, device, computer equipment and storage medium | |
CN115631457A (en) | Man-machine cooperation abnormity detection method and system in building construction monitoring video | |
JPWO2018143277A1 (en) | Image feature output device, image recognition device, image feature output program, and image recognition program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |