CN112699859A - Target detection method, device, storage medium and terminal - Google Patents

Target detection method, device, storage medium and terminal Download PDF

Info

Publication number
CN112699859A
CN112699859A CN202110310610.1A CN202110310610A CN112699859A CN 112699859 A CN112699859 A CN 112699859A CN 202110310610 A CN202110310610 A CN 202110310610A CN 112699859 A CN112699859 A CN 112699859A
Authority
CN
China
Prior art keywords
network
module
target detection
yolov5s
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110310610.1A
Other languages
Chinese (zh)
Other versions
CN112699859B (en
Inventor
黄仝宇
胡斌杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202110310610.1A priority Critical patent/CN112699859B/en
Publication of CN112699859A publication Critical patent/CN112699859A/en
Application granted granted Critical
Publication of CN112699859B publication Critical patent/CN112699859B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training
    • G06V40/25Recognition of walking or running movements, e.g. gait recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/58Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads
    • G06V20/584Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads of vehicle lights or traffic lights
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/08Detecting or categorising vehicles

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Biology (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a target detection method, which comprises the following steps: acquiring an image shot by a camera in a driving scene; inputting the image into a trained target detection network, and judging and predicting the image through the target detection network to obtain target classification and position information; the target detection network adopts a lightweight YOLOv5s network structure as a basic framework, a bottleneck attention mechanism module is embedded in a cross-stage local network of a YOLOv5s main network, and deep separable convolution operation is adopted in a specified convolution layer of the YOLOv5s main network. The invention effectively improves the target detection precision and speed of the driving scene image and meets the application requirement of front end lightweight under the driving scene.

Description

Target detection method, device, storage medium and terminal
Technical Field
The present invention relates to the field of information technologies, and in particular, to a target detection method, an apparatus, a storage medium, and a terminal.
Background
With the rapid development of artificial intelligence technology, a large number of target detection algorithms based on deep learning emerge, and are widely applied to target detection tasks in the fields of auxiliary driving, video monitoring, robot vision, industrial detection and the like. The visual perception is an important component for road environment perception in auxiliary driving, and can automatically analyze images shot by a camera, and actively predict potential dangerous conditions around the vehicle, such as whether pedestrians cross the road without traffic rules, whether vehicles ahead suddenly brake, and the like.
In the prior art, when the target detection is performed on the image shot by the camera in the driving scene, the YOLOv3 algorithm is used as a basic frame, and the receptive field of the feature mapping image is enhanced by embedding the SEnet structure, so that the feature information learned by the network is more comprehensive. However, this method has the following disadvantages:
(1) the SEnet only carries out screening weighting on the features on the channel dimension, and cannot well acquire the position relation information, so that the detection precision is poor.
(2) The YOLOv3 algorithm has the defects of insufficient recall rate and inaccurate positioning. The accuracy of YOLOv3 is improved but the detection speed is reduced compared to previous versions of YOLOv1, YOLOv2, and the like.
(3) The detection precision of the partially shielded target is low, and the application requirement of a traffic road scene is difficult to meet.
(4) Aiming at the problem that the positive and negative samples of the target detection are unbalanced in the driving scene, the model can pay more attention to the easily-separated samples, and the performance of the model is low.
Disclosure of Invention
The embodiment of the invention provides a target detection method, a target detection device, a storage medium and a terminal, and aims to solve the problems of low detection precision and detection speed when the target detection is carried out on an image shot by a camera in a driving scene in the prior art.
A method of target detection, comprising:
acquiring an image shot by a camera in a driving scene;
inputting the image into a trained target detection network, and judging and predicting the image through the target detection network to obtain target classification and position information;
the target detection network adopts a lightweight YOLOv5s network structure as a basic framework, a bottleneck attention mechanism module is embedded in a cross-stage local network of a YOLOv5s main network, and deep separable convolution operation is adopted in a specified convolution layer of the YOLOv5s main network.
Optionally, the target detection network embeds a bottleneck attention mechanism module in a BottleneckCSP1_ x layer of a YOLOv5s backbone network to obtain a cross-stage local network based on the bottleneck attention mechanism module.
Optionally, the target detection network replaces the designated CBH module in the YOLOv5s backbone network with an MBH module, the CBH module consisting of a convolution operation, a normalization process, and an activation function, the MBH module consisting of an inverted residual error module based on a depth separable convolution operation, a normalization process, and an activation function.
Optionally, the depth separable convolution operation based inverted residual module comprises a first single-point convolution layer, a depth convolution layer, a second single-point convolution layer, and a blend layer, wherein the first single-point convolution layer and the depth convolution layer employ a BatchNorm operation and a non-linear ReLU6 activation function, and the second single-point convolution layer employs the BatchNorm operation without a non-linear ReLU6 activation function;
the first single-point convolutional layer is used for expanding a first low-dimensional feature representation to a first high-dimensional feature representation, and the depth convolutional layer is used for performing feature extraction on the first high-dimensional feature representation based on depth separable convolution operation to obtain a second high-dimensional feature representation; the second single-point convolutional layer is used for compressing the second high-dimensional feature representation to obtain a second low-dimensional feature representation; and the fusion layer is used for fusing the input first low-dimensional feature representation and the second low-dimensional feature representation processed by the first single-point convolutional layer, the depth convolutional layer and the second single-point convolutional layer by adopting layer jump connection operation to generate a new feature map.
Optionally, the target detection network replaces the fourth CBH module in the YOLOv5s backbone network with an MBH module.
Optionally, the target detection network is obtained by training through a preset loss function;
the loss function is composed of a classification loss function, a frame regression loss function and a confidence coefficient loss function.
Optionally, the target detection network is obtained by training through a preset loss function;
the loss function is composed of a gradient equilibrium mechanism loss function, a repulsive force loss function and a confidence coefficient loss function
An object detection apparatus, the apparatus comprising:
the acquisition module is used for acquiring images shot by the camera in a driving scene;
the detection module is used for inputting the image to a trained target detection network, and judging and predicting the image through the target detection network to obtain target classification and position information;
the target detection network adopts a lightweight YOLOv5s network structure as a basic framework, a bottleneck attention mechanism module is embedded in a cross-stage local network of a YOLOv5s main network, and deep separable convolution operation is adopted in a specified convolution layer of the YOLOv5s main network.
A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when executed by a processor, carries out the steps of the object detection method as described above.
A terminal comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the object detection method as described above when executing the computer program.
According to the embodiment of the invention, a target detection network is constructed in advance, the target detection network adopts a light-weight YOLOv5s network structure as a basic frame, and a bottleneck attention mechanism module is embedded in a cross-stage local network of a YOLOv5s backbone network, so that the characteristic information of channels and spaces can be screened at the same time, the expression capacity of the network channels and the space characteristics is improved, and the sensing range of the network to a target characteristic region can be expanded; and the deep separable convolution operation is adopted in the specified convolution layer of the YOLOv5s backbone network, so that the parameter quantity can be effectively reduced, and the detection speed is improved. When the target is detected, the image shot by the camera in the driving scene is obtained; and inputting the image into a trained target detection network, and judging and predicting the image by the target detection network to obtain target classification and position information, so that the target detection precision and speed of the driving scene image are effectively improved, and the application requirement of front end light weight under the driving scene is met.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
FIG. 1 is a flow chart of a method of target detection in an embodiment of the invention;
FIG. 2 is a schematic diagram of a target detection network according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a bottleneck attention mechanism module according to an embodiment of the invention;
FIG. 4 is a schematic structural diagram of a CBH module according to an embodiment of the present invention;
FIG. 5 is a block diagram of a BAM-CSP1_ x network module according to an embodiment of the invention;
FIG. 6 is a schematic diagram of the structure of an MBH module in an embodiment of the present invention;
FIG. 7 is a block diagram illustrating an inverted residual module based on a depth separable convolution operation according to an embodiment of the present invention;
FIG. 8 is a functional block diagram of an object detection device in accordance with an embodiment of the present invention;
FIG. 9 is a schematic diagram of a computer device according to an embodiment of the invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment provides a target detection method. The target detection method is applied to an auxiliary driving system, so that the auxiliary driving system can detect targets such as people, vehicles and the like as early as possible and accurately, and can remind a driver of adopting operations such as braking, steering and the like in time under emergency by combining other technologies, thereby avoiding collision and ensuring driving safety and traffic order. The following describes in detail the target detection method provided in this embodiment, as shown in fig. 1, the target detection method includes:
in step S101, an image captured by a camera in a driving scene is acquired.
The embodiment of the invention performs target detection on the images shot by the camera in the driving scene, including motor vehicle detection, non-motor vehicle detection and pedestrian detection, and obtains the position of the target.
In step S102, the image is input to a trained target detection network, and the image is judged and predicted by the target detection network, so as to obtain target classification and position information.
The embodiment of the invention aims at improving the accuracy rate of target detection in an image shot by a camera in a driving scene, and designs a deep neural network model, namely the target detection network, so as to realize the lightweight of the model while improving the accuracy rate of the target detection.
As shown in fig. 2, the target detection network includes four parts, i.e., an Input layer, a Backbone network Backbone, a Neck structure Neck, and an Output layer Output. Wherein, the Input layer Input is used for preprocessing the Input image, and the preprocessing comprises but is not limited to enhancement, adaptive scaling and adaptive anchor frame; the Backbone network Backbone is used for aggregating fine granularity of different images, forming a feature mapping chart and outputting the feature mapping chart to a Neck structure Neck; the Neck structure Neck is used for performing feature fusion on different detection layers from different trunk layers, and the capability of network feature fusion is enhanced; the Output layer Output is used to generate bounding boxes and classes of predicted targets.
In one embodiment of the present invention, the target detection network adopts a lightweight YOLOv5s network structure as a basic framework, and a bottleneck attention mechanism module is embedded in a cross-stage local network of a YOLOv5s backbone network. According to the embodiment of the invention, the bottleneck attention machine module is embedded into the YOLOv5 algorithm, the channel attention and space attention machine module is added into the feature extraction network, and the attention machine module is used for screening the feature information of the channel and the space at the same time, so that the network channel and space feature expression capability is improved, and the network can expand the sensing range of the target feature area.
Here, a Bottleneck Attention Module (BAM) is a hybrid Attention model, which can be embedded into a forward propagation convolutional neural network, and includes two branch networks, a channel Attention Module and a spatial Attention Module. Fig. 3 is a schematic structural diagram of a bottleneck attention mechanism module according to an embodiment of the present invention. Is provided with a feature map
Figure DEST_PATH_IMAGE001
The feature map is obtained after the processing of two independent branch network structures of a Channel Attention mechanism (Channel Attention) and a Spatial Attention mechanism (Spatial Attention) of the BAM module respectively
Figure DEST_PATH_IMAGE002
And
Figure DEST_PATH_IMAGE003
the feature mapping obtained by fusing the two is
Figure DEST_PATH_IMAGE004
And inhibiting unimportant features by point-by-point multiplication, highlighting the important features, and then matching the input feature map with the input feature map
Figure 9182DEST_PATH_IMAGE001
Adding to obtain refined feature map
Figure DEST_PATH_IMAGE005
In the existing network structure of YOLOv5s, a module composed of convolution operation Conv2d, normalization processing BatchNorm, and a HardSwish activation function is referred to as a CBH module, as shown in fig. 4, which is a schematic structural diagram of the CBH module provided in the embodiment of the present invention; the BottleneckCSP1_ X consists of a CBH module and X residual error structures Res unit modules; the BottleneckCSP2_ x has a structure similar to that of the BottleneckCSP1_ x, except that N Bottlenecks are replaced by N CBH modules; spatial Pyramid Pooling (SPP for short) is performed in a manner of maximal Pooling of 1 × 1, 5 × 5, 9 × 9, and 13 × 13 for multi-scale fusion. In order to improve the feature expression capability of the network, in the embodiment of the present invention, a bottleneck attention mechanism module is embedded into a botteleeck CSP1_ x layer of a YOLOv5s backbone network to form a repeating unit composed of a CBH module and a BAM module, so as to obtain a cross-stage local network based on the bottleneck attention mechanism module, which is herein referred to as a BAM-CSP1_ x network module. Fig. 5 is a schematic structural diagram of a BAM-CSP1_ x network module according to an embodiment of the present invention. The feature map is transmitted into a BAM-CSP1_ x network module, and the influence of other factors is eliminated as much as possible at the bottom layer of YOLOv5s through a channel attention mechanism and a space attention mechanism, so that the network focuses on effective feature information, inhibits unimportant feature information, focuses on extraction of target features in a driving scene, and is beneficial to improvement of detection accuracy.
In another embodiment of the present invention, embodiments of the present invention also employ deep separable convolution operations in the designated convolution layer of the YOLOv5s backbone network. Specifically, a designated CBH module in the YOLOv5s backbone network is replaced with an MBH module. Wherein, the MBH module is obtained by replacing the convolution operation Conv2d in the CBH module with an inverted residual error (herein, denoted as Mod module) module based on a depth separable convolution operation. As a preferred example of the present invention, as shown in fig. 6, the MBH module is composed of an inverted residual module based on a depth separable convolution operation, a normalization process of BatchNorm2d, and an activation function of HardSwish. The feature map is passed through an inverse residual module based on a depth separable convolution operation, followed by normalized BatchNorm2d and HardSwish activation function operations, respectively.
As shown in fig. 7, the inverted residual module based on the depth separable convolution operation includes a first single-point convolution layer and a depth convolution layer, wherein the first single-point convolution layer and the depth convolution layer employ a BatchNorm operation and a non-linear ReLU6 activation function, a second single-point convolution layer and a fusion layer, wherein the second single-point convolution layer employs the BatchNorm operation without the non-linear ReLU6 activation function;
the first single-point convolutional layer is used for expanding a first low-dimensional feature representation to a first high-dimensional feature representation, and the depth convolutional layer is used for performing feature extraction on the first high-dimensional feature representation based on depth separable convolution operation to obtain a second high-dimensional feature representation; the second single-point convolutional layer is used for compressing the second high-dimensional feature representation to obtain a second low-dimensional feature representation; and the fusion layer is used for fusing the input first low-dimensional feature representation and the second low-dimensional feature representation processed by the first single-point convolutional layer, the depth convolutional layer and the second single-point convolutional layer by adopting layer jump connection operation to generate a new feature map.
Here, the depth separable Convolution uses the correlation between the spatial dimension and the channel dimension of the 3 × 3 depth Convolution (Depthwise Convolution) and the 1 × 1 single-point Convolution (Pointwise Convolution) split feature, and the amount of calculation of the model can be effectively reduced. Because the depth separable convolution not only can effectively compress the calculated amount of the convolution kernel, but also can compress the exploration space of the convolution kernel, and after the exploration capacity of the feature space of the convolution is compressed, the feature expression capacity is weakened. In view of this, the embodiment of the present invention uses an inverse residual model based on a depth separable convolution operation, expands the low-dimensional feature representation to the high-dimensional feature representation by adding a 1 × 1 convolution before the depth convolution, performs feature extraction using the depth separable convolution operation, and then compresses the model into the low-dimensional space. In the inverse residual model based on the depth separable convolution operation, the first single-point convolution layer and the depth convolution layer are followed by the BatchNorm operation and the non-linear ReLU6 function. In the case of a large number of channels, the features may enter into a portion of the low dimensional space. Although the use of the ReLU6 function may maintain a good feature extraction capability, the ReLU6 function may instead reduce the feature extraction capability of the network after the feature is transformed from high dimension to low dimension, and thus the ReLU6 function is not used in the final second single-point convolutional layer operation. Finally, the original feature map is fused with the feature map of the depth separable convolution by using a short concatenation operation to generate a new feature map. It will be appreciated that the first low-dimensional feature representation and the first high-dimensional feature representation are relative, and the second high-dimensional feature representation and the second low-dimensional feature representation are also relative.
As a preferred example of the present invention, in the target detection network, specifically, the fourth CBH module in the YOLOv5s backbone network may be replaced by an MBH module. According to the embodiment of the invention, the YOLOv5 algorithm with small network parameter scale and very high reasoning speed is used as a basic framework, and the Convolution layer with large calculated amount in the main network is replaced by the depth Separable Convolution (Depthwise Separable Convolution), so that the parameter amount is effectively reduced, the target detection speed is improved, and a good detection effect can be obtained.
For the constructed target detection network, the embodiment of the invention uses an adam optimization method and adopts a preset loss function to train in an end-to-end mode. Optionally, the trained image size is 640 x 640, the batch-size is set to 16, and the epoch size is set to 300.
In one embodiment of the invention, the LOSS function LOSS of the YOLOv5s network in the target detection network is classified by a classification LOSS function
Figure DEST_PATH_IMAGE006
Bounding box regression (Bounding box) loss function
Figure DEST_PATH_IMAGE007
And confidence loss function
Figure DEST_PATH_IMAGE008
The composition is shown as a formula (1).
Figure DEST_PATH_IMAGE009
(1)
General classification loss function
Figure DEST_PATH_IMAGE010
A BCE (Binary Cross Entropy) loss function was used. Aiming at the problem that the positive and negative samples of the target detection are unbalanced in the driving scene, the embodiment of the invention classifies the loss function
Figure DEST_PATH_IMAGE011
And replacing the model with a Gradient equalization Mechanism Loss function (GHM Loss for short). In the loss function of the gradient equalization mechanism, for a candidate frame, let
Figure DEST_PATH_IMAGE012
Is the probability that the model is predicting,
Figure DEST_PATH_IMAGE013
is a certain class of real label, and calculates the binary cross entropy loss as shown in formula (2).
Figure DEST_PATH_IMAGE014
(2)
In dealing with the problem of gradient norm imbalance, gradient density is used
Figure DEST_PATH_IMAGE015
A function. As shown in formula (3).
Figure DEST_PATH_IMAGE016
(3)
In the formula (3), the reaction mixture is,
Figure DEST_PATH_IMAGE017
is in samples 1 to N, the gradient mode length is distributed in
Figure DEST_PATH_IMAGE018
The number of samples within the range is,
Figure DEST_PATH_IMAGE019
to represent
Figure DEST_PATH_IMAGE020
The interval length of (2).
Loss of gradient equalization mechanism available for classification
Figure DEST_PATH_IMAGE021
As shown in formula (4).
Figure DEST_PATH_IMAGE022
(4)
The embodiment of the invention aims at the problem that the positive and negative samples of the target detection are unbalanced in the driving scene, and classifies the loss function
Figure DEST_PATH_IMAGE023
And a gradient balance mechanism Loss function GHM Loss is replaced, the weights of simple negative samples and very difficult abnormal samples in the candidate samples are reduced, and the weights of normal difficult samples are improved, so that the model can be more concentrated on effective normal difficult samples, and the performance of the model is effectively improved.
For the case of partially shielding the target, the embodiment of the invention can also regress the frame into a loss function
Figure DEST_PATH_IMAGE024
Replacement is made with a repulsive force Loss function (Repulsion Loss). In this case, the loss function of the target detection network is arranged by a loss function of a gradient equilibrium mechanismA repulsive force loss function and a confidence loss function. The repulsive force loss function can reduce the distance between the prediction frame and the target frame in target detection and increase the distance between the prediction frame and the surrounding target frame or the prediction frame. The repulsive force loss function consists of three parts: the first part is the loss value generated by the prediction frame and the target frame; the second part is the loss value generated by the prediction frame and the surrounding target frame; the third part is the loss value generated by the prediction box and the surrounding prediction box which is not predicting the same target. By passing
Figure DEST_PATH_IMAGE025
And
Figure DEST_PATH_IMAGE026
two relationship coefficients to adjust the second and third loss values. The loss value will be smaller if the distance to the surrounding target is larger. The expression of the repulsive force loss function is shown in equation (5):
Figure DEST_PATH_IMAGE027
(5)
the first partial expression on the right side of equation (5) is a loss function of a regression model. Wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE028
a set of positive samples is represented, and,
Figure DEST_PATH_IMAGE029
is a prediction box of the propofol regression,
Figure DEST_PATH_IMAGE030
then is with propofol
Figure DEST_PATH_IMAGE031
The real target box with the largest Intersection over Union (IoU for short),
Figure DEST_PATH_IMAGE032
function for measuring
Figure DEST_PATH_IMAGE033
And
Figure DEST_PATH_IMAGE034
the distance of (d); in the second sub-formula, the first sub-formula,
Figure DEST_PATH_IMAGE035
is in addition to and
Figure DEST_PATH_IMAGE036
the corresponding real box, the real box with the maximum value of IoU,
Figure DEST_PATH_IMAGE037
to represent
Figure DEST_PATH_IMAGE038
And
Figure DEST_PATH_IMAGE039
the overlapping region is
Figure DEST_PATH_IMAGE040
The percentage of the area over which the light is emitted,
Figure DEST_PATH_IMAGE041
the function is used for measuring the distance between the prediction frame and the real frame of the surrounding target; in the third sub-formula, the
Figure DEST_PATH_IMAGE042
The division into different subsets is performed such that,
Figure DEST_PATH_IMAGE043
Figure DEST_PATH_IMAGE044
and
Figure DEST_PATH_IMAGE045
a prediction box representing a different target is shown,
Figure DEST_PATH_IMAGE046
function for metric prediction block and method thereofThe distance of the prediction boxes of the surrounding targets is such that the overlap area of the pro posal P of the different subsets is as small as possible. From the denominator part of the third fraction, it can be seen that the loss value is only counted if the prediction box has an overlapping area, and if not adjacent at all, it is not counted. The loss of the third fraction can reduce the probability that the bounding boxes of different regression targets are combined into one, so that the embodiment is more robust under the condition that the traffic road target is partially shielded, and the detection effect is effectively improved.
For the target detection network, the embodiment of the present invention adopts Precision (Precision), Recall (Recall), mean Average Precision (mAP) and detection speed (Frames Per Second, FPS) as evaluation indexes, and specifically calculates and explains that:
1. precision represents the proportion of the positive case in the case of being divided into positive cases, in letters
Figure 595497DEST_PATH_IMAGE042
Expressed as shown in formula (6):
Figure DEST_PATH_IMAGE047
(6)
where TP + FP is the number of pictures of the predicted pictures that are positive classes, and TP is the number of pictures of the positive classes that are predicted as positive classes.
2. Recall indicates how many of the actual positive classes are classified as positive classes, and uses letters
Figure DEST_PATH_IMAGE048
The expression is a measure of coverage of the detection result, and is shown in formula (7):
Figure DEST_PATH_IMAGE049
(7)
3. mAP is the mean of the average accuracies of all classes in the dataset, AP is the average accuracy of a certain class, for the second
Figure DEST_PATH_IMAGE050
Selecting different IoU thresholds according to each category, wherein the calculation formula of the average accuracy is as follows:
Figure DEST_PATH_IMAGE051
(8)
the geometric meaning is the area enclosed by a curve formed by the accuracy and the recall rate and a horizontal axis, and the number of the obtained categories is
Figure DEST_PATH_IMAGE052
The average precision of the mean value is calculated by the following formula:
Figure DEST_PATH_IMAGE053
(9)
4. the FPS detects the number of image frames per second, and the index is not only related to the calculated amount of the algorithm model, but also related to the hardware performance in the experimental process. Generally, if the detection speed is not less than 25fps, the algorithm model can be considered to meet the real-time requirement.
The embodiment of the invention provides a lightweight method for detecting targets in shot images in a driving scene based on a deep neural network, and a series of improvements and optimizations are performed on the basis of YOLOv5s, compared with the existing SE + YOLOv3 network structure, the accuracy rate of a test environment GTX1080 on an image data set shot by a camera in the driving scene is greatly improved, and a prediction frame is closer to a real target frame. Compared with the original SE + YOLOv3 network structure, the size of the model in the embodiment of the invention is greatly reduced, the detection accuracy is improved, and the application requirement of front end light weight in a driving scene is met.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
In an embodiment, the present invention further provides an object detection apparatus, which corresponds to the object detection method in the foregoing embodiment one to one. As shown in fig. 8, the object detection device includes an acquisition module 81 and a detection module 82. The functional modules are explained in detail as follows:
an acquisition module 81 for acquiring an image captured by a camera in a driving scene;
the detection module 82 is configured to input the image to a trained target detection network, and judge and predict the image through the target detection network to obtain target classification and position information;
the target detection network adopts a lightweight YOLOv5s network structure as a basic framework, a bottleneck attention mechanism module is embedded in a cross-stage local network of a YOLOv5s main network, and deep separable convolution operation is adopted in a specified convolution layer of the YOLOv5s main network.
Optionally, the target detection network embeds a bottleneck attention mechanism module in a BottleneckCSP1_ x layer of a YOLOv5s backbone network to obtain a cross-stage local network based on the bottleneck attention mechanism module.
Optionally, the target detection network replaces the designated CBH module in the YOLOv5s backbone network with an MBH module, the CBH module consisting of a convolution operation, a normalization process, and an activation function, the MBH module consisting of an inverted residual error module based on a depth separable convolution operation, a normalization process, and an activation function.
Optionally, the depth separable convolution operation based inverted residual module comprises a first single-point convolution layer, a depth convolution layer, a second single-point convolution layer, and a blend layer, wherein the first single-point convolution layer and the depth convolution layer employ a BatchNorm operation and a non-linear ReLU6 activation function, and the second single-point convolution layer employs the BatchNorm operation without a non-linear ReLU6 activation function;
the first single-point convolutional layer is used for expanding a first low-dimensional feature representation to a first high-dimensional feature representation, and the depth convolutional layer is used for performing feature extraction on the first high-dimensional feature representation based on depth separable convolution operation to obtain a second high-dimensional feature representation; the second single-point convolutional layer is used for compressing the second high-dimensional feature representation to obtain a second low-dimensional feature representation; and the fusion layer is used for fusing the input first low-dimensional feature representation and the second low-dimensional feature representation processed by the first single-point convolutional layer, the depth convolutional layer and the second single-point convolutional layer by adopting layer jump connection operation to generate a new feature map.
Optionally, the target detection network replaces the fourth CBH module in the YOLOv5s backbone network with an MBH module.
Optionally, the target detection network is obtained by training through a preset loss function;
the loss function is composed of a classification loss function, a frame regression loss function and a confidence coefficient loss function.
Optionally, the target detection network is obtained by training through a preset loss function;
the loss function consists of a gradient equilibrium mechanism loss function, a repulsive force loss function, and a confidence loss function.
For specific limitations of the target detection device, reference may be made to the above limitations of the target detection method, which are not described herein again. The modules in the target detection device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in a computer device, and can also be stored in a memory in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 9. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of object detection.
In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:
acquiring an image shot by a camera in a driving scene;
inputting the image into a trained target detection network, and judging and predicting the image through the target detection network to obtain target classification and position information;
the target detection network adopts a lightweight YOLOv5s network structure as a basic framework, a bottleneck attention mechanism module is embedded in a cross-stage local network of a YOLOv5s main network, and deep separable convolution operation is adopted in a specified convolution layer of the YOLOv5s main network.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, databases, or other media used in embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims (10)

1. A method of object detection, comprising:
acquiring an image shot by a camera in a driving scene;
inputting the image into a trained target detection network, and judging and predicting the image through the target detection network to obtain target classification and position information;
the target detection network adopts a lightweight YOLOv5s network structure as a basic framework, a bottleneck attention mechanism module is embedded in a cross-stage local network of a YOLOv5s main network, and deep separable convolution operation is adopted in a specified convolution layer of the YOLOv5s main network.
2. The target detection method of claim 1, wherein the target detection network embeds a bottleneck attention mechanism module in a BottleneckCSP1_ x layer of a YOLOv5s backbone network to obtain a cross-stage local network based on the bottleneck attention mechanism module.
3. The object detection method of claim 2, wherein the object detection network replaces a designated CBH module in the YOLOv5s backbone network with an MBH module, the CBH module consisting of a convolution operation, a normalization process, and an activation function, the MBH module consisting of an inverted residual module based on a deep separable convolution operation, a normalization process, and an activation function.
4. The object detection method of claim 3, wherein the depth separable convolution operation based inverted residual module comprises a first single point convolution layer, a depth convolution layer, a second single point convolution layer, and a blend layer, wherein the first single point convolution layer and the depth convolution layer employ a BatchNorm operation and a non-linear ReLU6 activation function, and the second single point convolution layer employs a BatchNorm operation and does not employ a non-linear ReLU6 activation function;
the first single-point convolutional layer is used for expanding a first low-dimensional feature representation to a first high-dimensional feature representation, and the depth convolutional layer is used for performing feature extraction on the first high-dimensional feature representation based on depth separable convolution operation to obtain a second high-dimensional feature representation; the second single-point convolutional layer is used for compressing the second high-dimensional feature representation to obtain a second low-dimensional feature representation; and the fusion layer is used for fusing the input first low-dimensional feature representation and the second low-dimensional feature representation processed by the first single-point convolutional layer, the depth convolutional layer and the second single-point convolutional layer by adopting layer jump connection operation to generate a new feature map.
5. The object detection method of claim 3 or 4, wherein the object detection network replaces the fourth CBH module in the YOLOv5s backbone network with an MBH module.
6. The object detection method according to any one of claims 1 to 4, wherein the object detection network is trained by a preset loss function;
the loss function is composed of a classification loss function, a frame regression loss function and a confidence coefficient loss function.
7. The object detection method according to any one of claims 1 to 4, wherein the object detection network is trained by a preset loss function;
the loss function consists of a gradient equilibrium mechanism loss function, a repulsive force loss function, and a confidence loss function.
8. An object detection apparatus, characterized in that the apparatus comprises:
the acquisition module is used for acquiring images shot by the camera in a driving scene;
the detection module is used for inputting the image to a trained target detection network, and judging and predicting the image through the target detection network to obtain target classification and position information;
the target detection network adopts a lightweight YOLOv5s network structure as a basic framework, a bottleneck attention mechanism module is embedded in a cross-stage local network of a YOLOv5s main network, and deep separable convolution operation is adopted in a specified convolution layer of the YOLOv5s main network.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the object detection method according to any one of claims 1 to 7.
10. A terminal comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the object detection method according to any one of claims 1 to 7 when executing the computer program.
CN202110310610.1A 2021-03-24 2021-03-24 Target detection method, device, storage medium and terminal Active CN112699859B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110310610.1A CN112699859B (en) 2021-03-24 2021-03-24 Target detection method, device, storage medium and terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110310610.1A CN112699859B (en) 2021-03-24 2021-03-24 Target detection method, device, storage medium and terminal

Publications (2)

Publication Number Publication Date
CN112699859A true CN112699859A (en) 2021-04-23
CN112699859B CN112699859B (en) 2021-07-16

Family

ID=75515587

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110310610.1A Active CN112699859B (en) 2021-03-24 2021-03-24 Target detection method, device, storage medium and terminal

Country Status (1)

Country Link
CN (1) CN112699859B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113160062A (en) * 2021-05-25 2021-07-23 烟台艾睿光电科技有限公司 Infrared image target detection method, device, equipment and storage medium
CN113449691A (en) * 2021-07-21 2021-09-28 天津理工大学 Human shape recognition system and method based on non-local attention mechanism
CN113469087A (en) * 2021-07-09 2021-10-01 上海智臻智能网络科技股份有限公司 Method, device, equipment and medium for detecting picture frame in building drawing
CN113569702A (en) * 2021-07-23 2021-10-29 闽江学院 Deep learning-based truck single-tire and double-tire identification method
CN113705604A (en) * 2021-07-15 2021-11-26 中国科学院信息工程研究所 Botnet flow classification detection method and device, electronic equipment and storage medium
CN113887706A (en) * 2021-09-30 2022-01-04 苏州浪潮智能科技有限公司 Method and device for low bit quantization aiming at one-stage target detection network
CN113963167A (en) * 2021-10-29 2022-01-21 北京百度网讯科技有限公司 Method, device and computer program product applied to target detection
CN114529825A (en) * 2022-04-24 2022-05-24 城云科技(中国)有限公司 Target detection model, method and application for fire fighting channel occupation target detection
CN114549970A (en) * 2022-01-13 2022-05-27 山东师范大学 Night small target fruit detection method and system fusing global fine-grained information
CN115223130A (en) * 2022-09-20 2022-10-21 南京理工大学 Multi-task panoramic driving perception method and system based on improved YOLOv5
CN115578624A (en) * 2022-10-28 2023-01-06 北京市农林科学院 Agricultural disease and pest model construction method, detection method and device
CN116468730A (en) * 2023-06-20 2023-07-21 齐鲁工业大学(山东省科学院) Aerial insulator image defect detection method based on YOLOv5 algorithm
CN114549970B (en) * 2022-01-13 2024-06-07 山东师范大学 Night small target fruit detection method and system integrating global fine granularity information

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110633610A (en) * 2019-05-17 2019-12-31 西南交通大学 Student state detection algorithm based on YOLO
CN110852222A (en) * 2019-10-31 2020-02-28 上海交通大学 Campus corridor scene intelligent monitoring method based on target detection
CN112233090A (en) * 2020-10-15 2021-01-15 浙江工商大学 Film flaw detection method based on improved attention mechanism
CN112307921A (en) * 2020-10-22 2021-02-02 桂林电子科技大学 Vehicle-mounted end multi-target identification tracking prediction method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110633610A (en) * 2019-05-17 2019-12-31 西南交通大学 Student state detection algorithm based on YOLO
CN110852222A (en) * 2019-10-31 2020-02-28 上海交通大学 Campus corridor scene intelligent monitoring method based on target detection
CN112233090A (en) * 2020-10-15 2021-01-15 浙江工商大学 Film flaw detection method based on improved attention mechanism
CN112307921A (en) * 2020-10-22 2021-02-02 桂林电子科技大学 Vehicle-mounted end multi-target identification tracking prediction method

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113160062A (en) * 2021-05-25 2021-07-23 烟台艾睿光电科技有限公司 Infrared image target detection method, device, equipment and storage medium
CN113469087A (en) * 2021-07-09 2021-10-01 上海智臻智能网络科技股份有限公司 Method, device, equipment and medium for detecting picture frame in building drawing
CN113705604A (en) * 2021-07-15 2021-11-26 中国科学院信息工程研究所 Botnet flow classification detection method and device, electronic equipment and storage medium
CN113449691A (en) * 2021-07-21 2021-09-28 天津理工大学 Human shape recognition system and method based on non-local attention mechanism
CN113569702A (en) * 2021-07-23 2021-10-29 闽江学院 Deep learning-based truck single-tire and double-tire identification method
CN113569702B (en) * 2021-07-23 2023-10-27 闽江学院 Truck single-double tire identification method based on deep learning
CN113887706A (en) * 2021-09-30 2022-01-04 苏州浪潮智能科技有限公司 Method and device for low bit quantization aiming at one-stage target detection network
CN113887706B (en) * 2021-09-30 2024-02-06 苏州浪潮智能科技有限公司 Method and device for low-bit quantization of one-stage target detection network
CN113963167B (en) * 2021-10-29 2022-05-27 北京百度网讯科技有限公司 Method, device and computer program product applied to target detection
CN113963167A (en) * 2021-10-29 2022-01-21 北京百度网讯科技有限公司 Method, device and computer program product applied to target detection
CN114549970A (en) * 2022-01-13 2022-05-27 山东师范大学 Night small target fruit detection method and system fusing global fine-grained information
CN114549970B (en) * 2022-01-13 2024-06-07 山东师范大学 Night small target fruit detection method and system integrating global fine granularity information
CN114529825A (en) * 2022-04-24 2022-05-24 城云科技(中国)有限公司 Target detection model, method and application for fire fighting channel occupation target detection
CN115223130A (en) * 2022-09-20 2022-10-21 南京理工大学 Multi-task panoramic driving perception method and system based on improved YOLOv5
CN115223130B (en) * 2022-09-20 2023-02-03 南京理工大学 Multi-task panoramic driving perception method and system based on improved YOLOv5
CN115578624A (en) * 2022-10-28 2023-01-06 北京市农林科学院 Agricultural disease and pest model construction method, detection method and device
CN116468730A (en) * 2023-06-20 2023-07-21 齐鲁工业大学(山东省科学院) Aerial insulator image defect detection method based on YOLOv5 algorithm
CN116468730B (en) * 2023-06-20 2023-09-05 齐鲁工业大学(山东省科学院) Aerial Insulator Image Defect Detection Method Based on YOLOv5 Algorithm

Also Published As

Publication number Publication date
CN112699859B (en) 2021-07-16

Similar Documents

Publication Publication Date Title
CN112699859B (en) Target detection method, device, storage medium and terminal
CN113688723B (en) Infrared image pedestrian target detection method based on improved YOLOv5
CN110929692A (en) Three-dimensional target detection method and device based on multi-sensor information fusion
CN111461083A (en) Rapid vehicle detection method based on deep learning
CN111797983A (en) Neural network construction method and device
CN111738037B (en) Automatic driving method, system and vehicle thereof
CN111222478A (en) Construction site safety protection detection method and system
CN111611947A (en) License plate detection method, device, equipment and medium
CN110533046B (en) Image instance segmentation method and device, computer readable storage medium and electronic equipment
CN111160481B (en) Adas target detection method and system based on deep learning
CN111242015A (en) Method for predicting driving danger scene based on motion contour semantic graph
CN114187311A (en) Image semantic segmentation method, device, equipment and storage medium
CN115049821A (en) Three-dimensional environment target detection method based on multi-sensor fusion
CN116597411A (en) Method and system for identifying traffic sign by unmanned vehicle in extreme weather
CN112462759B (en) Evaluation method, system and computer storage medium of rule control algorithm
CN111435457B (en) Method for classifying acquisitions acquired by sensors
CN110852272B (en) Pedestrian detection method
JP2018124963A (en) Image processing device, image recognition device, image processing program, and image recognition program
CN111652350A (en) Neural network visual interpretation method and weak supervision object positioning method
CN112465037B (en) Target detection method, device, computer equipment and storage medium
CN115880654A (en) Vehicle lane change risk assessment method and device, computer equipment and storage medium
WO2018143278A1 (en) Image processing device, image recognition device, image processing program, and image recognition program
CN112699809B (en) Vaccinia category identification method, device, computer equipment and storage medium
CN115631457A (en) Man-machine cooperation abnormity detection method and system in building construction monitoring video
JPWO2018143277A1 (en) Image feature output device, image recognition device, image feature output program, and image recognition program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant