CN112699859A

CN112699859A - Target detection method, device, storage medium and terminal

Info

Publication number: CN112699859A
Application number: CN202110310610.1A
Authority: CN
Inventors: 黄仝宇; 胡斌杰
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-03-24
Filing date: 2021-03-24
Publication date: 2021-04-23
Anticipated expiration: 2041-03-24
Also published as: CN112699859B

Abstract

The invention discloses a target detection method, which comprises the following steps: acquiring an image shot by a camera in a driving scene; inputting the image into a trained target detection network, and judging and predicting the image through the target detection network to obtain target classification and position information; the target detection network adopts a lightweight YOLOv5s network structure as a basic framework, a bottleneck attention mechanism module is embedded in a cross-stage local network of a YOLOv5s main network, and deep separable convolution operation is adopted in a specified convolution layer of the YOLOv5s main network. The invention effectively improves the target detection precision and speed of the driving scene image and meets the application requirement of front end lightweight under the driving scene.

Description

Target detection method, device, storage medium and terminal

Technical Field

The present invention relates to the field of information technologies, and in particular, to a target detection method, an apparatus, a storage medium, and a terminal.

Background

With the rapid development of artificial intelligence technology, a large number of target detection algorithms based on deep learning emerge, and are widely applied to target detection tasks in the fields of auxiliary driving, video monitoring, robot vision, industrial detection and the like. The visual perception is an important component for road environment perception in auxiliary driving, and can automatically analyze images shot by a camera, and actively predict potential dangerous conditions around the vehicle, such as whether pedestrians cross the road without traffic rules, whether vehicles ahead suddenly brake, and the like.

In the prior art, when the target detection is performed on the image shot by the camera in the driving scene, the YOLOv3 algorithm is used as a basic frame, and the receptive field of the feature mapping image is enhanced by embedding the SEnet structure, so that the feature information learned by the network is more comprehensive. However, this method has the following disadvantages:

(1) the SEnet only carries out screening weighting on the features on the channel dimension, and cannot well acquire the position relation information, so that the detection precision is poor.

(2) The YOLOv3 algorithm has the defects of insufficient recall rate and inaccurate positioning. The accuracy of YOLOv3 is improved but the detection speed is reduced compared to previous versions of YOLOv1, YOLOv2, and the like.

(3) The detection precision of the partially shielded target is low, and the application requirement of a traffic road scene is difficult to meet.

(4) Aiming at the problem that the positive and negative samples of the target detection are unbalanced in the driving scene, the model can pay more attention to the easily-separated samples, and the performance of the model is low.

Disclosure of Invention

The embodiment of the invention provides a target detection method, a target detection device, a storage medium and a terminal, and aims to solve the problems of low detection precision and detection speed when the target detection is carried out on an image shot by a camera in a driving scene in the prior art.

A method of target detection, comprising:

acquiring an image shot by a camera in a driving scene;

inputting the image into a trained target detection network, and judging and predicting the image through the target detection network to obtain target classification and position information;

the target detection network adopts a lightweight YOLOv5s network structure as a basic framework, a bottleneck attention mechanism module is embedded in a cross-stage local network of a YOLOv5s main network, and deep separable convolution operation is adopted in a specified convolution layer of the YOLOv5s main network.

Optionally, the target detection network embeds a bottleneck attention mechanism module in a BottleneckCSP1_ x layer of a YOLOv5s backbone network to obtain a cross-stage local network based on the bottleneck attention mechanism module.

Optionally, the target detection network replaces the designated CBH module in the YOLOv5s backbone network with an MBH module, the CBH module consisting of a convolution operation, a normalization process, and an activation function, the MBH module consisting of an inverted residual error module based on a depth separable convolution operation, a normalization process, and an activation function.

Optionally, the depth separable convolution operation based inverted residual module comprises a first single-point convolution layer, a depth convolution layer, a second single-point convolution layer, and a blend layer, wherein the first single-point convolution layer and the depth convolution layer employ a BatchNorm operation and a non-linear ReLU6 activation function, and the second single-point convolution layer employs the BatchNorm operation without a non-linear ReLU6 activation function;

the first single-point convolutional layer is used for expanding a first low-dimensional feature representation to a first high-dimensional feature representation, and the depth convolutional layer is used for performing feature extraction on the first high-dimensional feature representation based on depth separable convolution operation to obtain a second high-dimensional feature representation; the second single-point convolutional layer is used for compressing the second high-dimensional feature representation to obtain a second low-dimensional feature representation; and the fusion layer is used for fusing the input first low-dimensional feature representation and the second low-dimensional feature representation processed by the first single-point convolutional layer, the depth convolutional layer and the second single-point convolutional layer by adopting layer jump connection operation to generate a new feature map.

Optionally, the target detection network replaces the fourth CBH module in the YOLOv5s backbone network with an MBH module.

Optionally, the target detection network is obtained by training through a preset loss function;

the loss function is composed of a classification loss function, a frame regression loss function and a confidence coefficient loss function.

the loss function is composed of a gradient equilibrium mechanism loss function, a repulsive force loss function and a confidence coefficient loss function

An object detection apparatus, the apparatus comprising:

the acquisition module is used for acquiring images shot by the camera in a driving scene;

the detection module is used for inputting the image to a trained target detection network, and judging and predicting the image through the target detection network to obtain target classification and position information;

A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when executed by a processor, carries out the steps of the object detection method as described above.

A terminal comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the object detection method as described above when executing the computer program.

According to the embodiment of the invention, a target detection network is constructed in advance, the target detection network adopts a light-weight YOLOv5s network structure as a basic frame, and a bottleneck attention mechanism module is embedded in a cross-stage local network of a YOLOv5s backbone network, so that the characteristic information of channels and spaces can be screened at the same time, the expression capacity of the network channels and the space characteristics is improved, and the sensing range of the network to a target characteristic region can be expanded; and the deep separable convolution operation is adopted in the specified convolution layer of the YOLOv5s backbone network, so that the parameter quantity can be effectively reduced, and the detection speed is improved. When the target is detected, the image shot by the camera in the driving scene is obtained; and inputting the image into a trained target detection network, and judging and predicting the image by the target detection network to obtain target classification and position information, so that the target detection precision and speed of the driving scene image are effectively improved, and the application requirement of front end light weight under the driving scene is met.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a flow chart of a method of target detection in an embodiment of the invention;

FIG. 2 is a schematic diagram of a target detection network according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a bottleneck attention mechanism module according to an embodiment of the invention;

FIG. 4 is a schematic structural diagram of a CBH module according to an embodiment of the present invention;

FIG. 5 is a block diagram of a BAM-CSP1_ x network module according to an embodiment of the invention;

FIG. 6 is a schematic diagram of the structure of an MBH module in an embodiment of the present invention;

FIG. 7 is a block diagram illustrating an inverted residual module based on a depth separable convolution operation according to an embodiment of the present invention;

FIG. 8 is a functional block diagram of an object detection device in accordance with an embodiment of the present invention;

FIG. 9 is a schematic diagram of a computer device according to an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment provides a target detection method. The target detection method is applied to an auxiliary driving system, so that the auxiliary driving system can detect targets such as people, vehicles and the like as early as possible and accurately, and can remind a driver of adopting operations such as braking, steering and the like in time under emergency by combining other technologies, thereby avoiding collision and ensuring driving safety and traffic order. The following describes in detail the target detection method provided in this embodiment, as shown in fig. 1, the target detection method includes:

in step S101, an image captured by a camera in a driving scene is acquired.

The embodiment of the invention performs target detection on the images shot by the camera in the driving scene, including motor vehicle detection, non-motor vehicle detection and pedestrian detection, and obtains the position of the target.

In step S102, the image is input to a trained target detection network, and the image is judged and predicted by the target detection network, so as to obtain target classification and position information.

The embodiment of the invention aims at improving the accuracy rate of target detection in an image shot by a camera in a driving scene, and designs a deep neural network model, namely the target detection network, so as to realize the lightweight of the model while improving the accuracy rate of the target detection.

As shown in fig. 2, the target detection network includes four parts, i.e., an Input layer, a Backbone network Backbone, a Neck structure Neck, and an Output layer Output. Wherein, the Input layer Input is used for preprocessing the Input image, and the preprocessing comprises but is not limited to enhancement, adaptive scaling and adaptive anchor frame; the Backbone network Backbone is used for aggregating fine granularity of different images, forming a feature mapping chart and outputting the feature mapping chart to a Neck structure Neck; the Neck structure Neck is used for performing feature fusion on different detection layers from different trunk layers, and the capability of network feature fusion is enhanced; the Output layer Output is used to generate bounding boxes and classes of predicted targets.

In one embodiment of the present invention, the target detection network adopts a lightweight YOLOv5s network structure as a basic framework, and a bottleneck attention mechanism module is embedded in a cross-stage local network of a YOLOv5s backbone network. According to the embodiment of the invention, the bottleneck attention machine module is embedded into the YOLOv5 algorithm, the channel attention and space attention machine module is added into the feature extraction network, and the attention machine module is used for screening the feature information of the channel and the space at the same time, so that the network channel and space feature expression capability is improved, and the network can expand the sensing range of the target feature area.

Here, a Bottleneck Attention Module (BAM) is a hybrid Attention model, which can be embedded into a forward propagation convolutional neural network, and includes two branch networks, a channel Attention Module and a spatial Attention Module. Fig. 3 is a schematic structural diagram of a bottleneck attention mechanism module according to an embodiment of the present invention. Is provided with a feature map

The feature map is obtained after the processing of two independent branch network structures of a Channel Attention mechanism (Channel Attention) and a Spatial Attention mechanism (Spatial Attention) of the BAM module respectively

And

the feature mapping obtained by fusing the two is

And inhibiting unimportant features by point-by-point multiplication, highlighting the important features, and then matching the input feature map with the input feature map

Adding to obtain refined feature map

。

In the existing network structure of YOLOv5s, a module composed of convolution operation Conv2d, normalization processing BatchNorm, and a HardSwish activation function is referred to as a CBH module, as shown in fig. 4, which is a schematic structural diagram of the CBH module provided in the embodiment of the present invention; the BottleneckCSP1_ X consists of a CBH module and X residual error structures Res unit modules; the BottleneckCSP2_ x has a structure similar to that of the BottleneckCSP1_ x, except that N Bottlenecks are replaced by N CBH modules; spatial Pyramid Pooling (SPP for short) is performed in a manner of maximal Pooling of 1 × 1, 5 × 5, 9 × 9, and 13 × 13 for multi-scale fusion. In order to improve the feature expression capability of the network, in the embodiment of the present invention, a bottleneck attention mechanism module is embedded into a botteleeck CSP1_ x layer of a YOLOv5s backbone network to form a repeating unit composed of a CBH module and a BAM module, so as to obtain a cross-stage local network based on the bottleneck attention mechanism module, which is herein referred to as a BAM-CSP1_ x network module. Fig. 5 is a schematic structural diagram of a BAM-CSP1_ x network module according to an embodiment of the present invention. The feature map is transmitted into a BAM-CSP1_ x network module, and the influence of other factors is eliminated as much as possible at the bottom layer of YOLOv5s through a channel attention mechanism and a space attention mechanism, so that the network focuses on effective feature information, inhibits unimportant feature information, focuses on extraction of target features in a driving scene, and is beneficial to improvement of detection accuracy.

In another embodiment of the present invention, embodiments of the present invention also employ deep separable convolution operations in the designated convolution layer of the YOLOv5s backbone network. Specifically, a designated CBH module in the YOLOv5s backbone network is replaced with an MBH module. Wherein, the MBH module is obtained by replacing the convolution operation Conv2d in the CBH module with an inverted residual error (herein, denoted as Mod module) module based on a depth separable convolution operation. As a preferred example of the present invention, as shown in fig. 6, the MBH module is composed of an inverted residual module based on a depth separable convolution operation, a normalization process of BatchNorm2d, and an activation function of HardSwish. The feature map is passed through an inverse residual module based on a depth separable convolution operation, followed by normalized BatchNorm2d and HardSwish activation function operations, respectively.

As shown in fig. 7, the inverted residual module based on the depth separable convolution operation includes a first single-point convolution layer and a depth convolution layer, wherein the first single-point convolution layer and the depth convolution layer employ a BatchNorm operation and a non-linear ReLU6 activation function, a second single-point convolution layer and a fusion layer, wherein the second single-point convolution layer employs the BatchNorm operation without the non-linear ReLU6 activation function;

Here, the depth separable Convolution uses the correlation between the spatial dimension and the channel dimension of the 3 × 3 depth Convolution (Depthwise Convolution) and the 1 × 1 single-point Convolution (Pointwise Convolution) split feature, and the amount of calculation of the model can be effectively reduced. Because the depth separable convolution not only can effectively compress the calculated amount of the convolution kernel, but also can compress the exploration space of the convolution kernel, and after the exploration capacity of the feature space of the convolution is compressed, the feature expression capacity is weakened. In view of this, the embodiment of the present invention uses an inverse residual model based on a depth separable convolution operation, expands the low-dimensional feature representation to the high-dimensional feature representation by adding a 1 × 1 convolution before the depth convolution, performs feature extraction using the depth separable convolution operation, and then compresses the model into the low-dimensional space. In the inverse residual model based on the depth separable convolution operation, the first single-point convolution layer and the depth convolution layer are followed by the BatchNorm operation and the non-linear ReLU6 function. In the case of a large number of channels, the features may enter into a portion of the low dimensional space. Although the use of the ReLU6 function may maintain a good feature extraction capability, the ReLU6 function may instead reduce the feature extraction capability of the network after the feature is transformed from high dimension to low dimension, and thus the ReLU6 function is not used in the final second single-point convolutional layer operation. Finally, the original feature map is fused with the feature map of the depth separable convolution by using a short concatenation operation to generate a new feature map. It will be appreciated that the first low-dimensional feature representation and the first high-dimensional feature representation are relative, and the second high-dimensional feature representation and the second low-dimensional feature representation are also relative.

As a preferred example of the present invention, in the target detection network, specifically, the fourth CBH module in the YOLOv5s backbone network may be replaced by an MBH module. According to the embodiment of the invention, the YOLOv5 algorithm with small network parameter scale and very high reasoning speed is used as a basic framework, and the Convolution layer with large calculated amount in the main network is replaced by the depth Separable Convolution (Depthwise Separable Convolution), so that the parameter amount is effectively reduced, the target detection speed is improved, and a good detection effect can be obtained.

For the constructed target detection network, the embodiment of the invention uses an adam optimization method and adopts a preset loss function to train in an end-to-end mode. Optionally, the trained image size is 640 x 640, the batch-size is set to 16, and the epoch size is set to 300.

In one embodiment of the invention, the LOSS function LOSS of the YOLOv5s network in the target detection network is classified by a classification LOSS function

Bounding box regression (Bounding box) loss function

And confidence loss function

The composition is shown as a formula (1).

（1）

General classification loss function

A BCE (Binary Cross Entropy) loss function was used. Aiming at the problem that the positive and negative samples of the target detection are unbalanced in the driving scene, the embodiment of the invention classifies the loss function

And replacing the model with a Gradient equalization Mechanism Loss function (GHM Loss for short). In the loss function of the gradient equalization mechanism, for a candidate frame, let

Is the probability that the model is predicting,

is a certain class of real label, and calculates the binary cross entropy loss as shown in formula (2).

（2）

In dealing with the problem of gradient norm imbalance, gradient density is used

A function. As shown in formula (3).

（3）

In the formula (3), the reaction mixture is,

is in samples 1 to N, the gradient mode length is distributed in

The number of samples within the range is,

to represent

The interval length of (2).

Loss of gradient equalization mechanism available for classification

As shown in formula (4).

（4）

The embodiment of the invention aims at the problem that the positive and negative samples of the target detection are unbalanced in the driving scene, and classifies the loss function

And a gradient balance mechanism Loss function GHM Loss is replaced, the weights of simple negative samples and very difficult abnormal samples in the candidate samples are reduced, and the weights of normal difficult samples are improved, so that the model can be more concentrated on effective normal difficult samples, and the performance of the model is effectively improved.

For the case of partially shielding the target, the embodiment of the invention can also regress the frame into a loss function

Replacement is made with a repulsive force Loss function (Repulsion Loss). In this case, the loss function of the target detection network is arranged by a loss function of a gradient equilibrium mechanismA repulsive force loss function and a confidence loss function. The repulsive force loss function can reduce the distance between the prediction frame and the target frame in target detection and increase the distance between the prediction frame and the surrounding target frame or the prediction frame. The repulsive force loss function consists of three parts: the first part is the loss value generated by the prediction frame and the target frame; the second part is the loss value generated by the prediction frame and the surrounding target frame; the third part is the loss value generated by the prediction box and the surrounding prediction box which is not predicting the same target. By passing

And

two relationship coefficients to adjust the second and third loss values. The loss value will be smaller if the distance to the surrounding target is larger. The expression of the repulsive force loss function is shown in equation (5):

（5）

the first partial expression on the right side of equation (5) is a loss function of a regression model. Wherein the content of the first and second substances,

a set of positive samples is represented, and,

is a prediction box of the propofol regression,

then is with propofol

The real target box with the largest Intersection over Union (IoU for short),

function for measuring

And

the distance of (d); in the second sub-formula, the first sub-formula,

is in addition to and

the corresponding real box, the real box with the maximum value of IoU,

to represent

And

the overlapping region is

The percentage of the area over which the light is emitted,

the function is used for measuring the distance between the prediction frame and the real frame of the surrounding target; in the third sub-formula, the

The division into different subsets is performed such that,

，

and

a prediction box representing a different target is shown,

function for metric prediction block and method thereofThe distance of the prediction boxes of the surrounding targets is such that the overlap area of the pro posal P of the different subsets is as small as possible. From the denominator part of the third fraction, it can be seen that the loss value is only counted if the prediction box has an overlapping area, and if not adjacent at all, it is not counted. The loss of the third fraction can reduce the probability that the bounding boxes of different regression targets are combined into one, so that the embodiment is more robust under the condition that the traffic road target is partially shielded, and the detection effect is effectively improved.

For the target detection network, the embodiment of the present invention adopts Precision (Precision), Recall (Recall), mean Average Precision (mAP) and detection speed (Frames Per Second, FPS) as evaluation indexes, and specifically calculates and explains that:

1. precision represents the proportion of the positive case in the case of being divided into positive cases, in letters

Expressed as shown in formula (6):

（6）

where TP + FP is the number of pictures of the predicted pictures that are positive classes, and TP is the number of pictures of the positive classes that are predicted as positive classes.

2. Recall indicates how many of the actual positive classes are classified as positive classes, and uses letters

The expression is a measure of coverage of the detection result, and is shown in formula (7):

（7）

3. mAP is the mean of the average accuracies of all classes in the dataset, AP is the average accuracy of a certain class, for the second

Selecting different IoU thresholds according to each category, wherein the calculation formula of the average accuracy is as follows:

（8）

the geometric meaning is the area enclosed by a curve formed by the accuracy and the recall rate and a horizontal axis, and the number of the obtained categories is

The average precision of the mean value is calculated by the following formula:

（9）

4. the FPS detects the number of image frames per second, and the index is not only related to the calculated amount of the algorithm model, but also related to the hardware performance in the experimental process. Generally, if the detection speed is not less than 25fps, the algorithm model can be considered to meet the real-time requirement.

The embodiment of the invention provides a lightweight method for detecting targets in shot images in a driving scene based on a deep neural network, and a series of improvements and optimizations are performed on the basis of YOLOv5s, compared with the existing SE + YOLOv3 network structure, the accuracy rate of a test environment GTX1080 on an image data set shot by a camera in the driving scene is greatly improved, and a prediction frame is closer to a real target frame. Compared with the original SE + YOLOv3 network structure, the size of the model in the embodiment of the invention is greatly reduced, the detection accuracy is improved, and the application requirement of front end light weight in a driving scene is met.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

In an embodiment, the present invention further provides an object detection apparatus, which corresponds to the object detection method in the foregoing embodiment one to one. As shown in fig. 8, the object detection device includes an acquisition module 81 and a detection module 82. The functional modules are explained in detail as follows:

an acquisition module 81 for acquiring an image captured by a camera in a driving scene;

the detection module 82 is configured to input the image to a trained target detection network, and judge and predict the image through the target detection network to obtain target classification and position information;

the loss function consists of a gradient equilibrium mechanism loss function, a repulsive force loss function, and a confidence loss function.

For specific limitations of the target detection device, reference may be made to the above limitations of the target detection method, which are not described herein again. The modules in the target detection device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in a computer device, and can also be stored in a memory in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 9. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of object detection.

In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

acquiring an image shot by a camera in a driving scene;

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, databases, or other media used in embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A method of object detection, comprising:

acquiring an image shot by a camera in a driving scene;

2. The target detection method of claim 1, wherein the target detection network embeds a bottleneck attention mechanism module in a BottleneckCSP1_ x layer of a YOLOv5s backbone network to obtain a cross-stage local network based on the bottleneck attention mechanism module.

3. The object detection method of claim 2, wherein the object detection network replaces a designated CBH module in the YOLOv5s backbone network with an MBH module, the CBH module consisting of a convolution operation, a normalization process, and an activation function, the MBH module consisting of an inverted residual module based on a deep separable convolution operation, a normalization process, and an activation function.

4. The object detection method of claim 3, wherein the depth separable convolution operation based inverted residual module comprises a first single point convolution layer, a depth convolution layer, a second single point convolution layer, and a blend layer, wherein the first single point convolution layer and the depth convolution layer employ a BatchNorm operation and a non-linear ReLU6 activation function, and the second single point convolution layer employs a BatchNorm operation and does not employ a non-linear ReLU6 activation function;

5. The object detection method of claim 3 or 4, wherein the object detection network replaces the fourth CBH module in the YOLOv5s backbone network with an MBH module.

6. The object detection method according to any one of claims 1 to 4, wherein the object detection network is trained by a preset loss function;

7. The object detection method according to any one of claims 1 to 4, wherein the object detection network is trained by a preset loss function;

8. An object detection apparatus, characterized in that the apparatus comprises:

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the object detection method according to any one of claims 1 to 7.

10. A terminal comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the object detection method according to any one of claims 1 to 7 when executing the computer program.