CN115953625A - Vehicle detection method based on characteristic diagram double-axis Transformer module - Google Patents

Vehicle detection method based on characteristic diagram double-axis Transformer module Download PDF

Info

Publication number
CN115953625A
CN115953625A CN202211621191.4A CN202211621191A CN115953625A CN 115953625 A CN115953625 A CN 115953625A CN 202211621191 A CN202211621191 A CN 202211621191A CN 115953625 A CN115953625 A CN 115953625A
Authority
CN
China
Prior art keywords
layer
convolution
feature map
input
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211621191.4A
Other languages
Chinese (zh)
Inventor
刘尧
张玉杰
杜逸
李炎
金忠富
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Intelligent Transportation Research Branch Of Zhejiang Transportation Investment Group Co ltd
Original Assignee
Intelligent Transportation Research Branch Of Zhejiang Transportation Investment Group Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intelligent Transportation Research Branch Of Zhejiang Transportation Investment Group Co ltd filed Critical Intelligent Transportation Research Branch Of Zhejiang Transportation Investment Group Co ltd
Priority to CN202211621191.4A priority Critical patent/CN115953625A/en
Publication of CN115953625A publication Critical patent/CN115953625A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Image Analysis (AREA)

Abstract

The present invention relates to the field of object detection in computer vision. The method is characterized by high detection precision. The technical scheme is that the vehicle detection method based on the characteristic diagram double-shaft Transformer module comprises the following steps: 1) Collecting a road monitoring picture and marking a vehicle outline in the picture to form a data set; 2) Inputting the data set into a double-shaft Transformer module for training; 3) And detecting the road monitoring picture by using a double-shaft Transformer module.

Description

Vehicle detection method based on characteristic diagram double-axis Transformer module
Technical Field
The invention relates to the field of target detection of computer vision, in particular to a vehicle detection method based on a characteristic diagram double-axis Transformer module.
Background
The target detection technology is divided into a two-stage detection technology and a one-stage detection technology from the stage.
The basic flow of the two-stage detection technology is to put forward a target candidate frame, calculate the rough position, size and foreground probability of the target frame in the first stage, and calculate the precise position, size and category of the target frame in the second stage, and the representative method comprises the following steps: RCNN (RegionswthCNfeatures), SPP-NET (spatialpyramidPooling), fastRCNN, fasterRCNN, and the like. The one-stage detection technology is implemented by directly calculating the size, position and category of a target through a deep neural network, and representative methods are YOLO (youlonlylokonce) and SSD (single shot multiple box selector). The framework of the two-stage detection technology has higher detection precision, but also limits the detection speed, and the one-stage detection technology has the advantages of high speed and low precision.
In recent years, transformers have achieved remarkable results in the fields of natural language processing and search recommendation. The Transformer is proposed by an attention isallyoud in the field of natural language processing, different information is endowed with different weights through an attention mechanism, and fusion coding is performed by considering the interaction relationship among the information. The Transformer structure is superior to the convolution structure in information extraction and processing capability. The applications of the Transformer in machine vision are, for example, VIT (vision transform), but the Transformer structure has high computational complexity (the Transformer belongs to a one-stage detection technology), and the Transformer improves the model effect and greatly increases the computational complexity. In order to ensure the efficiency of the model, swinTransformer, C-SwinTransformer were proposed in succession, the main way of which is to calculate the in-window Attention by the sliding window method. When the Transformer module is adopted, the method is mostly used in an input image or a backhaul stage, and at the moment, the image size is large, and the Transformer calculation complexity is high.
Disclosure of Invention
The invention aims to overcome the defects in the background technology and provide a vehicle detection method based on a characteristic diagram double-shaft Transformer module, which has the characteristic of high detection precision.
The technical scheme of the invention is as follows:
the vehicle detection method based on the feature map double-axis Transformer module comprises the following steps:
1) Collecting a road monitoring picture and marking a vehicle outline in the picture to form a data set;
2) Inputting the data set into a double-shaft Transformer module for training;
3) Detecting a road monitoring picture by using a double-shaft Transformer module;
the double-shaft Transformer module comprises a backhaul network, a Neck network and a Prediction network;
the backhaul network comprises a Focus layer, a first convolution layer, a first residual structure convolution layer, a second residual structure convolution layer, a third residual structure convolution layer, a fourth convolution layer, a spatial pyramid pooling layer and a fourth residual structure convolution layer which are sequentially connected;
the Neck network comprises a fifth convolution layer, a first up-sampling layer, a splicing layer, a sixth convolution layer, a second up-sampling layer, a first CTCR layer, a first CTPR layer, a seventh convolution layer, a second CTCR layer, a second CTPR layer, an eighth convolution layer, a third CTCR layer and a third CTPR layer which are sequentially connected;
the Prediction network classifies and predicts the boundary of the target of the feature map based on a candidate box with a preset size, and comprises a first detection layer, a second detection layer and a third detection layer;
the output of the second residual structure convolution layer is further connected with the input of the first CTCR layer, the output of the third residual structure convolution layer is further connected with the input of the splicing layer, the output of the fifth convolution layer is further connected with the input of the third CTPR layer, the output of the splicing layer is further connected with the input of the first CTPR layer, and the output of the sixth convolution layer is further connected with the input of the second CTPR layer; the output of the first CTPR layer is also connected with the input of the first detection layer, the output of the second CTPR layer is also connected with the input of the second detection layer, and the output of the third CTPR layer is connected with the input of the third detection layer.
The first CTCR layer, the second CTCR layer and the third CTCR layer all comprise:
s1-1, performing convolution operation on an input characteristic diagram for two times;
s1-2, processing an input characteristic diagram through a Transformer _ C module;
s1-3, splicing the output of the S1-1, the output of the S1-2 and the input characteristic diagram, and then performing convolution operation;
the first CTPR layer, the second CTPR layer and the third CTPR layer all comprise:
s2-1, performing convolution operation on the input characteristic diagram for two times;
s2-2, processing the input characteristic diagram through a Transformer _ P module;
and S2-3, splicing the output of the S2-1, the output of the S2-2 and the input characteristic diagram, and then performing convolution operation.
The Transformer _ C module sequentially comprises: the method comprises the steps of firstly unfolding an input feature graph according to channel levels, then adding position parameters to each vector, splicing all vectors after being processed by an Attention module, then obtaining output vectors through a full-connection layer, and finally restoring the vectors through Reshape operation.
The Transformer _ P module sequentially comprises: the method comprises the steps of firstly carrying out region division on a feature map to obtain region blocks, then unfolding according to the region blocks, then adding position parameters to each region block vector, splicing all vectors after being processed by an Attention module, obtaining output vectors through a full-connection layer, and finally restoring the vectors through Reshape operation.
The Attenttion module sequentially comprises the steps of mapping an input vector into a Query vector, a Key vector and a Value vector, calculating the relevance by using the Query vector and the Key vector as the weight of weighted Value, and calculating by a full connection layer to obtain an output result.
The Focus layer comprises: and carrying out slicing operation on the pictures, dividing each picture into four complementary pictures, splicing the four pictures, and carrying out convolution operation to obtain a two-time sampling feature map.
The first convolution layer, the second convolution layer, the third convolution layer, the fourth convolution layer, the fifth convolution layer, the sixth convolution layer, the seventh convolution layer, the eighth convolution layer and the convolution operation have the same steps.
The first residual structure convolution layer, the second residual structure convolution layer, the third residual structure convolution layer and the fourth residual structure convolution layer all include:
s3-1, performing convolution operation twice on the input characteristic diagram;
s3-2, adding the output of the S3-1 and the input characteristic diagram;
s3-3, performing convolution operation on the input feature map for the first time;
and S3-4, splicing the outputs of the S3-2 and the S3-3, and then performing convolution operation again.
The spatial pyramid pooling layer comprises:
s4-1, performing convolution operation on the input characteristic diagram;
s4-2, pooling input feature maps by using pooling windows of different sizes;
and S4-3, splicing the input characteristic graph with the output of the S4-2, and then performing convolution operation.
And the splicing layer splices the shallow feature map and the deep feature map to fuse the information of the multi-level feature map.
The first up-sampling layer and the second up-sampling layer both adopt an interpolation method; the first convolution layer includes a vector convolution, a BatchNorm layer, and an activation.
The beneficial effects of the invention are:
according to the method, the model detection precision is improved by fusing the YOLOV5 and the Transformer structures, the Transformer structures are introduced into the calculation stage of the nerck characteristic diagram of the target detection model, and the biaxial Transformer model structures (CTCR and CTPR) are designed to fuse information between different characteristic diagrams and between characteristic diagram channels, so that the detection precision is improved, and the method is applied to a vehicle detection scene for verification; the prior algorithm adopts concat structure to splice in the neck part, and the invention adopts double-shaft (Twin-axial) -transform module fusion; a feature map patch level-based Transformer structure (CTPR) is used in a feature map fusion stage, so that the phenomenon that the computational complexity is increased and the model computational efficiency is reduced due to the use of the Transformer structure in a backbone stage is avoided; and a feature map channel level Transformer structure (CTCR) is used in a feature map fusion stage, so that the feature map fusion effect is improved, and the detection precision is improved.
Drawings
FIG. 1 is an overall architecture diagram of a two-axis transducer module of the present invention.
Fig. 2 is a schematic view of the Focus layer of the present invention.
FIG. 3 is a schematic view of a first buildup layer of the present invention.
FIG. 4 is a diagram of a first residual structure convolutional layer of the present invention.
Fig. 5 is a schematic diagram of the spatial pyramid pooling layer of the present invention.
Fig. 6 is a schematic diagram of a first CTCR layer of the invention.
FIG. 7 is a diagram of a Transformer _ C module according to the present invention.
FIG. 8 is a schematic diagram of an Attion module of the present invention.
Fig. 9 is a schematic view of a first CTPR layer of the present invention.
FIG. 10 is a diagram of a Transformer _ P module according to the present invention.
FIG. 11 is a label distribution plot for a data set.
FIG. 12 is a plot of the label box size scale for the labels of the data set.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
The invention provides a vehicle detection method based on a characteristic diagram double-axis Transformer module, which is an important basis for road monitoring and control based on a road monitoring video, and comprises the following steps:
1) Collecting a road monitoring picture and marking a vehicle outline in the picture to form a data set; dividing a data set into a training set and a test set;
2) Inputting the data set into a double-shaft Transformer module for training to obtain a trained double-shaft Transformer module;
3) And detecting the road monitoring picture by using the trained double-shaft Transformer module.
The two-axis Transformer module (feature map two-axis Transformer based detector, TDD for short) includes three parts (shown in fig. 1), which are a backhaul network (trunk network), a tack network (header network), and a Prediction network (detection network), respectively.
The backhaul network comprises four types of neural network layers, namely a Focus layer, a convolutional layer (CBS), a residual structure convolutional layer (C3) and a spatial pyramid pooling layer (SPP). The backhaul network comprises a Focus layer, a first convolution layer, a first residual structure convolution layer, a second residual structure convolution layer, a third residual structure convolution layer, a fourth convolution layer, a spatial pyramid pooling layer and a fourth residual structure convolution layer which are sequentially connected.
The Neck network comprises five types of neural network layers, namely a convolutional layer (CBS), an upsampling layer, a splicing layer (Concat), a CTCR layer (a channel-level residual error structure convolution-Transformer layer, which is abbreviated as CTCR in the figure) and a CTPR layer (a patch-level residual error structure convolution-Transformer layer, which is abbreviated as CTCR in the figure). The Neck network comprises a fifth convolution layer, a first up-sampling layer, a splicing layer, a sixth convolution layer, a second up-sampling layer, a first CTCR layer, a first CTPR layer, a seventh convolution layer, a second CTCR layer, a second CTPR layer, an eighth convolution layer, a third CTCR layer and a third CTPR layer which are sequentially connected.
The output of the second residual structure convolution layer is further connected with the input of the first CTCR layer, the output of the third residual structure convolution layer is further connected with the input of the splicing layer, the output of the fifth convolution layer is further connected with the input of the third CTPR layer, the output of the splicing layer is further connected with the input of the first CTPR layer, and the output of the sixth convolution layer is further connected with the input of the second CTPR layer.
The Prediction network classifies and predicts the boundary of the target on the feature map based on the candidate box with preset size. The Prediction network comprises a first detection layer, a second detection layer and a third detection layer. The output of the first CTPR layer is also connected with the input of the first detection layer, the output of the second CTPR layer is also connected with the input of the second detection layer, and the output of the third CTPR layer is connected with the input of the third detection layer.
The three detection layers are connected with feature maps with different sizes, the feature map corresponding to the first detection layer is the largest in size and used for small target detection, the feature map corresponding to the second detection layer is the medium in size and used for medium target detection, and the feature map corresponding to the third detection layer is the smallest in size and used for large target detection.
The Focus layer firstly slices the picture, then performs channel-level splicing (channel number is expanded by multiple times) on the picture, and finally performs convolution. The method comprises the following specific steps: as shown in fig. 2, a value (similar to neighboring downsampling) is taken at every other pixel point on an original input picture, a picture is divided into four parts (Slice) and then spliced (Concat), the four parts of pictures are complementary but have no information loss, so that W and H information is concentrated into a channel space, an input channel is expanded by 4 times, the spliced picture is expanded into 12 channels relative to an original RGB three-channel mode, convolution operation (CBS) is performed on the obtained new picture, and finally a double downsampling feature map without information loss is obtained.
As shown in fig. 4, the first residual structure convolutional layer, the second residual structure convolutional layer, the third residual structure convolutional layer, and the fourth residual structure convolutional layer have the same steps, and each of the steps includes (C3 represents each residual structure convolutional layer):
s3-1, performing convolution operation (CBS) on the input characteristic diagram for two times;
s1-2, adding (add) the output of S3-1 with the input characteristic diagram; the addition is to carry on the one-to-one correspondence addition to the identical channel on the corresponding characteristic map, the channel number is invariable;
s3-3, performing convolution operation (CBS) on the input feature map for the first time;
s3-4, splicing the outputs of the S3-2 and the S3-3 (Concat), and then performing convolution operation (CBS) again; splicing refers to expanding in channel depth, and increasing in channel number.
As shown in fig. 5, the spatial pyramid pooling layer (SPP) is used for extracting features of different scales, and includes:
s4-1, performing primary convolution operation (CBS) on the input feature map;
s4-2, pooling the input feature map by using pooling windows (MaxPool) with different sizes; there are 3 pooling windows shown, with the pooling windows being 5 × 5, 9 × 9, 13 × 13, respectively;
and S4-3, splicing the input feature map with the output of the S4-2 (Concat), and then performing convolution operation (CBS).
The first up-sampling layer and the second up-sampling layer both adopt an interpolation method (nearest neighbor down-sampling), the interpolation method is the simplest interpolation method, and calculation is not needed, namely, a proper interpolation algorithm is adopted to insert new elements among pixel points on the basis of original image pixels (nearest neighbor down-sampling, and in four adjacent pixels of pixels to be solved, the gray level of the adjacent pixel closest to the pixels to be solved is assigned to the pixels to be solved).
The splicing layer carries out channel-level splicing on the plurality of feature maps, and the feature maps are unified in size through the first upper sampling layer before splicing. The purpose of the splicing layer is to fuse the information of the multi-level feature map, in the invention, the splicing layer splices the shallow feature map and the deep feature map, the shallow feature map is beneficial to the boundary calculation of target detection, and the deep feature map is beneficial to the image semantic calculation. And the splicing layer splices the convolution layer (the shallow layer) with the third residual error structure and the first up-sampling layer (the deep layer).
The steps of the first convolutional layer, the second convolutional layer, the third convolutional layer, the fourth convolutional layer, the fifth convolutional layer, the sixth convolutional layer, the seventh convolutional layer, and the eighth convolutional layer are the same as the step of the convolution operation (CBS), and all include (as shown in fig. 3): the feature map is subjected to vector convolution (Conv), batchNorm layer (BN) and activation (Silu). The activation adopts SiLU function (Sigmoid Weighted Liner Unit).
The vector convolution refers to the corresponding multiplication and addition of a convolution kernel and a numerical value on the feature map, and the convolution kernel can slide up and down, left and right on the feature map to complete the calculation of all positions. The BatchNorm layer normalizes all data in the batch, and the specific steps are firstly calculating the mean value and the standard deviation of the batch data, secondly subtracting the mean value from all convolution kernel output values and dividing the result by the standard deviation, and finally introducing scaling and translation variables, namely multiplying the scaling and translation variables by a learnable coefficient and adding an offset. The Batchnorm layer effectively solves the gradient disappearance problem and accelerates convergence. The SiLU function is called a Simuw Weighted Liner Unit and has the following calculation formula:
SiLU(x)=x*Sigmoid(x)
the SiLU function is an unsaturated activation function and is derivable in the full value domain.
The convolution operation is an important part in image operation, essentially uses parameters of convolution kernels to extract data features, specifically uses convolution kernels to add and sum element values of corresponding areas on an image, and uses a sliding method to complete the convolution operation on the whole image, wherein related parameters comprise the size, step length, number (output channel number) and the like of the convolution kernels.
As shown in fig. 6, the steps of the first CTCR layer, the second CTCR layer, and the third CTCR layer are the same, and each of the steps includes:
s1-1, performing convolution operation on an input characteristic diagram twice;
s1-2, processing an input characteristic diagram through a Transformer _ C module;
and S1-3, splicing the output of the S1-1, the output of the S1-2 and the input characteristic diagram, and then performing convolution operation.
The layer is a key module for feature map channel level fusion processing, the former layer is a result obtained by splicing feature maps in different levels, and the result is output to the next layer for processing.
As shown in fig. 7, the Transformer _ C module sequentially includes: firstly, expanding an input feature graph according to channel levels (shape level), wherein the dimension of an input vector is batch size H W, and after the input feature graph is processed by the channel level shape, the dimension of an output vector is batch size M, and the size of M is channel H W; and adding a position parameter (Add Pos) to each vector, splicing (Concat) all vectors after being processed by an Attention module, obtaining an output vector through a full connection layer (FC), and finally restoring the vectors through a resume operation.
As shown in fig. 8, the Attention module sequentially maps the input vector into a Query vector, a Key vector, and a Value vector, calculates the correlation using the Query vector and the Key vector as the weight of the weighted Value, and then calculates through a full connection layer (FC) to obtain the output result. The vector also becomes a Multi-header after passing through multiple Attention modules.
As shown in fig. 9, the steps of the first, second and third CTPR layers are the same, and each step includes:
s2-1, performing convolution operation on the input characteristic diagram for two times;
s2-2, processing the input characteristic diagram through a Transformer _ P module
And S2-3, splicing the output of the S2-1, the output of the S2-2 and the input characteristic diagram, and then performing convolution operation.
This layer is the key module proposed in the present invention for feature map region-level fusion processing. The module updates the expression vectors of the feature map by performing attention processing on different regions of the feature map, and the correlation among region blocks is considered in the expression vectors of the different regions.
As shown in fig. 10, the Transformer _ P module sequentially includes: firstly, carrying out region division (Partition) on a feature diagram to obtain region blocks, then expanding (flat level) according to the region blocks, then adding position parameters (Add Pos) to each region block vector, splicing (Concat) all the vectors after being processed by an Attention module (updating expression vectors of the region blocks based on the relevance among the region blocks), obtaining output vectors through a full connection layer (FC), and finally restoring the vectors through a resume operation. The Attention module of the Transformer _ P module has the same structure as the Attention module of the Transformer _ P module.
The CTCP and the CTPR are fused and coded from a space axis and a channel axis, and information interaction is carried out at a channel level and a space level. Both CTCP and CTPR use a residual structure. Wherein: the Transformer _ C module of the CTCP performs fusion processing on channel-level information by using a Transformer structure, and weights different feature graphs on a channel axis by using an attention mechanism, so that the detection network emphasizes the feature graph hierarchy when detecting different scale targets. However, the existing networks are generally spliced directly by Concat, and the detection effect is limited. And the third path keeps the input unchanged, and finally, the data of the three paths are fused through a convolution module.
The Transformer _ P module of the CTPR encodes the feature map by using a Transformer structure in a spatial dimension, the image information of the surrounding position is considered for encoding a certain position of the feature map, and the CTPR is a global encoding mechanism. The invention adopts 4 convolution layers with residual error structures, can extract deeper features and is beneficial to effect improvement and parameter optimization efficiency. The activation function adopts a silu function, and compared with a leak relu function, the silu function has the advantages of lower bound, no upper bound and global guidance, and can enhance the nonlinear expression capability of the model. In addition, the optimization is carried out by matching with a Focus layer and a spatial pyramid pooling layer.
The overall flow for the two-axis Transformer module (shown in table 1) is:
the layer 1 is a Focus layer, an original picture is divided into 4 parts by sampling, channel level splicing is carried out, and then the original picture passes through a CBS layer, wherein the convolution kernel number is 32, and the size of an output feature map is 240 × 32. Layers 2 to 11 consist of CBS, C3, SPP layers, each with inputs from the previous layer. The 12 th layer and the 15 th layer are up-sampling layers, a nearest neighbor method is adopted, and the sampling multiplying power is 2. Layer 13 is a Concat layer. The other layers are composed of CBS, CTPR, and CTCR layers. All convolution kernels in TDD are 3 x 3 in size. The data sequentially flows from the first layer, the last layer is a Detect layer, and the required feature maps come from the 17, 20 and 23 layers.
TABLE 1 TTD layer classes, inputs and SuperParameters
Number of layers Categories Input device Output size Radix Ginseng
0 Original picture Original picture 960*960*3
1 Focus Original picture 240*240*32 Number of convolution kernels 32, step 1
2 CBS One layer above 120*120*64 Convolution kernel number 64, step 2
3 C3 One layer above 120*120*64 Number of convolution kernels 64, step 1
4 CBS One layer above 60*60*128 Number of convolution kernels 128, step 2
5 C3 One layer above 60*60*128 Number of convolution kernels 128, step 1
6 CBS One layer above 30*30*256 Number of convolution kernels 256, step 2
7 C3 One layer above 30*30*256 Number of convolution kernels 256, step 1
8 CBS One layer above 15*15*512 Number of convolution kernels 512, step 2
9 SPP One layer above 15*15*512 Number of convolution kernels 512, step 1
10 C3 One layer above 15*15*512 Number of convolution kernels 512, step 1
11 CBS One layer above 15*15*256 Number of convolution kernels 256, step 1
12 Upsampling One layer above 30*30*512 Nearest neighbor, magnification 2
13 Concat Layers 7 and 12 30*30*768
14 CBS One layer above 30*30*128 Convolution with a bit lineNumber of kernels 128, step 1
15 Upsampling One layer above 60*60*128 Nearest neighbor, magnification 2
16 CTCR 5 th, 13 th, 15 th layer
17 CTPR One layer above 60*60*128 Number of convolution kernels 128, step 1
18 CBS One layer above 30*30*128 Number of convolution kernels 128, step 2
19 CTCR Layer 14, 18
20 CTPR One layer above 30*30*256 Number of convolution kernels 256, step 1
21 CBS One layer above 15*15*256 Number of convolution kernels 256, step 2
22 CTCR Layer 11, 21 15*15*512
23 CTPR The upper layer 15*15*512 Number of convolution kernels 512, step 1
24 First detection layer Layer 17
25 Second detection layer Layer 20
26 Third detection layer Layer 23
Experimental verification
The YOLOV5 algorithm was chosen in comparison to Baseline. The reason why YOLOV5 is selected is as follows:
1. the YOLOV5 algorithm is a leading edge and representative algorithm in the field of target detection;
2. the size and the parameter number of the model structure are equivalent to those of the method provided by the invention, and the method has comparable effects.
Experiment one: vehicle detection
The data set is a video shot of a live-action road. Video is recorded at a rate of 25 frames per second (Fps) with a resolution of 960 x 540 pixels. The data set comprises a training set and a test set, wherein the training set comprises about 8 ten thousand pictures, and the test set comprises about 1 ten thousand 4 pictures. As shown in FIG. 11, the tag includes four categories, car, van, bus, other. The label is labeled with a frame size ratio as shown in fig. 12, for example.
Table 2 experiment one detection accuracy
Figure BDA0004002225470000131
And (4) experimental conclusion:
TTD was best achieved on the 5 th epoch training, 0.729 for Precison, 0.761 for Recall, and 0.798 for mAP. Yolov5 achieved the best results on the 10 th epoch training, 0.683 for Precison, 0.746 for recall, 0.783 for maps. By combining the indexes, compared with a YOLOv5 algorithm, the convergence rate and accuracy of the method provided by the invention are obviously improved.
Preferred embodiments of the present invention are shown in the drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

Claims (10)

1. The vehicle detection method based on the feature map double-axis Transformer module comprises the following steps:
1) Collecting a road monitoring picture and marking a vehicle outline in the picture to form a data set;
2) Inputting the data set into a double-shaft Transformer module for training;
3) Detecting a road monitoring picture by using a double-shaft Transformer module;
the double-shaft Transformer module comprises a backhaul network, a Neck network and a Prediction network;
the backhaul network comprises a Focus layer, a first convolution layer, a first residual structure convolution layer, a second residual structure convolution layer, a third residual structure convolution layer, a fourth convolution layer, a spatial pyramid pooling layer and a fourth residual structure convolution layer which are sequentially connected;
the Neck network comprises a fifth convolution layer, a first up-sampling layer, a splicing layer, a sixth convolution layer, a second up-sampling layer, a first CTCR layer, a first CTPR layer, a seventh convolution layer, a second CTCR layer, a second CTPR layer, an eighth convolution layer, a third CTCR layer and a third CTPR layer which are sequentially connected;
the Prediction network classifies and predicts the boundary of the target of the feature map based on a candidate frame with a preset size, and comprises a first detection layer, a second detection layer and a third detection layer;
the output of the second residual structure convolution layer is further connected with the input of the first CTCR layer, the output of the third residual structure convolution layer is further connected with the input of the splicing layer, the output of the fifth convolution layer is further connected with the input of the third CTPR layer, the output of the splicing layer is further connected with the input of the first CTPR layer, and the output of the sixth convolution layer is further connected with the input of the second CTPR layer; the output of the first CTPR layer is also connected with the input of the first detection layer, the output of the second CTPR layer is also connected with the input of the second detection layer, and the output of the third CTPR layer is connected with the input of the third detection layer.
The first CTCR layer, the second CTCR layer and the third CTCR layer all comprise:
s1-1, performing convolution operation on an input characteristic diagram twice;
s1-2, processing the input characteristic diagram through a Transformer _ C module;
s1-3, splicing the output of the S1-1, the output of the S1-2 and the input characteristic diagram, and then performing convolution operation;
the first CTPR layer, the second CTPR layer and the third CTPR layer all comprise:
s2-1, performing convolution operation on the input characteristic diagram for two times;
s2-2, processing the input characteristic diagram through a Transformer _ P module;
and S2-3, splicing the output of the S2-1, the output of the S2-2 and the input characteristic diagram, and then performing convolution operation.
2. The feature map two-axis Transformer module-based vehicle detection method of claim 1, wherein: the Transformer _ C module sequentially comprises: the method comprises the steps of firstly unfolding an input feature graph according to channel levels, then adding position parameters to each vector, splicing all vectors after being processed by an Attention module, then obtaining output vectors through a full-connection layer, and finally restoring the vectors through Reshape operation.
3. The feature map two-axis Transformer module-based vehicle detection method according to claim 2, characterized in that: the Transformer _ P module sequentially comprises: the method comprises the steps of firstly carrying out region division on a feature map to obtain region blocks, then unfolding according to the region blocks, then adding position parameters to each region block vector, splicing all vectors after being processed by an Attention module, obtaining output vectors through a full-connection layer, and finally restoring the vectors through Reshape operation.
4. The feature map two-axis Transformer module-based vehicle detection method according to claim 3, characterized in that: the Attenttion module sequentially comprises the steps of mapping an input vector into a Query vector, a Key vector and a Value vector, calculating the relevance by using the Query vector and the Key vector as the weight of weighted Value, and calculating by a full connection layer to obtain an output result.
5. The feature map two-axis Transformer module-based vehicle detection method according to claim 4, characterized in that: the Focus layer comprises: and carrying out slicing operation on the pictures, dividing each picture into four complementary pictures, splicing the four pictures, and carrying out convolution operation to obtain a two-time sampling feature map.
6. The feature map two-axis Transformer module-based vehicle detection method of claim 5, wherein: the first convolution layer, the second convolution layer, the third convolution layer, the fourth convolution layer, the fifth convolution layer, the sixth convolution layer, the seventh convolution layer, the eighth convolution layer and the convolution operation are the same in steps.
7. The feature map two-axis Transformer module-based vehicle detection method of claim 6, wherein: the first residual structure convolution layer, the second residual structure convolution layer, the third residual structure convolution layer and the fourth residual structure convolution layer all include:
s3-1, performing convolution operation twice on the input characteristic diagram;
s3-2, adding the output of the S3-1 and the input characteristic diagram;
s3-3, performing convolution operation on the input feature map for the first time;
and S3-4, splicing the outputs of the S3-2 and the S3-3, and then performing convolution operation again.
8. The feature map two-axis Transformer module-based vehicle detection method according to claim 7, characterized in that: the spatial pyramid pooling layer includes:
s4-1, performing convolution operation on the input feature map for once;
s4-2, pooling the input feature map by using pooling windows with different sizes;
and S4-3, splicing the input characteristic graph with the output of the S4-2, and then performing convolution operation.
9. The feature map two-axis Transformer module-based vehicle detection method of claim 8, wherein: and the splicing layer splices the shallow feature map and the deep feature map to fuse the information of the multi-level feature map.
10. The feature map two-axis Transformer module-based vehicle detection method of claim 9, wherein: the first up-sampling layer and the second up-sampling layer both adopt an interpolation method; the first convolution layer includes a vector convolution, a BatchNorm layer, and an activation.
CN202211621191.4A 2022-12-16 2022-12-16 Vehicle detection method based on characteristic diagram double-axis Transformer module Pending CN115953625A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211621191.4A CN115953625A (en) 2022-12-16 2022-12-16 Vehicle detection method based on characteristic diagram double-axis Transformer module

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211621191.4A CN115953625A (en) 2022-12-16 2022-12-16 Vehicle detection method based on characteristic diagram double-axis Transformer module

Publications (1)

Publication Number Publication Date
CN115953625A true CN115953625A (en) 2023-04-11

Family

ID=87286938

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211621191.4A Pending CN115953625A (en) 2022-12-16 2022-12-16 Vehicle detection method based on characteristic diagram double-axis Transformer module

Country Status (1)

Country Link
CN (1) CN115953625A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116704317A (en) * 2023-08-09 2023-09-05 深圳华付技术股份有限公司 Target detection method, storage medium and computer device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116704317A (en) * 2023-08-09 2023-09-05 深圳华付技术股份有限公司 Target detection method, storage medium and computer device
CN116704317B (en) * 2023-08-09 2024-04-19 深圳华付技术股份有限公司 Target detection method, storage medium and computer device

Similar Documents

Publication Publication Date Title
CN110097568B (en) Video object detection and segmentation method based on space-time dual-branch network
CN112115783B (en) Depth knowledge migration-based face feature point detection method, device and equipment
CN111563508B (en) Semantic segmentation method based on spatial information fusion
CN109299274B (en) Natural scene text detection method based on full convolution neural network
CN113807355B (en) Image semantic segmentation method based on coding and decoding structure
CN107844743B (en) Image multi-subtitle automatic generation method based on multi-scale hierarchical residual error network
CN111898432B (en) Pedestrian detection system and method based on improved YOLOv3 algorithm
CN111882620B (en) Road drivable area segmentation method based on multi-scale information
CN111696110B (en) Scene segmentation method and system
CN111160407B (en) Deep learning target detection method and system
CN111507222A (en) Three-dimensional object detection framework based on multi-source data knowledge migration
CN112183675B (en) Tracking method for low-resolution target based on twin network
CN113313810A (en) 6D attitude parameter calculation method for transparent object
CN117557775B (en) Substation power equipment detection method and system based on infrared and visible light fusion
CN111199255A (en) Small target detection network model and detection method based on dark net53 network
CN115953625A (en) Vehicle detection method based on characteristic diagram double-axis Transformer module
CN117372777A (en) Compact shelf channel foreign matter detection method based on DER incremental learning
CN115527096A (en) Small target detection method based on improved YOLOv5
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion
CN115115917A (en) 3D point cloud target detection method based on attention mechanism and image feature fusion
CN113627481A (en) Multi-model combined unmanned aerial vehicle garbage classification method for smart gardens
CN116977859A (en) Weak supervision target detection method based on multi-scale image cutting and instance difficulty
He et al. Building extraction based on U-net and conditional random fields
CN114494284B (en) Scene analysis model and method based on explicit supervision area relation
CN115131563A (en) Interactive image segmentation method based on weak supervised learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination