CN115953625A

CN115953625A - Vehicle detection method based on characteristic diagram double-axis Transformer module

Info

Publication number: CN115953625A
Application number: CN202211621191.4A
Authority: CN
Inventors: 刘尧; 张玉杰; 杜逸; 李炎; 金忠富
Original assignee: Intelligent Transportation Research Branch Of Zhejiang Transportation Investment Group Co ltd
Current assignee: Intelligent Transportation Research Branch Of Zhejiang Transportation Investment Group Co ltd
Priority date: 2022-12-16
Filing date: 2022-12-16
Publication date: 2023-04-11

Abstract

The present invention relates to the field of object detection in computer vision. The method is characterized by high detection precision. The technical scheme is that the vehicle detection method based on the characteristic diagram double-shaft Transformer module comprises the following steps: 1) Collecting a road monitoring picture and marking a vehicle outline in the picture to form a data set; 2) Inputting the data set into a double-shaft Transformer module for training; 3) And detecting the road monitoring picture by using a double-shaft Transformer module.

Description

Vehicle detection method based on characteristic diagram double-axis Transformer module

Technical Field

The invention relates to the field of target detection of computer vision, in particular to a vehicle detection method based on a characteristic diagram double-axis Transformer module.

Background

The target detection technology is divided into a two-stage detection technology and a one-stage detection technology from the stage.

The basic flow of the two-stage detection technology is to put forward a target candidate frame, calculate the rough position, size and foreground probability of the target frame in the first stage, and calculate the precise position, size and category of the target frame in the second stage, and the representative method comprises the following steps: RCNN (RegionswthCNfeatures), SPP-NET (spatialpyramidPooling), fastRCNN, fasterRCNN, and the like. The one-stage detection technology is implemented by directly calculating the size, position and category of a target through a deep neural network, and representative methods are YOLO (youlonlylokonce) and SSD (single shot multiple box selector). The framework of the two-stage detection technology has higher detection precision, but also limits the detection speed, and the one-stage detection technology has the advantages of high speed and low precision.

In recent years, transformers have achieved remarkable results in the fields of natural language processing and search recommendation. The Transformer is proposed by an attention isallyoud in the field of natural language processing, different information is endowed with different weights through an attention mechanism, and fusion coding is performed by considering the interaction relationship among the information. The Transformer structure is superior to the convolution structure in information extraction and processing capability. The applications of the Transformer in machine vision are, for example, VIT (vision transform), but the Transformer structure has high computational complexity (the Transformer belongs to a one-stage detection technology), and the Transformer improves the model effect and greatly increases the computational complexity. In order to ensure the efficiency of the model, swinTransformer, C-SwinTransformer were proposed in succession, the main way of which is to calculate the in-window Attention by the sliding window method. When the Transformer module is adopted, the method is mostly used in an input image or a backhaul stage, and at the moment, the image size is large, and the Transformer calculation complexity is high.

Disclosure of Invention

The invention aims to overcome the defects in the background technology and provide a vehicle detection method based on a characteristic diagram double-shaft Transformer module, which has the characteristic of high detection precision.

The technical scheme of the invention is as follows:

the vehicle detection method based on the feature map double-axis Transformer module comprises the following steps:

1) Collecting a road monitoring picture and marking a vehicle outline in the picture to form a data set;

2) Inputting the data set into a double-shaft Transformer module for training;

3) Detecting a road monitoring picture by using a double-shaft Transformer module;

the double-shaft Transformer module comprises a backhaul network, a Neck network and a Prediction network;

the backhaul network comprises a Focus layer, a first convolution layer, a first residual structure convolution layer, a second residual structure convolution layer, a third residual structure convolution layer, a fourth convolution layer, a spatial pyramid pooling layer and a fourth residual structure convolution layer which are sequentially connected;

the Neck network comprises a fifth convolution layer, a first up-sampling layer, a splicing layer, a sixth convolution layer, a second up-sampling layer, a first CTCR layer, a first CTPR layer, a seventh convolution layer, a second CTCR layer, a second CTPR layer, an eighth convolution layer, a third CTCR layer and a third CTPR layer which are sequentially connected;

the Prediction network classifies and predicts the boundary of the target of the feature map based on a candidate box with a preset size, and comprises a first detection layer, a second detection layer and a third detection layer;

the output of the second residual structure convolution layer is further connected with the input of the first CTCR layer, the output of the third residual structure convolution layer is further connected with the input of the splicing layer, the output of the fifth convolution layer is further connected with the input of the third CTPR layer, the output of the splicing layer is further connected with the input of the first CTPR layer, and the output of the sixth convolution layer is further connected with the input of the second CTPR layer; the output of the first CTPR layer is also connected with the input of the first detection layer, the output of the second CTPR layer is also connected with the input of the second detection layer, and the output of the third CTPR layer is connected with the input of the third detection layer.

The first CTCR layer, the second CTCR layer and the third CTCR layer all comprise:

s1-1, performing convolution operation on an input characteristic diagram for two times;

s1-2, processing an input characteristic diagram through a Transformer _ C module;

s1-3, splicing the output of the S1-1, the output of the S1-2 and the input characteristic diagram, and then performing convolution operation;

the first CTPR layer, the second CTPR layer and the third CTPR layer all comprise:

s2-1, performing convolution operation on the input characteristic diagram for two times;

s2-2, processing the input characteristic diagram through a Transformer _ P module;

and S2-3, splicing the output of the S2-1, the output of the S2-2 and the input characteristic diagram, and then performing convolution operation.

The Transformer _ C module sequentially comprises: the method comprises the steps of firstly unfolding an input feature graph according to channel levels, then adding position parameters to each vector, splicing all vectors after being processed by an Attention module, then obtaining output vectors through a full-connection layer, and finally restoring the vectors through Reshape operation.

The Transformer _ P module sequentially comprises: the method comprises the steps of firstly carrying out region division on a feature map to obtain region blocks, then unfolding according to the region blocks, then adding position parameters to each region block vector, splicing all vectors after being processed by an Attention module, obtaining output vectors through a full-connection layer, and finally restoring the vectors through Reshape operation.

The Attenttion module sequentially comprises the steps of mapping an input vector into a Query vector, a Key vector and a Value vector, calculating the relevance by using the Query vector and the Key vector as the weight of weighted Value, and calculating by a full connection layer to obtain an output result.

The Focus layer comprises: and carrying out slicing operation on the pictures, dividing each picture into four complementary pictures, splicing the four pictures, and carrying out convolution operation to obtain a two-time sampling feature map.

The first convolution layer, the second convolution layer, the third convolution layer, the fourth convolution layer, the fifth convolution layer, the sixth convolution layer, the seventh convolution layer, the eighth convolution layer and the convolution operation have the same steps.

The first residual structure convolution layer, the second residual structure convolution layer, the third residual structure convolution layer and the fourth residual structure convolution layer all include:

s3-1, performing convolution operation twice on the input characteristic diagram;

s3-2, adding the output of the S3-1 and the input characteristic diagram;

s3-3, performing convolution operation on the input feature map for the first time;

and S3-4, splicing the outputs of the S3-2 and the S3-3, and then performing convolution operation again.

The spatial pyramid pooling layer comprises:

s4-1, performing convolution operation on the input characteristic diagram;

s4-2, pooling input feature maps by using pooling windows of different sizes;

and S4-3, splicing the input characteristic graph with the output of the S4-2, and then performing convolution operation.

And the splicing layer splices the shallow feature map and the deep feature map to fuse the information of the multi-level feature map.

The first up-sampling layer and the second up-sampling layer both adopt an interpolation method; the first convolution layer includes a vector convolution, a BatchNorm layer, and an activation.

The beneficial effects of the invention are:

according to the method, the model detection precision is improved by fusing the YOLOV5 and the Transformer structures, the Transformer structures are introduced into the calculation stage of the nerck characteristic diagram of the target detection model, and the biaxial Transformer model structures (CTCR and CTPR) are designed to fuse information between different characteristic diagrams and between characteristic diagram channels, so that the detection precision is improved, and the method is applied to a vehicle detection scene for verification; the prior algorithm adopts concat structure to splice in the neck part, and the invention adopts double-shaft (Twin-axial) -transform module fusion; a feature map patch level-based Transformer structure (CTPR) is used in a feature map fusion stage, so that the phenomenon that the computational complexity is increased and the model computational efficiency is reduced due to the use of the Transformer structure in a backbone stage is avoided; and a feature map channel level Transformer structure (CTCR) is used in a feature map fusion stage, so that the feature map fusion effect is improved, and the detection precision is improved.

Drawings

FIG. 1 is an overall architecture diagram of a two-axis transducer module of the present invention.

Fig. 2 is a schematic view of the Focus layer of the present invention.

FIG. 3 is a schematic view of a first buildup layer of the present invention.

FIG. 4 is a diagram of a first residual structure convolutional layer of the present invention.

Fig. 5 is a schematic diagram of the spatial pyramid pooling layer of the present invention.

Fig. 6 is a schematic diagram of a first CTCR layer of the invention.

FIG. 7 is a diagram of a Transformer _ C module according to the present invention.

FIG. 8 is a schematic diagram of an Attion module of the present invention.

Fig. 9 is a schematic view of a first CTPR layer of the present invention.

FIG. 10 is a diagram of a Transformer _ P module according to the present invention.

FIG. 11 is a label distribution plot for a data set.

FIG. 12 is a plot of the label box size scale for the labels of the data set.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

The invention provides a vehicle detection method based on a characteristic diagram double-axis Transformer module, which is an important basis for road monitoring and control based on a road monitoring video, and comprises the following steps:

1) Collecting a road monitoring picture and marking a vehicle outline in the picture to form a data set; dividing a data set into a training set and a test set;

2) Inputting the data set into a double-shaft Transformer module for training to obtain a trained double-shaft Transformer module;

3) And detecting the road monitoring picture by using the trained double-shaft Transformer module.

The two-axis Transformer module (feature map two-axis Transformer based detector, TDD for short) includes three parts (shown in fig. 1), which are a backhaul network (trunk network), a tack network (header network), and a Prediction network (detection network), respectively.

The backhaul network comprises four types of neural network layers, namely a Focus layer, a convolutional layer (CBS), a residual structure convolutional layer (C3) and a spatial pyramid pooling layer (SPP). The backhaul network comprises a Focus layer, a first convolution layer, a first residual structure convolution layer, a second residual structure convolution layer, a third residual structure convolution layer, a fourth convolution layer, a spatial pyramid pooling layer and a fourth residual structure convolution layer which are sequentially connected.

The Neck network comprises five types of neural network layers, namely a convolutional layer (CBS), an upsampling layer, a splicing layer (Concat), a CTCR layer (a channel-level residual error structure convolution-Transformer layer, which is abbreviated as CTCR in the figure) and a CTPR layer (a patch-level residual error structure convolution-Transformer layer, which is abbreviated as CTCR in the figure). The Neck network comprises a fifth convolution layer, a first up-sampling layer, a splicing layer, a sixth convolution layer, a second up-sampling layer, a first CTCR layer, a first CTPR layer, a seventh convolution layer, a second CTCR layer, a second CTPR layer, an eighth convolution layer, a third CTCR layer and a third CTPR layer which are sequentially connected.

The output of the second residual structure convolution layer is further connected with the input of the first CTCR layer, the output of the third residual structure convolution layer is further connected with the input of the splicing layer, the output of the fifth convolution layer is further connected with the input of the third CTPR layer, the output of the splicing layer is further connected with the input of the first CTPR layer, and the output of the sixth convolution layer is further connected with the input of the second CTPR layer.

The Prediction network classifies and predicts the boundary of the target on the feature map based on the candidate box with preset size. The Prediction network comprises a first detection layer, a second detection layer and a third detection layer. The output of the first CTPR layer is also connected with the input of the first detection layer, the output of the second CTPR layer is also connected with the input of the second detection layer, and the output of the third CTPR layer is connected with the input of the third detection layer.

The three detection layers are connected with feature maps with different sizes, the feature map corresponding to the first detection layer is the largest in size and used for small target detection, the feature map corresponding to the second detection layer is the medium in size and used for medium target detection, and the feature map corresponding to the third detection layer is the smallest in size and used for large target detection.

The Focus layer firstly slices the picture, then performs channel-level splicing (channel number is expanded by multiple times) on the picture, and finally performs convolution. The method comprises the following specific steps: as shown in fig. 2, a value (similar to neighboring downsampling) is taken at every other pixel point on an original input picture, a picture is divided into four parts (Slice) and then spliced (Concat), the four parts of pictures are complementary but have no information loss, so that W and H information is concentrated into a channel space, an input channel is expanded by 4 times, the spliced picture is expanded into 12 channels relative to an original RGB three-channel mode, convolution operation (CBS) is performed on the obtained new picture, and finally a double downsampling feature map without information loss is obtained.

As shown in fig. 4, the first residual structure convolutional layer, the second residual structure convolutional layer, the third residual structure convolutional layer, and the fourth residual structure convolutional layer have the same steps, and each of the steps includes (C3 represents each residual structure convolutional layer):

s3-1, performing convolution operation (CBS) on the input characteristic diagram for two times;

s1-2, adding (add) the output of S3-1 with the input characteristic diagram; the addition is to carry on the one-to-one correspondence addition to the identical channel on the corresponding characteristic map, the channel number is invariable;

s3-3, performing convolution operation (CBS) on the input feature map for the first time;

s3-4, splicing the outputs of the S3-2 and the S3-3 (Concat), and then performing convolution operation (CBS) again; splicing refers to expanding in channel depth, and increasing in channel number.

As shown in fig. 5, the spatial pyramid pooling layer (SPP) is used for extracting features of different scales, and includes:

s4-1, performing primary convolution operation (CBS) on the input feature map;

s4-2, pooling the input feature map by using pooling windows (MaxPool) with different sizes; there are 3 pooling windows shown, with the pooling windows being 5 × 5, 9 × 9, 13 × 13, respectively;

and S4-3, splicing the input feature map with the output of the S4-2 (Concat), and then performing convolution operation (CBS).

The first up-sampling layer and the second up-sampling layer both adopt an interpolation method (nearest neighbor down-sampling), the interpolation method is the simplest interpolation method, and calculation is not needed, namely, a proper interpolation algorithm is adopted to insert new elements among pixel points on the basis of original image pixels (nearest neighbor down-sampling, and in four adjacent pixels of pixels to be solved, the gray level of the adjacent pixel closest to the pixels to be solved is assigned to the pixels to be solved).

The splicing layer carries out channel-level splicing on the plurality of feature maps, and the feature maps are unified in size through the first upper sampling layer before splicing. The purpose of the splicing layer is to fuse the information of the multi-level feature map, in the invention, the splicing layer splices the shallow feature map and the deep feature map, the shallow feature map is beneficial to the boundary calculation of target detection, and the deep feature map is beneficial to the image semantic calculation. And the splicing layer splices the convolution layer (the shallow layer) with the third residual error structure and the first up-sampling layer (the deep layer).

The steps of the first convolutional layer, the second convolutional layer, the third convolutional layer, the fourth convolutional layer, the fifth convolutional layer, the sixth convolutional layer, the seventh convolutional layer, and the eighth convolutional layer are the same as the step of the convolution operation (CBS), and all include (as shown in fig. 3): the feature map is subjected to vector convolution (Conv), batchNorm layer (BN) and activation (Silu). The activation adopts SiLU function (Sigmoid Weighted Liner Unit).

The vector convolution refers to the corresponding multiplication and addition of a convolution kernel and a numerical value on the feature map, and the convolution kernel can slide up and down, left and right on the feature map to complete the calculation of all positions. The BatchNorm layer normalizes all data in the batch, and the specific steps are firstly calculating the mean value and the standard deviation of the batch data, secondly subtracting the mean value from all convolution kernel output values and dividing the result by the standard deviation, and finally introducing scaling and translation variables, namely multiplying the scaling and translation variables by a learnable coefficient and adding an offset. The Batchnorm layer effectively solves the gradient disappearance problem and accelerates convergence. The SiLU function is called a Simuw Weighted Liner Unit and has the following calculation formula:

SiLU(x)＝x*Sigmoid(x)

the SiLU function is an unsaturated activation function and is derivable in the full value domain.

The convolution operation is an important part in image operation, essentially uses parameters of convolution kernels to extract data features, specifically uses convolution kernels to add and sum element values of corresponding areas on an image, and uses a sliding method to complete the convolution operation on the whole image, wherein related parameters comprise the size, step length, number (output channel number) and the like of the convolution kernels.

As shown in fig. 6, the steps of the first CTCR layer, the second CTCR layer, and the third CTCR layer are the same, and each of the steps includes:

s1-1, performing convolution operation on an input characteristic diagram twice;

and S1-3, splicing the output of the S1-1, the output of the S1-2 and the input characteristic diagram, and then performing convolution operation.

The layer is a key module for feature map channel level fusion processing, the former layer is a result obtained by splicing feature maps in different levels, and the result is output to the next layer for processing.

As shown in fig. 7, the Transformer _ C module sequentially includes: firstly, expanding an input feature graph according to channel levels (shape level), wherein the dimension of an input vector is batch size H W, and after the input feature graph is processed by the channel level shape, the dimension of an output vector is batch size M, and the size of M is channel H W; and adding a position parameter (Add Pos) to each vector, splicing (Concat) all vectors after being processed by an Attention module, obtaining an output vector through a full connection layer (FC), and finally restoring the vectors through a resume operation.

As shown in fig. 8, the Attention module sequentially maps the input vector into a Query vector, a Key vector, and a Value vector, calculates the correlation using the Query vector and the Key vector as the weight of the weighted Value, and then calculates through a full connection layer (FC) to obtain the output result. The vector also becomes a Multi-header after passing through multiple Attention modules.

As shown in fig. 9, the steps of the first, second and third CTPR layers are the same, and each step includes:

s2-2, processing the input characteristic diagram through a Transformer _ P module

This layer is the key module proposed in the present invention for feature map region-level fusion processing. The module updates the expression vectors of the feature map by performing attention processing on different regions of the feature map, and the correlation among region blocks is considered in the expression vectors of the different regions.

As shown in fig. 10, the Transformer _ P module sequentially includes: firstly, carrying out region division (Partition) on a feature diagram to obtain region blocks, then expanding (flat level) according to the region blocks, then adding position parameters (Add Pos) to each region block vector, splicing (Concat) all the vectors after being processed by an Attention module (updating expression vectors of the region blocks based on the relevance among the region blocks), obtaining output vectors through a full connection layer (FC), and finally restoring the vectors through a resume operation. The Attention module of the Transformer _ P module has the same structure as the Attention module of the Transformer _ P module.

The CTCP and the CTPR are fused and coded from a space axis and a channel axis, and information interaction is carried out at a channel level and a space level. Both CTCP and CTPR use a residual structure. Wherein: the Transformer _ C module of the CTCP performs fusion processing on channel-level information by using a Transformer structure, and weights different feature graphs on a channel axis by using an attention mechanism, so that the detection network emphasizes the feature graph hierarchy when detecting different scale targets. However, the existing networks are generally spliced directly by Concat, and the detection effect is limited. And the third path keeps the input unchanged, and finally, the data of the three paths are fused through a convolution module.

The Transformer _ P module of the CTPR encodes the feature map by using a Transformer structure in a spatial dimension, the image information of the surrounding position is considered for encoding a certain position of the feature map, and the CTPR is a global encoding mechanism. The invention adopts 4 convolution layers with residual error structures, can extract deeper features and is beneficial to effect improvement and parameter optimization efficiency. The activation function adopts a silu function, and compared with a leak relu function, the silu function has the advantages of lower bound, no upper bound and global guidance, and can enhance the nonlinear expression capability of the model. In addition, the optimization is carried out by matching with a Focus layer and a spatial pyramid pooling layer.

The overall flow for the two-axis Transformer module (shown in table 1) is:

the layer 1 is a Focus layer, an original picture is divided into 4 parts by sampling, channel level splicing is carried out, and then the original picture passes through a CBS layer, wherein the convolution kernel number is 32, and the size of an output feature map is 240 × 32. Layers 2 to 11 consist of CBS, C3, SPP layers, each with inputs from the previous layer. The 12 th layer and the 15 th layer are up-sampling layers, a nearest neighbor method is adopted, and the sampling multiplying power is 2. Layer 13 is a Concat layer. The other layers are composed of CBS, CTPR, and CTCR layers. All convolution kernels in TDD are 3 x 3 in size. The data sequentially flows from the first layer, the last layer is a Detect layer, and the required feature maps come from the 17, 20 and 23 layers.

TABLE 1 TTD layer classes, inputs and SuperParameters

Number of layers	Categories	Input device	Output size	Radix Ginseng
					0	Original picture	Original picture	9609603
1	Focus	Original picture	24024032	Number of convolution kernels 32, step 1
					2	CBS	One layer above	12012064	Convolution kernel number 64, step 2
3	C3	One layer above	12012064	Number of convolution kernels 64, step 1
					4	CBS	One layer above	6060128	Number of convolution kernels 128, step 2
5	C3	One layer above	6060128	Number of convolution kernels 128, step 1
					6	CBS	One layer above	3030256	Number of convolution kernels 256, step 2
7	C3	One layer above	3030256	Number of convolution kernels 256, step 1
					8	CBS	One layer above	1515512	Number of convolution kernels 512, step 2
9	SPP	One layer above	1515512	Number of convolution kernels 512, step 1
					10	C3	One layer above	1515512	Number of convolution kernels 512, step 1
11	CBS	One layer above	1515256	Number of convolution kernels 256, step 1
					12	Upsampling	One layer above	3030512	Nearest neighbor, magnification 2
13	Concat	Layers 7 and 12	3030768
					14	CBS	One layer above	3030128	Convolution with a bit lineNumber of kernels 128, step 1
15	Upsampling	One layer above	6060128	Nearest neighbor, magnification 2
					16	CTCR	5 th, 13 th, 15 th layer
17	CTPR	One layer above	6060128	Number of convolution kernels 128, step 1
					18	CBS	One layer above	3030128	Number of convolution kernels 128, step 2
19	CTCR	Layer 14, 18
					20	CTPR	One layer above	3030256	Number of convolution kernels 256, step 1
21	CBS	One layer above	1515256	Number of convolution kernels 256, step 2
					22	CTCR	Layer 11, 21	1515512
23	CTPR	The upper layer	1515512	Number of convolution kernels 512, step 1
					24	First detection layer	Layer 17
25	Second detection layer	Layer 20
					26	Third detection layer	Layer 23

Experimental verification

The YOLOV5 algorithm was chosen in comparison to Baseline. The reason why YOLOV5 is selected is as follows:

1. the YOLOV5 algorithm is a leading edge and representative algorithm in the field of target detection;

2. the size and the parameter number of the model structure are equivalent to those of the method provided by the invention, and the method has comparable effects.

Experiment one: vehicle detection

The data set is a video shot of a live-action road. Video is recorded at a rate of 25 frames per second (Fps) with a resolution of 960 x 540 pixels. The data set comprises a training set and a test set, wherein the training set comprises about 8 ten thousand pictures, and the test set comprises about 1 ten thousand 4 pictures. As shown in FIG. 11, the tag includes four categories, car, van, bus, other. The label is labeled with a frame size ratio as shown in fig. 12, for example.

Table 2 experiment one detection accuracy

And (4) experimental conclusion:

TTD was best achieved on the 5 th epoch training, 0.729 for Precison, 0.761 for Recall, and 0.798 for mAP. Yolov5 achieved the best results on the 10 th epoch training, 0.683 for Precison, 0.746 for recall, 0.783 for maps. By combining the indexes, compared with a YOLOv5 algorithm, the convergence rate and accuracy of the method provided by the invention are obviously improved.

Preferred embodiments of the present invention are shown in the drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

Claims

1. The vehicle detection method based on the feature map double-axis Transformer module comprises the following steps:

2) Inputting the data set into a double-shaft Transformer module for training;

the Prediction network classifies and predicts the boundary of the target of the feature map based on a candidate frame with a preset size, and comprises a first detection layer, a second detection layer and a third detection layer;

s1-2, processing the input characteristic diagram through a Transformer _ C module;

2. The feature map two-axis Transformer module-based vehicle detection method of claim 1, wherein: the Transformer _ C module sequentially comprises: the method comprises the steps of firstly unfolding an input feature graph according to channel levels, then adding position parameters to each vector, splicing all vectors after being processed by an Attention module, then obtaining output vectors through a full-connection layer, and finally restoring the vectors through Reshape operation.

3. The feature map two-axis Transformer module-based vehicle detection method according to claim 2, characterized in that: the Transformer _ P module sequentially comprises: the method comprises the steps of firstly carrying out region division on a feature map to obtain region blocks, then unfolding according to the region blocks, then adding position parameters to each region block vector, splicing all vectors after being processed by an Attention module, obtaining output vectors through a full-connection layer, and finally restoring the vectors through Reshape operation.

4. The feature map two-axis Transformer module-based vehicle detection method according to claim 3, characterized in that: the Attenttion module sequentially comprises the steps of mapping an input vector into a Query vector, a Key vector and a Value vector, calculating the relevance by using the Query vector and the Key vector as the weight of weighted Value, and calculating by a full connection layer to obtain an output result.

5. The feature map two-axis Transformer module-based vehicle detection method according to claim 4, characterized in that: the Focus layer comprises: and carrying out slicing operation on the pictures, dividing each picture into four complementary pictures, splicing the four pictures, and carrying out convolution operation to obtain a two-time sampling feature map.

6. The feature map two-axis Transformer module-based vehicle detection method of claim 5, wherein: the first convolution layer, the second convolution layer, the third convolution layer, the fourth convolution layer, the fifth convolution layer, the sixth convolution layer, the seventh convolution layer, the eighth convolution layer and the convolution operation are the same in steps.

7. The feature map two-axis Transformer module-based vehicle detection method of claim 6, wherein: the first residual structure convolution layer, the second residual structure convolution layer, the third residual structure convolution layer and the fourth residual structure convolution layer all include:

s3-2, adding the output of the S3-1 and the input characteristic diagram;

8. The feature map two-axis Transformer module-based vehicle detection method according to claim 7, characterized in that: the spatial pyramid pooling layer includes:

s4-1, performing convolution operation on the input feature map for once;

s4-2, pooling the input feature map by using pooling windows with different sizes;

9. The feature map two-axis Transformer module-based vehicle detection method of claim 8, wherein: and the splicing layer splices the shallow feature map and the deep feature map to fuse the information of the multi-level feature map.

10. The feature map two-axis Transformer module-based vehicle detection method of claim 9, wherein: the first up-sampling layer and the second up-sampling layer both adopt an interpolation method; the first convolution layer includes a vector convolution, a BatchNorm layer, and an activation.