CN114842316A

CN114842316A - Real-time target detection method combining convolutional neural network and Transformer network

Info

Publication number: CN114842316A
Application number: CN202210508625.3A
Authority: CN
Inventors: 李国权; 何斌; 夏瑞阳; 林金朝; 庞宇; 朱宏钰
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-05-10
Filing date: 2022-05-10
Publication date: 2022-08-02

Abstract

The invention requests to protect a real-time target detection method combining a convolutional neural network and a Transformer network, and belongs to the field of image processing. The method comprises the following steps: s1, inputting image data; s2: the image passes through a convolutional neural trunk network, so that the extracted features have inductive bias characteristics; s3: and designing a detection neck network, and transitioning between a detection backbone network and a head network to provide high resolution and high semantic features for detecting the head network. S4: designing a detection head network, introducing a Transformer into the head network, constructing a plurality of remote dependency relationships among the generated local features, and representing the target category and the coordinates in the image; s5, designing a nonlinear combination method for reducing false negative samples and improving the target capturing capability of the detection model; s6: detection is performed on the native data set. Based on this approach, better performance was achieved on challenging PASCAL VOC 2007, 2012 and MS COCO 2017 datasets, and superior to many more advanced real-time detection methods.

Description

Real-time target detection method combining convolutional neural network and Transformer network

Technical Field

The invention belongs to the field of image processing, and relates to a real-time target detection method combining a convolutional neural network and a Transformer network.

Background

Object detection is an attractive and challenging topic in computer vision. Its appeal comes from a wide range of applications, such as autopilot and robotic navigation, while challenges come from ever changing scales, complex shapes, and varieties. With the rapid development of Convolutional Neural Networks (CNNs), the number of target detection models is rapidly increasing. Although the models are diverse, they can be classified by depth stacking of convolution operations into anchor-based methods or anchor-free methods, which are sensitive to local regions of interest, with fewer parameters than required by Multi-Layer perceptron (MLP). However, a significant feature of these methods is that the image features extracted from the detection network are limited to local regions only. This lacks long-range semantic relevance, and remote dependencies are important for the network to focus on the region of interest and ignore noise in the entire feature map. Furthermore, other work has mathematically demonstrated that the effective Field of perception (RF) of the extracted features is much smaller than theoretically, which means that the depth-stacking mechanism of convolution operations is impractical in establishing remote dependencies between local image features.

Therefore, to overcome the limitations of the inherent locality of convolution operations, some self-attention mechanisms based on locality features are proposed. On the other hand, the Transformer is an inventive network, mainly used for natural language processing, and is used for mining a plurality of long-distance correlations among time series information in parallel, and recently introduced into the computer vision field, and the most advanced results are obtained in a plurality of vision tasks. The success of employing various visual Transformer networks justifies the necessity of building remote dependencies.

However, in contrast to CNNs, visual Transformer networks do not generalize well due to lack of generalized preferences such as translational isotacticity and locality, which means that either a sufficient amount of data is required for training or a reasonable combination of training skills is required. On the other hand, processing high-resolution images under accumulation of Self-Attention networks (SAN) and MLPs in a visual Transformer Network leads to a rapid increase in computational complexity. In addition, many object detection networks produce many correct boxes with lower scores and many false boxes with higher scores when performing detection.

Through retrieval, application publication No. CN110765886A, a target detection method and device based on convolutional neural network, the method includes: importing a real-time image into a target detection network, and outputting a target object contained in the real-time image; the target detection network comprises a convolution layer, an inverse convolution layer, a feature enhancement block, a feature fusion block, a first regressor and a second regressor. Because the features extracted by the convolutional neural network have locality, the features often only reflect the local region characteristics of the image, and deviation is introduced for subsequent feature extraction and final prediction output. The Transformer network is a network for constructing remote dependency relationship among features, and is characterized in that a key value pair self-attention mechanism is used as a core for feature extraction, and the Transformer network has excellent performance in the aspect of processing global information. However, since the Transformer network is a neural network model without inductive bias, more data volume and data enhancement methods are required to converge in the training process and have better generalization performance.

In order to solve the problems of training and prediction deviation introduced by local characteristics of a convolutional neural network and the problem that a Transformer network is difficult to converge in the training of a limited data set, the invention provides a method combining the convolutional neural network and the Transformer network to solve the problems, so that the real-time accurate detection of an object is realized. The target detection network model comprises a convolution backbone network, a characteristic fusion network and a simplified Transformer detection head network. The convolutional neural network is positioned in the main network part, so that the extracted features have the characteristics of locality and constant variation, and a remote dependency relationship between the features is constructed through a subsequent Transformer network. The feature fusion network provides rich high-resolution semantic features for detecting the head network. A compact transform detector head network makes use of providing inputs with local and transform identity characteristics, making the network easy to converge on a limited data set. In addition, the invention reduces the parameter quantity and the calculation complexity of the network and further improves the convergence performance of the network by introducing a simplified Spatial compression module (LSR) and a simplified multi-layer perceptron (Lite MLP) into the simplified transform detection head network. Experiments prove that the method provided by the invention can reduce the parameters of the network model under the condition of improving the detection precision. Finally, the target capturing capability of the detection model is further improved through the proposed nonlinear combination mode (namely, a logarithmic operation mode is adopted through the target confidence and the classification score, and additional hyper-parameters are introduced).

Disclosure of Invention

The present invention is directed to solving the above problems of the prior art. A real-time target detection method combining a convolutional neural network and a Transformer network is provided. The technical scheme of the invention is as follows:

a real-time target detection method combining a convolutional neural network and a Transformer network comprises the following steps:

s1: inputting training image data to a network;

s2: designing a convolutional neural backbone network to extract the features of the image for training so that the extracted features have inductive bias characteristics, namely the features extracted by the convolutional neural network have locality and translational identity;

s3: designing a detection neck network, and performing transition between a detection backbone network and a head network to provide high resolution and high semantic features for detecting the head network; and compressing the channel dimensions of the partial layers;

s4: designing a detection head network, introducing a Transformer network, namely a full-automatic attention network, into the head network, constructing a plurality of remote dependency relationships among generated local features, and representing target categories and coordinates existing in an image;

s5: designing a nonlinear combination method, wherein a target detection model has partial false negative FN samples in the result output by S4, so that the nonlinear combination method is designed for reducing the false negative FN samples and improving the target capturing capacity of the detection model;

s6: and detecting on the natural data set, and screening the prediction result by using a non-maximum suppression algorithm. Calculating the intersection ratio IoU between the screened prediction result and the real target frame, and counting the prediction result to further obtain an average precision value AP50 under the condition that the threshold value of the IoU is 0.5 and an average precision value AP75 under the condition that the threshold value of IoU is 0.75 as the evaluation result of the model;

further, the step S1 inputs an image data set to be trained, and specifically includes the following steps:

the trained image dataset was the PASCAL VOC and MS COCO dataset, with the training batch size set to 24 per iteration, with 50 multi-scale training (320, 352, 384, 416, 448, 480, 512, 544, 576, and 608) on the PASCAL VOC, the size at image test being 448; performing 300 times of three-scale training (320, 352 and 384) on the MS COCO, wherein the length and the width of the input image are all 320 in the test; and screening the output result by using a post-processing algorithm so as to obtain a final prediction result.

Further, the step S1 uses a post-processing algorithm to screen the output result, so as to obtain a final predicted result, which specifically includes:

screening the prediction result by using a non-maximum suppression algorithm, calculating the intersection ratio IoU between the prediction result of the sample and the real target frame of the screened result, and firstly, according to a preset threshold value, namely two criteria: determining sample attributes according to a PASCAL rule and a standard MS COCO rule; PASCAL criterion IoU >0.5 and IoU >0.7, MS COCO criterion IoU threshold from 0.5 to 0.95, step size is 0.05, and all samples are sorted from high to low according to classification results; traversing the sorted samples, and calculating the accuracy and recall rate of the traversed samples according to a formula (1) and a formula (2):

wherein TP, FP and FN represent true positive, false positive and false negative cases;

according to the accuracy and the recall rate obtained by each traversal, a curve with the recall rate as an X axis and the accuracy as a Y axis is constructed; finally, calculating the area of a curve enclosed by the accuracy and the recall rate to obtain average accuracy APs, and calculating the average accuracy mean mAP of each category; using three indexes, i.e., an average accuracy value AP, AP50 (average accuracy value under a condition that a threshold value IoU is 0.5), and AP75 (average accuracy value under a condition that a threshold value IoU is 0.75) as model evaluation criteria, AP50 is an average accuracy value under a threshold value IoU of 0.5, AP75 is an average accuracy value under a threshold value IoU of 0.75, and further, for small, medium, and large-sized objects, AP is also adopted respectively _small 、AP _middle And AP _large Evaluation was performed.

Further, the step S3 specifically includes:

the detection neck network comprises a feature data compression part and a feature fusion part; the feature data compression part is positioned at a third-order network layer and a fourth-order network layer of the convolutional neural trunk network, the third-order network layer and the fourth-order network layer provide rich semantic information, and the feature data compression part compresses features extracted from the trunk network in a deep separable convolution calculation mode. And then, the compressed features are up-sampled by utilizing bilinear interpolation, so that the space size of the features of the third-order network layer and the fourth-order network layer is the same as that of the features of the second-order network layer. Finally, splicing the feature data after interpolation and the second-order network layer feature data on a channel dimension; the second-order network layer provides high-resolution bottom information, and the feature fusion part completes the fusion of features through deep separable convolution on the spliced data, namely the data features of the second-order third-order and fourth-order network layer are fused.

Further, the designing of the multi-branch detector head network in the step S4 includes the following steps:

the method comprises the steps of setting cavity depth convolution at the input of each branch to expand the receptive field RF of different head branches, then carrying out separation operation, dividing the characteristics in a characteristic fusion network FFN into two parts in channel dimension, enabling the channel dimension of the separated characteristics to be half of the original characteristic channel number, enabling one part to pass through LT, enabling the other part to be spliced with the output of LT in channel dimension, and enabling the fusion operation to be responsible for fusing the connection characteristics with convolution kernels with the size of 1 x 1 and outputting the fusion characteristics after passing through a Leaky ReLU activation function. The simplified transform network LT in different detection head network branches has corresponding size of image blocks, and the parameters of the size of the image blocks and the hole expansion factor on the detection head branch network of the large, medium and small-scale objects are respectively set to be 4, 2 and 1.

Further, in step S4, the simplified transform network LT is divided into three different parts as shown in fig. 4, the first part divides the input features into non-overlapping tiles, and adds the learnable position vector element by element with the vector mappings of different tiles to ensure the position uniqueness of each tile vector mapping; in addition, lightweight convolution operation is introduced into the LT network to reduce the computational complexity in the process of obtaining the image block vector mapping; each tile vector mapping is also passed through the GELU activation function before being output;

can be expressed in equation (3):

x represents the output data mapped by the reduced tile vector, and GELU (x) represents the output data through the activation function;

in the second part, after the position vector is combined with the map of the tile vector, the query value, the key value and the variable value of each map of the tile vector are calculated, and a plurality of pieces of global association information are constructed in parallel in the multi-head spatial attention module MSAN, the reduced spatial compression module LSR is used in the branch of calculating the key value and the variable value, the sequence size of the key value and the variable value is compressed by traversing the key value and the variable value through the non-overlapping convolution operation, and the global association information of the multi-head self-attention layer between LT and ViT is calculated to have the following complexity:

wherein p, N and C represent the number of space reductions, the total number of blocks and the channel dimension number, respectively;

the method comprises the steps of establishing a plurality of pieces of global associated information in parallel in a multi-self-attention network (MSAN); can be expressed in equation (5):

where q, k and v represent query, key and variable values, respectively, d _head And representing the dimension number of the key value channel, using a simplified space compression module in the key value and variable value calculation branches, and finally using a multilayer perceptron MLP and a short circuit to further extract global features.

Further, the step S5 specifically includes:

an additional hyper-parameter is introduced, which can be expressed in equation (6) as:

R＝log ₃ (1+α·C)·S (6)

where R represents the combined result, α is the hyper-parameter controlling the prediction box result, C is the object confidence, and S is the most likely category score.

The invention has the following advantages and beneficial effects:

the innovation of the invention is mainly the combination of claims 2-9. The invention provides a real-time target detector Transform Only Look One (TOLO) combined with a convolutional neural network and a Transformer network, which realizes real-time accurate detection of a target. The detector mainly comprises a detection backbone network, a detection neck network and a detection head network. The detection main network is used for extracting image features, the detection neck network is used for fusing different-order network layer features, and the detection head network is used for representing the category and the coordinate of the target. The method comprises the steps that for detecting a neck network, a feature fusion network is provided, feature fusion of different-order networks is achieved through the same path in a lightweight mode, and the problems of inconsistency of feature interaction of the different-order networks and high calculation complexity of the neck network are solved; the detection head network is composed of three different simplified transform head branches, large, medium and small-scale objects are detected respectively, the simplified transform inside each branch is used for extracting a plurality of remote dependency relationships among features, target detection is achieved with less memory overhead, and the problems that in the prior art, a real-time detection model is low in efficiency of building the remote dependency relationships and a vision transform network has high computational complexity are solved. In addition, in order to find a large number of potential correct prediction samples in the prediction process, the invention provides a nonlinear combination method between the confidence coefficient of the target and the classification score, and the capture capability of the detection model to the target is improved. Therefore, the invention improves the model respectively from three aspects of neck detection network, head detection network and nonlinear combination mode, solves the problems of low efficiency, overlarge calculation complexity and excessive false negative samples in the process of constructing the remote dependency relationship by the real-time detection model, reduces the parameter of the detection model and improves the real-time target detection performance.

Drawings

FIG. 1 is a block diagram of three core modules and an overall flow diagram of the preferred embodiment of the present invention.

Fig. 2 is a diagram of a detection neck network structure.

Fig. 3 is a detection header network structure.

Fig. 4 is an LT network structure.

Fig. 5 is a comparison of linear and non-linear combination effects.

Fig. 6 is a combined linear and non-linear visualization.

Detailed Description

The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.

The technical scheme for solving the technical problems is as follows:

s1: inputting training image data to a network;

s6: and detecting on the natural data set, and screening the prediction result by using a non-maximum suppression algorithm. Calculating the intersection ratio IoU between the screened prediction result and the real target frame, and counting the prediction result to obtain an average accuracy value AP50 under the condition that the threshold value of the average accuracy value AP and the threshold value of the IoU are 0.5 and an average accuracy value AP75 under the condition that the threshold value of the average accuracy value AP IoU is 0.75 as the evaluation result of the model;

optionally, the S1 specifically includes the following steps:

the training picture data adopts a PASCAL VOC and MS COCO data set, the size of a training batch of each iteration is set to be 24, the invention carries out 50 times of multi-scale training (320, 352, 384, 416, 448, 480, 512, 544, 576 and 608) on the PASCAL VOC, and the size during image testing is 448; three-dimensional training was performed 300 times on the MS COCO (320, 352, and 384), and the input images were all 320 in size.

In S2, for a vit (vision transform) Network with a large amount of computation cost, there is a lack of inductive preference, the present invention makes up for this disadvantage by combining a Convolutional Neural Network (CNN) with a transform, and the extracted features have inductive bias characteristics, that is, the features extracted by the Convolutional Neural Network have locality and translational identity through a Convolutional Neural backbone Network.

Preferably, in S3, as shown in fig. 2, the network layer of the last three stages of the backbone network is mainly focused. For features in the last two layers, the present invention compresses the channel dimension to half of the original by using a depth separable convolution operation. Because such convolution operation requires less convolution kernel parameters and computation effort than conventional convolution operations. It is noted that the number of feature channel dimensions after compression is related to the number of original channels, rather than outputting a fused feature of a fixed channel dimension size. Therefore, the operation can not only better avoid the loss of a large amount of semantic information (especially in a deep layer), but also can ensure the fusion efficiency of the features due to lower computation complexity. After the features are compressed in the channel dimension, upsampling is performed in a bilinear interpolation mode, so that the space size is the same as that of the first-layer features. And then, splicing the features in different layers on the channel dimension, and realizing feature fusion in a depth separable convolution mode to avoid great increase of the computational complexity. Thus, the network layer features of each stage can interact directly with other network layer features through the same path, without the need for discrete layer-by-layer binding interactions.

Preferably, in S4, as shown in fig. 3, the headpiece network of the present invention is composed of three different headpiece networks (LTHs) which respectively detect large, medium and small-scale objects. For a large number of real-time target detectors, the efficiency of establishing a remote distance dependency relationship among local features is low, the method overcomes the defect through the combination of CNN and Transformer, and for all LTHs, a simplified Transformer (LT Transformer, LT) in the LTHs is used for efficiently mining global correlation information among the features. Aiming at LTHs of objects with different scales, the structures of the LTHs have certain differences. And setting a hole depth convolution at the input of each branch to further expand the Receptive Field (RF) of different head branches, wherein LT in different detection head network branches has corresponding image block size, and hole expansion factors and image block size parameters on the detection head branch networks of large, medium and small-scale objects are respectively set to be 4, 2 and 1. For large-scale detection of head branches, rich channel information from a Feature Fusion Network (FFN) is mainly considered, and the size of an image block in an LT is set to be large; for detecting head branches of mesoscale and small-scale objects, spatial local features from a Feature Fusion Network (FFN) are of greater interest in the design process, as local variations in the image become more important. Therefore, the size of the tile is set to be relatively small, and in addition, in the separation operation, the method divides the features in the FFN into two parts in the channel dimension, so that the channel dimension of the separated features is half of the original features, one part passes through the LT, the other part is spliced with the output of the LT in the channel dimension, and then the fusion operation is responsible for fusing the connection features with the 1-by-1 convolution kernel and outputting the fusion features after passing through the Leaky ReLU activation function. In a detection header network of small-scale objects, the channel dimension of the FFN feature is first reduced using a compression operation, and in a target detection header branch network of medium-scale, an average pooling operation will be used to compress the spatial size of the local features. Finally, the mapping operation is used to adjust the channel dimensions to accommodate the final prediction result. As shown in FIG. 4, the present invention proposes a reduced vision Transformer network for real-time target detection. Unlike previous visual Transformer networks, LT networks are lightweight, which means that the network model will be fast in the training and reasoning phase. The LT may be divided into three different parts. The first section focuses on partitioning input features into non-overlapping tiles and adding learnable position vectors element-by-element with vector mappings of different tiles to ensure positional uniqueness of each tile vector mapping. In addition, lightweight convolution operations will be introduced to the LT network to reduce the computational complexity in obtaining the tile vector mappings. Each tile vector mapping is also passed through the GELU activation function before being output.

x represents the output data mapped by the reduced tile vector, and gelu (x) represents the output data by the activation function.

After the position vector is combined with the tile vector map, in the second part, each tile vector map query value, key value and variable value are calculated, and a plurality of global association information is constructed in parallel in a Multi-head Spatial Attention Module (MSAN). It is obvious to note that we use the reduced space compression module (LSR) in both the key value and variable value calculation branches, and still use the light weight as the main calculation method, and compress the sequence sizes of the key value and the variable value by traversing the key value and the variable value through the non-overlapping convolution operation, so as to reduce the calculation complexity, and the global association information calculation complexity of the multi-headed self-attention layer between LT and ViT is as follows:

where p, N, and C represent the number of spatial reductions, total number of tiles, and channel dimensions, respectively. Here we omit the complexity statistics of computing the query values, key values, variable values and final outputs, since they are the same for both visual Transformer network models. It is seen from equation (2) that the LSR module can reduce the complexity by a factor of P.

Further global feature extraction is performed using MLP and shorts to avoid the expressive power of the self-attention network from decreasing with increasing network depth. However, the large number of parameters in MLP also reduces the network feature extraction efficiency. Therefore, we set the number of hidden layer neurons of the proposed reduced MLP (lite MLP) in LT to the same number of neurons as the input channel dimension, and the ratio of its computational complexity to MLP in ViT can be calculated as:

where N and C represent the total number of tiles and the channel dimension, respectively, as can be seen from equation (3), Lite MLP in LT is saved by four times as compared with MLP parameter in ViT, and the number of neurons in the hidden layer is further compressed to further reduce the parameter number and computational complexity, but we find that the detection performance is also degraded.

Preferably, in S5, instead of directly multiplying the target confidence and the classification score, which are commonly used in the Transforms Only Look Once (TOLO) network, an additional hyper-parameter is introduced and the calculation method is changed, which can be expressed as in formula (4):

R＝log ₃ (1+α·C)·S (4)

where R represents the combined result, α is the hyper-parameter controlling the prediction box result, C is the object confidence, and S is the most likely category score. The effect of the different combinations is shown in fig. 5, and by assuming that the most probable classification label score is a random constant S independent of the confidence C, and the dashed line in the figure is a standard linear combination, it is clear that α is 1 or 1.5, which has a stronger inhibitory effect and the higher the confidence is, the stronger the inhibitory effect is. When α is 2, too low confidence does not have a large impact on the final result, indicating that the impact of a true negative prediction box can be avoided. However, when the confidence is within 0.4 to 0.6, the combined result will be greatly improved, so the number of false negative prediction boxes can be reduced by increasing the α value. The nonlinear combination method and linear combination method results are visualized as shown in fig. 6.

Preferably, in S6, the performance of the target detector and the nonlinear combinatorial method is evaluated on the PASCAL VOC 2007, PASCAL VOC 2012, and MS COCO 2017 data sets. The PASCAL VOC 2007 data set has 20 categories, 9962 pictures; there are 20 classes of the PASCAL VOC 2012 data set, 22531 images, which are the training set, the validation set, and the test set, respectively. And follows the standard PASCAL VOC standard, mean Average Precision (mapp) as a test evaluation metric. The invention uses three indexes of Average Precision (AP), AP50 (Average Precision value under the condition that IoU threshold is 0.5) and AP75 (Average Precision value under the condition that IoU threshold is 0.75), which are standard PASCAL criteria (namely Intersection over Unit, IoU)>0.5,IoU>0.7 and the standard MS COCO criterion (i.e., calculation IoU e [0.5:0.05:0.95 ∈)]The mAP mean). In addition, for small, medium and large-sized objects, the AP is also adopted respectively _small 、AP _middle And AP _large Evaluation was performed. In terms of training strategy, we use Stochastic Gradient Descent (SGD) to optimize our model, set the initial learning rate to 0.001, and train on two GPUs (GTX 3090). The cosine learning rate plan is also set between 0.001 and 0.00001. The weight attenuation was 0.0005 and the momentum was 0.9. In addition, some training skills such as MixUp and label smoothing are adopted to avoid overfitting and improve the generalization of the model.

Results of the experiment

In this example, we evaluated the effectiveness of the proposed target detector on the PASCAL VOC and MS COCO data sets. The TOLO was compared in performance with other state-of-the-art detectors including primary or secondary classes, real-time detectors and some transform-based detectors.

TABLE 1 detection results of different detectors on PASCAL VOC

TABLE 2 detection results of real-time detector on MS COCO

TABLE 3 detection results of the Transformer based detector on MS COCO

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims

1. A real-time target detection method combining a convolutional neural network and a Transformer network is characterized by comprising the following steps:

s1: inputting training image data to a network;

s6: and detecting on the natural data set, and screening the prediction result by using a non-maximum suppression algorithm. And calculating the intersection ratio IoU between the screened prediction result and the real target frame, and counting the prediction result to further obtain an average precision value AP50 under the condition that the threshold value of the IoU is 0.5 and an average precision value AP75 under the condition that the threshold value of IoU is 0.75 as the evaluation result of the model.

2. The method for real-time object detection by combining a convolutional neural network and a Transformer network as claimed in claim 1, wherein the step S1 inputs the image data set to be trained, specifically comprising the following steps:

3. The method for real-time object detection by combining a convolutional neural network and a Transformer network as claimed in claim 2, wherein the step S1 is implemented by using a post-processing algorithm to filter the output result, so as to obtain a final predicted result, which specifically comprises:

screening the prediction result by using a non-maximum suppression algorithm, calculating the intersection ratio IoU between the prediction result of the sample and the real target frame of the screened result, and firstly, according to a preset threshold value, namely two criteria: determining sample attributes according to a PASCAL criterion and a standard MS COCO criterion; PASCAL criterion IoU >0.5 and IoU >0.7, MS COCO criterion IoU threshold from 0.5 to 0.95, step size is 0.05, and all samples are sorted from high to low according to classification results; traversing the sorted samples, and calculating the accuracy and recall rate of the traversed samples according to a formula (1) and a formula (2):

according to the accuracy obtained by each traversalThe recall rate and the recall rate are used for constructing a curve with the recall rate as an X axis and the accuracy rate as a Y axis; finally, calculating the area of a curve enclosed by the accuracy and the recall rate to obtain average accuracy APs, and calculating the average accuracy mean mAP of each category; three indices, i.e., average accuracy value AP, AP50 (average accuracy value under the condition that the threshold value IoU is 0.5) and AP75 (average accuracy value under the condition that the threshold value IoU is 0.75) were used as model evaluation criteria, AP50 was the average accuracy value under the condition that the threshold value IoU is 0.5, AP75 was the average accuracy value under the condition that the threshold value IoU is 0.75, and further, for small, medium and large-sized subjects, AP was also adopted respectively _small 、AP _middle And AP _large Evaluation was performed.

4. The method for real-time object detection based on a convolutional neural network and a transform network as claimed in claim 3, wherein the step S3 specifically comprises:

5. The method for real-time target detection by combining a convolutional neural network and a Transformer network as claimed in claim 4, wherein the step S4 of designing a multi-branch detection head network specifically comprises the following steps:

6. The method for real-time object detection combining convolutional neural network and transform network as claimed in claim 5, wherein in step S4, the simplified transform network LT is divided into three different parts, the first part divides the input features into non-overlapping blocks, and adds the learnable position vector and the vector map of different blocks element by element to ensure the position uniqueness of each block vector map; in addition, lightweight convolution operation is introduced into the LT network to reduce the computational complexity in the process of obtaining the image block vector mapping; each tile vector mapping is also passed through the GELU activation function before being output;

can be expressed in equation (3):

7. The method of claim 6, wherein the step S5 specifically includes:

R＝log ₃ (1+α·C)·S (6)