CN115690549A - Target detection method for realizing multi-dimensional feature fusion based on parallel interaction architecture model - Google Patents
Target detection method for realizing multi-dimensional feature fusion based on parallel interaction architecture model Download PDFInfo
- Publication number
- CN115690549A CN115690549A CN202211420718.7A CN202211420718A CN115690549A CN 115690549 A CN115690549 A CN 115690549A CN 202211420718 A CN202211420718 A CN 202211420718A CN 115690549 A CN115690549 A CN 115690549A
- Authority
- CN
- China
- Prior art keywords
- model
- sampling
- window
- feature
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Image Analysis (AREA)
Abstract
The invention discloses a target detection method for realizing multi-dimensional feature fusion based on a parallel interactive architecture model, which comprises the following steps of: the method solves the problems of low convergence speed, long training time consumption and the like in the traditional target detection task, and improves the detection precision and the detection speed of the target detection task.
Description
Technical Field
The invention belongs to the field of target detection in computer vision, and particularly provides a target detection method for realizing feature fusion based on a parallel interaction architecture model.
Background
Object detection is a long-standing and non-negligible basic task in the field of computer vision, whose main purpose is to predict the location and class of instances in an image. As a basis for many visual tasks including example segmentation and target tracking, target detection has a very important research significance in the field of image vision. In recent years, with the rise of popularity of practical fields such as automatic driving and industrial defect detection, the industrial community has attracted more and more attention to target detection. The core challenge of object detection is how to make the detection network fully learn the spatial information and semantic information of the image from the input features, and how to accurately locate and classify instances from these information. The target detector needs strong feature fusion capability and enough spatial sensitivity, and most of the traditional deep learning detection models are based on a Convolutional Neural Network (CNN). The CNN fully fuses local features in the image by using convolution operation, and the sensitive local space sensing capability enables the CNN to become one of networks most suitable for a target detection task; however, CNN has a certain limitation, and its feature fusion capability in global space is deficient. Traditional target detection models based on CNN are generally divided into anchor-based and anchor-free according to how to locate objects; the former uses anchors to predict potential objects, while the latter typically detects objects based on a central point. The Anchor-based model can be divided into one-stage and two-stage according to the detection steps; the former classical model is YOLO series, SSD, retinaNet, etc., and the latter is represented by R-CNN series. In the two-stage method, firstly, potential target areas are searched, then, the second step is carried out on the areas to calculate classification scores, namely, the areas are positioned firstly and then classified; and the single-stage method directly generates the detection frame in one step to predict the category and the position of the object. The CNN-based model has two key issues, how to assign anchors and grountruth labels and how to make the model effectively learn key semantic information from features. Models designed to solve both problems also have significant drawbacks, such as the need to perform artificial design knowledge under certain a priori conditions, which is a difficult task to design a priori such as suitable anchor points and thresholds for different detection methods. On the other hand, the global feature interaction capability of CNN is weak due to the size limitation of the convolution kernel.
In recent years, the advent of visual Transformer (ViT), DEtectionTRansformer (DETR), and its variants, has raised the hot tide of applying Transformer to target detection. These new object detection paradigms discard the traditional CNN and replace it with a well-designed multi-layer encoding and decoding architecture; the encoder is used to fuse features, and the decoder uses object query to decouple rich semantics in features. Compared with CNN, viT emphasizes semantic association between global spaces in space, and integrates global space characteristics through a global self-attention mechanism. The DETR treats object detection as a task of ensemble prediction. A certain number of object queries are matched to the groudtruth during the training process. This process eliminates the label assignment of the traditional model; and in the inference process, the network directly predicts the object according to the object query. In addition, in the aspect of positioning an object, position sensing sensitivity of the model is enhanced by using positional embedding in the DETR, however, the DETR detector has the problems of low network convergence speed, high calculation force dependence and the like.
Disclosure of Invention
The invention provides a method for realizing feature fusion based on a parallel interactive architecture model in order to solve the problems and combine with the latest ideas in other fields of deep learning, and aims to provide advanced feature fusion capability for the model. Firstly, in the aspect of a feature extraction mode, 3D feature space window sampling different from the traditional CNN is introduced, and local and global spatial features are fully extracted; subsequently, the invention provides a multi-dimensional feature fusion network CFFN, which can enable the model to deeply fuse image features in space and channel dimensions, thereby enabling the model to better learn semantic information, further realizing better detection effect and achieving higher detection precision.
In order to achieve the purpose, the invention provides the following technical scheme: a target detection method for realizing multi-dimensional feature fusion based on a parallel interactive architecture model comprises the following steps:
step 1: preparing a COCO2017 data set required by model training; configuring a COCO2017 data set in a server, and putting the COCO2017 data set into a training folder according to a required format;
step 2: building a model under an mmdetection framework, and configuring a PyTorch deep learning environment required by training;
and 3, step 3: setting training hyper-parameters, and inputting a data set into an end-to-end target detection model of a parallel interaction architecture for training;
and 4, step 4: the model sends the input image into a ResNet50 for feature extraction, outputs a multi-scale feature map, and simultaneously constructs a 3D feature sampling space by the multi-scale feature map;
and 5: a set of predictor vector objects is generated that contains the content vector and the position vector. For each object query, the object query generates a sampling offset through a feedforward neural network, and generates a model initial sampling point by taking a position vector of the object query as an initial coordinate and combining the sampling offset;
step 6: forming a local sampling window by using the initial sampling point and eight adjacent points of the initial sampling point in a sampling space, interpolating points in the window to obtain window characteristics, and then paving the window;
and 7: the obtained feature matrix is sent to a feature fusion network CFFN, wherein the CFFN is composed of a one-way parallel interactive structure (PSUI) and an inter-group self-attention layer, and the layer realizes the full fusion of features in space and channel dimensions;
and 8: the fully fused features are sent into an adaptive Mixing decoding layer for feature decoupling;
and step 9: the final output of the decoding layer sequentially updates the content vector and the position vector of the object query through two feed-forward neural networks FFN, and the content vector and the position vector predict the category and the position of the target to be detected through two FFNs;
step 10: after the model training is finished, the precision of the model can be verified, and a detection frame can be generated by using a trained model file according to an input test picture to detect the type and the position of an object to be detected in the test picture. In step 4, the model performs feature pre-extraction on the input image by using a classical CNN backbone network ResNet50 to obtain feature maps with four different scales. If the dimension of the input image isThen the output multi-scale feature map is respectively When a 3D feature space is constructed, the channel number of each scale feature map is normalized to a value D feat . In step 5, the prediction vector object is composed of a content vector and a position vector, wherein the content vector is the coordinates (x, y, z, r) of the object query.
Step 6, as shown in fig. 2, in the characteristic sampling space, eight adjacent points from the initial sampling points in the step 5 and the initial sampling points form a local sampling window, and then the sampling points in the window are interpolated to obtain a sampling characteristic matrix x epsilon R G*W*P*C (taking an object query as an example). Wherein G represents sampling grouping, W and P respectively represent the number of sampling windows and the number of points in the sampling windows, and C represents the number of characteristic channels. The formula is as follows:
in the above formula, S is the size of the local window, i is the sampling point in the local window, coordinate refers to the coordinates of the sampling point, and Interpolation is Interpolation operation.
In step 7, the CFFN includes a unidirectional interactive Parallel Structure (PSUI) and self attention between groups, where the PSUI is composed of two left and right branches and a unidirectional interactive network connecting the left and right branches, and details of the PSUI are shown in fig. 3:
(1) The left branch performs self-attention among windows to realize local feature fusion, and V in self-attention operation among windows w The result of the semantic weight of the channel obtained after dot product of the feature matrix and factor, and Q w ,K w The characteristic matrix is obtained through a feedforward neural network; local self-attention operation is carried out on different points in the window, so that local features between different adjacent points in the same window can be fully fused. The method comprises the following specific operations:
Q w ,K w =FFN 1 (x),FFN 2 (x),
V w =x⊙factor,
in the above formula, Q w ,K w ,V w Three matrices in the self-attention operation, respectively, factor is the interaction factor containing channel weight generated by the single-term interaction network, d k Is a scaling factor.
(2) The right branch firstly carries out dimension conversion on the characteristic matrix of the local window, and the converted matrix belongs to R for x ∈ G *C*W*P . Then two dimensions form a characteristic diagram; the lateral direction represents the inside of one window and the longitudinal direction represents the different windows. The right branch firstly carries out depthwise convolution on the right branch by utilizing a convolution kernel with the size of 9X5, and features among windows are fused, so that global feature interaction is realized; and realizing semantic fusion on channel dimensionality through poitwise convolution. The final output of the right branch will be converted to the original dimension to concat with the left branch.
(3) The direction of the single-term interactive connection is from right to left, depthwise convolution output of the right branch is operated by an interactive network to obtain a factor containing channel semantic weight, and the factor is input to the left branch to participate in self-attention operation in the left branch.
(4) The final results of the left branch and the right branch only retain the characteristics of the initial sampling point in the dimension P, and then the dimension of the matrix is converted into x ∈ R G*W*C After concat, the dimension of FFN is maintained as x ∈ R through a layer G*W*C 。
In step 7, the CFFN comprises a unidirectional interactive parallel architecture (PSUI) and inter-group self-attention, wherein the details of the inter-group self-attention are: in order to reduce the network consumption computation amount and accelerate the network detection speed and the training speed, the model is used for calculating the d value feat And dividing the matrix into four groups, and performing interpolation sampling on each group to reduce the size of the matrix in a certain dimension. To compensate for the different interchannel missing interactions resulting from this operation, the model designed components to self-attention fuse the different interchannel features. The formula is as follows:
Q g ,K g =FFN 3 (x),FFN 4 (x),
in the above formula Q g ,K g Query and Key matrices in self-attention operations, respectively, d k Is a scaling factor. In step 9, the output of the decoder is converted to the same dimension as the object query content vector through an FFN, and the content vector is updated; the detection head predicts the category and the position of the candidate frame by different FFNs respectively by using the content vector and the position vector.
In the training process of the invention, the model predicts candidate frames of the same size from an initial picture of the training input by a set of fixed number (N) of object queries, wherein N is usually much larger than the actual number of the objects of interest in the image, so that an additional special class label is usedTo indicate that no object is detected.
In the whole training process of the invention, the parallel interactive architecture target detection model adopts a one-to-one label assignment mode, each prediction box needs to be matched with a Bounding box, and the model realizes the best binary matching between a real object and a prediction object by using Hungary algorithm, namely, the optimal matching mode is foundLet total matching cost L matc h And (3) minimum:
the above formula sigma is a matching rule between the group route and the prediction box, thetaN represents a possible matching mode, y is a group route set,is a set of N prediction boxes, if the number of boxes in y is less than N, the method usesAnd (4) filling. L is a radical of an alcohol match The matching cost between the true value and a prediction with index σ (i), which includes the classification penalty L cls And predicting frame loss L box ,L box And includes IoU loss L iou And l 1 Loss L 1 。
Each element y in the grountruth set y i Are all composed of c i And b i In which c is i Is the category to which the object in the frame belongs, b i Is a position vector defining the center coordinates and dimensions of the real box. For prediction of the index σ (i), the invention defines that it belongs to the class c i Has a probability ofThe prediction box isThen L can be substituted matc h Expressed as:
the loss function in the training process of the invention is the Hungarian loss of all pairs in the matching:
After 12 epochs are completed in step 10, the trained model is saved as a pt file, and the trained model file can be used to verify model accuracy and pictures.
The specific method for detecting the picture in the step 10 is as follows:
the trained model file can be loaded by using a network to detect the object in the image; and running a detect code, setting a detection model as a pt file after training, and setting an input picture directory as a folder where the picture to be detected is located. And starting detection, inputting data to be detected into a trained model for image recognition and positioning, and outputting a plurality of prediction frames containing the positions and the classes of the potential objects in the picture by the model.
The present invention has the following advantages over the prior art. In the aspect of the method: firstly, the method does not use the traditional CNN to further extract the characteristics, but obtains the spatial characteristics in a mode of constructing a 3D sampling space and then carrying out window sampling in the space, improves the richness of characteristic extraction in the spatial dimension and strengthens the positioning capability of a model; secondly, in order to improve the feature fusion quality, a CFFN network structure is designed to fuse the extracted features in the dimensions of space and channels, the CFFN network comprises convolution and self-attention operation, different methods are used for feature fusion of different dimensions, the design greatly enriches semantic information and makes contribution to improving the model precision; on the application level, the invention obtains the AP precision of 43.0 under the training period of 12 epochs in the example, which is superior to a plurality of detection methods, and meanwhile, the invention abandons the prior knowledge of the traditional CNN detection network, improves the training speed, further solves the problems of low convergence speed, long training time consumption and the like in the target detection task, and improves the precision and the detection speed of the target detection task.
Drawings
FIG. 1 is a diagram of the network architecture of the present invention;
FIG. 2 is a schematic view of a sampling local window of the present invention;
fig. 3 is a schematic diagram of the PSUI structure of the present invention.
Detailed Description
The following detailed description of the embodiments of the invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
Example 1: aiming at the traditional two-stage target detection method at present, the invention provides a 2D image target detection method for realizing multi-dimensional feature fusion based on a parallel interactive architecture model.
In the embodiment, a COCO2017 data set is used as experimental data, and by means of a data amplification technology, object positioning and classification are realized by using an end-to-end parallel interactive target detection model based on a backbone network ResNet50, a 3D feature space window sampling mode, a feature fusion network CFFN and a detection head comprising two feed-forward networks.
Step 1: preparing a COCO2017 data set required by model training; configuring a COCO2017 data set in a server, and putting the COCO2017 data set into a training folder according to a required format;
step 1.1, a public data set COCO2017 is obtained in a COCO official website, and pictures and labels of a training set and a verification set which are divided by an official part are downloaded.
Step 1.2, the COCO2017 data set is composed of a training set image, an annotation file, a verification set image and an annotation file, wherein the training set image, the verification set image and the annotation are respectively placed in a train2017 folder, a val2017 folder and an indications folder.
And 2, step: building a model, and configuring a PyTorch deep learning environment required by training;
step 2.1, creating a virtual environment for the project in anaconda, and installing key packages such as the pytorch 1.11.0 and the like required by the training environment in the virtual environment. The video card used by the training server is NVIDIA RTX 3090GPU, the operating system is Ubuntu20.04, the CUDA version is 11.3, and the compiling language is Python 3.8.
And 2.2, installing and configuring an mmdetection framework and a required mmcv compiling package, and installing other dependent packages required by the training script.
And step 3: setting training hyper-parameters, and inputting a data set into an end-to-end target detection model of a parallel interaction architecture for training;
step 3.1, the hyper-parameters of the training part are as follows: the feature extraction backbone is ResNet50, the initial learning rate is 0.000125, the batch size is 4, and the epoch number is 12.
And 4, step 4: the model sends the input image into a ResNet50 for feature extraction, outputs a multi-scale feature map, and then constructs a 3D feature sampling space by the multi-scale feature map;
and 5: a set of prediction vector object queries is generated that contains a content vector and a position vector. For each object query, the object query generates a sampling offset through a feedforward neural network, and generates a model initial sampling point by taking a position vector of the object query as an initial coordinate and combining the sampling offset;
step 6: forming a local sampling window by using the initial sampling point and eight adjacent points of the initial sampling point in a sampling space, interpolating points in the window to obtain window characteristics, and then paving the window;
and 7: the obtained feature matrix is sent to a feature fusion network CFFN, wherein the CFFN is composed of a one-way parallel interactive structure (PSUI) and an inter-group self-attention layer, and the layer realizes the full fusion of features in space and channel dimensions;
and step 8: the fully fused features are sent into an Adaptive Mixing decoding layer for feature decoupling;
and step 9: the final output of the decoding layer sequentially updates the content vector and the position vector of the object query through two feed-forward neural networks FFN, and the content vector and the position vector predict the category and the position of the object to be detected through two FFNs;
step 10: after the model training is finished, the precision of the model can be verified, and a detection frame can be generated by using a trained model file according to an input test picture to detect the type and the position of an object to be detected in the test picture.
After the 10.1 and 12 epochs are completed, the trained model is stored as a pt file, the model can be subjected to precision verification according to the training weight by using val.py, and the val2017 data set mentioned in the step 1 is input into the model, so that the precision of the trained model can be evaluated. Common precision index AP and AP 50 、AP 75 、AP S 、AP M And AP L 6 pieces of the Chinese herbal medicines are used. In the final verification result of the embodiment, the AP is 43.0, which is superior to most methods under the same experimental conditions.
Step 10.2, the trained model file can be loaded by using a detection network to detect the object in the image; and running a detect code, setting the detection model as a pt file after training, and setting the input picture directory as a folder where the picture to be detected is located. After the operation is set, the detection can be started, the data to be detected is input into the trained model for image recognition and positioning, and the model outputs a plurality of prediction frames containing the positions and the classes of the potential objects in the picture.
Claims (6)
1. A target detection method for realizing multi-dimensional feature fusion based on a parallel interactive architecture model is characterized by comprising the following steps:
step 1: preparing a COCO2017 data set required by model training; configuring a COCO2017 data set in a server, and putting the COCO2017 data set into a training folder according to a required format;
and 2, step: building a model under an mmdetection framework, and configuring a PyTorch deep learning environment required by training;
and step 3: setting training hyper-parameters, and inputting a data set into an end-to-end target detection model of a parallel interaction architecture for training;
and 4, step 4: the model sends the input image into a ResNet50 for feature extraction, outputs a multi-scale feature map, and then constructs a 3D feature sampling space by the multi-scale feature map;
and 5: generating a group of prediction vector objects containing content vectors and position vectors, generating sampling offsets of the object queries through a feedforward neural network for each object query, taking the position vectors as initial coordinates, and generating model initial sampling points by combining the sampling offsets;
step 6: forming a local sampling window by using the initial sampling points and eight adjacent points of the initial sampling points in a sampling space, interpolating points in the window to obtain window characteristics, and then paving the window;
and 7: the obtained feature matrix is sent to a feature fusion network CFFN, wherein the CFFN is composed of a one-way parallel interactive structure (PSUI) and an inter-group self-attention layer, and the layer realizes the full fusion of features in space and channel dimensions;
and 8: the fully fused features are sent into an adaptive mixing decoding layer for feature decoupling;
and step 9: the final output of the decoding layer sequentially updates the content vector and the position vector of the object query through two feed-forward neural networks FFN, and the content vector and the position vector predict the category and the position of the object to be detected through two FFNs;
step 10: after the model training is finished, the precision of the model can be verified, and a detection frame can be generated by using a trained model file according to an input test picture to detect the type and the position of an object to be detected in the test picture.
2. The method for detecting the target based on the parallel interactive architecture model to realize the multi-dimensional feature fusion of claim 1, wherein the step 4: the target detection model uses a classical CNN backbone network ResNet50 to perform feature pre-extraction on an input image to obtain four feature maps with different scales, and if the dimension of the input image isThen the output multi-scale feature map is respectively In addition, when a 3D feature space is constructed, the channel numbers of all the scale feature maps are normalized to a uniform value D feat D above i The number of feature channels, H, of the ith layer feature map 0 、W 0 Is the input image height and width.
3. The method for detecting the target based on the parallel interactive architecture model to realize the multi-dimensional feature fusion of claim 1, wherein the step 5: the prediction vector object query is composed of a content vector and a location vector, wherein the content vector represents the initial coordinates (x, y, z, r) of the object query, where r is the aspect ratio.
4. The method for detecting the target based on the parallel interactive architecture model to realize the multi-dimensional feature fusion of claim 1, wherein the step 6: eight adjacent points from the initial sampling points in the step 5 and the initial sampling points form a local sampling window in a characteristic sampling space, and then the sampling points in the window are interpolated to obtain a sampling characteristic matrix x epsilon R G*W*P*C Wherein G represents sampling grouping, W and P represent the number of sampling windows and the number of points in the sampling windows respectively, and C represents the number of characteristic channels.
5. The method for detecting the target based on the parallel interactive architecture model to realize the multi-dimensional feature fusion of claim 1, wherein the step 7: the CFFN includes a one-way Interactive parallel architecture (PSUI) and inter-group self-attention, with the respective details as follows:
(1) The CFFN comprises a unidirectional interactive Parallel Structure (PSUI) and self attention among groups, wherein the PSUI consists of a left branch and a right-left unidirectional interactive network connecting the left branch and the right branch;
(2) The branch on the left side of the CFFN realizes local feature fusion by utilizing inter-window self-attention operation, and V in the inter-window self-attention operation w The result from the semantic weight of the channel obtained after dot product of the window feature matrix and the factor, Q w ,K w The feature matrix is obtained through different feedforward neural networks; local self-attention operation is carried out on different points in the window, so that local features between different adjacent points in the same window can be fully fused;
(3) The right branch firstly carries out dimension conversion on the characteristic matrix of the local window, and the converted matrix belongs to R for x ∈ G*C*W*P Then the two dimensions form a characteristic diagram; the horizontal represents the inside of one window, the vertical represents different windows, and the right branch utilizes convolution kernel with the size of 9X5Firstly, depthwise convolution is carried out on the window frame, and the characteristics among windows are fused, so that global characteristic interaction is realized; semantic fusion on channel dimensionality is achieved through pointwise convolution, and the final output of the right branch is converted into the original dimensionality so as to be convenient for concat with the left branch;
(4) The direction of the single-term interactive connection is from right to left, depthwise convolution output of the right branch is operated by an interactive network to obtain a factor containing channel semantic weight, and the factor is input to the left branch to participate in self-attention operation in the left branch;
(5) The final results of the left branch and the right branch only retain the characteristics of the initial sampling point in the dimension P, and then the dimension of the matrix is converted into x ∈ R G*W*C After concat, the dimension of FFN is maintained as x ∈ R through a layer G*W*C ;
(6) In order to reduce the network consumption computation amount and accelerate the network detection speed and the training speed, the model is used for calculating the d value feat The method is divided into four groups, each group is subjected to interpolation sampling to reduce the size of a matrix in a certain dimension, and in order to make up for missing interaction of different inter-group channels caused by the operation, a model designs components to integrate characteristics of different inter-group channels by self attention.
6. The method for detecting the target based on the parallel interactive architecture model to realize the multi-dimensional feature fusion of claim 1, wherein the step 9:
(1) The output of the decoder is converted into the same dimension of the content vector of the object query through one FFN, and the updating of the content vector is completed; then, the position vector is updated by changing one FFN into the same dimension with the position vector;
(2) The detection head predicts the category and position of the candidate frame by different FFNs respectively by using the content vector and the position vector.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211420718.7A CN115690549A (en) | 2022-11-12 | 2022-11-12 | Target detection method for realizing multi-dimensional feature fusion based on parallel interaction architecture model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211420718.7A CN115690549A (en) | 2022-11-12 | 2022-11-12 | Target detection method for realizing multi-dimensional feature fusion based on parallel interaction architecture model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115690549A true CN115690549A (en) | 2023-02-03 |
Family
ID=85052450
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211420718.7A Pending CN115690549A (en) | 2022-11-12 | 2022-11-12 | Target detection method for realizing multi-dimensional feature fusion based on parallel interaction architecture model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115690549A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116071773A (en) * | 2023-03-15 | 2023-05-05 | 广东电网有限责任公司东莞供电局 | Method, device, medium and equipment for detecting form in power grid construction type archive |
CN117058646A (en) * | 2023-10-11 | 2023-11-14 | 南京工业大学 | Complex road target detection method based on multi-mode fusion aerial view |
-
2022
- 2022-11-12 CN CN202211420718.7A patent/CN115690549A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116071773A (en) * | 2023-03-15 | 2023-05-05 | 广东电网有限责任公司东莞供电局 | Method, device, medium and equipment for detecting form in power grid construction type archive |
CN117058646A (en) * | 2023-10-11 | 2023-11-14 | 南京工业大学 | Complex road target detection method based on multi-mode fusion aerial view |
CN117058646B (en) * | 2023-10-11 | 2024-02-27 | 南京工业大学 | Complex road target detection method based on multi-mode fusion aerial view |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109948425B (en) | Pedestrian searching method and device for structure-aware self-attention and online instance aggregation matching | |
CN108564109B (en) | Remote sensing image target detection method based on deep learning | |
CN110443818B (en) | Graffiti-based weak supervision semantic segmentation method and system | |
CN109344736B (en) | Static image crowd counting method based on joint learning | |
CN106845430A (en) | Pedestrian detection and tracking based on acceleration region convolutional neural networks | |
CN111368815A (en) | Pedestrian re-identification method based on multi-component self-attention mechanism | |
CN114092832B (en) | High-resolution remote sensing image classification method based on parallel hybrid convolutional network | |
CN108765383B (en) | Video description method based on deep migration learning | |
CN115690549A (en) | Target detection method for realizing multi-dimensional feature fusion based on parallel interaction architecture model | |
CN109443382A (en) | Vision SLAM closed loop detection method based on feature extraction Yu dimensionality reduction neural network | |
CN110399850A (en) | A kind of continuous sign language recognition method based on deep neural network | |
CN109522961A (en) | A kind of semi-supervision image classification method based on dictionary deep learning | |
CN110751027B (en) | Pedestrian re-identification method based on deep multi-instance learning | |
CN104616005A (en) | Domain-self-adaptive facial expression analysis method | |
CN112070010B (en) | Pedestrian re-recognition method for enhancing local feature learning by combining multiple-loss dynamic training strategies | |
CN111652273A (en) | Deep learning-based RGB-D image classification method | |
CN113239753A (en) | Improved traffic sign detection and identification method based on YOLOv4 | |
CN112364791A (en) | Pedestrian re-identification method and system based on generation of confrontation network | |
CN110096991A (en) | A kind of sign Language Recognition Method based on convolutional neural networks | |
CN113822368A (en) | Anchor-free incremental target detection method | |
CN114241191A (en) | Cross-modal self-attention-based non-candidate-box expression understanding method | |
CN114692732A (en) | Method, system, device and storage medium for updating online label | |
CN111144462A (en) | Unknown individual identification method and device for radar signals | |
Cheng et al. | An image-based deep learning approach with improved DETR for power line insulator defect detection | |
CN114579794A (en) | Multi-scale fusion landmark image retrieval method and system based on feature consistency suggestion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |