CN112131925B - Construction method of multichannel feature space pyramid - Google Patents

Construction method of multichannel feature space pyramid Download PDF

Info

Publication number
CN112131925B
CN112131925B CN202010709350.0A CN202010709350A CN112131925B CN 112131925 B CN112131925 B CN 112131925B CN 202010709350 A CN202010709350 A CN 202010709350A CN 112131925 B CN112131925 B CN 112131925B
Authority
CN
China
Prior art keywords
feature
pyramid
feature map
multiplied
space pyramid
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010709350.0A
Other languages
Chinese (zh)
Other versions
CN112131925A (en
Inventor
余至立
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chang Guangyu
Suirui Technology Group Co Ltd
Original Assignee
Suirui Technology Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suirui Technology Group Co Ltd filed Critical Suirui Technology Group Co Ltd
Priority to CN202010709350.0A priority Critical patent/CN112131925B/en
Publication of CN112131925A publication Critical patent/CN112131925A/en
Application granted granted Critical
Publication of CN112131925B publication Critical patent/CN112131925B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4007Scaling of whole images or parts thereof, e.g. expanding or contracting based on interpolation, e.g. bilinear interpolation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20016Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a construction method of a multichannel feature space pyramid, belonging to the technical field of computer vision and target detection. The method comprises the following steps: inputting a picture to be detected; then extracting picture features by using any convolutional neural network based on deep learning; then, the features of each layer of the convolutional neural network are lifted up to build a feature space pyramid C according to the extraction sequence, and on the basis of the pyramid C, feature space pyramids P0, P1, P2 and P3 are built by using convolution kernels of 4 different channels; finally, the feature maps of the 4 pyramids are integrated into one pyramid P. Compared with a feature space pyramid C initially extracted by a convolutional neural network, the method greatly enhances the semantics of the feature map so as to improve the accuracy of the target detector on a multi-scale target.

Description

Construction method of multichannel feature space pyramid
Technical Field
The invention relates to a feature space pyramid construction method, in particular to a multi-channel image feature space pyramid construction method using a cascading idea, which is used for extracting feature representation of a target in a current target detection method based on deep learning, and belongs to the technical field of computer vision and target detection.
Background
Object detection aims at detecting object instances in an image, identifying objects (e.g. people, vehicles, planes, birds, etc.) contained in the image and locating their position in the image. The current target detection architecture based on a deep convolutional neural network mainly consists of three different steps: feature extraction, region suggestion generation, classification and localization. The target detection performance depends largely on whether the feature extracted by the feature extraction section-deep convolutional neural network is sufficient or not. In the process of target detection by using the deep convolutional neural network, shallow layer feature map semantic information (used for judging what a target is subsequently) extracted by an earlier convolutional layer is less, but target position information (used for positioning the target subsequently) is accurate. The deep feature map semantic information extracted by the later convolution layer is rich, but the target position is rough. Meanwhile, due to inconsistent target dimensions in the images, semantic feature information of the targets can appear in different convolution layers according to the size of the targets. Thus, if the object is large (i.e., the area of the object in the image is large), then the semantic feature information will appear in the later convolutional layer, on the other hand if the object is small (i.e., the area of the object in the image is small), then it requires location information in the earlier convolutional layer, since it is likely to disappear in the subsequent convolutional layer, which makes feature extraction particularly important for detecting images containing multi-scale (large, medium, small) objects.
Disclosure of Invention
The invention aims to solve the technical problems in the prior art and overcome the defects, and provides a construction method of a multi-channel feature space pyramid for fully extracting feature information comprising a multi-scale target image.
The invention adopts the following technical scheme:
The construction method of the multichannel feature space pyramid comprises the following steps:
Inputting a picture to be detected;
Step (2), repeating the steps (3) to (10) aiming at the multi-scale target detection task, and continuing training the neural network model until the loss function of the network reaches a convergence state;
Step (3) selects a convolutional neural network based on deep learning (here, selects the convolutional neural network ResNet-101, and also selects other convolutional neural networks, such as ResNet-50, VGGNet, googLeNet, etc.) to extract the picture characteristics;
And (4) marking the last layer characteristic diagram of each group of convolution blocks of the convolution neural network ResNet-101 as C1, C2, C3, C4 and C5. Because C1 occupies a large amount of GPU (graphic processing unit) video memory, only C2, C3, C4 and C5 are used for constructing a four-layer feature space pyramid C in the method;
Step (5) on the basis of the feature space pyramid C, a convolution check feature space pyramid C2, C3, C4 and C5 with the size of 1 multiplied by 256 is used for enhancing feature representation by using a feature fusion unit, so that an enhanced feature space pyramid P0 is obtained;
Step (6) on the basis of the feature space pyramid C, a convolution check with the size of 1 multiplied by 512 is used for checking the C3, C4 and C5 of the feature space pyramid C, and a feature fusion unit is used for enhancing the feature representation to obtain an enhanced feature space pyramid P1;
step (7) on the basis of the feature space pyramid C, a convolution check feature space pyramid C4 with the size of 1 multiplied by 1024 is used, and a feature fusion unit is used for enhancing feature representation by C5 to obtain an enhanced feature space pyramid P2;
Step (8) on the basis of the feature space pyramid C, a convolution check feature space pyramid C5 with the size of 1 multiplied by 2048 is used for convolution to obtain a feature space pyramid P3;
Step (9) integrating the obtained layer with the strongest semantic information in the four feature space pyramids P0, P1, P2 and P3 subjected to fusion enhancement into a final feature space pyramid P;
and (10) sending the feature space pyramid P into a target detection network to detect the category and the position of the target.
Preferably, first step (3) we use a deep convolutional neural network RESNETNET-101, which contains a total of 101 convolutional layers of 4 sets of convolutional blocks.
The step (4) is to construct a feature pyramid C, and the specific steps are as follows:
(i) After an input picture enters a deep convolutional neural network ResNet-101, extracting features by using a convolutional kernel with the size of 7 multiplied by 64 in a first layer of the network, and then using a maximum pooling layer with the size of 3 multiplied by 3 to obtain a 64-dimensional feature map C1 with the size of original figure 1/2;
(ii) Further, in layers 2-10 of the deep convolutional neural network ResNet-101, the feature map C1 is each implemented using a set of three convolutional kernels: the convolution operation is repeatedly carried out for 3 times on 1 multiplied by 64,3 multiplied by 3 multiplied by 64, and 1 multiplied by 256, so that a 256-dimensional characteristic diagram C2 with the size of 1/4 of the original diagram is finally obtained;
(iii) Further, in layers 11-22 of the deep convolutional neural network ResNet-101, the feature map C2 is each implemented using a set of three convolutional kernels: repeatedly performing convolution operation for 4 times on 1×1× 128,3 ×3×128 and 1×1×512 to obtain a feature map C3 with dimension of 512 and size of 1/8 as shown in the original figure;
(iv) Further, in layers 23-91 of the deep convolutional neural network ResNet-101, the feature map C3 is each implemented using a set of three convolutional kernels: repeatedly performing convolution operation for 23 times on 1×1×256,3×3×256 and 1×1×1024 to obtain a feature map C4 with dimensions 1024 and sizes of original figures 1/16;
(v) Further, in layers 23-91 of the deep convolutional neural network ResNet-101, the feature map C4 is each implemented using a set of three convolutional kernels: 1×1× 512,3 ×3X 512,1 X1 X2048 the convolution operation was repeated 3 times, finally, a feature map C5 with the dimension of 2048 and the size of original figure 1/32 is obtained.
Preferably, step (5) constructs feature space pyramid P0 by feature space pyramid C:
(i) Using a convolution kernel of 1 multiplied by 256 to reduce the dimension of the feature map C5 to obtain P0_5;
(ii) The method comprises the steps of using a convolution kernel of 1 multiplied by 256 to reduce the dimension of a feature map C4 to obtain C4', then performing an up-sampling operation on the feature map P0_5 through linear interpolation to obtain P0_5', wherein the P0_5 'and the C4' have the same size and dimension, and inputting the two feature maps into a feature fusion unit to obtain a feature map P0_4;
(iii) The method comprises the steps of performing dimension reduction on a feature map C3 by using a convolution kernel of 1 multiplied by 256 to obtain C3', performing up-sampling operation on P0_4 through linear interpolation to obtain P0_4', inputting the two feature maps into a feature fusion unit by using the C3 'and the P0_4' which have the same size and dimension to obtain a feature map P0_3;
(iv) The feature map C2 is subjected to dimension reduction by using a convolution kernel of 1 multiplied by 256 to obtain C2', then the P0_3 is subjected to an up-sampling operation by linear interpolation to obtain P0_3', and at the moment, the C2 'and the P0_3' have the same size and dimension, and the two feature maps are input into a feature fusion unit to obtain the feature map P0_2.
Preferably, the step (6) establishes the feature pyramid P1, which is specifically implemented as follows:
(i) Using a convolution kernel of 1 multiplied by 512 to reduce the dimension of the feature map C5 to obtain P1_5;
(ii) The method comprises the steps of using a convolution kernel of 1 multiplied by 512 to reduce the dimension of a feature map C4 to obtain C4', then performing an up-sampling operation on the feature map P1_5 through linear interpolation to obtain P1_5', wherein the P1_5 'and the C4' have the same size and dimension, and inputting the two feature maps into a feature fusion unit to obtain a feature map P1_4;
(iii) The feature map C3 is subjected to dimension reduction by using a convolution kernel of 1 multiplied by 512 to obtain C3', then an up-sampling operation is performed on the P1_4 through linear interpolation to obtain P1_4', and at the moment, the P1_4 'and the C3' have the same size and dimension, and the two feature maps are input into a feature fusion unit to obtain the feature map P1_3.
Preferably, the step (7) establishes a feature pyramid P2, which is specifically implemented as follows:
(i) Performing dimension reduction on the feature map C5 by using a convolution kernel of 1 multiplied by 1024 to obtain P2_5;
(ii) The method comprises the steps of using a convolution kernel of 1 multiplied by 1024 to reduce the dimension of a feature map C4 to obtain C4 'and then performing an up-sampling operation on the feature map P2_5 through linear interpolation to obtain P2_5', wherein the P2_5 'and the C4' have the same size and dimension, and inputting the two feature maps into a feature fusion unit to obtain a feature map P2_4;
Preferably, the step (8) establishes the feature pyramid P3, which is specifically implemented as follows: performing dimension reduction on the feature map C5 by using a convolution kernel of 1 multiplied by 2048 to obtain P3_5;
Preferably, the step (9) establishes a feature pyramid, and the specific implementation is as follows: the bottom-layer feature map of each pyramid fully fuses and enhances the feature semantics of the upper-layer feature map and the self-contained position information in the respective pyramid; thus, the final feature space pyramid P is reconstructed using the lowest level feature map of each pyramid.
Preferably, the final feature space pyramid P is fed into a convolutional neural network for subsequent detection and localization. The specific implementation is as follows: and (5) feeding the feature images in the final feature space pyramid P into a convolutional neural network for detection and positioning. The specific implementation is as follows: and detecting targets (such as people, automobiles, birds, planes and the like) contained in the pictures by using a classifier for the fused feature pictures, and further obtaining the coordinate positions of the targets in the pictures by using a locator.
Preferably, the convolutional neural network is one or more of ResNet-101, resNet-50, VGGNet or GoogLeNet.
By adopting the technical scheme, the invention has the following advantages: according to the method, a plurality of feature space pyramids of different channels are constructed by utilizing a cascading idea, and then the feature map with the strongest semantics in each feature space pyramid is integrated into the same feature space pyramid, so that the semantic features and the position information of the feature space pyramids are enhanced, and finally the accuracy of target classification and positioning is improved. In the proposed method called constructing the multi-channel feature space pyramid, the method has good robustness and detection capability for multi-scale target detection.
Drawings
FIG. 1 is a flow chart of an overall implementation of the present invention.
Fig. 2 is a block diagram of a convolutional neural network used in the present invention.
Fig. 3 is a diagram of a weight calculation method used in the present invention.
Fig. 4 is a picture that was not detected using the present method.
Fig. 5 is a picture after inspection using the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Examples:
As shown in fig. 1-5, the present embodiment is a method for constructing a multi-channel feature space pyramid, which is implemented according to the following steps:
Inputting a picture to be detected;
Step (2), repeating the steps (3) to (10) aiming at the multi-scale target detection task, and continuing training the neural network model until the loss function of the network reaches a convergence state;
Step (3) selects a convolutional neural network based on deep learning (in this embodiment, a deep convolutional neural network ResNet-101 is selected, and other convolutional neural networks, such as ResNet-50, VGGNet, googLeNet, etc. are also selected) to extract the picture features;
And (4) marking the last layer characteristic diagram of each group of convolution blocks of the convolution neural network ResNet-101 as C1, C2, C3, C4 and C5. Because C1 occupies a large amount of GPU (graphic processing unit) video memory, only C2, C3, C4 and C5 are used for constructing a four-layer feature space pyramid C in the method;
step (5) on the basis of the feature space pyramid C, a convolution check feature space pyramid C2, C3, C4 and C5 with the size of 1x1x256 is used for enhancing feature representation by using a feature fusion unit, and an enhanced feature space pyramid P0 is obtained;
Step (6) on the basis of the feature space pyramid C, a convolution with the size of 1x1x512 is used for checking the C3, C4 and C5 of the feature space pyramid C, and a feature fusion unit is used for enhancing feature representation to obtain an enhanced feature space pyramid P1;
Step (7) on the basis of the feature space pyramid C, a convolution check feature space pyramid C4 with the size of 1x1x1024 is used, and a feature fusion unit is used for enhancing feature representation by C5 to obtain an enhanced feature space pyramid P2;
Step (8) on the basis of the feature space pyramid C, a convolution check feature space pyramid C5 with the size of 1x1x2048 is used for convolution to obtain a feature space pyramid P3;
Step (9) integrating the obtained layer with the strongest semantic information in the four feature space pyramids P0, P1, P2 and P3 subjected to fusion enhancement into a final feature space pyramid P;
and (10) sending the feature space pyramid P into a target detection network to detect the category and the position of the target.
The deep learning convolutional neural network model adopted in the steps (3) and (4) of the construction method of the multichannel feature space pyramid is ResNet-101, and the method is concretely realized as follows:
First we use a deep convolutional neural network ResNet-101 that contains a total of 101 convolutional layers of 4 sets of convolutional blocks. The structure is shown in the attached figure 1:
(i) Further, after the input picture enters the deep convolutional neural network ResNet-101, features are extracted by using a convolutional kernel with the size of 7x7x64 in the first layer of the network, and then a maximum pooling layer with the size of 3x3 is used to obtain a 64-dimensional feature map C1 with the size of original figure 1/2;
(ii) Further, in layers 2-10 of the deep convolutional neural network ResNet-101, the feature map C1 is each implemented using a set of three convolutional kernels: repeatedly performing convolution operation for 3 times on 1x64,3x 64 and 1x256 to obtain a 256-dimensional characteristic diagram C2 with the size of 1/4 of the original diagram;
(iii) Further, in layers 11-22 of the deep convolutional neural network ResNet-101, the feature map C2 is each implemented using a set of three convolutional kernels: repeatedly performing convolution operation for 4 times on 1x128,3x 128 and 1x512 to obtain a feature diagram C3 with dimension being 512 and size being 1/8 of original figure;
(iv) Further, in layers 23-91 of the deep convolutional neural network ResNet-101, the feature map C3 is each implemented using a set of three convolutional kernels: repeatedly performing convolution operation for 23 times on 1x256,3x 256 and 1x1024 to obtain a feature map C4 with dimensions of 1024 and sizes of original figures 1/16;
(v) Further, in layers 23-91 of the deep convolutional neural network ResNet-101, the feature map C4 is each implemented using a set of three convolutional kernels: repeatedly performing convolution operation on 1x512,3x 512 and 1x2048 for 3 times to obtain a feature map C5 with dimension of 2048 and size of original figure 1/32;
The construction method of the multi-channel feature space pyramid in this embodiment includes the following steps (5) constructing a feature space pyramid P0 through a feature space pyramid C:
In the feature space pyramid C obtained by the deep convolutional neural network ResNet-101, semantic information of each layer is different, wherein the dimension of C2 is the smallest, the semantic features are the shallowest, and the most target position information is contained. To balance semantic features and location information, we use a feature fusion unit to augment the feature space pyramid C. Because the feature fusion unit needs that the two input feature graphs are consistent in dimension and size, the feature fusion unit comprises a transverse convolution downscaling process and a longitudinal 2-time upsampling process, the transverse convolution aligns the dimensions, and the longitudinal upsampling process enables the pictures to be consistent in size. The method comprises the following specific steps:
(i) Performing dimension reduction on the feature map C5 by using a convolution kernel of 1x1x256 to obtain P0_5;
(ii) The method comprises the steps of using a convolution kernel of 1x1x256 to reduce the dimension of a feature map C4 to obtain C4', then performing an up-sampling operation on the feature map P0_5 through linear interpolation to obtain P0_5', wherein the P0_5 'and the C4' have the same size and dimension, and inputting the two feature maps into a feature fusion unit to obtain a feature map P0_4;
(iii) The method comprises the steps of performing dimension reduction on a feature map C3 by using a convolution kernel of 1x1x256 to obtain C3', performing up-sampling operation on P0_4 through linear interpolation to obtain P0_4', inputting the two feature maps into a feature fusion unit by using the C3 'and the P0_4' which have the same size and dimension to obtain a feature map P0_3;
(iv) The method comprises the steps of performing dimension reduction on a feature map C2 by using a convolution kernel of 1x1x256 to obtain C2', performing up-sampling operation on P0_3 through linear interpolation to obtain P0_3', inputting the two feature maps into a feature fusion unit by using the C2 'and the P0_3' which have the same size and dimension to obtain a feature map P0_2;
The construction method of the multi-channel feature space pyramid in this embodiment includes the following steps (6) of establishing a feature pyramid P1:
(i) Performing dimension reduction on the feature map C5 by using a convolution kernel of 1x1x512 to obtain P1_5;
(ii) The method comprises the steps of using a convolution kernel of 1x1x512 to reduce the dimension of a feature map C4 to obtain C4', then performing an up-sampling operation on the feature map P1_5 through linear interpolation to obtain P1_5', wherein the P1_5 'and the C4' have the same size and dimension, and inputting the two feature maps into a feature fusion unit to obtain a feature map P1_4;
(iii) The feature map C3 is subjected to dimension reduction by using a convolution kernel of 1x1x512 to obtain C3', then P1_4' is obtained by carrying out an up-sampling operation on P1_4 through linear interpolation, at the moment, P1_4 'and C3' have the same size and dimension, and the two feature maps are input into a feature fusion unit to obtain a feature map P1_3;
The construction method of the multi-channel feature space pyramid in this embodiment includes the following steps:
(i) Performing dimension reduction on the feature map C5 by using a convolution kernel of 1x1x1024 to obtain P2_5;
(ii) The method comprises the steps of using a convolution kernel of 1x1x1024 to reduce the dimension of a feature map C4 to obtain a feature map C4 ', then performing an up-sampling operation on a feature map P2_5 through linear interpolation to obtain a feature map P2_5', wherein the feature maps P2_5 'and the feature map C4' have the same size and dimension, and inputting the two feature maps into a feature fusion unit to obtain a feature map P2_4;
The construction method of the multi-channel feature space pyramid in this embodiment includes the following steps: performing dimension reduction on the feature map C5 by using a convolution kernel of 1x1x2048 to obtain P3_5;
the construction method of the multi-channel feature space pyramid in the embodiment includes the following steps: the bottom-layer feature map of each pyramid is formed by fully fusing and enhancing the feature semantics of the upper-layer feature map and the self-contained position information in the respective pyramid. Thus, the final feature space pyramid P is reconstructed using the lowest level feature map of each pyramid.
In the method step (10) of constructing the multi-channel feature space pyramid of the embodiment, the feature map in the final feature space pyramid P is fed into a convolutional neural network for detection and positioning. The specific implementation is as follows: and detecting targets (such as people, automobiles, birds, planes and the like) contained in the pictures by using a classifier for the fused feature pictures, and further obtaining the coordinate positions of the targets in the pictures by using a locator.
As can be easily seen by comparing fig. 4 and fig. 5, the detection effect on the small-scale targets of the pictures is remarkable.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (7)

1. The construction method of the multichannel feature space pyramid is characterized by comprising the following steps:
Inputting a picture to be detected;
Step (2), repeating the steps (3) to (10) aiming at the multi-scale target detection task, and continuing training the neural network model until the loss function of the network reaches a convergence state;
step (3) extracting picture features by using a convolutional neural network based on deep learning;
the final layer characteristic graphs of each group of convolution blocks of the convolution neural network are marked as C1, C2, C3, C4 and C5, and a four-layer characteristic space pyramid C is constructed by using only C2, C3, C4 and C5;
Step (5) on the basis of the feature space pyramid C, a convolution check feature space pyramid C2, C3, C4 and C5 with the size of 1 multiplied by 256 is used for enhancing feature representation by using a feature fusion unit, so that an enhanced feature space pyramid P0 is obtained;
Step (6) on the basis of the feature space pyramid C, a convolution check with the size of 1 multiplied by 512 is used for checking the C3, C4 and C5 of the feature space pyramid C, and a feature fusion unit is used for enhancing the feature representation to obtain an enhanced feature space pyramid P1;
step (7) on the basis of the feature space pyramid C, a convolution check feature space pyramid C4 with the size of 1 multiplied by 1024 is used, and a feature fusion unit is used for enhancing feature representation by C5 to obtain an enhanced feature space pyramid P2;
step (8) on the basis of the feature space pyramid C, a convolution check feature space pyramid C5 with the size of 1 multiplied by 2048 is used for convolution to obtain a feature space pyramid P3;
step (9) integrating the obtained bottommost feature graphs in the four feature space pyramids P0, P1, P2 and P3 subjected to fusion enhancement into a final feature space pyramid P;
Step (10), a feature space pyramid P is sent to a target detection network, and the category and the position of a target are detected;
the step (5) constructs a feature space pyramid P0 through a feature space pyramid C, and comprises the following steps:
(i) Using a convolution kernel of 1 multiplied by 256 to reduce the dimension of the feature map C5 to obtain P0_5;
(ii) The method comprises the steps of using a convolution kernel of 1 multiplied by 256 to reduce the dimension of a feature map C4 to obtain C4', then performing an up-sampling operation on the feature map P0_5 through linear interpolation to obtain P0_5', wherein the P0_5 'and the C4' have the same size and dimension, and inputting the two feature maps into a feature fusion unit to obtain a feature map P0_4;
(iii) The method comprises the steps of performing dimension reduction on a feature map C3 by using a convolution kernel of 1 multiplied by 256 to obtain C3', performing up-sampling operation on P0_4 through linear interpolation to obtain P0_4', inputting the two feature maps into a feature fusion unit by using the C3 'and the P0_4' which have the same size and dimension to obtain a feature map P0_3;
(iv) The method comprises the steps of performing dimension reduction on a feature map C2 by using a convolution kernel of 1 multiplied by 256 to obtain C2', performing up-sampling operation on P0_3 through linear interpolation to obtain P0_3', inputting the two feature maps into a feature fusion unit by using the C2 'and the P0_3' which have the same size and dimension to obtain a feature map P0_2;
The step (9) is to build a feature pyramid, and comprises the following steps: the bottom-layer feature map of each pyramid fully fuses and enhances the feature semantics of the upper-layer feature map and the self-contained position information in the respective pyramid; thus, the final feature space pyramid P is reconstructed using the lowest level feature map of each pyramid.
2. The method for constructing a pyramid of multi-channel feature space according to claim 1, wherein the deep learning convolutional neural network model adopted in the steps (3) and (4) is ResNet-101, and the steps are as follows:
(i) After an input picture enters a deep convolutional neural network ResNet-101, extracting features by using a convolutional kernel with the size of 7 multiplied by 64 in a first layer of the network, and then using a maximum pooling layer with the size of 3 multiplied by 3 to obtain a 64-dimensional feature map C1 with the size of original figure 1/2;
(ii) In layers 2-10 of the deep convolutional neural network ResNet-101, the feature map C1 is used with a set of three convolutional kernels, respectively: the convolution operation is repeatedly carried out for 3 times on 1 multiplied by 64,3 multiplied by 3 multiplied by 64, and 1 multiplied by 256, so that a 256-dimensional characteristic diagram C2 with the size of 1/4 of the original diagram is finally obtained;
(iii) In layers 11-22 of the deep convolutional neural network ResNet-101, the feature map C2 is used with a set of three convolutional kernels: repeatedly performing convolution operation for 4 times on 1×1× 128,3 ×3×128 and 1×1×512 to obtain a feature map C3 with dimension of 512 and size of 1/8 as shown in the original figure;
(iv) In layers 23-91 of the deep convolutional neural network ResNet-101, the feature map C3 is used with a set of three convolutional kernels: repeatedly performing convolution operation for 23 times on 1×1×256,3×3×256 and 1×1×1024 to obtain a feature map C4 with dimensions 1024 and sizes of original figures 1/16;
(v) In layers 23-91 of the deep convolutional neural network ResNet-101, the feature map C4 is used with a set of three convolutional kernels: 1×1× 512,3 ×3X 512,1 X1 X2048 the convolution operation was repeated 3 times, finally, a feature map C5 with the dimension of 2048 and the size of original figure 1/32 is obtained.
3. The method for constructing a multi-channel feature space pyramid as claimed in claim 1, wherein said step (6) of creating a feature pyramid P1 comprises the steps of:
(i) Using a convolution kernel of 1 multiplied by 512 to reduce the dimension of the feature map C5 to obtain P1_5;
(ii) The method comprises the steps of using a convolution kernel of 1 multiplied by 512 to reduce the dimension of a feature map C4 to obtain C4', then performing an up-sampling operation on the feature map P1_5 through linear interpolation to obtain P1_5', wherein the P1_5 'and the C4' have the same size and dimension, and inputting the two feature maps into a feature fusion unit to obtain a feature map P1_4;
(iii) The feature map C3 is subjected to dimension reduction by using a convolution kernel of 1 multiplied by 512 to obtain C3', then an up-sampling operation is performed on the P1_4 through linear interpolation to obtain P1_4', and at the moment, the P1_4 'and the C3' have the same size and dimension, and the two feature maps are input into a feature fusion unit to obtain the feature map P1_3.
4. The method for constructing a multi-channel feature space pyramid as claimed in claim 1, wherein said step (7) of creating a feature pyramid P2 comprises the steps of:
(i) Performing dimension reduction on the feature map C5 by using a convolution kernel of 1 multiplied by 1024 to obtain P2_5;
(ii) The feature map C4 is reduced in dimension by using a convolution kernel of 1 multiplied by 1024 to obtain C4 ', then the feature map P2_5 is subjected to an up-sampling operation by linear interpolation to obtain P2_5', and at the moment, the P2_5 'and the C4' have the same size and dimension, and the two feature maps are input into a feature fusion unit to obtain the feature map P2_4.
5. The method for constructing a multi-channel feature space pyramid as claimed in claim 1, wherein said step (8) of creating a feature pyramid P3 comprises the steps of: the feature map C5 is reduced in dimension using a convolution kernel of 1×1×2048 to yield p3_5.
6. The method for constructing a multi-channel feature space pyramid according to claim 1, wherein the step (10) of feeding the feature map in the final feature space pyramid P into a convolutional neural network for detection and positioning comprises the steps of: and detecting the targets contained in the pictures by using a classifier for the fused feature pictures, and then further obtaining the coordinate positions of the targets in the pictures by using a locator.
7. The method of claim 1, wherein the convolutional neural network is one or more of ResNet-101, resNet-50, VGGNet, or GoogLeNet.
CN202010709350.0A 2020-07-22 2020-07-22 Construction method of multichannel feature space pyramid Active CN112131925B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010709350.0A CN112131925B (en) 2020-07-22 2020-07-22 Construction method of multichannel feature space pyramid

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010709350.0A CN112131925B (en) 2020-07-22 2020-07-22 Construction method of multichannel feature space pyramid

Publications (2)

Publication Number Publication Date
CN112131925A CN112131925A (en) 2020-12-25
CN112131925B true CN112131925B (en) 2024-06-07

Family

ID=73850570

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010709350.0A Active CN112131925B (en) 2020-07-22 2020-07-22 Construction method of multichannel feature space pyramid

Country Status (1)

Country Link
CN (1) CN112131925B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20220125719A (en) * 2021-04-28 2022-09-14 베이징 바이두 넷컴 사이언스 테크놀로지 컴퍼니 리미티드 Method and equipment for training target detection model, method and equipment for detection of target object, electronic equipment, storage medium and computer program

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110782420A (en) * 2019-09-19 2020-02-11 杭州电子科技大学 Small target feature representation enhancement method based on deep learning
CN111160379A (en) * 2018-11-07 2020-05-15 北京嘀嘀无限科技发展有限公司 Training method and device of image detection model and target detection method and device
WO2020098225A1 (en) * 2018-11-16 2020-05-22 北京市商汤科技开发有限公司 Key point detection method and apparatus, electronic device and storage medium
CN111401201A (en) * 2020-03-10 2020-07-10 南京信息工程大学 Aerial image multi-scale target detection method based on spatial pyramid attention drive

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111160379A (en) * 2018-11-07 2020-05-15 北京嘀嘀无限科技发展有限公司 Training method and device of image detection model and target detection method and device
WO2020098225A1 (en) * 2018-11-16 2020-05-22 北京市商汤科技开发有限公司 Key point detection method and apparatus, electronic device and storage medium
CN110782420A (en) * 2019-09-19 2020-02-11 杭州电子科技大学 Small target feature representation enhancement method based on deep learning
CN111401201A (en) * 2020-03-10 2020-07-10 南京信息工程大学 Aerial image multi-scale target detection method based on spatial pyramid attention drive

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MFPN: A NOVEL MIXTURE FEATURE PYRAMID NETWORK OF MULTIPLE ARCHITECTURES FOR OBJECT DETECTION;Tingting Liang et al;《arXiv:1912.09748v1》;第1-7页 *
多层特征图堆叠网络及其目标检测方法;杨爱萍;鲁立宇;冀中;;天津大学学报(自然科学与工程技术版)(06);全文 *

Also Published As

Publication number Publication date
CN112131925A (en) 2020-12-25

Similar Documents

Publication Publication Date Title
CN111612008B (en) Image segmentation method based on convolution network
CN107239751B (en) High-resolution SAR image classification method based on non-subsampled contourlet full convolution network
CN112232391B (en) Dam crack detection method based on U-net network and SC-SAM attention mechanism
CN110175986B (en) Stereo image visual saliency detection method based on convolutional neural network
CN110619638A (en) Multi-mode fusion significance detection method based on convolution block attention module
CN110765833A (en) Crowd density estimation method based on deep learning
CN112489054A (en) Remote sensing image semantic segmentation method based on deep learning
CN113436210B (en) Road image segmentation method fusing context progressive sampling
CN112991364A (en) Road scene semantic segmentation method based on convolution neural network cross-modal fusion
CN114187520B (en) Building extraction model construction and application method
CN114022408A (en) Remote sensing image cloud detection method based on multi-scale convolution neural network
CN113468996A (en) Camouflage object detection method based on edge refinement
CN110705566A (en) Multi-mode fusion significance detection method based on spatial pyramid pool
CN115294326A (en) Method for extracting features based on target detection grouping residual error structure
CN115294356A (en) Target detection method based on wide area receptive field space attention
CN117037004A (en) Unmanned aerial vehicle image detection method based on multi-scale feature fusion and context enhancement
CN114299383A (en) Remote sensing image target detection method based on integration of density map and attention mechanism
CN114519819B (en) Remote sensing image target detection method based on global context awareness
CN114998756A (en) Yolov 5-based remote sensing image detection method and device and storage medium
CN115238758A (en) Multi-task three-dimensional target detection method based on point cloud feature enhancement
CN112149526A (en) Lane line detection method and system based on long-distance information fusion
CN112149662A (en) Multi-mode fusion significance detection method based on expansion volume block
CN112131925B (en) Construction method of multichannel feature space pyramid
CN114549959A (en) Infrared dim target real-time detection method and system based on target detection model
CN113111740A (en) Characteristic weaving method for remote sensing image target detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20211123

Address after: 100192 Beijing suirui center, building 19, Aobei Science Park, 1 Baosheng South Road, Haidian District, Beijing

Applicant after: Chang Guangyu

Address before: Room 1501, 15th floor, block B, building 1, 459 Jianghong Road, Binjiang District, Hangzhou City, Zhejiang Province 310000

Applicant before: ZHEJIANG UHOPE COMMUNICATIONS TECHNOLOGY Co.,Ltd.

Effective date of registration: 20211123

Address after: Room 101, floor 1, building 19, yard 1, Baosheng South Road, Haidian District, Beijing 100192

Applicant after: Suirui Technology Group Co.,Ltd.

Address before: 100192 Beijing suirui center, building 19, Aobei Science Park, 1 Baosheng South Road, Haidian District, Beijing

Applicant before: Chang Guangyu

GR01 Patent grant