CN111666988A

CN111666988A - Target detection algorithm based on multi-layer information fusion

Info

Publication number: CN111666988A
Application number: CN202010444366.3A
Authority: CN
Inventors: 陈宝远; 申宇琨; 历博
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2020-05-22
Filing date: 2020-05-22
Publication date: 2020-09-15

Abstract

The invention provides a target detection algorithm based on multi-layer information fusion, which comprises the following steps: s1, preprocessing the data set image; adjusting the image data to the size set by the network; s2, extracting different levels of information of the image by using Densenet, and extracting characteristic features of four stages; s3, normalizing the channel number of the extracted features of the four stages; and S4, vertically fusing the extracted multi-level information, and enhancing the transmission of different levels of information, so that the feature map has rich deep semantic information and shallow position information. The invention uses Densenet as a feature extraction network, compared with the traditional ResNet network, the parameter quantity required by the network is less than half of the ResNet; for the industry, the small model can obviously save bandwidth, reduce storage overhead, improve the calculation efficiency of the network model, and extract information of different levels according to network characteristics.

Description

Target detection algorithm based on multi-layer information fusion

Technical Field

The invention belongs to the field of computer vision detection, and particularly relates to a target detection algorithm based on multi-layer information fusion.

Background

Object detection and recognition are one of the basic tasks in the field of computer vision, and in the industry, object detection is widely regarded and has many practical applications in various fields. For example: target tracking, automobile assistant driving, biological recognition, smart home, smart agriculture, medical image analysis and identification of flying objects. The reduction of human capital consumption through computer vision is of great practical significance. In the automobile industry, automobile enterprises and primary suppliers have a dispute in the field of auxiliary driving. The driving assistance road based on the camera is opened. For the auxiliary driving of the urban comprehensive road, the road condition is complex, the number of obstacles such as motor vehicles, non-motor vehicles and pedestrians is large, and small targets such as children, pets and scooters are possible to appear. The accurate detection of vehicles and pedestrians by the aid of the camera is required by the driving system and is used as a very important basic link in the auxiliary driving technology, and the improvement of the accuracy and efficiency of the detection algorithm based on the camera has important significance on vehicle safety.

With the fire development of deep learning technology in recent years, the target detection algorithm is also shifted to the detection technology based on the deep neural network from the traditional algorithm based on manual characteristics. In the target detection algorithm based on deep learning, the detection precision and the detection speed of the algorithm are an independent item, and if the detection precision is improved, the detection speed needs to be reduced. And the detection network structure is more complicated, the parameter quantity is overlarge, the training time is long, and the training efficiency is low. The overall algorithm also has great room for improvement. In the existing target detection algorithm, fast-Rcnn is taken as a more advanced algorithm, and the author of the algorithm proposes a candidate region extraction network with shared characteristics, and the application of the network further improves the performance of the algorithm. However, the backbone network VGGNet is an image classification network pre-trained based on ImageNet, and has a position insensitive characteristic, and continuous down-sampling of VGGNet filters information of some smaller targets, so that characteristic information input into the regional candidate network is incomplete.

The single-stage detection algorithm only needs to obtain the position and the category information of the target through one-time network. Compared with the target detection algorithm based on the suggestion area, the detection speed is greatly improved, and the method is more suitable for mobile equipment. However, such methods have the problems of inaccurate positioning and poor recall rate compared with the area-based suggestion method, and have poor detection effect on objects with close distance and very small objects, and relatively weak generalization capability.

Disclosure of Invention

In view of the above, the present invention provides a target detection algorithm based on multi-layer information fusion, which aims to overcome the defects in the prior art.

In order to achieve the purpose, the technical scheme of the invention is realized as follows:

a target detection algorithm based on multi-layer information fusion comprises the following steps:

s1, preprocessing the data set image; adjusting the image data to the size set by the network;

s2, extracting different levels of information of the image by using Densenet, and extracting characteristic features of four stages;

s3, normalizing the channel number of the extracted features of the four stages;

s4, vertically fusing the extracted multi-level information, and enhancing the transmission of different levels of information, so that the feature map has rich deep semantic information and shallow position information;

s5, extracting the region of interest from the fused multi-level information by using a region suggestion algorithm;

s6, predicting the accurate category of the region of interest and regressing the position coordinates;

s7, calculating the multitask loss function of the classification network and the regression network, training and optimizing the network to make the classification and regression loss function converge and save the weight parameter of the network

And S8, deploying the optimized parameters and detecting the target.

Further, the specific steps of step S1 are as follows:

s101, performing color enhancement, translation change, horizontal and vertical turnover on an image;

s102, scaling all image data to 448 x 448 size using linear interpolation.

Further, the specific method for extracting the features in step S2 is as follows: and performing convolution pooling on the image by using a built 98-layer Densenet network, and outputting the result of each transmission layer to obtain four-stage feature maps with the resolutions of 56 × 56,28 × 28,17 × 17 and 17 × 17.

Further, the specific method for normalizing the number of channels by the four-stage features in step S3 is as follows: the convolution operation is performed on the four stage features by using convolution of 1 × 1 with the channel number of 256, and the dimension of all the stage features is specified to be 256.

Further, the multi-stage feature fusion in step S4 specifically includes:

s401, performing corresponding element addition operation on two adjacent stage features with the same size, and performing up-sampling operation on a smaller-size feature if the two stage features are different in size to ensure that the two fused features are the same in size;

and S402, convolving the fused result by using a convolution kernel of 3x3 to eliminate the aliasing effect after fusion.

Further, the specific method of step S5 is as follows: and (4) extracting the region of interest from the multiple stage features fused in the step S4 by using a region candidate network, and performing foreground and background binary prediction and rough fitting of the border position on the region of interest by using an anchor point mechanism.

Further, the specific method of step S6 is as follows:

s601, performing pooling operation on the region of interest extracted in the step S5;

s602, inputting the pooled region of interest into a fully-connected network, and classifying by using a Softmax classifier;

and S603, outputting predicted target position coordinates x, y, w and h, wherein x, y, w and h respectively represent the center coordinate, the width and the height of the box.

Further, the specific method of step S7 is as follows:

s701, firstly, calculating a loss function of a classification part:

wherein: p is a radical of_iThe probability of predicting a target for an anchor point,

case of true tags for data sets

S702, calculating a loss function of the position regression part: using Smooth_l1(═ 3) smoothing loss function:

wherein: t is t _n4 parameterized coordinate vectors representing the prediction bounding box,

is the vector of the real frame matched with the positive case anchor point;

wherein: x, x_a，x^*Respectively corresponding to a prediction frame, an anchor point and a real frame;

s703, finally calculating the sum of two loss functions:

wherein: n is a radical of_clsFor training the network, the number of images per input, N_regThe number of anchor points is, and lambda is a balance parameter of loss of the two parts;

and S704, training the fully-connected network to make the loss function converge.

Compared with the prior art, the invention has the following advantages:

the invention uses Densenet as a feature extraction network, compared with the traditional ResNet network, the parameter quantity required by the network is less than half of the ResNet; for the industry, the small model can obviously save bandwidth, reduce storage overhead, improve the calculation efficiency of the network model, and extract information of different levels according to network characteristics.

The invention solves the problem that a two-stage target detection algorithm is insensitive to small target objects, enhances the capability of extracting information by using a basic network, and builds a multi-layer information fusion network to fuse information of different layers, thereby ensuring that the position information of the features of the high layer is not lost and the semantic information of the features of the low layer is not lost; the detection capability of the detection algorithm to the targets with different sizes is improved while the detection speed is not reduced.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and, together with the description, serve to explain the invention without limitation. In the drawings:

FIG. 1 is a flowchart of a target price detection algorithm based on multi-layer information fusion according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a feature extraction network Densenet in the embodiment of the present invention;

FIG. 3 is a flow chart of a multi-information fusion network in an embodiment of the present invention;

FIG. 4 is a flowchart of the regression and classification of candidate regions according to an embodiment of the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

In the description of the present invention, it is to be understood that the terms "central," "longitudinal," "lateral," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are used in the orientation or positional relationship indicated in the drawings, which are merely for convenience in describing the invention and to simplify the description, and are not intended to indicate or imply that the referenced device or element must have a particular orientation, be constructed and operated in a particular orientation, and are therefore not to be construed as limiting the invention. Furthermore, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first," "second," etc. may explicitly or implicitly include one or more of that feature. In the description of the invention, the meaning of "a plurality" is two or more unless otherwise specified.

In the description of the invention, it is to be noted that, unless otherwise explicitly specified or limited, the terms "mounted", "connected" and "connected" are to be construed broadly, e.g. as being fixed or detachable or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the creation of the present invention can be understood by those of ordinary skill in the art through specific situations.

The invention will be described in detail with reference to the following embodiments with reference to the attached drawings.

An object detection algorithm based on multi-layer information fusion, as shown in fig. 1 to 4, includes:

s1, preprocessing the data set image; adjusting the image data to the size set by the network; s2, extracting different levels of information of the image by using Densenet, and extracting characteristic features of four stages; s3, normalizing the channel number of the extracted features of the four stages; s4, vertically fusing the extracted multi-level information, and enhancing the transmission of different levels of information, so that the feature map has rich deep semantic information and shallow position information; s5, extracting the region of interest from the fused multi-level information by using a region suggestion algorithm; s6, predicting the accurate category of the region of interest and regressing the position coordinates; s7, calculating a multi-task loss function of the classification network and the regression network, training and optimizing the network to make the classification and regression loss function converge and save the weight parameters of the network; s8, deploying the optimized parameters, and detecting the target;

specifically, S8, deploying the network weight parameters stored in step 7 into a network, inputting image data including the target, performing feature extraction and fusion on the image through a trained parameter network, performing rough prediction on the region of interest, performing accurate prediction and regression on the type and position of the target, and finally outputting the target type and position information. .

The specific steps of step S1 are as follows: s101, performing color enhancement, translation change, horizontal and vertical turnover on an image; s102, scaling all image data to 448 x 448 size using linear interpolation.

The specific method for extracting the features in the step S2 is as follows: and performing convolution pooling on the image by using a built 98-layer Densenet network, and outputting the result of each transmission layer to obtain four-stage feature maps with the resolutions of 56 × 56,28 × 28,17 × 17 and 17 × 17.

The specific method for normalizing the number of channels by the four-stage features in the step S3 is as follows: the convolution operation is performed on the four stage features by using convolution of 1 × 1 with the channel number of 256, and the dimension of all the stage features is specified to be 256.

The specific method of the multi-stage feature fusion in step S4 is as follows: s401, performing corresponding element addition operation on two adjacent stage features with the same size, and performing up-sampling operation on a smaller-size feature if the two stage features are different in size to ensure that the two fused features are the same in size; and S402, convolving the fused result by using a convolution kernel of 3x3 to eliminate the aliasing effect after fusion.

The specific method of step S5 is as follows: and (4) extracting the region of interest from the multiple stage features fused in the step S4 by using a region candidate network, and performing foreground and background binary prediction and rough fitting of the border position on the region of interest by using an anchor point mechanism.

The specific method of step S6 is as follows: s601, performing pooling operation on the region of interest extracted in the step S5; s602, inputting the pooled region of interest into a fully-connected network, and classifying by using a Softmax classifier; and S603, outputting predicted target position coordinates x, y, w and h, wherein x, y, w and h respectively represent the center coordinate, the width and the height of the box.

The specific method of step S7 is as follows: s701, firstly, calculating a loss function of a classification part:

case of true tags for data sets

wherein: t is t _n4 parameters representing the prediction bounding boxThe vector of coordinates of the quantization is,

is the vector of the real frame matched with the positive case anchor point;

s703, finally calculating the sum of two loss functions:

Specifically, the structure of the feature fusion network created by the present invention is shown in fig. 3. The invention utilizes dense feature extraction network to extract feature maps (C1, C2, C3 and C4) with different scales. In order to achieve the effect of feature sharing and more accurate detection, the feature graphs at different stages are subjected to pyramid fusion and then are respectively input into the regional candidate network for prediction. The multi-level feature information is fused by means of the characteristics of the structure, the features with low resolution and high level are connected with the features with high resolution and low level from top to bottom in a side-to-side mode, and therefore the features with all scales contain feature information of objects with different sizes. The perception of information by the detector is increased to some extent. Compared with the fast Rcnn algorithm, the original fast Rcnn only utilizes the last layer of feature information of the feature extraction network, and the algorithm performs suggestion region extraction on fused multi-stage features instead of only using the last p1 stage features because the subsequent region suggestion network is a sliding window detector with a fixed window size, so that the sliding on different layers of the fused network can increase the robustness of the fused network to target scale change. In addition, only the last stage is used, so that more anchor points exist, and the accuracy cannot be effectively improved by increasing the number of mapping anchor points.

The structure on the left side of FIG. 3 is the dimension normalization process of different levels of features, and the invention creates the output (C1, C2, C3, C4) of each transition layer as the input of the feature fusion network. The feature map extracted by the trunk network densenert has different dimensions and resolutions, and the dimensions of features of different levels are normalized before fusion. All the extracted features are subjected to convolution operation of 1x1, different channel information is subjected to linear combination, and the features are subjected to dimensionality reduction and dimensionality enhancement under the condition that the expression capability of the model is not damaged. And on the premise of keeping the size of the characteristic diagram unchanged, the nonlinear characteristic is added. The features after dimension unification are respectively C1 ', C2', C3 'and C4'. The resolution of the feature maps of C1 ', C2', C3 'and C4' are 28x28,28x28,56x56 and 112x112 respectively, and FIG. 3 shows the fusion process of the multi-level features. The specific fusion process of fusion1 and fusion2 fusion3 is shown as the right dashed box flow at C4'_kAnd C3'_kIn the fusion process of (1), C4'_kAnd C3'_kFeature map size same, C4'_kC4 'is directly prepared without an upsampling process'_kAnd C3'_kAnd performing add operation. P3'_kAnd C2'_kThe same applies to the fusion operation 2; p2'_kAnd C1'_kIn the fusion process, the sizes of the two groups of feature maps are different, and P2 'is obtained by utilizing bilinear interpolation operation'_kReduction to C1'_kAnd (4) size. Subscriptk represents the kth dimension of a feature, e.g.

The Add operation calculation for the two features of the k-th dimension is shown in the following equation.

Z_k(x，y)＝f_add(A_k，B_k)＝β₁A_k(x，y)+β₂B_k(x，y)

The above formula represents the characteristic A_k，B_kThe corresponding elements of (x, y) position of (c) are added, and the result of adding all positions is used as the characteristic after add, using β₁，β₂For feature A_k，B_kCarry out the weighted balance β₁＝β₂＝0.5。

To eliminate aliasing effects after fusion, each fusion result was convolved again using a convolution kernel of 3 × 3. The fused network outputs P1(28x28 d-256), P2(28x28 d-256), P3(28x28 d-256), and P4(56x56 d-256), respectively, and then these fused features are input to subsequent candidate networks of the region of interest.

Region of interest extraction and classification and regression network: the algorithm uses a region suggestion network to extract the region of interest of the features. The RPN network is essentially a non-category target detector based on a sliding window, the input of the network is a characteristic diagram with different sizes returned by a basic network, and an interested area is output. Structure of area candidate network as shown in fig. 4, in order to generate candidate areas, a 3 × 3 window is slid on feature maps of multiple sizes, and the anchor point mapping mechanism plays a central role in the network. Anchors are placed on pictures of different sizes and scales with a fixed frame and are then used as reference frames in the prediction of the target location. The candidate area network provides two fully connected outputs for each anchor point. The first output is the probability of the anchor point as the target, and the second output is the frame regression, which is used to adjust the anchor point to better fit the predicted target.

The ROIPooling layer utilizes the suggested region generated by the region candidate network and the feature extracted by the feature networkObtaining a characteristic diagram of a fixed-size suggested area, then forming the characteristic diagram of the fixed size on the Roi Pooling layer for full connection operation, classifying specific categories by using a normalization function, and simultaneously, using Smooth_l1The loss function performs a regression operation to obtain the precise position of the object.

A multitask penalty function; in the regional candidate network, we set two types of labels for each anchor point: positive and negative examples. The positive sample is the anchor point with the highest intersection ratio with the real frame, and if the intersection ratio of the anchor point with the real frame is lower than 0.3, the anchor point is set as the negative sample. The loss function for the RPN network is defined as:

equation 5 is divided into two parts, the first part is the classification loss and the second part is the regression loss of the target box, where: p is a radical of_iThe probability of predicting a target for an anchor point,

in the case of a dataset true tag. t is t _n4 parameterized coordinate vectors representing the prediction bounding box,

is the vector of the real box that matches the positive case anchor.

X, y, w, h in equation (6) represent the box center coordinates and width and height, respectively. x, x_a，x^*Respectively corresponding to the prediction frame, the anchor point and the real frame.

In the classification loss of the first part

Is a log loss of two classes (target, non-target):

in the second part of the target block regression prediction, a least squares loss function is generally used. But the penalty of the L2 loss for a relatively large error is high. Smooth is used herein_l1(═ 3) smoothing loss function

The two part losses are represented by N_cls(sizing of Small batches, 32) and N_reg(anchor location number determination, here 5488) is normalized and weighted by a balance parameter lambda to achieve the effect of balancing the classification and regressing the partial weights.

The region of interest extracted by the RPN network also needs to be subjected to prediction of a specific class and fine tuning of a prediction box. The method is a classification and regression process, the regression loss principle of the fine tuning candidate box is consistent with the regression loss of the region of interest, and the prediction of a specific class is expanded into 20 classes from plus and minus classes. The classification and regression loss for the specific 20 classes is shown in equation 8.

p＝(p₀，...，p_k) Predicting the discrete probability distribution of each interested region for the output of the classification network, u being the real class of each interested region, mu being the balance coefficient mu of the two loss functions, 0.5, t_n' denotes the 4 parameterized coordinate vectors of the final fine-tuning bounding box. In order to accurately extract the region of interest and perform the following specific classification and boundary fine tuning, the total loss function of the algorithm is set as the sum of the above two losses (formula 12), so that the weights of the two stages can be updated simultaneously when the network is trained.

I_total＝L({p_i}，{t_n})+L({p}，{t_n′}) (12)

The specific steps of the proposed algorithm are as follows: in order to acquire features including different levels of information, a densinet network is used for carrying out multi-stage feature extraction on an image, and the output of a transition layer in the densinet is used as the extracted different-stage features Cn (n is 1, 2, 3 and 4).

And because the extracted feature dimensions of different stages are inconsistent, performing dimension normalization on the features of different stages. The feature dimensions of different stages are normalized to 256 by using the characteristic of convolution of 1x1, and then the fusion operation shown in fig. 3 is performed, and the specific fusion calculation is shown in formulas 13-16.

Cn′＝Cn*conv1x1_d(n＝1，2，3，4，d＝256) (13)

P4＝C4′

In equation 13, Cn' is the result of dimension normalization of the features at different stages, and d ═ 256 represents the dimension of the normalized features. In formulae 14 to 16, f_add() Add operation, β, representing two features set forth in 2.2₁，β₂Function to weight balance the features β₁＝β₂＝0.5

And extracting the region of interest from the fused feature using the region suggestion network. And roughly predicting the target, judging whether the target contains the object or not, not predicting the specific category, and roughly regressing the position of the target. And finishing the extraction of the region of interest, and then performing the prediction of the specific category and the fine regression of the target position.

In order to accurately acquire the category and the position of the target. The sum of the loss function of the proposed region and the final classification regression loss function is calculated using the multitask loss function designed in 2.4. And then, the total loss function is subjected to derivation by using a back propagation algorithm, the weight and the bias parameters are updated, and multiple iterations are performed to minimize the loss function.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and should not be taken as limiting the invention, so that any modifications, equivalents, improvements and the like, which are within the spirit and principle of the present invention, should be included in the scope of the present invention.

Claims

1. A target detection algorithm based on multi-layer information fusion is characterized by comprising the following steps:

And S8, deploying the optimized parameters and detecting the target.

2. The multi-layer information fusion-based target detection algorithm as claimed in claim 1, wherein the specific steps of step S1 are as follows:

s102, scaling all image data to 448 x 448 size using linear interpolation.

3. The multi-layer information fusion-based target detection algorithm as claimed in claim 1, wherein the specific method for feature extraction in step S2 is as follows: and performing convolution pooling on the image by using a built 98-layer Densenet network, and outputting the result of each transmission layer to obtain four-stage feature maps with the resolutions of 56 × 56,28 × 28,17 × 17 and 17 × 17.

4. The multi-layer information fusion-based target detection algorithm as claimed in claim 1, wherein the specific method for performing channel number normalization on the four-stage features in step S3 is as follows: the convolution operation is performed on the four stage features by using convolution of 1 × 1 with the channel number of 256, and the dimension of all the stage features is specified to be 256.

5. The multi-layer information fusion-based target detection algorithm according to claim 1, wherein the multi-stage feature fusion in step S4 is performed by:

6. The multi-layer information fusion-based target detection algorithm as claimed in claim 1, wherein the specific method of step S5 is as follows: and (4) extracting the region of interest from the multiple stage features fused in the step S4 by using a region candidate network, and performing foreground and background binary prediction and rough fitting of the border position on the region of interest by using an anchor point mechanism.

7. The multi-layer information fusion-based target detection algorithm as claimed in claim 1, wherein the specific method of step S6 is as follows:

8. The multi-layer information fusion-based target detection algorithm as claimed in claim 1, wherein the specific method of step S7 is as follows:

s701, firstly, calculating a loss function of a classification part:

case of true tags for data sets

wherein: t is t_n4 parameterized coordinate vectors representing the prediction bounding box,

is the vector of the real frame matched with the positive case anchor point;

s703, finally calculating the sum of two loss functions: