CN108734210A

CN108734210A - A kind of method for checking object based on cross-module state multi-scale feature fusion

Info

Publication number: CN108734210A
Application number: CN201810474925.8A
Authority: CN
Inventors: 刘盛; 尹科杰; 刘儒瑜; 陈彬; 陈一彬; 沈康
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2018-05-17
Filing date: 2018-05-17
Publication date: 2018-11-02
Anticipated expiration: 2038-05-17
Also published as: CN108734210B

Abstract

The invention discloses a kind of method for checking object based on cross-module state multi-scale feature fusion, the network parameter initialization depth map that network model is detected by RGB detects network model；The RGB detection network models based on acquisition detect network model with depth map again, initialize the feature extraction weight of converged network model respectively, and final training obtains a multiple dimensioned converged network model across modal characteristics of fusion.The present invention independent of the depth image data collection largely marked, can across modality fusion depth image and RGB image feature, in real time, efficiently, and accurately complete object identification, positioning and detection.The converged network model that the present invention designs only need one piece of consumer level video card and CPU as hardware, so that it may to reach real-time detection speed.

Description

A kind of method for checking object based on cross-module state multi-scale feature fusion

Technical field

The present invention relates to image identification technical field more particularly to a kind of objects based on cross-module state multi-scale feature fusion Detection method is completed at the same time inspection to the object in color depth image (RGB-D images, including colouring information and depth information) Survey, position and accurately identify task.

Background technology

In the industry, faster, more acurrate and more extensive method for checking object is always urgent demand.RGB image exists It can be influenced by violent in some special environment, such as movement or dazzle can bring image data degeneration, be schemed using RGB As feature cannot often reach expected precision to complete detection.It is therefore necessary to be believed using such as depth of the information from different sensors Breath, to improve the performance of object detection.

Since convolutional neural networks are used for identification and the Detection task of object, most of high-precision object detections Method is all based on convolutional neural networks realization.These networks can learn object from the data set of extensive label RGB image Generic features expression.If necessary to improve the precision of object detection using depth map data, it is necessary to extract the logical of object It is expressed with depth characteristic.However, in the industry not comprising enough categorical measures, and the extensive depth map all marked As data set, the generic features expression of depth information can not be thus directly obtained.

On the other hand, there are speed limitations for existing fusion feature detection method, generally require high performance GPU through long Time calculating can just obtain as a result, cannot meet the rigid requirements in industrial system for high real-time.

Invention content

The object of the present invention is to provide a kind of RGB-D image detecting methods based on cross-module state multi-scale feature fusion, if Meter one kind taking into account real-time and high-precision Fusion Model, the accurate detection of object is carried out using the multi-modal feature of object, simultaneously It completes the detection of objects in images, position and accurately identify task.

To achieve the goals above, technical solution of the present invention is as follows：

A kind of method for checking object based on cross-module state multi-scale feature fusion, including：

Pre-training model is trained using the RGB image for having marked objects in images classification in the first data set, based on pre- instruction The RGB for practicing model initialization single mode detects network model；

Network model is detected to RGB using the RGB image for having marked objects in images classification and position in the second data set It is trained；

Network model is detected based on trained RGB, initializes the depth map detection network model of single mode；

Using the depth image pair for having marked objects in images classification and position corresponding with RGB image in the second data set Depth map detection network model is trained；

Network model is detected based on trained RGB detection network models and depth map, initializes converged network model And carry out multi-scale feature fusion；

Using the RGB image and depth image for having marked objects in images classification and position of pairing, to converged network mould Type is trained；

The object in color depth image is detected using trained converged network model.

Further, described that network model is detected based on trained RGB, initialize the depth map detection net of single mode Network model, including：

The network parameter for replicating RGB detection network models, the network parameter of network model is detected as depth map.

Further, described that network model, initialization are detected based on trained RGB detection network models and depth map Converged network model simultaneously carries out multi-scale feature fusion, including：

The network parameter for replicating RGB detection network models and depth map detection network model, as in converged network model The weight of two characteristic extraction parts；

The Analysis On Multi-scale Features for combining two characteristic extraction parts to extract with multiple fused layers.

Further, the converged network model uses Multibox Loss as loss function when training.

Further, it is described to RGB detection network model be trained, to depth map detection network model be trained, When being trained to converged network model, further include：

Data enhancing processing is carried out to input data.

Further, described when being trained to converged network model, further include：

Freeze feature extracts the weight of part.

A kind of method for checking object based on cross-module state multi-scale feature fusion proposed by the present invention, in conjunction with RGB and depth Feature improves detection performance, and the network parameter initialization depth map that network model is detected by RGB detects network model；It is based on again RGB detection network models and the depth map of acquisition detect network model, initialize the feature extraction power of converged network model respectively Weight, final training obtain a multiple dimensioned converged network model across modal characteristics of fusion.The present invention is independent of a large amount of marks Depth image data collection, can across modality fusion depth image and RGB image feature, in real time, efficiently, and accurately complete object Body identification, positioning and detection.The converged network model that the present invention designs only need one piece of consumer level video card and CPU as hard Part, so that it may to reach real-time detection speed, such as video card GTX1080 and CPU Intel 7700K.

Description of the drawings

Fig. 1 is that the present invention is based on the method for checking object flow charts of cross-module state multi-scale feature fusion；

Fig. 2 is the structural schematic diagram of converged network model.

Specific implementation mode

Technical solution of the present invention is described in further details with reference to the accompanying drawings and examples, following embodiment is not constituted Limitation of the invention.

The present invention general thought be, being capable of across modality fusion depth independent of the depth image data collection largely marked Image and RGB image feature are spent, in real time, efficiently, and accurately completes object identification, positioning and detection.Training obtains an energy Receive RGB and the depth image input of cross-module state, and obtains the position of multiple objects and the Fusion Model of classification information in real time.It should Solution needs to complete the feature transfer of cross-module state：Depth map information network is initialized by RGB model parameters and training obtains Depth graph model；RGB models based on acquisition and depth graph model again, initialize the spy of converged network proposed by the present invention respectively Sign extraction part, final training obtain a multiple dimensioned network model across modal characteristics of fusion.The real-time that the present invention designs High multiple dimensioned cross-module state converged network is the key element of solution with accuracy of detection.

As shown in Figure 1, a kind of method for checking object based on cross-module state multi-scale feature fusion, including：

The step of above method, is described in detail below, wherein the present embodiment model training includes three phases, First stage trains RGB to detect network model, the training method training depth map inspection that second stage is shifted using the supervision of cross-module state Network model is surveyed, the phase III is based on trained RGB detection network models and depth map and detects network model, training fusion Network model.

Since convolutional neural networks are used for identification and the Detection task of object, most of high-precision object detections Method is all based on convolutional neural networks realization.These networks can learn object from the data set of extensive label RGB image Generic features expression.The technical program improves the precision of object detection by using depth map data, it is necessary to extract The general depth characteristic of body is expressed.However, in the industry not comprising enough categorical measures, and all marked extensive Depth image data collection can not thus directly obtain the generic features expression of depth information.The present embodiment is single by first training The RGB of mode detects network model, detects network model using the training method training depth map of cross-module state supervision transfer, only needs Depth map detection network model is obtained with using small-scale data set.

First stage：First, pre-training is trained using the RGB image for having marked objects in images classification in the first data set Model, the RGB based on pre-training model initialization single mode detect network model；Then it uses in the second data set and has marked figure The RGB image of object category and position is trained RGB detection network models as in.

In current pre-training model, the pre- instruction trained using the extensive rgb image data collection marked Practice model comparative maturity, such as advance trained VGG16 models etc. on ImageNet data sets, it can be directly by this Technical solution uses.Pre-training model is instructed using the extensive rgb image data collection (also referred to as the first data set) marked Practice, has typically been labelled with the classification of object in RGB image.

After choosing pre-training model, network model is detected to initialize RGB based on pre-training model, that is, replicates pre- instruction The parameter for practicing Model Neural detects network model to RGB.Then small-scale data set (also referred to as the second data set) is used In RGB image training is finely adjusted to RGB detection network models, examined object is (i.e. in RGB image in small-scale data set Object) classification and position need mark in advance, and include depth image corresponding with RGB image, depth in small-scale data set Also examined object classification and position are labelled in image.

Second stage：The present embodiment is based on trained RGB and detects network model, initializes the depth map inspection of single mode Survey network model；Using the depth map for having marked objects in images classification and position corresponding with RGB image in the second data set As being trained to depth map detection network model.

The present embodiment RGB detection network models and depth map detection network model are all single modes, are all made of neural network Model is expressed to be layered, and wherein RGB image mode is expressed as：

It is the i-th layer of feature representation trained from the large-scale dataset by label, #l is the number of plies of neural network, god Parameter through network is usedTo indicate.

Depth image mode is expressed as：

ψⁱI-th layer of feature representation, #u is the number of plies of neural network, equally withNerve is expressed as the layering The parameter of network.

The present embodiment detects network model based on trained RGB, initializes the depth map detection network of single mode Model replicates the network parameter of RGB detection network modelsThe network ginseng of network model is detected as depth map Number, is then finely adjusted it training with the depth image portion in small-scale data set, and trained depth map detects network The network parameter of model isThe network model of depth map detection at this time can identify object category and position in depth map It sets.

The method that cross-module state supervision transfer (Supervision Transfer) is employed herein, with the nerve of RGB mode The neural network of network expression initialization depth information mode, the migration pattern of this cross-module state, in the layer of convolution pond Verification is arrived.Assuming that existing be completed pairing from both modalities which, but the large data collection not marked, it is labeled as P_l,u.Passing through will Image feature representationAnd ψ^#uScheme I with the pairs of image RGB of both modalities which_uWith depth map I_lMatched (Refer to RGB networks Part expression, I_lAnd I_uRGB and depth image respectively in data set), it can therefrom learn the rich expression of depth map, use Transforming function transformation function t is identical come the dimension for making the two express, and proposes that (f can be arbitrary letter to the loss function based on above-mentioned network Number form formula), then it can obtain the parameter of depth map network with trained mode：

Here in addition to conventional convolution pondization and full articulamentum in single mode of the invention detection network, further include Element_wise-sum layers, permute layers, flatten layers and priorbox layers.First, element_wise-sum layers it is right Characteristic pattern is summed, and the sum operation of multi-dimensional matrix is can be regarded as.Secondly, the sequence of permute layers of change data dimension, This is identical as the unit matrix that one exchanges through space is multiplied by.Then flatten layers multi-dimensional matrix is merged into it is one-dimensional.Finally, Priorbox layers are used for handling bounding box, and are not influenced on image foot sign.All these layers can be unified into one and turn Exchange the letters number s, then the loss function of network can be described as follows：

Based on the loss function, you can realize the cross-module states model transfer of single mode network.

It should be noted that small-scale data set includes having marked the RGB image of object category and position, and it is right therewith That answers has marked the depth map of object category and position, micro- detecting network model progress to RGB using small-scale data set When adjusting training, using the RGB image in small-scale data set；Network is being detected to depth map using small-scale data set When model is finely adjusted trained, using the depth map in small-scale data set, and depth map has been represented as HHA formats (there are three dimensions for HHA codings, are horizontal parallax, height from the ground and the angle with gravity respectively), dimension and RGB image one It causes.

For the present embodiment when pre-training model uses VGG16 models, RGB detects network model, depth map detects network mould Type network structure is identical, the form of SSD-VGG16 may be used, wherein SSD-VGG16 is i.e. using VGG16 as feature extraction unit The SSD networks divided.But these networks are not fixed, other latticed forms may be used, as pre-training model also can be used The form of SSD-ResNet may be used in ResNet, corresponding RGB detections network model, depth map detection network model.

Phase III detects network model, initialization fusion based on trained RGB detection network models and depth map Network model simultaneously carries out multi-scale feature fusion；Using pairing the RGB image for having marked objects in images classification and position and Depth image is trained converged network model.

The present embodiment detects the network parameter of network model using trained RGB detection network models and depth map, The weight of characteristic extraction part in converged network model is initialized, that is, replicates RGB detection network models and depth map detects network The network parameter of model, the weight as characteristic extraction part in converged network model.Then input data is the RGB figures of pairing Piece and depth picture, are finely adjusted training, which may be used the second data set.As shown in Fig. 2, here with SSD- For VGG16 is as characteristic extraction part, converged network, which needs to copy to detect in network with depth map from RGB detection networks, to be obtained (RGB feature extraction part and depth map are special as two characteristic extraction parts of converged network for all parameters of SSD-VGG16 modules Sign extraction part) weight.Converged network input RGB figures and depth map in this way, will after two characteristic extraction parts Respectively obtain multi-level RGB generic features and the generic features (Analysis On Multi-scale Features) of depth map.

It is extracted respectively in above-mentioned two characteristic extraction part (RGB feature extraction part and depth map features extract part) After the Analysis On Multi-scale Features of two mode, converged network model (is used more using multilayered structure to merge the feature from different scale A fused layer come combine two characteristic extraction parts extract Analysis On Multi-scale Features), it is more aobvious that these features come from semantic feature The convolution pond layer of work corresponds to the higher level in network structure.There is the feasible merging point of two classes in converged network framework, these Point is divided into two major classes, and one kind is the network bottom layer before characteristic extraction part；It is another kind of relatively high after feature extraction layer The position of layer.The network of low layer possesses more space characteristics, and upper layer network possesses more semantic features.To be detected two A object if it is the same object, then high-level generic features expression it is closer, but low level expression might have very It is different.Therefore, the present invention selects high-rise fusion rather than the fusion of low layer, and experiment also demonstrates high-rise fusion and can obtain more Good effect.The present invention not merely carries out Fusion Features in the architectural framework of converged network using the feature of a certain layer, but It chooses multiple specific network layer features to be merged, including the more of the significantly more convolution pond layer of multiple semantic features Scale feature.As shown in Fig. 2, for continuing here using SSD-VGG16 as characteristic extraction part, can be merged using fused layer Conv4-3, fc7, conv6_2, conv7_2, conv8_2, conv9_2 layers of feature obtains the fusion feature of network high level.It needs It is noted that in fig. 2, illustrating only conv4-3, fc7, schematic diagram being omitted for the fusion of other layers, here no longer It repeats.

The experimental results showed that replacing element_wise-sum layers with concatenate layers carries out Fusion Features, effect is not Good, it is special that the present invention merges RGB and depth map cross-module state using the merging network layer of the element_wise-sum of corresponding number Sign.Classification and the position that object can be predicted using these features, by including two 3*3 convolution after obtaining these features The convolutional layer of core carries out regression forecasting and obtains multiple as a result, wherein first convolution kernel completes the prediction (1*4 dimensions) of position, another A convolution kernel completes the prediction of object category (1* needs the dimension for the categorical measure predicted).Finally these are obtained multiple pre- It surveys and result to the end is obtained by the method for NMS (Non-maximum suppression, non-maximum restraining).Converged network exists Using Multibox Loss as loss function when training.

In order to preferably utilize input data, the present embodiment carries out data enhancing to input data, such as passes through rotation, mirror As, the means such as cut, the Spatial diversity of picture is showed, better robust will also be had by training the model come accordingly Property.

When being trained to converged network model, freeze feature extracts the weight of part, and only training fusion part, that is, set The learning rate for setting characteristic extraction part is less than the threshold value (threshold value as 0 or very low numerical value, such as 10e-8) set so that network Trained process can be absorbed in fusion part, will not excessively change the weight of RGB and depth characteristic extraction part.By freezing From the copied next module weight of RGB and depth model, only training fusion part is trained with the fine tuning for completing converged network.Training Generally at 40,000 times or more, basic learning rate is arranged 0.001 or so iterations.Here fusion part is converged network mould Type removes the other parts of feature extraction unit exceptionally.

Technical solution of the present invention combination RGB and depth characteristic improve detection performance, before the feature merging, RGB and depth Degree image will be converted to generic features expression, the two feature extraction units by two characteristic extraction parts in converged network respectively The characteristic extraction part being made of multiple convolution ponds layer, respectively RGB and depth map features is divided to extract part, their power It is to detect to initialize and train in network model and depth map detection model from the RGB of above-mentioned two single mode to obtain again.This two A single mode network has individually carried out fine tuning training before converged network training, and uses identical architecture. The present embodiment can obtain the generic features expression of depth image in the case of no deep annotation large data collection.In addition, melting In the training process for closing network, the input data of two kinds of different modalities must keep dimension identical.Meanwhile two kinds of input pictures The data enhancement operations mode of (RGB and depth image) also must be identical.

The above embodiments are merely illustrative of the technical solutions of the present invention rather than is limited, without departing substantially from essence of the invention In the case of refreshing and its essence, those skilled in the art make various corresponding changes and change in accordance with the present invention Shape, but these corresponding change and deformations should all belong to the protection domain of appended claims of the invention.

Claims

1. a kind of method for checking object based on cross-module state multi-scale feature fusion, which is characterized in that described more based on cross-module state The method for checking object of scale feature fusion, including：

Pre-training model is trained using the RGB image for having marked objects in images classification in the first data set, is based on pre-training mould The RGB that type initializes single mode detects network model；

RGB detection network models are carried out using the RGB image for having marked objects in images classification and position in the second data set Training；

Using the depth image for having marked objects in images classification and position corresponding with RGB image in the second data set to depth Figure detection network model is trained；

Network model is detected based on trained RGB detection network models and depth map, initialization converged network model is gone forward side by side Row multi-scale feature fusion；

Using the RGB image and depth image for having marked objects in images classification and position of pairing, to converged network model into Row training；

2. the method for checking object according to claim 1 based on cross-module state multi-scale feature fusion, which is characterized in that institute The depth map detection network model that single mode is initialized based on trained RGB detections network model is stated, including：

3. according to the method for checking object described in claim 1 based on cross-module state multi-scale feature fusion, which is characterized in that described Network model is detected based on trained RGB detection network models and depth map, converged network model is initialized and carries out more Scale feature merges, including：

The network parameter for replicating RGB detection network models and depth map detection network model, as two in converged network model The weight of characteristic extraction part；

4. according to the method for checking object based on cross-module state multi-scale feature fusion described in claim 3, which is characterized in that described Converged network model is when training using Multibox Loss as loss function.

5. according to the method for checking object described in claim 1 based on cross-module state multi-scale feature fusion, which is characterized in that described RGB detection network models are trained, depth map detection network model is trained, converged network model is trained When, further include：

Data enhancing processing is carried out to input data.

6. according to the method for checking object based on cross-module state multi-scale feature fusion described in claim 3, which is characterized in that described When being trained to converged network model, further include：

Freeze feature extracts the weight of part.