CN108171217A

CN108171217A - A kind of three-dimension object detection method based on converged network

Info

Publication number: CN108171217A
Application number: CN201810081797.0A
Authority: CN
Inventors: 夏春秋
Original assignee: Shenzhen Vision Technology Co Ltd
Current assignee: Shenzhen Vision Technology Co Ltd
Priority date: 2018-01-29
Filing date: 2018-01-29
Publication date: 2018-06-15

Abstract

A kind of three-dimension object detection method based on converged network proposed in the present invention, main contents include：Point cloud network, converged network, intensive fusion forecasting score function, its process is, point cloud network model intake original point cloud, learn the space encoding each put and polymerization global point cloud feature, by these features for classification and semantic segmentation, converged network is using the corresponding points cloud feature generated using the sub-network of the characteristics of image of convolutional neural networks extraction and point converged network as input, these combination of function are got up and are that target object exports a three-dimensional boundaries frame by it, with the supervision direct training network of score function, whether future position is in object boundary frame, and unsupervised score function can help network to select optimum prediction point.The present invention can directly learn most optimally to combine image and depth information, avoid quantization or projection etc. and damage input pretreatment, have general applicability, and its accuracy also greatly improves.

Description

A kind of three-dimension object detection method based on converged network

Technical field

The present invention relates to object detecting areas, more particularly, to a kind of three-dimension object detection side based on converged network Method.

Background technology

The detection and identification of three-dimension object are the important research directions of computer vision field, and it is a kind of general to put fusion Three-dimension object detection method, image and three-dimensional point cloud information can be utilized simultaneously.Three dimensional point cloud industrially, especially Application in reverse-engineering is more and more universal.The main application scenarios of three-dimension object detection and identification exist on point cloud data In identifying existing three-dimensional object model from the point cloud data obtained.Three-dimension object detection technique can be applied various each Urban planning, construction and the management project of sample, for detecting the pedestrian in City scenarios, vehicle, shop etc.；It can also be by certainly The auxiliary drivers such as dynamic detection pedestrian, vehicle, traffic sign and road conditions drive vehicle, it might even be possible to applied to following nothing The fields such as people's autonomous driving technology and mobile robot.In addition to this, in electronic communication monitoring, industrial detection automation, military affairs The every field such as investigation and Medical Instruments also have extensive application demand.And existing three-dimension object detection technique is often only applicable to The object of such as automobile etc, but it is not suitable for the detection of other critical objects such as pedestrian or cycling, therefore without general Time applicability, it is difficult to be applied in practice.

The present invention proposes a kind of three-dimension object detection method based on converged network, and the model intake of point cloud network is original Point cloud learns the space encoding each put and polymerization global point cloud feature, by these features for classifying and semantic segmentation, fusion Network makees the corresponding points cloud feature generated using the sub-network of the characteristics of image of convolutional neural networks extraction and point converged network For input, these combination of function are got up and are that target object exports a three-dimensional boundaries frame by it, direct with supervision score function Training network, whether future position is in object boundary frame, and unsupervised score function can help network to select optimum prediction point.This Invention can directly learn most optimally to combine image and depth information, avoid quantization or projection etc. and damage input in advance Processing has general applicability, and its accuracy also greatly improves.

Invention content

Do not have the problem of universality for existing method, the purpose of the present invention is to provide one kind based on point converged network Three-dimension object detection method, point cloud network model intake original point cloud learns the space encoding each put and polymerization global point Cloud feature, by these features for classification and semantic segmentation, converged network will be using the characteristics of image of convolutional neural networks extraction The corresponding points cloud feature generated with the sub-network of converged network as input, got up and be target pair by it by these combination of function As exporting a three-dimensional boundaries frame, with the supervision direct training network of score function, whether future position is in object boundary frame, and nothing Supervision score function can help network to select optimum prediction point.

To solve the above problems, the present invention provides a kind of three-dimension object detection method based on converged network, it is main Content includes：

(1) cloud network is put；

(2) converged network；

(3) intensive fusion forecasting score function.

Wherein, point fusion, there are three chief components：The point converged network variant of extraction point cloud feature carries Take convolutional neural networks (CNN), two features of combination and the converged network for exporting three-dimensional boundaries frame of picture appearance feature.

Wherein, the point cloud network realizes the constant of unordered 3D point cloud collection using symmetric function (maximum pond) first Property processing；The model absorbs original point cloud, and learns the global point cloud feature for the space encoding and polymerization each put, by these Feature is for classification and semantic segmentation；

Point converged network can directly handle original point, without the operation that damages of voxelization or projection, and with input The point converged network formula that the quantity of point is linear but original cannot be used for 3D recurrence, it is therefore desirable to carry out batch and return One changes and inputs normalization.

Further, it is described that batch is gone to normalize, in original point converged network is realized, all full articulamentums all with A batch normalization layer；But batch normalization hampers the estimation performance of three-dimensional boundaries frame；Batch normalization is intended to eliminate Scale and deviation in input data, but task is returned for 3D, the absolute figure for putting position is helpful；Therefore, point melts It closes network variations and deletes all batch normalization layers.

Further, input normalization obtains image by searching for all the points in projectable to frame in scene The corresponding 3D point cloud of bounding box；However, the spatial position of 3D points is related to 2D frame position heights, this can introduce deviation；Point fusion Network application space transformer network (STN) carrys out the specification input space；But STN cannot correct these deviations completely, therefore use instead Known geometry camera calculates specification spin matrix R_c；R_cThe z-axis of camera frame will be rotated to by the light at 2D frames center.

Wherein, the converged network, by the characteristics of image extracted using standard CNN and the sub-network of point converged network The corresponding points cloud feature of generation is as input；These combination of function are got up and are that target object exports a 3D bounding box by it； Converged network includes global converged network and novel intensive converged network.

Further, the global converged network handles, and directly to object boundary image and point cloud feature The three-dimensional position in eight corners of frame is returned；The loss function of global converged network is：

Wherein,It is the corner location of true frame demarcated, x_iBe prediction corner location, L_stnIt is the space introduced Regularization loss is converted, for the orthogonality of mandatory learning space conversion matrices；A but major defect of global converged network It is regressive objectVariance directly depend on concrete condition.

Further, the intensive converged network, the main thought of intensive converged network model are three using input Dimension point returns the absolute position of the corner location of 3D bounding boxes as intensive space anchor point rather than directly, for each defeated The three-dimensional point entered is predicted from this to the spatial deviation of the corner location of neighbouring bounding box；Use a converged network variant To export point-by-point feature；For each point, point converged network variant is connect with global point converged network feature and characteristics of image, Generate the input tensor of n × 3136；Intensive converged network handles the input using multiple layers, and it is pre- to export 3D bounding boxes The score surveyed and each put；In the testing time, the prediction with top score is selected as finally predicting；Intensive converged network Loss function be：

Wherein, N is the number of input point,Be the true frame demarcated corner location and i-th input point it Between offset,Be prediction offset, L_scoreIt is the loss of score function.

Wherein, the intensive fusion forecasting score function, L_scoreThe target of function is by close-target by Web Cams Learn spatial deviation on the point of frame；Specifically, score function includes：

(1) score function is supervised：Direct training network, to predict a point whether in object boundary frame；

(2) unsupervised score function：Network selection is allowed to lead to the point of optimum prediction.

Further, the supervision score function and unsupervised score function, supervision scoring loss training network prediction Whether one point is in target frame；The offset of point i is returned loss to be expressed asI-th point of binary classification is lost into table It is shown asThen：

Wherein, m_i∈ { 0,1 } indicates at i-th point whether in object boundary frame, L_scoreIt is to intersect entropy loss, punishment is closed In set point whether the incorrect prediction in frame；As defined, the supervision score function by Web Cams in study, with pre- Survey the spatial deviation of the point in object boundary frame；However, it may not provide optimum, because the point in frame may not be Point with optimum prediction；

The target of unsupervised scoring is that network is allowed to be directly acquainted with which point may provide best hypothesis；Need to network into Row training, to determine that there may be the high confidence levels well predicted；The formula includes two loss conditions vied each other：Selection The high confidence level c of all the points_i, however, the prediction error of corner location is directly proportional to this confidence level；DefinitionCollection is combined into The inflection point offset of point i returns loss；Then loss becomes：

Wherein, w is the weight factor between two；Best w is found by rule of thumb, and enables w=in all experiments 0.1。

Description of the drawings

Fig. 1 is a kind of system framework figure of the three-dimension object detection method based on converged network of the present invention.

Fig. 2 is a kind of point converged network system figure of the three-dimension object detection method based on converged network of the present invention.

Fig. 3 is a kind of input normalization of three-dimension object detection method based on converged network of the present invention.

Specific embodiment

It should be noted that in the absence of conflict, the feature in embodiment and embodiment in the application can phase It mutually combines, the present invention is described in further detail in the following with reference to the drawings and specific embodiments.

Fig. 1 is a kind of system framework figure of the three-dimension object detection method based on converged network of the present invention.Mainly include Point cloud network, converged network, intensive fusion forecasting score function.

There are three chief components for point fusion：The point converged network variant of extraction point cloud feature, extraction picture appearance are special Convolutional neural networks (CNN), two features of combination and the converged network for exporting three-dimensional boundaries frame of sign.

Point converged network realizes that the invariance of unordered 3D point cloud collection is handled using symmetric function (maximum pond) first；The mould Type absorbs original point cloud, and learns the global point cloud feature for the space encoding and polymerization each put, these features are used to divide Class and semantic segmentation；

Batch is gone to normalize, in original point converged network is realized, all full articulamentums all follow a batch normalizing Change layer；But batch normalization hampers the estimation performance of three-dimensional boundaries frame；Batch normalization is intended to eliminate the ruler in input data Degree and deviation, but task is returned for 3D, the absolute figure for putting position is helpful；Therefore, point converged network variant is deleted All batch normalization layers.

The corresponding points that converged network generates the sub-network of the characteristics of image extracted using standard CNN and point converged network Cloud feature is as input；These combination of function are got up and are that target object exports a 3D bounding box by it；Converged network includes Global converged network and novel intensive converged network.

Intensive fusion forecasting score function, L_scoreThe target of function is to learn Web Cams on the point by close-target frame Spatial deviation；Specifically, score function includes：

Whether supervision scoring loss training network predicts a point in target frame；The offset of point i is returned loss to represent ForI-th point of binary classification loss is expressed asThen：

Fig. 2 is a kind of point converged network system figure of the three-dimension object detection method based on converged network of the present invention.Its In, Fig. 2 (A) represents the point converged network variant of extraction point cloud feature, for handling original point cloud data；Fig. 2 (B) expressions are used for Extract the convolutional neural networks of picture appearance feature；Fig. 2 (C) represents intensive converged network；Fig. 2 (D) represents global converged network； Fig. 2 (E) represents final prediction result.

Wherein, the main thought of intensive converged network model is to use the three-dimensional point of input as intensive space anchor point, Rather than directly return the absolute position of the corner location of 3D bounding boxes, for the three-dimensional point of each input, predict from the point to The spatial deviation of the corner location of neighbouring bounding box；Point-by-point feature is exported using a converged network variant；For each Point, point converged network variant are connect with global point converged network feature and characteristics of image, generate the input of n × 3136 Amount；Intensive converged network handles the input using multiple layers, and exports the prediction of 3D bounding boxes and the score each put；It is surveying The time is tried, the prediction with top score is selected as finally predicting；The loss function of intensive converged network is：

Wherein, global converged network handles, and directly to eight angles of object boundary frame image and point cloud feature The three-dimensional position fallen is returned；The loss function of global converged network is：

Fig. 3 is a kind of input normalization of three-dimension object detection method based on converged network of the present invention.By searching for All the points in scene in projectable to frame obtain the corresponding 3D point cloud of image boundary frame；However, spatial position and the 2D of 3D points Frame position height is related, this can introduce deviation；Point converged network application space converter network (STN) carrys out the specification input space； But STN cannot correct these deviations completely, therefore use known geometry camera instead to calculate specification spin matrix R_c；R_cIt will pass through The light at 2D frames center rotates to the z-axis of camera frame.

For those skilled in the art, the present invention is not limited to the details of above-described embodiment, in the essence without departing substantially from the present invention In the case of refreshing and range, the present invention can be realized in other specific forms.In addition, those skilled in the art can be to this hair Bright to carry out various modification and variations without departing from the spirit and scope of the present invention, these improvements and modifications also should be regarded as the present invention's Protection domain.Therefore, appended claims are intended to be construed to include preferred embodiment and fall into all changes of the scope of the invention More and change.

Claims

1. a kind of three-dimension object detection method based on converged network, which is characterized in that main to include point cloud network (one)；Melt Close network (two)；Intensive fusion forecasting score function (three).

2. based on the point fusion described in claims 1, which is characterized in that there are three chief components for point fusion：Extraction point Two the point converged network variant of cloud feature, the convolutional neural networks (CNN) for extracting picture appearance feature, combination features simultaneously export The converged network of three-dimensional boundaries frame.

3. based on the point cloud network (one) described in claims 1, which is characterized in that point converged network uses symmetric function first (maximum pond) come realize the invariance of unordered 3D point cloud collection handle；The model absorbs original point cloud, and learns the space each put Coding and the global point cloud feature of polymerization, by these features for classification and semantic segmentation；

Point converged network can directly handle original point, without the operation that damages of voxelization or projection, and with input point Quantity is linear, but original point converged network formula cannot be used for 3D recurrence, it is therefore desirable to carry out batch and normalize It is normalized with input.

4. based on batch being gone to normalize described in claims 3, which is characterized in that in original point converged network is realized, All full articulamentums all follow a batch normalization layer；But batch normalization hampers the estimation performance of three-dimensional boundaries frame；Batch Amount normalization is intended to eliminate scale and deviation in input data, but returns task for 3D, and the absolute figure for putting position is that have It helps；Therefore, point converged network variant deletes all batch normalization layers.

5. based on the input normalization described in claims 3, which is characterized in that by searching in projectable to frame in scene All the points obtain the corresponding 3D point cloud of image boundary frame；However, the spatial position of 3D points is related to 2D frame position heights, this can draw Enter deviation；Point converged network application space converter network (STN) carrys out the specification input space；But STN cannot correct these completely Deviation, therefore known geometry camera is used instead to calculate specification spin matrix R_c；R_cTo phase be rotated to by the light at 2D frames center The z-axis of machine frame.

6. based on the converged network (two) described in claims 1, which is characterized in that converged network will be carried using standard CNN The corresponding points cloud feature that the characteristics of image and the sub-network of point converged network taken generates is as input；It closes these group of functions Come and be target object export a 3D bounding box；Converged network includes global converged network and novel intensive converged network.

7. the global converged network described in based on claims 6, which is characterized in that global converged network is to image and Dian Yunte Sign is handled, and directly the three-dimensional position in eight corners of object boundary frame is returned；The loss of global converged network Function is：

Wherein,It is the corner location of true frame demarcated, x_iBe prediction corner location, L_stnBe introduce spatial alternation just Then change loss, for the orthogonality of mandatory learning space conversion matrices；But a major defect of global converged network is to return TargetVariance directly depend on concrete condition.

8. the intensive converged network described in based on claims 6, which is characterized in that the main thought of intensive converged network model It is to use the three-dimensional point of input as intensive space anchor point rather than the absolute position of the direct corner location for returning 3D bounding boxes It puts, for the three-dimensional point of each input, predicts from this to the spatial deviation of the corner location of neighbouring bounding box；Use a point Converged network variant exports point-by-point feature；For each point, point converged network variant and global point converged network feature and Characteristics of image connects, and generates the input tensor of n × 3136；Intensive converged network handles the input using multiple layers, and The prediction of output 3D bounding boxes and the score each put；In the testing time, the prediction with top score is selected as finally in advance It surveys；The loss function of intensive converged network is：

Wherein, N is the number of input point,It is inclined between the corner location for the true frame demarcated and i-th of input point Shifting amount,Be prediction offset, L_scoreIt is the loss of score function.

9. the intensive fusion forecasting score function (three) described in based on claims 1, which is characterized in that L_scoreThe target of function It is that Web Cams are learnt into spatial deviation on the point by close-target frame；Specifically, score function includes：

10. based on the supervision score function described in claims 9 and unsupervised score function, which is characterized in that supervision scoring Whether lose training network predicts a point in target frame；The offset of point i is returned loss to be expressed asBy i-th The binary classification loss of point is expressed asThen：

Wherein, m_i∈ { 0,1 } indicates at i-th point whether in object boundary frame, L_scoreBe intersect entropy loss, punishment about to Fixed point whether the incorrect prediction in frame；As defined, the supervision score function by Web Cams in study, to predict mesh Mark the spatial deviation of the point in bounding box；However, it may not provide optimum, because the point in frame may not be to have The point of optimum prediction；

The target of unsupervised scoring is that network is allowed to be directly acquainted with which point may provide best hypothesis；It needs to instruct network Practice, to determine that there may be the high confidence levels well predicted；The formula includes two loss conditions vied each other：Selection is all The high confidence level c of point_i, however, the prediction error of corner location is directly proportional to this confidence level；DefinitionCollection is combined into point i Inflection point offset return loss；Then loss becomes：

Wherein, w is the weight factor between two；Best w is found by rule of thumb, and enables w=0.1 in all experiments.