CN107944443A

CN107944443A - One kind carries out object consistency detection method based on end-to-end deep learning

Info

Publication number: CN107944443A
Application number: CN201711139653.8A
Authority: CN
Inventors: 夏春秋
Original assignee: Shenzhen Vision Technology Co Ltd
Current assignee: Shenzhen Vision Technology Co Ltd
Priority date: 2017-11-16
Filing date: 2017-11-16
Publication date: 2018-04-20

Abstract

The present invention proposes a kind of based on end-to-end deep learning progress object consistency detection method, it is intended to find the position of objects in images at the same time, classification and uniformity, the feature in interest region is computed correctly from characteristics of image figure using interest region aligned layer, RoI characteristic patterns are upsampled to high-resolution convolutional layer using convolution sequence of layer and obtain uniformity figure, its consistency is supervised using Robust Strategies adjusting training model.Object detection is positioned for object, and each pixel in object is distributed to its consistency label by consistency detection, and the mapping of bounding box classification, position and uniformity is trained using multitask loss, and finally training and reasoning obtain uniformity label.The present invention uses end-to-end deep learning, use multitask loss function combined optimization object detection and consistency detection, it is not necessary to which extraneous information, reduces the complexity in training and test process, the accuracy of detection is effectively improved, suitable for the application of real-time machine people.

Description

One kind carries out object consistency detection method based on end-to-end deep learning

Technical field

The present invention relates to computer vision field, and object one is carried out based on end-to-end deep learning more particularly, to a kind of Cause property detection method.

Background technology

In computer vision, while detection object and cutting object are becoming increasingly popular, and object can be regarded by various Feel that attribute such as color, shape or physical attribute such as weight, volume and material are described, these attributes are for identifying object Or it is useful to be classified into different classifications, in many robot applications, identification object consistency be it is vital, But robot may still need more information to complete task, robot does not merely have to detection object uniformity, and It can position and identify relevant object.Object consistency detection is used as emerging problem, has practicality hair in many fields Exhibition, such as scene understanding, video search, object detection, behavioural analysis, 3 D scene rebuilding, human-computer interaction etc., especially Ground, the human-computer interaction in object detection unmanned, in smart home, medical diagnosis in field of traffic etc. all has wide Wealthy application prospect.Understand that object or object consistency are different from the virtual physical properties of only description object, it is also necessary to acquisition pair Interaction as consensus information and with the mankind, therefore, understands that object consistency is autonomous robot with object interaction and assisting People carry out the key of various routine works.

However, the uniformity of detection object is more increasingly difficult than traditional semantic segmentation problem, two have different appearances Object may have identical uniformity label, because uniformity label is the abstract concept to object behavior based on the mankind, separately Outside, it is also vital to carry out detection and the summary to that can not see object in real time for uniformity.Existing common method It is very time-consuming using two continuous deep-neural-networks, it is not suitable for applying in real time.

The present invention proposes a kind of based on end-to-end deep learning progress object consistency detection method, it is intended to finds at the same time The position of objects in images, classification and uniformity, are correctly counted using interest region aligned layer (RoIAlign) from characteristics of image figure The feature of interest region (RoI) is calculated, RoI characteristic patterns are upsampled to high-resolution convolutional layer using convolution sequence of layer obtains uniformity Figure, its consistency is supervised using Robust Strategies adjusting training model.Object detection is positioned for object, and consistency detection will be right Each pixel as in distributes to its consistency label, using multitask loss be trained bounding box classify, position with it is consistent Property mapping, finally training and reasoning obtain uniformity label.The present invention uses end-to-end deep learning, is lost using multitask Function combined optimization object detection and consistency detection, it is not necessary to which extraneous information, reduces the complexity in training and test process Property, is effectively improved the accuracy of detection, suitable for the application of real-time machine people.

The content of the invention

For time-consuming, be not suitable for applying in real time the problem of, the present invention use end-to-end deep learning, use multitask Loss function combined optimization object detection and consistency detection, it is not necessary to which extraneous information, reduces in training and test process Complexity, is effectively improved the accuracy of detection, suitable for the application of real-time machine people.

To solve the above problems, the present invention provides one kind carries out object consistency detection side based on end-to-end deep learning Method, mainly includes：

Problem formulation (one)；

The uniformity network architecture (two)；

(3) are lost in multitask；

Training and reasoning (four).

Wherein, the problem of described, is formulation, and frame is intended to find the position of object at the same time, pair in object type and image The uniformity of elephant, according in computer vision standard design, the position of object by being defined relative to the upper left corner rectangle of image, Object type is defined by rectangle frame, each pixel coder its consistency in rectangle frame, and object pixel region has identical Function, it is believed that be it is consistent, ideally, all related objects in detection image, and by each picture in these objects Element is mapped to most probable uniformity label.

Wherein, the uniformity network architecture, there is three chief components：1) interest region aligned layer (RoIAlign) it is used for the feature that interest region (RoI) is computed correctly from characteristics of image figure；2) convolution sequence of layer is by RoI characteristic patterns It is upsampled to high-resolution convolutional layer and obtains smooth, fine and smooth uniformity figure；3) supervised using Robust Strategies adjusting training model Its consistency.

Further, the interest region aligned layer (RoIAlign), region suggest that network (RPN) is carried out based on region Target acquisition, the network share weight with master file product backbone, export different size of bounding box, and each RoI uses RoIPool layers The small Feature Mapping (such as 7 × 7) of fixed size is accumulated from characteristics of image set of graphs layer, RoIAlign layers will suitably carry The feature taken is alignd with RoI, is operated without using rounding-off, and RoIAlign layer in each RoI grid of bilinear interpolation calculating using advising The then interpolated value of sampling location, carrys out polymerization result using maximum computing, avoids the imbalance between RoI and the feature of extraction.

Further, the high-resolution convolutional layer, uses the model (such as 14 × 14 or 28 × 28) of small fixed size To represent Object Segmentation model, the pixel value in each prediction model of RoI is binary, i.e. foreground and background, because often There are multiple Consistency Class in a object, cannot be worked well in test problems are provided using bench model, therefore use solution Convolutional layer realizes high-resolution consistency model, and in form, it is S to give input feature vector figure size_i, uncoiling lamination performs and volume The opposite operation of lamination, in order to build the output figure size S of bigger_o, S_iWith S_oRelation be：

S_o=s* (S_i-1)+S_f-2*d (1)

Wherein S_fIt is filter size；S and d is stride and pad parameter respectively；In fact, RoIAlign layers of Output Size For 7 × 7 characteristic pattern, which is upsampled to the resolution ratio of higher, first uncoiling lamination filling using three uncoiling laminations Parameter d=1, stride s=1, kernel size S_f=8, the figure that size is 30 × 30 is created, similarly, the second layer parameter is (d= 1, s=4, S_f=8), third layer parameter is (d=1, s=2, S_f=4) the final high resolution graphics that size is 244 × 244 is created, Before each uncoiling lamination, carrying out learning characteristic using convolutional layer will be used to deconvolute, and convolutional layer can be regarded as two continuously Uncoiling lamination between adaptation.

Further, the training pattern, the fixed size of consistency model detection branches needs one (such as 244 × 244) supervised training, is not worked using single threshold value in consistency detection problem, therefore proposes multi thresholds Developing Tactics ruler It is very little, a virgin control group model is given, in the case of without loss of generality, if n separate label P=(c in model₀, c₁,…,c_n-1), the value Linear Mapping in P is set toUsing from P toMapping by original mould Type is converted into new model；The model of conversion is adjusted to predefined moulded dimension, and is used on the model of adjustment size Threshold value, it is as follows：

Wherein, ρ (x, y) is the pixel value for adjusting model；It isValue in one；α is super parameter, is set to 0.005；By the value in threshold model be remapped to original tag value (by using fromTo the mapping of P) come realize object instruct Practice model.

Further, the end-to-end deep learning, network is made of Liang Ge branches, for object detection and uniformity Detection, is given input picture, is extracted further feature from image using VGG16 networks as backbone, then used and convolution bone Frame shares the RPN of weight to generate candidate's bounding box (RoIs), and for each RoI, RoIAlign layers of extraction are simultaneously corresponding by its Feature is converged in the characteristic pattern of 7 × 7 sizes, and in object detection branch, using two layers being fully connected, every layer all There are 4096 neurons, its subseries layer classifies object, returns layer and returns object's position；In consistency detection branch In, the characteristic patterns of 7 × 7 sizes up-sampling is amplified to 244 × 244 acquisition high resolution graphics, using softmax layers by 244 × 244 Each pixel in mapping distributes to its most probable Consistency Class, and whole network is carried out end-to-end using multitask loss function Training.

Wherein, the multitask is lost, in end-to-end framework, in K+1 object type classification layer output probability distribution p =(p₀,…,p_K), p is softmax layers of output, and returning K+1 bounding box recurrence offset of layer output, (each offset is included in frame The heart and frame size)：Each offset t^kCorresponding to each classification k, to t^kParameterized, t^kSpecify The conversion of one Scale invariant, height/width relative shift relation RPN bounding boxs, consistency detection branch export each pixel i RoI in one group of probability distribution m={ mⁱ}_i∈RoI, whereinIt is in the C+1 uniformity labels including background The softmax layers output of upper definition；Using multitask lose L carry out the classification of joint training bounding box, surround box position and Uniformity maps, as follows：

L=L_cls+L_loc+L_aff (3)

Wherein L_clsIt is defined as the output of classification layer, L_locIt is defined as returning the output of layer, L_affIt is defined as consistency detection point The output of branch.

Further, the prediction object of each RoI is control group object class u, and control group bounding box offset υ is consistent with target Property model s, training dataset provide u and υ value, goal congruence model s be RoI it is associated there control group model between Intersection, the RoI interior pixels for being not belonging to intersection, we are marked as background, and object mask is adjusted to fixed Size (i.e. 244 × 244), formula (3) is written as：

L(p,u,t^u, v, m, s) and=L_cls(p,u)+I[u≥1]L_loc(t^u,v)+I[u≥1]L_aff(m,s) (4)

First loss L_cls(p, u) is the intersection entropy loss of multinomial classification, is calculated as follows：

L_cls(p, u)=- log (p_u) (5)

Wherein, p_uIt is the softmax outputs of control group object class u, second is lost L_loc(t^u, v) and it is to return frame offset t^u (correspond to the smooth L1 losses between control group object class u) and control group bounding box offset υ, calculate as follows：

Wherein：

L_aff(m, s) is the multinomial intersection entropy loss of consistency detection branch, is calculated as follows：

Wherein,It is true tag s_iPixel i at softmax output；N is the pixel number in RoI；

In equation (4), I [u >=1] is a target function, exports 1 as u >=1, is otherwise 0, only defines frame position Lose L_loc, only RoI is timing, defines consistency detection loss L_aff, when the value of RoI is positive or negative, define object classification damage Lose L_cls, consistency detection branch penalty is different from example segmentation loss, and binary segmentation, i.e. prospect are divided into each RoI And background, in consistency detection problem, uniformity label is different from object tag, the uniformity number of labels in each RoI Binary, i.e., it is always greater than 2 (including backgrounds), therefore, uniformity label dependent on each pixel softmax and Multinomial intersection entropy loss.

Wherein, the training and reasoning, the training network in a manner of end to end, using 0.9 momentum and 0.0005 weight The stochastic gradient descent method of decay, the network carry out 200,000 repetitive exercises, and the learning rate of first 150,000 times is arranged to 0.001, most The learning rate of 50,000 times reduces afterwards, and input picture is resized so that short edge is 600 pixels, and long edge is no more than 1000 pixels；If longer edge, more than 1000 pixels, longer edge is arranged to 1000 pixels, and is based on the edge Adjust image size；15 anchor points are used in RPN, preceding 2000 RoI of RPN are used to calculate multitask loss；In reasoning rank Section, selects preceding 1000 RoI of RPN generations, and object detection branch is run on these RoI, from the output of detection branches, selection Classify output box of the fraction higher than 0.9 as the object eventually detected, if not meeting the frame of the condition, selection has One unique detection object of conduct of highest classification fraction, the input using the object detected as supply detection branches are right Each pixel in the object detected, uniformity classification prediction obtain the output-consistence label of each pixel；Finally, adopt 244 × 244 consistency models of each object prediction are adjusted to object (frame) size with strategy is sized, if detected Object between there are overlapping, final consistency label is determined based on priority.

Brief description of the drawings

Fig. 1 is a kind of system flow chart that object consistency detection method is carried out based on end-to-end deep learning of the present invention.

Fig. 2 is a kind of uniformity network rack that object consistency detection method is carried out based on end-to-end deep learning of the present invention Composition.

Fig. 3 is a kind of deconvolution up-sampling that object consistency detection method is carried out based on end-to-end deep learning of the present invention Figure.

Embodiment

It should be noted that in the case where there is no conflict, the feature in embodiment and embodiment in the application can phase Mutually combine, the present invention is described in further detail with specific embodiment below in conjunction with the accompanying drawings.

Fig. 1 is a kind of system flow chart that object consistency detection method is carried out based on end-to-end deep learning of the present invention. Mainly include：Problem formulation (one)；The uniformity network architecture (two)；(3) are lost in multitask；Training and reasoning (four).

Problem formulation frame is intended to find the position of object at the same time, the uniformity of the object in object type and image, According in computer vision standard design, the position of object by being defined relative to the upper left corner rectangle of image, object type by Rectangle frame defines, and each pixel coder its consistency in rectangle frame, object pixel region has the function of identical, it is believed that is It is consistent, ideally, all related objects in detection image, and by each pixel-map in these objects to most may be used The uniformity label of energy.

In end-to-end framework, in K+1 object type classification layer output probability distribution p=(p₀,…,p_K), p is softmax The output of layer, returns layer and exports K+1 bounding box recurrence offset (each offset includes frame center and frame size)：Each offset t^kCorresponding to each classification k, to t^kParameterized, t^kSpecify a Scale invariant Conversion, height/width relative shift relation RPN bounding boxs, consistency detection branch exports in the RoI of each pixel i one group Probability distribution m={ mⁱ}_i∈RoI, whereinIt is to be defined on the C+1 uniformity labels including background Softmax layers of output；The classification of L progress joint training bounding boxs, encirclement box position and uniformity are lost using a multitask to reflect Penetrate, it is as follows：

L=L_cls+L_loc+L_aff (1)

The prediction object of each RoI is control group object class u, and control group bounding box deviates υ and goal congruence model s, Training dataset provides the value of u and υ, and goal congruence model s is the intersection between RoI control group models associated there, RoI interior pixels for being not belonging to intersection, we are marked as background, and object mask is adjusted to fixed size (i.e. 244 × 244), formula is written as：

L(p,u,t^u, v, m, s) and=L_cls(p,u)+I[u≥1]L_loc(t^u,v)+I[u≥1]L_aff(m,s) (2)

L_cls(p, u)=- log (p_u) (3)

Wherein, p_uIt is the softmax outputs of control group object class u, second is lost L_loc(t^u, v) and it is to return frame offset t^u (L1 corresponded between control group object class u) and control group bounding box offset υ smoothly loses, and calculates as follows：

Wherein：

In equation (1), I [u >=1] is a target function, exports 1 as u >=1, is otherwise 0, only defines frame position Lose L_loc, only RoI is timing, defines consistency detection loss L_aff, when the value of RoI is positive or negative, define object classification damage Lose L_cls, consistency detection branch penalty is different from example segmentation loss, and binary segmentation, i.e. prospect are divided into each RoI And background, in consistency detection problem, uniformity label is different from object tag, the uniformity number of labels in each RoI Binary, i.e., it is always greater than 2 (including backgrounds), therefore, uniformity label dependent on each pixel softmax and Multinomial intersection entropy loss.

The training network in a manner of end to end, the stochastic gradient descent method to be decayed using 0.9 momentum and 0.0005 weight, should Network carries out 200,000 repetitive exercises, and the learning rate of first 150,000 times is arranged to 0.001, and the learning rate of last 50,000 times reduces, input Image is resized so that short edge is 600 pixels, and long edge is no more than 1000 pixels；If longer edge surpasses Cross 1000 pixels, then longer edge is arranged to 1000 pixels, and based on edge adjustment image size；Used in RPN 15 anchor points, preceding 2000 RoI of RPN are used to calculate multitask loss；In the reasoning stage, first 1000 of RPN generations are selected RoI, runs object detection branch on these RoI, and from the output of detection branches, selection sort fraction is higher than 0.9 output box As the object eventually detected, if not meeting the frame of the condition, a conduct of the selection with highest classification fraction Unique detection object, the input using the object detected as supply detection branches, for each in the object that detects Pixel, uniformity classification prediction obtain the output-consistence label of each pixel；Finally, it is each right using strategy general is sized As 244 × 244 consistency models of prediction are adjusted to object (frame) size, if there are overlapping between the object detected, most Whole uniformity label is determined based on priority.

Fig. 2 is a kind of uniformity network rack that object consistency detection method is carried out based on end-to-end deep learning of the present invention Composition.The uniformity network architecture has three chief components：1) interest region aligned layer (RoIAlign) is used for special from image Sign figure is computed correctly the feature of interest region (RoI)；2) RoI characteristic patterns are upsampled to high-resolution convolutional layer and obtained by convolution sequence of layer Obtain smooth, fine and smooth uniformity figure；3) its consistency is supervised using Robust Strategies adjusting training model.

Interest region aligned layer (RoIAlign), region suggest that network (RPN) is based on region and carries out target acquisition, the network Weight is shared with master file product backbone, exports different size of bounding box, each RoI uses RoIPool layers from characteristics of image atlas The small Feature Mapping (such as 7 × 7) that fixed size is accumulated in layer is closed, RoIAlign layers suitably by the feature and RoI of extraction Alignment, operates without using rounding-off, RoIAlign layer using in each RoI grid of bilinear interpolation calculating rule sampling position it is interior Interpolation, carrys out polymerization result using maximum computing, avoids the imbalance between RoI and the feature of extraction.

Consistency model detection branches need fixed size (such as 244 × 244) supervised training, use single threshold Value does not work in consistency detection problem, therefore proposes multi thresholds Developing Tactics size, gives a virgin control group model, In the case of without loss of generality, if n separate label P=(c in model₀,c₁,…,c_n-1), the value Linear Mapping in P is set For Using from P toMapping archetype is converted into new model；By the model tune of conversion Whole is predefined moulded dimension, and uses threshold value on the model of adjustment size, as follows：

End-to-end deep learning network is made of Liang Ge branches, for object detection and consistency detection, gives input figure Picture, further feature is extracted using VGG16 networks as backbone from image, then uses the RPN that weight is shared with convolution skeleton To generate candidate's bounding box (RoIs), for each RoI, RoIAlign layers are extracted and its corresponding feature are converged to one 7 In the characteristic pattern of × 7 sizes, in object detection branch, using two layers being fully connected, every layer has 4096 neurons, Its subseries layer classifies object, returns layer and returns object's position；In consistency detection branch, the feature of 7 × 7 sizes Figure up-sampling is amplified to 244 × 244 acquisition high resolution graphics, each pixel during 244 × 244 are mapped using softmax layers Its most probable Consistency Class is distributed to, whole network is lost function using multitask and trained end to end.

Fig. 3 is a kind of deconvolution up-sampling that object consistency detection method is carried out based on end-to-end deep learning of the present invention Figure.High-resolution convolutional layer represents Object Segmentation model using the model (such as 14 × 14 or 28 × 28) of small fixed size, Pixel value in each prediction model of RoI is binary, i.e. foreground and background, because having in each object multiple consistent Property class, cannot be worked, therefore realize high-resolution using uncoiling lamination well using bench model in test problems are provided Consistency model, in form, it is S to give input feature vector figure size_i, the uncoiling lamination execution operation opposite with convolutional layer, in order to Build the output figure size S of bigger_o, S_iWith S_oRelation be：

S_o=s* (S_i-1)+S_f-2*d (7)

For those skilled in the art, the present invention is not restricted to the details of above-described embodiment, in the essence without departing substantially from the present invention In the case of refreshing and scope, the present invention can be realized in other specific forms.In addition, those skilled in the art can be to this hair Bright to carry out various modification and variations without departing from the spirit and scope of the present invention, these improvements and modifications also should be regarded as the present invention's Protection domain.Therefore, appended claims are intended to be construed to include preferred embodiment and fall into all changes of the scope of the invention More and change.

Claims

1. one kind carries out object consistency detection method based on end-to-end deep learning, it is characterised in that mainly determines including problem Formula (one)；The uniformity network architecture (two)；(3) are lost in multitask；Training and reasoning (four).

2. the problem of based on described in claims 1 formulation (one), it is characterised in that frame is intended to find the position of object at the same time Put, the uniformity of the object in object type and image, designed according to the standard in computer vision, the position of object is by opposite In the upper left corner rectangle definition of image, object type is defined by rectangle frame, each pixel coder its consistency in rectangle frame, Object pixel region has the function of identical, it is believed that be it is consistent, ideally, all related objects in detection image, And by each pixel-map in these objects to most probable uniformity label.

3. based on the uniformity network architecture (two) described in claims 1, it is characterised in that three of the uniformity network architecture Chief component：1) interest region aligned layer (RoIAlign) is used to be computed correctly interest region (RoI) from characteristics of image figure Feature；2) RoI characteristic patterns are upsampled to high-resolution convolutional layer and obtain smooth, fine and smooth uniformity figure by convolution sequence of layer；3) Its consistency is supervised using Robust Strategies adjusting training model.

4. based on the interest region aligned layer (RoIAlign) described in claims 3, it is characterised in that suggest network in region (RPN) target acquisition is carried out based on region, which shares weight with master file product backbone, export different size of bounding box, often A RoI accumulates the small Feature Mapping (such as 7 × 7) of fixed size using RoIPool layers from characteristics of image set of graphs layer, RoIAlign layers are suitably alignd the feature of extraction with RoI, are operated without using rounding-off, RoIAlign layers use bilinear interpolation The interpolated value of rule sampling position in each RoI grid is calculated, carrys out polymerization result using maximum computing, avoids RoI and extraction Imbalance between feature.

5. be based on high-resolution convolutional layer described in claims 3, it is characterised in that using small fixed size model (such as 14 × 14 or 28 × 28) represent Object Segmentation model, pixel value in each prediction model of RoI be it is binary, i.e., before Scape and background, cannot be fine in test problems are provided using bench model because having multiple Consistency Class in each object Ground works, therefore realizes high-resolution consistency model using uncoiling lamination, and in form, it is S to give input feature vector figure size_i, Uncoiling lamination performs the operation opposite with convolutional layer, in order to build the output figure size S of bigger_o, S_iWith S_oRelation be：

S_o=s* (S_i-1)+S_f-2*d (1)

Wherein S_fIt is filter size；S and d is stride and pad parameter respectively；In fact, RoIAlign layers of Output Size for 7 × The figure, the resolution ratio of higher, first uncoiling lamination pad parameter d are upsampled to using three uncoiling laminations by 7 characteristic pattern =1, stride s=1, kernel size S_f=8, the figure that size is 30 × 30 is created, similarly, the second layer parameter is (d=1, s= 4, S_f=8), third layer parameter is (d=1, s=2, S_f=4) the final high resolution graphics that size is 244 × 244 is created, every Before a uncoiling lamination, carrying out learning characteristic using convolutional layer will be used to deconvolute, and convolutional layer can be regarded as two continuous solutions Adaptation between convolutional layer.

6. based on the training pattern described in claims 3, it is characterised in that consistency model detection branches need a fixation Size (such as 244 × 244) supervised training, do not worked using single threshold value in consistency detection problem, thus propose it is more Threshold strategies adjust size, give a virgin control group model, in the case of without loss of generality, if n independence in model Label P=(c₀,c₁,…,c_n-1), the value Linear Mapping in P is set toUsing from P toMapping Archetype is converted into new model；The model of conversion is adjusted to predefined moulded dimension, and in adjustment size Threshold value is used on model, it is as follows：

Wherein, ρ (x, y) is the pixel value for adjusting model；It isValue in one；α is super parameter, is set to 0.005；By threshold Value in value model be remapped to original tag value (by using fromTo the mapping of P) realize object training pattern.

7. the end-to-end deep learning described in based on claims 1, it is characterised in that network is made of Liang Ge branches, is used for Object detection and consistency detection, give input picture, further feature are extracted from image using VGG16 networks as backbone, Then use and share the RPN of weight with convolution skeleton to generate candidate's bounding box (RoIs), for each RoI, RoIAlign layers Extract and converge to its corresponding feature in the characteristic pattern of one 7 × 7 size, it is complete using two in object detection branch The layer connected entirely, every layer has 4096 neurons, its subseries layer classifies object, returns layer and returns object's position； In consistency detection branch, the characteristic pattern up-sampling of 7 × 7 sizes is amplified to 244 × 244 acquisition high resolution graphics, uses Each pixel in 244 × 244 mappings is distributed to its most probable Consistency Class by softmax layers, and whole network uses more Business is lost function and is trained end to end.

8. (three) are lost based on the multitask described in claims 1, it is characterised in that in end-to-end framework, in K+1 object Classification of type layer output probability distribution p=(p₀,…,p_K), p is softmax layers of output, returns layer and exports K+1 bounding box time Return-to-bias moves (each offset includes frame center and frame size)：Each offset t^kCorresponding to each class Other k, to t^kParameterized, t^kThe conversion of a specified Scale invariant, height/width relative shift relation RPN bounding boxs, one Cause property detection branches export one group of probability distribution m={ m in the RoI of each pixel iⁱ}_i∈RoI, whereinIt is The softmax layers output defined on the C+1 uniformity labels including background；L, which is lost, using a multitask carries out joint instruction Practice bounding box classification, surround box position and uniformity mapping, it is as follows：

L=L_cls+L_loc+L_aff (3)

Wherein L_clsIt is defined as the output of classification layer, L_locIt is defined as returning the output of layer, L_affIt is defined as consistency detection branch Output.

9. based on the loss described in claims 8, it is characterised in that the prediction object of each RoI is control group object class u, Control group bounding box deviates υ and goal congruence model s, and training dataset provides the value of u and υ, and goal congruence model s is Intersection between RoI control group models associated there, the RoI interior pixels for being not belonging to intersection, we are marked For background, object mask is adjusted to fixed size (i.e. 244 × 244), formula (3) is written as：

L(p,u,t^u, v, m, s) and=L_cls(p,u)+I[u≥1]L_loc(t^u,v)+I[u≥1]L_aff(m,s) (4)

L_cls(p, u)=- log (p_u) (5)

Wherein, p_uIt is the softmax outputs of control group object class u, second is lost L_loc(t^u, v) and it is to return frame offset t^uIt is (corresponding L1 between control group object class u) and control group bounding box offset υ smoothly loses, and calculates as follows：

<mrow> <msub> <mi>L</mi> <mrow> <mi>l</mi> <mi>o</mi> <mi>c</mi> </mrow> </msub> <mrow> <mo>(</mo> <msup> <mi>t</mi> <mi>u</mi> </msup> <mo>,</mo> <mi>v</mi> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>&Element;</mo> <mo>{</mo> <mi>x</mi> <mo>,</mo> <mi>y</mi> <mo>,</mo> <mi>w</mi> <mo>,</mo> <mi>h</mi> <mo>}</mo> </mrow> </munder> <msub> <mi>Smooth</mi> <mrow> <mi>L</mi> <mn>1</mn> </mrow> </msub> <mrow> <mo>(</mo> <msubsup> <mi>t</mi> <mi>i</mi> <mi>u</mi> </msubsup> <mo>-</mo> <msub> <mi>v</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>6</mn> <mo>)</mo> </mrow> </mrow>

Wherein：

<mrow> <msub> <mi>L</mi> <mrow> <mi>a</mi> <mi>f</mi> <mi>f</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>m</mi> <mo>,</mo> <mi>s</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mo>-</mo> <mn>1</mn> </mrow> <mi>N</mi> </mfrac> <munder> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>&Element;</mo> <mi>R</mi> <mi>o</mi> <mi>I</mi> </mrow> </munder> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mrow> <mo>(</mo> <msubsup> <mi>m</mi> <msub> <mi>s</mi> <mi>i</mi> </msub> <mi>i</mi> </msubsup> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>7</mn> <mo>)</mo> </mrow> </mrow>

Wherein,It is true tag s_iPixel i at softmax output；N is the pixel number in RoI；In equation (4), I [u >=1] is a target function, exports 1 as u >=1, is otherwise 0, only defines frame position loss L_loc, only RoI is timing, Define consistency detection loss L_aff, when the value of RoI is positive or negative, define object classification loss L_cls, consistency detection branch damage Mistake is different from example segmentation loss, is divided into binary segmentation, i.e. foreground and background in each RoI, is asked in consistency detection In topic, uniformity label is different from object tag, and the uniformity number of labels in each RoI is not binary, i.e., it is always More than 2 (including backgrounds), therefore, softmax and multinomial intersection entropy loss of the uniformity label dependent on each pixel.

10. based on the training described in claims 1 and reasoning (four), it is characterised in that the training network in a manner of end to end, The stochastic gradient descent method to be decayed using 0.9 momentum and 0.0005 weight, the network carry out 200,000 repetitive exercises, first 150,000 times Learning rate be arranged to 0.001, the learning rate of last 50,000 times reduces, and input picture is resized so that short edge is 600 pixels, long edge are no more than 1000 pixels；If longer edge is arranged to more than 1000 pixels, longer edge 1000 pixels, and based on edge adjustment image size；15 anchor points are used in RPN, preceding 2000 RoI of RPN are used for Calculate multitask loss；In the reasoning stage, preceding 1000 RoI of RPN generations are selected, object detection point is run on these RoI Branch, from the output of detection branches, output box of the selection sort fraction higher than 0.9 is as the object eventually detected, if do not had Meet the frame of the condition, then a conduct unique detection object of the selection with highest classification fraction, uses the object detected As the input of supply detection branches, for each pixel in the object that detects, uniformity classification prediction obtains each picture The output-consistence label of element；Finally, using being sized strategy by 244 × 244 consistency model tune of each object prediction Whole is object (frame) size, if there are overlapping between the object detected, final consistency label is determined based on priority.