CN107944443A - One kind carries out object consistency detection method based on end-to-end deep learning - Google Patents

One kind carries out object consistency detection method based on end-to-end deep learning Download PDF

Info

Publication number
CN107944443A
CN107944443A CN201711139653.8A CN201711139653A CN107944443A CN 107944443 A CN107944443 A CN 107944443A CN 201711139653 A CN201711139653 A CN 201711139653A CN 107944443 A CN107944443 A CN 107944443A
Authority
CN
China
Prior art keywords
mrow
roi
consistency
detection
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201711139653.8A
Other languages
Chinese (zh)
Inventor
夏春秋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Vision Technology Co Ltd
Original Assignee
Shenzhen Vision Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Vision Technology Co Ltd filed Critical Shenzhen Vision Technology Co Ltd
Priority to CN201711139653.8A priority Critical patent/CN107944443A/en
Publication of CN107944443A publication Critical patent/CN107944443A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering

Abstract

The present invention proposes a kind of based on end-to-end deep learning progress object consistency detection method, it is intended to find the position of objects in images at the same time, classification and uniformity, the feature in interest region is computed correctly from characteristics of image figure using interest region aligned layer, RoI characteristic patterns are upsampled to high-resolution convolutional layer using convolution sequence of layer and obtain uniformity figure, its consistency is supervised using Robust Strategies adjusting training model.Object detection is positioned for object, and each pixel in object is distributed to its consistency label by consistency detection, and the mapping of bounding box classification, position and uniformity is trained using multitask loss, and finally training and reasoning obtain uniformity label.The present invention uses end-to-end deep learning, use multitask loss function combined optimization object detection and consistency detection, it is not necessary to which extraneous information, reduces the complexity in training and test process, the accuracy of detection is effectively improved, suitable for the application of real-time machine people.

Description

One kind carries out object consistency detection method based on end-to-end deep learning
Technical field
The present invention relates to computer vision field, and object one is carried out based on end-to-end deep learning more particularly, to a kind of Cause property detection method.
Background technology
In computer vision, while detection object and cutting object are becoming increasingly popular, and object can be regarded by various Feel that attribute such as color, shape or physical attribute such as weight, volume and material are described, these attributes are for identifying object Or it is useful to be classified into different classifications, in many robot applications, identification object consistency be it is vital, But robot may still need more information to complete task, robot does not merely have to detection object uniformity, and It can position and identify relevant object.Object consistency detection is used as emerging problem, has practicality hair in many fields Exhibition, such as scene understanding, video search, object detection, behavioural analysis, 3 D scene rebuilding, human-computer interaction etc., especially Ground, the human-computer interaction in object detection unmanned, in smart home, medical diagnosis in field of traffic etc. all has wide Wealthy application prospect.Understand that object or object consistency are different from the virtual physical properties of only description object, it is also necessary to acquisition pair Interaction as consensus information and with the mankind, therefore, understands that object consistency is autonomous robot with object interaction and assisting People carry out the key of various routine works.
However, the uniformity of detection object is more increasingly difficult than traditional semantic segmentation problem, two have different appearances Object may have identical uniformity label, because uniformity label is the abstract concept to object behavior based on the mankind, separately Outside, it is also vital to carry out detection and the summary to that can not see object in real time for uniformity.Existing common method It is very time-consuming using two continuous deep-neural-networks, it is not suitable for applying in real time.
The present invention proposes a kind of based on end-to-end deep learning progress object consistency detection method, it is intended to finds at the same time The position of objects in images, classification and uniformity, are correctly counted using interest region aligned layer (RoIAlign) from characteristics of image figure The feature of interest region (RoI) is calculated, RoI characteristic patterns are upsampled to high-resolution convolutional layer using convolution sequence of layer obtains uniformity Figure, its consistency is supervised using Robust Strategies adjusting training model.Object detection is positioned for object, and consistency detection will be right Each pixel as in distributes to its consistency label, using multitask loss be trained bounding box classify, position with it is consistent Property mapping, finally training and reasoning obtain uniformity label.The present invention uses end-to-end deep learning, is lost using multitask Function combined optimization object detection and consistency detection, it is not necessary to which extraneous information, reduces the complexity in training and test process Property, is effectively improved the accuracy of detection, suitable for the application of real-time machine people.
The content of the invention
For time-consuming, be not suitable for applying in real time the problem of, the present invention use end-to-end deep learning, use multitask Loss function combined optimization object detection and consistency detection, it is not necessary to which extraneous information, reduces in training and test process Complexity, is effectively improved the accuracy of detection, suitable for the application of real-time machine people.
To solve the above problems, the present invention provides one kind carries out object consistency detection side based on end-to-end deep learning Method, mainly includes:
Problem formulation (one);
The uniformity network architecture (two);
(3) are lost in multitask;
Training and reasoning (four).
Wherein, the problem of described, is formulation, and frame is intended to find the position of object at the same time, pair in object type and image The uniformity of elephant, according in computer vision standard design, the position of object by being defined relative to the upper left corner rectangle of image, Object type is defined by rectangle frame, each pixel coder its consistency in rectangle frame, and object pixel region has identical Function, it is believed that be it is consistent, ideally, all related objects in detection image, and by each picture in these objects Element is mapped to most probable uniformity label.
Wherein, the uniformity network architecture, there is three chief components:1) interest region aligned layer (RoIAlign) it is used for the feature that interest region (RoI) is computed correctly from characteristics of image figure;2) convolution sequence of layer is by RoI characteristic patterns It is upsampled to high-resolution convolutional layer and obtains smooth, fine and smooth uniformity figure;3) supervised using Robust Strategies adjusting training model Its consistency.
Further, the interest region aligned layer (RoIAlign), region suggest that network (RPN) is carried out based on region Target acquisition, the network share weight with master file product backbone, export different size of bounding box, and each RoI uses RoIPool layers The small Feature Mapping (such as 7 × 7) of fixed size is accumulated from characteristics of image set of graphs layer, RoIAlign layers will suitably carry The feature taken is alignd with RoI, is operated without using rounding-off, and RoIAlign layer in each RoI grid of bilinear interpolation calculating using advising The then interpolated value of sampling location, carrys out polymerization result using maximum computing, avoids the imbalance between RoI and the feature of extraction.
Further, the high-resolution convolutional layer, uses the model (such as 14 × 14 or 28 × 28) of small fixed size To represent Object Segmentation model, the pixel value in each prediction model of RoI is binary, i.e. foreground and background, because often There are multiple Consistency Class in a object, cannot be worked well in test problems are provided using bench model, therefore use solution Convolutional layer realizes high-resolution consistency model, and in form, it is S to give input feature vector figure sizei, uncoiling lamination performs and volume The opposite operation of lamination, in order to build the output figure size S of biggero, SiWith SoRelation be:
So=s* (Si-1)+Sf-2*d (1)
Wherein SfIt is filter size;S and d is stride and pad parameter respectively;In fact, RoIAlign layers of Output Size For 7 × 7 characteristic pattern, which is upsampled to the resolution ratio of higher, first uncoiling lamination filling using three uncoiling laminations Parameter d=1, stride s=1, kernel size Sf=8, the figure that size is 30 × 30 is created, similarly, the second layer parameter is (d= 1, s=4, Sf=8), third layer parameter is (d=1, s=2, Sf=4) the final high resolution graphics that size is 244 × 244 is created, Before each uncoiling lamination, carrying out learning characteristic using convolutional layer will be used to deconvolute, and convolutional layer can be regarded as two continuously Uncoiling lamination between adaptation.
Further, the training pattern, the fixed size of consistency model detection branches needs one (such as 244 × 244) supervised training, is not worked using single threshold value in consistency detection problem, therefore proposes multi thresholds Developing Tactics ruler It is very little, a virgin control group model is given, in the case of without loss of generality, if n separate label P=(c in model0, c1,…,cn-1), the value Linear Mapping in P is set toUsing from P toMapping by original mould Type is converted into new model;The model of conversion is adjusted to predefined moulded dimension, and is used on the model of adjustment size Threshold value, it is as follows:
Wherein, ρ (x, y) is the pixel value for adjusting model;It isValue in one;α is super parameter, is set to 0.005;By the value in threshold model be remapped to original tag value (by using fromTo the mapping of P) come realize object instruct Practice model.
Further, the end-to-end deep learning, network is made of Liang Ge branches, for object detection and uniformity Detection, is given input picture, is extracted further feature from image using VGG16 networks as backbone, then used and convolution bone Frame shares the RPN of weight to generate candidate's bounding box (RoIs), and for each RoI, RoIAlign layers of extraction are simultaneously corresponding by its Feature is converged in the characteristic pattern of 7 × 7 sizes, and in object detection branch, using two layers being fully connected, every layer all There are 4096 neurons, its subseries layer classifies object, returns layer and returns object's position;In consistency detection branch In, the characteristic patterns of 7 × 7 sizes up-sampling is amplified to 244 × 244 acquisition high resolution graphics, using softmax layers by 244 × 244 Each pixel in mapping distributes to its most probable Consistency Class, and whole network is carried out end-to-end using multitask loss function Training.
Wherein, the multitask is lost, in end-to-end framework, in K+1 object type classification layer output probability distribution p =(p0,…,pK), p is softmax layers of output, and returning K+1 bounding box recurrence offset of layer output, (each offset is included in frame The heart and frame size):Each offset tkCorresponding to each classification k, to tkParameterized, tkSpecify The conversion of one Scale invariant, height/width relative shift relation RPN bounding boxs, consistency detection branch export each pixel i RoI in one group of probability distribution m={ mi}i∈RoI, whereinIt is in the C+1 uniformity labels including background The softmax layers output of upper definition;Using multitask lose L carry out the classification of joint training bounding box, surround box position and Uniformity maps, as follows:
L=Lcls+Lloc+Laff (3)
Wherein LclsIt is defined as the output of classification layer, LlocIt is defined as returning the output of layer, LaffIt is defined as consistency detection point The output of branch.
Further, the prediction object of each RoI is control group object class u, and control group bounding box offset υ is consistent with target Property model s, training dataset provide u and υ value, goal congruence model s be RoI it is associated there control group model between Intersection, the RoI interior pixels for being not belonging to intersection, we are marked as background, and object mask is adjusted to fixed Size (i.e. 244 × 244), formula (3) is written as:
L(p,u,tu, v, m, s) and=Lcls(p,u)+I[u≥1]Lloc(tu,v)+I[u≥1]Laff(m,s) (4)
First loss Lcls(p, u) is the intersection entropy loss of multinomial classification, is calculated as follows:
Lcls(p, u)=- log (pu) (5)
Wherein, puIt is the softmax outputs of control group object class u, second is lost Lloc(tu, v) and it is to return frame offset tu (correspond to the smooth L1 losses between control group object class u) and control group bounding box offset υ, calculate as follows:
Wherein:
Laff(m, s) is the multinomial intersection entropy loss of consistency detection branch, is calculated as follows:
Wherein,It is true tag siPixel i at softmax output;N is the pixel number in RoI;
In equation (4), I [u >=1] is a target function, exports 1 as u >=1, is otherwise 0, only defines frame position Lose Lloc, only RoI is timing, defines consistency detection loss Laff, when the value of RoI is positive or negative, define object classification damage Lose Lcls, consistency detection branch penalty is different from example segmentation loss, and binary segmentation, i.e. prospect are divided into each RoI And background, in consistency detection problem, uniformity label is different from object tag, the uniformity number of labels in each RoI Binary, i.e., it is always greater than 2 (including backgrounds), therefore, uniformity label dependent on each pixel softmax and Multinomial intersection entropy loss.
Wherein, the training and reasoning, the training network in a manner of end to end, using 0.9 momentum and 0.0005 weight The stochastic gradient descent method of decay, the network carry out 200,000 repetitive exercises, and the learning rate of first 150,000 times is arranged to 0.001, most The learning rate of 50,000 times reduces afterwards, and input picture is resized so that short edge is 600 pixels, and long edge is no more than 1000 pixels;If longer edge, more than 1000 pixels, longer edge is arranged to 1000 pixels, and is based on the edge Adjust image size;15 anchor points are used in RPN, preceding 2000 RoI of RPN are used to calculate multitask loss;In reasoning rank Section, selects preceding 1000 RoI of RPN generations, and object detection branch is run on these RoI, from the output of detection branches, selection Classify output box of the fraction higher than 0.9 as the object eventually detected, if not meeting the frame of the condition, selection has One unique detection object of conduct of highest classification fraction, the input using the object detected as supply detection branches are right Each pixel in the object detected, uniformity classification prediction obtain the output-consistence label of each pixel;Finally, adopt 244 × 244 consistency models of each object prediction are adjusted to object (frame) size with strategy is sized, if detected Object between there are overlapping, final consistency label is determined based on priority.
Brief description of the drawings
Fig. 1 is a kind of system flow chart that object consistency detection method is carried out based on end-to-end deep learning of the present invention.
Fig. 2 is a kind of uniformity network rack that object consistency detection method is carried out based on end-to-end deep learning of the present invention Composition.
Fig. 3 is a kind of deconvolution up-sampling that object consistency detection method is carried out based on end-to-end deep learning of the present invention Figure.
Embodiment
It should be noted that in the case where there is no conflict, the feature in embodiment and embodiment in the application can phase Mutually combine, the present invention is described in further detail with specific embodiment below in conjunction with the accompanying drawings.
Fig. 1 is a kind of system flow chart that object consistency detection method is carried out based on end-to-end deep learning of the present invention. Mainly include:Problem formulation (one);The uniformity network architecture (two);(3) are lost in multitask;Training and reasoning (four).
Problem formulation frame is intended to find the position of object at the same time, the uniformity of the object in object type and image, According in computer vision standard design, the position of object by being defined relative to the upper left corner rectangle of image, object type by Rectangle frame defines, and each pixel coder its consistency in rectangle frame, object pixel region has the function of identical, it is believed that is It is consistent, ideally, all related objects in detection image, and by each pixel-map in these objects to most may be used The uniformity label of energy.
In end-to-end framework, in K+1 object type classification layer output probability distribution p=(p0,…,pK), p is softmax The output of layer, returns layer and exports K+1 bounding box recurrence offset (each offset includes frame center and frame size):Each offset tkCorresponding to each classification k, to tkParameterized, tkSpecify a Scale invariant Conversion, height/width relative shift relation RPN bounding boxs, consistency detection branch exports in the RoI of each pixel i one group Probability distribution m={ mi}i∈RoI, whereinIt is to be defined on the C+1 uniformity labels including background Softmax layers of output;The classification of L progress joint training bounding boxs, encirclement box position and uniformity are lost using a multitask to reflect Penetrate, it is as follows:
L=Lcls+Lloc+Laff (1)
Wherein LclsIt is defined as the output of classification layer, LlocIt is defined as returning the output of layer, LaffIt is defined as consistency detection point The output of branch.
The prediction object of each RoI is control group object class u, and control group bounding box deviates υ and goal congruence model s, Training dataset provides the value of u and υ, and goal congruence model s is the intersection between RoI control group models associated there, RoI interior pixels for being not belonging to intersection, we are marked as background, and object mask is adjusted to fixed size (i.e. 244 × 244), formula is written as:
L(p,u,tu, v, m, s) and=Lcls(p,u)+I[u≥1]Lloc(tu,v)+I[u≥1]Laff(m,s) (2)
First loss Lcls(p, u) is the intersection entropy loss of multinomial classification, is calculated as follows:
Lcls(p, u)=- log (pu) (3)
Wherein, puIt is the softmax outputs of control group object class u, second is lost Lloc(tu, v) and it is to return frame offset tu (L1 corresponded between control group object class u) and control group bounding box offset υ smoothly loses, and calculates as follows:
Wherein:
Laff(m, s) is the multinomial intersection entropy loss of consistency detection branch, is calculated as follows:
Wherein,It is true tag siPixel i at softmax output;N is the pixel number in RoI;
In equation (1), I [u >=1] is a target function, exports 1 as u >=1, is otherwise 0, only defines frame position Lose Lloc, only RoI is timing, defines consistency detection loss Laff, when the value of RoI is positive or negative, define object classification damage Lose Lcls, consistency detection branch penalty is different from example segmentation loss, and binary segmentation, i.e. prospect are divided into each RoI And background, in consistency detection problem, uniformity label is different from object tag, the uniformity number of labels in each RoI Binary, i.e., it is always greater than 2 (including backgrounds), therefore, uniformity label dependent on each pixel softmax and Multinomial intersection entropy loss.
The training network in a manner of end to end, the stochastic gradient descent method to be decayed using 0.9 momentum and 0.0005 weight, should Network carries out 200,000 repetitive exercises, and the learning rate of first 150,000 times is arranged to 0.001, and the learning rate of last 50,000 times reduces, input Image is resized so that short edge is 600 pixels, and long edge is no more than 1000 pixels;If longer edge surpasses Cross 1000 pixels, then longer edge is arranged to 1000 pixels, and based on edge adjustment image size;Used in RPN 15 anchor points, preceding 2000 RoI of RPN are used to calculate multitask loss;In the reasoning stage, first 1000 of RPN generations are selected RoI, runs object detection branch on these RoI, and from the output of detection branches, selection sort fraction is higher than 0.9 output box As the object eventually detected, if not meeting the frame of the condition, a conduct of the selection with highest classification fraction Unique detection object, the input using the object detected as supply detection branches, for each in the object that detects Pixel, uniformity classification prediction obtain the output-consistence label of each pixel;Finally, it is each right using strategy general is sized As 244 × 244 consistency models of prediction are adjusted to object (frame) size, if there are overlapping between the object detected, most Whole uniformity label is determined based on priority.
Fig. 2 is a kind of uniformity network rack that object consistency detection method is carried out based on end-to-end deep learning of the present invention Composition.The uniformity network architecture has three chief components:1) interest region aligned layer (RoIAlign) is used for special from image Sign figure is computed correctly the feature of interest region (RoI);2) RoI characteristic patterns are upsampled to high-resolution convolutional layer and obtained by convolution sequence of layer Obtain smooth, fine and smooth uniformity figure;3) its consistency is supervised using Robust Strategies adjusting training model.
Interest region aligned layer (RoIAlign), region suggest that network (RPN) is based on region and carries out target acquisition, the network Weight is shared with master file product backbone, exports different size of bounding box, each RoI uses RoIPool layers from characteristics of image atlas The small Feature Mapping (such as 7 × 7) that fixed size is accumulated in layer is closed, RoIAlign layers suitably by the feature and RoI of extraction Alignment, operates without using rounding-off, RoIAlign layer using in each RoI grid of bilinear interpolation calculating rule sampling position it is interior Interpolation, carrys out polymerization result using maximum computing, avoids the imbalance between RoI and the feature of extraction.
Consistency model detection branches need fixed size (such as 244 × 244) supervised training, use single threshold Value does not work in consistency detection problem, therefore proposes multi thresholds Developing Tactics size, gives a virgin control group model, In the case of without loss of generality, if n separate label P=(c in model0,c1,…,cn-1), the value Linear Mapping in P is set For Using from P toMapping archetype is converted into new model;By the model tune of conversion Whole is predefined moulded dimension, and uses threshold value on the model of adjustment size, as follows:
Wherein, ρ (x, y) is the pixel value for adjusting model;It isValue in one;α is super parameter, is set to 0.005;By the value in threshold model be remapped to original tag value (by using fromTo the mapping of P) come realize object instruct Practice model.
End-to-end deep learning network is made of Liang Ge branches, for object detection and consistency detection, gives input figure Picture, further feature is extracted using VGG16 networks as backbone from image, then uses the RPN that weight is shared with convolution skeleton To generate candidate's bounding box (RoIs), for each RoI, RoIAlign layers are extracted and its corresponding feature are converged to one 7 In the characteristic pattern of × 7 sizes, in object detection branch, using two layers being fully connected, every layer has 4096 neurons, Its subseries layer classifies object, returns layer and returns object's position;In consistency detection branch, the feature of 7 × 7 sizes Figure up-sampling is amplified to 244 × 244 acquisition high resolution graphics, each pixel during 244 × 244 are mapped using softmax layers Its most probable Consistency Class is distributed to, whole network is lost function using multitask and trained end to end.
Fig. 3 is a kind of deconvolution up-sampling that object consistency detection method is carried out based on end-to-end deep learning of the present invention Figure.High-resolution convolutional layer represents Object Segmentation model using the model (such as 14 × 14 or 28 × 28) of small fixed size, Pixel value in each prediction model of RoI is binary, i.e. foreground and background, because having in each object multiple consistent Property class, cannot be worked, therefore realize high-resolution using uncoiling lamination well using bench model in test problems are provided Consistency model, in form, it is S to give input feature vector figure sizei, the uncoiling lamination execution operation opposite with convolutional layer, in order to Build the output figure size S of biggero, SiWith SoRelation be:
So=s* (Si-1)+Sf-2*d (7)
Wherein SfIt is filter size;S and d is stride and pad parameter respectively;In fact, RoIAlign layers of Output Size For 7 × 7 characteristic pattern, which is upsampled to the resolution ratio of higher, first uncoiling lamination filling using three uncoiling laminations Parameter d=1, stride s=1, kernel size Sf=8, the figure that size is 30 × 30 is created, similarly, the second layer parameter is (d= 1, s=4, Sf=8), third layer parameter is (d=1, s=2, Sf=4) the final high resolution graphics that size is 244 × 244 is created, Before each uncoiling lamination, carrying out learning characteristic using convolutional layer will be used to deconvolute, and convolutional layer can be regarded as two continuously Uncoiling lamination between adaptation.
For those skilled in the art, the present invention is not restricted to the details of above-described embodiment, in the essence without departing substantially from the present invention In the case of refreshing and scope, the present invention can be realized in other specific forms.In addition, those skilled in the art can be to this hair Bright to carry out various modification and variations without departing from the spirit and scope of the present invention, these improvements and modifications also should be regarded as the present invention's Protection domain.Therefore, appended claims are intended to be construed to include preferred embodiment and fall into all changes of the scope of the invention More and change.

Claims (10)

1. one kind carries out object consistency detection method based on end-to-end deep learning, it is characterised in that mainly determines including problem Formula (one);The uniformity network architecture (two);(3) are lost in multitask;Training and reasoning (four).
2. the problem of based on described in claims 1 formulation (one), it is characterised in that frame is intended to find the position of object at the same time Put, the uniformity of the object in object type and image, designed according to the standard in computer vision, the position of object is by opposite In the upper left corner rectangle definition of image, object type is defined by rectangle frame, each pixel coder its consistency in rectangle frame, Object pixel region has the function of identical, it is believed that be it is consistent, ideally, all related objects in detection image, And by each pixel-map in these objects to most probable uniformity label.
3. based on the uniformity network architecture (two) described in claims 1, it is characterised in that three of the uniformity network architecture Chief component:1) interest region aligned layer (RoIAlign) is used to be computed correctly interest region (RoI) from characteristics of image figure Feature;2) RoI characteristic patterns are upsampled to high-resolution convolutional layer and obtain smooth, fine and smooth uniformity figure by convolution sequence of layer;3) Its consistency is supervised using Robust Strategies adjusting training model.
4. based on the interest region aligned layer (RoIAlign) described in claims 3, it is characterised in that suggest network in region (RPN) target acquisition is carried out based on region, which shares weight with master file product backbone, export different size of bounding box, often A RoI accumulates the small Feature Mapping (such as 7 × 7) of fixed size using RoIPool layers from characteristics of image set of graphs layer, RoIAlign layers are suitably alignd the feature of extraction with RoI, are operated without using rounding-off, RoIAlign layers use bilinear interpolation The interpolated value of rule sampling position in each RoI grid is calculated, carrys out polymerization result using maximum computing, avoids RoI and extraction Imbalance between feature.
5. be based on high-resolution convolutional layer described in claims 3, it is characterised in that using small fixed size model (such as 14 × 14 or 28 × 28) represent Object Segmentation model, pixel value in each prediction model of RoI be it is binary, i.e., before Scape and background, cannot be fine in test problems are provided using bench model because having multiple Consistency Class in each object Ground works, therefore realizes high-resolution consistency model using uncoiling lamination, and in form, it is S to give input feature vector figure sizei, Uncoiling lamination performs the operation opposite with convolutional layer, in order to build the output figure size S of biggero, SiWith SoRelation be:
So=s* (Si-1)+Sf-2*d (1)
Wherein SfIt is filter size;S and d is stride and pad parameter respectively;In fact, RoIAlign layers of Output Size for 7 × The figure, the resolution ratio of higher, first uncoiling lamination pad parameter d are upsampled to using three uncoiling laminations by 7 characteristic pattern =1, stride s=1, kernel size Sf=8, the figure that size is 30 × 30 is created, similarly, the second layer parameter is (d=1, s= 4, Sf=8), third layer parameter is (d=1, s=2, Sf=4) the final high resolution graphics that size is 244 × 244 is created, every Before a uncoiling lamination, carrying out learning characteristic using convolutional layer will be used to deconvolute, and convolutional layer can be regarded as two continuous solutions Adaptation between convolutional layer.
6. based on the training pattern described in claims 3, it is characterised in that consistency model detection branches need a fixation Size (such as 244 × 244) supervised training, do not worked using single threshold value in consistency detection problem, thus propose it is more Threshold strategies adjust size, give a virgin control group model, in the case of without loss of generality, if n independence in model Label P=(c0,c1,…,cn-1), the value Linear Mapping in P is set toUsing from P toMapping Archetype is converted into new model;The model of conversion is adjusted to predefined moulded dimension, and in adjustment size Threshold value is used on model, it is as follows:
Wherein, ρ (x, y) is the pixel value for adjusting model;It isValue in one;α is super parameter, is set to 0.005;By threshold Value in value model be remapped to original tag value (by using fromTo the mapping of P) realize object training pattern.
7. the end-to-end deep learning described in based on claims 1, it is characterised in that network is made of Liang Ge branches, is used for Object detection and consistency detection, give input picture, further feature are extracted from image using VGG16 networks as backbone, Then use and share the RPN of weight with convolution skeleton to generate candidate's bounding box (RoIs), for each RoI, RoIAlign layers Extract and converge to its corresponding feature in the characteristic pattern of one 7 × 7 size, it is complete using two in object detection branch The layer connected entirely, every layer has 4096 neurons, its subseries layer classifies object, returns layer and returns object's position; In consistency detection branch, the characteristic pattern up-sampling of 7 × 7 sizes is amplified to 244 × 244 acquisition high resolution graphics, uses Each pixel in 244 × 244 mappings is distributed to its most probable Consistency Class by softmax layers, and whole network uses more Business is lost function and is trained end to end.
8. (three) are lost based on the multitask described in claims 1, it is characterised in that in end-to-end framework, in K+1 object Classification of type layer output probability distribution p=(p0,…,pK), p is softmax layers of output, returns layer and exports K+1 bounding box time Return-to-bias moves (each offset includes frame center and frame size):Each offset tkCorresponding to each class Other k, to tkParameterized, tkThe conversion of a specified Scale invariant, height/width relative shift relation RPN bounding boxs, one Cause property detection branches export one group of probability distribution m={ m in the RoI of each pixel ii}i∈RoI, whereinIt is The softmax layers output defined on the C+1 uniformity labels including background;L, which is lost, using a multitask carries out joint instruction Practice bounding box classification, surround box position and uniformity mapping, it is as follows:
L=Lcls+Lloc+Laff (3)
Wherein LclsIt is defined as the output of classification layer, LlocIt is defined as returning the output of layer, LaffIt is defined as consistency detection branch Output.
9. based on the loss described in claims 8, it is characterised in that the prediction object of each RoI is control group object class u, Control group bounding box deviates υ and goal congruence model s, and training dataset provides the value of u and υ, and goal congruence model s is Intersection between RoI control group models associated there, the RoI interior pixels for being not belonging to intersection, we are marked For background, object mask is adjusted to fixed size (i.e. 244 × 244), formula (3) is written as:
L(p,u,tu, v, m, s) and=Lcls(p,u)+I[u≥1]Lloc(tu,v)+I[u≥1]Laff(m,s) (4)
First loss Lcls(p, u) is the intersection entropy loss of multinomial classification, is calculated as follows:
Lcls(p, u)=- log (pu) (5)
Wherein, puIt is the softmax outputs of control group object class u, second is lost Lloc(tu, v) and it is to return frame offset tuIt is (corresponding L1 between control group object class u) and control group bounding box offset υ smoothly loses, and calculates as follows:
<mrow> <msub> <mi>L</mi> <mrow> <mi>l</mi> <mi>o</mi> <mi>c</mi> </mrow> </msub> <mrow> <mo>(</mo> <msup> <mi>t</mi> <mi>u</mi> </msup> <mo>,</mo> <mi>v</mi> <mo>)</mo> </mrow> <mo>=</mo> <munder> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>&amp;Element;</mo> <mo>{</mo> <mi>x</mi> <mo>,</mo> <mi>y</mi> <mo>,</mo> <mi>w</mi> <mo>,</mo> <mi>h</mi> <mo>}</mo> </mrow> </munder> <msub> <mi>Smooth</mi> <mrow> <mi>L</mi> <mn>1</mn> </mrow> </msub> <mrow> <mo>(</mo> <msubsup> <mi>t</mi> <mi>i</mi> <mi>u</mi> </msubsup> <mo>-</mo> <msub> <mi>v</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>6</mn> <mo>)</mo> </mrow> </mrow>
Wherein:
Laff(m, s) is the multinomial intersection entropy loss of consistency detection branch, is calculated as follows:
<mrow> <msub> <mi>L</mi> <mrow> <mi>a</mi> <mi>f</mi> <mi>f</mi> </mrow> </msub> <mrow> <mo>(</mo> <mi>m</mi> <mo>,</mo> <mi>s</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mo>-</mo> <mn>1</mn> </mrow> <mi>N</mi> </mfrac> <munder> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>&amp;Element;</mo> <mi>R</mi> <mi>o</mi> <mi>I</mi> </mrow> </munder> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mrow> <mo>(</mo> <msubsup> <mi>m</mi> <msub> <mi>s</mi> <mi>i</mi> </msub> <mi>i</mi> </msubsup> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>7</mn> <mo>)</mo> </mrow> </mrow>
Wherein,It is true tag siPixel i at softmax output;N is the pixel number in RoI;In equation (4), I [u >=1] is a target function, exports 1 as u >=1, is otherwise 0, only defines frame position loss Lloc, only RoI is timing, Define consistency detection loss Laff, when the value of RoI is positive or negative, define object classification loss Lcls, consistency detection branch damage Mistake is different from example segmentation loss, is divided into binary segmentation, i.e. foreground and background in each RoI, is asked in consistency detection In topic, uniformity label is different from object tag, and the uniformity number of labels in each RoI is not binary, i.e., it is always More than 2 (including backgrounds), therefore, softmax and multinomial intersection entropy loss of the uniformity label dependent on each pixel.
10. based on the training described in claims 1 and reasoning (four), it is characterised in that the training network in a manner of end to end, The stochastic gradient descent method to be decayed using 0.9 momentum and 0.0005 weight, the network carry out 200,000 repetitive exercises, first 150,000 times Learning rate be arranged to 0.001, the learning rate of last 50,000 times reduces, and input picture is resized so that short edge is 600 pixels, long edge are no more than 1000 pixels;If longer edge is arranged to more than 1000 pixels, longer edge 1000 pixels, and based on edge adjustment image size;15 anchor points are used in RPN, preceding 2000 RoI of RPN are used for Calculate multitask loss;In the reasoning stage, preceding 1000 RoI of RPN generations are selected, object detection point is run on these RoI Branch, from the output of detection branches, output box of the selection sort fraction higher than 0.9 is as the object eventually detected, if do not had Meet the frame of the condition, then a conduct unique detection object of the selection with highest classification fraction, uses the object detected As the input of supply detection branches, for each pixel in the object that detects, uniformity classification prediction obtains each picture The output-consistence label of element;Finally, using being sized strategy by 244 × 244 consistency model tune of each object prediction Whole is object (frame) size, if there are overlapping between the object detected, final consistency label is determined based on priority.
CN201711139653.8A 2017-11-16 2017-11-16 One kind carries out object consistency detection method based on end-to-end deep learning Withdrawn CN107944443A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711139653.8A CN107944443A (en) 2017-11-16 2017-11-16 One kind carries out object consistency detection method based on end-to-end deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711139653.8A CN107944443A (en) 2017-11-16 2017-11-16 One kind carries out object consistency detection method based on end-to-end deep learning

Publications (1)

Publication Number Publication Date
CN107944443A true CN107944443A (en) 2018-04-20

Family

ID=61932635

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711139653.8A Withdrawn CN107944443A (en) 2017-11-16 2017-11-16 One kind carries out object consistency detection method based on end-to-end deep learning

Country Status (1)

Country Link
CN (1) CN107944443A (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145898A (en) * 2018-07-26 2019-01-04 清华大学深圳研究生院 A kind of object detecting method based on convolutional neural networks and iterator mechanism
CN109190537A (en) * 2018-08-23 2019-01-11 浙江工商大学 A kind of more personage's Attitude estimation methods based on mask perceived depth intensified learning
CN109299434A (en) * 2018-09-04 2019-02-01 重庆公共运输职业学院 Cargo customs clearance big data is intelligently graded and sampling observation rate computing system
CN109801297A (en) * 2019-01-14 2019-05-24 浙江大学 A kind of image panorama segmentation prediction optimization method realized based on convolution
CN109871798A (en) * 2019-02-01 2019-06-11 浙江大学 A kind of remote sensing image building extracting method based on convolutional neural networks
CN110008808A (en) * 2018-12-29 2019-07-12 北京迈格威科技有限公司 Panorama dividing method, device and system and storage medium
CN110298364A (en) * 2019-06-27 2019-10-01 安徽师范大学 Based on the feature selection approach of multitask under multi-threshold towards functional brain network
CN110349167A (en) * 2019-07-10 2019-10-18 北京悉见科技有限公司 A kind of image instance dividing method and device
CN110633595A (en) * 2018-06-21 2019-12-31 北京京东尚科信息技术有限公司 Target detection method and device by utilizing bilinear interpolation
CN110909748A (en) * 2018-09-17 2020-03-24 斯特拉德视觉公司 Image encoding method and apparatus using multi-feed
CN110956131A (en) * 2019-11-27 2020-04-03 北京迈格威科技有限公司 Single-target tracking method, device and system
WO2020155518A1 (en) * 2019-02-03 2020-08-06 平安科技(深圳)有限公司 Object detection method and device, computer device and storage medium
WO2020156303A1 (en) * 2019-01-30 2020-08-06 广州市百果园信息技术有限公司 Method and apparatus for training semantic segmentation network, image processing method and apparatus based on semantic segmentation network, and device and storage medium
CN112684704A (en) * 2020-12-18 2021-04-20 华南理工大学 End-to-end motion control method, system, device and medium based on deep learning
CN112692875A (en) * 2021-01-06 2021-04-23 华南理工大学 Digital twin system for operation and maintenance of welding robot
CN112799401A (en) * 2020-12-28 2021-05-14 华南理工大学 End-to-end robot vision-motion navigation method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106204555A (en) * 2016-06-30 2016-12-07 天津工业大学 A kind of combination Gbvs model and the optic disc localization method of phase equalization
CN106599939A (en) * 2016-12-30 2017-04-26 深圳市唯特视科技有限公司 Real-time target detection method based on region convolutional neural network
CN106780536A (en) * 2017-01-13 2017-05-31 深圳市唯特视科技有限公司 A kind of shape based on object mask network perceives example dividing method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106204555A (en) * 2016-06-30 2016-12-07 天津工业大学 A kind of combination Gbvs model and the optic disc localization method of phase equalization
CN106599939A (en) * 2016-12-30 2017-04-26 深圳市唯特视科技有限公司 Real-time target detection method based on region convolutional neural network
CN106780536A (en) * 2017-01-13 2017-05-31 深圳市唯特视科技有限公司 A kind of shape based on object mask network perceives example dividing method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
THANH-TOAN DO ET AL.: "AffordanceNet: An End-to-End Deep Learning Approach for Object Affordance Detection", 《ARXIV》 *

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110633595B (en) * 2018-06-21 2022-12-02 北京京东尚科信息技术有限公司 Target detection method and device by utilizing bilinear interpolation
CN110633595A (en) * 2018-06-21 2019-12-31 北京京东尚科信息技术有限公司 Target detection method and device by utilizing bilinear interpolation
CN109145898A (en) * 2018-07-26 2019-01-04 清华大学深圳研究生院 A kind of object detecting method based on convolutional neural networks and iterator mechanism
CN109190537A (en) * 2018-08-23 2019-01-11 浙江工商大学 A kind of more personage's Attitude estimation methods based on mask perceived depth intensified learning
CN109190537B (en) * 2018-08-23 2020-09-29 浙江工商大学 Mask perception depth reinforcement learning-based multi-person attitude estimation method
CN109299434A (en) * 2018-09-04 2019-02-01 重庆公共运输职业学院 Cargo customs clearance big data is intelligently graded and sampling observation rate computing system
CN110909748A (en) * 2018-09-17 2020-03-24 斯特拉德视觉公司 Image encoding method and apparatus using multi-feed
CN110909748B (en) * 2018-09-17 2023-09-19 斯特拉德视觉公司 Image encoding method and apparatus using multi-feed
CN110008808A (en) * 2018-12-29 2019-07-12 北京迈格威科技有限公司 Panorama dividing method, device and system and storage medium
CN109801297A (en) * 2019-01-14 2019-05-24 浙江大学 A kind of image panorama segmentation prediction optimization method realized based on convolution
WO2020156303A1 (en) * 2019-01-30 2020-08-06 广州市百果园信息技术有限公司 Method and apparatus for training semantic segmentation network, image processing method and apparatus based on semantic segmentation network, and device and storage medium
CN109871798A (en) * 2019-02-01 2019-06-11 浙江大学 A kind of remote sensing image building extracting method based on convolutional neural networks
WO2020155518A1 (en) * 2019-02-03 2020-08-06 平安科技(深圳)有限公司 Object detection method and device, computer device and storage medium
CN110298364A (en) * 2019-06-27 2019-10-01 安徽师范大学 Based on the feature selection approach of multitask under multi-threshold towards functional brain network
CN110349167A (en) * 2019-07-10 2019-10-18 北京悉见科技有限公司 A kind of image instance dividing method and device
CN110956131A (en) * 2019-11-27 2020-04-03 北京迈格威科技有限公司 Single-target tracking method, device and system
CN110956131B (en) * 2019-11-27 2024-01-05 北京迈格威科技有限公司 Single-target tracking method, device and system
CN112684704A (en) * 2020-12-18 2021-04-20 华南理工大学 End-to-end motion control method, system, device and medium based on deep learning
CN112799401A (en) * 2020-12-28 2021-05-14 华南理工大学 End-to-end robot vision-motion navigation method
CN112692875B (en) * 2021-01-06 2021-08-10 华南理工大学 Digital twin system for operation and maintenance of welding robot
CN112692875A (en) * 2021-01-06 2021-04-23 华南理工大学 Digital twin system for operation and maintenance of welding robot

Similar Documents

Publication Publication Date Title
CN107944443A (en) One kind carries out object consistency detection method based on end-to-end deep learning
CN110428428B (en) Image semantic segmentation method, electronic equipment and readable storage medium
CN104809187B (en) A kind of indoor scene semanteme marking method based on RGB D data
Volpi et al. Deep multi-task learning for a geographically-regularized semantic segmentation of aerial images
Yang et al. Layered object models for image segmentation
CN105869178B (en) A kind of complex target dynamic scene non-formaldehyde finishing method based on the convex optimization of Multiscale combination feature
CN106920243A (en) The ceramic material part method for sequence image segmentation of improved full convolutional neural networks
US20160055237A1 (en) Method for Semantically Labeling an Image of a Scene using Recursive Context Propagation
CN107909015A (en) Hyperspectral image classification method based on convolutional neural networks and empty spectrum information fusion
CN104281853A (en) Behavior identification method based on 3D convolution neural network
CN110929665B (en) Natural scene curve text detection method
CN106599805A (en) Supervised data driving-based monocular video depth estimating method
CN104298974A (en) Human body behavior recognition method based on depth video sequence
JP7329041B2 (en) Method and related equipment for synthesizing images based on conditional adversarial generation networks
Liu et al. Robust salient object detection for RGB images
CN112734789A (en) Image segmentation method and system based on semi-supervised learning and point rendering
Hernández et al. CUDA-based parallelization of a bio-inspired model for fast object classification
Zhang et al. Class relatedness oriented-discriminative dictionary learning for multiclass image classification
CN107657276B (en) Weak supervision semantic segmentation method based on searching semantic class clusters
CN109726725A (en) The oil painting writer identification method of heterogeneite Multiple Kernel Learning between a kind of class based on large-spacing
Vinoth Kumar et al. A decennary survey on artificial intelligence methods for image segmentation
Liu et al. Dunhuang murals contour generation network based on convolution and self-attention fusion
CN103440651A (en) Multi-label image annotation result fusion method based on rank minimization
CN104778683A (en) Multi-modal image segmenting method based on functional mapping
Wang et al. Self-attention deep saliency network for fabric defect detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20180420