CN104217225A

CN104217225A - A visual target detection and labeling method

Info

Publication number: CN104217225A
Application number: CN201410442817.4A
Authority: CN
Inventors: 黄凯奇; 任伟强; 王冲; 张俊格
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2014-09-02
Filing date: 2014-09-02
Publication date: 2014-12-17
Anticipated expiration: 2034-09-02
Also published as: CN104217225B

Abstract

The present invention discloses a visual target detection and labeling method. The method includes: an image inputting step, to input an image to be detected; a candidate region extracting step, to extract a candidate window as the candidate region from the image to be detected using selectively search algorithm; a feature description extracting step, to perform feature description on the candidate region using a pre-trained large-scale convolutional neural network and output the feature description of the candidate region; a visual target predicting step, to predict the candidate region based on the feature description of the candidate region using a pre-trained object detection module, to estimate regions having the visual target; and a position labeling step, to labeling the position of the visual target according to the estimated result. Experiments show that, compared with the mainstream week supervision visual target detection and labeling method, the present invention has a stronger ability to excavate positive samples and a more general application prospect, and is suitable for visual target detection and automatic labeling tasks on the large-scale data set.

Description

A kind of sensation target detects and mask method

Technical field

The present invention relates to object detection technical field in computer vision, particularly a kind of sensation target based on Weakly supervised study detects and mask method.

Background technology

It is basic problem of computer vision field that objects in images detects with automated location mark, is also one of the key problem that will study of this field.It is exactly given test pattern that objects in images detects, and answers what this problem somewhere.Object detection has a wide range of applications in a lot of other vision research problems, as the foreground detection under object identification, pedestrian detection, face detection, monitoring scene, motion tracking, behavior identification and analysis etc.

General object detection needs the given database that has marked object boundary rectangle, carries out model training to use based on pure object detection models that has supervision such as gradient orientation histogram (HOG), deformable member models (DPM).The high speed development of digital media technology, makes the data such as image, video occur explosive growth, and the universal of internet makes people can get more easily image, the video data of magnanimity.In the face of the view data of such magnanimity, the severe problem that current object detection and canonical algorithm need to be faced is that a large amount of data do not have available object space markup information.Mass image data is carried out to position mark, is a task that labour intensity is very high, cost is very high.

Comparatively speaking, whole image carried out to classification mark much easier, adopt to filter in advance also can realize without methods such as supervision clusterings and in the short time, construct fairly large taxonomy database.Thereby, utilize the image data base of only having classification annotation, realize and automatically carry out object Category Learning and location, realize sensation target by Weakly supervised study and detect and mark, there are important theory value and realistic meaning.

In traditional Weakly supervised learning algorithm, for the selection of candidate region, the general candidate window algorithm based on intensive collection, window number is very huge, and recall rate and registration are not very desirable.Meanwhile, conventionally adopt word bag model to be described to candidate window, the eigentransformation level of word bag model is conventionally few, and the feature obtaining can be thought middle level expression, lacks more high-rise information and allows model can automatically from image, excavate out object apparent model.

The method of current Weakly supervised object detection and mark aspect main flow comprises many learn-by-examples, topic model, condition random field etc.Traditional a lot of many learn-by-example algorithms are owing to depending on to a great extent core study or the learning framework based on distance metric, and use the very high optimized algorithms of complexity such as heuritic approach, quadratic programming, integer programming, be difficult to obtain efficient application on large-scale dataset.

Therefore, how improving and to optimize Weakly supervised learning algorithm and efficiently realize the object detection of large nuber of images and automated location mark, is the major issue that urgent need of the prior art solves.

Summary of the invention

In view of this, the sensation target that fundamental purpose of the present invention is to provide under Weakly supervised scene detects and mask method, can as class label in the situation that, automatically from image collection, locate interested target at a Given Graph, also can carry out object space automatic marking to image.

In order to achieve the above object, the invention provides following technical scheme:

A kind of sensation target detects and mask method, it is characterized in that, comprising:

Image input step, inputs image to be detected;

Candidate region extraction step, uses selective search algorithm to extract candidate window as candidate region from described image to be detected;

Feature is described extraction step, and the extensive convolutional neural networks of use training in advance carries out feature description to candidate region and exports the feature description of this candidate region;

Sensation target prediction steps, the feature based on described candidate region is described, and utilizes the object detection model of training in advance to predict candidate region, estimates to exist the region of described sensation target;

Position annotation step, marks the position of described sensation target according to described estimated result.

Preferably, the selective search algorithm in the extraction step of described candidate region further comprises:

Be predetermined space by the color space conversion of image to be detected, the over-segmentation algorithm of utilization based on Graph is to described Image Segmentation Using, constantly merge two the highest regions of similarity, obtain the stratification segmentation result of image, after multiple color spaces and multi-level cut zone set merging and duplicate removal are processed, obtain the set of candidate regions of this image.

Preferably, described predetermined color space comprises: HSV, RGI, I, Lab.

Preferably, the convolutional neural networks of described training in advance is: the convolutional neural networks of training based on object classification database ImageNet 2013.

Preferably, also comprise object detection model training step, specifically comprise:

Input is with the training set image of image category label;

Adopt selective search algorithm to extract candidate window as candidate region from training set image;

The extensive convolutional neural networks of use training in advance carries out feature description to candidate region and exports the feature description of this candidate region;

Feature based on described candidate region is described, and utilizes many examples linear SVM training object apparent model.

Preferably, described many examples of use linear SVM training object detection model, comprising:

Adopt MILinear to learn from example algorithm to the training of object detection model without constraint large-spacing, its objective function is more:

\min_{w} \frac{1}{2} {| | w | |}^{2} + \frac{C}{| B |} Σ_{i = 1}^{| B |} {(\max (0,1 - y^{i} w^{T} B_{I_{i}}^{i}))}^{2},

Wherein, an image I ⁱcomprise n by one ⁱthe bag B of individual d dimension example ⁱdescribe, wherein j example is designated as be exemplified as positive sample if at least include one in a bag, so the label y of this bag ⁱfor+1, if all examples are all negative samples, the label y of this bag so ⁱfor-1, training set is B={ (B ⁱ, y ⁱ) | i=1,2 ..., N}, | B|=N is training set number of samples, and w is sorter coefficient, C be regular terms for controlling the punishment to mis-classification, bag B ⁱthe index value of the example that middle prediction mark is the highest.

Preferably, adopt inter-trust domain Newton method to solve MILinear algorithm, comprising:

The optimization aim function of determining MILinear is the unconfined objective function of leading, and its first order derivative is:

g (w) = w + 2 \frac{C}{| B |} \underset{i &Element; I_{B}}{Σ} (w^{T} B_{I_{i}}^{i} B_{I_{i}}^{iT} - y^{i} B_{I_{i}}^{iT}),

Wherein,

I_{B} = {i | 1 - y^{i} w^{T} B_{I_{i}}^{i}, i = 1,2, . . ., | B | > 0}

It is the set that interval is less than 1 example;

By the generalized Hessian of formula calculating below

Wherein, I is unit matrix;

Mode with iteration is optimized objective function, calculates

\begin{matrix} s^{k} = \min q_{k} (s) = \min_{s} &dtri; f {(w^{k})}^{T} s + \frac{1}{2} s^{T} {&dtri;}^{2} f (w^{k}) s \\ = \min_{s} g {(w^{k})}^{T} s + \frac{1}{2} s^{T} H (w^{k}) s, s . t . | | s | | \leq Δ_{k} \end{matrix},

Wherein, s ^kto upgrade step-length, Δ _kinter-trust domain, g (w ^k) and H (w ^k) be respectively first order derivative and the second derivative of MILinear objective function.

Obtain upgrading step-length s solving ^kafter, enough large if realistic objective function declines, so just to w ^kupgrade, otherwise keep w ^kconstant, formula is as follows:

w^{k + 1} = \{\begin{matrix} w^{k} + s^{k} & if \frac{f (w^{k} + s^{k}) - f (w^{k})}{q_{k} (s^{k})} > η_{0}, \\ w^{k} & otherwise . \end{matrix},

Wherein, η ₀that the I of a predefined control is accepted the positive number that actual function declines.Preferably, also comprise and utilize the object detection model running bag decomposition algorithm training, adopt iterative manner to gradually reduce the blur level of positive closure, comprising:

Train the object detection model obtaining on training set image, to obtain the prediction probability to all candidate window by MILinear, according to this prediction probability, positive closure is resolved into a positive closure and a negative bag, on the data set obtaining after decomposition, train a new MILinear object detection model, described decomposable process may iteration several.

Sensation target provided by the invention detects and mask method, has several obvious advantages:

1), adopt the mode of selective search, based on the result of a large amount of over-segmentations, obtain the candidate window that target most probable occurs, the window that this mode obtains can be good at keeping the border of object, very high with real-world object coincidence factor, keep high recall rate to several thousand candidate window in the situation that at hundreds of simultaneously.

2), adopt the convolutional neural networks that training obtains on a very large Images Classification data set in advance from candidate window, to extract feature representation, can obtain the feature-rich that comprises stronger high-layer semantic information and express, allow model can automatically from image, excavate out object apparent model.

3), adopted a kind of new many examples linear SVM model, adopt a kind of optimized algorithm based on inter-trust domain Newton method to be optimized simultaneously, can on large-scale dataset, carry out efficiently the study of Weakly supervised detection model.

4), adopted new one bag decomposition algorithm, by positive sample packages being resolved into a positive sample packages and a negative sample bag, greatly reduce the ambiguity in positive sample packages, can effectively improve the performance of Weakly supervised detection model.

Brief description of the drawings

Fig. 1 is that the sensation target based on Weakly supervised study detects and mask method model training and test flow chart according to the embodiment of the present invention;

Fig. 2 is the MILinear schematic diagram with the decomposition of band bag according to embodiment of the present invention MILinear;

Fig. 3 adopts inter-trust domain Newton method to be optimized and other optimization method result contrast schematic diagram according to the embodiment of the present invention;

Fig. 4 trains the object detection model prediction mark and the sample registration that obtain to be related to schematic diagram according to the embodiment of the present invention;

Fig. 5 adopts some object classification improvement in performance schematic diagram in bag decomposition algorithm iterative process according to the embodiment of the present invention;

Fig. 6 trains the testing result schematic diagram of the object detection model obtaining on Pascal VOC2007 database according to the embodiment of the present invention.

Embodiment

For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in more detail.

Thought main points of the present invention are: 1) adopt the mode of selective search, based on the result of a large amount of over-segmentations, can in less candidate window situation, obtain higher target recall rate and registration; 2) the present invention adopts the convolutional neural networks that training obtains on a very large Images Classification data set in advance from candidate window, to extract feature representation, and the feature-rich that can obtain comprising stronger high-layer semantic information is expressed; 3) adopt a kind of new many examples linear SVM model, adopted a kind of optimized algorithm based on inter-trust domain Newton method to be optimized, can on large-scale dataset, carry out efficiently the study of Weakly supervised detection model; 4) the present invention has adopted new one bag decomposition algorithm, by positive sample packages being resolved into a positive sample packages and a negative sample bag, greatly reduces the ambiguity in positive sample packages, can effectively improve the performance of Weakly supervised detection model.

As shown in Figure 1, Fig. 1 the first half is that the sensation target based on Weakly supervised study detects and mask method model training process flow diagram according to the embodiment of the present invention.First, be input picture; Secondly,, by adopting selective search algorithm to carry out candidate window extraction to the image of input, obtain the candidate region of extracting; Then, by candidate region, candidate window sample order is sent into convolutional neural networks, and the feature that obtains each candidate region is described, i.e. Zonal expression; Finally, describe based on feature, the algorithm based on Weakly supervised study that uses the present invention to propose carries out the automatic learning of object apparent model, and positive sample excavates.Fig. 1 the latter half has been set forth the test process of the method.For test pattern, adopt the mode the same with training process to extract candidate window, then use degree of depth convolutional neural networks to carry out feature description to window area, finally use the object apparent model training to classify to window area above, realize target detects or mark task.The method comprises the following steps:

Extract S1, candidate region, uses selective search algorithm to extract candidate window as candidate region from training set image.

At a Given Graph as class label in the situation that, can only know the object that comprises some classification in image, such as " automobile ", " people ", but for the position of " automobile " and " people ", be ignorant, this just need to determine by algorithm the boundary rectangle of object.If extract all possible rectangle frame from image, the number of that all possible rectangle frame is very huge, and it is also unpractical dealing with.Candidate region extraction algorithm is exactly by extracting a limited number of possibility object rectangle frame, to make wherein to include as much as possible the object that will locate.Here have three indexs most important: the one, the number of candidate window, number is fewer, and efficiency of algorithm is higher; The 2nd, recall rate, is also the number that comprises real-world object in candidate window and the ratio of all objects number; The 3rd, the registration of candidate window and real-world object boundary rectangle frame.Candidate window algorithm based on intensive collection, window number is very huge, and recall rate and registration are not very desirable.

The selective search algorithm that the present invention adopts is a kind of candidate window extraction algorithm based on over-segmentation, it is by adopting different parameters to carry out over-segmentation to image, obtain different image blocks, adopt again the thought of stratification tissue to merge piecemeal, thereby find the boundary rectangle that most possibly comprises object.Concrete steps are as follows: first, original image, from RGB color space conversion to other color spaces, is comprised to HSV, RGI, I, Lab etc.; Then, use respectively the over-segmentation algorithm based on Graph to cut apart respectively respective image, then the thought of organizing by stratification constantly merges two regions that similarity is the highest, obtain the stratification segmentation result of image.By multiple color spaces, multi-level cut zone set is combined, and after carrying out duplicate removal processing, just obtains the set of candidate regions of this figure.

Selective search algorithm operational efficiency is higher, the in the situation that of hundreds of extremely thousands of candidate window, can obtain very high recall rate and registration.

The extensive convolutional neural networks of S2, use training in advance carries out feature description to each candidate region and exports this feature and describe.

After getting the candidate region that may comprise attention object, to determine whether certain candidate window is certain object by computer vision and algorithm for pattern recognition, need to first carry out feature description to this candidate region, thus can after use the sorter judgement of classifying.In Images Classification and identification field, conventional Image Description Methods comprises that the low-level image feature such as SIFT, LBP, HOG describes, and the middle level features such as word bag model are described, the popular stratification feature representation in recent years such as convolutional neural networks, degree of depth belief network.Weakly supervised object detection and mark problem, what solve is the identification problem of object level, be by eliminating Weakly supervised what object of ambiguity answer problem of these semantic hierarchies somewhere.This high-level semantic problem be not low-level image feature describe and middle level features describe can fine processing, need very abstract high-level characteristic expression.Convolutional neural networks has been obtained a series of important breakthrough in object identification field, the feature representation of its stratification, realized feature by bottom to high-rise successively abstract, normally edge of characteristic layer before it, Corner detector, along with the number of plies increases, feature below starts to describe object part, whole object gradually.By extracting the feature of characteristic layer after convolutional neural networks, can obtain description and expression to image higher level (such as object rank).It is exactly that its model capacity is very large that convolutional neural networks also has an important characteristic, and the number of plies is more, and neuron number is larger, and model complexity is more, and quantity of information that can code storage is larger.

Based on this, the present invention has trained a large-scale convolutional neural networks on the data set ImageNet 2013 of a very large image, and a large amount of general object informations is stored in this network.Preferably, carry out the training of convolutional neural networks with a large-scale general object classification database ImageNet 2013, training data comprises approximately 1,200,000 images of 1000 classes, the convolutional neural networks using comprises 5 convolutional layers, 2 full articulamentums, and 1st, after 2,5 convolutional layers, connected maximal value convergence-level, whole network packet is containing approximately 650,000 neurons.Just as the knowledge during the mankind exist in a large number contributes to differentiate object, the convolutional neural networks that this has comprised a large amount of general vision prior imformations, can be effectively for carrying out general description to object.

On S3, basis at a Given Graph as class label, use many examples linear SVM MI-SVM on the feature representation of candidate region, to train object detection model.

The present invention is by adopting selective search algorithm to get candidate window set from image, and use a good extensive convolutional neural networks of training in advance to carry out feature description to these candidate window, what next will do is exactly automatic learning object detection model on these candidate window features are described, the object detection model that-utilization trains, just can predict candidate region, find most probable to have the region of object.

Weakly supervised object detection and mark problem can be modeled as learn-by-example problem more than conventionally.An image I ⁱcomprise n by one ⁱthe bag B of individual d dimension example ⁱdescribe, wherein j example is designated as be exemplified as positive sample if at least include one in a bag, so the label y of this bag ⁱfor+1, if all examples are all negative samples, the label y of this bag so ⁱfor-1.Process side-play amount for fear of explicitly below, the present invention has added one extra 1 at the end of each exemplary characteristics.Note

\min_{w} \frac{1}{2} {| | w | |}^{2} + C Σ_{i = 1}^{| B |} ξ_{i} - - - (1)

s . t . \max (w^{T} B_{j}^{i}) &GreaterEqual; + 1 - ξ_{i}, y^{i} = + 1

\max (w^{T} B_{j}^{i}) \leq - 1 + ξ_{i}, y^{i} = - 1

ξ _i≥0

Training set is B={ (B ⁱ, y ⁱ) | i=1,2 ..., N}, | B|=N is training set number of samples, and w is sorter coefficient, and C is that regular terms is for controlling the punishment to mis-classification, ξ _iit is slack variable.

Under many learn-by-example frameworks, what the basic markup information of image brought is the ambiguity in positive closure, only knows that at least comprising a positive sample does not but know which is positive sample.MI-SVM algorithm is by only considering prediction mark W ^t maximum example solves this problem, and relies on this to predict bag, as shown in Fig. 2 (a).The lineoid of MI-SVM algorithm is to be determined by the highest example of the mark of each bag, and it optimizes formula is a mixed integer programming problem, can only solve by heuritic approach, and speed is very slow.

S3.1 MILinear algorithm

Be different from the small data set of traditional many learn-by-examples issue handling, it is above and eachly include hundreds of large data problems to thousand higher-dimension examples that the present invention mainly considers to comprise 5000 bags.For better the Weakly supervised problem under large data scale being carried out to Efficient Solution, the present invention proposes a kind of newly for many examples of constraint large-spacing linear SVM algorithm, be called MILinear.Its formula is shown below:

\min_{w} \frac{1}{2} {| | w | |}^{2} + \frac{C}{| B |} Σ_{i = 1}^{| B |} {(\max (0,1 - y^{i} w^{T} B_{I_{i}}^{i}))}^{2} - - - (2)

Wherein the proper vector of j example in i bag, y ⁱit is the classification mark of i bag.Above formula Section 2 has adopted a square Hinge loss function, and max (a, b) gets a, the maximal value of b.

I_{i} = \arg \max_{j} w^{T} B_{j}^{i} - - - (3)

Bag B ⁱthe index value of the example that middle prediction mark is the highest.

Optimization method based on gradient is used widely on Large-scale Optimization Problems, and the present invention has used the Hinge Loss loss function that can lead.Shown in 2 (a), MI-SVM and MILinear solve this many learn-by-examples of large scale problem by the example of selecting mark maximum.

S3.2 wraps decomposition algorithm

In the experiment of MILinear, the present invention's discovery, in a positive closure, positive sample concentrates on front 30% of mark maximum conventionally.Notice after this problem, the present invention proposes a kind of new bag decomposition algorithm, by positive closure being resolved into a positive closure and a negative bag, effectively reduce the ambiguity of positive closure.Preferably, train the model obtaining on training image, to obtain the prediction probability to all candidate window by MILinear, according to this prediction probability, positive closure is resolved into a positive closure and a negative bag, being specially 30% of maximum probability is new positive closure, and all the other samples become a new negative bag.Next, on the data set obtaining, train a new MILinear model after decomposition, as shown in Fig. 2 (b).By bag decomposition algorithm, reduce the ambiguity of sample in positive closure, thereby improved category of model performance.This decomposable process possibility iteration several, until model performance no longer improves.

S3.3 gradient optimal method

Provide the definition of MILinear algorithm above, will discuss below under large scale data set, how can carry out efficiently model learning.The optimization aim function of MILinear is the unconfined form of leading, and its first order derivative is

g (w) = w + 2 \frac{C}{| B |} \underset{i &Element; I_{B}}{Σ} (w^{T} B_{I_{i}}^{i} B_{I_{i}}^{iT} - y^{i} B_{I_{i}}^{iT}) - - - (4)

Wherein

I_{B} = {i | 1 - y^{i} w^{T} B_{I_{i}}^{i}, i = 1,2, . . ., | B | > 0} - - - (5)

It is the set that interval is less than 1 example.

After having obtained the gradient Analytical Expression of objective function, just there are a lot of methods to have carried out objective function optimization, comprise random Gradient Descent (SGD), L-BFGS, Nonlinear Conjugate Gradient Methods (CG) etc.Random gradient descent method is processed one by one to data set, and iteratively model is upgraded.L-BFGS is a kind of Newton optimization method of intending, and it avoids storing whole Hessian matrix by a kind of approximate low-rank method for solving of Hessian matrix.In general, the cost of the random every step of Gradient Descent is lower but iteration time is longer, and the every steps of second order optimization method such as L-BFGS are consuming time longer, but global convergence speed.

In order to carry out more efficiently the study of object apparent model, the present invention proposes a kind of than the more efficient many examples linear SVM optimized algorithm based on inter-trust domain Newton method of L-BFGS.Inter-trust domain Newton method is one large scale unconstrained problem method for solving very efficiently, and has obtained application in general large scale logistic recurrence and support vector machine training.For application inter-trust domain Newton Algorithm MILinear problem, use formula below to calculate generalized Hessian

Wherein I is unit matrix.

Inter-trust domain Newton method is optimized objective function in the mode of iteration, and every suboptimization attempts to solve the subproblem that comprises inter-trust domain below

\begin{matrix} s^{k} = \min q_{k} (s) = \min_{s} &dtri; f {(w^{k})}^{T} s + \frac{1}{2} s^{T} {&dtri;}^{2} f (w^{k}) s \\ = \min_{s} g {(w^{k})}^{T} s + \frac{1}{2} s^{T} H (w^{k}) s, s . t . | | s | | \leq Δ_{k} \end{matrix} - - - (7)

Wherein s ^kto upgrade step-length, Δ _kinter-trust domain, g (w ^k) core H (w ^k) be respectively first order derivative and the second derivative of MILinear objective function (formula 2).

This subproblem can adopt the method for conjugate gradient of having considered inter-trust domain to carry out Efficient Solution.

Obtain upgrading step-length s solving ^kafter, enough large if realistic objective function declines, so just to w ^kupgrade, otherwise keep w ^kconstant.

w^{k + 1} = \{\begin{matrix} w^{k} + s^{k} & if \frac{f (w^{k} + s^{k}) - f (w^{k})}{q_{k} (s^{k})} > η_{0}, \\ w^{k} & otherwise . \end{matrix} - - - (8)

Wherein η ₀be that the I of a predefined control is accepted the positive number that actual function declines, actual function declines and is greater than this value and upgrades direction and be accepted, and in an embodiment of the present invention, it is preferably set for 1e-4.

Strictly say, the objective function of MILinear is owing to having introduced max function, thereby right and wrong are protruding.This objective function neither second order can be led simultaneously.Although can not ensure globally optimal solution, under actual conditions, this algorithm can be effectively from large-scale dataset learning to object apparent model.

S4, on test pattern, extract candidate region, and make to carry out in the same way feature description, use the interested object of object detection model orientation that training obtains above.At test phase, the candidate region that first uses selective search algorithm to obtain some, then adopts the convolutional neural networks the same with the training stage to carry out feature description.Whether thereby, judge each candidate window be interested object, draw the conclusion of what object where if using afterwards the object apparent model that training obtains to classify to window feature above.The automatic detection and the mark that only utilize image tag information realization attention object are so just completed.

Fig. 3 adopts inter-trust domain Newton method to be optimized and other optimization method result contrast schematic diagram according to the embodiment of the present invention, Fig. 4 trains the object detection model prediction mark and the sample registration that obtain to be related to schematic diagram according to the embodiment of the present invention, Fig. 5 adopts some object classification improvement in performance schematic diagram in bag decomposition algorithm iterative process according to the embodiment of the present invention, and Fig. 6 trains the testing result schematic diagram of the object detection model obtaining on Pascal VOC2007 database according to the embodiment of the present invention.

In a word, the present invention proposes a kind of new sensation target based on Weakly supervised study detects and mask method, use selective search algorithm to carry out candidate window extraction, use the deep layer convolutional neural networks of pre-training in mass data as candidate window feature representation model and general priori, and use a kind of algorithm based on many examples linear SVM to carry out positive sample excavation.By adopting inter-trust domain Newton method to carry out model optimization, and utilize a kind of bag decomposition algorithm of novelty progressively to reduce the ambiguity of positive closure, the sensation target that this method has realized under Weakly supervised scene detects and automatic marking.Experiment shows that the Weakly supervised sensation target of this invention and main flow detects compared with mask method, has stronger positive sample mining ability and application prospect more generally, is suitable for sensation target detection and automatic marking task on large-scale dataset.

Above-described specific embodiment; object of the present invention, technical scheme and beneficial effect are further described; institute is understood that; the foregoing is only specific embodiments of the invention; be not limited to the present invention; within the spirit and principles in the present invention all, any amendment of making, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims

1. sensation target detects and a mask method, it is characterized in that, comprising:

Image input step, inputs image to be detected;

2. method according to claim 1, is characterized in that, the selective search algorithm in the extraction step of described candidate region further comprises:

Be predetermined color space by the color space conversion of image to be detected, the over-segmentation algorithm of utilization based on Graph is to described Image Segmentation Using, constantly merge two the highest regions of similarity, obtain the stratification segmentation result of image, after multiple color spaces and multi-level cut zone set merging and duplicate removal are processed, obtain the set of candidate regions of this image.

3. method according to claim 2, is characterized in that, described predetermined color space comprises: HSV, RGI, I, Lab.

4. method according to claim 1, is characterized in that, the convolutional neural networks of described training in advance is: the convolutional neural networks of training based on object classification database ImageNet 2013.

5. method according to claim 1, is characterized in that, also comprises object detection model training step, specifically comprises:

Input is with the training set image of image category label;

6. method according to claim 5, is characterized in that, described many examples of use linear SVM training object detection model, comprising:

Adopt MILinear to learn from example algorithm to the training of object detection model without constraint large-spacing, its objective function is more

\min_{w} \frac{1}{2} {| | w | |}^{2} + \frac{C}{| B |} Σ_{i = 1}^{| B |} {(\max (0,1 - y^{i} w^{T} B_{I_{i}}^{i}))}^{2},

7. method according to claim 6, is characterized in that, adopts inter-trust domain Newton method to solve MILinear algorithm, comprising:

The optimization aim function of determining MILinear is the unconfined objective function of leading, and its first order derivative is

g (w) = w + 2 \frac{C}{| B |} \underset{i &Element; I_{B}}{Σ} (w^{T} B_{I_{i}}^{i} B_{I_{i}}^{iT} - y^{i} B_{I_{i}}^{iT}),

Wherein,

I_{B} = {i | 1 - y^{i} w^{T} B_{I_{i}}^{i}, i = 1,2, . . ., | B | > 0}

It is the set that interval is less than 1 example;

By the generalized Hessian of formula calculating below

Wherein, I is unit matrix;

Mode with iteration is optimized objective function, calculates

\begin{matrix} s^{k} = \min q_{k} (s) = \min_{s} &dtri; f {(w^{k})}^{T} s + \frac{1}{2} s^{T} {&dtri;}^{2} f (w^{k}) s \\ = \min_{s} g {(w^{k})}^{T} s + \frac{1}{2} s^{T} H (w^{k}) s, s . t . | | s | | \leq Δ_{k} \end{matrix},

Wherein, k is iterations, s ^kto upgrade step-length, w ^kthe weights of iteration the k time, Δ _kinter-trust domain, ▽ f (w ^k)=g (w ^k) and ▽ 2f (w ^k) (w ^k)=H (w ^k) be respectively first order derivative and the second derivative of MILinear objective function;

w^{k + 1} = \{\begin{matrix} w^{k} + s^{k} & if \frac{f (w^{k} + s^{k}) - f (w^{k})}{q_{k} (s^{k})} > η_{0}, \\ w^{k} & otherwise . \end{matrix},

Wherein η ₀that the I of a predefined control is accepted the positive number that actual function declines.

8. method according to claim 7, is characterized in that, also comprises and utilizes the object detection model running bag decomposition algorithm training, and adopts iterative manner to gradually reduce the blur level of positive closure, specifically comprises:

Train the object detection model obtaining on training set image, to obtain the prediction probability to all candidate window by MILinear, according to this prediction probability, positive closure is resolved into a positive closure and a negative bag, on the data set obtaining after decomposition, train a new MILinear object detection model, described decomposable process needs iteration for several times.