CN107833213A

CN107833213A - A kind of Weakly supervised object detecting method based on pseudo- true value adaptive method

Info

Publication number: CN107833213A
Application number: CN201711066445.XA
Authority: CN
Inventors: 张永强; 丁明理; 李贤�; 杨光磊; 董娜
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2017-11-02
Filing date: 2017-11-02
Publication date: 2018-03-23
Anticipated expiration: 2037-11-02
Also published as: CN107833213B

Abstract

The present invention relates to a kind of Weakly supervised object detecting method based on pseudo- true value adaptive method, it is that need to rely on the database for largely having markup information to solve existing full supervision object detector, and when mutually being blocked containing multiple objects and object in picture the shortcomings that object space detection is inaccurate and propose, including：Picture is input in Weakly supervised object detector, detector output result is subjected to non-maximum restraining processing, the bounding box of every kind of object highest scoring is chosen in result；Candidate region is trained to produce network according to the positional information for the bounding box selected, and retain the bounding box for being more than certain value with the bounding box overlapping area of highest scoring, pixel coordinate corresponding to the candidate region of same object is averaged, unique bounding box of each object is determined according to result of calculation；Full supervision object detector is inputed to using bounding box information as pseudo- true value.The present invention is applied to the general object detection technology in object detection technology, especially real scene.

Description

A kind of Weakly supervised object detecting method based on pseudo- true value adaptive method

Technical field

The present invention relates to field of machine vision, and in particular to a kind of Weakly supervised object detection based on pseudo- true value adaptive method Method.

Background technology

Object detection is a very important research topic in field of machine vision, it be image segmentation, object tracking, The basic technology of the advanced tasks such as behavior act analysis identification.In addition, development image and video with development of Mobile Internet technology Quantity increase in a manner of explosion type, be badly in need of a kind of skill that can fast and accurately identify, position object in image and video Art, so as to the acquisition of the intelligent classification and key message of successive image video.Present object detection technology is widely applied to existing During generation is social, such as the Face datection in safety-security area, pedestrian detection, Traffic Sign Recognition, vehicle detection in intelligent transportation chase after Track, self-navigation driving, robot path planning etc..

Because object detection technology has important theoretical research value and urgent practical application request, examined for object The relevant art of survey is also evolving with new, and the present invention is roughly divided into two classes：Conventional method based on sliding window and Modernism based on deep learning.

Traditional method is to give a detected picture, and whole image is carried out once time using the method for sliding window Go through.Because accredited image is likely to occur any position in the picture, and the size of target, the ratio of width to height are all uncertain , repeatedly slided so needing to design different scale, different the ratio of width to height window on tested altimetric image.It is this traditional Exhaustive method always finds the position (being referred to as candidate region) of object appearance, but has the shortcomings that obvious：If slide Window size and the ratio of width to height is less, step-length is too big, then can not detect all objects；If sliding window yardstick and width are high It is small to compare more and step-length, then causes redundancy window too many, time-consuming oversize, it is impossible to meet the needs of real-time in practical application.It is logical Cross after sliding window selectes each candidate region, conventional method adopts the feature for manually extracting these candidate regions (being referred to as shallow-layer feature), common method have scale invariant feature conversion extraction and analysis method (Scale-invariant feature Transform, SIFT), Lis Hartel sign method (Haar-like features), histograms of oriented gradients feature extraction (Histogram of Oriented Gradient, HOG), local binary feature extraction (Local Binary Pattern, LBP) etc..In order to improve identification positioning precision, it will usually merge feature caused by above-mentioned various features extraction method as candidate regions The feature in domain.Finally, a grader is designed to identify the classification of object in each candidate region, and common grader has：Branch Hold vector machine (Support Vector Machine, SVM), adaptively strengthen method (Adaptive Boosting, AdaBoost) etc., the flow chart based on conventional method object detection is as shown in Figure 1.Traditional " sliding window+artificial extraction spy The framework of sign+shallow-layer grader ", because excessive redundancy window and feature representation ability weaker (shallow-layer feature) cause to calculate Speed and accuracy of detection can not all meet actual demand.

After 2012, deep learning achieves breakthrough in image classification problem (what classification objects in images is), The feature of appearance and convolutional neural networks (CNN) extraction mainly due to large database (ImageNet) has stronger table Danone power, as 4096 dimension datas of full articulamentum (Fully connected layer) in VGG-16 models are used for representing image Feature, the feature (further feature) of this deep learning extraction contains stronger semantic information.Then, deep learning is utilized The method of extraction further feature is also utilized object detection, and (including what object isAt which) in field, now detection essence Degree has a certain upgrade, but detection speed is still relatively slow, or even (characteristic dimension is bigger, network depth more slowly than conventional method It is deeper) because now simply solving the problems, such as that the ability to express of shallow-layer feature manually extracted is weak and by shallow-layer grader The convolutional neural networks (Convolution Neural Network, CNN) of deep learning are replaced with, still rely on sliding window The method of mouth solves the Issues On Multi-scales of object detection, so still the problem of bulk redundancy window be present.For sliding window The problem of bringing, candidate region (region proposal) method give good solution, and candidate region utilizes image The information such as edge, texture and color find out the position that object is likely to occur in image (frame of video) in advance, its quantity is usually Hundreds of according to actual conditions to thousands of (setting).This method can keep higher recall rate under less candidate region, Operation time is so greatly reduced, improves detection speed.Method caused by more commonly used candidate region has Selective Search, Edge Boxes, Region Proposal Network (RPN) etc..Based on candidate region depth The object detection flow chart of study is as shown in Figure 2.Based on " candidate region (Proposals Region)+convolutional neural networks (CNN) deep learning framework " balances the problem of conflicting between detection time and accuracy of detection, and is examined faster Higher accuracy of detection can be obtained by surveying under the time.

However, whether based on the conventional art still modern technologies based on deep learning of sliding window, at this stage Research is carried out on fixed database (PASCAL VOC, Microsoft COCO etc., refer to table 1), and needs pair The particular location occurred in each pictures in data set comprising which object and object is labeled.And it is based on deep learning Method again rely on substantial amounts of training data (tens of thousands of to arrive hundreds of thousands pictures), the so large-scale data for having mark of structure Storehouse is a giant-scale engineering taken time and effort.In addition, the database of these marks has the disadvantage that：First, the thing in database Body classification is limited, the object classification under the real scene in practical application may not be consistent with the object classification of database or Classification considerably beyond in these databases；Second, when manually marking the position of object in the picture with certain subjectivity Property, in the case of mutually being blocked containing multiple objects and object especially in picture, this can cause mark with certain inclined Difference, these mark deviations are likely to make model converge on some locally optimal solution when training pattern, and final result is exactly Object space detection is inaccurate.

The content of the invention

The invention aims to solve existing full supervision object detector to need dependence largely to have markup information Database, while solve to mark error so that object space is examined when containing multiple objects and object in picture and mutually blocking Indeterminacy is true, and the object for needing to detect in practical application may not be consistent with the object classification of database or much surpass The shortcomings that crossing the classification in these databases, and a kind of Weakly supervised object detecting method based on pseudo- true value adaptive method is proposed, Including：

Step 1), structure training sample；

Step 2), the picture in training sample is input to based on more example learning method (Multiple-Instance Learning in Weakly supervised object detector)；

Step 3), the output result progress non-maximum restraining processing by Weakly supervised object detector, in result picture Choose the bounding box of every kind of object highest scoring；

Step 4), the positional information training candidate region for the bounding box chosen according to step 3) produce network (Region Proposal Network, RPN), produce network using the candidate region and produce multiple candidate regions, retain and true value weight Area is closed than all candidate regions more than certain threshold value；The object of each classification corresponds to multiple candidate regions；

Step 5), the pixel coordinate corresponding to the candidate region of same object averaged, it is true according to result of calculation Unique bounding box of fixed each object；

Step 6), the information of the bounding box obtained in step 5) is inputed to as pseudo- true value and supervises object detector entirely, Obtain testing result.

Beneficial effects of the present invention are：1st, the invention enables the object detection technology based on deep learning not by training data The limitation of the problems such as rare and artificial labeled data deviation, promote based on deep learning the object detection under real scene Using；2nd, accurate testing result can also be reached when containing multiple objects in picture and object mutually blocks；3rd, this hair MAP data in bright experimental result are 52.4%, hence it is evident that higher than the 41.6% of prior art and 45.8%；Present invention experiment As a result the Corloc data in are 70.3%, hence it is evident that higher than the 61.4% of prior art and 65.0%.

Brief description of the drawings

Fig. 1 is the object detection flow chart based on conventional method；

Fig. 2 is the object detection flow chart based on candidate region deep learning；

Fig. 3 is Weakly supervised object detector testing result exemplary plot；Fig. 3 (a) to Fig. 3 (e) represents the inspection to different images Survey result；

Fig. 4 is Weakly supervised object detector testing result score exemplary plot；

Fig. 5 is conventional method and pseudo- true value adaptive method schematic diagram；Wherein Fig. 5 (a) represents the conventional method of prior art； The method that Fig. 5 (b) represents the present invention；

Fig. 6 is the Weakly supervised object detecting method flow chart based on pseudo- true value adaptive method；

Fig. 7 is the Weakly supervised detector schematic diagram based on more event selections

Fig. 8 experimental result pictures；Fig. 8 (a) to Fig. 8 (o) is the experimental result for different images.

Embodiment

Embodiment one：The Weakly supervised object detecting method based on pseudo- true value adaptive method of present embodiment, bag Include：

Step 1), structure training sample；

Step 4), the positional information training candidate region for the bounding box chosen according to step 3) produce network, using described Candidate region produces network and produces multiple candidate regions, retains with true value overlapping area than all candidates more than certain threshold value Region；The object of each classification corresponds to multiple candidate regions；

Step 5), the pixel coordinate corresponding to the candidate region of same object averaged, it is true according to result of calculation Unique bounding box of fixed each object；Seeking the process of pixel coordinate can be：All candidate regions of an object are calculated respectively The upper left corner, the lower left corner, the upper right corner, the average value of bottom right angular coordinate, a unique border is determined according to this four average values Frame.

Step 6), the information of the bounding box obtained in step 5) is inputed to as pseudo- true value and supervises object detector entirely, Obtain testing result.Wherein pseudo- true value and be the true value information of real handmarking, is found by the method for present embodiment The approximation of one true value serves as true value.

The process of step 3) to step 5) can specifically describe according to Fig. 5 (b), and it is defeated that the left numbers of Fig. 5 (b) first, which open figure, The original picture entered；Second figure represents the bounding box of the every kind of object highest scoring obtained by the processing of step 3)；3rd Figure represents to input the bounding box of highest scoring to candidate region generation network, and obtained multiple candidate regions, each object Corresponding multiple candidate regions are done pixel and are averaged, and just obtain each unique bounding box of object in the 4th width figure.

Specifically, the present invention with the image (frame of video) under real scene for research object, the class of specific detection object It can not determined according to the practical problem of oneself.Due to the development of present Internet technology, it is general that picture video obtains equipment And the picture on present YouTube and video are increased with the speed that 58 pictures are per second and 3.6 videos are per second according to statistics It is long.As long as user crawls picture on a search engine according to the detection classification of oneself in the form of keyword, it is possible to establishes The database being consistent with oneself practical problem, solves in existing fixed data storehouse that object classification is few, object classification and reality Need to detect the problem of classification is not consistent.Simultaneously as the mark of positional information is not needed, it is not necessary to substantial amounts of manpower and materials Mark database is removed, it also avoid the deviation that artificial mark subjectivity is introduced.

Establish after tranining database, it is possible to utilize existing Weakly supervised one Weakly supervised thing of object detection technique drill Detector.It is so-called Weakly supervised, refer to that each training sample has corresponding supervision message, but supervision here is believed Breath is Weakly supervised to refer to that every pictures have object type in simple information either imperfect information, such as the present invention Other information (which object is included in picture), but there is no object location information (object is at which).Existing Weakly supervised object inspection Survey technology is all that the object detection under Weakly supervised information is regarded as event selection (Multiple Instance more than one Learning, MIL) problem, this method has two shortcomings：First, model is more sensitive for initializing；Second, it is one Individual non-convex problem, model can converge on a locally optimal solution.Reflection directly perceived is exactly that object detector is only able to detect a thing The most characteristic part of body, rather than the whole part of object, such as when detection pedestrian, it is only able to detect the position of face Put and not all body, detection animal when can only position the head of object rather than whole body, as shown in Figure 3.

The present invention deploys to study to Weakly supervised object detector, it is found that object detector is to examine in most cases Whole object is measured, simply the score of the detection block comprising whole object (bounding box) is relatively low, and emphasizing object most has The score of feature part detection block is higher, as shown in Figure 4.Simultaneously as there is no position markup information, object when training Detector does not have regression capability, and this can cause part detection knot really only comprising the most characteristic part of object or comprising whole Also too many background information is included while individual object, these results are to cause the basic original of detection failure (discrimination reduction) Cause.In order to solve the problems, such as that Weakly supervised detector discrimination is low, the present invention proposes a kind of frame of the supervised learning from Weakly supervised to complete Frame：True value using the output result of Weakly supervised detector as object location information, the full prison of training one is gone with this pseudo- true value Object detector is superintended and directed, because full supervised learning has very strong regression capability.For true value On The Choice, one most simply may be used Capable method is exactly the bounding box for choosing highest scoring in Weakly supervised detector output result as true value.But the method is present Two problems：First, a bounding box is only able to find for each type objects in a pictures, even if comprising more in picture Individual object；Second, the pseudo- trutll value being selected includes the most characteristic part of object, rather than object whole, such as Fig. 5 (a) institutes Show.For problem above and analysis, the present invention proposes a kind of " Weakly supervised object detection side based on pseudo- true value adaptive method Method ".Specifically, first with the output result conduct of the Weakly supervised detector after non-maximum restraining (NMS) processing The true value of object location information, a candidate region is trained to produce network (region proposal with this pseudo- true value Network, RPN), then produce candidate region (proposals) with the network trained and retain those and pseudo- true value weight The candidate region that area ratio (IoU) is more than certain threshold value is closed, the pixel coordinate for these candidate regions that are finally averaged is to pseudo- true value Further optimized, flow chart is as shown in Figure 6.After above-mentioned processing, the pseudo- true value (bounding box) of each object It is found and more accurate, as shown in Fig. 5 (b).Gone using these more accurate bounding boxes as true value (ground truth) An object detector supervised entirely is trained, (can be according to true value to thing using the strong regression capability of full supervision object detector The bounding box of body is adjusted), can solve the problems, such as that Weakly supervised object detector discrimination is low.

" the Weakly supervised object detecting method based on pseudo- true value adaptive method " of the present invention can utilize full supervised learning Method solves the problems, such as Weakly supervised object detection, and higher object detection rate is being obtained in the case where not needing markup information.Solution Object classification of object detection of the having determined technology in practical application in mark database is not consistent with object classification in practical application Problem, while the problem of overcome mark database time and effort consuming.To the object detection technology based on deep learning from laboratory Certain impetus is served to practical application, promotes the development of Weakly supervised object detection technology.

Embodiment two：Present embodiment is unlike embodiment one：Step 1 specifically includes：

Step 1.1), the keyword for receiving user's input；The keyword is used for the classification for representing object；

Step 1.2), retrieved in a search engine using the keyword, choose the retrieval result of predetermined number simultaneously Markup information using the keyword as the retrieval result.

I.e. the present invention only need to know simple object classification information in picture, it is not necessary to which complicated object location information can Model is trained.Here simple object classification information can be obtained by many kinds of methods, such as with keyword (" OK People ", " vehicle " etc.) form search pictures in a search engine, download several thousand sheets come above and can serve as training Sample, it is not necessary to manually marked.

It is to be understood that when using the method for the present invention, training set can voluntarily be built by user, can be without using Some picture databases, building the process of training set is：Inputted and be used for represent object in photographic search engine by user Keyword, then crawls a number of picture in search result, and these pictures are usually to contain the object represented by keyword , that is to say, that just marked automatically equivalent to during searching for and crawling, it is no longer necessary to which artificial mark, this is fine It is difficult to adapt to changing new object, the situation of new picture that ground, which solves existing database,.Other existing object detections Method needs to rely on the large database with label information, and what can not voluntarily be built according to user only has simple picture letter The database of breath is trained and detected.

Other steps and parameter are identical with embodiment one.

Embodiment three：Present embodiment is unlike embodiment one or two：In step 1), sample is trained This collection can be any one in PASCAL VOC 2007/2012, MC COCO, WIDER FACE and FDDB databases, The database either built according to the method for embodiment two.Above-mentioned English name is the title of database.

Other steps and parameter are identical with embodiment one or two.

Embodiment four：Unlike one of present embodiment and embodiment one to three：In step 1), instruction The size for practicing the picture in sample meets：

The most short side of picture is random one kind in { 480,576,688,864,1200 } five yardsticks；The longest edge of picture Less than or equal to 2000.

Other steps and parameter are identical with one of embodiment one to three.

Embodiment five：Unlike one of present embodiment and embodiment one to four：Step 2) is specific Including：

Step 2.1) extracts the candidate region of predetermined number using selective search algorithm in the picture of training sample；

Step 2.2) inputs the candidate region to the VGG16 network models trained on ImageNet data sets The shallow-layer feature for representing detailed information and the further feature for representing semantic information are obtained, then passes through RoI pondizations side Method obtains the feature of each candidate region, and candidate region feature is converted into a bit vector table by two-dimensional matrix representation Show form, obtain the full connection features of each candidate region；

Step 2.3) inputs the full connection features into the Weakly supervised object detector based on more example learning methods, Have in Weakly supervised object detector and be used to be classification branch that object classification in candidate region is given a mark and for for candidate The detection branches that the positional information in region is given a mark；Then classification branch is multiplied with the score of detection branches to obtain this time The score of favored area；

Step 2.4) is inputted the score of each candidate region as supervision message to the 3 optimization networks mutually cascaded In, consequent propagate is carried out to optimization network and calculated, the result after being optimized.

Other steps and parameter are identical with one of embodiment one to four.

Embodiment six：Unlike one of present embodiment and embodiment one to five：In step 6), entirely It is any one in Fast-RCNN, Faster-RCNN, YOLO, SSD to supervise object detector.Above-mentioned English name is The title of object detector.

Other steps and parameter are identical with one of embodiment one to six.

Embodiment seven：

Present embodiment provides a specific implementation process：

As shown in fig. 6, training sample is prepared according to the actual demand of oneself first, then according to more event selections (MIL) Method trains a Weakly supervised object detector.Afterwards, output of the pseudo- true value adaptive method to Weakly supervised object detector is utilized As a result handled, obtain the positional information (pseudo- true value) of each object in training sample.Finally using this positional information as True value goes to train an object detector supervised entirely, and the object detector supervised entirely will provide a more accurate detection As a result.Every part is described in detail below：

Prepare training sample first.Training sample can be obtained with the pattern of keyword from search engine according to the actual requirements Take, if the detection of general object can also utilize existing object detection database, such as PASCAL VOC, MC COCO Deng if the detection of certain objects such as Face datection, the databases such as WIDER FACE, FDDB can be selected.In the present invention, Training sample is used as in order to choose the parts of the trainval in the databases of PASCAL VOC 2007 without loss of generality, with test portions It is allocated as surveying data for test.It should be noted that the present invention has only used classification information in training sample, not using thing The positional information of body.In the training stage, in order to further increase training sample, the versatility for strengthening training pattern, increase model Robustness, all samples have been subjected to left and right upset, and the image after upset is added to training data and concentrated.In addition, it is Adapt to the multiple dimensioned change of object in real scene, the present invention on the basis of the length-width ratio of data set picture is kept, from Most short side of the yardstick as training sample is randomly selected in { 480,576,688,864,1200 } five yardsticks, is examined simultaneously The longest edge for considering GPU memory problem setting training sample is not more than 2000.

Train Weakly supervised detector (weakly-supervised detector, WSD).The present invention utilizes more event selections Method realizes Weakly supervised object detector, and because no positional information is as supervision message, Weakly supervised object detector will A locally optimal solution is converged on, causes the discrimination of object detection relatively low.In order to improve discrimination, the present invention is in training mould Several embedded optimization networks parallel with more event selections detection network in type, as shown in Figure 7.For an input sample This, extracts about 2000 candidate regions (proposals), then using existing first with selective search The VGG16 network models extraction feature trained on ImageNet, finally obtains each time using the method in RoI ponds The feature of favored area, and then obtain the full connection features of each candidate region.In more example learning networks, input to be each The effect of the full connection features of candidate region, two classification arranged side by side and detection branches is respectively to judge the class of each candidate region Not and the positional information of each candidate region is given a mark, finally be multiplied to obtain with the score of detection branches by classification branch The score of this candidate region.In network is optimized, supervision is used as using the score of each candidate region in more example learning networks Information, consequent propagate is carried out to network and is calculated, further improves discrimination.In view of the relation between training time and discrimination (discrimination is in non-linear growth relation with the number for optimizing networking, but the training time closes with linear increase of discrimination System), the number for optimizing network is set as 3 by the present invention.

Pseudo- true value adaptive method (Pseudo Ground-truth Adaptive, PGA).In the training of Weakly supervised detector During do not use positional information, so the discrimination of Weakly supervised detector is limited.It is embodied in：It is only able to detect thing A part rather than whole object (for example, body of the face of people rather than people) for body, or believe comprising too many background Breath, these results are the basic reasons for causing discrimination low.In order to further improve discrimination, the side that the present invention will supervise entirely Method is referred in Weakly supervised object detection, but full supervised learning needs the positional information of object to be trained as supervision message Network, a most straightforward procedure are exactly to choose the candidate regions of each type objects highest scoring in Weakly supervised detector output result True value of the domain as positional information, the full supervision object detector of training one is gone with this pseudo- true value.Utilize full supervised learning Regression capability further improves object detection rate.But there are two shortcomings in this method：First, for each training sample This, a type objects are only able to find a bounding box, even if the multiple objects containing identical category in this sample；Second, looked for The bounding box arrived is not accurate enough, ordinary circumstance can only the most characteristic part of detection object, as Fig. 5 institutes (a) show.For upper Problem is stated, the present invention proposes a kind of pseudo- true value adaptive method, and detailed process includes three parts：First, to Weakly supervised detector Output result carries out non-maximum restraining (NMS) processing, chooses the bounding box of highest scoring in the corresponding bounding box of each sample As the positional information (pseudo- true value) of this object, but positional information now typically only includes the most characteristic portion of object Divide the whole of (such as head of people) rather than object, as shown in the second pictures in Fig. 5 (b)；Second, after being handled using NMS Result train candidate region to produce network (region proposal network, RPN) as positional information, then Some candidate regions are produced using the network trained, the present invention retains those and is more than necessarily with true value overlapping area ratio (IoU) All candidate regions of threshold value (present invention is set as 0.3), as shown in the 3rd pictures in Fig. 5 (b)；3rd, by second step After operation, there are corresponding some candidate regions for each object in training sample, and these candidate regions are closer In the profile of whole object, the present invention is averaged to the pixel coordinate of all candidate regions of each object, and this is averaged It is worth as final pseudo- true value, as shown in the 4th pictures in Fig. 5 (b).After above-mentioned three step process, in training sample Each object has a bounding box to correspond, while this bounding box is compared to the result chosen by highest point-score more Accurately.

Train full supervisory detection device (fully-supervised detector, FSD).After the search of pseudo- true value, training Each object has an accurate positional information in sample.True value is used as by the use of this positional information, it is possible to is trained One full supervision object detector.Full supervision object detector is not the emphasis of the present invention, and it can be existing any object Detector, such as Fast-RCNN, Faster-RCNN, YOLO, SSD etc..The present invention is from Fast-RCNN as full supervision thing Detector, it is 70000 times to train total iterations, and the learning rate of preceding 40000 iteration is 0.01, rear 30000 iteration Learning rate be 0.001.

The object detection network trained through above-mentioned steps, object can be realized in the case where not needing positional information mark Detection function, the object detection that can be applied to according to the actual requirements in real scene, not by existing object detection database thing The limitation of body classification, it is not necessary to spend manpower and materials to go to be labeled each training sample.Experiment proves the present invention's " the Weakly supervised object detecting method based on pseudo- true value adaptive method " positioning precision is accurate, while detection efficiency is high, and table two is real Comparative result data are tested, wherein mAP is Average Accuracy (mean Average Precision), is that test sample is carried out The index assessed, Corloc is the rate that is properly positioned (Correct Location), is to training sample in training process The index that locating effect is assessed.As can be seen that " Weakly supervised object detection proposed by the present invention from correction data The framework of device+pseudo- true value adaptive method+full supervisory detection device " has one huge to carry than the testing result of Weakly supervised detector Rise, while " the pseudo- true value adaptive method " of the present invention is compared with " highest scoring method ", testing result also has greatly improved.Figure 8 be experimental result picture, and the detection block of its Green is " the Weakly supervised object detection based on pseudo- true value adaptive method in the present invention The testing result of method ", red detection block are the inspection of " Weakly supervised object detector+top score method+full supervisory detection device " Result is surveyed, method of the invention is substantially better than another method as seen from the figure.

Collect in the object detection frequently-used data storehouse of table 1

The experimental result correction data of table 2

The present invention can also have other various embodiments, in the case of without departing substantially from spirit of the invention and its essence, this area Technical staff works as can make various corresponding changes and deformation according to the present invention, but these corresponding changes and deformation should all belong to In the protection domain of appended claims of the invention.

Claims

A kind of 1. Weakly supervised object detecting method based on pseudo- true value adaptive method, it is characterised in that including：

Step 1), structure training sample；

Step 2), the picture in training sample is input in the Weakly supervised object detector based on more example learning methods；

Step 3), the output result progress non-maximum restraining processing by Weakly supervised object detector, choose in result picture The bounding box of every kind of object highest scoring；

Step 4), the positional information training candidate region for the bounding box chosen according to step 3) produce network, use the candidate Region produces network and produces multiple candidate regions, retains and is more than with the bounding box overlapping area ratio of the highest scoring described in step 3) All candidate regions of certain threshold value；The object of each classification corresponds to multiple candidate regions；

Step 5), the pixel coordinate corresponding to the candidate region of same object averaged, determined according to result of calculation every Unique bounding box of individual object；

Step 6), the information of the bounding box obtained in step 5) is inputed to and supervises object detector entirely, obtain testing result.
2. the Weakly supervised object detecting method according to claim 1 based on pseudo- true value adaptive method, it is characterised in that step Rapid one specifically includes：

Step 1.1), the keyword for receiving user's input；The keyword is used for the classification for representing object；

Step 1.2), retrieved in a search engine using the keyword, choose the retrieval result of predetermined number and by institute State markup information of the keyword as the retrieval result.
3. the Weakly supervised object detecting method according to claim 1 based on pseudo- true value adaptive method, it is characterised in that step It is rapid 1) in, training sample set be PASCAL VOC 2007/2012, MC COCO, WIDER FACE and FDDB databases in Any one.
4. the Weakly supervised object detecting method according to claim 1 based on pseudo- true value adaptive method, it is characterised in that step It is rapid 1) in, the size of the picture in training sample meets：

The most short side of picture is random one kind in { 480,576,688,864,1200 } five yardsticks；The longest edge of picture is less than Equal to 2000.
5. the Weakly supervised object detecting method according to claim 1 based on pseudo- true value adaptive method, it is characterised in that step It is rapid 2) to specifically include：

Step 2.1) extracts the candidate region of predetermined number using selective search algorithm in the picture of training sample；

The candidate region is inputted to the VGG16 network models trained on ImageNet data sets and obtained by step 2.2) Obtained for representing the shallow-layer feature of detailed information and further feature for representing semantic information, then by RoI ponds method The feature of each candidate region is taken, and candidate region feature is converted into one-dimensional vector by two-dimensional matrix representation and represents shape Formula, obtain the full connection features of each candidate region；

Step 2.3) inputs the full connection features into the Weakly supervised object detector based on more example learning methods, weak prison Superintend and direct to have in object detector and be used to be classification branch that object classification in candidate region is given a mark and for for candidate region The detection branches given a mark of positional information；Then classification branch is multiplied with the score of detection branches to obtain this candidate region Score；

Step 2.4) is inputted the score of each candidate region as supervision message into the 3 optimization networks mutually cascaded, right Optimize network and carry out consequent propagation calculating, the result after being optimized.
6. the Weakly supervised object detecting method according to claim 1 based on pseudo- true value adaptive method, it is characterised in that step It is rapid 6) in, full object detector of supervising is any one in Fast-RCNN, Faster-RCNN, YOLO, SSD.