CN116863250B

CN116863250B - Open scene target detection method related to multi-mode unknown class identification

Info

Publication number: CN116863250B
Application number: CN202311119364.7A
Authority: CN
Inventors: 黄阳阳; 罗荣华
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2023-09-01
Filing date: 2023-09-01
Publication date: 2024-05-03
Anticipated expiration: 2043-09-01
Also published as: CN116863250A

Abstract

The invention discloses an open scene target detection method related to multi-mode unknown class identification. The method comprises the following steps: training to obtain a target detection model UC-OSOD unknown to be recognized in an open scene; generating a background frame by using the RPN, and marking the background frame with the front score as a potential unknown class; separating a known category from an unknown category by using a contrast clustering method; optimizing a target detection model UC-OSOD for unknown identification of an open scene by using a CLIP model based on multiple modes, identifying unknown categories and classifying; and learning a new class by using an incremental learning method according to the provided unknown class label, and further circularly realizing the identification of the unknown class of the open scene. The method and the device realize the detection of untrained objects in the open scene, realize the Zero-Shot prediction of unknown categories, reduce the cost of manual annotation and improve the target detection precision in the open scene.

Description

Open scene target detection method related to multi-mode unknown class identification

Technical Field

The invention belongs to the technical field of image data identification, and particularly relates to an open scene target detection method related to multi-mode unknown class identification.

Background

With the continuous development of deep learning methods, the progress of target detection research is accelerated, the task of target detection is to identify and locate targets in images, and traditional target detection methods are all aimed at working under a closed set, that is, all classes in a training stage are known, so that they can only detect known classes, and if the set is in an open scene, two challenging problems occur: 1) The image contains unknown classes in the test process, which are required to be detected as unknown classes, 2) when the unknown classes are given corresponding labels, the model needs to learn new classes in an increment, and the problem is defined as open scene target detection.

The open scene target detection method not only needs to identify known classes, but also needs to identify all unknown instances as unknown, then the human annotators can tag the classes of interest, and the model learns the classes incrementally in the next task; however, in addition to identifying unknown classes, it is also necessary to determine whether multiple unknown instances belong to the same instance, which is of great value in practical applications. For example, in practical applications of robots and autopilots, it is necessary to explore the unknown environment and to take different strategies for different unknown classes, which requires detection algorithms to locate and identify different unknown instances and assign them to different unknown classes.

In the current open scene target detection method, although unknown categories are identified and classified in one category in the implementation process, the unknown categories are various and not actually the same category, which can cause side effects （Towards Open World Object Detection,K J Joseph,Submitted on 3 Mar 2021 (v1), last revised 9 May 2021 (this version, v2)）.

Disclosure of Invention

The invention aims to solve the problems that unknown categories need to be detected in an open scene, new unknown categories can be gradually learned when unknown category labels are available, the unknown categories can be classified, the detection of the unknown categories is optimized, the detection of the unknown categories in the open scene is realized, the cost of manual labeling is reduced, and the target detection precision in the open world is improved.

The object of the invention is achieved by at least one of the following technical solutions.

An open scene target detection method related to multi-mode unknown class identification comprises the following steps:

s1, in a training stage, using a Faster R-CNN as a reference network training model, and using a known class image as a training set to train the Faster R-CNN model to obtain a target detection model UC-OSOD of unknown identification of an open scene;

s2, generating a background frame by utilizing RPN (Region Proposal Network), and marking the background frame with the front score as a potential unknown class;

S3, separating a known category from an unknown category by using a contrast clustering method;

S4, in an reasoning stage, optimizing a target detection model UC-OSOD for unknown identification of the open scene by using a multi-mode-based CLIP model, identifying the unknown category and classifying;

S5, learning a new class by using an incremental learning method according to the provided unknown class label, and further circularly realizing the identification of the unknown class of the open scene.

Further, in step S1, a Faster R-CNN two-stage target detection algorithm is used as a reference network training model, and a Pascal VOC 2007 and MS-COCO standard data set is used as a training set; the Faster R-CNN is known as Faster Region-based Convolutional Neural Network, and is a two-stage target detection algorithm.

Further, in step S2, extracting the network RPN by using the candidate frames in the fast R-CNN target detection algorithm to generate a foreground frame and a background frame, wherein the background frame is actually an unlabeled candidate region, and therefore, the background frames with higher scores are likely to be potential unknown class objects; extracting background frames generated by network RPN from candidate frames, sorting according to scores, and taking the previous frameLabeling the background frames as potential unknown categories; wherein the RPN is collectively referred to as "Region Proposal Network", which is a module in the Faster R-CNN, for generating a candidate region in target detection.

Further, in step S3, in the potential space, by comparing the clustering method, the instances of the same class will be forced to be kept nearby, and the instances of different classes will be pushed far, so as to realize separation of the known class and the unknown class in the potential space;

specifically, for potential spatial features of the ith category, counting feature average values of iterative samples in a set time period, taking the feature average values as clustering centers, and constraining the sample features of the ith category to be expected to be close to the clustering centers of the ith category, and keeping the sample features of the other categories away from the clustering centers of the ith category; the cluster center of each class is updated continuously during the training process to obtain separate known and unknown class objects.

Further, for any ith class, there is a prototype vector；/>For the known class object/>Feature vectors generated at the detector interlayer;

If the ith class is a known class, then the penalty is the feature vector extracted from the image Prototype vector/>, with i-th classA distance therebetween; this distance uses a distance metric function/>Measuring;

if the ith category is not a known category, then the penalty is 0, margin value And/>And/>The maximum value of the difference between the distances; will compare the loss function/>The definition is as follows:

Wherein, Representing the known category, the number of known categories being/>，/>Is an arbitrary distance metric function,/>Defining the distance between similar and dissimilar terms,/>Representing the sum of losses of known classes,/>Representing a known class object/>Feature vector/>And prototype vector/>, of the i-th classLoss of/>Representing a known class object/>Feature vector/>And prototype vector/>, of the i-th classA distance therebetween; method for separating known category from unknown category by contrast clustering using contrast loss function/>Minimization will ensure that separation of known and unknown classes is achieved in potential space.

Further, in step S4, a target detection model UC-OSOD for unknown identification of the open scene is optimized by using a multi-mode-based CLIP model, the unknown class is identified and classified, and the CLIP is a novel visual language Pre-Training model developed by OpenAI, which is fully called Contrastive Language-Image Pre-Training; it is a multimodal pre-training model that can handle text and image inputs simultaneously and understand the semantic links between the two.

Further, in step S4, candidate frames of unknown classes obtained by the comparative clustering are input into an image encoder of the CLIP model, so as to obtain vector representations of the unknown classes;

Constructing an object class label data set, combining object class labels into a plurality of sentences in a prompting template mode, inputting the sentences into a text encoder of a CLIP model to obtain text feature vectors, mapping the text feature vectors and the vector representations of unknown classes to the same multi-modal feature space, calculating cosine similarity of the text feature vectors and the vector representations of the unknown classes, and obtaining a classification result of the unknown classes, wherein the label with the highest similarity is the label of the unknown class.

Further, in step S5, according to the provided unknown class label, a new unknown class label is input, and the object detection model UC-OSOD based on open scene unknown class identification is obtained through retraining, so that open scene unknown class identification is realized in a circulating manner;

The retraining is carried out to obtain a target detection model UC-OSOD based on open scene unknown class identification, a new class is learned by using an incremental learning method based on sample playback, namely, a part of representative old data is stored, and fine adjustment is carried out on the target detection model UC-OSOD based on open scene unknown class identification after each incremental step;

And freezing parameters of other layers except the output layer of the target detection model UC-OSOD based on the unknown class identification of the open scene, and only adjusting the parameters of the final output layer.

Further, incremental learning based on sample playback is a machine learning method mainly used to deal with the addition of new data in online learning, the basic idea of which is to train a model using historical data and then use the new data together with the historical data to update the model. The method has the main advantages that the whole model can be prevented from being retrained, so that the training efficiency can be greatly improved, one common strategy is to randomly select a part of historical data to be used together with new data, and the method can prevent the model from being too dependent on certain historical data, so that the generalization capability of the model is improved, and specifically comprises the following steps:

S5.1, initializing a model: before incremental learning begins, the UC-OSOD model needs to be initialized and used for training a part of data;

s5.2, training a model: training the UC-OSOD model by using a part of new data;

S5.3, sample playback: storing samples of a set proportion in a data set trained before in a buffer, called a playback buffer, and then randomly extracting samples of the set proportion from the playback buffer, wherein the samples are used for training of the UC-OSOD model together with the current training data;

S5.4, updating a model: combining the UC-OSOD model trained by using the samples in the playback buffer area with the UC-OSOD model trained in the step S5.2 to obtain a new UC-OSOD model;

S5.5, testing model: evaluating the UC-OSOD model obtained by combining in the step S5.3 by using a test data set;

S5.6, if new data are needed to be trained, returning to the step S5.2, otherwise, ending the incremental learning.

Further, the fine tuning is performed by training the model using a portion of representative historical data and new data in order to avoid model retraining when tags of unknown class are received; in model tuning, the method of adjusting only the parameters of the last output layer is generally called "head tuning" (head fine-tuning) or "global tuning";

The main idea of the method is that only the last layers of the UC-OSOD model are finely tuned by utilizing the general characteristics learned by the pre-training model on large-scale data, so that the UC-OSOD model can be better adapted to a new task, and the specific implementation process is as follows:

A1, loading a pre-training UC-OSOD model: using the UC-OSOD model which is pre-trained on large-scale data as an initial model;

a2, freezing model parameters: for layers that do not require fine tuning, their parameters are frozen so that they do not change during the training process;

A3, replacing an output layer: replacing the last output layer of the UC-OSOD model with a new output layer adapting to the task, wherein the output layer contains the category number required by the new task;

a4, training only a new output layer: only training the new output layer, so that the UC-OSOD model can be better adapted to new tasks;

A5, thawing parameters: if the parameters of other layers need to be fine-tuned, the parameters of the layers are thawed so that they can be changed in the fine-tuning;

a6, fine tuning a model: the entire UC-OSOD model is trimmed until the UC-OSOD model converges on the new task.

Compared with the prior art, the invention has the advantages that:

In the current open scene target detection method, although unknown classes are identified in the implementation process, the unknown classes are classified into uniform unknown classes, but the unknown classes are various and are not actually the same class, which can cause side effects, and the classification of the unknown classes has great commercial value, for example, the unknown environments need to be explored in the practical application of robots and automatic driving automobiles, and different strategies are adopted for different unknown classes; according to the method, the accuracy of open scene target detection is improved through subdivision of the unknown categories.

Drawings

FIG. 1 is a flow chart of an open scene object detection method involving multi-modal unknown class identification in an embodiment of the invention;

FIG. 2 is a schematic diagram of PRN labeling unknown classes in an embodiment of the present invention;

FIG. 3 is a schematic diagram of comparative clustering in an embodiment of the present invention;

FIG. 4 is a schematic diagram of identifying unknown classes of CLIP in an embodiment of the present invention;

fig. 5 is an effect diagram of an embodiment of the present invention.

Detailed Description

For the purposes of promoting an understanding of the principles and advantages of the invention, reference will now be made in detail to the embodiments of the invention, examples of which are illustrated in the accompanying drawings and will be apparent to those skilled in the art, all other embodiments of which are intended to be within the scope of the invention.

Examples:

an open scene target detection method related to multi-mode unknown class identification, as shown in fig. 1, comprises the following steps:

Using a Faster R-CNN two-stage target detection algorithm as a reference network training model, wherein the training set uses a Pascal VOC 2007 and MS-COCO standard data set; the Faster R-CNN is known as Faster Region-based Convolutional Neural Network, and is a two-stage target detection algorithm.

In the training phase of the model, the confidence SCORE of target detection is set to 0.35, and the non-maximum suppression NMS is set to 0.35.

S2, generating a background frame by utilizing RPN, and marking the background frame with the front score as a potential unknown class;

as shown in fig. 2, extracting the network RPN using the candidate boxes in the fast R-CNN target detection algorithm produces foreground boxes and background boxes, wherein the background boxes are virtually unlabeled candidate regions, and therefore the higher scores in these background boxes are likely to be potential unknown class objects; extracting background frames generated by network RPN from candidate frames, sorting according to scores, and taking the previous frame Labeling the background frames as potential unknown categories; wherein the RPN is collectively referred to as "Region Proposal Network", which is a module in the Faster R-CNN, for generating a candidate region in target detection. The value of k is determined according to the confidence SCORE of background display, and k backgrounds with the SCORE larger than 0.5 are set as unknown categories to be displayed in the invention, so that the known and unknown categories are obtained. Wherein the RPN is collectively referred to as "Region Proposal Network", which is a module in the Faster R-CNN, for generating a candidate region in target detection.

S3, separating known and unknown categories by using a contrast clustering mode;

as shown in fig. 3, in the potential space, by comparing the clustering method, the instances of the same class are forced to be kept nearby, and the instances of different classes are pushed far away, so that the separation of the known class and the unknown class in the potential space is realized;

For any ith class, there is a prototype vector；/>For the known class object/>Feature vectors generated at the detector interlayer;

Wherein, Representing the known category, the number of known categories being/>，/>Is an arbitrary distance metric function,/>Distance between similar and dissimilar terms is defined, in the context of edge values, i.e. distance between similar and dissimilar terms/>The values of (a) are obtained by comparing and evaluating the correlation experiments, in particular the distance/>, between similar and dissimilar termsCan be obtained by the following method: experimental evaluation: by designing and executing a series of experiments, different margin values/>, are comparedThe following performance index or result. Different margin values may be tried and the experimental results recorded at each value, then the best margin value is selected depending on the performance or results,/>Representing the sum of losses of known classes,/>Representing a known class object/>Feature vector/>And prototype vector/>, of the i-th classLoss of/>Representing a known class object/>Feature vector/>And prototype vector/>, of the i-th classA distance therebetween; let contrast loss function/>Minimization will ensure that separation of known and unknown classes is achieved in potential space.

as shown in fig. 4, CLIP is a novel visual language Pre-Training model developed by OpenAI, which is collectively referred to as Contrastive Language-Image Pre-Training; it is a multimodal pre-training model that can handle text and image inputs simultaneously and understand the semantic links between the two.

Inputting the candidate frames of the unknown class obtained by the comparative clustering into an image encoder of the CLIP model to obtain vector representation of the unknown class; constructing an object class label data set, combining 1000 class names of the ImageNet data set into an object class label data set in a prompting template mode, inputting the object class label data set into a text encoder of a CLIP model to obtain text feature vectors, mapping the text feature vectors and the vector representations of unknown classes to the same multi-mode feature space, calculating cosine similarity of the text feature vectors and the vector representations of the unknown classes, and obtaining a classification result of the unknown classes, wherein the label with the highest similarity is the label of the unknown class.

S5, inputting a new unknown class label according to the provided unknown class label, retraining to obtain a target detection model UC-OSOD based on open scene unknown class identification, and further circularly realizing the open scene unknown class identification, wherein in one embodiment, the obtained effect diagram is shown as a diagram and b diagram in FIG. 5.

When the target detection model UC-OSOD based on the open scene unknown class identification is obtained through retraining, a new class is learned by using an incremental learning method based on sample playback, namely, a part of representative old data is stored, fine tuning is carried out on the target detection model UC-OSOD based on the open world unknown class identification after each incremental step, parameters of other layers except the output layer of the target detection model UC-OSOD based on the open world unknown class identification are frozen, and only parameters of the last output layer are adjusted.

Incremental learning based on sample playback is a machine learning method that is mainly used to handle the addition of new data in online learning, the basic idea of which is to train a model using historical data and then use the new data with the historical data to update the model. The method has the main advantages that the whole model can be prevented from being retrained, so that the training efficiency can be greatly improved, one common strategy is to randomly select a part of historical data to be used together with new data, and the method can prevent the model from being too dependent on certain historical data, so that the generalization capability of the model is improved, and specifically comprises the following steps:

s5.1, initializing a model: before incremental learning begins, a UC-OSOD model needs to be initialized and used for training a part of data;

s5.2, training a model: training the UC-OSOD model by using a part of new data;

S5.4, updating a model: combining the model trained by using the samples in the playback buffer area with the model trained in the step S5.2 to obtain a new UC-OSOD model;

s5.5, testing model: the UC-OSOD model in step S5.3 was evaluated using the test dataset.

The fine tuning is performed by training the model using a portion of representative historical data and new data in order to avoid model retraining when tags of unknown class are received; in model tuning, the method of adjusting only the parameters of the last output layer is generally called "head tuning" (head fine-tuning) or "global tuning";

The main idea of the method is that only the last layers of the model are finely tuned by utilizing the general features learned by the pre-training model on large-scale data, so that the model can be better adapted to a new task, and the specific implementation process is as follows:

a1, loading a pre-training model: using the UC-OSOD model which is pre-trained on large-scale data as an initial model;

A4, training only a new output layer: only training the new output layer, so that the model can be better adapted to new tasks;

a6, fine tuning a model: the entire UC-OSOD model is trimmed until the model converges on the new task.

In one embodiment, to demonstrate the effectiveness of the proposed method, a validation experiment is performed as follows:

a comprehensive evaluation standard is proposed to investigate the performance of UC-OSOD (an open scene object detection method involving multi-modal unknown class recognition), including recognition of unknown class objects, detection of known classes, and gradual learning of new classes while providing labels to the unknown classes.

Data segmentation: at the task set={/>，/>The UC-OSOD model was evaluated over. All classes of a particular task will be at the point in time/>Is introduced into the system. For task/>，{/>：/></>Is known, {/>：/>>/>The information will be regarded as unknown. As shown in Table 1,4 tasks were constructed, each task having 20 classes, and using the Pascal VOC and MS-COCO data sets. Task/>Consists of all VOC classes and data, which do not contain any information about the unknown class, which allows the model to be tested without any unknown information during training. The remaining 60 classes of MS-COCO are divided into three parts, namely/>，/>，. Although/>And/>The training images in (1) do not have labels of unknown instances, but they contain unknown instances, which can test the effect of the model in this case, in each Task (Task) the evaluation data consists of a Pascal VOC test segmentation and an MS-COCO verification segmentation. Table 1 shows the Task (Task) composition in the open world target detection evaluation criteria for unknown class identification:

TABLE 1

Evaluation index: since the unknown target is easily confused with the known target, using the WILDERNESS IMPACT (WI) index to explicitly describe this behavior, WI should ideally be smaller because the accuracy cannot be degraded when the unknown target is added to the test set. In addition to WI, absolute open set error (A-OSE) is also used to reflect the number of unknown targets misclassified as a known class. Both WI and a-OSE implicitly measure the effectiveness of the model in handling unknown targets.

Tables 2,3,4, and 5 show the comparison of UC-OSOD with the baseline model Faster R-CNN on open world target detection for task 1, task 2, task 3, and task 4, respectively, and WI and A-OSE show the degree of confusion for the known and unknown categories after each task training. The WI and a-OSE scores of the UC-OSOD model are significantly lower due to the explicit modeling of the unknown object. When the unknown class is progressively marked in task 2, the performance of the baseline detector is found to drop significantly from 56.16% over the set of known classes (quantified by mAP), which measures how well it detects the known class, to 4.011%, with the larger value the better the performance. UC-OSOD can achieve two goals simultaneously: detecting the effect of the known class and reducing the effect of the unknown class. Similar trends also occur in task 3 and task 4 classes.

TABLE 2

TABLE 3 Table 3

TABLE 4 Table 4

TABLE 5

The UC-OSOD model is used for clearly modeling the unknown object, so that the UC-OSOD model has good performance in the incremental target detection task. UC-OSOD reduces confusion in which unknown objects are classified as known objects, which allows the detector to learn incrementally about the actual foreground objects. UC-OSOD was evaluated using the criteria used in ILOD (abbreviation for incremental target detector), and the Pascal VOC 2007 dataset was used to divide the dataset into three groups: 10 (known class) +10 (unknown class), 15 (known class) +5 (unknown class), 19 (known class) +1 (unknown class) to allow incremental learning of the detector. UC-OSOD was compared with ILOD at three different settings. As shown in table 6 below, UC-ORE performs very well in all settings.

TABLE 6

It is to be understood that various changes and modifications in the form and detail herein disclosed may be made by those skilled in the art without departing from the spirit and scope of the present disclosure. Accordingly, equivalent modifications and variations of the present invention should also be made within the scope of the present invention. In addition, although specific terms are used in the present specification, these terms are for convenience of description only and do not limit the present invention in any way.

Claims

1. An open scene target detection method related to multi-mode unknown class identification is characterized by comprising the following steps:

S2, generating a background frame by utilizing RPN, and marking the background frame with the front score as a potential unknown class; extracting network RPN by using candidate frames in a Faster R-CNN target detection algorithm to generate a foreground frame and a background frame, wherein the background frame is an unmarked candidate region in fact; extracting background frames generated by network RPN from candidate frames, sorting according to scores, and taking the previous frame Labeling the background frames as potential unknown categories; the RPN is one module in the fast R-CNN and is used for generating a candidate region in target detection;

S3, separating a known category from an unknown category by using a contrast clustering method; in the potential space, the instances of the same class are forced to be kept nearby by a contrast clustering method, and the instances of different classes are pushed far away, so that the separation of the known class and the unknown class in the potential space is realized;

specifically, for potential spatial features of the ith category, counting feature average values of iterative samples in a set time period, taking the feature average values as clustering centers, and constraining the sample features of the ith category to be expected to be close to the clustering centers of the ith category, and keeping the sample features of the other categories away from the clustering centers of the ith category; the clustering center of each category is continuously updated in the training process to obtain separated known and unknown category objects; for any ith class, there is a prototype vector ；/>For the known class object/>Feature vectors generated at the detector interlayer;

if the ith category is a known category, then the loss is extracted from the image Prototype vector/>, with i-th classA distance therebetween; this distance uses a distance metric function/>Measuring;

Wherein, Representing the known category, the number of known categories being/>，/>Is an arbitrary distance metric function,/>Defining the distance between similar and dissimilar terms,/>Representing the sum of losses of known classes,/>Representing a known class object/>Feature vector/>And prototype vector/>, of the i-th classLoss of/>Representing a known class object/>Feature vector/>And prototype vector/>, of the i-th classA distance therebetween; method for separating known category from unknown category by contrast clustering using contrast loss function/>Minimizing will ensure that separation of known and unknown classes is achieved in potential space;

S4, in an reasoning stage, optimizing a target detection model UC-OSOD for unknown identification of the open scene by using a multi-mode-based CLIP model, identifying the unknown category and classifying; optimizing a target detection model UC-OSOD for unknown identification of an open scene by using a multi-mode-based CLIP model, identifying unknown categories and classifying, wherein the CLIP is a novel visual language Pre-Training model developed by OpenAI and is totally called Contrastive Language-Image Pre-Training; the multi-mode pre-training model can process text and image input simultaneously and understand semantic relation between the text and the image input; inputting the candidate frames of the unknown class obtained by the comparative clustering into an image encoder of the CLIP model to obtain vector representation of the unknown class;

Constructing an object class label data set, combining object class labels into a plurality of sentences in a prompting template mode, inputting the sentences into a text encoder of a CLIP model to obtain text feature vectors, mapping the text feature vectors and unknown class vector representations into the same multi-modal feature space, calculating cosine similarity of the text feature vectors and unknown class vector representations, and obtaining a classification result of the unknown class by using the label with the highest similarity as the unknown class label;

S5, learning a new class by using an incremental learning method according to the provided unknown class label, and further circularly realizing the identification of the unknown class of the open scene; inputting a new unknown class label according to the provided unknown class label, retraining to obtain a target detection model UC-OSOD based on open scene unknown class identification, and further circularly realizing the open scene unknown class identification;

freezing parameters of other layers except the output layer of the target detection model UC-OSOD based on open scene unknown class identification, and only adjusting the parameters of the last output layer;

the incremental learning based on sample playback is a machine learning method, and specifically comprises the following steps:

s5.2, training a model: training the UC-OSOD model by using a part of new data;

S5.3, sample playback: storing the samples of the set proportion in a buffer, called a playback buffer, in the data set trained before, and then randomly extracting the samples of the set proportion from the playback buffer, and using the samples together with the current training data for training of the UC-OSOD model;

S5.6, if new data are needed to be trained, returning to the step S5.2, otherwise, ending the incremental learning;

The fine tuning is performed by training the model using a portion of representative historical data and new data in order to avoid model retraining when tags of unknown class are received; by utilizing the general features of the pre-training model learned on large-scale data, only the last layers of the UC-OSOD model are finely tuned, so that the UC-OSOD model can be better adapted to a new task, and the specific implementation process is as follows:

2. The method for detecting an open scene target related to multi-modal unknown class recognition according to claim 1, wherein in step S1, a two-stage target detection algorithm of fast R-CNN is used as a reference network training model, and the training set uses a standard dataset of Pascal VOC 2007 and MS-COCO.