CN115439688A - Weak supervision object detection method based on surrounding area perception and association - Google Patents
Weak supervision object detection method based on surrounding area perception and association Download PDFInfo
- Publication number
- CN115439688A CN115439688A CN202211066364.0A CN202211066364A CN115439688A CN 115439688 A CN115439688 A CN 115439688A CN 202211066364 A CN202211066364 A CN 202211066364A CN 115439688 A CN115439688 A CN 115439688A
- Authority
- CN
- China
- Prior art keywords
- area
- clustering
- region
- surrounding
- discriminative
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
- G06T7/73—Determining position or orientation of objects or cameras using feature-based methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/22—Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
- G06V10/23—Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition based on positionally close patterns or neighbourhood relationships
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/7715—Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Multimedia (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
A weak supervision object detection method based on peripheral area perception and correlation relates to the technical field of object detection, and aims at solving the problems that in the prior art, weak supervision object detection is easy to converge on a local optimal solution, and the problem that in the weak supervision object detection method, the detection accuracy is low and converges on the local optimal solution due to the fact that only the most discriminative area of an object can be detected instead of all object areas, and the object positioning failure is caused, so that the detection accuracy is low. The invention belongs to basic technical research work of object detection in practical application scenes, promotes the landing of an object detection technology of artificial intelligence deep learning to a certain extent, and makes up the difference between weak supervision and full supervision object detection.
Description
Technical Field
The invention relates to the technical field of object detection, in particular to a weak supervision object detection method based on peripheral region perception and association.
Background
Weakly supervised object detection is a technique that enables object detection using only image-level tags, where an image-level tag indicates a category of whether an object is present in an image. In the application of a real scene, the fully supervised object detection cannot acquire instance-level labels in the training process, and the weakly supervised object detection technology utilizes image-level labels to replace the instance-level labels of the fully supervised object detection, so that the requirement of the fully supervised object detection on instance-level label training data can be greatly reduced, and the object detection is realized on the premise of scarce label data. However, compared to fully supervised object detection, weakly supervised object detection rarely has modules designed for accurate object region localization (fully supervised detection has candidate area networks, feature pyramid networks, etc.). Meanwhile, the task of detecting the weakly supervised object is generally regarded as a task of classifying candidate regions, in which case the weakly supervised object detector is caused to converge on a locally optimal solution, and the output result is the most discriminative region of the object. Based on the above, weakly supervised object detection is a challenging and potential technique.
At present, in order to make up for the difference between the detection of the weakly supervised object and the detection of the fully supervised object and to improve the local focusing phenomenon, the weak supervised detection method can be summarized into four representative methods as follows. The method based on initializing the high-quality candidate region ensures the recall rate index of the object detection task, and combines the similar activation map and the selective search algorithm to generate the high-quality candidate region. The generated high-quality candidate area is used as the input of the weak supervision detector, the intersection and parallel ratio of the candidate area and a real object boundary frame is improved while the high recall rate is ensured, and an accurate detection result is realized; guiding the detector to approach to a complete object region by a method based on an iterative refinement strategy, and providing supervision information for the next branch by taking the label with the same category in the high-overlapping region as the prior knowledge of the training process; the conversion method based on the weak supervision and the full supervision combines the advantages of the weak supervision (labeled information is easy to obtain) and the full supervision (strong regression capability), trains a fully supervised detector by using the output result of the weak supervised object detector, and takes the output of the fully supervised detector as the final detection result; based on the complete object searching method, the class activation graph is used as the position prior of the object region, the maximum score of the detection region and the minimum score of the surrounding region are searched, and the complete object region is further positioned. However, the high-quality region generation method, the iterative refinement method, the fully-supervised and weakly-supervised transformation method, and the complete object search method cannot fundamentally solve the phenomenon of local focusing, and the methods have no wide applicability and are only suitable for one or a class of current weakly-supervised detection methods. Based on the above, the limitations of the existing weak supervision object detection method can be summarized into two aspects: (1) The weak supervision object detection is easy to converge on a local optimal solution, and the visual expression is that only the area with the most discrimination of the object can be detected, but not all object areas, so that the object positioning fails; (2) Fully supervised object detection to improve positioning accuracy, well-designed modules can be integrated into any fully supervised object detector, for example: and generating a network, namely a characteristic pyramid network, in the candidate area. Compared with weak supervision object detection, a universal module designed for improving positioning accuracy rarely exists, and the existing method has no wide applicability and is only suitable for one type or a class of current weak supervision detection methods.
Disclosure of Invention
The purpose of the invention is: aiming at the problems that in the prior art, the detection of a weakly supervised object is easy to converge on a local optimal solution, and the object positioning is failed and the detection precision is low due to the fact that only the area with the most discrimination of the object can be detected instead of all object areas, the method for detecting the weakly supervised object based on the peripheral area perception and the correlation is provided.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a weak supervision object detection method based on surrounding area perception and association comprises the following steps:
the method comprises the following steps: acquiring an image to be recognized, predicting the image to be recognized by using a weak supervision detector, and taking the position of an object obtained by prediction as a most discriminative area;
step two: expanding the area with the most discriminating power, cutting the expanded area by using the image block, and finally taking the image block as a peripheral area;
step three: extracting the features of the most discriminative area and the surrounding areas, clustering the obtained features, assigning a clustering label to each area, and dividing each area into different clusters through the clustering labels;
step four: obtaining the surrounding area which is the same as the label of the area with the most discriminating power through the clustering label of each area, and fusing the surrounding areas which are the same as the label of the area with the most discriminating power into a new object area;
step five: performing data amplification on the most discriminative power region to obtain two amplified most discriminative power regions, namely q 'and q';
step six: extracting features of q ', q' and surrounding areas, clustering the extracted features, if q 'and q' are assigned to the same cluster in the clustering process, regarding the clustering process as correct clustering, and executing step seven, if q 'and q' are not assigned to the same cluster in the clustering process, executing step three to step six once again, if q 'and q' are assigned to the same cluster, regarding the clustering process as correct clustering, and executing step seven, if q 'and q' can not be assigned to the same cluster, ignoring the clustering result; then the area with the most discriminating power is taken as the final object area;
step seven: for the clustering process that assigns q 'and q' to the same cluster, the distance d from q 'and q' to the current cluster center is calculated 1 And d 2 If the distance is different | d 1 -d 2 I exceeds a set threshold value T dis If not, the clustering process is regarded as correct clustering;
step eight: based on the correct clustering in the step seven, acquiring a surrounding area in the cluster simultaneously containing q 'and q', calculating the cosine similarity between the surrounding area and q 'or q', and if the cosine similarity is greater than a threshold value T score If =0.95, fusing the most discriminative area in the cluster and the surrounding area into a final object area;
step nine: and training a neural network by using the most discriminative power area and the peripheral area as input and using the final object area as output, and detecting the object by using the trained neural network.
Further, the range ratio α expanded in the second step is greater than 1 time.
Further, the range ratio α expanded in the second step is 1.2 times.
Further, the features are high-dimensional nonlinear features.
Further, the high-dimensional nonlinear features are extracted through ViT.
Further, the size of the image block is 32 × 32.
Further, the peripheral area in the third step is 60% of the peripheral area in the second step.
Further, the data amplification includes random color dithering, random graying, random gaussian blurring, and random daylight.
Further, the neural network is a MoCov3 network.
Further, the specific steps of training the neural network are as follows:
the training process is carried out by adopting unsupervised comparative learning, and 100 epochs are trained in total;
(1) When in the 0-29 rounds, the network inputs the most discriminative power area;
(2) When the number of the wheels is 30, 35, 40.. 100, the network performs a fusion process, and simultaneously, a final object region after fusion is used as the input of the network for training;
(3) When in rounds 31-34, 36-39.. 96-99, the inputs to the network are the final object region after fusion and the most discriminative force region that is not fused.
The beneficial effects of the invention are:
the method solves the problems of low detection precision and convergence to a local optimal solution in the weak supervision object detection method, breaks through the limitation that no module for improving the positioning precision exists in the weak supervision, and reduces the requirement of an object detection technology on expensive manual labeling. The invention belongs to basic technical research work of object detection in practical application scenes, promotes the landing of an object detection technology of artificial intelligence deep learning to a certain extent, and makes up the difference between weak supervision and full supervision object detection.
Drawings
FIG. 1 is a diagram of an example of a surrounding area sensing and association module;
FIG. 2 is a diagram of a regional feature extractor architecture;
FIG. 3 is a diagram of a regional connectivity network architecture;
FIG. 4 is a schematic diagram of region fusion constraints 1;
FIG. 5 is a schematic diagram of region fusion constraints 2;
FIG. 6 is a comparison graph a of the detection effect of a weakly supervised object;
FIG. 7 is a comparison graph b of the detection effect of the weakly supervised object;
fig. 8 is a comparison graph c of the detection effect of the weakly supervised object.
Detailed Description
It should be noted that, in the present invention, the embodiments disclosed in the present application may be combined with each other without conflict.
The first embodiment is as follows: specifically describing the present embodiment with reference to fig. 1, the weak supervision object detection method based on surrounding area perception and association in the present embodiment includes the following steps:
the method comprises the following steps: acquiring an image to be recognized, predicting the image to be recognized by using a weak supervision detector, and taking the position of an object obtained by prediction as a most discriminative area;
step two: expanding the area with the most discriminating power, cutting the expanded area by using the image block, and finally taking the image block as a peripheral area;
step three: extracting the features of the most discriminative area and the surrounding areas, clustering the obtained features, assigning a clustering label to each area, and dividing each area into different clusters through the clustering labels;
step four: obtaining the surrounding area which is the same as the label of the area with the most discriminating power through the clustering label of each area, and fusing the surrounding areas which are the same as the label of the area with the most discriminating power into a new object area;
step five: performing data amplification on the most discriminative power region to obtain two amplified most discriminative power regions, namely q 'and q';
step six: extracting features of q ', q' and surrounding areas, clustering the extracted features, if q 'and q' are assigned to the same cluster in the clustering process, regarding the clustering process as correct clustering, and executing step seven, if q 'and q' are not assigned to the same cluster in the clustering process, executing step three to step six once again, if q 'and q' are assigned to the same cluster, regarding the clustering process as correct clustering, and executing step seven, if q 'and q' can not be assigned to the same cluster, ignoring the clustering result; then the area with the most discriminating power is taken as the final object area;
step seven: for the clustering process of assigning q 'and q' to the same cluster, the distance d from q 'and q' to the current cluster center is calculated 1 And d 2 If the distance is different | d 1 -d 2 L exceedsFixed threshold value T dis If not, the clustering process is regarded as correct clustering;
step eight: based on the correct clustering in the step seven, acquiring a surrounding area in the cluster simultaneously containing q 'and q', calculating the cosine similarity between the surrounding area and q 'or q', and if the cosine similarity is greater than a threshold value T score If =0.95, fusing the most discriminative area in the cluster and the surrounding area into a final object area;
step nine: and training a neural network by using the most discriminative power area and the peripheral area as input and using the final object area as output, and detecting the object by using the trained neural network.
Aiming at the difference between the positioning accuracy of the weakly supervised object detector and the positioning accuracy of the fully supervised object detector, the method provides a module which can be embedded into any weakly supervised detector, and forms an end-to-end learning framework with any weakly supervised detector, so that the problem that the training process converges to a local optimal solution is solved, and the problem that the weakly supervised object detection usually identifies the most characteristic region of the object is further shown. In order to overcome the defect that only the most discriminative region of an object is identified based on the current weak supervision object detector, the region association network provided by the application dynamically inquires the similarity between the surrounding region and the object region predicted by the existing detector based on a clustering method, and fuses the high-similarity region according to the similarity result, so that the object region output by the weak supervision object detector covers the complete region of the object. Meanwhile, a clustering process is introduced into the area association network, and the defect of the clustering process is inevitably introduced, namely, the clustering process in the early training stage has the condition of unstable area misclassification and low confidence coefficient. The regional fusion constraint provided by the application can enhance the regional fusion condition, hinder the wrong fusion process, further refine the rough object region output by the regional association network, and output an accurate and complete detection result.
Specifically, an example of the surrounding area and association module proposed in the present application is shown in fig. 1, which includes three components, namely, an area extractor, an area association network, and an area fusion constraint. The first component region extractor divides the most discriminative region and the surrounding region according to the detection result in one image aiming at any weak supervision detector, and uses different cutting ranges to define the surrounding region, so that the detection accuracy is improved, and the algorithm execution time is reduced. The second component is a regional association network, and the main function of the regional association network is to continuously execute comparison learning and clustering processes through two types of regions output by a region extractor, extract good visual representations of the two types of regions, input the obtained visual representations into the clustering process, assign a clustering label to each region, query surrounding regions which are the same as the most discriminative labels through the clustering labels, and fuse the surrounding regions into a new object region. And a third component, namely region fusion constraint, executing a region association network to output a new object region, wherein the new object region may introduce unstable and low-confidence surrounding regions at the early stage of clustering, and if the surrounding regions are fused with the most discriminative regions as a final result, the integrity and accuracy of the object are affected.
The region extractor of the present application first regards the output result of the existing detector as the most discriminative region. Then, the most discriminative power region is expanded at a certain ratio α, and the expanded most discriminative power region is set as a clipping range. Within the cropping range, 32 × 32 patches are sequentially cropped as key regions. In the region extractor, each region is regarded as the most discriminative region query regions or the surrounding region key regions, the two types of regions are input into the region association network provided by the application, and the comparison learning and clustering process is performed to find the complete object region.
The regional association network provided by the application combines the comparative learning and clustering processes. Aiming at contrast learning, the region extractor outputs the most discriminative region as query regions, the surrounding regions as key regions, and the query regions are subjected to a series of image enhancement strategies including random color jitter, random gray scale random graying, random gaussian blur and random solarize random grayness. Inputting the enhanced region (q ', q') into a MoCov3 frame, extracting good visual representation of query regions by using an unsupervised training strategy, and extracting the features of the two types of regions and mapping the features to a high-dimensional nonlinear space by taking the most discriminative region and the surrounding region as the input of a region association network through training a ViT feature extractor in the MoCov3 frame. Aiming at the clustering process, the area association network carries out the clustering process on the high-dimensional nonlinear characteristics of the two types of areas, a clustering label is assigned to each area according to the Euclidean distance in a high-dimensional space, the surrounding areas which are positioned in the same cluster with the most discriminative area are extracted at the same time, the surrounding areas and the most discriminative area are fused into a new object area, and the whole part or most part of the area of a real object can be covered. In the area association network of the present application, no label information is required for the unsupervised training process.
For the region fusion constraint proposed in the present application, this constraint includes a category sub-constraint and a distance sub-constraint. The category sub-constraint represents whether each query region at different visual angles after image enhancement is assigned to the same cluster, the distance sub-constraint calculates the distance difference of the query regions at different visual angles in the corresponding cluster centroid on the premise that the category sub-constraint is satisfied, the distance sub-constraint considers that the distance difference is smaller than a preset threshold value, namely a successful distance process is performed, otherwise, the clustering process is considered to be failed, namely the current clustering result is ignored, and no new object region is fused. Specifically, q ' and q ' are output after image enhancement is carried out on query regions in the region association network, and the q ' are used as two data enhanced regions of the same region under different visual angles and are input into a clustering process together with surrounding regions. In the clustering process, if q 'and q' are in the same cluster, the clustering process is taken as a successful clustering, otherwise, the class sub-constraint ignores the clustering result, and the clustering process is executed again in order to search the surrounding area with high similarity with the most discriminative area. And then, when q 'and q' are in the same cluster, the distance between the q 'and q' and the current cluster center is calculated by the distance sub-constraint, if the distance is greater than a set threshold value, the current clustering result is directly ignored, and the original object region is proved to be complete without further fusion. Similarly, if the distance is smaller than the set threshold, the category sub-constraint and the distance sub-constraint are satisfied at the same time, and the surrounding region satisfying both the sub-constraints is regarded as a candidate region of the region to be fused.
According to the method, the VOC2007/2012 data set is used as a research object, and a user can construct a corresponding database according to actual application requirements. In the present application, in order to better evaluate the weak surveillance object detection technology, a VOC data set widely used in the field of object detection is adopted, which includes 20 categories in actual scenes, including 9963 and 22531 image data, and the VOC images are classified into VOC2007train/val, VOC07/12train/val and VOC2007test. The VOC2007train/val and the VOC07/12train/val are used for respectively training the weak supervision detector framework of the application, and the VOC2007test is used for verifying the performance of the weak supervision detector framework of the application, and meanwhile, the detection performance is evaluated by adopting a wide object detection index mAP, namely the intersection ratio IOU of a detection example and a true value detection example is more than 0.5 as a correct detection result. After a training database is established, firstly, a region sensing and association network provided by the application is trained on the basis of an unsupervised end-to-end mode according to an output result of an extracted and trained weak supervision detector, and an output final detection result improves local focusing and obtains a complete object region.
In summary, the present application provides a novel weak supervised object detection framework based on peripheral region perception and association, which directly considers the relationship between the region predicted by the existing detector and the peripheral region in the process of realizing the weak supervised object detection, and updates the original locally focused region by using the similarity relationship between the two regions, thereby realizing the complete object region detection. In the "area association network" of the present application, when the most discriminative area is fused with the surrounding area into a new object area, the new object area may be rough, that is, include the entire object area or most of the object area, although including the entire object area, and may have a poor degree of fitting with the real object bounding box, because the clustering process at the training start stage is unstable and the ViT feature extraction capability is insufficient. In the 'region fusion constraint', the influence of the initial clustering stage on the fusion process is considered, the region fusion constraint refines a rough object region, an unstable surrounding region is removed, an accurate object region is obtained, and the object region is used as a result output by a weak supervision detector.
The method solves the problem of local focusing of the existing weak supervision object detector, greatly compensates the difference between the weak supervision detector and the full supervision detector, promotes the development of deep learning object detectors in the application of real scenes, solves the problem of scarcity or unavailability of training labels in practical application, and further provides technical support for falling of an artificial intelligent object detection technology.
The application provides a novel surrounding area perception and correlation module which can be integrated into any existing weak supervision detector to serve as a detection framework for end-to-end training and comprises three components, namely an area extractor, an area correlation network and an area fusion constraint. In order to solve the limitation (1), the present application focuses on the similarity between the surrounding area and the area with the most discriminant force directly, and according to the similarity result of the query, the surrounding area with high similarity is considered as the area to be fused and is fused with the area with the most discriminant force to form a new object area. And removing the low confidence coefficient, high noise and unstable region to be fused in the early training process through the region fusion constraint condition, refining the object region output by the region association network, and outputting an accurate detection result containing a complete object. To address limitation (2), the present application calculates the loss on the most discriminative regions predicted by the weakly supervised object detector during the training process, and the method proposed by the present application does not utilize any instance-level labels or image-level labels during the training process. Therefore, the surrounding area sensing and correlation module can be simply integrated with any weak supervision object detector into an end-to-end unified framework, and has wide applicability.
Example (b):
the present application employs a VOC2007/2012 dataset that broadly evaluates and validates the performance of weakly supervised detectors. Specifically, the types of images in the VOC image data set are classified into VOC2007train/val, VOC07/12train/val and VOC2007test. Among them, VOC2007train/val and VOC07/12train/val are used to train the weak surveillance detector framework of the present application, respectively, and VOC2007test is used to verify the performance of the weak surveillance detector framework of the present application. After the training database is established, the regional awareness and association module provided by the application is trained. Firstly, a region extractor of the method generates a most discriminative area and a surrounding area, trains a region association network proposed by the method by using the two types of regions, inquires the surrounding area in the same cluster with the most discriminative area according to an unsupervised training process and a clustering result in a model, fuses the surrounding area and the most discriminative area as a new object region, and introduces a region fusion constraint of the method, wherein the constraint comprises a class sub-constraint and a distance sub-constraint, the two sub-constraints remove the surrounding area which is contained in an early unstable clustering process and is accompanied with noise and the surrounding area with low confidence, refine the object region output by the region association network, and output a refined accurate object region as a result of a weak supervision detector. The method mainly solves the problem that the existing weak supervision object detector is often positioned in the area with the most characteristics of the object, obtains the local most solution and shows the phenomenon of local focusing, makes up the difference between the weak supervision object detector and the full supervision object detector, and provides a module which is convenient for the weak supervision object detector to improve the positioning precision and is convenient to integrate.
And designing a region extractor. As shown in fig. 2, the main idea of the region extractor is to extract the object position predicted by the existing weak surveillance detector, and take the predicted object position as query regions. Then, the region extractor expands the object region by a certain range α, and cuts a patch having an expanded object region of 32 × 32 at a fixed ratio as key regions. The coordinates of the initial object position are (x, y, w, h) and the extended object position is (x, y, α w, α h), where α >1. Query regions in an image are taken as a set, as shown in the formula:
wherein, b r Representing the object regions predicted by existing weak supervised detectors in one image, R represents the number of total predicted regions in this image. Likewise, key regions are used as a set, as shown in the formula:
wherein, b rn Representing the nth key regions corresponding to the r-th query regions, and N representing the total number of key regions, the number of N varying in the r-N correspondence, the region being the original size of the original predicted object region. It is noted that in the process for clipping key regions in the present application, 60% of the key regions are used as input to the area-associated network. The first reason is that 60% of key regions can realize an efficient regional area association network, a clustering process exists in the regional area association network, the number of the regions is reduced, the clustering is favorably realized, and the training time of the weak supervision detection framework and the time of an inference process are further reduced. The second reason is that 60% of the regions include part of the upper region, part of the left region, part of the right region and part of the lower region of the original prediction region, and the regions in 4 directions are included to realize the fusion process after clustering, so that semantic information near the region with the most discriminating power is not lost, and based on the above, 60% of key regions are adopted to ensure the enhanced detection performance of the surrounding regions detected by the algorithm, and simultaneously reduce the clustering time.
And designing an area association network. The regional association network provided by the application combines the feature extraction capability of unsupervised learning and the clustering process, and the network structure of the whole algorithm is shown in FIG. 3 and mainly comprises a MoCov3 network and a clustering network. The MoCov3 network is used for extracting query regions features based on an example discrimination task by adopting contrast learning. It is worth noting that the training process is divided into 3 stages, the first stage utilizes a comparative learning process to train query regions, the second stage trains new fused object regions which contain semantic information of surrounding regions, the third stage trains new fused object regions and regions which are not fused and have the most discriminative power, feature extraction capability of the VIT on the query regions and key regions is enhanced according to training strategies of the three stages, and key regions which are highly similar to the query regions are further inquired and complete object regions are found. Specifically, for each input image, the feature extractor generates query regions and key regions, and obtains different view angles q 'and q' of the query regions (key regions) by adopting a rich data enhancement method of RandomColorJutter, randomGrayScale, randomGaussianBlur and RandomSolarize. Specifically, the area-related network can be regarded as composed of an area-related algorithm (a) and an area-related algorithm (b), a mapping network is introduced after a basic encoder branches, the mapping network is composed of 3 full-link layers (Linear layers), 3 Batch Normalization layers (BN) and 2 ReLU activation functions, a blue part is a full-link layer, a green part is a Batch Normalization layer, an orange part is a Relu activation function, the Batch Normalization layer enables the network to be more easily converged and the model to be more stable, and the Relu activation function enables the network input and output to have a nonlinear relationship; a prediction network is added behind the mapping head, and the composition of the prediction network is similar to that of the mapping network. Compared with the mapping network, the prediction network consists of 2 fully-connected layers, 2 batch normalization layers and 1 Relu activation function. Compared with a basic encoder, the momentum encoder does not introduce a prediction network, the prediction network of the basic encoder generates a prediction vector and a mapping vector generated by the momentum encoder optimizes the whole network by adopting a cross entropy loss function, as shown in a formula:
L predict =ctr(z' d ,z″ d )+ctr(z″ d ,z' d )
z' d =pr(g(f b (B' d )))
z″ d =g(f m (B″ d ))
where ctr (×) represents a predicted loss function from a to B or B to a based on contrast learning MoCov 3. z' d ,z” d Representing the high-dimensional nonlinear features extracted by mapping networks and prediction networks on query regions or key regions. It is worth noting that the basic encoder branch is composed of a feature extraction network, a mapping network and a prediction network, while the momentum encoder branch only comprises the feature extraction network and the mapping network, and meanwhile, in order to construct a more feature-consistent encoder, the basic encoder updates the momentum encoder in a moving average mode. In the initial stage of training, viT only focuses on the most discriminative area of an object, and although the most discriminative area does not contain a complete object area, the most discriminative area covers most of the object area and has abundant semantic information of the object. And then, after ViT is trained by query regions, a clustering process is executed by the region association network, the unsupervised ViT in the region association algorithm (a) is used as a feature extractor, the embedded features of the query regions and the key regions are extracted, and the features of the two regions are extracted as the input for realizing clustering in a high-dimensional nonlinear space. The region association algorithm (b) separately processes the features f of the query regions and the key regions b (B d ),f b (B s ) Performing K-means clustering, setting the number of clusters to be K, and expressing the number of each cluster center as c k With continuous dynamic clustering, each region is allocated with a clustering label according to the Euclidean distance relationship in an embedding space, key regions become positive keys and negative keys according to the clustering result, wherein the positive keys represent key regions clustered with regions with the most discriminative power into a class, the key regions can become a part of an object region, and if the key regions and the regions with the most discriminative power are fused, the original object region can be expanded, and a more complete object region is output. In contrast, the negative key is used as a background region or a partial region of other instances in the same image. Nevertheless, the area association network assigns a cluster label to each area, generallyThe experimental result shows that unstable, noise-accompanied and low-confidence key regions are assigned the same clustering labels as query regions, which causes the background key regions or other example key regions to be positive keys, and inevitably affects the fusion result. In the early training process, the region association network only focuses on query regions, and feature extraction capability of the key regions by the ViT is omitted, so that region fusion constraint is introduced on the basis of the region association network.
And designing a region fusion constraint. The main role of the region fusion constraint is to remove low confidence key regions, refine the object regions output by the region association network and obtain accurate and complete object regions. As shown in fig. 4 and 5, the region fusion constraint includes a category sub-constraint and a distance sub-constraint, and the method considers the cluster labels of q' and q ″ output by the enhanced query regions and the distance relationship between the cluster labels and the cluster centers on the basis of the region association network. Specifically, the category sub-constraint and the distance sub-constraint are designed for measuring the accuracy of the clustering result. The outputs q' and q "are enhanced by data for the most discriminative power regions query regions. In the area correlation network, q' and key regions extract high-dimensional nonlinear features through ViT and then perform clustering operation. If the clustering process divides q 'and q' into one cluster, the clustering process is regarded as correct clustering, otherwise, the region association algorithm is executed again to extract the characteristics of the regions q ', q' and key regions, and the clustering process is executed again to inquire accurate positive keys as part of the object regions. When the class constraint is satisfied, the application calculates the distance d from q 'and q' to the current cluster center 1 And d 2 . Similarly, if the distance difference | d 1 -d 2 I exceeds a set threshold value T dis And if the distance is not equal to 0.1, the distance ion constraint is not satisfied, and the clustering result of the time is ignored by the region fusion constraint. When the category sub-constraint and the distance sub-constraint are satisfied simultaneously, then calculating the cosine similarity between the remaining key regions of the above process and q 'or q', if the cosine similarity value is greater than the threshold value T score The remaining key regions will be fused with the query regions to the maximumAnd outputting the result of the weak supervision detector in the final object area. And refining the object regions from the region association network by region fusion constraint, removing key regions with low confidence coefficient by re-executing the clustering process, finding accurate and complete object regions, and enhancing the proximity relation of clustering results by calculating distance difference. The similarity of Euclidean distances is considered in the clustering process, and meanwhile, the similarity in the direction is considered in the cosine similarity. The schematic diagrams of the clustering process and the region fusion constraint, as shown in fig. 6, 7 and 8, include the cases where the class sub-constraint is not satisfied, the distance ion constraint is not satisfied, and the region fusion constraint is satisfied.
Training the weak supervision object detection framework based on regional perception and association proposed by the application. Designing a region extractor in the output result of the existing weak supervision detector, dividing two regions of a most discriminative region and a surrounding region, designing a region association network, dynamically inquiring the similarity between the surrounding region and the most discriminative region, forming a new object region, introducing region fusion constraint, refining the object region from a region association algorithm, and obtaining an accurate and complete object region.
Specifically, in the region extractor, in order to define the clipping range setting α =1.2 of the surrounding region, the size of the clipping patch is 32 × 32, and the entire region resize is to 224 × 224 size; to perform the clustering process in the region association algorithm, consider that the number of instances in one image is set to K =20, and 100 epochs are iterated through the model, using a LARS optimizer. Set batch size =64, initial learning rate 1.5 × 10 -4 And changing the initial learning rate according to the batch size; in the region fusion contract, a distance threshold T is set dis =0.1 for measuring the correctness of the clustering result, in order to represent the similarity between query regions and key regions in the representing direction, a cosine similarity threshold T is set score =0.95, momentum (Momentum) and weight decay set to 0.9 and 1 × 10 -6 。
The weak supervision object detection framework based on the regional perception and the association trained in the steps improves the condition that the existing weak supervision detector only outputs the most discriminative region of the object, makes up the difference between the weak supervision detector and the full supervision detector, breaks through the limitation that the weak supervision has no module for improving the positioning precision, and reduces the requirement of the object detection technology on expensive manual labeling. Experiments prove that the 'weak supervision object detection technology based on regional perception and association' can detect complete and accurate object regions. Table 1 shows experimental result comparison data, and the proposed method is evaluated using a standard evaluation index mAP in the field of object detection. As can be seen from the comparative data, the weak surveillance object detection technology based on regional perception and association proposed by the application has a mAP improvement of 0.27% compared with the current state-of-the-art image super-resolution method, instant-aware. In addition, the weak supervision detection framework provided by the application is a single-stage model, and compared with other latest single-stage weak supervision object detection methods, the maximum detection result is 55.17% at present. And compared with other multi-stage weak supervision frameworks introduced into a FasterRCNN detector, the detection result is also the highest, and the effectiveness of the weak supervision object detection framework based on region perception and association provided by the application is proved. On the basis of taking VOC07trian/val as a training set, the application introduces an additional data set VOC07/12trian/val for training, and the test result is 58.22% which still exceeds other methods adopting the additional data set, thereby further proving the robustness and the generalization of the weak supervision framework. Fig. 7 is a comparison graph of experimental results, in which a green bounding box represents a real area of an object, a red bounding box represents a result output by a conventional weak surveillance detector, and a blue bounding box represents a detection result output based on the framework proposed in the present application. As can be seen from the figure, compared with other methods, the detection result output by using the method provided by the present application includes complete object information, and particularly for the detection result of a non-rigid object, the problem that only the most discriminative area is detected is improved, and an accurate and complete object detection result is achieved.
TABLE 1 VOC2007train/val as the result of a quantification test (mAP) of the training set
TABLE 2 VOC07/12train/val as training set quantification test results (mAP)
It should be noted that the detailed description is only for explaining and explaining the technical solution of the present invention, and the scope of protection of the claims is not limited thereby. It is intended that all such modifications and variations that fall within the spirit and scope of the invention be limited only by the claims and the description.
Claims (10)
1. A weak supervision object detection method based on surrounding area perception and association is characterized by comprising the following steps:
the method comprises the following steps: acquiring an image to be recognized, predicting the image to be recognized by using a weak supervision detector, and taking the position of an object obtained by prediction as a most discriminative area;
step two: expanding the area with the most discriminating power, cutting the expanded area by using the image block, and finally taking the image block as a peripheral area;
step three: extracting the features of the most discriminative area and the surrounding area, clustering the obtained features, assigning a clustering label to each area, and dividing each area into different clusters through the clustering labels;
step four: obtaining surrounding areas which are the same as the labels of the areas with the most discriminating power through the clustering labels of each area, and fusing the surrounding areas which are the same as the labels of the areas with the most discriminating power into a new object area;
step five: performing data amplification on the most discriminative power region to obtain two amplified most discriminative power regions, namely q 'and q';
step six: extracting features of q ', q' and surrounding areas, clustering the extracted features, if q 'and q' are assigned to the same cluster in the clustering process, regarding the clustering process as correct clustering, and executing step seven, if q 'and q' are not assigned to the same cluster in the clustering process, executing step three to step six once again, if q 'and q' are assigned to the same cluster, regarding the clustering process as correct clustering, and executing step seven, if q 'and q' can not be assigned to the same cluster, ignoring the clustering result; taking the area with the most discriminating force as the final object area;
step seven: for the clustering process that assigns q 'and q' to the same cluster, the distance d from q 'and q' to the current cluster center is calculated 1 And d 2 If the distance is different | d 1 -d 2 I exceeds a set threshold value T dis If =0.1, ignoring the clustering result, otherwise, regarding the clustering process as a correct clustering;
step eight: based on the correct clustering in the step seven, acquiring a surrounding area in the cluster simultaneously containing q 'and q', calculating the cosine similarity between the surrounding area and q 'or q', and if the cosine similarity is greater than a threshold value T score If =0.95, fusing the most discriminative area in the cluster and the surrounding area into a final object area;
step nine: and training a neural network by using the most discriminative power area and the peripheral area as input and using the final object area as output, and detecting the object by using the trained neural network.
2. The method for detecting the weakly supervised object based on surrounding area perception and association as recited in claim 1, wherein the range ratio α extended in the second step is greater than 1 time.
3. The method for detecting the weakly supervised object based on surrounding area perception and association as recited in claim 2, wherein the extended range ratio α in the second step is 1.2 times.
4. The method of claim 1, wherein the feature is a high-dimensional nonlinear feature.
5. The weak supervision object detection method based on surrounding area perception and association as claimed in claim 4 wherein the high dimensional non-linear features are extracted by ViT.
6. The method according to claim 1, wherein the size of the image block is 32 x 32.
7. The method according to claim 1, wherein the surrounding area in step three is 60% of the surrounding area in step two.
8. The method of claim 1, wherein the data augmentation includes random color dithering, random graying, random gaussian blurring, and random daylight.
9. The weak supervision object detection method based on surrounding area perception and association as claimed in claim 1 wherein the neural network is a MoCov3 network.
10. The method for detecting the weakly supervised object based on the peripheral region perception and association as claimed in claim 9, wherein the step of training the neural network comprises:
the training process is carried out by adopting unsupervised comparative learning, and 100 epochs are trained in total;
(1) When in 0-29 rounds, the network inputs the most discriminative power area;
(2) When the number of the wheels is 30, 35, 40.. 100, the network performs a fusion process, and simultaneously, a final object region after fusion is used as the input of the network for training;
(3) When in rounds 31-34, 36-39.. 96-99, the inputs to the network are the final object region after fusion and the most discriminative force region that is not fused.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211066364.0A CN115439688B (en) | 2022-09-01 | 2022-09-01 | Weak supervision object detection method based on surrounding area sensing and association |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211066364.0A CN115439688B (en) | 2022-09-01 | 2022-09-01 | Weak supervision object detection method based on surrounding area sensing and association |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115439688A true CN115439688A (en) | 2022-12-06 |
CN115439688B CN115439688B (en) | 2023-06-16 |
Family
ID=84246404
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211066364.0A Active CN115439688B (en) | 2022-09-01 | 2022-09-01 | Weak supervision object detection method based on surrounding area sensing and association |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115439688B (en) |
Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160217344A1 (en) * | 2015-01-22 | 2016-07-28 | Microsoft Technology Licensing, Llc. | Optimizing multi-class image classification using patch features |
CN107730553A (en) * | 2017-11-02 | 2018-02-23 | 哈尔滨工业大学 | A kind of Weakly supervised object detecting method based on pseudo- true value search method |
CN108062574A (en) * | 2017-12-31 | 2018-05-22 | 厦门大学 | A kind of Weakly supervised object detection method based on particular category space constraint |
CN108399406A (en) * | 2018-01-15 | 2018-08-14 | 中山大学 | The method and system of Weakly supervised conspicuousness object detection based on deep learning |
CN109034258A (en) * | 2018-08-03 | 2018-12-18 | 厦门大学 | Weakly supervised object detection method based on certain objects pixel gradient figure |
CN109657684A (en) * | 2018-12-20 | 2019-04-19 | 郑州轻工业学院 | A kind of image, semantic analytic method based on Weakly supervised study |
CN109671054A (en) * | 2018-11-26 | 2019-04-23 | 西北工业大学 | The non-formaldehyde finishing method of multi-modal brain tumor MRI |
US20200250398A1 (en) * | 2019-02-01 | 2020-08-06 | Owkin Inc. | Systems and methods for image classification |
CN111612051A (en) * | 2020-04-30 | 2020-09-01 | 杭州电子科技大学 | Weak supervision target detection method based on graph convolution neural network |
CN111931703A (en) * | 2020-09-14 | 2020-11-13 | 中国科学院自动化研究所 | Object detection method based on human-object interaction weak supervision label |
US11023730B1 (en) * | 2020-01-02 | 2021-06-01 | International Business Machines Corporation | Fine-grained visual recognition in mobile augmented reality |
US20210201076A1 (en) * | 2019-12-30 | 2021-07-01 | NEC Laboratories Europe GmbH | Ontology matching based on weak supervision |
CN113409335A (en) * | 2021-06-22 | 2021-09-17 | 西安邮电大学 | Image segmentation method based on strong and weak joint semi-supervised intuitive fuzzy clustering |
CN114359323A (en) * | 2022-01-10 | 2022-04-15 | 浙江大学 | Image target area detection method based on visual attention mechanism |
CN114648665A (en) * | 2022-03-25 | 2022-06-21 | 西安电子科技大学 | Weak supervision target detection method and system |
CN114677515A (en) * | 2022-04-25 | 2022-06-28 | 电子科技大学 | Weak supervision semantic segmentation method based on inter-class similarity |
-
2022
- 2022-09-01 CN CN202211066364.0A patent/CN115439688B/en active Active
Patent Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160217344A1 (en) * | 2015-01-22 | 2016-07-28 | Microsoft Technology Licensing, Llc. | Optimizing multi-class image classification using patch features |
CN107730553A (en) * | 2017-11-02 | 2018-02-23 | 哈尔滨工业大学 | A kind of Weakly supervised object detecting method based on pseudo- true value search method |
CN108062574A (en) * | 2017-12-31 | 2018-05-22 | 厦门大学 | A kind of Weakly supervised object detection method based on particular category space constraint |
CN108399406A (en) * | 2018-01-15 | 2018-08-14 | 中山大学 | The method and system of Weakly supervised conspicuousness object detection based on deep learning |
CN109034258A (en) * | 2018-08-03 | 2018-12-18 | 厦门大学 | Weakly supervised object detection method based on certain objects pixel gradient figure |
CN109671054A (en) * | 2018-11-26 | 2019-04-23 | 西北工业大学 | The non-formaldehyde finishing method of multi-modal brain tumor MRI |
CN109657684A (en) * | 2018-12-20 | 2019-04-19 | 郑州轻工业学院 | A kind of image, semantic analytic method based on Weakly supervised study |
US20200250398A1 (en) * | 2019-02-01 | 2020-08-06 | Owkin Inc. | Systems and methods for image classification |
US20210201076A1 (en) * | 2019-12-30 | 2021-07-01 | NEC Laboratories Europe GmbH | Ontology matching based on weak supervision |
US11023730B1 (en) * | 2020-01-02 | 2021-06-01 | International Business Machines Corporation | Fine-grained visual recognition in mobile augmented reality |
CN111612051A (en) * | 2020-04-30 | 2020-09-01 | 杭州电子科技大学 | Weak supervision target detection method based on graph convolution neural network |
CN111931703A (en) * | 2020-09-14 | 2020-11-13 | 中国科学院自动化研究所 | Object detection method based on human-object interaction weak supervision label |
CN113409335A (en) * | 2021-06-22 | 2021-09-17 | 西安邮电大学 | Image segmentation method based on strong and weak joint semi-supervised intuitive fuzzy clustering |
CN114359323A (en) * | 2022-01-10 | 2022-04-15 | 浙江大学 | Image target area detection method based on visual attention mechanism |
CN114648665A (en) * | 2022-03-25 | 2022-06-21 | 西安电子科技大学 | Weak supervision target detection method and system |
CN114677515A (en) * | 2022-04-25 | 2022-06-28 | 电子科技大学 | Weak supervision semantic segmentation method based on inter-class similarity |
Non-Patent Citations (3)
Title |
---|
XIAOLIN ZHANG ET.AL: "Adversarial Complementary Learning forWeakly Supervised Object Localization", 《ARXIV:1804.06962V1》, pages 1 - 10 * |
仲光宇: "几种弱监督视觉分析与理解问题研究", 《中国博士学位论文全文数据库信息科技辑》 * |
宋纯锋: "基于深度学习的弱监督分类算法及应用", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Also Published As
Publication number | Publication date |
---|---|
CN115439688B (en) | 2023-06-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111797716B (en) | Single target tracking method based on Siamese network | |
CN111259786B (en) | Pedestrian re-identification method based on synchronous enhancement of appearance and motion information of video | |
Chandio et al. | Precise single-stage detector | |
CN111310861A (en) | License plate recognition and positioning method based on deep neural network | |
CN108334881B (en) | License plate recognition method based on deep learning | |
CN113744311A (en) | Twin neural network moving target tracking method based on full-connection attention module | |
CN114119993B (en) | Remarkable target detection method based on self-attention mechanism | |
CN113808166B (en) | Single-target tracking method based on clustering difference and depth twin convolutional neural network | |
CN115019039B (en) | Instance segmentation method and system combining self-supervision and global information enhancement | |
CN108491828B (en) | Parking space detection system and method based on level pairwise similarity PVAnet | |
CN115457082A (en) | Pedestrian multi-target tracking algorithm based on multi-feature fusion enhancement | |
CN118230354A (en) | Sign language recognition method based on improvement YOLOv under complex scene | |
CN111832497B (en) | Text detection post-processing method based on geometric features | |
Wang et al. | Summary of object detection based on convolutional neural network | |
CN112418358A (en) | Vehicle multi-attribute classification method for strengthening deep fusion network | |
CN115731517B (en) | Crowded Crowd detection method based on crown-RetinaNet network | |
CN113963150B (en) | Pedestrian re-identification method based on multi-scale twin cascade network | |
CN114359493B (en) | Method and system for generating three-dimensional semantic map for unmanned ship | |
CN110110598A (en) | The pedestrian of a kind of view-based access control model feature and space-time restriction recognition methods and system again | |
CN115439688A (en) | Weak supervision object detection method based on surrounding area perception and association | |
CN113128461B (en) | Pedestrian re-recognition performance improving method based on human body key point mining full-scale features | |
CN114973202A (en) | Traffic scene obstacle detection method based on semantic segmentation | |
CN111126513B (en) | Universal object real-time learning and recognition system and learning and recognition method thereof | |
Wu et al. | Caption Generation from Road Images for Traffic Scene Construction | |
CN114882224B (en) | Model structure, model training method, singulation method, device and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |