CN115439688B

CN115439688B - Weak supervision object detection method based on surrounding area sensing and association

Info

Publication number: CN115439688B
Application number: CN202211066364.0A
Authority: CN
Inventors: 张永强; 丁明理; 田瑞; 张印; 张子安; 张漫
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2022-09-01
Filing date: 2022-09-01
Publication date: 2023-06-16
Anticipated expiration: 2042-09-01
Also published as: CN115439688A

Abstract

A weak supervision object detection method based on surrounding area sensing and association relates to the technical field of object detection, and aims at solving the problems that in the prior art, weak supervision object detection is easy to converge to a local optimal solution, visual representation is that only an area with most discrimination of an object can be detected instead of all object areas, so that object positioning failure is caused, and further detection precision is low. The invention belongs to basic technical research work of object detection in actual application scenes, promotes the landing of an object detection technology of artificial intelligent deep learning to a certain extent, and makes up the gap between weak supervision and full supervision object detection.

Description

Weak supervision object detection method based on surrounding area sensing and association

Technical Field

The invention relates to the technical field of object detection, in particular to a weak supervision object detection method based on surrounding area sensing and association.

Background

Weakly supervised object detection is a technique that uses only image-level labels to effect object detection, where image-level labels indicate whether or not there is a category of object in an image. In the application of a real scene, the full-supervision object detection cannot acquire instance-level labels in the training process, and the weak-supervision object detection technology utilizes image-level labels to replace instance-level labels of the full-supervision object detection, so that the requirement of the full-supervision object detection on instance-level label training data can be greatly reduced, and the object detection is realized on the premise that the label data are scarce. However, in contrast to fully supervised object detection, there are few modules designed for accurately locating object regions (fully supervised detection with candidate region networks, feature pyramid networks, etc.). Meanwhile, the weakly supervised object detection task is generally regarded as a classification task of the candidate region, in which case the weakly supervised object detector is caused to converge on a locally optimal solution, and the output result is the most discriminative region of the object. Based on the above, weakly supervised object detection is a challenging and potential technique.

Currently, in order to make up the gap between weak supervision and full supervision object detection and improve the phenomenon of local focusing, the weak supervision detection method can be summarized as the following four representative methods. The recall rate index of the object detection task is guaranteed based on the method for initializing the high-quality candidate region, and the class activation diagram and the selective search algorithm are combined to generate the high-quality candidate region. The generated high-quality candidate region is used as the input of a weak supervision detector, so that the high recall rate is ensured, the intersection ratio of the candidate region and a real object boundary frame is improved, and an accurate detection result is realized; the method based on the iterative refinement strategy guides the detector to trend towards the complete object area, and the high-overlap area should have the same category label as the priori knowledge of the training process to provide supervision information for the next branch; the conversion method based on the weak supervision and the full supervision combines the advantages of the weak supervision (the labeling information is easy to acquire) and the full supervision (the strong regression capability), the result output by the weak supervision object detector is utilized to train the full supervision detector, and the output of the full supervision detector is used as the final detection result; based on the complete object searching method, the class activation diagram is used as the position priori of the object area, the maximum score of the detection area and the minimum score of the surrounding area are searched, and the complete object area is further positioned. However, the above-described high-quality region generation method, iterative refinement method, full-supervision and weak-supervision transformation method, and complete object search method cannot fundamentally solve the phenomenon of local focusing, and these methods have no wide applicability, and are only applicable to the current one or one kind of weak-supervision detection method. Based on the above, limitations of the existing weakly supervised object detection methods can be summarized as two aspects: (1) The weak supervision object detection is easy to converge to a local optimal solution, and visually shows that only the area with the most discriminant of the object can be detected, but not all the object areas, so that the object positioning is failed; (2) Fully supervised object detection in order to improve positioning accuracy, a carefully designed module can be integrated into any fully supervised object detector, for example: candidate region generation networks, feature pyramid networks. There are few universal modules designed to improve positioning accuracy compared to weakly supervised object detection, and existing methods have no wide applicability, only to one or a class of current methods of weakly supervised detection.

Disclosure of Invention

The purpose of the invention is that: aiming at the problems that in the prior art, the detection of a weakly supervised object is easy to converge to a local optimal solution, the detection is visually represented as only the area with the most discriminant of the object, but not all the object areas, so that the positioning of the object is failed, and the detection precision is low, the method for detecting the weakly supervised object based on the surrounding area sensing and the association is provided.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a method of weakly supervised object detection based on surrounding area awareness and correlation, comprising the steps of:

step one: acquiring an image to be identified, predicting the image to be identified by using a weak supervision detector, and taking the predicted object position as the most discriminant area;

step two: expanding the most discriminant area, cutting the expanded area by utilizing the image blocks, and finally taking the image blocks as surrounding areas;

step three: extracting features of the most discriminant area and surrounding areas, clustering the obtained features, designating a clustering label for each area, and dividing each area into different clusters through the clustering labels;

step four: obtaining a surrounding area which is the same as the most discriminant area label through the cluster label of each area, and fusing the surrounding area which is the same as the most discriminant area label into a new object area;

step five: carrying out data amplification on the most discriminant area to obtain two amplified most discriminant areas, namely q 'and q';

step six: extracting features aiming at q ', q' and surrounding areas, clustering the extracted features, if q 'and q' are assigned to the same cluster in the clustering process, treating the clustering process as correct clustering, and executing a step seven, if q 'and q' are not assigned to the same cluster in the clustering process, executing a step three to a step six again, if q 'and q' are assigned to the same cluster at this time, treating the clustering process as correct clustering, and executing a step seven, and if q 'and q' cannot be assigned to the same cluster, ignoring the clustering result at this time; the most discriminant area is taken as the final object area;

step seven: for the clustering process of assigning q 'and q "into the same cluster, the distance d of q' and q" to the center of the current cluster is calculated ₁ And d ₂ If the distance is |d ₁ -d ₂ I exceeds the set threshold T _dis =0.1, ignoring the clustering result, otherwise, regarding the clustering process as a correct cluster;

step eight: based on the correct clustering in the step seven, acquiring the surrounding area in the cluster containing q 'and q' simultaneously, and calculating the cosine similarity between the surrounding area and q 'or q', if the cosine similarity value is larger than the threshold value T _score =0.95, then fuse the most discriminant region in the cluster with surrounding regions into the final object region;

step nine: and training the neural network by using the most discriminant area and the surrounding area as inputs and the final object area as output, and performing object detection by using the trained neural network.

Further, the range ratio α of the expansion in the second step is greater than 1 time.

Further, the range ratio α of the expansion in the second step is 1.2 times.

Further, the characteristic is a high-dimensional non-linear characteristic.

Further, the high-dimensional nonlinear characteristics are extracted through ViT.

Further, the size of the image block is 32×32.

Further, the surrounding area in the third step is 60% of the surrounding area in the second step.

Further, the data augmentation includes random color dithering, random graying, random gaussian blurring, and random daylight.

Further, the neural network is a MoCov3 network.

Further, the training neural network specifically comprises the following steps:

the training process is carried out by adopting unsupervised contrast learning, and 100 epochs are trained in total;

(1) When the network inputs 0-29 rounds, the most discriminant area;

(2) When at 30, 35, 40..100 rounds, the network performs a fusion process while training the fused final object region as an input to the network;

(3) The inputs to the network are the fused final object region and the unfused most discriminative region when at rounds 31-34, 36-39.

The beneficial effects of the invention are as follows:

the method solves the problems of low detection precision and convergence to a local optimal solution in the weak supervision object detection method, breaks through the limitation that a module for improving the positioning precision does not exist in weak supervision, and reduces the requirement of an object detection technology on expensive manual labeling. The invention belongs to basic technical research work of object detection in actual application scenes, promotes the landing of an object detection technology of artificial intelligent deep learning to a certain extent, and makes up the gap between weak supervision and full supervision object detection.

Drawings

FIG. 1 is an exemplary diagram of a surrounding area awareness and association module;

FIG. 2 is a block diagram of a region feature extractor;

FIG. 3 is a block diagram of a regional associative network;

FIG. 4 is a schematic view of a region fusion constraint diagram 1;

FIG. 5 is a schematic view of a region fusion constraint map 2;

FIG. 6 is a graph a comparing the detection effect of a weakly supervised object;

FIG. 7 is a graph b showing the comparison of the detection effect of a weakly supervised object;

fig. 8 is a comparison graph c of the detection effect of the weakly supervised object.

Detailed Description

It should be noted in particular that, without conflict, the various embodiments disclosed herein may be combined with each other.

The first embodiment is as follows: referring to fig. 1, a method for detecting a weakly supervised object based on surrounding area sensing and association according to the present embodiment is specifically described, and includes the following steps:

Aiming at the difference of positioning precision of the weak supervision object detector and the full supervision object detector, the application provides a module which can be embedded into any weak supervision detector, forms an end-to-end learning framework with any weak supervision detector, solves the problem that the training process is converged to a local optimal solution, and further shows the problem that the weak supervision object detection usually identifies the area with the most characteristic of the object. In order to overcome the defect that only the most discriminant area of an object is identified based on the current weakly-supervised object detector, the area correlation network provided by the application dynamically queries the similarity between the surrounding area and the object area predicted by the existing detector based on a clustering method, and the high-similarity area is fused according to a similarity result, so that the object area output by the weakly-supervised object detector covers the complete area. Meanwhile, a clustering process is introduced into the area association network, and the defect of the clustering process is inevitably introduced, namely, the clustering process in the early training stage has the wrong division condition of an area with instability and low confidence. The regional fusion constraint provided by the application can strengthen the condition of regional fusion, prevent the wrong fusion process, further refine the rough object region output from the regional correlation network and output an accurate and complete detection result.

Specifically, the surrounding area and association module proposed in the present application is shown in fig. 1, and includes three components, namely an area extractor, an area association network, and an area fusion constraint. The first component area extractor is used for dividing the most discriminant area and the surrounding area according to the detection result in one image aiming at any weak supervision detector, and defining the surrounding area by using different clipping ranges, so that the detection accuracy is improved, and the algorithm execution time is reduced. The second component area association network has the main functions of continuously executing contrast learning and clustering processes through the two types of areas output by the area extractor, extracting good visual characteristics of the two types of areas, inputting the obtained visual characteristics into the clustering process, designating a clustering label for each area, inquiring the surrounding areas which are the same as the most discriminant labels through the clustering labels, and fusing into a new object area. And the third component area fusion constraint is performed, a new object area is output by the area association network, the new object area possibly introduces an unstable and low-confidence surrounding area in the early stage of clustering, if the surrounding area is fused with the most discriminant area as a final result, the integrity and the accuracy of the object are affected, the area fusion constraint is used for removing the area, refining the rough object area output by the area association network and acquiring the accurate object area.

The region extractor firstly regards the output result of the existing detector as the most discriminant region. Then, the most discriminant area is expanded according to a certain proportion alpha, and the expanded most discriminant area is used as a clipping range. Within the clipping range, 32 x 32 patches are clipped sequentially as key regions. In the region extractor, each region is regarded as the most discriminant region query regions or the surrounding region key regions, the two types of regions are input into the region association network proposed by the application, and a contrast learning and clustering process is performed to find a complete object region.

The regional correlation network provided by the application combines contrast learning and clustering processes. For contrast learning, the region extractor outputs the most discriminative region as the query regions, the surrounding region as the key regions, the query regions are subjected to random color dithering by random color jitter, random gray scale is randomly grayed, random Gaussian blur and random solvent random sunlight are a series of image enhancement strategies, and the complex image enhancement strategies are helpful for learning better feature representation of objects. The enhanced region (q', q ") is input into a Mocov3 framework, the good visual representation of query regions is extracted by using an unsupervised training strategy, and the most discriminative region and surrounding regions are used as the input of a region association network by training a ViT feature extractor in the Mocov3 framework, so that the features of the two types of regions are extracted and mapped to a high-dimensional nonlinear space. For the clustering process, the region association network performs the clustering process on the high-dimensional nonlinear characteristics of the two types of regions, assigns a clustering label for each region according to the Euclidean distance in the high-dimensional space, and simultaneously extracts surrounding regions which are in the same cluster as the most discriminant region, wherein the surrounding regions and the most discriminant region are fused into a new object region, and can cover the whole part or most part of the region of the real object. In the area correlation network of the present application, the unsupervised training process does not require any tag information.

For the region fusion constraint proposed in the present application, this constraint includes a category sub-constraint and a distance sub-constraint. The class sub-constraint indicates whether each query region in different view angles after image enhancement is assigned to the same cluster, the distance sub-constraint is to calculate the distance difference of the query region in different view angles at the centroid of the corresponding cluster on the premise that the class sub-constraint is satisfied, and the distance sub-constraint considers that the distance difference is smaller than a preset threshold value as a successful distance process, otherwise, the clustering process is considered to fail, namely, the current clustering result is ignored, and the fusion of new object regions is not performed. Specifically, after image enhancement is performed on query regions in the region-associated network, q ' and q ' are output, and q ' are taken as two data-enhanced regions under different viewing angles of the same region and are input into a clustering process together with surrounding regions. In the clustering process, if q 'and q' are in the same cluster, the clustering process is used as a successful cluster, otherwise, the category sub-constraint ignores the clustering result, and the clustering process is re-executed for searching the surrounding area with high similarity with the most discriminant area. Then, when q 'and q' are in the same cluster, the distance from q 'and q' to the center of the current cluster is calculated by distance sub constraint, if the distance is larger than the set threshold value, the clustering result of this time is directly ignored, and the original object area is proved to be complete without further fusion. Similarly, if this distance is less than the set threshold, the category sub-constraint and the distance sub-constraint are satisfied simultaneously, and the surrounding area that satisfies both sub-constraints simultaneously is regarded as a candidate area for the area to be fused.

According to the method, the VOC2007/2012 data set is taken as a research object, and a user can construct a corresponding database according to actual application requirements. In this application, for better assessment of the weakly supervised object detection technique, a VOC dataset widely used in the field of object detection is adopted, which contains 20 categories under actual scene, including 9963 and 22531 image data, respectively, dividing the VOC image into VOC2007train/val, VOC07/12train/val and VOC2007test. The VOC2007train/val and the VOC07/12train/val are used for training the weak supervision detector framework of the application respectively, the VOC2007test is used for verifying the performance of the weak supervision detector framework of the application, and meanwhile, the detection performance is evaluated by adopting a wide object detection index mAP, namely, the intersection ratio IOU >0.5 of the detection example and the true value detection example is taken as a correct detection result. After a training database is established, firstly, according to the output result of a trained weak supervision detector, the region sensing and association network proposed by the method is trained based on an unsupervised end-to-end mode, and the output final detection result improves local focusing and obtains a complete object region.

In summary, the application provides a novel weak supervision object detection framework based on surrounding area sensing and association, in the process of realizing weak supervision object detection, the relation between the area predicted by the existing detector and the surrounding area is directly considered, and the original locally focused area is updated by utilizing the similarity relation between the two types of areas, so that complete object area detection is realized. In the area correlation network of the present application, when the most discriminant area and the surrounding area are fused into a new object area, the new object area may be rough, i.e. include the whole object area or a large part of the object area, although including the whole object area, due to unstable clustering process in the initial stage of training and insufficient feature extraction capability of ViT, and the degree of adhesion with the actual object bounding box may be poor. In the 'region fusion constraint' of the application, the influence of the initial clustering stage on the fusion process is considered, the region fusion constraint refines a rough object region, an unstable surrounding region is removed, an accurate object region is obtained, and the object region is used as a result output by a weak supervision detector.

The method solves the problem of local focusing of the existing weak supervision object detector, makes up the gap between the weak supervision detector and the full supervision detector greatly, promotes the development of the deep learning object detector in the application of a real scene, solves the problem of scarcity or unavailability of training labels in the practical application, and further provides technical support for the landing of the artificial intelligent object detection technology.

The application provides a novel surrounding area sensing and association module which can be integrated into any existing weak supervision detector to serve as a detection framework for end-to-end training, and comprises three components, namely an area extractor, an area association network and an area fusion constraint. In order to solve the limitation (1), the application focuses on the similarity between the surrounding area and the most discriminant area directly, and according to the similarity result of the query, the surrounding area with high similarity is regarded as the area to be fused and is fused with the most discriminant area to form a new object area. And removing the regions to be fused, which are low in confidence coefficient, high in noise and unstable in the early training process, from the region fusion constraint condition, refining the object region output by the region association network, and outputting an accurate detection result containing the complete object. To address limitation (2), the present application calculates the loss on the most discriminative area predicted by the weakly supervised object detector during training, and the method presented by the present application does not utilize any instance-level or image-level labels during training. Therefore, the surrounding area sensing and associating module can be simply integrated with any weak supervision object detector into an end-to-end unified framework, and has wide applicability.

Examples:

the present application employs VOC2007/2012 datasets that broadly evaluate and verify the performance of weakly supervised detectors. Specifically, the various images in the VOC image dataset are classified into VOC2007train/val, VOC07/12train/val and VOC2007test. Wherein VOC2007train/val and VOC07/12train/val are used to train the weakly-supervised detector framework of the present application, respectively, and VOC2007test is used to verify the performance of the weakly-supervised detector framework of the present application. After the training database is established, the region sensing and association module provided by the application is trained. Firstly, a region extractor of the application generates the most discriminant region and a surrounding region, the region association network proposed by the application is trained by utilizing the two types of regions, the surrounding region which is in the same cluster as the most discriminant region is queried according to an unsupervised training process and a clustering result in a model, the surrounding region and the most discriminant region are fused to be used as new object regions, then region fusion constraint of the application is introduced, the constraint comprises category sub-constraint and distance sub-constraint, the two sub-constraint removes surrounding regions with noise and low confidence degree contained in an early unstable clustering process, an object region output from the region association network is refined, and the refined accurate object region is output as a result of a weak supervision detector. The method mainly solves the problem that the existing weak supervision object detector is often positioned in the area with the most characteristic of the object, obtains the local most solution, presents the phenomenon of local focusing, makes up the gap between the weak supervision object detector and the full supervision object detector, and provides the weak supervision object detector with a module which improves positioning accuracy and is convenient to integrate.

The region extractor is designed. As shown in fig. 2, the main idea of the region extractor is to extract the object position predicted by the existing weakly supervised detector, and take the predicted object position as a query region. Then, the region extractor expands the object region by a certain range α, and cuts out a patch of 32×32 of the expanded object region at a fixed ratio as key regions. The coordinates of the initial object position are (x, y, w, h), and the extended object position is (x, y, aw, ah), where α >1. Query regions in an image as a set, as shown in the formula:

wherein b _r Representing the object area predicted by the existing weakly supervised detector in one image, R represents the number of total predicted areas in this image. Likewise, key regions as a set, as shown in the formula:

wherein b _rn Represents the nth key region corresponding to the nth query region, and N represents the total number of key regions, the number of N being varied in different r-N correspondence, in the initial size of the original predicted object region. Notably, 60% of the key regions are used as input to the area-associated network in the process of clipping key regions in this application. The first reason is that 60% of kThe ey regions can realize an efficient region association network, a clustering process exists in the region association network, the reduction of the number of regions is helpful for reducing the clustering realization, and the training time of a weak supervision detection framework and the time of an inference process are further reduced. The second point is that 60% of the regions comprise a part of upper region, a part of left region, a part of right region and a part of lower region of the original prediction region, and the fusion process after clustering can be realized by the regions comprising 4 directions, so that semantic information near the most discriminative region is not lost, and the clustering time is reduced while the enhanced detection performance of the surrounding regions is detected by adopting 60% key regions based on the above description.

The area association network is designed. The regional correlation network combines the feature extraction capability of unsupervised learning and the clustering process, and the network structure of the whole algorithm is shown in figure 3 and mainly comprises a MoCov3 network and a clustering network. The MoCov3 network is used for extracting query regions features by adopting a process of contrast learning based on an instance discrimination task. It is noted that the training process is divided into 3 stages, the first stage trains the query regions by using a process of contrast learning, the second stage trains the fused new object region containing semantic information of surrounding regions, the third stage trains the fused new object region and the unfused most discriminative region, and according to a training strategy of three stages, the feature extraction capability of ViT on the query regions and the key regions is enhanced so as to further search the key regions highly similar to the query regions and find the complete object region. Specifically, for each input image, the feature extractor generates query regions and key regions, obtains different views q' and q″ of the query regions (key regions) using rich data enhancement method RandomColorJitter, randomGrayScale, randomGaussianBlur and RandomSolarize, uses vision Transfomer (ViT) based encoders and momentum encoders, the based encoders are used to extract embedded features of the enhanced regions, and the output of the momentum encoder and the base encoder mutually predict as a loss function to optimize the entire network. Specifically, the area association network can be regarded as being composed of an area association algorithm (a) and an area association algorithm (b), a mapping network is introduced after a basic encoder branches, the mapping network is composed of 3 full-connection layers (Linear layers), 3 batch normalization layers (Batch Normalization, BN) and 2 ReLU activation functions, a blue part is the full-connection layer, a green part is the batch normalization layer, an orange part is the Relu activation function, the batch normalization layer enables the network to be more easily converged, a model is more stable, and the Relu activation function enables network input and output to have a nonlinear relation; a predictive network is added after the mapping head, and the composition of the predictive network is similar to that of the mapping network. In contrast to the mapping network, the prediction network consists of 2 fully connected layers, 2 batch normalization layers and 1 Relu activation function. Compared with the basic encoder which does not introduce a prediction network, the prediction network of the basic encoder generates a prediction vector and the mapping vector generated by the momentum encoder optimizes the whole network by adopting a cross entropy loss function, as shown in the formula:

L _predict ＝ctr(z' _d ,z″ _d )+ctr(z″ _d ,z' _d )

z' _d ＝pr(g(f _b (B' _d )))

z″ _d ＝g(f _m (B″ _d ))

where ctr represents a predictive loss function from a to B or B to a based on contrast learning MoCov 3. z'. _d ,z” _d Representing high-dimensional non-linear features extracted by the mapping network and the prediction network for either query regions or key regions. It is noted that the base encoder branch consists of a feature extraction network, a mapping network and a prediction network, while the momentum encoder branch only comprises the feature extraction network and the mapping network, while the base encoder updates the momentum encoder in a moving average manner in order to construct a more feature consistent encoder. In the initial stage of training, viT focuses only on the most discriminant area of the object, which does not contain the complete object area, but covers most of the area of the object, with rich semantic information of the object. Then, after ViT is trained by adopting query regions, the region association network executes a clustering process to carry out a region association algorithm(a) ViT of the non-supervision training is used as a feature extractor to extract embedded features of two types of regions, namely query regions and key regions, and the features of the two types of regions are extracted to be used as input for realizing clustering in a high-dimensional nonlinear space. The region association algorithm (b) is used for respectively carrying out the feature f on the query region and the key region _b (B _d )，f _b (B _s ) Performing K-means clustering, setting the number of clusters to K, and representing the number of each cluster center as c _k With continuously dynamic clustering, each region is assigned a cluster label according to the Euclidean distance relation in the embedded space, key regions become a positive key and a negative key according to the clustering result, wherein the positive key represents the key regions which are gathered into a type with the most discriminant region, the key regions possibly become part of the object region, and if the key regions and the most discriminant region query regions are fused, the original object region is enlarged, and the more complete object region is output. Conversely, a negative key is used as a background region, or a partial region of other instances in the same image. Nevertheless, the area-associated network assigns a cluster tag to each area, and finds that unstable, noisy, low-confidence key regions are assigned the same cluster tags as query regions through the results of the experiment, which results in the key regions as background or other instances of key regions as positive keys, inevitably affecting the fused results. In the early training process, the regional correlation network only focuses on query regions, and the feature extraction capability of ViT on key regions is ignored, so that the regional fusion constraint is introduced on the basis of the regional correlation network.

The region fusion constraint is designed. The main role of the region fusion constraint is to remove low confidence key regions, refine the region-associated network output object region and obtain an accurate and complete object region. As shown in fig. 4 and fig. 5, the region fusion constraint comprises a category sub-constraint and a distance sub-constraint, and the application considers the clustering labels of q 'and q' output by the enhanced query regions and the distance relation to the clustering center on the basis of the region association network. Specifically, category and distance sub-constraints are used to measure clusteringThe accuracy of the results is designed. The outputs q 'and q' are enhanced by the data for the most discriminant regions query regions. In the area association network, q' and key regions extract high-dimensional nonlinear characteristics through ViT and then perform clustering operation. If the clustering process divides q 'and q' into one cluster, the clustering process is regarded as correct clustering, otherwise, the application re-executes the region association algorithm to extract the characteristics of the regions q ', q' and key regions, and re-executes the clustering process to inquire about the correct active key as a part of the object region. When the category constraint is satisfied, the present application calculates the distance d of q' and q "to the center of the current cluster ₁ And d ₂ . Similarly, if the distance is different |d ₁ -d ₂ I exceeds the set threshold T _dis =0.1, the distance sub-constraint is not satisfied and the region fusion constraint will ignore the clustering result of this time. When the category sub-constraint and the distance sub-constraint are satisfied at the same time, then, the cosine similarity between the key regions remaining in the above process and q 'or q' is calculated, if the cosine similarity value is greater than the threshold value T _score The remaining key regions will fuse with the query regions into the final object region, outputting the result of the weakly supervised detector. The region fusion constraint refines the object region from the region association network, removes key regions with low confidence coefficient by re-executing the clustering process, discovers an accurate and complete object region, and enhances the proximity relation of the clustering result by calculating the distance difference. The clustering process considers the similarity of Euclidean distances, while the cosine similarity considers the similarity in the direction. The schematic diagrams of the clustering process and the region fusion constraint include the cases of not satisfying the category sub-constraint, not satisfying the distance ion constraint and satisfying the region fusion constraint, as shown in fig. 6, 7 and 8.

The weak supervision object detection framework based on the regional perception and the association is trained. The method comprises the steps of designing a region extractor in an output result of an existing weak supervision detector, dividing a most discriminant region and a surrounding region, designing a region association network, dynamically inquiring similarity between the surrounding region and the most discriminant region, forming a new object region, introducing region fusion constraint, refining the object region from a region association algorithm, and obtaining an accurate and complete object region.

Specifically, in the region extractor, in order to define the clipping range of the surrounding region, α=1.2, the size of the clipping patch is 32×32, and the entire region size is resize to 224×224; in order to perform the clustering process in the region-associated algorithm, consider that the number of instances in one image is set to k=20, and 100 epochs are iterated through the whole model, and a LARS optimizer is adopted. Setting batch size=64, initial learning rate of 1.5×10 ^-4 And changing the initial learning rate according to the batch size; in the region fusion constraint, a distance threshold T is set _dis =0.1 for measuring the correctness of the clustering result, and for similarity between query regions and key regions in the representation direction, a cosine similarity threshold T is set _score =0.95, momentum (Momentum) and weight decay set to 0.9 and 1×10 ^-6 。

Through the weak supervision object detection framework based on region perception and association trained by the steps, the method improves the most discriminative region of the object only output by the existing weak supervision detector, makes up the gap between the weak supervision detector and the full supervision detector, breaks through the limitation that the weak supervision does not have a module for improving the positioning precision, and reduces the requirement of the object detection technology on expensive manual labeling. Experiments prove that the 'weak supervision object detection technology based on area sensing and association' can detect complete and accurate object areas. Table 1 is experimental result comparison data, and the proposed method is evaluated by using a standard evaluation index mAP in the field of object detection. As can be seen from the comparison data, the 'weak supervision object detection technology based on region sensing and association' provided by the application has 0.27% mAP improvement compared with the most advanced image super-resolution method Instance-aware at present. In addition, the weak supervision detection framework provided by the application is a single-stage model, and compared with other latest single-stage weak supervision object detection methods, the weak supervision detection framework achieves the highest detection result 55.17% at present. And compared with other multi-stage weak supervision frameworks which are introduced into the FaterRCNN detector, the highest detection result is achieved, and the effectiveness of the weak supervision object detection framework based on the regional perception and the association is proved. Based on the VOC07trian/val as the training set, the application introduces an additional data set VOC07/12trian/val for training, and the test result is 58.22 percent which is still superior to other methods adopting the additional data set, thereby further proving the robustness and generalization of the weak supervision framework. Fig. 7 is a graph comparing experimental results, in which a green bounding box represents a real area of an object, a red bounding box represents a result output by an existing weak supervision detector, and a blue bounding box represents a detection result output based on a framework proposed in the present application. Compared with other methods, the detection result output by the method provided by the application contains complete object information, particularly for the detection result of a non-rigid object, the problem that only the most discriminant area is detected is improved, and the accurate and complete object detection result is realized.

Table 1 quantization test results (mAP) of VOC2007train/val as training set

TABLE 2 quantization test results (mAP) of VOC07/12train/val as training set

It should be noted that the detailed description is merely for explaining and describing the technical solution of the present invention, and the scope of protection of the claims should not be limited thereto. All changes which come within the meaning and range of equivalency of the claims and the specification are to be embraced within their scope.

Claims

1. A method for detecting a weakly supervised object based on surrounding area awareness and association, comprising the steps of:

step five: carrying out data amplification on the most discriminant area to obtain two amplified most discriminant areas, namely q 'and q', wherein q 'and q' are areas with two data enhanced under different visual angles of the same area;

step eight: based on the correct clustering in the step seven, acquiring the data simultaneously includingSurrounding areas in the clusters of q 'and q' and calculating cosine similarity between the surrounding areas and q 'or q', if the cosine similarity value is greater than a threshold T _score =0.95, then fuse the most discriminant region in the cluster with surrounding regions into the final object region;

2. A method of weakly supervised object detection based on surrounding area sensing and correlation as set forth in claim 1, wherein the range scale α of the expansion in step two is greater than 1.

3. A method of weakly supervised object detection based on ambient zone perception and correlation as set forth in claim 2, wherein the range scale α of the expansion in step two is 1.2 times.

4. A method of weakly supervised object detection based on ambient zone perception and correlation as set forth in claim 1, wherein the features are high dimensional non-linear features.

5. A method of weakly supervised object detection based on ambient zone perception and correlation as set forth in claim 4, wherein the high dimensional non-linear features are extracted by ViT.

6. A method of weakly supervised object detection based on surrounding area perception and correlation as claimed in claim 1, wherein the image block size is 32 x 32.

7. The method for detecting the weakly supervised object based on surrounding area sensing and correlation as set forth in claim 1, wherein the surrounding area in the third step is 60% of the surrounding area in the second step.

8. A method of weakly supervised object detection based on ambient zone perception and correlation as set forth in claim 1, wherein the data augmentation comprises random color dithering, random graying, random gaussian blurring, and random daylight.

9. A method of weakly supervised object detection based on ambient zone perception and correlation as set forth in claim 1, wherein the neural network is a MoCov3 network.

10. The method for detecting the weakly supervised object based on surrounding area perception and association as set forth in claim 9, wherein the training neural network specifically comprises the following steps:

(1) When the network inputs 0-29 rounds, the most discriminant area;