CN115439688A

CN115439688A - Weak supervision object detection method based on surrounding area perception and association

Info

Publication number: CN115439688A
Application number: CN202211066364.0A
Authority: CN
Inventors: 张永强; 丁明理; 田瑞; 张印; 张子安; 张漫
Original assignee: Harbin Institute of Technology Shenzhen
Current assignee: Harbin Institute of Technology Shenzhen
Priority date: 2022-09-01
Filing date: 2022-09-01
Publication date: 2022-12-06
Anticipated expiration: 2042-09-01
Also published as: CN115439688B

Abstract

A weak supervision object detection method based on peripheral area perception and correlation relates to the technical field of object detection, and aims at solving the problems that in the prior art, weak supervision object detection is easy to converge on a local optimal solution, and the problem that in the weak supervision object detection method, the detection accuracy is low and converges on the local optimal solution due to the fact that only the most discriminative area of an object can be detected instead of all object areas, and the object positioning failure is caused, so that the detection accuracy is low. The invention belongs to basic technical research work of object detection in practical application scenes, promotes the landing of an object detection technology of artificial intelligence deep learning to a certain extent, and makes up the difference between weak supervision and full supervision object detection.

Description

A weakly supervised object detection method based on surrounding area perception and association

技术领域technical field

本发明涉及物体检测技术领域，具体为一种基于周围区域感知与关联的弱监督物体检测方法。The invention relates to the technical field of object detection, in particular to a weakly supervised object detection method based on surrounding area perception and association.

背景技术Background technique

弱监督物体检测是一个仅仅使用图像级标签实现物体检测的技术，其中图像级标签表示一幅图像中是否存在物体的类别。在真实场景的应用中，全监督物体检测在训练过程中无法获取实例级的标注，而弱监督物体检测技术利用图像级标签代替全监督物体检测的实例级标签，能够大幅度降低全监督物体检测对实例级标注训练数据的需求，在标注数据稀缺的前提下实现物体检测。然而，相比于全监督的物体检测，弱监督物体检测很少存在为了精准的定位物体区域而设计模块(全监督检测有候选区域网络、特征金字塔网络等)。同时，弱监督物体检测任务通常被视为候选区域的分类任务，这种情况下导致弱监督物体检测器收敛于局部最优解、输出的结果为物体的最有判别力的区域。基于上述所述，弱监督物体检测是一个具有挑战性和潜力的技术。Weakly-supervised object detection is a technique to achieve object detection using only image-level labels, where the image-level labels indicate whether an object exists in an image or not. In the application of real scenes, fully supervised object detection cannot obtain instance-level annotations during the training process, while weakly supervised object detection technology uses image-level labels instead of instance-level labels of fully supervised object detection, which can greatly reduce the cost of fully supervised object detection. The need for instance-level labeled training data enables object detection under the premise of scarce labeled data. However, compared with fully supervised object detection, weakly supervised object detection rarely has modules designed to accurately locate object regions (full supervised detection has candidate area networks, feature pyramid networks, etc.). At the same time, the weakly supervised object detection task is usually regarded as a classification task of candidate regions, which in this case causes the weakly supervised object detector to converge to a local optimal solution, and the output result is the most discriminative region of the object. Based on the above, weakly supervised object detection is a challenging and promising technique.

目前，弱监督检测方法为了弥补弱监督与全监督物体检测的差距和改善局部聚焦的现象，可以概括为如下四种具有代表性的方法。基于初始化高质量候选区域的方法保证物体检测任务的召回率指标，结合了类激活图与选择性搜索算法生成高质量的候选区域。生成的高质量的候选区域作为弱监督检测器的输入，保证高召回率的同时，提高候选区域与真实物体边界框的交并比，实现准确的检测结果；基于迭代精细化策略的方法引导检测器趋向完整的物体区域，将高重叠区域应该具有相同类别标签作为训练过程的先验知识，为下一个分支提供监督信息；基于弱监督与全监督的转换方法结合了弱监督(标注信息容易获取)与全监督(强大的回归能力)的优点，利用弱监督物体检测器输出的结果训练全监督的检测器，并且全监督检测器的输出作为最终检测结果；基于完整物体搜索的方法，利用类激活图作为物体区域的位置先验，搜索检测区域极大值分数与周围区域极小值分数，进一步定位到完整的物体区域。然而，上述的高质量区域生成方法、迭代精细化方法、全监督与弱监督转化方法和完整物体搜索方法无法根本上解决局部聚焦的现象，并且这些方法没有广泛的适用性，仅仅适用于当前的一种或者一类的弱监督检测的方法。基于上述所述，现有的弱监督物体检测方法的局限性可以总结为两个方面：(1)弱监督物体检测容易收敛于局部最优解，直观表现为只能检测到物体最有判别力的区域，而不是全部物体区域，导致物体定位失败；(2)全监督物体检测为了提高定位精度，精心设计的模块能够集成到任何全监督物体检测器中，例如：候选区域生成网络，特征金字塔网络。相比于弱监督物体检测很少存在为了改善定位精度而设计的通用的模块，并且现有方法没有广泛的适用性，仅仅适用于当前的一种或者一类的弱监督检测的方法。At present, in order to bridge the gap between weak supervision and full supervision object detection and improve the phenomenon of local focus, the weak supervision detection method can be summarized as the following four representative methods. Based on the method of initializing high-quality candidate regions to ensure the recall rate index of object detection tasks, the class activation map and selective search algorithm are combined to generate high-quality candidate regions. The generated high-quality candidate regions are used as the input of the weakly supervised detector to ensure a high recall rate while improving the intersection ratio between the candidate region and the real object bounding box to achieve accurate detection results; the method based on iterative refinement strategy guides the detection The machine tends to the complete object area, and the high overlapping area should have the same category label as the prior knowledge of the training process, which provides supervision information for the next branch; the conversion method based on weak supervision and full supervision combines weak supervision (label information is easy to obtain) ) and the advantages of full supervision (strong regression ability), use the results of the output of the weakly supervised object detector to train the fully supervised detector, and the output of the fully supervised detector is used as the final detection result; based on the method of complete object search, using the class The activation map is used as the location prior of the object area, and the maximum score of the detection area and the minimum score of the surrounding area are searched to further locate the complete object area. However, the above-mentioned high-quality region generation methods, iterative refinement methods, fully supervised and weakly supervised transformation methods, and complete object search methods cannot fundamentally solve the phenomenon of local focus, and these methods do not have wide applicability and are only applicable to current One or a class of weakly supervised detection methods. Based on the above, the limitations of existing weakly supervised object detection methods can be summarized in two aspects: (1) Weakly supervised object detection tends to converge to a local optimal solution, and the intuitive performance can only detect objects with the most discriminative power area, not all object areas, resulting in object localization failure; (2) Fully supervised object detection In order to improve localization accuracy, well-designed modules can be integrated into any fully supervised object detector, such as: candidate region generation network, feature pyramid network. Compared with weakly supervised object detection, there are few general-purpose modules designed to improve positioning accuracy, and existing methods do not have wide applicability, and are only suitable for one or a class of weakly supervised detection methods.

发明内容Contents of the invention

本发明的目的是：针对现有技术中弱监督物体检测容易收敛于局部最优解，直观表现为只能检测到物体最有判别力的区域，而不是全部物体区域，导致物体定位失败，进而导致检测精度低的问题，提出一种基于周围区域感知与关联的弱监督物体检测方法。The purpose of the present invention is: for weakly supervised object detection in the prior art, it is easy to converge to a local optimal solution, intuitively, it can only detect the most discriminative area of the object, not all object areas, resulting in failure of object positioning, and then To solve the problem of low detection accuracy, a weakly supervised object detection method based on surrounding area perception and association is proposed.

本发明为了解决上述技术问题采取的技术方案是：The technical scheme that the present invention takes in order to solve the problems of the technologies described above is:

一种基于周围区域感知与关联的弱监督物体检测方法，包括以下步骤：A weakly supervised object detection method based on surrounding area perception and association, comprising the following steps:

步骤一：获取待识别图像，并利用弱监督检测器对待识别图像进行预测，并将预测得到的物体位置作为最有判别力区域；Step 1: Obtain the image to be recognized, and use the weak supervision detector to predict the image to be recognized, and use the predicted object position as the most discriminative area;

步骤二：将最有判别力区域进行扩展，并利用图像块将扩展后的区域进行裁剪，最后将图像块作为周围区域；Step 2: Expand the most discriminative area, and use the image block to crop the expanded area, and finally use the image block as the surrounding area;

步骤三：对最有判别力区域和周围区域进行特征提取，并将得到的特征进行聚类，为每个区域指定聚类标签，并通过聚类标签将每个区域划分成不同的簇；Step 3: Perform feature extraction on the most discriminative region and surrounding regions, and cluster the obtained features, assign a cluster label to each region, and divide each region into different clusters through the cluster label;

步骤四：通过每个区域的聚类标签得到与最有判别力区域标签相同的周围区域，并且将最有判别力区域与最有判别力区域标签相同的周围区域融合成一个新的物体区域；Step 4: Obtain the surrounding area with the same label as the most discriminative area through the clustering label of each area, and fuse the surrounding area with the same label as the most discriminative area and the most discriminative area into a new object area;

步骤五：对最有判别力区域进行数据扩增，得到两个扩增后的最有判别力区域，即q’和q”；Step 5: Carry out data amplification on the most discriminative region, and obtain two amplified most discriminative regions, namely q’ and q”;

步骤六：针对q’、q”和周围区域进行特征提取，并将提取到的特征进行聚类，若聚类过程中将q’和q”指派到同一个簇中，将此聚类过程视为正确的聚类，并执行步骤七，若聚类过程中未将q’和q”指派到同一个簇中，则重新执行一次步骤三至步骤六，若此次将q’和q”指派到同一个簇中，将此聚类过程视为正确的聚类，并执行步骤七，若仍不能将q’和q”指派到同一个簇中，则忽略此次聚类结果；则将最有判别力区域作为最终的物体区域；Step 6: Extract features for q', q" and the surrounding area, and cluster the extracted features. If q' and q" are assigned to the same cluster during the clustering process, the clustering process is regarded as is the correct clustering, and execute step 7. If q' and q" are not assigned to the same cluster during the clustering process, then re-execute steps 3 to 6. If q' and q" are assigned this time into the same cluster, regard this clustering process as the correct clustering, and perform step 7, if q' and q" still cannot be assigned to the same cluster, then ignore the clustering result; then the last The discriminative region serves as the final object region;

步骤七：针对将q’和q”指派到同一个簇中的聚类过程，计算q’和q”到当前簇中心的距离d₁和d₂，若距离差|d₁-d₂|超过设定的阈值T_dis＝0.1，则忽略此次聚类结果，否则，将此聚类过程视为正确的聚类；Step 7: For the clustering process of assigning q' and q" to the same cluster, calculate the distance d ₁ and d ₂ between q' and q" to the center of the current cluster, if the distance difference |d ₁ -d ₂ | exceeds If the set threshold T _dis =0.1, the clustering result is ignored, otherwise, this clustering process is regarded as a correct clustering;

步骤八：基于步骤七中正确的聚类，获取同时包含q’和q”的簇中的周围区域，并计算周围区域与q’或q”的余弦相似度，若余弦相似度数值大于阈值T_score＝0.95，则将该簇中最有判别力区域与周围区域融合成最终的物体区域；Step 8: Based on the correct clustering in step 7, obtain the surrounding area in the cluster containing both q' and q", and calculate the cosine similarity between the surrounding area and q' or q", if the cosine similarity value is greater than the threshold T _score = 0.95, then the most discriminative region in the cluster is fused with the surrounding region into the final object region;

步骤九：利用最有判别力区域和周围区域作为输入，最终的物体区域作为输出训练神经网络，并利用训练好的神经网络进行物体检测。Step 9: Use the most discriminative area and the surrounding area as input, and the final object area as output to train the neural network, and use the trained neural network for object detection.

进一步的，所述步骤二中扩展的范围比例α大于1倍。Further, the extended range ratio α in the step 2 is greater than 1 times.

进一步的，所述步骤二中扩展的范围比例α为1.2倍。Further, the extended range ratio α in the second step is 1.2 times.

进一步的，所述特征为高维非线性特征。Further, the features are high-dimensional nonlinear features.

进一步的，所述高维非线性特征通过ViT进行提取。Further, the high-dimensional nonlinear features are extracted through ViT.

进一步的，所述图像块的大小为32*32。Further, the size of the image block is 32*32.

进一步的，所述步骤三中周围区域为步骤二中周围区域的60％。Further, the surrounding area in step 3 is 60% of the surrounding area in step 2.

进一步的，所述数据扩增包括随机色彩抖动、随机灰度化、随机高斯模糊以及随机日光化。Further, the data augmentation includes random color dithering, random grayscale, random Gaussian blur and random solarization.

进一步的，所述神经网络为MoCov3网络。Further, the neural network is a MoCov3 network.

进一步的，所述训练神经网络的具体步骤为：Further, the specific steps of the training neural network are:

训练过程采用无监督的对比学习进行，总计训练100个epoch；The training process is carried out by unsupervised contrastive learning, and a total of 100 epochs are trained;

(1)当在0-29轮时，网络输入的是最有判别力区域；(1) When rounds 0-29, the network input is the most discriminative area;

(2)当在30、35、40...100轮时，网络进行融合过程，同时将融合后的最终物体区域作为网络的输入进行训练；(2) When rounds 30, 35, 40...100, the network performs the fusion process, and at the same time, the final object area after fusion is used as the input of the network for training;

(3)当在31-34、36-39...96-99轮时，网络的输入是融合后的最终物体区域和未被融合的最有判别力区域。(3) When rounds 31-34, 36-39...96-99, the input to the network is the fused final object region and the unfused most discriminative region.

本发明的有益效果是：The beneficial effects of the present invention are:

本申请解决了弱监督物体检测方法中检测精度低和收敛于局部最优解的问题，突破了弱监督不存在提高定位精度的模块的局限，降低了物体检测技术对昂贵的人工标注的需求。本发明属于在实际应用场景中，物体检测的基础性技术研究工作，在一定程度上推动了人工智能深度学习的物体检测技术的落地，弥补了弱监督与全监督物体检测的差距。This application solves the problems of low detection accuracy and convergence to local optimal solutions in the weak supervision object detection method, breaks through the limitation of weak supervision that there is no module for improving positioning accuracy, and reduces the need for expensive manual labeling in object detection technology. The present invention belongs to the basic technical research work of object detection in practical application scenarios, which promotes the implementation of artificial intelligence deep learning object detection technology to a certain extent, and bridges the gap between weak supervision and full supervision object detection.

附图说明Description of drawings

图1为周围区域感知与关联模块示例图；Figure 1 is an example diagram of the surrounding area perception and association module;

图2为区域特征提取器结构图；Fig. 2 is a structural diagram of a region feature extractor;

图3为区域关联网络结构图；Fig. 3 is a regional association network structure diagram;

图4为区域融合约束示意图1；Figure 4 is a schematic diagram 1 of region fusion constraints;

图5为区域融合约束示意图2；Figure 5 is a schematic diagram 2 of region fusion constraints;

图6为弱监督物体检测效果对比图a；Figure 6 is a comparison chart a of weakly supervised object detection effect;

图7为弱监督物体检测效果对比图b；Figure 7 is a comparison chart b of weakly supervised object detection effects;

图8为弱监督物体检测效果对比图c。Figure 8 is a comparison chart c of weakly supervised object detection effects.

具体实施方式detailed description

需要特别说明的是，在不冲突的情况下，本申请公开的各个实施方式之间可以相互组合。It should be noted that, in the case of no conflict, various implementations disclosed in this application can be combined with each other.

具体实施方式一：参照图1具体说明本实施方式，本实施方式所述的一种基于周围区域感知与关联的弱监督物体检测方法，包括以下步骤：Specific Embodiment 1: This embodiment is specifically described with reference to FIG. 1. A weakly supervised object detection method based on surrounding area perception and association described in this embodiment includes the following steps:

本申请针对弱监督与全监督物体检测器定位精度的差距，提出可以嵌入到任何弱监督检测器的模块，与任何弱监督检测器组成端到端学习的框架，解决了训练过程收敛于局部最优解的问题，进一步表现出弱监督物体检测通常识别到物体最有特征的区域的问题。为了克服基于当前弱监督物体检测器仅仅识别物体最有判别力的区域的不足，本申请提出的区域关联网络基于聚类方法动态地查询周围区域与现有检测器预测的物体区域之间的相似度，根据相似度结果融合高相似度区域，使弱监督物体检测器输出的物体区域覆盖物体完整的区域。同时，在区域关联网络中引入了聚类过程，不可避免地引入了聚类过程的缺点，即早期训练阶段的聚类过程存在不稳定与置信度低的区域的错分情况。本申请提出的区域融合约束可以增强了区域融合的条件，阻碍了错误的融合过程，进一步地精炼了来自于区域关联网络输出的粗糙的物体区域，输出准确且完整的检测结果。Aiming at the gap in positioning accuracy between weakly supervised and fully supervised object detectors, this application proposes a module that can be embedded into any weakly supervised detector, and forms an end-to-end learning framework with any weakly supervised detector, solving the problem that the training process converges to the local optimum The problem of optimal solution further shows the problem that weakly supervised object detection usually identifies the most characteristic regions of objects. In order to overcome the shortcomings of current weakly supervised object detectors that only identify the most discriminative regions of objects, the region association network proposed in this application dynamically queries the similarity between the surrounding regions and the object regions predicted by existing detectors based on the clustering method. According to the similarity result, the high similarity area is fused, so that the object area output by the weakly supervised object detector covers the complete area of the object. At the same time, the clustering process is introduced in the regional association network, which inevitably introduces the disadvantages of the clustering process, that is, the clustering process in the early training stage has instability and misclassification of regions with low confidence. The region fusion constraints proposed in this application can enhance the conditions of region fusion, hinder the wrong fusion process, further refine the rough object regions output from the region association network, and output accurate and complete detection results.

具体地，本申请提出的周围区域与关联模块示例如图1所示，其中包含三个组件，即区域提取器，区域关联网络以及区域融合约束。第一个组成部分区域提取器，针对任何的弱监督检测器，根据一幅图像中的检测结果，划分最有判别力区域与周围区域，使用不同的裁剪范围定义周围区域，在提升检测准确率的同时降低了算法执行的时间。第二个组成部分区域关联网络，主要功能是通过区域提取器输出的两类区域持续地执行对比学习和聚类过程，提取两类区域良好的视觉表征，将得到的视觉表征输入到聚类过程，为每一个区域指定聚类标签，通过聚类标签查询与最有判别力标签相同的周围区域，并且融合成一个新的物体区域。第三个组成部分区域融合约束，执行区域关联网络将输出新的物体区域，新的物体区域可能在聚类早期引入不稳定、低置信度的周围区域，如果将此类周围区域与最有判别力区域融合作为最终的结果，影响物体的完整性与准确性，本申请的区域融合约束的作用是移除此类区域，对区域关联网络输出的粗糙的物体区域进行精炼并且获取准确的物体区域。Specifically, an example of the surrounding area and association module proposed in this application is shown in Figure 1, which contains three components, namely, an area extractor, an area association network, and an area fusion constraint. The first component area extractor, for any weakly supervised detector, divides the most discriminative area and the surrounding area according to the detection results in an image, and uses different cropping ranges to define the surrounding area, improving the detection accuracy. At the same time, it reduces the execution time of the algorithm. The second component of the regional association network, the main function is to continuously perform comparative learning and clustering processes through the two types of regions output by the region extractor, extract good visual representations of the two types of regions, and input the obtained visual representations into the clustering process , specify a cluster label for each region, query the surrounding region with the same most discriminative label through the cluster label, and fuse into a new object region. The third component is the region fusion constraint. The execution region association network will output a new object region. The new object region may introduce unstable and low-confidence surrounding regions in the early stage of clustering. If such surrounding regions are combined with the most discriminative As the final result, region fusion affects the integrity and accuracy of the object. The region fusion constraint of this application is to remove such regions, refine the rough object regions output by the region association network and obtain accurate object regions .

本申请提出的区域提取器首先将现有检测器的输出结果视为最有判别力区域。然后，按照一定的比例α扩展最有判别力区域，以扩展后的最有判别力区域作为裁剪范围。在裁剪范围内，依次裁剪32*32的patch作为key regions。在区域提取器中，每一个区域被视为最有判别力区域query regions或周围区域key regions，两类区域输入到本申请提出的区域关联网络中，执行对比学习与聚类过程以发现完整的物体区域。The region extractor proposed in this application first considers the output of existing detectors as the most discriminative region. Then, the most discriminative area is expanded according to a certain ratio α, and the expanded most discriminative area is used as the clipping range. Within the cropping range, 32*32 patches are sequentially cropped as key regions. In the region extractor, each region is regarded as the most discriminative region query regions or surrounding region key regions, and the two types of regions are input into the region association network proposed in this application, and the comparative learning and clustering process is performed to discover the complete object area.

本申请提出的区域关联网络结合对比学习和聚类过程。针对对比学习，区域提取器输出最有判别力区域作为query regions，周围区域作为key regions，将query regions区域执行RandomColorJitter随机色彩抖动,RandomGrayScale随机灰度化,RandomGaussianBlur随机高斯模糊和RandomSolarize随机日光化一系列的图像增强的策略，复杂的图像增强策略有助于学习物体更好的特征表示。增强后的区域(q’,q”)输入到MoCov3框架，利用无监督的训练策略提取query regions良好的视觉表征，通过训练MoCov3框架中的ViT特征提取器，将最有判别力的区域和周围区域作为区域关联网络的输入，提取两类区域的特征并映射到高维非线性空间。针对于聚类过程，区域关联网络对上述两类区域的高维非线性特征执行聚类过程，根据高维空间中的欧式距离为每一个区域指定聚类标签，同时提取与最有判别力区域处于相同簇的周围区域，此类周围区域与最有判别力区域融合成新的物体区域，可以覆盖真实物体的区域完整部分或者大部分。在本申请的区域关联网络中，无监督的训练过程不需要任何的标签信息。The regional association network proposed in this application combines contrastive learning and clustering processes. For contrastive learning, the region extractor outputs the most discriminative regions as query regions, and the surrounding regions as key regions, and implements RandomColorJitter random color jitter, RandomGrayScale random grayscale, RandomGaussianBlur random Gaussian blur and RandomSolarize random solarization in the query regions. The image enhancement strategy, complex image enhancement strategy helps to learn better feature representation of objects. The enhanced region (q', q") is input to the MoCov3 framework, and the unsupervised training strategy is used to extract the good visual representation of the query regions. By training the ViT feature extractor in the MoCov3 framework, the most discriminative region and surrounding As the input of the regional association network, the region extracts the features of the two types of regions and maps them to the high-dimensional nonlinear space. For the clustering process, the regional association network performs a clustering process on the high-dimensional nonlinear features of the above two types of regions, according to the high-dimensional space. The Euclidean distance assigns a cluster label to each region, and at the same time extracts the surrounding regions that are in the same cluster as the most discriminative region. Such surrounding regions are fused with the most discriminative region to form a new object region, which can completely cover the region of the real object. Part or most. In the regional association network of this application, the unsupervised training process does not require any label information.

针对于本申请提出的区域融合约束，此约束包括类别子约束和距离子约束。其中，类别子约束表示图像增强后的处于不同视角的每一个query regions是否被指派到相同的簇中，距离子约束是在类别子约束满足的前提下，计算不同视角的query regions在对应簇质心的距离差，距离子约束认为距离差小于预先设定的阈值即为成功的距离过程，反之认为此聚类过程失败，即忽略当前的聚类结果，不进行新的物体区域的融合。具体地，在区域关联网络中query regions进行图像增强后输出q’和q”，q’与q”作为同一个区域不同视角下的两个数据增强后的区域，与周围区域一同输入到聚类过程中。在聚类过程中，如果q’和q”处于相同的簇中，此聚类过程作为一个成功的聚类，否则，类别子约束忽略此聚类结果，并且为了查找与最有判别力区域相似度高的周围区域重新执行聚类过程。接着，当q’和q”处于相同的簇中，距离子约束计算q’和q”到当前簇中心的距离，如果此距离大于设定的阈值，直接忽略本次的聚类结果，证明原始的物体区域已经是完整的，不需进一步的融合。相似地，如果此距离小于设定的阈值，类别子约束和距离子约束同时被满足，同时满足两种子约束的周围区域被视为待融合区域的候选区域。For the region fusion constraints proposed in this application, the constraints include category sub-constraints and distance sub-constraints. Among them, the category sub-constraint indicates whether each query region in different viewing angles after image enhancement is assigned to the same cluster, and the distance sub-constraint is to calculate the query regions of different viewing angles in the corresponding cluster centroid under the premise that the category sub-constraint is satisfied The distance difference, the distance sub-constraint considers that the distance difference is less than the preset threshold is a successful distance process, otherwise it is considered that the clustering process fails, that is, the current clustering result is ignored, and the fusion of new object regions is not performed. Specifically, in the regional association network, query regions output q' and q" after image enhancement, and q' and q" are two data-enhanced regions under different viewing angles in the same region, which are input to the cluster together with the surrounding regions in process. During the clustering process, if q' and q" are in the same cluster, the clustering process is regarded as a successful clustering, otherwise, the category subconstraint ignores the clustering result, and in order to find the similarity with the most discriminative region Then, when q' and q" are in the same cluster, the distance sub-constraint calculates the distance between q' and q" to the center of the current cluster. If the distance is greater than the set threshold, Directly ignore the clustering results this time, proving that the original object area is complete and no further fusion is required.Similarly, if the distance is less than the set threshold, the category sub-constraint and the distance sub-constraint are satisfied at the same time, and at the same time The surrounding regions of the two sub-constraints are considered as candidate regions for the region to be fused.

本申请以VOC2007/2012数据集作为研究对象，用户可以根据实际应用需求自行构建相应的数据库。在本申请中，为了对弱监督物体检测技术更好地评定，采用在物体检测领域广泛使用的VOC数据集，其中包含有实际场景下的20种类别，分别包括9963和22531幅图像数据，把VOC图像分为VOC2007train/val，VOC07/12train/val和VOC2007test。其中，VOC2007train/val和VOC07/12train/val用来分别训练本申请的弱监督检测器框架，VOC2007test用来验证本申请的弱监督检测器框架的性能，同时采用广泛的物体检测指标mAP评价检测性能，也就是检测实例与真值检测实例的交并比IOU>0.5作为正确的检测结果。建立训练数据库之后，首先根据提取训练好的弱监督检测器的输出结果，基于无监督的端到端的方式训练本申请提出的区域感知与关联网络，输出的最终检测结果改善局部聚焦并获取完整的物体区域。This application takes the VOC2007/2012 dataset as the research object, and users can build corresponding databases according to actual application requirements. In this application, in order to better evaluate the weakly supervised object detection technology, the VOC data set widely used in the field of object detection is used, which contains 20 categories in actual scenes, including 9963 and 22531 image data respectively. VOC images are divided into VOC2007train/val, VOC07/12train/val and VOC2007test. Among them, VOC2007train/val and VOC07/12train/val are used to train the weakly supervised detector framework of this application, and VOC2007test is used to verify the performance of the weakly supervised detector framework of this application, and a wide range of object detection indicators mAP is used to evaluate the detection performance , that is, the intersection and union ratio IOU>0.5 of the detection instance and the true value detection instance is regarded as the correct detection result. After establishing the training database, firstly, according to the output results of the trained weakly supervised detectors, the region perception and association network proposed in this application is trained in an unsupervised end-to-end manner, and the final detection results output improve local focus and obtain complete object area.

综上，本申请提出了一种新颖的基于周围区域感知与关联的弱监督物体检测框架，在实现弱监督物体检测过程中，直接考虑到现有检测器预测的区域与周围区域之间的关系，利用两类区域之间的相似度关系，更新原始的局部聚焦的区域，实现完整的物体区域检测。本申请的“区域关联网络”中，最有判别力区域与周围区域融合成新的物体区域时，由于训练开始阶段的聚类过程不稳定、ViT特征提取能力不充分，新的物体区域可能是粗糙的，即包括整个物体区域或者物体区域的大部分，尽管包含整个物体区域，可能与真实物体边界框的贴合程度较差。本申请的“区域融合约束”中，考虑聚类初期对融合过程的影响，区域融合约束精炼粗糙的物体区域，移除不稳定的周围区域，获取准确的物体区域，将此物体区域作为弱监督检测器输出的结果。In summary, this application proposes a novel weakly supervised object detection framework based on surrounding area perception and association. In the process of realizing weakly supervised object detection, the relationship between the area predicted by the existing detector and the surrounding area is directly considered. , using the similarity relationship between the two types of regions to update the original locally focused region to achieve complete object region detection. In the "Regional Association Network" of this application, when the most discriminative region is fused with the surrounding region to form a new object region, the new object region may be Coarse, that is, including the entire object area or most of the object area, although including the entire object area, may have a poor fit with the real object bounding box. In the "Regional Fusion Constraints" of this application, considering the influence of the early stage of clustering on the fusion process, the regional fusion constraints refine the rough object area, remove the unstable surrounding area, obtain the accurate object area, and use this object area as weak supervision The result of the detector output.

本申请解决了现有的弱监督物体检测器的局部聚焦现象，大幅度弥补了弱监督检测器与全监督检测器的差距，在真实场景的应用中，促进了深度学习物体检测器的发展，解决了实际应用中训练标签的稀缺或不可用的情况，进一步地为人工智能物体检测技术落地提供了技术支撑。This application solves the local focusing phenomenon of existing weakly supervised object detectors, and greatly bridges the gap between weakly supervised detectors and fully supervised detectors. In the application of real scenes, it promotes the development of deep learning object detectors. It solves the scarcity or unavailability of training labels in practical applications, and further provides technical support for the implementation of artificial intelligence object detection technology.

本申请提出了一种新颖的周围区域感知与关联模块，可以整合到任何现有的弱监督检测器作为一个端到端训练的检测框架，包含三个组件，即区域提取器，区域关联网络以及区域融合约束。其中，为了解决局限(1)，本申请直接关注于周围区域与最有判别力区域的相似度，根据查询的相似度结果，高相似度的周围区域将被视为待融合区域，与最有判别力区域融合成新的物体区域。通过区域融合约束条件，移除早期训练过程低置信度、高噪声和不稳定的待融合区域，对区域关联网络输出的物体区域精细化，输出准确的包含完整物体的检测结果。为了解决局限(2)，本申请在训练过程中计算损失在弱监督物体检测器预测的最有判别力的区域上，并且在训练过程中本申请提出的方法不利用任何实例级标签或图像级标签。因此，周围区域感知与关联模块可以简单地与任何弱监督物体检测器整合成端到端统一的框架，具有广泛的适用性。This application proposes a novel surrounding area perception and association module, which can be integrated into any existing weakly supervised detector as an end-to-end training detection framework, which consists of three components, namely the area extractor, the area association network and Regional Fusion Constraints. Among them, in order to solve the limitation (1), this application directly focuses on the similarity between the surrounding area and the most discriminative area. According to the similarity results of the query, the surrounding area with high similarity will be regarded as the area to be fused, and the most discriminative area. The discriminative regions are fused into new object regions. Through the regional fusion constraints, the low confidence, high noise and unstable regions to be fused in the early training process are removed, the object regions output by the regional association network are refined, and accurate detection results containing complete objects are output. To address limitation (2), this application computes the loss on the most discriminative region predicted by a weakly supervised object detector during training, and our proposed method does not utilize any instance-level labels or image-level Label. Therefore, the surrounding area awareness and association module can be simply integrated with any weakly supervised object detector into an end-to-end unified framework with wide applicability.

实施例：Example:

本申请采用广泛评价与验证弱监督检测器性能的VOC2007/2012数据集。具体地将VOC图像数据集中的各类图像分为VOC2007train/val，VOC07/12train/val和VOC2007test。其中，VOC2007train/val和VOC07/12train/val用来分别训练本申请的弱监督检测器框架，VOC2007test用来验证本申请的弱监督检测器框架的性能。建立训练数据库之后，训练本申请所提出的区域感知与关联模块。首先，本申请的区域提取器生成最有判别力区域与周围区域，利用两类区域训练本申请提出的区域关联网络，根据模型中的无监督训练过程以及聚类结果，查询与最有判别力区域处于相同簇中的周围区域，融合此类周围区域与最有判别力区域作为新的物体区域，接着引入本申请的区域融合约束，此约束包含类别子约束和距离子约束，两种子约束移除早期不稳定的聚类过程含有的伴有噪声的周围区域、低置信度的周围区域，精炼了来自于区域关联网络输出的物体区域，输出精炼后的准确的物体区域作为弱监督检测器的结果。本申请主要解决了现有弱监督物体检测器常定位到物体最有特征的区域，得到局部最有解，表现为局部聚焦的现象，弥补弱监督与全监督物体检测器的差距，提供给弱监督物体检测器提升定位精度且便于集成的模块。This application uses the VOC2007/2012 dataset, which has extensively evaluated and verified the performance of weakly supervised detectors. Specifically, various types of images in the VOC image dataset are divided into VOC2007train/val, VOC07/12train/val and VOC2007test. Among them, VOC2007train/val and VOC07/12train/val are used to train the weakly supervised detector framework of this application, and VOC2007test is used to verify the performance of the weakly supervised detector framework of this application. After the training database is established, the area perception and association module proposed in this application is trained. First, the region extractor in this application generates the most discriminative region and surrounding regions, and uses two types of regions to train the region association network proposed in this application. According to the unsupervised training process and clustering results in the model, the query and the most discriminative The area is in the surrounding area in the same cluster, and the surrounding area and the most discriminative area are fused as a new object area, and then the area fusion constraint of this application is introduced. This constraint includes a category sub-constraint and a distance sub-constraint. The two sub-constraints shift In addition to the noisy surrounding areas and low confidence surrounding areas contained in the early unstable clustering process, the object area output from the regional association network is refined, and the refined accurate object area is output as the weak supervision detector. result. This application mainly solves the problem that the existing weakly supervised object detectors often locate the most characteristic areas of the object, and obtains the most local solution, which is manifested as local focus. It makes up for the gap between weakly supervised and fully supervised object detectors, and provides weak Module that supervises object detectors to improve localization accuracy and is easy to integrate.

设计区域提取器。如图2所示，区域提取器主要思想是提取现有的弱监督检测器预测的物体位置，将预测的物体位置作为query regions。然后，区域提取器以一定的范围α扩展物体区域，以固定的比例裁剪扩展后的物体区域为32*32的patch作为key regions。初始的物体位置的坐标是(x,y,w,h)，扩展后的物体位置是(x,y,αw,αh)，其中α>1。在一幅图像中的query regions作为一个集合，如公式所示：Design region extractor. As shown in Figure 2, the main idea of the region extractor is to extract the object position predicted by the existing weakly supervised detector, and use the predicted object position as query regions. Then, the region extractor expands the object region with a certain range α, and crops the expanded object region with a fixed ratio of 32*32 patches as key regions. The coordinates of the initial object position are (x, y, w, h), and the extended object position is (x, y, αw, αh), where α>1. The query regions in an image are used as a collection, as shown in the formula:

其中，b_r表示在一幅图像中由现有的弱监督检测器预测的物体区域，R表示在此图像中总计预测区域的数目。同样地，key regions作为一个集合，如公式所示：where b _r denotes the object regions predicted by existing weakly supervised detectors in an image, and R denotes the total number of predicted regions in this image. Similarly, key regions are used as a collection, as shown in the formula:

其中，b_rn表示与第r个query regions对应的第n个key regions，并且N表示keyregions的总数目，不同的r-n对应关系中，N的数目是变化的，区域于原始的预测的物体区域的初始大小。值得注意的是，本申请中针对裁剪key regions的过程中，60％的keyregions被使用作为区域关联网络的输入。上述的原因有两个，第一点原因是60％的keyregions区域可以实现高效地区域区域关联网络，区域关联网络中存在聚类过程，减少区域的数目有助于减少聚类实现，进一步减少弱监督检测框架的训练时间以及推理过程的时间。第二点原因是60％的区域包括了原始预测区域的部分上方区域、部分左侧区域、部分右侧区域以及部分下方区域，包含了4个方向的区域即可实现聚类后的融合过程，不会丢失最有判别力区域附近的语义信息，基于上述所示，采用60％的key regions在保证算法检测到周围区域的增强检测性能的同时，降低聚类时间。Among them, b _rn represents the nth key regions corresponding to the rth query regions, and N represents the total number of keyregions. In different rn correspondences, the number of N changes, and the area is in the original predicted object area. initial size. It is worth noting that in the process of pruning key regions in this application, 60% of the key regions are used as the input of the regional association network. There are two reasons for the above. The first reason is that 60% of the keyregions area can realize an efficient regional regional association network. There is a clustering process in the regional association network. Reducing the number of regions helps to reduce the number of clustering and further reduce weak The training time of the supervised detection framework as well as the time of the inference process. The second reason is that 60% of the area includes part of the upper area, part of the left area, part of the right area, and part of the lower area of the original prediction area, and the fusion process after clustering can be realized by including the area in 4 directions. The semantic information near the most discriminative region will not be lost. Based on the above, 60% of the key regions are used to ensure that the algorithm detects the enhanced detection performance of the surrounding area and reduces the clustering time.

设计区域关联网络。本申请提出的区域关联网络，结合了无监督学习的特征提取能力与聚类过程，整个算法的网络结构如图3所示，主要由MoCov3网络和聚类网络组成。其中，MoCov3网络的作用是采用对比学习基于实例判别任务提取query regions特征的过程。值得注意的是，训练过程本申请分成3个阶段，第一个阶段利用对比学习的过程训练queryregions区域，第二阶段训练融合后的新的物体区域，此区域包含了周围区域的语义信息，第三阶段训练融合后的新的物体区域和未被融合的最有判别力区域，根据三个阶段的训练策略，增强ViT对query regions和key regions的特征提取能力，为了进一步地查询与query regions高相似的key regions并发现完整的物体区域。具体地，针对每一个输入图像，特征提取器产生query regions和key regions，采用丰富的数据增强的方法RandomColorJitter,RandomGrayScale,RandomGaussianBlur and RandomSolarize获取query regions(key regions)不同的视角q’和q”，本申请采用vision Transfomer(ViT)作为基础的编码器和动量编码器，基础的编码器用来提取增强后区域的嵌入特征，动量编码器的输出与基础编码器互相预测作为损失函数优化整个网络。具体地，区域关联网络可以看作由区域关联算法(a)与区域关联算法(b)组成的，基础编码器分支后引入一个映射网络，映射网络由3个全连接层(Linear layer)和3个批归一化层(Batch Normalization，BN)和2个ReLU激活函数组成，蓝色部分是全连接层，绿色部分是批归一化层，橙色部分是Relu激活函数，批归一化层使网络更容易收敛，模型更稳定，Relu激活函数使网络输入输出具有非线性关系；映射头后加入一个预测网络，预测网络的组成与映射网络类似。相比于映射网络，预测网络由2个全连接层、2个批归一化层和1个Relu激活函数组成。动量编码器相比于基础编码器没有引入预测网络，基础编码器的预测网络产生预测向量与动量编码器产生的映射向量采用交叉熵损失函数优化整个网络，如公式所示：Design regional association networks. The regional association network proposed in this application combines the feature extraction ability of unsupervised learning and the clustering process. The network structure of the entire algorithm is shown in Figure 3, which is mainly composed of MoCov3 network and clustering network. Among them, the role of the MoCov3 network is the process of extracting query regions features based on the instance discrimination task using contrastive learning. It is worth noting that the training process of this application is divided into three stages. The first stage uses the process of contrastive learning to train the queryregions area. The second stage trains the fused new object area, which contains the semantic information of the surrounding area. The second stage The new object region after three-stage training fusion and the most discriminative region that has not been fused, according to the three-stage training strategy, enhance the feature extraction ability of ViT for query regions and key regions, in order to further query and query regions High Similar key regions and discover complete object regions. Specifically, for each input image, the feature extractor generates query regions and key regions, and uses rich data enhancement methods RandomColorJitter, RandomGrayScale, RandomGaussianBlur and RandomSolarize to obtain query regions (key regions) with different perspectives q' and q", this The application uses vision Transformer (ViT) as the basic encoder and momentum encoder. The basic encoder is used to extract the embedded features of the enhanced area. The output of the momentum encoder and the basic encoder are mutually predicted as a loss function to optimize the entire network. Specifically , the regional association network can be regarded as composed of the regional association algorithm (a) and the regional association algorithm (b). After the basic encoder branch, a mapping network is introduced. The mapping network consists of 3 fully connected layers (Linear layer) and 3 batches The normalization layer (Batch Normalization, BN) and 2 ReLU activation functions, the blue part is the fully connected layer, the green part is the batch normalization layer, and the orange part is the Relu activation function. The batch normalization layer makes the network more It is easy to converge, the model is more stable, and the Relu activation function makes the network input and output have a nonlinear relationship; a prediction network is added after the mapping head, and the composition of the prediction network is similar to that of the mapping network. Compared with the mapping network, the prediction network consists of 2 fully connected layers , 2 batch normalization layers and 1 Relu activation function. Compared with the basic encoder, the momentum encoder does not introduce a prediction network. The prediction vector generated by the prediction network of the basic encoder and the mapping vector generated by the momentum encoder use cross entropy The loss function optimizes the entire network, as shown in the formula:

L_predict＝ctr(z'_d,z″_d)+ctr(z″_d,z'_d)L _predict ＝ctr(z' _d ,z″ _d )+ctr(z″ _d ,z' _d )

z'_d＝pr(g(f_b(B'_d)))z' _d = pr(g(f _b (B' _d )))

z″_d＝g(f_m(B″_d))z″ _d = g(f _m (B″ _d ))

其中，ctr(*)表示一个基于对比学习MoCov3的A到B或者B到A的预测损失函数。z’_d,z”_d表示由映射网络和预测网络对query regions或key regions提取的高维非线性特征。值得注意的是基础编码器分支由特征提取网络、映射网络和预测网络组成，而动量编码器分支仅仅包括特征提取网络与映射网络，同时为了构建特征更一致的编码器，基础编码器采用滑动平均的方式更新动量编码器。在训练初期，ViT仅仅关注物体的最有判别力区域，最有判别力区域虽然没有包含完整的物体区域，但覆盖物体的大部分区域，具有丰富的物体的语义信息。接着，采用query regions训练好ViT后，区域关联网络执行聚类过程，将区域关联算法(a)中无监督训练的ViT作为特征提取器，提取query regions和key regions两类区域的嵌入特征，提取两类区域的特征作为在高维非线性空间实现聚类的输入。区域关联算法(b)分别对query regions和key region的特征f_b(B_d)，f_b(B_s)执行k-means聚类，设定聚类的数目为K个，每个聚类中心数目表示为c_k，随着持续地动态的聚类，每一个区域根据嵌入空间中的欧式距离关系分配一个聚类标签，key regions会根据聚类的结果成为positive key和negative key，其中positive key表示与最有判别力区域聚成一类的keyregions，此类key regions可能成为物体区域的一部分，如果融合此类key regions和最有判别力的区域query regions，会将原始的物体区域扩大，输出更完整的物体区域。相反地，negative key作为背景区域，或者同一幅图像中其他实例的部分区域。尽管如此，区域关联网络为每一个区域分配聚类标签，通过实验的结果发现不稳定、伴随噪声的、低置信度的key regions被分配与query regions相同的聚类标签，这种情况导致作为背景的keyregions或其他实例的key regions作为了positive key，不可避免地影响了融合的结果。在早期的训练过程中，区域关联网络仅仅关注到query regions，忽略了ViT对key regions的特征提取能力，为此，本申请在区域关联网络的基础上引入了区域融合约束。Among them, ctr(*) represents a prediction loss function based on contrastive learning MoCov3 from A to B or B to A. z' _d , z" _d represent the high-dimensional nonlinear features extracted by the mapping network and prediction network for query regions or key regions. It is worth noting that the basic encoder branch consists of feature extraction network, mapping network and prediction network, while the momentum encoder The branch only includes the feature extraction network and the mapping network. At the same time, in order to build an encoder with more consistent features, the basic encoder uses a moving average to update the momentum encoder. In the early stage of training, ViT only focuses on the most discriminative area of the object, the most Although the discriminative power area does not contain the complete object area, it covers most of the object area and has rich semantic information of the object. Then, after using query regions to train ViT, the area association network performs the clustering process, and the area association algorithm ( The unsupervised training ViT in a) is used as a feature extractor to extract the embedded features of query regions and key regions, and extract the features of the two types of regions as the input for clustering in high-dimensional nonlinear space. Region association algorithm (b) respectively Perform k-means clustering on the features f _b (B _d ) and f _b (B _s ) of query regions and key regions, set the number of clusters to K, and the number of each cluster center is denoted as c _k , with With continuous dynamic clustering, each region is assigned a cluster label according to the Euclidean distance relationship in the embedding space, and the key regions will become positive key and negative key according to the clustering results, where the positive key represents the most discriminative region Grouped into a class of key regions, such key regions may become part of the object region, if such key regions and the most discriminative region query regions are fused, the original object region will be expanded and a more complete object region will be output. On the contrary , the negative key is used as the background area, or some areas of other instances in the same image. However, the regional association network assigns cluster labels to each area, and the experimental results found that the key is unstable, accompanied by noise, and low confidence Regions are assigned the same clustering labels as query regions, which leads to background key regions or key regions of other instances as positive keys, which inevitably affects the fusion results. In the early training process, the regional association network Only focus on query regions, ignoring the feature extraction capability of ViT on key regions. Therefore, this application introduces region fusion constraints on the basis of region association network.

设计区域融合约束。区域融合约束的主要作用是移除低置信度的key regions，精炼区域关联网络输出的物体区域并且获得准确且完整的物体区域。如图4和图5所示，区域融合约束包括类别子约束和距离子约束，本申请在区域关联网络的基础上考虑增强后的query regions输出的q’和q”的聚类标签和到聚类中心的距离关系。具体地，类别子约束和距离子约束为了衡量聚类结果的准确性而设计。针对于最有判别力区域query regions，通过数据增强输出q’和q”。在区域关联网络中，q’，q”与key regions通过ViT提取高维非线性特征后执行聚类操作。如果聚类过程将q’和q”分到一个簇中，将此聚类过程视为正确的聚类，否则，本申请重新执行区域关联算法提取上述区域q’，q”与key regions的特征，重新执行聚类过程以查询准确的positive key作为物体区域的一部分。当类别约束被满足，本申请计算q’和q”到当前簇中心的距离d₁和d₂。相似地，如果距离差|d₁-d₂|超过设定的阈值T_dis＝0.1，则不满足距离子约束，区域融合约束将忽略此次的聚类结果。当类别子约束和距离子约束同时被满足，接着，计算上述过程剩余的key regions区域与q’或q”的余弦相似度，如果余弦相似度数值大于阈值T_score，剩余的key regions将与query regions融合成最终的物体区域，输出弱监督检测器的结果。区域融合约束精炼来自于区域关联网络的物体区域，通过重新执行聚类过程实现移除低置信度的key regions，发现准确的完整的物体区域，通过距离差的计算，增强聚类结果的邻近关系。聚类过程考虑欧式距离的相似度，同时，余弦相似度考虑了方向上的相似度。上述的聚类过程与区域融合约束的示意图，如图6、图7和图8所示，包括了未满足类别子约束、未满足距离子约束和满足区域融合约束的情况。Design area fusion constraints. The main role of the region fusion constraint is to remove low-confidence key regions, refine the object region output by the region association network and obtain accurate and complete object regions. As shown in Figure 4 and Figure 5, the regional fusion constraints include category sub-constraints and distance sub-constraints. This application considers the clustering labels of q' and q" output by the enhanced query regions and the clustering labels based on the regional association network. The distance relationship of the class center. Specifically, the category sub-constraint and the distance sub-constraint are designed to measure the accuracy of the clustering results. For the most discriminative region query regions, output q' and q" through data enhancement. In the regional association network, q', q" and key regions perform clustering operations after extracting high-dimensional nonlinear features through ViT. If the clustering process divides q' and q" into one cluster, the clustering process is considered correct clustering, otherwise, this application re-executes the region association algorithm to extract the features of the above-mentioned region q', q" and key regions, and re-executes the clustering process to query the accurate positive key as a part of the object region. When the category constraint is satisfied, This application calculates the distances d ₁ and d ₂ from q' and q" to the current cluster center. Similarly, if the distance difference |d ₁ -d ₂ | exceeds the set threshold T _dis =0.1, the distance sub-constraint is not satisfied, and the clustering result of this time will be ignored for the region fusion constraint. When the category sub-constraint and the distance sub-constraint are satisfied at the same time, then, calculate the cosine similarity between the remaining key regions of the above process and q' or q", if the cosine similarity value is greater than the threshold T _score , the remaining key regions will be the same as the query Regions are fused into the final object region and output the result of the weakly supervised detector.Region fusion constraints refine the object region from the region association network, and remove low-confidence key regions by re-executing the clustering process to find accurate and complete The object area, through the calculation of the distance difference, enhances the proximity relationship of the clustering results. The clustering process considers the similarity of the Euclidean distance, and at the same time, the cosine similarity considers the similarity in the direction. The above clustering process and the regional fusion constraints The schematic diagrams, as shown in Fig. 6, Fig. 7 and Fig. 8, include cases where the category sub-constraint is not satisfied, the distance sub-constraint is not satisfied, and the region fusion constraint is satisfied.

训练本申请所提出的基于区域感知与关联的弱监督物体检测框架。在现有的弱监督检测器的输出结果中设计区域提取器，划分最有判别力区域和周围区域两类区域，设计区域关联网络，动态地查询周围区域与最有判别力区域的相似度，并且形成新的物体区域，引入区域融合约束，精炼来自于区域关联算法的物体区域，获得准确且完整的物体区域。Train the weakly supervised object detection framework based on region awareness and association proposed in this application. Design a region extractor in the output of the existing weakly supervised detector, divide the most discriminative region and the surrounding region into two types of regions, design a region association network, and dynamically query the similarity between the surrounding region and the most discriminative region, And form a new object area, introduce area fusion constraints, refine the object area from the area association algorithm, and obtain an accurate and complete object area.

具体地，在区域提取器中，为了定义周围区域的裁剪范围设定α＝1.2，裁剪patch的尺寸为32*32，并且全部的区域resize到224*224尺寸；在区域关联算法中为了执行聚类过程，考虑一幅图像中的实例数目设定为K＝20，整个模型的迭代100个epoch，采用LARS优化器优化器。设定batch size＝64，初始学习率为1.5*10^-4并根据batch size改变初始的学习率；在区域融合约束中，设定距离阈值T_dis＝0.1用来衡量聚类结果的正确性，为了在表示方向上query regions和key regions的相似度，设定余弦相似度阈值T_score＝0.95，动量(Momentum)和权重衰减设置为0.9和1*10^-6。Specifically, in the region extractor, in order to define the clipping range of the surrounding region, α=1.2 is set, the size of the clipping patch is 32*32, and all regions are resized to 224*224 in size; in the region association algorithm, in order to perform aggregation Class process, considering the number of instances in an image is set to K=20, the entire model is iterated for 100 epochs, and the LARS optimizer optimizer is used. Set batch size=64, the initial learning rate is 1.5*10 ^-4 and change the initial learning rate according to the batch size; in the region fusion constraint, set the distance threshold T _dis =0.1 to measure the correctness of the clustering results, In order to express the similarity between query regions and key regions in the direction, set the cosine similarity threshold T _score =0.95, and set the momentum (Momentum) and weight decay to 0.9 and 1*10 ^-6 .

通过上述步骤训练的基于区域感知与关联的弱监督物体检测框架，改善了现有弱监督检测器仅仅输出物体最有判别力区域，弥补了弱监督检测器与全监督检测器的差距，突破了弱监督不存在提高定位精度的模块的局限，降低了物体检测技术对昂贵的人工标注的需求。实验证明本申请的“基于区域感知与关联的弱监督物体检测技术”，可以检测到完整且准确的物体区域。表1为实验结果对比数据，采用物体检测领域中的标准评价指标mAP对所提出的方法进行评估。从对比数据中可以看出，本申请提出的“基于区域感知与关联的弱监督物体检测技术”比目前最先进的图像超分辨方法Instance-aware有0.27％的mAP提升。此外，本申请提出的弱监督检测框架是一个单阶段的模型，与其他最新的单阶段弱监督物体检测方法进行比较，均达到了目前最高的检测结果55.17％。并且，与其他的引入FasterRCNN检测器的多阶段弱监督的框架相比，同样达到了最高的检测结果，证明了本申请提出的基于区域感知与关联的弱监督物体检测框架的有效性。在VOC07trian/val作为训练集的基础上，本申请引入额外的数据集VOC07/12trian/val进行训练，测试结果为58.22％仍然超过其他的采用额外数据集的方法，从而进一步证明此弱监督框架的鲁棒性与泛化性。图7为实验结果对比图，其中绿色的边界框表示物体真实区域，红色的边界框表示现有的弱监督检测器输出的结果，蓝色的边界框表示基于本申请提出的框架输出的检测结果。从图中可以看出，与其他方法相比，使用本申请提出的方法输出的检测结果包含了完整的物体信息，特别对于非刚性物体的检测结果，改善了仅仅检测到最有判别力区域的问题，实现了准确的且完整的物体检测结果。The weakly supervised object detection framework based on region awareness and association trained through the above steps has improved the existing weakly supervised detectors to only output the most discriminative regions of objects, making up for the gap between weakly supervised detectors and fully supervised detectors, and breaking through Weak supervision does not have the limitations of modules that improve localization accuracy, reducing the need for expensive manual annotation by object detection techniques. Experiments have proved that the "Weakly Supervised Object Detection Technology Based on Region Awareness and Association" of this application can detect complete and accurate object regions. Table 1 is the comparison data of the experimental results, and the proposed method is evaluated by using the standard evaluation index mAP in the field of object detection. From the comparative data, it can be seen that the "weakly supervised object detection technology based on area awareness and association" proposed in this application has a mAP improvement of 0.27% compared with the current state-of-the-art image super-resolution method Instance-aware. In addition, the weakly supervised detection framework proposed in this application is a single-stage model. Compared with other latest single-stage weakly supervised object detection methods, both have achieved the highest detection result of 55.17%. Moreover, compared with other multi-stage weakly supervised frameworks that introduce the FasterRCNN detector, it also achieves the highest detection results, which proves the effectiveness of the weakly supervised object detection framework based on region awareness and association proposed in this application. On the basis of VOC07trian/val as the training set, this application introduces an additional data set VOC07/12trian/val for training, and the test result is 58.22%, which still exceeds other methods using additional data sets, thus further proving the strength of this weakly supervised framework robustness and generalization. Figure 7 is a comparison of experimental results, where the green bounding box represents the real area of the object, the red bounding box represents the output result of the existing weakly supervised detector, and the blue bounding box represents the detection result based on the framework output proposed by this application . It can be seen from the figure that compared with other methods, the detection results output by the method proposed in this application contain complete object information, especially for the detection results of non-rigid objects, which improves the detection of only the most discriminative region. problem, achieving accurate and complete object detection results.

表1 VOC2007train/val作为训练集的量化试验结果(mAP)Table 1 VOC2007train/val as the quantification test results (mAP) of the training set

表2 VOC07/12train/val作为训练集的量化试验结果(mAP)Table 2 Quantification test results (mAP) of VOC07/12train/val as a training set

需要注意的是，具体实施方式仅仅是对本发明技术方案的解释和说明，不能以此限定权利保护范围。凡根据本发明权利要求书和说明书所做的仅仅是局部改变的，仍应落入本发明的保护范围内。It should be noted that the specific implementation is only an explanation and description of the technical solution of the present invention, and cannot limit the protection scope of rights. All changes made according to the claims and description of the present invention are only partial changes, and should still fall within the protection scope of the present invention.

Claims

1. A weakly supervised object detection method based on surrounding area perception and association, characterized in that it comprises the following steps:

Step 1: Obtain the image to be recognized, and use the weak supervision detector to predict the image to be recognized, and use the predicted object position as the most discriminative area;

Step 2: Expand the most discriminative area, and use the image block to crop the expanded area, and finally use the image block as the surrounding area;

Step 3: Perform feature extraction on the most discriminative region and surrounding regions, and cluster the obtained features, assign a cluster label to each region, and divide each region into different clusters through the cluster label;

Step 4: Obtain the surrounding area with the same label as the most discriminative area through the clustering label of each area, and fuse the surrounding area with the same label as the most discriminative area and the most discriminative area into a new object area;

Step 5: Carry out data amplification on the most discriminative region, and obtain two amplified most discriminative regions, namely q’ and q”;

Step 6: Extract features for q', q" and the surrounding area, and cluster the extracted features. If q' and q" are assigned to the same cluster during the clustering process, the clustering process is regarded as is the correct clustering, and execute step 7. If q' and q" are not assigned to the same cluster during the clustering process, then re-execute steps 3 to 6. If q' and q" are assigned this time into the same cluster, regard this clustering process as the correct clustering, and perform step 7, if q' and q" still cannot be assigned to the same cluster, then ignore the clustering result; then the last The discriminative region serves as the final object region;

Step 7: For the clustering process of assigning q' and q" to the same cluster, calculate the distance d ₁ and d ₂ between q' and q" to the center of the current cluster, if the distance difference |d ₁ -d ₂ | exceeds If the set threshold T _dis =0.1, the clustering result is ignored, otherwise, this clustering process is regarded as a correct clustering;

Step 8: Based on the correct clustering in step 7, obtain the surrounding area in the cluster containing both q' and q", and calculate the cosine similarity between the surrounding area and q' or q", if the cosine similarity value is greater than the threshold T _score = 0.95, then the most discriminative region in the cluster is fused with the surrounding region into the final object region;

Step 9: Use the most discriminative area and the surrounding area as input, and the final object area as output to train the neural network, and use the trained neural network for object detection.

2. A weakly supervised object detection method based on surrounding area perception and association according to claim 1, characterized in that the range ratio α expanded in step 2 is greater than 1 times.

3. A weakly supervised object detection method based on surrounding area perception and association according to claim 2, characterized in that the range ratio α expanded in step 2 is 1.2 times.

4. A weakly supervised object detection method based on surrounding area perception and association according to claim 1, wherein the feature is a high-dimensional nonlinear feature.

5. A weakly supervised object detection method based on surrounding area perception and association according to claim 4, characterized in that the high-dimensional nonlinear features are extracted by ViT.

6. A weakly supervised object detection method based on surrounding area perception and association according to claim 1, characterized in that the size of the image block is 32*32.

7. A weakly supervised object detection method based on surrounding area perception and association according to claim 1, characterized in that the surrounding area in step 3 is 60% of the surrounding area in step 2.

8. A weakly supervised object detection method based on surrounding area perception and association according to claim 1, characterized in that said data amplification includes random color dithering, random grayscale, random Gaussian blur and random solarization.

9. A kind of weakly supervised object detection method based on surrounding area perception and association according to claim 1, characterized in that said neural network is a MoCov3 network.

10. A kind of weakly supervised object detection method based on surrounding area perception and association according to claim 9, characterized in that the specific steps of said training neural network are:

The training process is carried out by unsupervised contrastive learning, and a total of 100 epochs are trained;

(1) When rounds 0-29, the network input is the most discriminative area;

(2) When rounds 30, 35, 40...100, the network performs the fusion process, and at the same time, the final object area after fusion is used as the input of the network for training;

(3) When rounds 31-34, 36-39...96-99, the input to the network is the fused final object region and the unfused most discriminative region.