CN117725243B

CN117725243B - Class irrelevant instance retrieval method based on hierarchical semantic region decomposition

Info

Publication number: CN117725243B
Application number: CN202410173702.3A
Authority: CN
Inventors: 赵万磊; 孙琦颖
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2024-02-07
Filing date: 2024-02-07
Publication date: 2024-06-04
Anticipated expiration: 2044-02-07
Also published as: CN117725243A

Abstract

The invention discloses a class irrelevant instance retrieval method based on hierarchical semantic region decomposition, which extracts an instance of a characteristic image through a detector and stores the instance in a potential instance library; then extracting the characteristics of the examples in the potential example library through a descriptor to obtain example characteristics, and storing the example characteristics in the characteristic library; when a query picture is input, the retriever extracts features of the query picture to obtain query features, matches the query features with example features in a feature library, and obtains top K example features most similar to the query features as retrieval results. In the detector, the invention realizes the rapid discovery of the instance-level regions with different scales by carrying out hierarchical semantic region decomposition and screening on the image so as to achieve the purpose of comprehensively discovering the instance which is likely to be retrieved. The hierarchical decomposition well solves the problems of object shielding and embedding commonly existing in a real instance retrieval scene.

Description

Class irrelevant instance retrieval method based on hierarchical semantic region decomposition

Technical Field

The invention relates to the technical field of computer vision, in particular to a class irrelevant instance retrieval method based on hierarchical semantic region decomposition.

Background

Example-based retrieval techniques are the direction of current picture retrieval with widespread demand. As it does not retrieve the picture itself, but rather a specific retrieval is made for the instance given by the user in the picture. For a given query picture, in an instance retrieval task, a specific query instance marked by a rectangular frame needs to be searched, a picture containing the query instance is retrieved from a database consisting of a large number of images, and a specific position of the queried instance is marked on a retrieval result image. The search for such refinement requirements is widely used in real-world scenarios. For example, in electronic commerce platforms and online shopping, a user can upload a commodity picture, the system can perform example retrieval according to image features and return similar or related commodities to help the user find the commodity of interest; in addition, for large-scale image libraries or image databases, instance retrieval can be used to quickly search and locate specific images, which is widely used in image management systems, photo album applications, and image archives.

Conventional retrieval methods generally use manual features to extract features of potential examples and query examples, and then obtain more refined and accurate example features through techniques such as feature compression, but such methods generally have high computational requirements and cannot robustly cope with the non-rigid transformation of images. In recent years, with the development of computer vision technology, some methods have attempted to use deep learning algorithms for instance discovery and feature extraction. Some methods have an image represented by a depth global feature that is assembled from a convolution layer. During the pooling process, high weights are assigned to potential instance areas. Such approaches fail to acquire instance positioning and are subject to background interference. Some attempts have focused on designing instance-level features for search tasks, dividing instance retrieval tasks into two steps, instance location and feature extraction from detected instance areas. Among these methods, the method using the supervised or weakly supervised method as the backbone network is limited in the ability to detect unknown class objects, and in addition, the examples of different scales found by the existing methods are not comprehensive enough, which affects the effect of the search.

The main difficulty of instance retrieval focuses on the potential instance discovery of instance localization, for which the dimensions, size, shape of the subsequent query instance are unknown, even defects may occur. How to find out potential examples and extract features of the potential examples becomes a key to solving the problem. The existing methods are poor in exhaustive and accurate example extraction, and can not solve the problems of example shielding and multi-example retrieval. Thus, there is a need for a simple but efficient hierarchical instance discovery method to address existing challenges.

Disclosure of Invention

Aiming at the problems existing in the prior art, the invention aims to provide a class irrelevant instance retrieval method based on hierarchical semantic region decomposition, which realizes quick discovery of instance level regions with different scales by carrying out hierarchical semantic region decomposition and screening on images so as to realize comprehensive discovery of instances which can be retrieved.

In order to achieve the above purpose, the invention adopts the following technical scheme:

A class irrelevant instance retrieval method based on hierarchical semantic region decomposition extracts instances of feature images through a detector and stores the instances in a potential instance library; then extracting the characteristics of the examples in the potential example library through a descriptor to obtain example characteristics, and storing the example characteristics in the characteristic library; when a query picture is input, the retriever performs feature extraction on the query picture to obtain query features, matches the query features with example features in a feature library, and obtains top K example features most similar to the query features as retrieval results; the detector processes the characteristic pictures specifically as follows:

s1, encoding a characteristic image by using a characteristic encoder trained by an unsupervised method to obtain an encoded image block set;

Feature images come from a feature library, and feature set marks are obtained after feature images are coded by an unsupervised feature coder Namely, the coded image block set;

S2, carrying out hierarchical decomposition on the image in a rapid bipartite clustering mode, so as to detect example areas on different scales; the method comprises the following steps:

Adding the coded image block set into a to-be-processed node queue, and performing hierarchical bipartition on each node in the queue until no node exists in the queue;

s2.1, carrying out initialization clustering on the image block set in each node: if two image blocks And/>Similarity/>Greater than threshold/>Consider image block/>And/>Similarly, the degree/>, of the two image blocksAnd/>Respectively increasing 1; after similarity calculation of all the image blocks is completed, selecting the image block features with the highest degree and the lowest degree as seeds of clustering, and distributing the rest image block features into corresponding clusters according to the distances between the rest image block features and the seeds;

S2.2, performing binary clustering on each cluster obtained in the step S2.1;

Defining an objective function The objective function is intended to minimize the distance between features within the cluster; wherein k represents the number of clusters,/>Is a cluster,/>Representative of belonging to cluster/>Is a block feature of the image;

Specifically, the distance between each feature and the cluster is as follows:

Wherein, Is a cluster,/>For cluster/>Number of features in/>For cluster/>Sum of internal features,/>A transpose representing the characteristics of this image block;

When a feature is moved into another cluster, the distance between the feature and the cluster it is moved to is calculated as follows:

Wherein, Is a cluster,/>For cluster/>Number of features in/>For cluster/>The sum of the internal features;

If it is If the feature movement is described as the optimization of the objective function, the movement is continued, and the process is iterated until the objective function converges to obtain two new image block sets;

S2.3, detecting regional connectivity;

Mapping the two graphic block sets obtained in the step 2.2 back to the original image to obtain two partial subgraphs, further dividing each subgraph into smaller subregions according to the spatial connectivity of the image blocks, and detecting the regional connectivity, wherein each subgraph can obtain at least one group of communicated subregions, so that the subsequent processing is more accurate and reliable;

s2.4, judging whether to continue to divide into two parts or not;

mapping the plurality of subareas obtained in the step 2.3 to an encoder for encoding to obtain a plurality of image block sets, and judging whether to continue bisection by using average internal connectivity for each image block set;

feature sets corresponding to each image block set By calculating the feature set/>The number of internal edges to measure its connectivity; specifically, assume feature set/>The total number of internal edges is/>When average internal connectivity/>When the threshold is exceeded, the feature set/>Further segmentation is performed, i.e. average internal connectivity/>Discarding the image block set corresponding to the feature set S when the image block set is larger than the threshold value, otherwise, adding the image block set corresponding to the feature set S into a queue, and continuing halving;

s3, screening examples, namely screening examples with insignificant semantics according to significance conditions;

determining a feature set node by intensity of feature energy Whether it is a dummy node or not, specifically as follows:

feature sets corresponding to each image block set Judging a feature set/>, by the intensity of feature energyWhether it is a dummy node:

Specifically, a given feature set ，/>Is in feature set/>Set of features with high energy in (1), "feature set/>And feature set/>The overlapping ratio of (2) is as follows:

When the ratio of overlapping When the threshold value is smaller than the threshold value, the feature set S is a dummy node, and the feature set S is discarded; otherwise, the feature set is retained.

In the step S2.4, the method for establishing the inner edge of the set S is as follows: if S two image blocks are assembledAnd/>Similarity/>Greater than threshold/>Considering the two as similar, an edge is established between the two image blocks, which is one of the inner edges of the set S.

In said step 2.4, average internal connectivityDefined as the total number of edges/>And feature set size/>Ratio of average internal connectivity/>The value range of (2) is/>Between them.

The instance processing in the potential instance library by the descriptor is specifically as follows:

For a feature set in an image obtained by the detector, it is projected onto the encoder downsampled according to its order On a feature map of size, wherein/>Is the length of the feature map,/>To be the width of the feature map, a/>, with the case of 1 and the case of 0, is formedA mask of a size, up-sampling the mask to an original size, i.e. an instance representation mask of the original size is reached, and inputting the image into a convolutional neural network based feature extractor to obtain a feature map encoded for the image; for each positioning result represented by the mask, downsampling the mask to obtain a mask with the same size as the feature map; the feature representation for each instance is obtained using a generalized mean pooling on each channel by multiplying the feature map with a downsampled mask.

After the scheme is adopted, a hierarchy-based instance detection structure is introduced to discover instance information of different scales, and a reasonable hierarchy pause and instance screening mechanism is added. On one hand, the introduction of the effective hierarchical structure can completely find out examples of different hierarchies, and the semantically significant region is detected by minimizing the feature distance in the cluster, so that the robustness of the detection model under special retrieval tasks such as multiple examples, shielding and the like is improved while the retrieval precision is improved; on the other hand, a specific optimization algorithm is introduced, so that the problems of computational complexity and local optimal solution in the traditional clustering method are solved, and the computational consumption required by the instance discovery is greatly reduced, so that potential instances can be quickly discovered; in addition, the introduction of the instance screening mechanism based on significance also greatly improves the retrieval efficiency.

Drawings

FIG. 1 is an overall flow chart of an example search of the present invention;

fig. 2 is a flow chart of a method of the detector portion of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described with reference to the following examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. On the contrary, the invention is intended to cover any alternatives, modifications, equivalents, and variations as may be included within the spirit and scope of the invention as defined by the appended claims. Further, in the following detailed description of the present invention, certain specific details are set forth in order to provide a thorough understanding of the present invention. The present invention will be fully understood by those skilled in the art without the details described herein.

As shown in FIG. 1, the invention discloses a class irrelevant instance retrieval method based on hierarchical semantic region decomposition, which extracts an instance of a feature image through a detector and stores the instance in a potential instance library; then extracting the characteristics of the examples in the potential example library through a descriptor to obtain example characteristics, and storing the example characteristics in the characteristic library; when a query picture is input, the retriever extracts features of the query picture to obtain query features, matches the query features with example features in a feature library, and obtains top K example features most similar to the query features as retrieval results.

As shown in fig. 2, the detector processes the feature image specifically as follows:

s1, a feature encoder trained by an unsupervised method is used for encoding the feature image, and an encoded image block set is obtained.

The feature image (original image) comes from a feature library, and the feature image is marked by a feature set after being coded by an unsupervised feature coderI.e. the set of encoded image blocks.

S2, carrying out hierarchical decomposition on the image in a rapid bipartite clustering mode, and detecting the instance areas on different scales. The method comprises the following steps:

And adding the coded image block set into a to-be-processed node queue, and carrying out hierarchical bipartition on each node (each node corresponds to one image block set) in the queue until no node exists in the queue.

S2.1, carrying out initialization clustering on the image block set in each node: if two image blocksAnd/>Similarity/>Greater than threshold/>Consider image block/>And/>Similarly, the degree/>, of the two image blocksAnd/>Respectively increasing 1; after the similarity calculation of all the image blocks is completed, selecting the image block features with the highest and lowest degrees as the seeds of clustering, and distributing the rest image block features into corresponding clusters according to the distance between the rest image block features and the seeds.

S2.2, performing binary clustering on each cluster obtained in the step S2.1.

Defining an objective functionThe objective function is intended to minimize the distance between features within the cluster. In this formula, k represents the number of clusters,/>Is a cluster,/>Representative of belonging to cluster/>Is included. Hereinafter/>Is the number of features in the cluster,/>Is the sum of intra-cluster features,/>Representing a transpose of this image block feature (which is a vector).

Specifically, the distance between each feature and the cluster is as follows:

Wherein, Is a cluster,/>For cluster/>Number of features in/>For cluster/>Sum of internal features,/>Representing a transpose of this image block feature.

The similarity between features is measured by computing the sum of squares of the euclidean distances between them. An optimization procedure is used here to gradually reduce the value of the objective function by moving features into different clusters. When one feature is moved into another cluster, the distance between the feature and the cluster it is moved into is calculated as follows:

Wherein, Is a cluster,/>For cluster/>Number of features in/>For cluster/>And the sum of the internal features.

If it isThe objective function is optimized after the feature movement is described, and the movement can be performed, and the process is iterated until the objective function converges, so that two new image block sets are obtained.

S2.3, detecting the connectivity of the area.

Mapping the two graphic block sets obtained in the step 2.2 back to the original image to obtain two partial subgraphs, and for the two subgraphs, not necessarily, two suitable example representations are needed to be considered, wherein connectivity of the two subgraphs on the feature image is needed to be considered. Thus requiring further partitioning to obtain connected sub-regions.

According to the space connectivity of the image blocks, each sub-graph is further divided into smaller sub-regions, the divided regions are guaranteed to be communicated on the original graph, and each sub-graph can obtain at least one group of communicated sub-regions through region connectivity detection, so that subsequent processing is more accurate and reliable.

S2.4, judging whether to continue bisection or not.

And (3) mapping the plurality of subareas obtained in the step (2.3) to an encoder to obtain a plurality of image block sets after encoding, and judging whether to continue bisection by using average internal connectivity for each image block set.

Feature sets corresponding to each image block setBy calculating the feature set/>The number of edges inside measures its connectivity. The establishing mode of the inner edge of the set S is as follows: if set S two tiles/>And/>Similarity/>Greater than threshold/>Considering the two as similar, an edge is established between the two image blocks, which is one of the inner edges of the set S.

Specifically, assume a feature setThe total number of internal edges is/>Average internal connectivity/>Defined as the total number of edges/>And feature set size/>Ratio of average internal connectivity/>The value range of (2) is/>Between them. Feature set/>, as deeper and deeper nodes are decomposedBecome more compact, average internal connectivity/>And also increases. Therefore, the invention is achieved by adjusting the average internal connectivity/>Is used to control the granularity of the segmentation when average internal connectivity/>When the threshold is exceeded, the feature set/>Further segmentation is performed, i.e. average internal connectivity/>And if the image block set is larger than the threshold value, discarding the image block set corresponding to the feature set S, otherwise, adding the image block set corresponding to the feature set S into a queue, and continuing halving.

The larger threshold value can lead to finer granularity of segmentation, and the function of the termination condition is to ensure that the segmentation process is not excessively subdivided, avoid excessive subsets, and ensure that the interior of the segmented subsets has enough tight connectivity.

S3, screening examples, and screening examples with insignificant semantics according to significance conditions.

In hierarchical clustering, a feature set corresponding to a certain image block set does not necessarily correspond to a semantically compact region, and may be a mixture of multiple instances of different categories, or may be a mixture of an object and a background, or may be merely a blank background, and such feature set is a dummy node. The feature of building instance level from dummy nodes is of little significance.

The invention judges a feature set node by the intensity of feature energyWhether it is a dummy node.

Feature sets corresponding to each image block setJudging a feature set/>, by the intensity of feature energyWhether it is a dummy node:

Specifically, a given feature set ，/>Is in feature set/>A set of features with high energy. Feature set/>And feature set/>The overlapping ratio of (2) is as follows:

When (when) At low levels, it is essentially represented that the region is dominated by semantically unimportant features, and therefore these dummy nodes in the hierarchy are ignored by setting a threshold. In particular, when the overlap ratio/>When the threshold value is smaller than the threshold value, the feature set S is a dummy node, and the feature set S is discarded; otherwise, the feature set is retained.

The determination of dummy nodes does not interfere with feature set nodesThe segmentation is further performed. Feature set node/>, when clustering reaches a finer granularity levelMay become a significant area.

For a feature set in an image obtained by the detector, it is projected onto the encoder downsampled according to its order On a feature map of size, wherein/>Is the length of the feature map,/>To be the width of the feature map, a/>, with the case of 1 and the case of 0, is formedA mask of size, up-sampling this mask to the original size, i.e. an instance up to the original size represents the mask.

Inputting the original image into a feature extractor based on a convolutional neural network to obtain a feature image for encoding the original image.

An example level feature is extracted using a masked region of interest feature extraction method. Specifically, for each positioning result represented by a mask, the mask is downsampled to obtain a mask having the same size as the feature map. The feature representation of each instance is obtained and stored in a feature library using a generalized mean pooling on each channel by multiplying the feature map with the downsampled mask.

The retriever processes the query picture as follows:

Preprocessing the query picture, and then extracting the characteristics to obtain the query characteristics. And for the query features, calculating cosine similarity of all instance features and the query features in a feature library, ranking the similarity, and returning the instance with the top ranking to obtain a query result.

For the target bounding box that the query needs to return, then the bounding box of coordinates with a mask of 1 in the mask up-sampled to the original size in the descriptor step is used as the bounding box for this instance.

The invention encodes the image by using a feature encoder trained by an unsupervised method to obtain an encoded image block feature set; then, after an initial point is selected according to priori knowledge, carrying out hierarchical decomposition on the image in a rapid bipartite clustering mode until the internal connectivity is met, and finally realizing rapid detection on instance areas on different scales; then, screening some examples with insignificant semantics according to the significance conditions to reduce the number of examples stored in the feature library; finally, for the obtained instance mask, the instance-level features are extracted using the masked region of interest feature extraction method and stored in a feature library. For given query example features, cosine similarity of all example features and query features is calculated, so that a query result picture and an example positioning frame can be queried. In summary, the invention provides a simple and efficient instance retrieval method, which realizes the discovery of instance-level regions with different scales by carrying out hierarchical semantic region decomposition on images so as to achieve the purpose of comprehensively discovering instances which can be retrieved, and the hierarchical decomposition well solves the problems of object shielding and embedding commonly existing in a real instance retrieval scene and improves the retrieval precision.

In the example search task, tests were performed on three well-known example search task data sets, instance-160, instance-240, instance-335 and INSTRE, to demonstrate the effectiveness of the present invention. The present invention and comparison of the effectiveness of the R-MAC, CAM-weight, BLCF and DASR methods using three search criteria, mAP-50, mAP-100 and mAP-all, show the advances of the methods of the present invention as shown in Table 1. The method of the present invention is referred to as CLAID. And before evaluation, an existing normalization and whitening strategy and a feature extraction layer number selection strategy are adopted for all the method models so as to reflect the real requirements of the actual scene.

TABLE 1

In table 1, mAP50 represents the average search accuracy reported on the first 50 results; mAP100 represents the average search accuracy reported on the first 100 results; map-all represents the average search accuracy reported over all search results.

The first left-most column of Table 1 represents the method name, and the top row of Instance-160, instance-335, INSTRE represents the three retrieved datasets. Wherein, R-MAC is the maximum area activation convolution method; CAM-weight is a class activation mapping weighting method; BLCF is a partial convolution feature bag-of-word model method; BLCF-SalGAN are local convolution feature bag of words model methods based on saliency weighting; DASR is an example retrieval method based on depth activation of salient regions; DUODIS is region-based dataset-driven unsupervised object discovery; CLAID is a class independent instance level descriptor method (present invention).

Table 1 shows the experimental results of the search experiments performed on three well-known example search datasets, showing how accurately the examples were searched by different methods by reporting the average search accuracy over the top 50, 100 and all search results. From Table 1 we can see that the invention exceeds other example search approaches in search accuracy.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains.

Claims

1. A class irrelevant instance retrieval method based on hierarchical semantic region decomposition extracts instances of feature images through a detector and stores the instances in a potential instance library; then extracting the characteristics of the examples in the potential example library through a descriptor to obtain example characteristics, and storing the example characteristics in the characteristic library; when a query picture is input, the retriever performs feature extraction on the query picture to obtain query features, matches the query features with example features in a feature library, and obtains top K example features most similar to the query features as retrieval results; the method is characterized in that: the detector processes the characteristic pictures specifically as follows:

S2.2, performing binary clustering on each cluster obtained in the step S2.1;

Specifically, each feature and cluster The distance between them is as follows:

Wherein, Is a cluster,/>For cluster/>Number of features in/>For cluster/>Sum of internal features,/>Representing this image block feature/>Is a transpose of (2);

Moving features to another cluster Calculates the cluster/>, to which the feature and the cluster it movedThe distance between them is as follows:

S2.3, detecting regional connectivity;

s2.4, judging whether to continue to divide into two parts or not;

feature sets corresponding to each image block set Determining a feature set by intensity of feature energyWhether it is a dummy node:

2. The class independent instance retrieval method based on hierarchical semantic region decomposition according to claim 1, wherein: in the step S2.4, the method for establishing the inner edge of the set S is as follows: if S two image blocks are assembledAnd/>Similarity of (2)Greater than threshold/>Considering the two as similar, an edge is established between the two image blocks, which is one of the inner edges of the set S.

3. The class independent instance retrieval method based on hierarchical semantic region decomposition according to claim 1, wherein: in said step 2.4, average internal connectivityDefined as the total number of edges/>And feature set size/>Ratio of average internal connectivity/>The value range of (2) is/>Between them.

4. The class independent instance retrieval method based on hierarchical semantic region decomposition according to claim 1, wherein: the instance processing in the potential instance library by the descriptor is specifically as follows:

For a feature set in an image obtained by the detector, it is projected onto the encoder downsampled according to its order On a feature map of size, wherein/>Is the length of the feature map,/>To be the width of the feature map, a/>, with the case of 1 and the case of 0, is formedA mask of a size, up-sampling the mask to an original size, namely, an instance representation mask of the original size is reached, and inputting the image into a convolutional neural network-based feature extractor to obtain a feature map for image coding; for each positioning result represented by the mask, downsampling the mask to obtain a mask with the same size as the feature map; the feature representation for each instance is obtained using a generalized mean pooling on each channel by multiplying the feature map with a downsampled mask. /(I)