CN112966137B

CN112966137B - Image retrieval method and system based on global and local feature rearrangement

Info

Publication number: CN112966137B
Application number: CN202110111317.2A
Authority: CN
Inventors: 张招亮; 刘后标; 廖欢; 汪洋旭; 唐文杰
Original assignee: China Electronics Import And Export Co ltd
Current assignee: China Electronics Import And Export Co ltd
Priority date: 2021-01-27
Filing date: 2021-01-27
Publication date: 2022-05-31
Anticipated expiration: 2041-01-27
Also published as: CN112966137A

Abstract

The invention relates to an image retrieval method and system based on global and local feature rearrangement. The method comprises the following steps: extracting global features and local features of an image to be queried and an image in an image database; calculating the similarity between the image to be queried and the image in the image database according to the global feature, sequencing the images in the image database according to the similarity, and taking a topN image of a sequencing result as a query result of the global feature; and calculating the matching degree between the image to be inquired and the topN image by using the local features, and rearranging the topN image according to the number of the matching points to obtain an accurate image retrieval sequencing result. The invention firstly utilizes the similarity of the global characteristics to carry out image retrieval and sorting, and then rearranges the TopN result based on the characteristic matching points of the local characteristics, further improves the retrieval precision, and is particularly suitable for key investigation of hit-and-run vehicles or suspected vehicles in traffic accidents.

Description

Image retrieval method and system based on global and local feature rearrangement

Technical Field

The invention belongs to the technical field of information technology and image retrieval, and particularly relates to an image retrieval method and system based on global and local feature rearrangement, which are particularly suitable for retrieval of images such as vehicle images.

Background

In image retrieval methods such as a vehicle image retrieval method, there are roughly two types: one is based on global features, a vehicle image is mapped into feature vectors by using a residual error network, and image retrieval sequencing is completed by using cosine similarity of the feature vectors. The other type is that based on traditional local features, vehicle images are mapped into key point descriptors, and the number of matching points is calculated based on feature matching to complete image retrieval sequencing. Based on this, researchers want to use the powerful expressive power of the residual network to fuse the global features and the local features together.

Disclosure of Invention

In order to improve the accuracy of image retrieval, the invention provides an image retrieval method and system based on global and local feature rearrangement.

The technical scheme adopted by the invention is as follows:

an image retrieval method based on global and local feature rearrangement comprises the following steps:

extracting global features and local features of an image to be queried and images in an image database;

calculating the similarity between the image to be queried and the image in the image database according to the global feature, sequencing the images in the image database according to the similarity, and taking a topN image of a sequencing result as a query result of the global feature;

and calculating the matching degree between the image to be inquired and the topN image by using the local features, and rearranging the topN image according to the number of the matching points to obtain an accurate image retrieval sequencing result.

Optionally, the image to be queried is a vehicle image to be queried, and the image database is a vehicle image database.

Further, the extracting global features and local features of the image to be queried and the image in the image database includes:

simultaneously acquiring global features, local features and local feature scores based on the global feature branches and the local feature branches;

reserving the spatial positions with the local feature scores larger than a certain threshold, and discarding the spatial positions with the local feature scores smaller than the threshold, which are considered to be not significant enough;

and mapping the reserved space position back to the input image according to the network receptive field to carry out non-maximum suppression operation, and obtaining a key point descriptor.

Furthermore, ResNet 50-ibn is used as a backbone network, and the backbone network is divided into 5 stages from large to small according to the resolution of output features, namely Stage 1-5; the global feature branch is connected using the output of Stage5, the output of Stage4 together with the output of Stage5 connects the local feature branches.

Further, the global feature branch comprises a pooling layer, a dimensionality reduction layer and a full connection layer, the output of the dimensionality reduction layer is global features, the global features are connected to optimize Map loss, the global features are connected to the full connection layer and then connected to cross entropy loss, and finally the loss is the sum of the two losses;

the local feature branch comprises a stage fusion module, a position self-attention module, a feature reconstruction branch module and a self-attention pooling module; the stage fusion module comprises a channel self-attention module and is used for screening out attention aiming at the channel; the output of the stage fusion module is respectively connected with two branches, the upper branch is connected with the position self-attention module, the output is local feature fraction, the lower branch is connected with the feature reconstruction branch module, the output is reconstructed feature, the outputs of the two branches are input into the self-attention pooling module together, the self-attention pooling module is connected with a full connection layer and cross entropy loss, the reconstructed feature and the output of the stage fusion module solve the mean square error loss, and the final loss is the sum of the two losses; the feature reconstruction branch module is composed of an encoder and a decoder, and the output of the encoder is a local feature.

Further, the local area image block retrieval of any region of interest is carried out by utilizing the feature matching of local features, and the method comprises the following steps:

a user draws an arbitrary area on an image to be queried and hopes to find a local area image block set which is most similar to the area in an image database; firstly, performing characteristic analysis on an image to be queried and an image database to respectively obtain a key point descriptor of each image, then filtering the key point descriptor of the region in the image to be queried through the region, calculating the matching degree between an image block of the region and the image database, and sorting data in all the image databases according to the number of matching points to obtain a final sorting result.

An image retrieval system based on global and local feature rearrangement adopting the method comprises the following steps:

the characteristic extraction module is used for extracting the global characteristics and the local characteristics of the image to be inquired and the images in the image database;

the global feature query module is used for calculating the similarity between the image to be queried and the image in the image database according to the global features, sequencing the images in the image database according to the similarity, and taking a topN image of a sequencing result as a query result of the global features;

and the local feature rearrangement module is used for calculating the matching degree between the image to be queried and the topN image by using local features, and rearranging the topN image according to the number of the matching points to obtain an accurate image retrieval ordering result.

The invention has the following beneficial effects:

the invention mainly contributes to providing an image retrieval method and a system based on global and local feature rearrangement, the system has the advantages that the global features and the local features of the image are obtained simultaneously by utilizing a residual error network, wherein an improved local feature extraction module has powerful feature expression capability due to the introduction of multi-scale fusion, channel attention and position attention mechanisms; the global features and the local features are combined to carry out secondary rearrangement, and the rearrangement effect is further improved; meanwhile, due to the characteristic that the designed key point feature descriptors are irrelevant to the size of the region, any region of interest in the image can be selected, the key point descriptors in the region of interest and the key point descriptors of the corresponding image of the sequencing result obtained by utilizing the global features are subjected to matching degree calculation, and the reordering information of the key region is obtained in a key point mode; the method is particularly suitable for key investigation of the troubled vehicles in public security escape or traffic accidents.

Drawings

FIG. 1 is a flow chart of the image global feature and keypoint descriptor extraction process of the present invention.

FIG. 2 is a flow chart of global feature training in the present invention.

FIG. 3 is a schematic diagram of a stage fusion module according to the present invention.

FIG. 4 is a flow chart of local feature training in the present invention.

FIG. 5 is a flow chart of the operation of the vehicle retrieval system based on global and local feature rearrangement in the present invention.

Fig. 6 is a flowchart of the work of the local area image block retrieval system based on any area in the present invention.

Detailed Description

The present invention will be described in further detail below with reference to specific examples and the accompanying drawings.

The invention discloses an image retrieval method based on global and local feature rearrangement, which is used for retrieving vehicle images and comprises the following steps:

(1) preparing data: based on a license plate recognition system, vehicle pictures captured by different cameras are sorted out under each license plate number, then a vehicle image data set is divided into a training set, a verification set and a retrieval test set according to the ratio of 8:1:1, a batch of pictures under each license plate number in the retrieval test set are randomly extracted to serve as query images, and the rest pictures serve as an image database.

(2) And constructing a distributed deep learning vehicle image retrieval environment based on the Pythrch.

(3) Data preprocessing: and normalizing the vehicle image data set into a data input format required by a network model.

(4) Constructing a model: as shown in FIG. 1, ResNet 50-ibn is used as a backbone network, and the backbone network can be divided into 5 stages, namely Stage 1-5, from large to small according to the resolution of output features. The invention uses the output of Stage5 to connect the global feature branch, the output of Stage4 together with the output of Stage5 to connect the local feature branch, which contains a phase fusion module with two inputs as shown in fig. 3. As shown in fig. 3, the Stage fusion module up-samples the output of Stage5, and performs channel connection with the output of Stage4, and then screens out the attention for the channels through the channel self-attention module, which has both the capability of multi-scale feature expression and the capability of learning the correlation between the channels, and screens out the attention for the channels.

(5) Model training: in order to not destroy the strong expression ability of the backbone network, a two-step training mode is adopted, wherein the first step only trains global feature branches, the training process is shown in fig. 2, and the second step only trains local feature branches by fixing the weight of the backbone network, as shown in fig. 4. After the training is completed, the weights of the 2 branches are solidified into the global feature and local feature extraction network in fig. 1.

As shown in fig. 2, the training process of the global feature branch includes the following steps:

a) ResNet 50-ibn is used as a backbone network, the output of Stage5 is connected with a pooling layer and a dimensionality reduction layer, the output of the dimensionality reduction layer is a global feature, the global feature is connected with the loss of an optimized Map (average retrieval precision), such as Smooth-ap loss, the global feature is connected with a full connection layer, cross entropy loss is connected, the final loss is the sum of the two losses, and the weight coefficient of the two losses is set to be 1: 1.

b) A uniform sampling mechanism is adopted during training: assuming that the training set has N IDs, M IDs are randomly sampled, and P images are sampled for each ID to form a batch, and if the number of images under the ID is less than P, the sampling can be repeated.

c) And loading ImageNet data set pre-training weights, setting the batch-size to be 256, setting the basic learning rate to be 6e-4, changing to 6e-5 after 20 rounds and changing to 6e-6 after 40 rounds, training from the beginning and adopting a linear warmup learning strategy, and finishing after 3000 batches with warmup.

As shown in fig. 4, the training process of the local feature branch includes the following steps:

a) constructing a local feature branch module: as shown in fig. 3, the outputs of the phase fusion module are respectively connected to two branches, the upper branch is connected to the position self-attention module, the output is the local feature score S (b,0, y, x), the lower branch is connected to the feature reconstruction branch module, the output is the reconstructed feature R (b, c, y, x), the outputs of the two branches are input to the self-attention pooling module together, then the full connection layer and the cross entropy loss are connected, the reconstructed feature R (b, c, y, x) and the output of the phase fusion module are used to solve the mean square error loss, the final loss is the sum of the two losses, and the weight coefficient is set to 1: 10.

b) The characteristic reconstruction branch module of the lower branch circuit is composed of an encoder and a decoder, and the output of the encoder is local characteristics.

c) And fixing the weight of the trunk network during training, and only training local characteristic branches.

(6) Extracting image global features and key point descriptors: as shown in fig. 1, global features, local features and local feature scores are simultaneously obtained based on a global feature branch and a local feature branch, then spatial positions where the local feature scores are larger than a certain threshold are reserved, the spatial positions smaller than the threshold are considered to be not significant enough to be discarded, and then the reserved spatial positions are mapped back to an input image according to a network receptive field to perform non-maximum suppression operation, so as to obtain sparse key point descriptors.

(7) Vehicle retrieval: as shown in fig. 5, the vehicle image retrieval system provided by the present invention performs feature analysis on an image to be queried and an image database according to the image global feature and key point descriptor extraction network of fig. 1 to obtain a global feature and a key point descriptor of each vehicle image in the image to be queried and the image database, respectively calculates cosine similarity between the image to be queried and the vehicle image in the image database according to the global feature, sorts all image database data according to cosine distance, and takes topN before the sorted result as a query result of the global feature; and aiming at the topN query result, respectively calculating the matching degree between the image to be queried and the topN image by using the key point descriptors, and performing finer rearrangement on the topN query result according to the number of the matching points to finally obtain a more accurate vehicle retrieval sequencing result.

The innovation points of the invention are as follows:

1. in the global feature branch in the step (4), on one hand, in order to obtain global features with lower dimensionality, a dimensionality reduction module based on convolution is used, on the other hand, based on the particularity of a vehicle retrieval system, different from a general classification task, in order to emphasize independence of individuals, the invention uses a ResNet50 variant on model structure design, introduces ibn layers to replace a bn layer of an original network, and uses a loss function for optimizing Map (average retrieval precision), such as Smooth-ap loss, during training, compared with a loss triple function, the function has the advantage that the ranking effect of search results is directly concerned, and the ranking effect of the search results can be effectively improved. The bn layer is a batch normalization layer, and the ibn layer combines the batch normalization layer and the instance normalization layer according to the same channel number, so that the generalization capability on different data fields can be improved, and no extra calculation amount is brought.

2. The self-attention pooling module of the local feature training branch in the step (4) completes the following calculation:

wherein S represents a local feature score, R represents a reconstructed feature, wherein x, y represent a spatial position index, b represents an index of a batch of pictures, and c represents a channel number index. The module performs weighted sum (i.e. F (b, c)) of different weights (the local feature scores are weights) on the local features of different positions, the positions are mapped back to an input image to serve as central coordinates of a network receptive field, the module is followed by cross entropy classification loss, and the position self-attention module is stimulated to select local features with strong distinguishing capability in the training process.

3. The local feature branch containing stage fusion module shown in the step (4) is shown in fig. 3, and has the capability of multi-scale feature expression and the correlation between learning channels, so that the attention of the channels is screened out. The module only needs to be trained by the channel self-attention module, and only needs to be trained for the weighting weight of the channel, so that the module has few training parameters and small calculation amount.

4. In the step (7), the key point descriptors for vehicle retrieval may be subjected to secondary sorting based on matching results obtained by the whole vehicle input, or may be subjected to secondary sorting based on matching results obtained by a certain arbitrary local area of the vehicle body of the image to be queried and the image data set. The method has the obvious advantages that the position of the local area is not limited, accurate local characteristic information can be obtained, the method is particularly suitable for focusing on the characteristic area specific to a certain vehicle, and the retrieval capability based on local key detail information is obviously enhanced.

According to the vehicle retrieval method based on global and local feature rearrangement, on one hand, in order to obtain stronger local feature expression in the aspect of model construction, multi-scale feature fusion, a channel attention and position attention module is introduced, and on the other hand, the system can be applied to vehicle retrieval of an interested region by relying on feature matching of local features. Searching for image blocks of a local area of any region of interest is shown in fig. 6, a user draws an arbitrary area on an image to be queried, and hopes to find a local area image block set most similar to the area in an image database, firstly, feature analysis is carried out on the image to be queried and the image database according to a network model of fig. 1 to respectively obtain a key point descriptor of each vehicle image, then, the key point descriptor of the area in the image to be queried is filtered through the area, the matching degree between the image blocks of the area and the image database is calculated, all image database data are sorted according to the number of matching points, and a final sorting result is obtained.

The experiment adopts an open source veri-wild public data set, and takes the average retrieval accuracy Map as an evaluation index to respectively obtain indexes on a small group of test set, a middle group of test set and a large group of test set, and the experiment result is shown in table 1. In the table, 1) ibn layers are not used, and bn layers are used by default; 2) the triple loss function triplet loss is used by default if the Smooth-ap loss is not used.

TABLE 1 results of the experiments

The experimental result proves the effectiveness of the method, and the vehicle retrieval accuracy can be effectively improved to different degrees by replacing bn with ibn, replacing the triple loss function with Smooth-ap loss and adding local feature reordering. More broadly, the local features can be used alone to query the vehicle image with some local key feature information.

The method disclosed by the invention can be applied to vehicle retrieval schemes, and can also be popularized and applied to the fields of pedestrian retrieval, map retrieval and the like.

Based on the same inventive concept, another embodiment of the present invention is an image retrieval system based on global and local feature rearrangement using the above method, which includes:

and the local feature rearrangement module is used for calculating the matching degree between the image to be queried and the topN image by using the local features, and rearranging the topN image according to the number of the matching points to obtain an accurate image retrieval sequencing result.

Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smartphone, etc.) comprising a memory storing a computer program configured to be executed by the processor, and a processor, the computer program comprising instructions for performing the steps of the inventive method.

Based on the same inventive concept, another embodiment of the present invention provides a computer-readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program, which when executed by a computer, implements the steps of the inventive method.

The foregoing disclosure of the specific embodiments of the present invention and the accompanying drawings is intended to assist in understanding the contents of the invention and to enable its practice, and it will be understood by those skilled in the art that various alternatives, modifications and variations may be possible without departing from the spirit and scope of the invention. The present invention should not be limited to the disclosure of the embodiments and drawings in the specification, and the scope of the present invention is defined by the scope of the claims.

Claims

1. An image retrieval method based on global and local feature rearrangement is characterized by comprising the following steps:

extracting global features and local features of an image to be queried and an image in an image database;

calculating the similarity between the image to be queried and the image in the image database according to the global feature, sequencing the images in the image database according to the similarity, and taking a topN image of the sequencing result as a query result of the global feature;

calculating the matching degree between the image to be queried and the topN image by using local features, and rearranging the topN image according to the number of matching points to obtain an accurate image retrieval and sorting result;

the method for extracting the global features and the local features of the image to be inquired and the image in the image database comprises the following steps:

mapping the reserved space position back to an input image according to a network receptive field to carry out non-maximum suppression operation, and obtaining a key point descriptor;

the global feature branch comprises a pooling layer, a dimensionality reduction layer and a full-connection layer, the output of the dimensionality reduction layer is global features, the global features are connected with the loss of the optimized Map, the global features are connected with the full-connection layer, then the cross entropy loss is connected, and finally the loss is the sum of the two losses;

2. The method according to claim 1, wherein the image to be queried is a vehicle image to be queried, and the image database is a vehicle image database.

3. The method of claim 1, wherein ResNet50_ ibn is used as a backbone network, and the backbone network is divided into 5 stages, namely Stage 1-5; the global feature branch is connected using the output of Stage5, the output of Stage4 together with the output of Stage5 connects the local feature branches.

4. The method of claim 1, wherein the self-attention pooling module weights the local features at different locations with different weights using the following formula:

wherein S represents a local feature score, R represents a reconstructed feature, x and y represent spatial position indexes, b represents an index of a batch of pictures, and c represents a channel number index.

5. The method according to claim 1, wherein the local region image block retrieval of any region of interest using feature matching of local features comprises the steps of:

6. An image retrieval system based on global and local feature rearrangement using the method of any one of claims 1 to 5, comprising:

7. An electronic apparatus, comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1 to 5.

8. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a computer, implements the method of any one of claims 1 to 5.