CN112597329B

CN112597329B - Real-time image retrieval method based on improved semantic segmentation network

Info

Publication number: CN112597329B
Application number: CN202011523748.1A
Authority: CN
Inventors: 王博; 吴忻生; 陈安; 杨璞光; 陈纯玉
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2020-12-21
Filing date: 2020-12-21
Publication date: 2022-12-16
Anticipated expiration: 2040-12-21
Also published as: CN112597329A

Abstract

The invention discloses a real-time image retrieval method based on an improved semantic segmentation network, which comprises the following steps: s1, collecting retrieval images to form a data set to be trained, expanding the data set, and dividing the expanded data set into a training set, a testing set and a verification set; s2, constructing an improved real-time calculation semantic segmentation network; s3, inputting the pictures of the training set into an improved semantic segmentation network calculated in real time to obtain semantic vectors of the pictures, and storing the semantic vectors according to the category information of the pictures to obtain a semantic vector database for image retrieval; s4, sending the pictures to be detected in the test set into an improved semantic segmentation network calculated in real time to obtain semantic feature vectors, and comparing the semantic categories of the pictures to be searched with the images of the vectors in an image search semantic vector database to obtain a candidate picture group containing the semantic categories of the searched pictures; and S5, matching the semantic feature vector of the picture to be detected with the semantic feature vector of the candidate picture group.

Description

Real-time image retrieval method based on improved semantic segmentation network

Technical Field

The invention relates to the technical field of image processing, in particular to a real-time image retrieval method based on an improved semantic segmentation network.

Background

The technology of finding out the image desired by the user from the database based on the basic visual features of the image is called content-based image retrieval (CBIR). Typically, the user's needs are given in the manner of querying the image. Briefly, it needs to analyze the content of the image to be queried by the user, extract the appropriate retrieval features, and retrieve the pictures in the image database that conform to the features to obtain the result. CBIR has been widely used in hundreds of degrees as an engineering requirement in daily life, and is also embedded in various e-commerce platforms such as kyoto and naobao for searching websites such as Google. Conventional image retrieval methods use a manually designed feature extraction operator to obtain low-level visual features of an image. Low-level visual features include color, edges, texture, location, and geometry, among others. Common extraction methods include Gabor filters, SIFT operators, SURF operators, and the like. However, the manually designed feature operators do not have generalization capability, and only have good extraction capability for a certain category and weak extraction capability for other categories. Image retrieval requires processing of hundreds of categories of images and there are great limitations to using low-level visual feature extraction operators (raney coma. Convolutional neural network-based image retrieval several key technical studies [ D ]. University of chejiang physics, 2020.).

In recent years, the development of deep learning, compared with the traditional image feature extraction method, the image processing technology based on deep learning has better effect. Semantic segmentation is a task of computer vision, and a good effect is achieved by using a deep learning technology to perform semantic segmentation. The semantic segmentation method applied to the image retrieval task at present still has the problems of low retrieval efficiency, low accuracy, incapability of providing semantic understanding and the like. Therefore, the method still has a larger promotion space on occasions with higher real-time performance, such as the closed-loop image detection problem of unmanned vehicles and the large-scale multi-class image retrieval problem (Jiang Si Yao. Semantic segmentation algorithm [ D ] based on a full convolution neural network model, the university of Liaoning engineering technology, 2020.).

Disclosure of Invention

The invention aims to solve the problems of precision ratio and rapidity in an image retrieval task, and provides a real-time image retrieval method based on an improved semantic segmentation network. The method has better effect when being applied to retrieval scenes with higher real-time requirements.

The invention is realized by at least one of the following technical schemes.

A real-time image retrieval method based on an improved semantic segmentation network comprises the following steps:

s1, collecting retrieval images to form a data set to be trained, expanding the data set, and dividing the expanded data set into a training set, a testing set and a verification set;

s2, constructing an improved real-time calculation semantic segmentation network;

s3, inputting the pictures of the training set into an improved semantic segmentation network calculated in real time to obtain semantic vectors of the pictures, and storing the semantic vectors according to the category information of the pictures to obtain a semantic vector database for image retrieval;

s4, sending the pictures to be detected in the test set into an improved semantic segmentation network calculated in real time to obtain semantic feature vectors, and comparing whether the semantic categories of the pictures to be retrieved are consistent with the category information of the images of the vectors in the image retrieval semantic vector database to obtain a candidate picture group containing the semantic categories with the retrieved pictures;

and S5, matching the semantic feature vector of the picture to be detected with the semantic feature vector of the candidate picture group to obtain a plurality of pictures closest to the unit value, namely the retrieval result.

Preferably, step S1 is to select a data set PASCAL (Statistical modeling and Computational Learning) VOC (Visual Object Classes) 2012 and a berkeley data set and construct an extended data set.

Preferably, the PSCAL VOC2012 data sets have twenty-one category, wherein the number of the images in the training set, the verification set and the test set is M, H and R, respectively, and the original data set is expanded by using the berkeley data set to finally obtain the training image, the verification image and the test image.

Preferably, the improved real-time computing semantic segmentation network adopts an image semantic segmentation and real-time retrieval deep learning network framework; the network framework adopts an encoder-decoder structure, the calculation result of the encoder is used for extracting the characteristic information vector of the object, and the decoder is used for extracting the category information of the object.

Preferably, the coding network of the encoder adopts a lightweight network MobileNet (Efficient Convolutional Neural network suitable for Mobile visual application) Efficient Convolutional Neural Networks for Mobile visual application as a main network, and a hole convolution is added to a convolution layer of an original network to maintain the resolution of a feature map, and simultaneously an Spatial Pyramid Pooling module (ASPP) with adjacent information is selected, and a global average Pooling module and a Pooling layer of a are added to an ASPP network structure to extract feature information with higher significance, and hole convolution with a scale of a is added to extract associated information of pixels.

Preferably, the decoding network of the decoder adopts a two-way Fusion mode, the decoding network uses a semantic Feature Fusion Module SFFM (semantic Feature Fusion Module), the SFFM combines the high-level image features sampled by the ASPP network structure and the bottom-level information features of the block3 grid in the backbone network, and the two-way input information is used for complementing each other to further reduce the number of parameters.

Preferably, the object feature information vector is extracted in the following two ways:

the method I comprises the following steps: adding a full connection layer after the convolution layer of the original encoder network to obtain a feature vector with fixed length, wherein x is as follows ₁ To x _n Taking the value of the corresponding position of the feature map, w ₁₁ …w _nn For full link layer parameters, a ₁ To a _n To be characterized byGraph derived feature vectors:

the second method comprises the following steps: constructing and constructing a feature vector for adding a global average pooling module, introducing a global pooling function to fuse high-dimensional features at each pixel point in a feature map, using an empirical weighted average method, and making weights the same to obtain an average value, wherein extracted features X belong to the following feature spaces, h and w are the height and width of the feature map respectively, and c is the number of feature channels:

X∈R ^h×w×c (2)

directly adding the feature vectors of all the spatial positions according to the feature dimension, and then normalizing by the total number of the spatial positions:

wherein X _i Representing the feature vector, ω, at position i _i The representation corresponds to X _i The normalized image feature vector containing the object is obtained.

Preferably, all pictures in the training set are processed in the steps to construct an image retrieval database, and the pictures to be detected in the test set are sent to a semantic segmentation network to obtain semantic feature vectors of the pictures through semantic segmentation.

Preferably, the matching manner of step S5 is as follows:

wherein, X represents the semantic feature vector of the picture to be detected, and Y represents the semantic feature vector of the candidate picture group.

The system of the real-time image retrieval method based on the improved semantic segmentation network comprises an image acquisition system, a deep neural network system and an image processing system;

the image acquisition system is used for collecting and searching images to form a data set needing training and expanding the data set;

the deep neural network system comprises an improved semantic segmentation network calculated in real time, and is used for extracting characteristic information vectors of objects and extracting class information of the objects;

the image processing system is used for combining the object characteristic information vector and the object category information to obtain a semantic retrieval vector for image retrieval and perform similarity calculation and matching of images.

Compared with the prior art, the invention has the beneficial effects that: aiming at the defect that the existing general semantic segmentation network is low in computing speed, the improved semantic segmentation network capable of computing in real time is provided to improve the efficiency of the image retrieval process. Meanwhile, semantic category information is added in the process of retrieving the images, so that the information is richer in retrieving the images, and the accuracy of the retrieving process under complex conditions is ensured.

Drawings

FIG. 1 is an overall flow chart of the present invention;

FIG. 2 is a diagram of a real-time semantic segmentation network architecture of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

As shown in fig. 1 and fig. 2, a real-time image retrieval method based on an improved semantic segmentation network specifically includes the following steps:

(1) A data set is acquired. The PASCAL VOC2012 data set and the berkeley data set are downloaded and, because the data categories of the two data sets are the same, the two data sets are combined to construct an extended data set for the image retrieval task. There are twenty-one classes in PSCAL VOC 2012. The number of images in the training set, the verification set and the test set is 1464, 1449 and 1456 respectively. The expansion using the berkeley dataset finally yielded 10582 training images, 1229 validation images and 1456 test images. The expanded data set is obtained according to the proportion of a training set, a test set and a verification set as 6:2:2, grouping;

(2) Constructing an improved real-time computing semantic segmentation network; the conventional semantic segmentation network deep lab v3 (encoder-decoder network with separable convolutions) was modified to make it suitable for real-time computation. The improved encoder network uses MobileNet v2 (Efficient Neural Networks for Mobile Vision Applications), adds a feature extraction module to an encoding layer to extract features, adds a semantic feature fusion module to a decoding layer to perform category calculation, and trains images in a data set through the improved network;

the improved real-time computing semantic segmentation network adopts an image semantic segmentation and real-time retrieval deep learning network framework. The network framework adopts an encoder-decoder structure, the calculation result of the encoder is used for extracting the characteristic information vector of the object, and the decoder is used for extracting the class information of the object.

The coding network of the coder adopts a lightweight network MobileNet v2 as a backbone network, a cavity convolution is added in a convolution layer of an original network to keep the resolution ratio of a feature map, meanwhile, a space Pyramid Pooling module (ASPP) with adjacent information is selected, the ASPP network structure obtains context information of an image by aggregating sense field information with different scales and extracting high-level semantic information, a global average Pooling module and a 3 × 3 Pooling layer are added in the ASPP network structure to extract feature information with higher significance, the cavity convolution with the scale of 3 × 3 is added to extract associated information of pixels, and a coding result is used for extracting object feature information.

The decoding network of the decoder adopts a two-way Fusion mode, the decoding network uses a semantic Feature Fusion Module SFFM (semantic Feature Fusion Module), the SFFM combines the high-level image features sampled by an ASPP network structure and the bottom-level information features of a block3 grid in a backbone network, the two ways of input information are mutually supplemented to further reduce the number of parameters, and the decoding result is used for image category extraction.

After a feature map of a convolutional layer with optimal expression features is extracted through an improved encoder of a semantic segmentation network calculated in real time, feature vectors of the image are generated through methods such as global average pooling. The encoder processing result can obtain the category information contained in the image through a decoding network. And adding the class information of the object in the image to the tail part of the normalized feature vector to construct an image semantic feature vector.

The feature vector of the image is obtained by encoding the convolution layer of the network in the following two ways:

the method I comprises the following steps: after the convolution layer of the original encoder network, a full connection layer is added to obtain a feature vector with a fixed length. The formula is shown below, where x ₁ To x _n Taking the value of the corresponding position of the feature map, w ₁₁ …w _nn For full link layer parameters, a ₁ To a _n For the feature vectors derived from the feature map:

the second method comprises the following steps: a feature vector is constructed for the add global average pooling module construction. And introducing a global pooling function to fuse the high-dimensional features at each pixel point in the feature map, using an empirical weighted average method, and obtaining an average value by keeping the weights the same. The extracted feature X belongs to a feature space where h and w are the height and width of the feature map, respectively, and c is the number of feature channels.

X∈R ^h×w×c (2)

The feature vectors of each space position are directly added according to feature dimensions and then normalized by the total number of the space positions to obtain the following expression

Wherein X _i Representing the feature vector at position i. Omega _i The representation corresponds to X _i The empirical weights of (1). And setting the experience weights to be 1 to obtain the normalized image feature vector containing the object.

And finally obtaining a segmentation result containing the object semantic category through a decoding network.

And combining the feature vector obtained by the coding layer of the improved real-time semantic segmentation network with the category information vector contained in the image obtained by the decoding network. And adding the category information of the object in the image to the tail part of the normalized feature vector to construct a picture semantic feature vector.

S3, inputting the pictures of the training set into an improved semantic segmentation network calculated in real time to obtain semantic vectors of the pictures, and storing the semantic vectors according to the category information of the pictures to obtain an image retrieval semantic vector database;

s4, sending the pictures to be detected in the test set into an improved semantic segmentation network calculated in real time to obtain semantic feature vectors, and comparing whether the semantic categories of the pictures to be searched are consistent with the category information of the images of the vectors in the image search semantic vector database to obtain a candidate picture group containing the semantic categories of the pictures to be searched;

and S5, calculating the cosine of the semantic feature vector of the picture to be detected and the semantic feature vector of the candidate picture group to obtain a plurality of pictures which are closest to the unit value, namely the retrieval result.

The matching method is that a cosine calculation method is used to calculate matching values, and K vectors closest to a unit value in a result are extracted as a result value of image retrieval. The value of K can be set by the searcher as desired.

Wherein X represents the semantic feature vector of the picture to be detected, and Y represents the semantic feature vector of the candidate picture group.

The embodiment provides a real-time image retrieval method based on an improved semantic segmentation network, which comprises the steps of improving the real-time performance through the improved semantic segmentation network, expanding a PASCAL VOC data set by using a Berkeley data set to perform semantic segmentation training, extracting category vectors after semantic segmentation results are obtained, performing full-connection or global average pooling according to a feature map obtained by an encoding layer to obtain feature vectors, and finally combining the feature vectors to obtain the semantic vectors of a picture to be retrieved. During retrieval, firstly, rough matching is carried out to obtain groups of pictures with consistent categories, and then K pictures meeting requirements are obtained through cosine calculation and serve as results. According to the technical scheme, the speed of semantic segmentation is effectively improved on the premise of not losing precision, and a better picture retrieval effect is obtained. The method has good adaptability under the conditions of large-scale image retrieval and more picture categories.

The invention also provides an image retrieval system based on deep learning and semantic segmentation, which comprises the following steps: the system comprises an image acquisition system, a deep neural network system and an image processing system; the image acquisition system is used for collecting and searching images to form a data set needing training and expanding the data set; the deep neural network system comprises an improved semantic segmentation network calculated in real time, and is used for extracting characteristic information vectors of objects and extracting class information of the objects; the image processing system is used for combining the object characteristic information vector and the object category information to obtain a semantic retrieval vector for image retrieval and perform similarity calculation and matching of the images.

The image retrieval system based on deep learning and semantic segmentation stores a computer program therein. After program initialization, the network parameters are trained by the semantic segmentation data set. The model parameters are greatly reduced due to the use of an improved semantic segmentation network. When a retrieval image is input, the system starts two thread running programs, wherein one thread inputs image data to perform semantic segmentation calculation, and after a calculation result is obtained, a classification result is transmitted to a second thread. The new thread retrieves the same category from the database as the input image resulting in a category grouping. And then carrying out vector similarity calculation on the original image and the images in the category grouping. And obtaining K images with the similarity sequence as input results.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and amendments can be made without departing from the principle of the present invention, and these modifications and amendments should also be considered as the protection scope of the present invention.

Claims

1. A real-time image retrieval method based on an improved semantic segmentation network is characterized by comprising the following steps:

s2, constructing an improved real-time computing semantic segmentation network; extracting a feature map of a convolutional layer with optimal expression features through an improved encoder of a semantic segmentation network calculated in real time, and generating a feature vector of the image through a global average pooling method; performing full connection or global average pooling on the processing result of the encoder to obtain category information contained in the image; merging the feature vector with a category information vector contained in the image obtained by the encoder, and adding the category information of the object in the image to the tail part of the normalized feature vector to construct an image semantic feature vector;

the improved real-time computing semantic segmentation network adopts an image semantic segmentation and real-time retrieval deep learning network framework; the network framework adopts an encoder-decoder structure, the calculation result of the encoder is used for extracting the characteristic information vector of the object, and the decoder is used for extracting the class information of the object;

the coding network of the encoder adopts a lightweight network MobileNet as a backbone network, and adds a hole convolution to the convolution layer of the original network to keep the resolution of a characteristic diagram, selects a spatial pyramid pooling module ASPP with adjacent information at the same time, adds a global average pooling module and a pooling layer of a to extract characteristic information with higher significance in an ASPP network structure, and adds a hole convolution with the scale of a to extract associated information of pixels;

the decoding network of the decoder adopts a two-way fusion mode, the decoding network uses a semantic feature fusion module SFFM, the SFFM combines the high-level image features sampled by an ASPP network structure and the bottom-level information features of block3 grids in a backbone network, and the two paths of input information are used for supplementing each other to further reduce the parameter number; the object feature information vector is extracted in the following two ways:

the first method is as follows: adding a full-link layer after the convolutional layer of the original encoder network to obtain a feature vector with a fixed length, wherein x is as follows ₁ To x _n Taking the value of the corresponding position of the feature map, w ₁₁ …w _nn For full link layer parameters, a ₁ To a _n For the feature vectors derived from the feature map:

the second method comprises the following steps: constructing a feature vector for adding a global average pooling module, introducing a global pooling function to fuse high-dimensional features at each pixel point in a feature map, using an empirical weighted average method, and making weights the same to obtain an average value, wherein extracted features X belong to the following feature space, h and w are the height and width of the feature map respectively, and c is the number of feature channels:

X∈R ^h×w×c (2)

wherein X _i Representing the feature vector, ω, at position i _i The representation corresponds to X _i Obtaining the normalized image feature vector containing the object;

2. The method for real-time image retrieval based on improved semantic segmentation network as claimed in claim 1, wherein step S1 is to select the PSCAL VOC2012 data set and the berkeley data set and construct the augmented data set.

3. The method as claimed in claim 2, wherein the PSCAL VOC2012 data sets have twenty-one categories, wherein the numbers of the images in the training set, the verification set, and the test set are M, H, and R, respectively, and the berkeley data set is used to expand the original data set to obtain the training image, the verification image, and the test image.

4. The real-time image retrieval method based on the improved semantic segmentation network as claimed in claim 3, wherein the processing of step S2 is performed on all the pictures in the training set to construct an image retrieval database, and the pictures to be detected in the test set are sent to the semantic segmentation network to obtain semantic feature vectors of the pictures through semantic segmentation.

5. The method for retrieving an image in real time based on an improved semantic segmentation network according to claim 4, wherein the matching manner in step S5 is as follows:

6. The system of the real-time image retrieval method based on the improved semantic segmentation network is characterized by comprising an image acquisition system, a deep neural network system and an image processing system;

the image acquisition system is used for collecting retrieval images as image information input of image retrieval;

the image processing system is used for combining the object characteristic information vector and the object category information to obtain a semantic retrieval vector for image retrieval, and performing similarity calculation and matching of images.