CN114329006A

CN114329006A - Image retrieval method, device, equipment and computer readable storage medium

Info

Publication number: CN114329006A
Application number: CN202111124020.6A
Authority: CN
Inventors: 郭卉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-09-24
Filing date: 2021-09-24
Publication date: 2022-04-12

Abstract

The embodiment of the application provides an image retrieval method, an image retrieval device, image retrieval equipment and a computer-readable storage medium, and relates to the field of artificial intelligence, wherein the method comprises the following steps: acquiring an image to be inquired; determining a first feature embedding vector and a first quantization vector of an image to be queried, wherein the first quantization vector is used for representing quantization features corresponding to the first feature embedding vector; determining at least one index vector corresponding to the first quantization vector according to the first quantization vector and an index vector list of a preset image library; and determining at least one image from the plurality of images as a similar image corresponding to the image to be inquired according to the feature embedding vectors of the plurality of images in the preset image library corresponding to the first feature embedding vector and the at least one index vector. The method improves the accuracy and recall rate of image retrieval.

Description

Image retrieval method, device, equipment and computer readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to an image retrieval method, an image retrieval device, an image retrieval apparatus, and a computer-readable storage medium.

Background

In the prior art, product quantization retrieval PQ is adopted for large-scale image retrieval, but PQ directly performs subspace segmentation and each space segmentation from characteristics, so that similar samples are easily segmented to different quantization codes due to insufficient characteristic similarity, and the accuracy of image retrieval is low.

Disclosure of Invention

The present application provides an image retrieval method, an image retrieval device, an image retrieval apparatus, a computer-readable storage medium, and a computer program product, which are used to solve the problem of how to improve the accuracy of image retrieval.

In a first aspect, the present application provides an image retrieval method, including:

acquiring an image to be inquired;

determining a first feature embedding vector and a first quantization vector of an image to be queried, wherein the first quantization vector is used for representing quantization features corresponding to the first feature embedding vector;

determining at least one index vector corresponding to the first quantization vector according to the first quantization vector and an index vector list of a preset image library, wherein the index vector is used for representing quantization characteristics corresponding to characteristic embedding vectors of image samples in the preset image library;

and determining at least one image from the plurality of images as a similar image corresponding to the image to be inquired according to the feature embedding vectors of the plurality of images in the preset image library corresponding to the first feature embedding vector and the at least one index vector.

In one embodiment, determining a first feature embedding vector and a first quantization vector for an image to be queried comprises:

inputting a triple sample corresponding to an image to be queried into a basic feature and feature embedding model of a first neural network model, and obtaining a first feature embedding vector of the image to be queried through feature embedding processing, wherein the basic feature and feature embedding model comprises a feature map extraction model and a feature embedding model;

and inputting the first characteristic embedding vector into a quantization mapping model of the first neural network model, and obtaining a first quantization vector through mapping quantization processing.

In one embodiment, determining at least one index vector corresponding to the first quantization vector according to the first quantization vector and an index vector list of a preset image library includes:

and if the distance between the first quantization vector and at least one index vector in the index vector list of the preset image library is smaller than a preset first distance threshold, determining at least one index vector in the index vector list as at least one index vector corresponding to the first quantization vector.

In one embodiment, determining at least one image from a plurality of images as a similar image corresponding to an image to be queried according to the feature embedding vectors of the plurality of images in the preset image library corresponding to the first feature embedding vector and the at least one index vector comprises:

and if the first distances between the first feature embedding vector and the feature embedding vector of any image in the plurality of images are smaller than a preset second distance threshold, sorting the first distances from small to large, and taking the image in the plurality of images corresponding to the first distance in front as a similar image corresponding to the image to be inquired.

In one embodiment, before acquiring the image to be queried, the method further comprises:

obtaining a plurality of triple samples corresponding to the image sample set, wherein the triple samples comprise image samples, positive samples corresponding to the image samples and negative samples corresponding to the image samples;

inputting the multiple triple samples into the basic features and the feature embedding model of the second neural network model to obtain second feature embedding vectors corresponding to the multiple triple samples;

inputting second feature embedding vectors corresponding to the multiple triple samples into the quantization mapping model of the second neural network model to obtain second quantization vectors corresponding to the multiple triple samples;

determining a first loss function value of a quantization mapping model of a second neural network model according to a second quantization vector corresponding to the plurality of triple samples;

updating parameters of a quantitative mapping model of the second neural network model based on the first loss function values;

and if the first loss function value is smaller than or equal to the preset first loss function value threshold, finishing the training of the second neural network model, and taking the trained second neural network model as the first neural network model.

In one embodiment, the basic features and the feature embedding model of the second neural network model comprise a feature map extraction model and a feature embedding model, and the second neural network model further comprises an auxiliary quantization model;

inputting the multiple triple samples into the basic features and the feature embedding model of the second neural network model to obtain second feature embedding vectors corresponding to the multiple triple samples, wherein the second feature embedding vectors comprise:

inputting the multiple triple samples into a feature map extraction model of a second neural network model to obtain depth feature maps corresponding to the multiple triple samples;

inputting the depth feature map into a feature embedding model of a second neural network model to obtain second feature embedding vectors corresponding to a plurality of triple samples;

determining a first loss function value of a quantization mapping model of a second neural network model according to a second quantization vector corresponding to the plurality of triple samples; if the first loss function value is larger than a preset first loss function threshold value, performing gradient backward calculation according to the first loss function value, and updating parameters of a quantitative mapping model of the second neural network model; if the first loss function value is less than or equal to a preset first loss function value threshold, ending the training of the second neural network model, and taking the trained second neural network model as the first neural network model, including:

inputting second characteristic embedding vectors corresponding to the multiple triple samples into a quantization mapping model of a second neural network model to obtain second quantization vectors corresponding to the multiple triple samples;

inputting the depth feature maps corresponding to the triple samples into an auxiliary quantization model of a second neural network model to obtain a third quantization vector;

determining a first quantization mapping loss function value of a quantization mapping model of a second neural network model according to the second quantization vector and the third quantization vector;

determining a first bypass quantization loss function value of an auxiliary quantization model of the second neural network model according to the third quantization vector;

determining a first quantization loss function value according to the first quantization mapping loss function value, a preset first parameter and a first bypass quantization loss function value;

updating parameters of a quantization mapping model and parameters of an auxiliary quantization model of the second neural network model based on the first quantization loss function values;

and if the first quantization loss function value is smaller than or equal to a preset first quantization loss function value threshold, finishing the training of the second neural network model, and taking the trained second neural network model as the first neural network model.

In one embodiment, the method for training the basic features and the feature embedding model of the second neural network model comprises the following steps:

determining basic features of a second neural network model and triple loss function values of the feature embedding model according to second feature embedding vectors corresponding to the multiple triple samples;

updating the basic characteristics of the second neural network model and the parameters of the characteristic embedded model according to the triplet loss function values;

and if the triple loss function value is less than or equal to the preset triple loss function threshold, finishing the training of the basic characteristic and characteristic embedded model of the second neural network model, and taking the basic characteristic and characteristic embedded model of the second neural network model obtained by training as the basic characteristic and characteristic embedded model of the first neural network model.

In a second aspect, the present application provides an image retrieval apparatus comprising:

the first processing module is used for acquiring an image to be inquired;

the second processing module is used for determining a first feature embedding vector and a first quantization vector of the image to be queried, and the first quantization vector is used for representing quantization features corresponding to the first feature embedding vector;

the third processing module is used for determining at least one index vector corresponding to the first quantization vector according to the first quantization vector and an index vector list of a preset image library, wherein the index vector is used for representing quantization characteristics corresponding to characteristic embedding vectors of image samples in the preset image library;

and the fourth processing module is used for determining at least one image from the plurality of images as a similar image corresponding to the image to be inquired according to the feature embedding vectors of the plurality of images in the preset image library corresponding to the first feature embedding vector and the at least one index vector.

In a third aspect, the present application provides an electronic device, comprising: a processor, a memory, and a bus;

a bus for connecting the processor and the memory;

a memory for storing operating instructions;

and the processor is used for executing the image retrieval method of the first aspect of the application by calling the operation instruction.

In a fourth aspect, the present application provides a computer-readable storage medium storing a computer program for executing the image retrieval method of the first aspect of the present application.

In a fifth aspect, the present application provides a computer program product comprising a computer program which, when executed by a processor, performs the steps of the image retrieval method of the first aspect of the present application.

The technical scheme provided by the embodiment of the application at least has the following beneficial effects:

determining a first feature embedding vector and a first quantization vector of an image to be queried, wherein the first quantization vector is used for representing quantization features corresponding to the first feature embedding vector; determining at least one index vector corresponding to the first quantization vector according to the first quantization vector and an index vector list of a preset image library, wherein the index vector is used for representing quantization characteristics corresponding to characteristic embedding vectors of image samples in the preset image library; determining at least one image from the plurality of images as a similar image corresponding to the image to be inquired according to the first feature embedding vector and the feature embedding vectors of the plurality of images in the preset image library corresponding to the at least one index vector; therefore, the accuracy and the recall rate of image retrieval are improved, and the collapse of the quantization result when the output result of the feature embedding model is quantized by directly applying deep learning is avoided.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

FIG. 1 is a block diagram of an image retrieval system according to an embodiment of the present disclosure;

fig. 2 is a schematic flowchart of an image retrieval method according to an embodiment of the present application;

FIG. 3-a is a diagram illustrating an image retrieval according to an embodiment of the present disclosure;

3-b is a schematic diagram of an image retrieval method provided by an embodiment of the present application;

fig. 4 is a schematic diagram of image retrieval according to an embodiment of the present application;

fig. 5 is a schematic flowchart of an image retrieval method according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an image retrieval apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below in conjunction with the drawings in the present application. It should be understood that the embodiments set forth below in connection with the drawings are exemplary descriptions for explaining technical solutions of the embodiments of the present application, and do not limit the technical solutions of the embodiments of the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the terms "comprises" and/or "comprising," when used in this specification in connection with embodiments of the present application, specify the presence of stated features, information, data, steps, operations, elements, and/or components, but do not preclude the presence or addition of other features, information, data, steps, operations, elements, components, and/or groups thereof, as embodied in the art. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein indicates at least one of the items defined by the term, e.g., "a and/or B" indicates either an implementation as "a", or an implementation as "a and B".

The traditional product quantization retrieval PQ can enable the index to realize more accurate recall without occupying too much memory. However, the inventor researches and finds that at least the following problems exist:

(1) the quantization method of non-end-to-end training has the problem that the performance is obviously reduced after quantization.

(2) There was a similar sample fracture in the feature space: the PQ quantization directly performs subspace segmentation and each space segmentation from characteristics, so that similar samples are easily generated and are segmented to different quantization codes due to insufficient characteristic similarity; for example, the eigenvectors of two similar samples are [ -1,1,0.5, -0.03], [ -1,1,0.5,0.01], and then two codes of [0,1,1,0] and [0,1,1,1] are obtained by directly sign-quantizing the eigenvectors, instead of quantizing the same codes.

(3) The learning with labels is not supported, namely, the wrong samples cannot be quantized in a targeted manner, so that the recall effect of PQ in practical application is always lower than the recall rate of the original characteristics.

(4) Only the quantization is learned, and the embedding is not learned, so that other steps after retrieval, such as sorting and the like, cannot be supported.

Based on the above, in order to solve at least one of the problems in the existing image retrieval and better meet the requirements of the image retrieval, the application provides an image retrieval method, and based on the method, the accuracy and the recall rate of the image retrieval can be improved.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The embodiment of the application relates to an image retrieval method provided by an image retrieval system, and the image retrieval method relates to the field of artificial intelligence, such as machine learning, deep learning and the like.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

Deep Learning (DL) is a new research direction in the field of machine Learning. Deep learning is the intrinsic law and expression hierarchy of learning sample data, and information obtained in the learning process is very helpful for interpretation of data such as characters, images and sounds. The final goal of deep learning is to make a machine capable of human-like analytical learning, and to recognize data such as characters, images, and sounds.

For better understanding and description of the embodiments of the present application, some technical terms used in the embodiments of the present application will be briefly described below.

Image recognition: class level identification, regardless of the particular instance of the object, only takes into account identification by the class of the object (e.g., person, dog, cat, bird, etc.) and gives the class to which the object belongs. A typical example is the identification task of a large generic object in the source dataset imagenet to identify which of the 1000 categories a certain object is.

Binary quantization: for the D-dimensional feature embedding vector, the value range after vector normalization is generally-1 floating point number, the feature is compressed to a binary code (called 48-bit compression) with the designated bit number (for example 48 bits) being 0 and 1, and the binary code is vector binary quantization or binary coding.

Binary quantization index: and obtaining a binary vector of a finite bit by the D-dimensional feature vector through a certain calculation process (model), and recalling the image by taking the binary vector as an index during retrieval.

Imagenet: imagenet identifies a source data set for large, general objects.

Imagenet pre-training model: and training a deep learning network model based on the imagenet, wherein the obtained parameter weight of the model is the pre-training model of the imagenet.

Hamming distance: hamming distance is used for measuring the distance between binary features, and is realized by taking the number of feature bits with different statistics as the distance, for example, the distance between (1000) and (0011) is 3.

Learning rate: the Learning rate (Learning rate) is an important super-parameter in supervised Learning and deep Learning, and determines whether and when the objective function can converge to a local minimum. An appropriate learning rate enables the objective function to converge to a local minimum in an appropriate time.

K-L divergence: K-L Divergence (Kullback-Leibler Divergence) is a way to quantify the difference between two probability distributions, P and Q, also called relative entropy.

PQ: PQ is product quantitative retrieval, PQ is that firstly, a D-dimensional vector is divided into M subspaces, each space characteristic dimension is D/M, and each subspace is subjected to kmeans (poly K classes) respectively to obtain M space clustering centers. Dividing query features into M dimensions during retrieval, finding the nearest center in K centers of each dimension, and recalling all samples under the centers; calculating distances (M distances in total) between all samples in the dimension M and the query in each dimension, and summing to obtain the distances between the query and all recalled samples; the distances are ranked and the smallest top30 sample is recalled. After M subspaces are divided, each subspace is quantized respectively, the simplest quantization method is to quantize the signs of D/M dimensional features in the subspaces, namely, when the feature of a certain dimension is greater than 0, the quantization is 1, and when the feature of the dimension is less than 0, the quantization is 0, and if [ -1,1,0.5, -0.2] feature vectors are quantized, a [0,1,1,0] code is obtained. The method does not need a large amount of distance calculation with the clustering center, and is better in speed.

Knowledge distillation: knowledge distillation is to use knowledge learned by a large and complex model (teacher) to know a small and compact model (student), and the aim is to make the output result of the student network consistent with the output result of the teacher network as much as possible, so that the aim of obtaining large performance with small cost is achieved.

A training epoch: when a complete data set passes through the neural network once and back once, the process is called an epoch.

The technical scheme and how to solve the technical problems are described in detail by specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

The scheme provided by the embodiment of the application can be suitable for any application scene needing image retrieval in the field of artificial intelligence.

In order to better understand the scheme provided by the embodiment of the present application, the scheme is described below with reference to a specific application scenario.

In an embodiment, fig. 1 shows an architecture schematic diagram of an image retrieval system to which the embodiment of the present application is applied, and it can be understood that the image retrieval method provided by the embodiment of the present application can be applied to, but is not limited to, the application scenario shown in fig. 1.

In this example, as shown in fig. 1, the architecture of the image retrieval system in this example may include, but is not limited to, an image recognition platform 10 and a database system 20, where the image recognition platform 10 may be a server or a terminal, and the database system 20 may be a server; the image recognition platform 10 and the database system 20 may interact with each other via a network. The image recognition platform 10 runs a first neural network model 101, and the first neural network model 101 includes a basic feature and feature embedding model 110 and a quantitative mapping model 120, wherein the basic feature and feature embedding model 110 includes a feature map extraction model 111 and a feature embedding model 112.

The image identification platform 10 acquires an image to be inquired; the image processing platform 10 inputs the image to be queried to the feature map extraction model 111 to obtain a depth feature map; the feature map extraction model 111 inputs the depth feature map into the feature embedding model 112 to obtain a first feature embedding vector of the image to be queried; the feature embedding model 112 inputs the first feature embedding vector to the quantization mapping model 120, and obtains a first quantized vector through the mapping quantization process. The image identification platform 10 determines at least one index vector corresponding to the first quantization vector according to the first quantization vector, an index vector list of an image library in the database system 20 and a preset first distance threshold; the image identification platform 10 identifies at least one image from the plurality of images as a similar image corresponding to the image to be queried according to the first feature embedding vector, the feature embedding vectors of the plurality of images in the image library corresponding to the at least one index vector, and a preset second distance threshold.

It is understood that the above is only an example, and the present embodiment is not limited thereto.

The terminal may be a smart phone (such as an Android phone, an iOS phone, etc.), a phone simulator, a tablet computer, a notebook computer, a digital broadcast receiver, an MID (Mobile Internet Devices), a PDA (personal digital assistant), a vehicle terminal (such as a vehicle navigation terminal), a smart speaker, a smart watch, and the like. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server or a server cluster providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform. Such networks may include, but are not limited to: a wired network, a wireless network, wherein the wired network comprises: a local area network, a metropolitan area network, and a wide area network, the wireless network comprising: bluetooth, Wi-Fi, and other networks that enable wireless communication. The determination may also be based on the requirements of the actual application scenario, and is not limited herein.

Referring to fig. 2, fig. 2 shows a flowchart of an image retrieval method provided in an embodiment of the present application, where the method may be executed by any electronic device, such as a server or a terminal, as an alternative implementation, the method may be executed by the server or the terminal, and for convenience of description, in the following description of some alternative embodiments, the server or the terminal will be taken as an example of an execution subject of the method. As shown in fig. 2, the image retrieval method provided in the embodiment of the present application includes the following steps:

s201, acquiring an image to be inquired.

Specifically, the image to be queried may be a picture of a person, a dog, a cat, a bird, or the like. And performing image retrieval on the image to be queried to obtain a similar image corresponding to the image to be queried in the image library, wherein the image to be queried and the similar image belong to the same type of image, for example, the image to be queried is a small spotted dog, and the similar image is a spotted dog acquainted with the small spotted dog.

S202, a first feature embedding vector and a first quantization vector of the image to be queried are determined, and the first quantization vector is used for representing quantization features corresponding to the first feature embedding vector.

Specifically, the first feature embedding vector may be an embedding vector, i.e., a feature embedding vector; the embedding vector is input into a quantization mapping model (Map model), and a first quantization vector is obtained through mapping quantization processing.

For example, the first feature embedding vector may be (-1,1,0.5, -0.03), (0.1,0.1,0.5,0.03), etc. The first quantized vector may be (0,1,0), (1,0,0), etc.

S203, determining at least one index vector corresponding to the first quantization vector according to the first quantization vector and the index vector list of the preset image library, wherein the index vector is used for representing the quantization feature corresponding to the feature embedding vector of the image sample in the preset image library.

Specifically, the index vector list of the preset image library includes a plurality of index vectors, where each index vector may correspond to a feature embedding vector of a plurality of image samples, and each image sample corresponds to one feature embedding vector.

In one embodiment, a plurality of index vectors corresponding to the first quantization vector are determined according to the first quantization vector, an index vector list of a preset image library and a preset first distance threshold.

S204, determining at least one image from the plurality of images as a similar image corresponding to the image to be inquired according to the first feature embedding vector and the feature embedding vectors of the plurality of images in the preset image library corresponding to the at least one index vector.

Specifically, at least two images in the plurality of images can be determined as similar images corresponding to the image to be queried according to the first feature embedding vector, the feature embedding vectors corresponding to the plurality of index vectors, and the preset second distance threshold.

In the embodiment of the application, a first feature embedding vector and a first quantization vector of an image to be queried are determined, wherein the first quantization vector is used for representing quantization features corresponding to the first feature embedding vector; determining at least one index vector corresponding to the first quantization vector according to the first quantization vector and an index vector list of a preset image library, wherein the index vector is used for representing quantization characteristics corresponding to characteristic embedding vectors of image samples in the preset image library; determining at least one image from the plurality of images as a similar image corresponding to the image to be inquired according to the first feature embedding vector and the feature embedding vectors of the plurality of images in the preset image library corresponding to the at least one index vector; therefore, the accuracy and the recall rate of image retrieval are improved, and the collapse of the quantization result when the output result of the feature embedding model is quantized by directly applying deep learning is avoided.

In one embodiment, determining a first feature embedding vector and a first quantization vector for an image to be queried comprises steps a1-a 2:

step A1, inputting the triple sample corresponding to the image to be queried to a preset basic feature and feature embedding model of the first neural network model, and obtaining a first feature embedding vector of the image to be queried through feature embedding processing, wherein the basic feature and feature embedding model comprises a feature map extraction model and a feature embedding model.

For example, as shown in fig. 1, a triple sample corresponding to an image to be queried is input to a preset basic feature and feature embedding model 110 of a first neural network model 101, and a first feature embedding vector of the image to be queried is obtained through feature embedding processing, where the basic feature and feature embedding model 110 includes a feature map extraction model 111 and a feature embedding model 112. The feature extraction model 111 may be CNN (Convolutional Neural Networks), and the feature extraction model 111 may be ResNet-101, where the feature module structure table of ResNet-101 is shown in table 1.

TABLE 1 ResNet-101 feature Module Structure Table

The feature embedding model 112 may be an embedding learning layer, and learns the triplet distance information by using metric learning, where the feature embedding model 112 is shown in table 2, and 64 in table 2 is an embedding dimension.

TABLE 2 feature embedding model

Step A2, inputting the first feature embedding vector into a quantization mapping model of the first neural network model, and obtaining a first quantization vector through mapping quantization processing.

For example, as shown in fig. 1, the first feature embedding vector is input to a quantization mapping model 120 of the first neural network model 101, and a first quantization vector is obtained through a mapping quantization process. The quantization mapping model (Map)120 is shown in table 3.

TABLE 3 quantized mapping model

In one embodiment, the method for creating the index of the image library (image database) and generating the reverse list LinvertT and the forward list T comprises the following steps of B1-B3:

step B1, respectively inputting N images i in the image library into a first neural network model to obtain a feature embedded vector e and a quantized vector q _ map; wherein e is the result output by the feature embedding model of the first neural network model; and q _ map is quantized 0 and 1 vectors obtained by taking sign functions after quantization output, namely q _ map is the output result of the quantization mapping model of the first neural network model. And recording a T mapping table of the image and the characteristic embedded vector as T [ i: e ], wherein i represents the serial number of the image, and e represents the characteristic embedded vector of the image.

Step B2, an index system based on q _ map is established, i.e. the image sequence numbers with qj are recorded in the inverted list Linvert, for example, [ q1: [ img1, img2, img5], q2: [ img3], q3: [ img4] ], the index vector list Lindex is saved: [ q1, q2, q3], wherein q1, q2 and q3 are index vectors, and img1, img2, img3, img4 and img5 are images corresponding to the index vectors.

Step B3, for the newly added sample x, qx and ex of the sample x can be calculated, and when qx exists in the Lindex list, ex is added to the list corresponding to qx under the Lindex, and the image sequence number x and ex are added to the T mapping table, i.e. a record of sequence number-characteristics [ x: ex ] is newly added.

In one embodiment, if a distance between the first quantized vector q _ map and at least one index vector in an index vector list Lindex of a predetermined image library is smaller than a predetermined first distance threshold, the at least one index vector in the index vector list is determined as the at least one index vector corresponding to the first quantized vector. For example, according to the q _ map of the image to be queried, determining an index vector of which the Hamming distance from the q _ map of the image to be queried is smaller than a preset first distance threshold Dq _ thr from Lindex, wherein the index vector is q2 and q3 in a Lindex list.

For example, index vectors q2 and q3 in the Lindex list, wherein q2: [ img3], q3: [ img4 ]; q2 corresponds to the feature embedding vector e3 of the image img3 in the preset image library, and q3 corresponds to the feature embedding vector e4 of the image img4 in the preset image library. If the first distances (e.g. euclidean distances) between the first feature embedding vector e and e3 and e4 are smaller than the preset second distance threshold, sorting the first distances dist3 of e3 and dist4 of e4 from small to large, selecting topK samples from the sorting and returning, for example, topK is 2, namely, selecting image img3 and image img4 to return to the user, and using image img3 and image img4 as acquaintance images of the images to be queried, wherein topK can be preset.

For example, a plurality of triple samples are input to the basic feature and feature embedding model of the second neural network model, second feature embedding vectors corresponding to the plurality of triple samples are obtained through feature embedding processing, and the second feature embedding vectors corresponding to the plurality of triple samples are input to the quantization mapping model of the second neural network model to obtain second quantization vectors corresponding to the plurality of triple samples; the basic feature and feature embedding model comprises a feature map extraction model and a feature embedding model, wherein the feature map extraction model can be CNN, and the feature map extraction model can be ResNet-101.

In one embodiment, training sample preparation is required before training the second neural network model. The conventional similarity embedding training needs to prepare a similar sample pair, and the image retrieval method provided by the embodiment of the application can be based on the similarity information, so that the similar sample pair also needs to be prepared. The similar sample pairs for quantization are consistent with the sample pairs of the training similarity embedding model (the feature embedding model included in the second neural network model).

In one embodiment, the mining of the plurality of triple samples corresponding to the image sample set may be sample mining in batch, and the mining of the samples may employ triplet learning, with similar sample pairs as output. For example, mining in each sample pair of batch as follows yields a triple sample: for a certain sample x, embedding (output of the feature embedding model included in the second neural network model) of the sample pair is respectively calculated from samples of the remaining bs-1 sample pairs (each pair randomly selects an image), the distance between embedding and x is calculated, the samples are sorted from small to large according to the distance, the first 10 samples are taken as negative samples and respectively form triples with positive samples in x, so that each sample generates 10 triples, and the whole batch can obtain 10 × bs triples, wherein bs (batch size) represents the number of samples selected by one training.

It should be noted that the similarity embedding model (the feature embedding model included in the second neural network model) predicts that the same image (for example, the identical or extremely similar image) is the same (the metric distance is as small as possible); it is desirable for the different images (e.g. including only a few similar but not sufficiently similar or dissimilar images) to be as far apart as possible, while also satisfying the order preserving effect, i.e. the more dissimilar distances. Except for similar images of different types such as attack means such as previous and later frames under the same shot of a video and addition of tone conversion on an image, the probability that most of the other images are similar images is very low, so that any sample pair of each batch (sampled from a full amount of samples and understood as a sample pair in which two different sample pairs in the batch are different from each other) is effective negative samples, and the 10 negative samples (samples closest to the positive sample pair) with the minimum distance are selected as the negative samples.

In one embodiment, for the image from which the embedding has been extracted, only the embedding is input to the quantization mapping model (Map) of table 3, i.e., a quantized q _ Map of the image can be extracted, and the embedding and q _ Map of the image can be indexed and retrieved through subsequent steps. The usage is suitable for a large amount of inventory business applications, namely, an embedding model already exists, and index upgrading can be performed on inventory methods such as PQ or kmeans by adopting the method.

and if the triple loss function value is less than or equal to the preset triple loss function threshold, finishing the training of the basic characteristic and characteristic embedded model of the second neural network model, and taking the basic characteristic and characteristic embedded model of the second neural network model obtained by training as the basic characteristic and characteristic embedded model of the first neural network model. It should be noted that training the basic features and the feature embedding model of the second neural network model may occur in the pre-training stage of the second neural network model; training the underlying features and the feature embedding model of the second neural network model may also occur at or after a quantization stage of the second neural network model, such as a first quantization stage and a second quantization stage.

In one embodiment, as shown in FIG. 3-a, the pre-training phase of the second neural network model 210 may be the training of the base features and feature embedding model 220 of the second neural network model, wherein the base features and feature embedding model includes a feature map extraction model 221 and a feature embedding model 222. The pre-training phase comprises:

(1) and initializing parameters.

For example, Conv1-Conv5 in the pre-training phase uses parameters of ResNet-101 (feature map extraction model) pre-trained on ImageNet dataset, and a newly added layer such as an embedding model is initialized with a Gaussian distribution with variance of 0.01 and mean of 0. The parameters in table 1 and table 2 can be learned.

(2) And setting learning parameters.

For example, for the imbedding learning in the pre-training phase, the underlying basic features (feature map extraction model) need to be updated, and the set learning parameters are shown in table 1 and table 2.

(3) The learning rate is set.

For example, the feature map extraction model and the feature embedding model of the second neural network model both adopt a learning rate of lr1 ═ 0.005; the learning rate lr becomes 0.1 times the original rate after each 10-round iteration.

(4) And (6) learning.

For example, the learning process includes: carrying out epoch round iteration on the full data; the full number of samples is processed for each iteration until the average epoch loss at a certain epoch no longer decreases.

(5) The specific operations in each iteration are as follows: taking each batch size sample pair as a batch, dividing the full-size sample pair into Nxb batches, acquiring a plurality of triple samples for each batch, and executing the following steps C1-C3:

and C1, performing forward calculation on the basic features and the feature embedding models of the second neural network model.

Specifically, all parameters of the model are set to be in a state needing learning, and the neural network performs forward calculation on an input picture during training to obtain a prediction result, namely a second feature embedding vector em output by the feature embedding model of the second neural network model.

And C2, calculating the loss function of the basic characteristics and the characteristic embedding model of the second neural network model.

For example, the loss function of the underlying features and the feature embedding model of the second neural network model may be/_triI.e., similarity feature loss triplet loss; the loss function value for the pre-training phase may be l_triThe value of (c). The triplet samples (a, p, n) include an image sample a, a positive sample p corresponding to the image sample, and a negative sample n corresponding to the image sample. After the triple sample (a, p, n) is found in the batch sample, the second feature embedding vector em corresponding to the triple sample is subjected to L2normalization, i.e., L2normalization, and then the triplet loss is calculated. the triple loss is calculated as shown in formula (1), wherein alpha (α) is margin, α can be set to 4, | X_a-X_p| | represents the L2 distance of two embeddings. the objective of the triple loss is to make the distance between the image sample a (anchor) and the negative sample n (negative) greater than 4. The normalization aims to: the feature space is in the range of 0-1, and the situation that the space is too large and is not beneficial to optimization learning is avoided.

l_tri＝max(||X_a-X_p||-||X_a-X_nEquation (1) | + α,0)

In addition,/[_triThe basic features and the feature embedding model of the second neural network model in the first quantization stage and the second quantization stage can be fixed and are not adjusted when used in the pre-training stage.

And step C3, updating the parameters of the basic features and the feature embedding models of the second neural network model.

Specifically, the basic features and the feature embedding model of the second neural network model are subjected to Gradient backward calculation by using a random Gradient Descent (SGD) method, so that the parameters of the basic features and the feature embedding model of the second neural network model are updated.

In one embodiment, as shown in fig. 3-b, the second neural network model 210 includes a base feature and feature embedding model 220, a quantization mapping model 230, and an auxiliary quantization model 240, wherein the base feature and feature embedding model 220 includes a feature map extraction model 221 and a feature embedding model 222.

For example, the feature extraction model 221 can be CNN, and the feature extraction model 221 can be ResNet-101, where the feature module structure table of ResNet-101 is shown in table 1. The feature embedding model 222 may be an embedding learning layer, and the feature embedding model 222 is shown in table 2. The quantization mapping model (Map)230 is shown in table 3. The auxiliary quantization model 240 is shown in table 4.

TABLE 4 auxiliary quantization model

It should be noted that the auxiliary quantization model shown in table 4 is used for auxiliary score quantization, and directly quantizes the depth features output after pooling in table 1, so that a more direct quantization effect from image to quantization can be obtained without being constrained by embedding, and the auxiliary quantization model is used as an auxiliary branch to make the target quantization approach the quantization effect, so as to avoid the insufficient similarity measurement of the target quantization under the influence of embedding.

It should be noted that, in this embodiment, the training of the second neural network model may be used as the whole quantization stage of the second neural network model; the training of the second neural network model in this embodiment may also be used as the first quantization stage of the second neural network model.

In one embodiment, after finishing the training of the basic features and the feature embedding model of the second neural network model, the method further comprises the following steps:

inputting the second characteristic embedding vector into a quantization mapping model of a second neural network model for training to obtain a second quantization vector;

inputting the depth feature map output by the feature map extraction model of the second neural network model into an auxiliary quantization model of the second neural network model for training to obtain a third quantization vector;

determining a first quantization mapping loss function value of a quantization mapping model of a second neural network model through a K-L divergence algorithm according to the second quantization vector and the third quantization vector;

if the first quantization loss function value is larger than a preset first quantization loss function threshold, performing gradient backward calculation on the first quantization loss function value according to a random gradient descent algorithm, and updating parameters of a quantization mapping model and parameters of an auxiliary quantization model of a second neural network model;

and if the first quantization loss function value is smaller than or equal to a preset first quantization loss function value threshold, ending the first-stage training of the quantization mapping model and the auxiliary quantization model of the second neural network model.

In one embodiment, the first quantization stage of the second neural network model, i.e., the first stage training of the quantization mapping model and the auxiliary quantization model of the second neural network model; the first quantization stage can enable the bypass branch (auxiliary quantization model) to have better learning effect by preferentially learning the bypass branch (auxiliary quantization model). The first quantization stage comprises:

(1) and initializing parameters.

For example, the parameters of the second neural network model are initialized by using the parameters of table 1 and table 2 learned in the pre-training stage, and the newly added Map layer (quantization mapping model) and quantization layer (auxiliary quantization model) are initialized by using a gaussian distribution with a variance of 0.01 and a mean of 0.

(2) And setting learning parameters.

For example, for quantization learning in the first quantization stage, it is necessary to learn the parameters in table 1, table 2, table 3, and table 4, where the parameters in table 1 and table 2 are parameters that need to be fine-tuned, and the parameters in table 3 and table 4 are newly added parameters.

(3) The learning rate is set.

For example, the quantization mapping model and the auxiliary quantization model of the second neural network model both adopt a learning rate of lr1 ═ 0.005; the learning rate lr becomes 0.1 times the original rate after each 10-round iteration.

(4) And (6) learning.

(5) The specific operations in each iteration are as follows: taking each batch size sample pair as a batch, dividing the full-size sample pair into Nxb batches, acquiring a plurality of triple samples for each batch, and executing the following steps D1-D3:

and D1, performing forward calculation of the basic features and the feature embedding models of the second neural network model.

Specifically, all parameters of the model are set to be in a state needing learning, and the neural network performs forward calculation on an input picture during training to obtain a prediction result, namely a second feature embedding vector em output by the feature embedding model of the second neural network model, a feature vector q _ Map output by the quantization mapping model (Map layer) of the second neural network model, and a feature vector q output by the auxiliary quantization model of the second neural network model.

Step D2, calculation of a first quantized loss function value of the second neural network model.

Specifically, a first quantization loss function value is determined according to a first quantization mapping loss function value, a preset first parameter and a first bypass quantization loss function value. The calculation of the first quantization loss function value is shown in equation (2).

Loss of the first quantization stage ═ Lq + mxlmap equation (2)

The Loss of the first quantization stage is a first quantization Loss function, Lq is a first bypass quantization Loss function, Lmap is a first quantization mapping Loss function, and M is a first parameter, for example, the value of M is 0.01.

For example, the bypass quantization loss Lq (first bypass quantization loss function) may be represented by the symbol quantization loss L_codingMetric loss triplet loss (L) with quantization result q (feature vector q output by auxiliary quantization module of second neural network model)_triplet) And (4) forming.

The purpose of the metric loss of the quantization result q is to enable the quantization feature q output by the auxiliary quantization branch (the auxiliary quantization model of the second neural network model) to have the capability of measuring the triple, and since the first step of searching is to find the topK quantization feature which is most similar to the image to be queried in the inventory quantization feature, the quantization feature needs to have the capability of measuring.

The purpose of the symbol quantization penalty is to make the quantization output sufficiently close to the value of-1 or 1, two symbol quantization possible outputs, so that the quantization result (the quantization vector consisting of-1 and 1) can satisfy the above-mentioned metric penalty of q, resulting in a satisfactory binary quantization vector. In the calculation, Tanh activation can be performed on the output of the quantization module, and then the bypass quantization loss Lq is calculated according to the formula (3).

L_q＝w₂₁L_triplet+w₂₂L_codingFormula (3)

Wherein the content of the first and second substances,metric loss L_triplet: also with the use of the triple sample, since the quantization vector is 128-dimensional, each bit needs to learn a value of-1 or 1, and the distance between a and n samples in the triple sample needs to be large enough to ensure that the triple sample can be distinguished in the quantization space, the edge margin is set to 100.

Symbol quantization loss L_coding: the quantization effect loss is calculated for the vector output by this quantization branch, the layer target being the output (-1, 1), whereby a symbol quantization, i.e. a symbol quantization<The number 0 is 0 and the number 0,>the goal of quantization loss is to bring the output of the quantized coded coding closer to-1 or 1 (if the output is near a critical value, i.e., 0, this tends to cause similar features to be quantized into different codes). Therefore, the target code of the quantization learning task can be generated by using a sign function, for example, a sign function is used, the target code bi of each bit ui of the coding vector u is calculated according to the formula (4) by using the sign function, and finally the target code of u is b. And reducing the distance between the coding output vector u and the target code b by adopting regression loss. Calculating L according to equation (5)_coding。

And (3) weighting: w21 is 1 and w22 is 0.5, since regression loss converges faster than triplet-loss, in order to ensure that triplet-loss dominates the overall loss and thus that embedding always has the ability to measure the similarity, w22 may be set to 0.5, or w22 may be set to other values less than 1, adjusted as appropriate.

The Tanh activation function maps the output result between-1 and 1, similar to the function of the sign function, except that the sign function is not derivable (+0 and-0) at the 0 position, i.e. the gradient calculation cannot be performed, and thus cannot be used in deep learning based on sgd gradient return; whereas with Tanh activation may be derived and mapped between-1 and 1. Sigmoid activation (to between 0 and 1) can also be adopted, and then 0 and 1 are taken as the targets of quantization (instead of-1 and 1). Wherein, Tanh can obtain-1 and 1 more quickly, so the learning effect is better. Wherein, the curve of Tanh is steeper and is faster to approach-1 and 1, and sigmoid is slower and is not easy to approach the extreme values (0 and 1) at the two ends. Therefore, the learning effect of the Tanh model is closer to the real effect of binary quantization. The auxiliary quantization branch (the auxiliary quantization model of the second neural network model) is used for assisting the learning of the quantization mapping model of the second neural network model.

For example, the embedding quantization mapping loss Lmap (first quantization mapping loss function) outputs a 1x 128-dimensional result by the mapping layer. Since quantization measurement effect (such as PQ quantization mode) is easily poor directly from the embedding mapping, with the help of the effect of bypass quantization as a learning target, after the second neural network model learns a better bypass quantization branch (auxiliary quantization model of the second neural network model), the result of target quantization mapping only needs to be aligned with the output result of bypass quantization, and the Lmap calculation is as shown in formula (6).

Lmap is the loss of K _ L divergence that keeps the distributions of the two predictors consistent, making the post-mapping quantization distribution consistent with the bypass quantization. Where p (x) is the bypass quantization output result (1 x128 output in table 4); q (x) is the mapped quantization result (table 3 mapping layer output 1x 128).

And D3, updating the parameters of the basic features and the feature embedding models of the second neural network model.

Specifically, the first quantization loss function value is subjected to Gradient backward calculation by using a SGD (Stochastic Gradient Descent) method, so that parameters of the quantization mapping model and parameters of the auxiliary quantization model of the second neural network model are updated.

In one embodiment, after finishing the first-stage training of the quantization mapping model and the auxiliary quantization model of the second neural network model, the method further comprises:

determining a second quantization mapping loss function value of the quantization mapping model of the second neural network model through a K-L divergence algorithm according to the second quantization vector and the third quantization vector;

determining a second bypass quantization loss function value of an auxiliary quantization model of the second neural network model according to the third quantization vector;

determining a second quantization loss function value according to the second quantization mapping loss function value, a preset second parameter and a second bypass quantization loss function value, wherein the second parameter is larger than the first parameter;

if the second quantization loss function value is larger than a preset second quantization loss function threshold, performing gradient backward calculation on the second quantization loss function value according to a random gradient descent algorithm, and updating parameters of a quantization mapping model and parameters of an auxiliary quantization model of a second neural network model;

and if the second quantization loss function value is smaller than or equal to a preset second quantization loss function value threshold, finishing the second-stage training of the quantization mapping model and the auxiliary quantization model of the second neural network model, and taking the quantization mapping model of the second neural network model obtained by the second-stage training as the quantization mapping model of the first neural network model.

In one embodiment, the second quantization stage of the second neural network model, i.e., the second stage training of the quantization mapping model and the auxiliary quantization model of the second neural network model; the second quantization stage can quantize mapping learning and adjust bypass branch effects, and specifically, a quantization mapping model is trained through a combined bypass branch (auxiliary quantization model), so that quantization result collapse which often occurs when quantization is performed on an output result of an embedding model by directly applying deep learning is avoided, and a better quantization effect is achieved. The second quantization stage comprises:

(1) and initializing parameters.

(2) And setting learning parameters.

For example, for quantization learning in the second quantization stage, it is necessary to learn the parameters in table 1, table 2, table 3, and table 4, where the parameters in table 1 and table 2 are parameters that need to be fine-tuned, and the parameters in table 3 and table 4 are newly added parameters.

(3) The learning rate is set.

(4) And (6) learning.

(5) The specific operations in each iteration are as follows: taking each batch size sample pair as a batch, dividing the full-size sample pair into Nxb batches, acquiring a plurality of triple samples for each batch, and executing the following steps E1-E3:

and E1, performing forward calculation of the basic features and the feature embedding models of the second neural network model.

Step E2, calculation of a second quantized loss function value for the second neural network model.

Specifically, a second quantization loss function value is determined according to the second quantization mapping loss function value, a preset second parameter and a second bypass quantization loss function value. The calculation of the second quantization loss function value is shown in equation (7).

Loss of the second quantization stage is Lq + N × Lmap equation (7)

The Loss of the second quantization stage is a second quantization Loss function, Lq is a second bypass quantization Loss function, Lmap is a second quantization mapping Loss function, and N is a second parameter, for example, the value of N is 1.0.

For example, the bypass quantization loss Lq (second bypass quantization loss function) may be represented by the symbol quantization loss L_codingMetric loss triplet loss (L) with quantization result q (feature vector q output by auxiliary quantization module of second neural network model)_triplet) And (4) forming.

For example, the embedding quantization mapping loss Lmap (second quantization mapping loss function) outputs a 1x 128-dimensional result by the mapping layer. The Lmap calculation is shown in equation (6). Lmap is the loss of K _ L divergence that keeps the distributions of the two predictors consistent, making the post-mapping quantization distribution consistent with the bypass quantization. Where p (x) is the bypass quantization output result (1 x128 output in table 4); q (x) is the mapped quantization result (table 3 mapping layer output 1x 128).

And E3, updating the parameters of the basic features and the feature embedding models of the second neural network model.

Specifically, the second quantization loss function value is subjected to Gradient backward calculation by using a SGD (Stochastic Gradient Descent) method, so that parameters of the quantization mapping model and parameters of the auxiliary quantization model of the second neural network model are updated.

It should be noted that the first 5 epochs in the second neural network model training are for the first quantization stage, and the subsequent epochs are for the second quantization stage.

In one embodiment, when the condition that the trimmed embedding features (second feature embedded vectors) are not changed is not needed in the business, for example, the embedding has not been applied in a large scale on the business or the cost of updating the embedding is not high, the embedding loss Lq-triplet can be learned in both the pre-training stage and the quantization stage (the first quantization stage and the second quantization stage), and the purpose of learning the embedding loss in the quantization stage is to avoid the effect of the embedding representation being difficult to obtain a better quantization effect. That is, when it is not necessary to ensure that the imbedding in the first quantization stage and the second quantization stage is the same as the pre-training stage, the calculation of the first quantization loss function value is as shown in equation (8), and the calculation of the second quantization loss function value is as shown in equation (9).

The first quantization stage Loss ═ Lq + M × Lmap + Lq-triplet equation (8)

The second quantization stage Loss ═ Lq + nxlmap + Lq-triplet formula (9)

The application of the embodiment of the application has at least the following beneficial effects:

(1) the effective embedding direct quantization end-to-end quantization model learning method comprises the following steps: a pre-training and quantitative staged fine-tuning scheme is designed to ensure that a plurality of tasks with different convergence modes are effectively learned; if the multitask weighted learning method of the conventional multitask learning is adopted, smooth convergence of double tasks is difficult to guarantee, and the embedding recall is easy to become low.

(2) The quantization result has the measurement capability to avoid the similar sample quantization space fracture: similar sample quantization distances are close through metric learning, and satisfactory recall rate can be obtained through hamming distance threshold adjustment.

(3) Avoid collapse of conventional cascaded quantization: if there is no bypass branch, it is easy to collapse to directly learn map quantization from embedding, i.e. all quantized metric effects are concentrated in some M positions (e.g. 10 bits out of 128 bits) of the quantized vector, while the quantized features of only 10/128 scale are far insufficient in the retrieval recall, however, since the quantization effects are concentrated in M bits, the quantized bits after M +1 no longer have sufficient distinctiveness; wherein, distinctively means: recall only similar samples, not similar recall; therefore, the quantization effect collapses after M bits, and the most serious collapse is that the full stock is recalled.

(4) And (3) quantitative optimization of stock retrieval features is supported: the method is flexible in modification of different business applications, for example, inventory characteristics, namely an embedding characteristic model (characteristic embedding model) is trained, but other quantization methods such as PQ and the like are adopted, quantization can be supported only by performing a quantization fine-tuning stage, and in the application, embedding can be directly input to obtain a quantization vector, so that extraction is faster, and the effect is better.

(5) Improving the accuracy and recall rate of index retrieval: quantitative metric learning based on bypass branch optimization can promote recalls without reducing accuracy on a large scale. The quantization result collapse which often occurs when the output result of the embedding model is quantized by directly applying deep learning is avoided.

(6) The quantization coding extraction efficiency is improved for stock characteristics: by only learning the quantization module, the stock characteristic quantization can not only ensure the similarity and avoid space fracture, but also avoid the time consumption of performing the CNN model again by quantization extraction, and the quantization coding can be extracted by only inputting e into the quantization model (quantization mapping model).

In order to better understand the method provided by the embodiment of the present application, the following further describes the scheme of the embodiment of the present application with reference to an example of a specific application scenario.

The image retrieval method provided by the embodiment of the application is applied to the image retrieval scene in the field of image identification.

The image retrieval method provided by the embodiment of the application is applied to image sub-bucket/quantization index retrieval (one bucket corresponds to one quantization code), and the image retrieval comprises the following steps of F1-F7:

in step F1, an image embedding model (second neural network model) learned based on the similarity metric is trained.

Step F2, extracting quantization vectors and embedding vectors (feature embedding vectors) of the inventory, where the quantization vectors and the embedding vectors are in one-to-many relationship, and the embedding vectors and the images are in one-to-one relationship.

In step F3, the quantized vector is used as an index for search (for bucket search) and associated with the embedding vector.

Step F4, finding the nearest K quantifications in the stock according to the quantification of the query image (image to be queried) in the search.

Step F5, acquiring the embedding vectors of the images associated with the quantization indexes, and obtaining a candidate image retrieval recall.

Step F6, calculating the Euclidean distance between the embedding vector of the recall image and the embedding vector of the query image, and sorting the Euclidean distance from small to large.

Step F7, take the topK samples in the sequence as the final recall result.

For example, as shown in fig. 4, the first neural network model includes a basic feature and feature embedding model and a quantization mapping model (Map), wherein the basic feature and feature embedding model includes a feature Map extraction model (CNN) and a feature embedding model (embedding). Inputting a query image (namely, an image to be queried) into a first neural network model to obtain a quantization vector (1,0,0) and a feature embedding vector (0.2,0.8,0.3,0.3) of the image to be queried; inquiring the nearest quantized vector (1,0,0) in an image library according to the quantized vector (1,0,0) of the image to be inquired in the retrieval, wherein the feature embedding vectors corresponding to the quantized vector (1,0,0) in the image library are (0.2,0.7,0.3,0.3), (0.1,0.5,0.2,0.2) and (0.2,0.4,0.2, 0.3); respectively calculating Euclidean distances between a plurality of feature embedding vectors corresponding to the quantization vectors (1,0,0) in the image library and the feature embedding vector of the query image, and sequencing the obtained Euclidean distances from small to large; the top samples in the ranking were selected as the final recall results.

In one embodiment, the neural network model provided by the embodiment of the application can simultaneously output the characteristic embedding and quantization in model reasoning, so that the generation and quantization are more efficient.

In one embodiment, for an image from which embedding has been extracted, a quantization result can be obtained quickly by directly inputting embedding into a quantization model (auxiliary quantization model), so that resource consumption caused by performing CNN extraction again is avoided.

Referring to fig. 5, fig. 5 shows a flowchart of an image retrieval method provided in an embodiment of the present application, where the method may be executed by any electronic device, such as a server or a terminal, as an alternative implementation, the method may be executed by the server or the terminal, and for convenience of description, in the following description of some alternative embodiments, the server or the terminal will be taken as an example of an execution subject of the method. As shown in fig. 5, the image retrieval method provided in the embodiment of the present application includes the following steps:

s501, a plurality of triple samples corresponding to the image sample set are obtained.

S502, training the basic features and the feature embedded models of the second neural network model in a pre-training stage based on a plurality of triple samples, and determining triple loss function values in the pre-training stage; and if the triple loss function value is less than or equal to the preset triple loss function threshold, ending the training in the pre-training stage.

S503, performing first quantization stage training on the quantization mapping model and the auxiliary quantization model of the second neural network model, and determining a first quantization loss function value of the first quantization stage; and if the first quantization loss function value is smaller than or equal to a preset first quantization loss function value threshold, ending the training of the first quantization stage.

S504, carrying out second quantization stage training on the quantization mapping model and the auxiliary quantization model of the second neural network model, and determining a second quantization loss function value of the second quantization stage; and if the second quantization loss function value is smaller than or equal to a preset second quantization loss function value threshold, ending the second-stage training.

And S505, acquiring an image to be inquired.

S506, inputting the image to be inquired into the basic feature and feature embedding model of the first neural network model, and obtaining a first feature embedding vector of the image to be inquired through feature embedding processing.

Specifically, the basic features and the feature embedding model of the first neural network model may be a basic feature and feature embedding model of the second neural network model.

And S507, inputting the first feature embedding vector to a quantization mapping model of the first neural network model, and obtaining a first quantization vector through mapping quantization processing.

In particular, the quantized mapping model of the first neural network model may be a quantized mapping model of the second neural network model.

And S508, determining at least one index vector corresponding to the first quantization vector according to the first quantization vector, the index vector list of the preset image library and the preset first distance threshold.

S509, determining at least one image from the plurality of images as a similar image corresponding to the image to be inquired according to the first feature embedding vector, the feature embedding vectors of the plurality of images in the preset image library corresponding to the at least one index vector and a preset second distance threshold.

designing an auxiliary quantization model of the second neural network model outside the quantization mapping model of the second neural network model, and learning quantization features with similarity measurement for the bottom layer features; knowledge distillation learning of the auxiliary quantization model is performed through the quantization mapping model of the second neural network model, so that the imbedding quantization learning collapse is avoided; different task weights are adjusted through multiple stages on the training strategy, the learning effect and the final quantization effect of each branch are improved, and therefore the end-to-end quantization model with similarity representation is achieved.

The image retrieval apparatus 60 includes a first processing module 601, a second processing module 602, a third processing module 603, and a fourth processing module 604, as shown in fig. 6.

The first processing module 601 is configured to obtain an image to be queried;

a second processing module 602, configured to determine a first feature embedding vector and a first quantization vector of an image to be queried, where the first quantization vector is used to represent a quantization feature corresponding to the first feature embedding vector;

a third processing module 603, configured to determine, according to the first quantized vector and an index vector list of a preset image library, at least one index vector corresponding to the first quantized vector, where the index vector is used to represent a quantized feature corresponding to a feature embedding vector of an image sample in the preset image library;

the fourth processing module 604 is configured to determine, according to the feature embedding vectors of the multiple images in the preset image library corresponding to the first feature embedding vector and the at least one index vector, at least one image from the multiple images as a similar image corresponding to the image to be queried.

In an embodiment, the second processing module 602 is specifically configured to:

In an embodiment, the third processing module 603 is specifically configured to:

In an embodiment, the fourth processing module 604 is specifically configured to:

In one embodiment, the first processing module 601 is further configured to:

In an embodiment, the first processing module 601 is specifically configured to:

the basic features and the feature embedding model of the second neural network model comprise a feature map extraction model and a feature embedding model, and the second neural network model further comprises an auxiliary quantization model;

An embodiment of the present application further provides an electronic device, a schematic structural diagram of the electronic device is shown in fig. 7, and an electronic device 4000 shown in fig. 7 includes: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further include a transceiver 4004, and the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 7, but this is not intended to represent only one bus or type of bus.

The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium, other magnetic storage devices, or any other medium that can be used to carry or store a computer program and that can be Read by a computer, without limitation.

The memory 4003 is used for storing computer programs for executing the embodiments of the present application, and is controlled by the processor 4001 to execute. The processor 4001 is used to execute computer programs stored in the memory 4003 to implement the steps shown in the foregoing method embodiments.

Among them, electronic devices include but are not limited to: servers, terminals, etc.

Embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, and when being executed by a processor, the computer program may implement the steps and corresponding contents of the foregoing method embodiments.

Embodiments of the present application further provide a computer program product, which includes a computer program, and when the computer program is executed by a processor, the steps and corresponding contents of the foregoing method embodiments can be implemented.

Based on the same principle as the method provided by the embodiment of the present application, the embodiment of the present application also provides a computer program product or a computer program, which includes computer instructions, and the computer instructions are stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in any of the alternative embodiments of the present application.

It should be understood that, although each operation step is indicated by an arrow in the flowchart of the embodiment of the present application, the implementation order of the steps is not limited to the order indicated by the arrow. In some implementation scenarios of the embodiments of the present application, the implementation steps in the flowcharts may be performed in other sequences as desired, unless explicitly stated otherwise herein. In addition, some or all of the steps in each flowchart may include multiple sub-steps or multiple stages based on an actual implementation scenario. Some or all of these sub-steps or stages may be performed at the same time, or each of these sub-steps or stages may be performed at different times, respectively. In a scenario where execution times are different, an execution sequence of the sub-steps or the phases may be flexibly configured according to requirements, which is not limited in the embodiment of the present application.

The foregoing is only an optional implementation manner of a part of implementation scenarios in this application, and it should be noted that, for those skilled in the art, other similar implementation means based on the technical idea of this application are also within the protection scope of the embodiments of this application without departing from the technical idea of this application.

Claims

1. An image retrieval method, comprising:

acquiring an image to be inquired;

determining a first feature embedding vector and a first quantization vector of the image to be queried, wherein the first quantization vector is used for representing quantization features corresponding to the first feature embedding vector;

and determining at least one image from the plurality of images as a similar image corresponding to the image to be inquired according to the first feature embedding vector and the feature embedding vectors of the plurality of images in the preset image library corresponding to the at least one index vector.

2. The method of claim 1, wherein the determining the first feature embedding vector and the first quantization vector of the image to be queried comprises:

inputting the triple sample corresponding to the image to be queried to a basic feature and feature embedding model of a first neural network model, and obtaining a first feature embedding vector of the image to be queried through feature embedding processing, wherein the basic feature and feature embedding model comprises a feature map extraction model and a feature embedding model;

and inputting the first feature embedding vector to a quantization mapping model of the first neural network model, and obtaining a first quantization vector through mapping quantization processing.

3. The method according to claim 1, wherein determining at least one index vector corresponding to the first quantization vector according to the first quantization vector and a list of index vectors of a predetermined image library comprises:

and if the distance between the first quantization vector and at least one index vector in an index vector list of a preset image library is smaller than a preset first distance threshold, determining the at least one index vector in the index vector list as the at least one index vector corresponding to the first quantization vector.

4. The method according to claim 1, wherein the determining, according to the feature embedding vectors of the plurality of images in the preset image library corresponding to the first feature embedding vector and the at least one index vector, at least one image from the plurality of images as a similar image corresponding to the image to be queried comprises:

and if the first distances between the first feature embedding vector and the feature embedding vector of any image in the plurality of images are smaller than a preset second distance threshold, sorting the first distances from small to large, and taking the image in the plurality of images corresponding to the first distance in the front as a similar image corresponding to the image to be inquired.

5. The method according to claim 1, before the obtaining the image to be queried, further comprising:

obtaining a plurality of triple samples corresponding to an image sample set, wherein the triple samples comprise image samples, positive samples corresponding to the image samples and negative samples corresponding to the image samples;

inputting the plurality of triple samples into a basic feature and feature embedding model of a second neural network model to obtain second feature embedding vectors corresponding to the plurality of triple samples;

inputting second feature embedding vectors corresponding to the multiple triple samples into a quantization mapping model of the second neural network model to obtain second quantization vectors corresponding to the multiple triple samples;

determining a first loss function value of a quantization mapping model of the second neural network model according to a second quantization vector corresponding to the plurality of triplet samples;

updating parameters of a quantized mapping model of the second neural network model based on the first loss function values;

and if the first loss function value is smaller than or equal to a preset first loss function value threshold, ending the training of the second neural network model, and taking the trained second neural network model as the first neural network model.

6. The method of claim 5, wherein the base features and feature embedding models of the second neural network model comprise a feature map extraction model and a feature embedding model, and the second neural network model further comprises an auxiliary quantization model;

inputting the plurality of triple samples into the basic features and the feature embedding model of the second neural network model to obtain second feature embedding vectors corresponding to the plurality of triple samples, wherein the method comprises the following steps:

inputting the multiple triple samples into a feature map extraction model of the second neural network model to obtain depth feature maps corresponding to the multiple triple samples;

inputting the depth feature map into a feature embedding model of the second neural network model to obtain second feature embedding vectors corresponding to the multiple triple samples;

determining a first loss function value of a quantization mapping model of the second neural network model according to a second quantization vector corresponding to the plurality of triplet samples; if the first loss function value is larger than a preset first loss function threshold value, performing gradient backward calculation according to the first loss function value, and updating parameters of a quantitative mapping model of the second neural network model; if the first loss function value is less than or equal to a preset first loss function value threshold, ending the training of the second neural network model, and using the trained second neural network model as the first neural network model, including:

inputting the depth feature maps corresponding to the triple samples into an auxiliary quantization model of the second neural network model to obtain a third quantization vector;

determining a first quantization mapping loss function value of a quantization mapping model of the second neural network model from the second quantization vector and the third quantization vector;

determining a first quantization loss function value according to the first quantization mapping loss function value, a preset first parameter and the first bypass quantization loss function value;

and if the first quantization loss function value is smaller than or equal to a preset first quantization loss function value threshold, ending the training of the second neural network model, and taking the trained second neural network model as the first neural network model.

7. The method of claim 5 or 6, wherein the training of the base features and the feature embedding model of the second neural network model comprises:

determining basic features of the second neural network model and triple loss function values of the feature embedding model according to second feature embedding vectors corresponding to the multiple triple samples;

and if the triple loss function value is less than or equal to a preset triple loss function threshold, ending the training of the basic characteristic and characteristic embedded model of the second neural network model, and taking the basic characteristic and characteristic embedded model of the second neural network model obtained by the training as the basic characteristic and characteristic embedded model of the first neural network model.

8. An image retrieval apparatus, comprising:

the first processing module is used for acquiring an image to be inquired;

the second processing module is used for determining a first feature embedding vector and a first quantization vector of the image to be queried, wherein the first quantization vector is used for representing quantization features corresponding to the first feature embedding vector;

and the fourth processing module is used for determining at least one image from the plurality of images as a similar image corresponding to the image to be inquired according to the first feature embedding vector and the feature embedding vectors of the plurality of images in the preset image library corresponding to the at least one index vector.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to implement the steps of the method of any of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.