CN116051917A

CN116051917A - Method for training image quantization model, method and device for searching image

Info

Publication number: CN116051917A
Application number: CN202111265044.3A
Authority: CN
Inventors: 郭卉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-10-28
Filing date: 2021-10-28
Publication date: 2023-05-02

Abstract

The application provides a method for training an image quantization model, a method and a device for searching images, which can be applied to the field of artificial intelligence and are used for solving the problem of low accuracy of quantized images. The method comprises the following steps: respectively carrying out quantization characteristic extraction processing on each sample image to obtain respective quantization characteristics of each sample image; based on a binary quantization evaluation strategy, performing quantization evaluation on each obtained quantization characteristic, and determining a binary quantization loss of an image quantization model; based on a semantic evaluation strategy, carrying out semantic evaluation on each quantized feature, and determining semantic quantization loss of an image quantization model; based on the binary quantization loss and the semantic quantization loss, when the image quantization model to be trained meets the training target, outputting the trained target image quantization model, so that the accuracy of the quantized image is higher, and further, the recall performance of image retrieval is improved.

Description

Method for training image quantization model, method and device for searching image

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method for training an image quantization model, a method for retrieving an image, and an apparatus thereof.

Background

With the continuous development of technology, the device can provide search services for not only text but also images. The device may characterize the corresponding image using the quantized features of the image as index tags, such that, when retrieving for the image, the corresponding image may be retrieved by the index tags.

Under the related technology, the method for obtaining the quantized features of the image is to obtain the quantized features of the image based on product quantization (Product Quantization, PQ), and the method has more feature loss in the process of generating the quantized features, so that the quantized features cannot accurately represent the image, the accuracy of the quantized image is lower, and the recall performance of image retrieval is further affected.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a storage medium for quantizing an image, so as to solve the problem of low accuracy of the quantized image.

In a first aspect, a method for training an image quantization model is provided, based on each sample image, performing multiple rounds of iterative training on the image quantization model to be trained, where each round of training process includes:

respectively carrying out quantization characteristic extraction processing on each sample image to obtain respective quantization characteristics of each sample image;

Performing quantization evaluation on each obtained quantization feature based on a binary quantization evaluation strategy, and determining a binary quantization loss of an image quantization model, wherein the binary quantization evaluation strategy is used for evaluating the binarization degree of the quantization feature and the feature characterization degree of the corresponding sample image;

performing semantic evaluation on each quantized feature based on a semantic evaluation strategy, and determining semantic quantization loss of an image quantization model, wherein the semantic evaluation strategy is used for evaluating semantic characterization degree of the quantized feature for a corresponding sample image;

and based on the binary quantization loss and the semantic quantization loss, when the image quantization model to be trained meets a training target, outputting a trained target image quantization model.

In a second aspect, there is provided a method of retrieving an image, comprising:

obtaining a reference image;

performing quantization feature extraction processing on the reference image based on a target image quantization model obtained by training by any one of the methods according to the first aspect, so as to obtain quantization features of the reference image;

and screening candidate images with quantized features matched with the quantized features of the reference image from the candidate images based on the quantized features of the candidate images, and taking the candidate images as target retrieval images.

In a third aspect, an apparatus for training an image quantization model is provided, comprising:

and the feature extraction module is used for: the method comprises the steps of respectively carrying out quantization characteristic extraction processing on each sample image to obtain respective quantization characteristics of each sample image;

training module: the method comprises the steps of carrying out quantization evaluation on each obtained quantization characteristic based on a binary quantization evaluation strategy, and determining a binary quantization loss of an image quantization model, wherein the binary quantization evaluation strategy is used for evaluating the binarization degree of the quantization characteristic and the characteristic characterization degree of a corresponding sample image;

the training module is also configured to: performing semantic evaluation on each quantized feature based on a semantic evaluation strategy, and determining semantic quantization loss of an image quantization model, wherein the semantic evaluation strategy is used for evaluating semantic characterization degree of the quantized feature for a corresponding sample image;

the training module is also configured to: and based on the binary quantization loss and the semantic quantization loss, when the image quantization model to be trained meets a training target, outputting a trained target image quantization model.

Optionally, the sample images include a plurality of sample images that form a plurality of triplets, wherein each triplet includes a reference sample image, a positive sample image similar to the reference sample image, and a negative sample image dissimilar to the reference sample image;

The training module is specifically configured to:

determining a triplet quantization loss of an image quantization model based on quantization features of each of a reference sample image, a positive sample image, and a negative sample image contained by each of the plurality of triplet samples;

respectively carrying out binarization processing on the respective quantization characteristics of each sample image to determine the binarization loss of an image quantization model;

and determining a binary quantization loss of an image quantization model based on a weighted sum of the triplet quantization loss and the binary quantization loss.

Optionally, the training module is specifically configured to:

classifying the quantization characteristics of each sample image respectively to determine the classification loss of an image quantization model;

performing product quantization processing on the quantization characteristics of each sample image respectively to determine product quantization loss of an image quantization model;

and determining the semantic quantization loss of the image quantization model based on the weighted sum of the classification loss and the product quantization loss.

Optionally, the training module is specifically configured to:

obtaining respective codebooks, wherein the respective codebooks are generated based on respective quantization features of the respective sample images, each codebook containing a plurality of index vectors for use as indexes of the quantization features;

Respectively carrying out index marking on the quantization characteristics of each sample image based on a plurality of index vectors contained in each codebook to obtain index characteristics corresponding to each quantization characteristic;

and determining product quantization loss of the image quantization model based on the index features corresponding to the sample images.

Optionally, the training module is specifically configured to:

for each quantized feature of each sample image, the following operations are respectively executed:

dividing the quantized feature into a sub-feature sequence containing a plurality of sub-features, wherein the number of the sub-features contained in the sub-feature sequence is the same as the number of the codebooks, and each codebook corresponds to the position of the sub-feature in the sub-feature sequence one by one;

for each codebook, determining the similarity between a plurality of index vectors contained in the codebook and the sub-features at corresponding positions in the sub-feature sequence, and screening the index vector with the maximum similarity for each position in the sub-feature sequence;

and obtaining index features corresponding to the quantization features based on the screened index vectors.

Optionally, the training module is specifically configured to:

Determining index quantization loss of an image quantization model based on errors between the index features corresponding to the sample images and the corresponding quantization features;

determining a codebook triplet quantization loss of an image quantization model based on index features corresponding to a reference sample image, a positive sample image, and a negative sample image contained in each of a plurality of triplet samples, wherein the plurality of triplet samples are composed based on sample images in each sample image;

classifying the index vectors contained in each codebook respectively to determine codebook classification loss of an image quantization model;

and determining a product quantization loss of an image quantization model based on a weighted sum of the index quantization loss, the codebook triplet quantization loss, and the codebook classification loss.

Optionally, each of the sample images has a corresponding sample classification;

the training module is further configured to:

after the quantized features of each sample image are obtained by respectively carrying out quantized feature extraction processing on each sample image, when each codebook is determined to exist, respectively carrying out the following operations on a plurality of sample images corresponding to the same sample classification, wherein each codebook respectively comprises a plurality of index vectors, and the index vectors are used as indexes of the quantized features;

Clustering the quantized features of each of the plurality of sample images to obtain each clustering center;

respectively determining the quantization characteristics of each of the plurality of sample images, and regarding the similarity between each sample image and each cluster center, and respectively screening out a sample image corresponding to the quantization characteristic with the largest similarity as a target sample image;

and updating each codebook based on the quantization characteristic of each screened target sample image.

Optionally, the triplet sample is obtained by the training module by adopting the following method:

determining a plurality of similar sample image pairs based on the respective sample images, wherein the similar sample image pairs contain two sample images that are similar images to each other;

for the plurality of similar sample image pairs, respectively performing the following operations:

respectively taking two sample images contained in a similar sample image pair as a reference sample image and a positive sample image;

selecting sample images contained in the similar sample image pairs except for the similar sample image pairs from the similar sample image pairs as negative sample images;

a triplet sample is created based on the reference sample image, the positive sample image, and the negative sample image.

In a fourth aspect, there is provided an apparatus for retrieving an image, comprising:

the acquisition module is used for: for obtaining a reference image;

the processing module is used for: for performing a quantization feature extraction process on the reference image based on a target image quantization model trained using the apparatus according to any one of the third aspects, to obtain a quantization feature of the reference image;

the processing module is further configured to: and screening candidate images with quantized features matched with the quantized features of the reference image from the candidate images based on the quantized features of the candidate images, and taking the candidate images as target retrieval images.

Optionally, the processing module is specifically configured to:

obtaining respective codebooks, wherein each codebook contains a plurality of index vectors, and the index vectors are used as marks of quantization characteristics;

screening each target index vector matched with the quantization characteristic of the reference image from each index vector based on a plurality of index vectors contained in each codebook;

the target search image is determined based on the candidate images corresponding to the quantized features using the respective target index vectors as markers.

Optionally, the processing module is specifically configured to:

Dividing the quantization characteristic of the reference image into a sub-characteristic sequence containing a plurality of sub-characteristics, wherein the number of the sub-characteristics contained in the sub-characteristic sequence is the same as the number of the codebooks, and each codebook corresponds to the positions of the sub-characteristics in the sub-characteristic sequence one by one;

and screening the target index vector with the maximum similarity with the sub-features at the corresponding positions from the index vectors according to the positions in the sub-feature sequence.

In a fifth aspect, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as in the first aspect or as in the second aspect.

In a sixth aspect, there is provided a computer device comprising:

a memory for storing program instructions;

and a processor for invoking program instructions stored in said memory, performing the method according to the first aspect or according to the second aspect in accordance with the obtained program instructions.

In a seventh aspect, there is provided a computer readable storage medium storing computer executable instructions for causing a computer to perform the method of the first aspect or the second aspect.

In the embodiment of the application, in the training process of the image quantization model, on one hand, training is performed based on the binary quantization loss of the image quantization model, so that the quantization characteristic output by the image quantization model can be as close to one value of the binary values as possible, the binarization degree is improved, and a better quantization effect is achieved from the binarization angle; meanwhile, training is performed based on the binary quantization loss of the image quantization model, so that the quantization characteristic output by the image quantization model can represent the characteristic of the sample image before quantization as much as possible, the characteristic representation degree is improved, and a better quantization effect is achieved from the characteristic representation angle.

On the other hand, the image quantization model based semantic quantization loss is trained, so that the quantization features output by the image quantization model have certain semantic representation capability, and compared with the method for obtaining the quantization features based on visual pixel values of images, the method for obtaining the quantization features based on the semantics of image abstraction in the embodiment of the application improves the accuracy of representing corresponding images by the quantization features, and achieves better quantization effects from the aspect of semantics. The embodiment of the application improves the accuracy of the quantized image from multiple angles, and further improves recall performance during image retrieval.

Drawings

Fig. 1 is an application scenario of a method for training an image quantization model according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating a method for training an image quantization model according to an embodiment of the present application;

FIG. 3a is a schematic diagram of a method for training an image quantization model according to an embodiment of the present application;

FIG. 3b is a schematic diagram II of a method for training an image quantization model according to an embodiment of the present application;

FIG. 4 is a schematic diagram III of a method for training an image quantization model according to an embodiment of the present application;

FIG. 5a is a schematic diagram of a method for searching an image according to an embodiment of the present disclosure;

FIG. 5b is a second flowchart of a method for retrieving images according to an embodiment of the present disclosure;

FIG. 5c is a schematic diagram of a method for retrieving images according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an apparatus for training an image quantization model according to an embodiment of the present application;

fig. 7 is a schematic diagram ii of an apparatus for retrieving images according to an embodiment of the present application;

fig. 8 is a schematic diagram III of a device for training an image quantization model or a device for retrieving an image according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

Some of the terms in the embodiments of the present application are explained below to facilitate understanding by those skilled in the art.

(1) Imagenet pre-training model:

imagenet is a large generic object recognition open source dataset. The image pre-training model is based on image training a deep learning network model, and the obtained parameter weight of the model is the image pre-training model.

(2) Binarization:

binarization includes a hash process that learns quantized features obtained by binarization for replacing conventional floating-point features, which may be quantized into hash features.

(3) PQ quantization and vector quantization:

the PQ quantization is carried out by dividing the D-dimensional feature vector into k segments, wherein each segment has dimension length D/k, k features are obtained, and vector quantization is carried out on the ith feature as an index. During retrieval, k segments of the query sample can be split first, similarity calculation of k segment centers to all index centers is performed, a sample image under the latest topk index combinations (the combined index closest to the query quantization distance after k index vectors are combined) is recalled, and the similarity of the sample image is calculated.

The image feature is quantized into N vectors by dividing the image feature vector into N non-overlapping regions, each region being represented by a vector (cluster center), the quantization vector of each feature being the region in which it is located. During retrieval, the corresponding quantized vector can be recalled through the image features, and then the similarity between the image features and the sample images under the quantized vector can be compared.

Embodiments of the present application relate to the field of artificial intelligence (ArtificialIntelligence, AI) designed based on machine learning (MachineLearning, ML) and Computer Vision (CV) techniques.

Artificial intelligence is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is a comprehensive technology of computer science, which studies the design principles and implementation methods of various machines in an attempt to understand the essence of intelligence, and to produce a new intelligent machine that can react in a similar way to human intelligence, so that the machine has the functions of sensing, reasoning and decision.

Artificial intelligence is a comprehensive discipline, and relates to a wide range of fields, including hardware-level technology and software-level technology. Basic technologies of artificial intelligence generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation interaction systems, electromechanical integration, and the like. The software technology of artificial intelligence mainly comprises computer vision technology, voice processing technology, natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other large directions. With the development and progress of artificial intelligence, the artificial intelligence is developed and applied in various fields, such as common fields of smart home, smart customer service, virtual assistant, smart sound box, smart marketing, smart wearable equipment, unmanned driving, automatic driving, unmanned plane, robot, smart medical treatment, internet of vehicles, automatic driving, smart transportation, etc., and it is believed that with the further development of future technology, the artificial intelligence will be applied in more fields, playing an increasingly important value. The scheme provided by the embodiment of the application relates to the technologies of artificial intelligence such as deep learning and augmented reality, and is specifically further described through the following embodiments.

Machine learning is a multi-field interdisciplinary, and relates to multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like, and a specially researched computer acquires new knowledge or skills by simulating learning behaviors of human beings, reorganizes the existing knowledge structure and enables the computer to continuously improve the performance of the computer.

Machine learning is the core of artificial intelligence, which is the fundamental way for computers to have intelligence, applied throughout various areas of artificial intelligence; the core of machine learning is deep learning, which is a technology for realizing machine learning. Machine learning typically includes deep learning, reinforcement learning, transfer learning, induction learning, artificial neural networks, teaching learning, etc., and deep learning includes convolutional neural networks (Convolutional Neural Networks, CNN), deep confidence networks, recurrent neural networks, automatic encoders, generation countermeasure networks, etc.

Computer vision is a comprehensive discipline integrating multiple disciplines such as computer science, signal processing, physics, application mathematics, statistics, neurophysiology and the like, and is also a challenging important research direction in the scientific field. Computer vision is a subject for researching how to make a machine "look at", and more specifically, the subject refers to that various imaging systems such as a camera and a computer are used for replacing human visual organs, machine vision processing such as recognition, tracking and measurement is performed on targets, and collected images are processed into images which are more suitable for human eyes to observe or are transmitted to an instrument to detect through further map processing.

Computer vision is taken as a scientific subject, and by researching related theory and technology, the computer is tried to be provided with the capability of observing and understanding the world through visual organs like human beings, and an artificial intelligence system capable of acquiring information from images or multidimensional data is established. Computer vision technologies typically include image processing, image recognition, image semantic understanding, image retrieval, optical character recognition (Optical Character Recognition, OCR), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and mapping, autopilot, intelligent transportation, and the like, in addition to common biometric technologies such as face recognition, fingerprint recognition, and the like.

It should be noted that, in the embodiments of the present application, "first", "second", or "third" are used to distinguish nouns, and do not characterize the sequence of nouns or the like. For example, in a first pair of similar sample images and a second pair of similar sample images described below, "first" and "second" are used only to distinguish between the two similar sample image pairs, and do not characterize that the two similar sample image pairs have precedence, etc.

The application fields of the method for training the image quantization model and the method for searching the image provided by the embodiment of the application are briefly described below.

Under the related art, the method for obtaining the quantized features of the image is to obtain the quantized features of the image based on product quantization (Product Quantization, PQ), and in the process of generating the quantized features, more feature loss exists, so that the quantized features cannot accurately represent the image. For example, two images which are similar to each other are obtained based on PQ respectively, and the two obtained quantized features are different or dissimilar due to more feature loss in the quantization process, so that the two quantized features cannot characterize the similarity between the two images.

Therefore, in the related art, the quantized features cannot accurately represent the image, so that the accuracy of the quantized image is low, and recall performance of image retrieval is further affected.

In order to solve the problem of low accuracy of quantized images, the application provides a method for training an image quantization model. In the method, based on each sample image, multiple rounds of iterative training are carried out on an image quantization model to be trained, wherein in each round of training, each sample image is respectively subjected to quantization characteristic extraction processing, and respective quantization characteristics of each sample image are obtained. And carrying out quantization evaluation on each obtained quantization characteristic based on a binary quantization evaluation strategy, and determining a binary quantization loss of an image quantization model, wherein the binary quantization evaluation strategy is used for evaluating the binarization degree of the quantization characteristic and the characteristic characterization degree of the corresponding sample image. And carrying out semantic evaluation on each quantized feature based on a semantic evaluation strategy, and determining semantic quantization loss of the image quantization model, wherein the semantic evaluation strategy is used for evaluating the semantic characterization degree of the quantized feature for the corresponding sample image. And based on the obtained binary quantization loss and semantic quantization loss, when the image quantization model to be trained meets the training target, outputting the trained target image quantization model.

The application scenario of the method for training an image quantization model provided in the present application is described below.

Referring to fig. 1, a schematic view of an application scenario of a method for training an image quantization model or a method for retrieving an image provided in the present application is shown. The application scene comprises a client 101 and a server 102. Communication may be between client 101 and server 102. The communication mode can be communication by adopting a wired communication technology, for example, communication is carried out through a connecting network wire or a serial port wire; the communication may also be performed by using a wireless communication technology, for example, a bluetooth or wireless fidelity (wireless fidelity, WIFI) technology, which is not particularly limited.

The client 101 generally refers to a device that can train an image quantization model, or a device that can provide a reference image for retrieval, for example, a terminal device, a third party application that the terminal device can access, or a web page that the terminal device can access, or the like. Terminal devices include, but are not limited to, cell phones, computers, intelligent transportation devices, intelligent appliances, and the like. The server 102 generally refers to a device that can train an image quantization model, or a device that can retrieve a target retrieval image based on a reference image, for example, a terminal device or a server, or the like. Servers include, but are not limited to, cloud servers, local servers, or associated third party servers, and the like. Both the client 101 and the server 102 can adopt cloud computing to reduce occupation of local computing resources; cloud storage may also be employed to reduce the occupation of local storage resources.

As an embodiment, the client 101 and the server 102 may be the same device, which is not limited in particular. In this embodiment, the description is given by taking the example that the client 101 and the server 102 are different devices respectively.

The following specifically describes a method for training an image quantization model provided in the embodiment of the present application based on fig. 1, with a server as a main body. Referring to fig. 2, a flowchart of a method for training an image quantization model according to an embodiment of the present application is shown.

And S201, respectively carrying out quantization characteristic extraction processing on each sample image, and before obtaining the respective quantization characteristic of each sample image, firstly obtaining each sample image by the server, carrying out multi-round iterative training on the image quantization model to be trained based on each sample image until the image quantization model meets a training target, and obtaining the trained target image quantization model.

The sample images may be obtained from a network resource by a server, or may be read from other devices, etc., and are not particularly limited. Each sample image may have a corresponding sample classification, which may represent an image category corresponding to the corresponding sample image, and the sample classification may include a bar, a park, an office, a coffee shop, etc., and may further include a crowd, a self-timer, a whole body photo, etc., without limitation.

Each sample image may be composed of a plurality of similar sample image pairs, and the server may determine a plurality of similar sample image pairs based on the similarity between every two sample images in each sample image, which is not particularly limited. Wherein each pair of similar sample images comprises two sample images that are similar images to each other. Thus, each sample image may contain a plurality of sample images that form a plurality of triplet samples, which may be determined based on a plurality of pairs of similar sample images. Wherein each triplet sample comprises a reference sample image, a positive sample image that is similar to the reference sample image, and a negative sample image that is dissimilar to the reference sample image.

The process of creating the triplet is described below by taking a similar sample image pair as an example, and the process of creating the triplet is similar to that of the similar sample image pair and will not be described in detail here.

The server may include two sample images from a similar sample image pair as the reference sample image and the positive sample image, respectively. And selecting one sample image from the similar sample image pairs as a negative sample image, wherein the other similar sample image pairs except the similar sample image pair are respectively selected, so that the server can establish a triplet sample based on the reference sample image, the positive sample image and the negative sample image when one negative sample image is selected.

As an embodiment, when the server selects the negative sample image, the server may select the negative sample image from only a part of other similar sample image pairs, so as to avoid generating more training samples and increase the resources occupied by the training process. If the image quantization model is trained based on a negative sample image which is very dissimilar to the reference sample image or the positive sample image, the image quantization model can quickly reach the training target, but the image quantization model can learn less differences between the positive sample image and the negative sample image, so that the extracted quantization features are inaccurate, and therefore, the value of training the image quantization model is less by adopting the negative sample image which is very dissimilar to the reference sample image or the positive sample image.

The server can select a negative sample image which is similar to the reference sample image or the positive sample image, and form a triplet sample with the reference sample image or the positive sample image, so as to obtain the triplet sample with more training value. The server selects one sample image from other similar sample image pairs, calculates the similarity between the sample image and the reference sample image or the positive sample image, sorts the selected sample images according to the sequence from the big to the small of the similarity, and then selects the sample image which is arranged before the appointed sequence number as the negative sample image. And finally, respectively establishing a plurality of triplet samples according to the reference sample image, the positive sample image and the selected plurality of negative sample images.

For example, there are three similar sample image pairs in total, a first similar sample image pair, a second similar sample image pair, and a third similar sample image pair, respectively, and one sample image, such as sample image a, is selected for two sample images, sample image a and sample image B, contained in the first similar sample image pair. And then selecting one sample image from the sample image C and the sample image D contained in the second similar sample image pair and the sample image E and the sample image F contained in the third similar sample image pair, for example, selecting the sample image D and the sample image E. The server may establish a triplet sample directly based on sample image a, sample image B, and sample image D, and establish a triplet sample based on sample image a, sample image B, and sample image E.

The server may also calculate the distance between the sample image a and the sample image D as the similarity between the sample image a and the sample image D; meanwhile, the distance between the sample image a and the sample image E is calculated as the similarity between the sample image a and the sample image E. And according to the calculated distance, sorting the sample images D and E according to the sequence from the small distance to the large distance, namely the similarity from the large distance to the small similarity, selecting the sample images which are arranged before the designated sequence number, and establishing a triplet sample with the sample image A and the sample image B. When there are more pairs of similar sample images, for example, 100 pairs of similar sample images, then the designated sequence number may be 20, the first 20 sample images are selected, and triad samples are respectively established with sample image a and sample image B, so as to obtain 20 triad samples.

The server may determine a greater number of pairs of similar sample images, e.g., 256 pairs of similar sample images, so that more negative sample images may be selected and more triplet samples may be obtained to train the image quantization model.

After each sample image is obtained, the server can perform multiple rounds of iterative training on the image quantization model to be trained based on each sample image, when the image quantization model meets a training target, the trained target image quantization model is output, and when the image quantization model does not meet the training target, the model parameters of the image quantization model are adjusted, and the image quantization model with the model parameters adjusted is continuously trained. The training process of each round of iterative training is similar, and one round of training process is taken as an example for description below.

S201, respectively carrying out quantization characteristic extraction processing on each sample image to obtain respective quantization characteristics of each sample image.

The server can adopt an image quantization model to respectively conduct quantization characteristic extraction processing on each sample image, so as to obtain respective quantization characteristics of each sample image. When the quantized feature extraction processing is performed on the sample image, the image quantization model can be adopted to extract the image features first, and then the image features are quantized to obtain the quantized features of the sample image. Referring to fig. 3a, the image quantization model may include a feature extraction module and a feature quantization module, wherein the feature extraction module may be used to extract image features, and the feature quantization module may be used to quantize the image features. The server can input the sample image into the feature extraction module to obtain the image features of the sample image output by the feature extraction module, and then input the obtained image features into the feature quantization model to obtain the quantization features of the sample image output by the feature quantization model.

As an embodiment, the feature extraction module in the image quantization model may directly adopt the feature extraction module in the trained image recognition model, for example, the feature extraction module in the image quantization model is the feature extraction module in the trained model res net-101, where the network structure of the feature extraction module in the image quantization model is the network structure of the feature extraction module in the res net-101, and the model parameters of the feature extraction module in the image quantization model are the model parameters of the feature extraction module in the res net-101.

Please refer to table 1, which is a network structure of the feature extraction module.

TABLE 1

After the sample image is obtained, the first convolution layer adopts a 64-channel 7x7 convolution kernel, and the sample image is subjected to convolution processing respectively in 2 steps to obtain 300x500 first image shallow layer features. The second convolution layer adopts a maximum pooling mode of 3x3, and performs pooling treatment on the first image shallow layer characteristics in a 2 step size, and 3 residual modules, namely a convolution kernel of 1x1 of 64 channels, a convolution kernel of 3x3 of 64 channels and a convolution kernel of 1x1 of 256 channels are respectively adopted, so that the pooled data is subjected to convolution treatment, and the second image shallow layer characteristics are obtained. And so on, obtaining the image characteristics of the sample image output by the fifth convolution layer. Referring to fig. 3B, a structural diagram of a residual module is shown, the convolution kernel of the convolution layer a is 1x1, the channel is 64, the convolution kernel of the convolution layer B is 3x3, the channel is 64, the convolution kernel of the convolution layer C is 1x1, and the channel is 256. The second through fifth convolution layers may establish a network structure through stacking of residual modules.

As an embodiment, the image feature quantization module in the image quantization model may be implemented by using a fully connected layer, please refer to table 2, which is a schematic structural diagram of the quantization module.

TABLE 2

Network layer name	Output data size	Network structure
			Pooling layer	1x2048	Maximum pooling layer
Quantization layer	1x128	Full connection layer

The size of the image feature output by the feature extraction module is 19x32, in order to quantize the image feature into a vector, a pooling layer can be adopted to maximally pool the obtained image feature, a vector of 1x2048 is output, the vector is quantized by a full connection layer, and a quantized feature of 1x128 is output.

S202, carrying out quantization evaluation on each obtained quantization characteristic based on a binary quantization evaluation strategy, and determining the binary quantization loss of the image quantization model.

After obtaining the quantized features of each of the sample images, the server obtains quantized features of the reference sample image, quantized features of the positive sample image, and quantized features of the negative sample image contained in each of the plurality of triplet samples. The server may perform quantization evaluation on each obtained quantization feature based on a binary quantization evaluation policy, to determine a binary quantization loss of the image quantization model, where the binary quantization evaluation policy is used to evaluate a binarization degree of the quantization feature, and a feature characterization degree for the corresponding sample image.

The process of performing quantization evaluation on each obtained quantized feature is described below from the feature characterization perspective and the binarization perspective, respectively.

Characteristic characterization angle:

the server may determine a triplet quantization loss of the image quantization model based on the quantization features of the reference sample image, the quantization features of the positive sample image, and the quantization features of the negative sample image contained by each of the plurality of triplet samples. The server may calculate a first distance between quantized features of the reference sample image and quantized features of the positive sample image and a second distance between quantized features of the positive sample image and quantized features of the negative sample image in the triplet sample, the server based on the first distance and the second distanceError between the distances, determining a triplet quantization loss L of the image quantization model _triple Please refer to formula (1).

L _triple ＝max(||x _a -x _p ||-||x _a -x _n ||+α，0) (1)

Wherein x is _a Representing quantized features, x, of a reference sample image in a triplet of samples _p Representing quantized features, x, of positive sample images in a triplet of samples _n Representing the quantized features of the negative sample image in the triplet sample, |·|| represents the L2 distance, and α represents a boundary value, e.g., 20.

The server adjusts model parameters of the image quantization model based on the triplet quantization loss such that the second distance is 20 greater than the first distance, and the image quantization model can accurately distinguish between similar images and dissimilar images according to quantization characteristics of the sample image.

Binarization angle:

the server can also respectively carry out binarization processing on the respective quantization characteristics of each sample image to determine the binarization loss of the image quantization model. The object of the binarization process is to obtain a binary feature that is either very close to 1 or very close to-1, the binarization process can be implemented by a sign function, e.g. a sign function, with values greater than or equal to zero mapped to 1 and values less than zero mapped to 1. The server may perform binarization processing on each element included in the quantized feature by using a sign function to obtain each binarized element to form a binary feature, please refer to formula (2).

Where ui represents the i-th element in the quantized feature u, bi represents the i-th element in the binary feature b.

The server can determine the L2 distance between the binary feature and the quantized feature based on regression loss (regression), and obtain the binary loss L of the image quantized model _quantization Please refer to the ginsengAnd (3) checking a formula. The smaller the distance, the higher the binarization degree of the quantized feature, and the larger the distance, the lower the binarization degree of the quantized feature.

Wherein, since the output size is 1×128 when the quantized feature is obtained, the quantized feature contains 128 elements, i.e., i takes on a value of [1, 128]. The server adjusts model parameters of the image quantization model through the L2 distance between the binary feature and the quantization feature, namely the binarization loss, so that the L2 distance between the binary feature and the quantization feature is stabilized at a smaller value, namely the quantization feature is stabilized at a higher binarization degree.

After the server performs quantization evaluation on each obtained quantized feature from the feature characterization angle and the binarization angle, a triplet quantization loss and a binarization loss can be obtained, and the server can determine the binarization loss of the image quantization model based on the triplet quantization loss and the binarization loss. For example, the server may determine the binary quantization loss L of the image quantization model based on a weighted sum of the obtained triplet quantization loss and the binary quantization loss _hash Please refer to formula (4).

L _hash ＝w ₁ L _triple +w ₂ L _quantization (4)

Wherein w is ₁ Weights lost for triplet quantization, w ₂ The weight lost for binarization. Because the image quantization model firstly needs to ensure the accuracy of feature characterization, and further ensures the binarization degree of the quantized features on the basis, the weight of the ternary quantization loss can be set to be a larger value, and the weight of the binarization loss can be set to be a smaller value, so that the feature characterization training process cannot be greatly influenced by the binarization training process. For example, the weight of the triplet quantization loss is set to 1, and the weight of the binarization loss is set to 0.1.

S203, carrying out semantic evaluation on each quantized feature based on a semantic evaluation strategy, and determining semantic quantization loss of the image quantization model.

The semantic evaluation strategy is used for evaluating the semantic characterization degree of the quantized features for the corresponding sample images. Because the quantization process can cause more feature loss, the situation that the quantization features of two similar images are different or dissimilar easily occurs, so that the accuracy of the characterization of the quantization features on the images is lower. In the process of training the image quantization model, the accuracy of the quantization characteristic output by the image quantization model can be further improved through the training process of the semantic angle. The semanteme represented by the two similar images is necessarily similar, so that the situation that the respective quantized features of the two similar images output by the image quantization model are different or dissimilar can be avoided through the training process of the semanteme angle, and the purpose of improving the accuracy of the quantized features output by the image quantization model is achieved.

Since the semantics of an image can be measured from multiple perspectives, embodiments of the present application are presented separately from two semantic perspectives.

Semantic angle one:

the server can respectively classify the quantization characteristics of each sample image to determine the classification loss of the image quantization model.

Since the semantics characterized by an image generally corresponds to a class, for example, an image that characterizes the environment within a bar, corresponds to a bar class; for another example, images representing the head of a person correspond to self-portrait classification, and so on. Therefore, the classification module can be used for classifying the quantized features of the sample image to obtain the prediction classification of the sample image, and the classification loss of the image quantization model is determined by taking the sample classification corresponding to the sample image as the supervision information. Therefore, whether the quantized features accurately represent the semantics of the corresponding sample images can be determined by predicting the classification based on the quantized features and whether the predicted classification is the same as the sample classification, and the image quantization model is trained by the classification loss, so that the image quantization loss can be enabled to have semantic measurement capability to a certain extent.

Referring to table 3, a network structure of the partial image quantization model is classified.

TABLE 3 Table 3

Network layer name	Output data size	Network structure
			Classification layer	1x200	Full connection layer

The classification layer may be implemented by a fully-connected layer, and the output data size of the classification layer represents the number of sample classifications corresponding to the number of 200 sample classifications, and the classification layer may predict the probability for each of the 200 sample classifications based on the quantized features.

When determining the classification loss of the image quantization model, the server may determine the prediction classification corresponding to each sample image and the error between the corresponding sample classification by using a cross entropy loss function, and determine the classification loss of the image quantization model based on the obtained error, please refer to formula (5).

Where N represents the number of sample images, i represents the ith sample image of the N sample images, p _i Representing the prediction classification corresponding to the ith sample image, y _i Representing the sample classification corresponding to the ith sample image, L _i Representing the classification loss of the i-th sample image.

Semantic angle two:

the server can respectively conduct product quantization processing on the quantization characteristics of each sample image, and determine product quantization loss of the image quantization model.

The product quantization process may include a plurality of processes, and the server may first obtain respective codebooks, wherein the respective codebooks are generated based on respective quantization characteristics of the respective sample images, each of the codebooks including a plurality of index vectors for indexing as quantization characteristics. And the server marks the index of the quantized features of each sample image based on a plurality of index vectors contained in each codebook to obtain the index features corresponding to each quantized feature. The server determines product quantization loss of the image quantization model based on the index features corresponding to the respective sample images.

Referring to table 4, a network structure is provided for obtaining each codebook part image quantization model.

TABLE 4 Table 4

Wherein the sub-features may be referred to hereafter.

The process of indexing and marking the quantized features of each sample image by the server is described by taking the quantized feature of one sample image in the quantized features of each sample image as an example, and the indexing and marking processes of other quantized features are similar and are not repeated here.

The server may divide the quantized feature into a sub-feature sequence comprising a plurality of sub-features, wherein the sub-feature sequence comprises the same number of sub-features as the number of respective codebooks, and each codebook corresponds to a position of a sub-feature in the sub-feature sequence one by one. The server may determine, for each codebook, a plurality of index vectors included in the codebook, and a similarity between the index vectors and sub-features at corresponding positions in the sub-feature sequence, and screen an index vector with a maximum similarity for each position in the sub-feature sequence. The method for determining the similarity between the index vector and the sub-feature may be to calculate a distance between the index vector and the sub-feature, wherein the smaller the distance is, the larger the similarity is, and the larger the distance is, and the smaller the similarity is. The server may obtain index features corresponding to the quantized features based on the screened index vectors.

For example, the number of codebooks is K, each codebook includes 64 index vectors, and when the index feature corresponding to the quantized feature is obtained for the quantized feature of the ith sample image, the server may divide the quantized feature into a sub-feature sequence including K sub-features, where each position in the sub-feature sequence corresponds to one codebook.

For the sub-feature at the first position in the sub-feature sequence, the server calculates 64 index vectors contained in the first codebook respectively, the hamming distance between the 64 index vectors and the sub-feature, selects the index vector with the closest hamming distance as the quantization feature of the ith sample image, corresponds to the first index vector of the first codebook, and so on, obtains the sub-feature at each position in the sub-feature sequence, corresponds to the index vector of the codebook corresponding to the corresponding position, thereby obtaining the quantization feature of the ith sample image and the corresponding K index vectors.

After obtaining the sub-feature at each position in the sub-feature sequence, corresponding to the index vector of the codebook corresponding to the corresponding position, namely, K index vectors, determining the weights of the K index vectors respectively, wherein the weight of the first index vector is 1 at the first position, the weights of the rest K-1 vectors are 0, the weight of the second index vector is 1 at the second position, the weights of the rest K-1 vectors are 0, and so on, to obtain the weights of the K index vectors respectively.

Determining index features R corresponding to quantization features of the ith sample image based on the sum of products of K index vectors and the respective weights _i Please refer to formula (6).

Wherein C is _k Represents the sub-bits at the kth position among 1 to K weightsCharacterizing the weight, Z, corresponding to the selected index vector _ik And the index vector which is selected for the sub-feature at the kth position in the sub-feature sequence corresponding to the quantized feature of the ith sample image is represented. For example, the quantization characteristic of the i-th sample image is a 1x 128-dimensional vector, and the corresponding index characteristic may be the same 1x 128-dimensional vector.

When the server determines the product quantization loss of the image quantization model based on the index features corresponding to each sample image, the product quantization loss of the image quantization model may be obtained based on multiple losses, and three of the product quantization losses are described below as examples, where the product quantization loss may only include one of the product quantization losses, may also be obtained based on any two of the product quantization losses, and may also be obtained based on other losses, which are not described herein. In the embodiment of the present application, taking the product quantization loss as an example, the following three loss gains are taken as an example.

Index quantization loss:

and determining index quantization loss of the image quantization model based on errors between the index features corresponding to the sample images and the corresponding quantization features.

After obtaining the respective index features of the respective sample images, the server may determine an index quantization loss of the image quantization model based on an error between the respective index features of the respective sample images and the respective quantization features. Referring to formula (7), the server can base on the index features corresponding to the sample images and the quantization features H _i Mean square error between the two, and determining index quantization loss L of an image quantization model _code-error 。

Wherein H is _i Representing the quantization characteristic of the i-th sample image.

Codebook triplet quantization loss:

and determining the codebook triplet quantization loss of the image quantization model based on index features corresponding to the reference sample image, the positive sample image and the negative sample image respectively contained in the plurality of triplet samples.

The server may recalculate the triplet quantization loss for a plurality of triplet samples obtained based on the respective sample images after obtaining the index features corresponding to the respective sample images to obtain a codebook triplet quantization loss L _code-triplet . Since the quantization features of the sample image are variable with respect to the indexing features obtained based on the index vectors contained in each of the limited respective codebooks, it is not guaranteed that each of the indexing features is completely identical to the corresponding quantization feature. Thus, when the triplet quantization loss is recalculated based on the index feature, the value of α may be reduced accordingly, for example, the value of α corresponding to the triplet quantization loss is 20, and then the value of α corresponding to the codebook triplet quantization loss may be 16.

Codebook classification loss:

and respectively classifying a plurality of index vectors contained in each codebook to determine codebook classification loss of the image quantization model.

Since the quantized features of an image may represent the semantics of the image to some extent, the index features corresponding to the quantized features may also represent the semantics of the image. Therefore, the classification module can be used for classifying the plurality of index vectors contained in each codebook to obtain the prediction classification of each index vector. The server may determine, based on the sample classifications corresponding to the sample images, the sample classifications corresponding to the respective index vectors, and determine codebook classification loss of the image quantization model using the sample classifications corresponding to the respective index vectors as the supervision information. Therefore, the index vectors contained in each codebook can accurately represent the semantics of the image, so that the semantics of the sample image can be accurately represented to a certain extent based on the index features determined by the codebook, and the semantic measurement capability is provided.

Codebook classification loss L of server in determining image quantization model _code-class In this case, the classification loss may be determined in the same manner as described above, and will not be described in detail here.

After obtaining the index quantization loss, the codebook triplet quantization loss, and the codebook classification loss, the server may determine a product quantization loss L of the image quantization model based on a weighted sum of the obtained index quantization loss, codebook triplet quantization loss, and codebook classification loss _pq Please refer to formula (8).

L _pq ＝w ₃ L _code-error +w ₄ L _code-triple +w ₅ L _code-class (8)

The weights corresponding to the index quantization loss, the codebook triplet quantization loss and the codebook classification loss may be set according to practical situations, and since the index features cannot completely coincide with the quantization features, the weights of the index quantization loss may be set to smaller values than the weights corresponding to the codebook triplet quantization loss and the codebook classification loss, for example, the weights corresponding to the index quantization loss, the codebook triplet quantization loss and the codebook classification loss are respectively 0.01, 0.5 and 0.5.

After obtaining the classification loss and the product quantization loss, the server may determine a semantic quantization loss L of the image quantization model based on a weighted sum of the obtained classification loss and the product quantization loss _laug Please refer to formula (9).

L _laug ＝w ₆ L _class +w ₇ L _pq (9)

The weights of the classification loss and the product quantization loss may be set according to actual situations, for example, the weights of the classification loss and the product quantization loss are set to 0.2 and 0.1, respectively.

As an embodiment, the quantization characteristic of the sample image does not have a characterization capability at the time of the first training, and therefore, after obtaining the respective quantization characteristic of each sample image, steps other than determining the classification loss in S203 may not be performed, which corresponds to the other loss determined in S203 being 0.

After each training round, each codebook may be updated, and then after each quantized feature extraction process is performed on each sample image to obtain each quantized feature of each sample image, whether each codebook exists may be determined first, if each codebook exists, it may be determined that the training round is not the first, that is, after each training round is performed currently, so each codebook may be updated first, and then steps except determining the classification loss in S203 are performed.

The following describes a process of updating each codebook, wherein each codebook is updated for a plurality of sample images of the same sample class, and the update process of a plurality of sample images of each sample class is similar.

The server may perform clustering processing on the quantized features of each of the plurality of sample images to obtain each cluster center. The server can respectively determine the quantized features of each of the plurality of sample images, the similarity between the quantized features and each cluster center, and respectively screen the sample image corresponding to the quantized feature with the largest similarity for each cluster center to serve as a target sample image. The server may update each codebook based on the respective quantized features of each of the screened target sample images.

As an example, if the number of the plurality of sample images belonging to the same sample class (i-th sample class) is large, a specified number of sample images may be selected for the plurality of sample images of the same sample class. The server may perform clustering processing on a plurality of sample images belonging to the same sample class, to obtain a specified number of cluster centers, for example, a specified number of 10, and then the server may determine 10 cluster centers. The server may determine, for each cluster center, one sample image closest to the corresponding cluster center among the plurality of sample images, thereby obtaining 10 target sample images.

After obtaining 10 target sample images, the server divides the quantized features of each of the 10 target sample images into sub-feature sequences, for example, each sub-feature sequence includes K positions, which are in one-to-one correspondence with the K codebooks, and each position corresponds to one sub-feature. For each position, adding the sub-feature in the position to the ith 10 th to (i+1) th 10 th codebooks, wherein i represents the ith sample classification, and 10 represents the designated number corresponding to the cluster center. And updating the corresponding codebook of the corresponding sample classification in each codebook by adopting each sub-feature corresponding to the quantization feature of each of the 10 target sample images.

S204, based on the obtained binary quantization loss and semantic quantization loss, when the image quantization model to be trained meets the training target, outputting the trained target image quantization model.

After obtaining the binary quantization loss and the semantic quantization loss, the server may determine a training loss L of the image quantization model to be trained based on the obtained binary quantization loss and the semantic quantization loss _total The training loss of the image quantization model may be a weighted sum of the binary quantization loss and the semantic quantization loss, please refer to equation (10).

L _total ＝a*L _hash +b*L _laug (10)

After obtaining the training loss of the image quantization model, the server may determine whether the training loss meets a training goal, e.g., determine whether the training loss converges. If it is determined that the training loss meets the training target, the current image quantization model is output as a trained target image quantization model.

As an embodiment, if it is determined that the training loss does not meet the training target, a SGD random gradient descent method may be used to perform gradient backward calculation on the training loss to obtain an updated value of a model parameter of the image quantization model, update the model parameter of the image quantization model, and continue training the image quantization model after adjusting the model parameter based on each sample image until the training loss meets the training target during a certain round of training.

As an embodiment, in the multi-task learning, the problem of convergence difficulty is easily caused by the mutual influence of multiple losses, so in the embodiment of the application, different weights are correspondingly set for different losses, larger weights are set for the triplet quantization losses, the triplet quantization training process is preferentially ensured, and the influence of classification, product quantization and the like on the convergence of the image quantization model is avoided.

In this embodiment of the present application, besides the training process of setting different weights for each loss to preferentially ensure the loss with a larger weight, each loss may be trained in a hierarchical manner, and after the loss of the first level is ensured to converge, the training process corresponding to the loss of the other levels is performed. The manner of setting different weights and setting different levels may also be used in combination, for example, the image quantization model is trained based on the binary quantization loss and the classification loss first, at which time the weight of the classification loss is set to a smaller value relative to the weight of the binary quantization loss, for example, the weight of the binary quantization loss is 1 and the weight of the classification loss is 0.2. When the image quantization model is stable, training of other losses, such as product quantization loss, is performed until the image quantization model is stable again, so that the purpose of ensuring the binary quantization training process is achieved.

The following describes an example of a method for training an image quantization model according to an embodiment of the present application.

Referring to fig. 4, the image quantization model includes a feature extraction module, a quantization module, a classification module, a codebook module, and a training module. The model parameters of the feature extraction module can use the model parameters of the ResNet101 pre-trained on the ImageNet dataset as the initialization model parameters. Other network structures not possessed by the ResNet101, such as the network structure of the quantization module, can be initialized with a Gaussian distribution with a variance of 0.01 and a mean of 0.

The learning parameters of the image quantization model can be set with reference to the parameters designed in the table described above, and the learning rate of the image quantization model is 0.0005. The server adopts each sample image to carry out multiple rounds of iterative training on the image quantization model, each round of iterative training can carry out one round of training based on all sample images, and each round of training can be carried out updating on each codebook after starting from the second round of iterative training.

The server may set model parameters of the image quantization model to a learning state, forward computing input data, such as sample images or triples samples. The feature extraction module extracts respective image features of each sample image, and the quantization module quantizes each image feature to obtain respective quantized features of each sample image.

The classification module performs classification processing on each quantized feature to obtain respective prediction classification of each sample image, and determines classification loss of the image quantization model. And the codebook module respectively carries out product quantization processing on the quantization characteristics of each sample image and determines product quantization loss of the image quantization model.

The training module carries out quantization evaluation on each obtained quantization characteristic based on a binary quantization evaluation strategy, and determines the binary quantization loss of the image quantization model. The training module determines a semantic quantization penalty for the image quantization model based on a weighted sum of the obtained classification penalty and the product quantization penalty.

The training module is used for outputting a trained target image quantization model when determining that the image quantization model to be trained meets a training target based on the obtained binary quantization loss and semantic quantization loss.

In the embodiment of the application, an end-to-end training mode is adopted, and when a trained target image quantization model is adopted for quantization feature extraction processing, the problem of degradation of quantization feature extraction performance caused by different processing procedures from the processing procedures during training is avoided. Through the codebook with semantic measurement capability and the quantization characteristic joint learning, the codebook is more in line with the measurement learning requirement, so that the problem of similar sample splitting is avoided. The product vector loss constraint obtained through the product vector processing ensures that the quantized features have certain semantic measurement capability, so that the quantized features of the similar sample images are the same as much as possible. The priority learning of the triple quantization loss is always kept in the joint learning, and other losses are taken as assistance, so that different problems of difficulty in learning of different tasks are coordinated, an image quantization model is easier to converge, and semantic characterization effect and quantization characterization effect are improved.

Based on the same inventive concept, the embodiment of the present application provides a method for searching an image, please refer to fig. 5a, and the method for training an image quantization model as described above is adopted, based on each sample image, multiple rounds of iterative training are performed on the image quantization model to be trained, and the training unit adjusts model parameters of the image quantization model when determining that the image quantization model does not meet a training target, and continues training, and outputs the image quantization model as a trained target image quantization model when determining that the image quantization model meets the training target.

In the method for retrieving the image provided by the embodiment of the application, after the reference image is obtained, quantization characteristic extraction processing is performed on the reference image based on the obtained target image quantization model, so as to obtain the quantization characteristic of the reference image. And screening candidate images with quantized features matched with those of the reference image from the candidate images based on the quantized features of the candidate images, and taking the candidate images as target retrieval images. Because the target image quantization model can extract more accurate quantization characteristics, the target retrieval image can be retrieved more accurately aiming at the reference image based on the accurate quantization characteristics.

Please refer to fig. 5b, which is a flowchart of retrieving an image according to an embodiment of the present application.

S501, a reference image is obtained.

The server may receive the reference image sent by the client, the reference image obtained by the client in response to the search operation triggered by the user, and the like, which is not particularly limited.

S502, performing quantization characteristic extraction processing on the reference image based on the target image quantization model obtained by training by adopting the method for training the image quantization model as described above, and obtaining the quantization characteristic of the reference image.

Based on the target image quantization model obtained by training by the method for training the image quantization model as described above, the process of extracting the quantization characteristic of the reference image and obtaining the quantization characteristic of the reference image can be specifically described with reference to the foregoing, the image quantization model adopts an end-to-end training mode, and the training process of the image quantization model is similar to the use process of the image quantization model.

S503, screening candidate images with quantized features matched with those of the reference image from the candidate images based on the quantized features of the candidate images, and taking the candidate images as target retrieval images.

Based on the quantization characteristics of each candidate image, the candidate images with quantization characteristics matched with those of the reference image are screened from each candidate image, and when the candidate images are used as target retrieval images, the server can obtain each codebook first. And screening each target index vector matched with the quantization characteristic of the reference image from the index vectors based on a plurality of index vectors contained in each codebook. The target search image is determined based on the candidate images corresponding to the quantization features using the respective target index vectors as markers.

Based on a plurality of index vectors contained in each codebook, when each target index vector matched with the quantization feature of the reference image is screened out from each index vector, the server can divide the quantization feature of the reference image into a sub-feature sequence containing a plurality of sub-features, and the target index vector with the maximum similarity with the sub-features at the corresponding positions is screened out from each index vector according to each position in the sub-feature sequence.

The number of the target search images may be one or a plurality of. After the target index vector having the maximum similarity with the sub-feature at the corresponding position is obtained, the distances between each candidate image and the reference image are respectively determined based on the candidate images corresponding to the quantized feature using each target index vector as the marker, and the candidate images before the designated sequence number can be selected as the target search image or the candidate images smaller than the designated distance can be selected as the target search image according to the order of the distances from the smaller to the larger, and the method is not particularly limited.

An exemplary description of a method for retrieving an image based on an image infringement recognition scenario is described below, with reference to fig. 5c.

The network resource contains more candidate images, if the similarity between the reference image and each candidate image is matched based on the image characteristics, a large amount of calculation resources and a large amount of matching time are consumed, so that the server can acquire the quantization characteristics of the candidate images when the images are uploaded into the network resource; the respective quantization characteristics and the like of each candidate image uploaded to the network resource in the corresponding time period can also be obtained at fixed time, and the method is not particularly limited. Taking an example that each candidate image includes a candidate image a, a candidate image B, a candidate image C and a candidate image D, the quantization features of each candidate image are respectively a quantization feature a, a quantization feature B, a quantization feature C and a quantization feature D.

After the server obtains the quantized features of the candidate images, the quantized features may be divided into sub-feature sequences including a specified number of sub-features, that is, the sub-feature sequences corresponding to the respective candidate images are respectively a sub-feature sequence a, a sub-feature sequence B, a sub-feature sequence C, and a sub-feature sequence D. Wherein the sub-feature sequence A comprises a sub-feature A-1, a sub-feature A-2, a sub-feature A-3 and a sub-feature A-4; the sub-feature sequence B comprises a sub-feature B-1, a sub-feature B-2, a sub-feature B-3 and a sub-feature B-4; the sub-feature sequence C comprises a sub-feature C-1, a sub-feature C-2, a sub-feature C-3 and a sub-feature C-4; the sub-feature sequence D contains sub-feature D-1, sub-feature D-2, sub-feature D-3 and sub-feature D-4.

The server clusters the respective quantized features of the respective candidate images for each position in the sequence of sub-features, i.e. for sub-feature a-1, sub-feature B-1, sub-feature C-1 and sub-feature D-1, e.g. a first cluster center and a second cluster center are obtained, sub-feature a-1 and sub-feature B-1 belonging to the first cluster center and sub-feature C-1 and sub-feature D-1 belonging to the second cluster center. And so on, for each position, a corresponding cluster center is obtained. Therefore, for each position, a codebook is obtained, and the corresponding cluster center is the index vector in the corresponding codebook. The server may obtain four codebooks, a first codebook, a second codebook, a third codebook, and a fourth codebook.

After the server obtains the reference image, the sub-feature sequence E corresponding to the reference image is obtained in the same way, and comprises a sub-feature E-1, a sub-feature E-2, a sub-feature E-3 and a sub-feature E-4. For the sub-feature E-1, screening out an index vector closest to the L2 distance between the sub-features E-1 in the first codebook, namely obtaining a cluster center most similar to the sub-feature E-1, and screening out an alternative image corresponding to the sub-feature belonging to the cluster center. And similarly, corresponding alternative images are respectively screened out according to the sub-feature E-1, the sub-feature E-2, the sub-feature E-3 and the sub-feature E-4. For example, candidate image A and candidate image B are screened out for sub-feature E-1; screening an alternative image A aiming at the sub-feature E-2; screening an alternative image A, an alternative image B and an alternative image C aiming at the sub-feature E-3; and screening out an alternative image A and an alternative image D aiming at the sub-feature E-4.

The server may count the selected candidate images, and select two candidate images with the largest statistics number as the target search images, that is, the candidate image a and the candidate image B. If there is a target retrieval image, indicating that there is an image in the network resource that is similar to the reference image, there may be an infringement image in relation to the reference image in alternative image a and alternative image B.

In the embodiment of the application, the quantization characteristic extraction processing process of the trained target image quantization model is similar to the quantization characteristic extraction processing process when the image quantization model is trained, so that the problem of performance degradation of quantization characteristic extraction processing by adopting the target image quantization model due to different processing processes is avoided. Meanwhile, as each codebook is obtained in the training process and updated in real time, each codebook obtained in the training process can be directly adopted when the target image quantization model is used, and the codebook obtaining process is not needed.

Furthermore, a target image quantization model with semantic measurement capability is adopted to conduct quantization feature extraction processing on the reference image, so that the extracted quantization features can accurately represent the semantics of the reference image, and the situation that candidate images with the same or similar semantics as the reference image representation are omitted during retrieval is avoided. Therefore, based on the quantization characteristics, the target retrieval image matched with the reference image can be accurately recalled, the image recall performance is improved, and when the method is applied to the fields such as the image duplication removal field, the infringement image recognition field and the like, the image duplication removal result can be obtained more accurately, and the infringement image and the like can be recognized more accurately.

Based on the same inventive concept, the embodiment of the application provides a device for training an image quantization model, which can realize the functions corresponding to the method for training the image quantization model. Referring to fig. 6, the apparatus includes a feature extraction module 601 and a training module 602, wherein:

feature extraction module 601: the method comprises the steps of respectively carrying out quantization characteristic extraction processing on each sample image to obtain respective quantization characteristics of each sample image;

training module 602: the method comprises the steps of carrying out quantization evaluation on each obtained quantization characteristic based on a binary quantization evaluation strategy, and determining a binary quantization loss of an image quantization model, wherein the binary quantization evaluation strategy is used for evaluating the binarization degree of the quantization characteristic and the characteristic characterization degree of a corresponding sample image;

training module 602 is also to: carrying out semantic evaluation on each quantized feature based on a semantic evaluation strategy, and determining semantic quantization loss of an image quantization model, wherein the semantic evaluation strategy is used for evaluating semantic characterization degree of the quantized feature for a corresponding sample image;

training module 602 is also to: and outputting a trained target image quantization model when the image quantization model to be trained meets the training target based on the binary quantization loss and the semantic quantization loss.

In one possible embodiment, each sample image includes a plurality of sample images that form a plurality of triplets, wherein each triplet includes a reference sample image, a positive sample image that is similar to the reference sample image, and a negative sample image that is dissimilar to the reference sample image;

training module 602 is specifically configured to:

determining a triplet quantization loss of the image quantization model based on quantization features of each of the reference sample image, the positive sample image, and the negative sample image contained by each of the plurality of triplet samples;

the binary quantization loss of the image quantization model is determined based on a weighted sum of the triplet quantization loss and the binary quantization loss.

In one possible embodiment, training module 602 is specifically configured to:

classifying the respective quantized features of each sample image to determine the classifying loss of the image quantized model;

performing product quantization processing on the quantization characteristics of each sample image respectively, and determining product quantization loss of an image quantization model;

A semantic quantization penalty of the image quantization model is determined based on a weighted sum of the classification penalty and the product quantization penalty.

In one possible embodiment, training module 602 is specifically configured to:

obtaining codebooks, wherein each codebook is generated based on respective quantization features of each sample image, each codebook comprising a plurality of index vectors, the index vectors being used as indexes of the quantization features;

based on a plurality of index vectors contained in each codebook, respectively carrying out index marking on each quantization characteristic of each sample image to obtain each corresponding index characteristic of each quantization characteristic;

In one possible embodiment, training module 602 is specifically configured to:

for each quantized feature of each sample image, the following operations are performed:

dividing the quantized feature into a sub-feature sequence containing a plurality of sub-features, wherein the number of the sub-features contained in the sub-feature sequence is the same as the number of each codebook, and each codebook corresponds to the position of the sub-feature in the sub-feature sequence one by one;

for each codebook, respectively determining the similarity between a plurality of index vectors contained in the codebook and the sub-features at corresponding positions in the sub-feature sequence, and screening the index vector with the maximum similarity for each position in the sub-feature sequence;

In one possible embodiment, training module 602 is specifically configured to:

determining index quantization loss of an image quantization model based on errors between the index features corresponding to each sample image and the corresponding quantization features;

classifying a plurality of index vectors contained in each codebook respectively to determine codebook classification loss of an image quantization model;

product quantization loss of the image quantization model is determined based on a weighted sum of the index quantization loss, the codebook triplet quantization loss, and the codebook classification loss.

In one possible embodiment, each sample image has a corresponding sample classification;

training module 602 is also to:

after each sample image is subjected to quantization feature extraction processing to obtain each quantization feature of each sample image, when each codebook is determined to exist, the following operations are respectively executed for a plurality of sample images corresponding to the same sample classification, wherein each codebook respectively comprises a plurality of index vectors, and the index vectors are used as indexes of the quantization features;

Clustering processing is carried out on the quantization characteristics of each of the plurality of sample images, so as to obtain each clustering center;

respectively determining the quantized features of each of the plurality of sample images, and the similarity between the quantized features and each cluster center, and respectively screening out the sample image corresponding to the quantized feature with the largest similarity as a target sample image aiming at each cluster center;

In one possible embodiment, the triplet sample is obtained by training module 602 using the following method:

for a plurality of similar sample image pairs, the following operations are performed:

selecting sample images contained in a plurality of similar sample image pairs except for the similar sample image pairs as negative sample images;

Furthermore, the image quantization model adopts an end-to-end training mode, and when the trained target image quantization model is adopted for quantization characteristic extraction processing, the problem of degradation of quantization characteristic extraction performance caused by different processing procedures from the processing procedures during training is avoided. Through the codebook with semantic measurement capability and the quantization characteristic joint learning, the codebook is more in line with the measurement learning requirement, so that the problem of similar sample splitting is avoided. The product vector loss constraint obtained through the product vector processing ensures that the quantized features have certain semantic measurement capability, so that the quantized features of the similar sample images are the same as much as possible. The priority learning of the triple quantization loss is always kept in the joint learning, and other losses are taken as assistance, so that different problems of difficulty in learning of different tasks are coordinated, an image quantization model is easier to converge, and semantic characterization effect and quantization characterization effect are improved.

Based on the same inventive concept, the embodiment of the application provides an image searching device, which can realize functions corresponding to the image searching method. Referring to fig. 7, the apparatus includes an acquisition module 701 and a processing module 702, where:

The acquisition module 701: for obtaining a reference image;

the processing module 702: the method is used for carrying out quantization characteristic extraction processing on the reference image based on a target image quantization model obtained by training by adopting any one of the training image quantization models as described above to obtain quantization characteristics of the reference image;

the processing module 702 is further configured to: and screening candidate images with quantized features matched with those of the reference image from the candidate images based on the quantized features of the candidate images, and taking the candidate images as target retrieval images.

In one possible embodiment, the processing module 702 is specifically configured to:

obtaining each codebook, wherein each codebook comprises a plurality of index vectors, and the index vectors are used as marks of quantization characteristics;

the target search image is determined based on the candidate images corresponding to the quantization features using the respective target index vectors as markers.

dividing the quantization characteristic of the reference image into a sub-characteristic sequence containing a plurality of sub-characteristics, wherein the number of the sub-characteristics contained in the sub-characteristic sequence is the same as the number of each codebook, and each codebook corresponds to the position of the sub-characteristic in the sub-characteristic sequence one by one;

And screening the target index vector with the maximum similarity with the sub-features at the corresponding positions from the index vectors according to the positions in the sub-feature sequences.

In the method for retrieving the image provided by the embodiment of the application, after the reference image is obtained, quantization characteristic extraction processing is performed on the reference image based on the obtained target image quantization model, so as to obtain the quantization characteristic of the reference image. And screening candidate images with quantized features matched with those of the reference image from the candidate images based on the quantized features of the candidate images, and taking the candidate images as target retrieval images. And a target image quantization model with semantic measurement capability is adopted to extract quantization characteristics of the reference image, so that the extracted quantization characteristics can accurately represent the semantics of the reference image, and the condition that candidate images with the same or similar semantics as the characteristics of the reference image are omitted during retrieval is avoided. Therefore, based on the quantization characteristics, the target retrieval image matched with the reference image can be accurately recalled, the image recall performance is improved, and when the method is applied to the fields such as the image duplication removal field, the infringement image recognition field and the like, the image duplication removal result can be obtained more accurately, and the infringement image and the like can be recognized more accurately.

Meanwhile, the quantization characteristic extraction processing process of the trained target image quantization model is similar to the quantization characteristic extraction processing process when the image quantization model is trained, so that the problem of performance degradation of the quantization characteristic extraction processing by adopting the target image quantization model due to different processing processes is avoided. Meanwhile, as each codebook is obtained in the training process and updated in real time, each codebook obtained in the training process can be directly adopted when the target image quantization model is used, and the codebook obtaining process is not needed.

Based on the same inventive concept, the embodiments of the present application provide a computer device, which may be a terminal device or a server, and is not particularly limited. The following describes an example of a configuration of a computer device.

Referring to fig. 8, the apparatus for training an image quantization model or the apparatus for retrieving an image may be run on a computer device 800, and a current version and a historical version of a data storage program and application software corresponding to the data storage program may be installed on the computer device 800, where the computer device 800 includes a processor 880 and a memory 820. In some embodiments, the computer device 800 may include a display unit 840, the display unit 840 including a display panel 841 for displaying an interface or the like for interactive operation by a user.

In one possible embodiment, the display panel 841 may be configured in the form of a liquid crystal display (Liquid Crystal Display, LCD) or an Organic Light-Emitting Diode (OLED), or the like.

The processor 880 is configured to read the computer program and then execute a method defined by the computer program, for example, the processor 880 reads a data storage program or a file, etc., thereby running the data storage program on the computer device 800 and displaying a corresponding interface on the display unit 840. The processor 880 may include one or more general-purpose processors and may further include one or more DSPs (Digital Signal Processor, digital signal processors) for performing related operations to implement the technical solutions provided by the embodiments of the present application.

Memory 820 typically includes memory and external memory, and memory may be Random Access Memory (RAM), read Only Memory (ROM), CACHE (CACHE), etc. The external memory can be a hard disk, an optical disk, a USB disk, a floppy disk, a tape drive, etc. The memory 820 is used to store computer programs including application programs corresponding to the respective clients, etc., and other data, which may include data generated after the operating system or application programs are run, including system data (e.g., configuration parameters of the operating system) and user data. The program instructions in the embodiments of the present application are stored in the memory 820, and the processor 880 executes the program instructions stored in the memory 820 to implement any of the means for training an image quantization model or the method for retrieving an image discussed in the previous figures.

The above-described display unit 840 is used to receive input digital information, character information, or touch operation/noncontact gestures, and to generate signal inputs related to user settings and function controls of the computer device 800, and the like. Specifically, in the embodiment of the present application, the display unit 840 may include a display panel 841. The display panel 841, for example, a touch screen, may collect touch operations thereon or thereabout by a user (such as operations of the user on the display panel 841 or on the display panel 841 using any suitable object or accessory such as a finger, a stylus, etc.), and drive the corresponding connection device according to a predetermined program.

In one possible embodiment, the display panel 841 may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a player, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device and converts it into touch point coordinates, which are then sent to the processor 880 and can receive commands from the processor 880 and execute them.

The display panel 841 may be implemented by various types such as resistive, capacitive, infrared, and surface acoustic wave. In addition to the display unit 840, in some embodiments, the computer device 800 may also include an input unit 830, where the input unit 830 may include a graphical input device 831 and other input devices 832, where other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, mouse, joystick, etc.

In addition to the above, the computer device 800 may also include a power supply 890 for powering other modules, audio circuitry 860, near field communication module 870, and RF circuitry 810. The computer device 800 may also include one or more sensors 850, such as acceleration sensors, light sensors, pressure sensors, and the like. The audio circuit 860 specifically includes a speaker 861 and a microphone 862, etc., and for example, the computer device 800 can collect the sound of the user through the microphone 862, perform corresponding operations, etc.

The number of processors 880 may be one or more, and the processors 880 and memory 820 may be coupled or may be relatively independent.

As an example, processor 880 of fig. 8 may be used to implement the functionality of feature extraction module 601 and training module 602 of fig. 6, as well as the functionality of acquisition module 701 and processing module 702 of fig. 7.

As an example, the processor 880 of fig. 8 may be used to implement the functions corresponding to the server or terminal device discussed above.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware associated with program instructions, where the foregoing program may be stored in a computer readable storage medium, and when executed, the program performs steps including the above method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk or an optical disk, or the like, which can store program codes.

Alternatively, the above-described integrated units of the present invention may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in essence or in a part contributing to the prior art in the form of a software product, for example, by a computer program product stored in a storage medium, comprising several instructions for causing a computer device to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, ROM, RAM, magnetic or optical disk, or other medium capable of storing program code.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims

1. A method of training an image quantization model, wherein performing multiple rounds of iterative training of the image quantization model to be trained based on respective sample images, comprises:

2. The method of claim 1, wherein each sample image comprises a plurality of sample images that form a plurality of triplet samples, wherein each triplet sample comprises a reference sample image, a positive sample image that is similar to the reference sample image, and a negative sample image that is dissimilar to the reference sample image;

Performing quantization evaluation on each obtained quantization characteristic based on a binary quantization evaluation strategy to determine a binary quantization loss of the image quantization model, including:

3. The method of claim 1, wherein semantically evaluating the individual quantized features based on a semantic evaluation policy, determining a semantic quantization penalty for an image quantization model, comprises:

4. A method according to claim 3, wherein the step of performing product quantization processing on the quantized features of each of the sample images to determine product quantization loss of the image quantization model comprises:

5. The method of claim 4, wherein indexing the quantized features of each sample image based on the plurality of index vectors included in each codebook, respectively, to obtain the index features corresponding to each quantized feature, comprises:

6. The method of claim 4, wherein determining product quantization loss for an image quantization model based on the respective index features for the respective sample images comprises:

7. The method of claim 1, wherein each of the sample images has a corresponding sample classification;

then after the quantized feature extraction process is performed on each sample image to obtain the quantized feature of each sample image, the method further includes:

when each codebook exists, respectively executing the following operations aiming at a plurality of sample images corresponding to the same sample classification, wherein each codebook respectively comprises a plurality of index vectors, and the index vectors are used as indexes of quantization characteristics;

8. The method according to claim 2 or 6, wherein the triplet sample is obtained by:

9. A method of retrieving an image, comprising:

obtaining a reference image;

performing quantization feature extraction processing on the reference image based on a target image quantization model trained by the method according to any one of claims 1 to 8, so as to obtain quantization features of the reference image;

10. The method according to claim 9, wherein screening candidate images whose quantized features match the quantized features of the reference image from among the respective candidate images based on the quantized features of the respective candidate images as target search images includes:

11. The method of claim 10, wherein screening each target index vector that matches the quantization characteristic of the reference image from among the index vectors based on a plurality of index vectors each contained in the respective codebooks, comprises:

12. An apparatus for training an image quantization model, comprising:

training module: the method comprises the steps of carrying out quantization evaluation on each quantization characteristic based on a binary quantization evaluation strategy, and determining a binary quantization loss of an image quantization model, wherein the binary quantization evaluation strategy is used for evaluating the binarization degree of the quantization characteristic and the characteristic characterization degree of a corresponding sample image;

13. A method of retrieving an image, comprising:

the acquisition module is used for: for obtaining a reference image;

the processing module is used for: for performing a quantization feature extraction process on the reference image based on a target image quantization model trained by the method according to any one of claims 1 to 8, to obtain a quantization feature of the reference image;

14. A computer program product comprising a computer program which, when executed by a processor, implements the steps of the method according to any one of claims 1 to 11.

15. A computer device, comprising:

a memory for storing program instructions;

a processor for invoking program instructions stored in the memory and for performing the method according to any of claims 1-11 in accordance with the obtained program instructions.

16. A computer-readable storage medium storing computer-executable instructions for causing a computer to perform the method of any one of claims 1 to 11.