Background
Existing visual retrieval systems can be divided into two major categories from a technical perspective:
(1) the features extracted based on the traditional computer vision can be global features or local features;
global features are as follows: color histograms, texture features, etc.; the local features are as follows: SIFT, SURF, ORB, etc.
(2) A deep learning based approach. This broad category also includes two approaches:
a. directly extracting a specific layer (such as a convolutional layer or a full-link layer) as a feature vector for searching;
b. and performing end-to-end training by combining with the metric function.
The visual search is aimed at the same category (or the same entity or the same semantic meaning) as the visual entity to be queried, and should be in the front of the ordered list as much as possible, especially to ensure the accuracy of the top K. Generally, the accuracy of a retrieval system is measured by using an Average accuracy index Average Precision (AP), as shown in formula 1-1:
take an image as an example, as a representation form of a visual entity. Wherein q represents an image to be queried, APq represents the average accuracy of the image q to be queried, a data set S = { Ii, i =1, 2.. once, n } to be searched in the database is divided into S + and S-, S + represents an image set of the same category as the image to be queried, and S-represents an image set of a different category from the image to be queried, according to whether the query and the image to be queried belong to the same category, wherein S = S + U S-, and Rank (i, S) represents the ranking order number of the image i and the image set S.
The prior art has the following defects: there is no explicit optimization by optimizing the above-mentioned ordering penalty, such as when based on features extracted by conventional computer vision, which is not considered at all more unlikely to guarantee the above-mentioned properties. In the deep learning-based mode, a specific layer (such as a convolutional layer or a fully-connected layer) is extracted as a feature vector to be searched, and an ordered list is not displayed to be optimized; on the other hand, although the deep learning-based method and the end-to-end training method in combination with the metric function can implicitly control the sequence of the sorted list through the loss function, the method is essentially based on distance optimization, rather than explicitly adopting the sorting-based loss optimization, which results in that under the same distance loss, the losses generated by the two items which are arranged at the front of the sorted list and at the back of the sorted list are the same, but according to the target of image retrieval, the method is to ensure that the K items at the front are all correct as much as possible, and the priority at the back is not so high.
Disclosure of Invention
In order to solve the problems, the invention provides a sequencing optimization method, a sequencing optimization device, electronic equipment and a storage medium for visual retrieval, wherein an optimization method of directly optimizing average accuracy as a loss function is adopted, so that the defect that the loss function based on distance only focuses on the similarity between features is effectively overcome, the penalty for increasing the result of the error in the front row of a sequencing list is not considered to be increased, and the visual retrieval accuracy is obviously improved.
In order to achieve the purpose, the invention provides the following specific technical scheme:
in a first aspect, the present application provides a ranking optimization method for visual search, including:
establishing a visual entity database;
acquiring a visual entity to be queried;
extracting the characteristics of the visual entity to be inquired and the visual entities in the set to be searched, wherein the set to be searched is a set of all the visual entities in the visual entity database;
calculating the distance between any visual entity in the set to be searched and the visual entity to be inquired according to a distance measurement function, and identifying the visual entity with the distance smaller than a preset threshold value as a similar target retrieval entity;
according to a loss function
Performing loss calculation on the target retrieval entities to obtain a target retrieval entity list which is arranged according to the characteristic similarity of the visual entities to be queried in a descending orderAnd outputting;
wherein the content of the first and second substances,
q represents the visual entity to be queried, APq represents the average accuracy of the visual entity q to be queried, S represents the data set of the set to be searched in the visual entity database, si and sj represent the similarity between the visual entity i, the visual entity j and the visual entity q to be queried in the set to be searched, n represents the number of visual entities in the set to be searched, S + represents the set of visual entities of the same category as the visual entity to be queried, S-represents the set of visual entities of different categories as the visual entity to be queried,
representing the temperature parameter in the Sigmoid function.
With reference to the first aspect, in some possible implementations, the visual entity includes a key frame or an image frame in image data or video data.
In combination with the first aspect, in some possible implementations the distance metric function comprises a euclidean distance, a cosine similarity, a manhattan distance, a chebyshev distance, a minkowski distance, a mahalanobis distance, or a hamming distance.
With reference to the first aspect, in some possible implementations, when performing feature extraction on the visual entity to be queried and the visual entities in the set to be searched, optionally performing feature extraction based on a conventional computer vision feature extraction manner or a deep learning manner.
With reference to the first aspect, in some possible implementation manners, when performing feature extraction on the visual entity to be queried and the visual entities in the set to be searched, performing feature extraction in a deep learning-based manner includes:
image data of a training dataset and label data of an image;
and constructing a deep learning feature extraction network.
In a second aspect, the present application further provides a ranking optimization apparatus for visual search, including:
the storage module is used for establishing a visual entity database;
the acquisition module is used for acquiring the visual entity to be inquired;
the characteristic extraction module is used for extracting characteristics of the visual entity to be inquired and the visual entity in the set to be searched, and the set to be searched is a set of all the visual entities in the visual entity database;
the identification module is used for calculating the distance between any visual entity in the set to be searched and the visual entity to be inquired according to a distance measurement function, and identifying the visual entity with the distance smaller than a preset threshold value as a similar target retrieval entity;
a processing module for processing the loss function
Performing loss calculation on the target retrieval entities to obtain a target retrieval entity list which is arranged in a descending order according to the feature similarity with the visual entities to be queried, and outputting the target retrieval entity list;
wherein the content of the first and second substances,
q represents the visual entity to be queried, APq represents the average accuracy of the visual entity q to be queried, S represents the data set of the set to be searched in the visual entity database, si and sj represent the similarity between the visual entity i, the visual entity j and the visual entity q to be queried in the set to be searched, n represents the number of visual entities in the set to be searched, S + represents the set of visual entities of the same category as the visual entity to be queried, S-represents the set of visual entities of different categories as the visual entity to be queried,
representing the temperature parameter in the Sigmoid function.
In a third aspect, the present application provides an electronic device comprising a memory and a processor;
the memory is used for storing a computer program;
the processor is configured to execute the computer program and to implement the method according to the first aspect when executing the computer program.
In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, causes the processor to carry out the method of the first aspect.
Therefore, the embodiment of the invention provides a sequencing optimization method, a sequencing optimization device, electronic equipment and a storage medium for visual retrieval, wherein the searching sequencing is directly optimized by directly optimizing an average accuracy as an optimization method of a loss function instead of optimizing a loss function based on distance, so that the defect that the loss function based on distance only focuses on the similarity between features is effectively overcome, and the defect that the penalty is increased for the result of the error in the front row of a sequencing list is not considered to be increased; the accuracy of visual retrieval is obviously improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Detailed Description
The above and other objects, features and advantages of the present invention will become more apparent by describing in more detail embodiments of the present invention with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.
The embodiment of the invention provides a visual retrieval sequencing optimization method, and fig. 1 is a schematic flow chart of the visual retrieval sequencing optimization method of the embodiment of the invention. As shown in fig. 1, the method for optimizing the ranking of visual search according to the embodiment of the present invention includes step S110, establishing a visual entity database; step S120, acquiring a visual entity to be queried; step S130, extracting the characteristics of the visual entity to be inquired and the visual entities in the set to be searched; step S140, calculating the distance between any visual entity in the set to be searched and the visual entity to be inquired according to a distance measurement function, and identifying the visual entity with the distance smaller than a preset threshold value as a similar target retrieval entity; step S150, according to the loss function
And performing loss calculation on the target retrieval entities to obtain a target retrieval entity list which is arranged in a descending order according to the feature similarity with the visual entities to be queried, and outputting the target retrieval entity list.
In an embodiment of the invention, the visual entity comprises a key frame or an image frame in the image data or the video data.
In step S130, the set to be searched is a set of all visual entities in the visual entity database, and feature extraction is optionally performed based on a conventional computer vision feature extraction manner or a deep learning manner. The features extracted based on the traditional computer vision can be global features or local features; global features are as follows: color histograms, shape features, texture features, etc.; the local features are as follows: SIFT, SURF, ORB, etc. The deep learning-based method includes directly extracting a specific layer (such as a convolutional layer or a fully-connected layer) as a feature vector, searching, or performing end-to-end training in combination with a metric function.
In a preferred embodiment of the present invention, the image is used as an expression form of a visual entity, the step S140 uses a deep learning model, and matches with an equidistance measurement function of euclidean distance, cosine similarity, manhattan distance, chebyshev distance, minkowski distance, mahalanobis distance or hamming distance to perform visual retrieval and sorting, and the step S150 specifically includes: defining an image data set to be searched in a visual entity database as S = { Ii, i =1, 2.. once, n }, n is the number of all images in the image data set to be searched, defining an image to be inquired as q, and dividing S into S + and S-, S + representing an image set of the same category as the image to be inquired and S-representing an image set of different categories as the image to be inquired according to whether the inquired image and the image to be inquired belong to the same category, wherein S = S + U S-, and Rank (i, S) represents the ordered sequence number of the image i and the image set S.
The deep learning feature extraction network is constructed by training the image data of the data set and the label data of the image in a metric learning-based mode, and network architectures such as twin networks and the like can be adopted, and the method adopts
As a function of the loss, among others,
and APq represents the average accuracy of the image q to be queried.
In the training process, any one picture is selected from a training data set to serve as an image q to be queried, and the training set is divided into S + and S-according to label data of the image q to be queried. According to the formula
And (4) performing gradient back transmission by using the defined loss function, and optimizing the neural network. And obtaining a feature extractor after the training is finished, namely, performing feature extraction on the input image I to obtain the image feature fI.
When defining the loss function, the method specifically includes: miningBy using
As a sorting function, as shown in fig. 2, si and sj respectively represent the similarity between the image i, the image j and the image q to be queried in the image data set to be searched.
As can be seen from fig. 2, the Indicator function is discontinuous near x =0, which results in a discontinuous loss function AP, and thus end-to-end training cannot be performed by an optimization method based on gradient descent or the like. Therefore, according to the expression form of Indicator function, Sigmoid function is adopted to approximate the Indicator function, i.e. the Indicator function is expressed
Wherein x is an independent variable,
is a temperature (temperature) parameter that controls the shape of the function value.
The effect of the parameters on the function values is shown in figure 4.
As can be seen from fig. 3, when the argument values in the real number domain are all continuous everywhere, and the Indicator function is well fitted, the Indicator function is replaced by the Sigmoid function in an approximate manner, and because the Sigmoid function is continuous and has a derivative, end-to-end optimization can be performed by an optimization method such as gradient descent, and a sequencing objective function is directly optimized, instead of optimizing a distance-based loss function. The optimization target is consistent with the target expected to be obtained, and the problem that similar images are in front of the ordered list in the visual retrieval problem is solved well.
After replacing the Indicator function with the Sigmoid function,
finally, finally
。
As can be seen from FIG. 4, when
When approaching zero, the right side of the equal sign approaches the left side of the equal sign. For n images in the set S, each image as a search image, APq in the above equation is substituted into the loss function, eventually as per
And as a loss function, arranging the images in the image data set to be searched in a descending order according to the feature similarity of the images to be inquired according to the loss calculation result, and outputting the images.
On the other hand, an embodiment of the present invention provides a ranking optimization apparatus for visual search, and fig. 5 is an overall framework diagram of the ranking optimization apparatus for visual search according to the embodiment of the present invention. As shown in fig. 5, the ranking optimization apparatus for visual search according to an embodiment of the present invention includes: a storage module 501, configured to establish a visual entity database; an obtaining module 502, configured to obtain a visual entity to be queried; a feature extraction module 503, configured to perform feature extraction on the visual entity to be queried and a visual entity in a set to be searched, where the set to be searched is a set of all visual entities in the visual entity database; an identifying module 504, configured to calculate a distance between any visual entity in the set to be searched and the visual entity to be queried according to a distance metric function, and identify a visual entity whose distance is smaller than a predetermined threshold as a similar target retrieval entity; a processing module 505 for performing a function based on the loss
Performing loss calculation on the target retrieval entities to obtain a target retrieval entity list which is arranged in a descending order according to the feature similarity with the visual entities to be queried, and outputting the target retrieval entity list; wherein the content of the first and second substances,
q represents the visual entity to be queriedAPq represents the average accuracy of the visual entity q to be queried, S represents the data set of the set to be searched in the visual entity database, si and sj represent the similarity between the visual entity i, the visual entity j and the visual entity q to be queried in the set to be searched, respectively, n represents the number of visual entities in the set to be searched, S + represents the set of visual entities of the same category as the visual entity to be queried, S-represents the set of visual entities of different categories as the visual entity to be queried,
representing the temperature parameter in the Sigmoid function. In one possible design, the structure of the ranking optimization device for visual search includes a processor and a memory, the memory is used for storing a program for supporting the ranking optimization device for visual search to execute the ranking optimization method for visual search, and the processor is configured to execute the program stored in the memory.
In yet another aspect, an embodiment of the present invention provides an electronic device, including a memory and a processor; the memory is used for storing a computer program; a processor for executing the computer program and for implementing the method of any of the above-described methods of ranking optimization for visual search when executing the computer program.
In still another aspect, an embodiment of the present invention provides a computer-readable storage medium, which stores a computer program, and when the program is executed by a processor, the program causes the processor to implement any one of the above-mentioned methods for optimizing ranking of visual search. Compared with the prior art, the beneficial effect of this application lies in:
a visual retrieval ordering optimization method, device, electronic equipment and storage medium are provided, which are used for end-to-end visual search. Wherein, the visual search is carried out by adopting an optimization mode, the inaccurate search result caused by the loss function based on measurement is pointed out, and the optimization method of directly optimizing the average accuracy as the loss function is adopted; by the form of the indication function, the relation between the ranking digit number in the ranking list and the similarity between the query feature and the query feature with the query feature is established definitely;by analyzing the irreducible indication function, optimization methods based on gradient descent and the like cannot be adopted for optimization, so that a method for optimizing based on sequencing loss cannot be carried out; by observing the form of the exponential function, Sigmoid (x;
) The family of functions approximates the indicator function, Sigmoid (x;
) The family of functions is continuous and conductive and can be optimized by employing optimization methods based on gradient descent, etc.
By directly optimizing search sequencing instead of optimizing a loss function based on distance, the method effectively overcomes the defect that the loss function based on distance only focuses on the similarity between features, and does not consider increasing the penalty for increasing the result of the error in the front row of the sequencing list; the accuracy of visual retrieval is obviously improved. Furthermore, the technical scheme in the application is easy to realize, clear in structure and easy to maintain and upgrade; the neural network trained by the method is used as a feature extractor and can be applied to downstream tasks such as visual clustering, visual recognition and the like; and the modularized structure can be matched with different grid structures, batch sampling functions are plug-and-play, and the practicability is high. In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments. The embodiment of the device corresponds to the embodiment of the method, so that the description of the embodiment of the device is relatively simple, and the related description can refer to the description of the embodiment of the method.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a computer readable storage medium. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive various changes or substitutions within the technical scope of the present invention, and these should be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.