CN117648456A

CN117648456A - Image searching method, device, electronic equipment and storage medium

Info

Publication number: CN117648456A
Application number: CN202311555954.4A
Authority: CN
Inventors: 吴佳涛
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2023-11-21
Filing date: 2023-11-21
Publication date: 2024-03-05

Abstract

The application provides an image searching method, an image searching device, electronic equipment and a storage medium. Comprising the following steps: acquiring a first image and a second image corresponding to the first image; extracting a first global feature vector and a first local feature vector based on the first image; extracting a second global feature vector and a second local feature vector based on the second image; determining a similarity result of the first image and the second image based on the first global feature vector, the first local feature vector, the second global feature vector, and the second local feature vector; and under the condition that the similar result is similar, determining that the second image is a target image corresponding to the first image. According to the method and the device, the determination of the similar results of the two images can be realized based on the global feature vector and the local feature vector, so that the accuracy of the similar results of the images is improved, and the accuracy of image searching is improved.

Description

Image searching method, device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of image searching technologies, and in particular, to an image searching method, an image searching device, an electronic device, and a storage medium.

Background

The similar image searching technology mainly refers to similarity measurement between two or more images, and judges whether the two images are similar or not according to measurement results. In the field of long videos, a series of subsequent processing is often required to be performed on pictures of video de-framing, a similarity graph searching algorithm is required to be used as a front-end module at the moment, a large number of similar de-framing images are aggregated together, and only one cluster needs to be selected as a representative image. The method can greatly reduce the processing pressure of the subsequent algorithm and improve the overall operation efficiency of the algorithm. Most of the existing similarity graph searching algorithms determine whether two images are similar or not based on single image features, so that similar image searching is achieved.

However, this scheme of implementing similar image search based on a single image feature has low accuracy of image similar results, resulting in low search accuracy.

Disclosure of Invention

An embodiment of the application aims to provide an image searching method, an image searching device, electronic equipment and a storage medium, so as to solve the problem of low searching precision of a scheme for realizing similar image searching based on single image characteristics. The specific technical scheme is as follows:

in a first aspect, the present application provides an image searching method, including:

acquiring a first image and a second image corresponding to the first image;

extracting a first global feature vector and a first local feature vector based on the first image;

extracting a second global feature vector and a second local feature vector based on the second image;

determining a similarity result of the first image and the second image based on the first global feature vector, the first local feature vector, the second global feature vector, and the second local feature vector;

and under the condition that the similar result is similar, determining that the second image is a target image corresponding to the first image.

In one possible implementation manner, the extracting a first global feature vector and a first local feature vector based on the first image includes:

the first image is input to a feature extraction model to output the first global feature vector and the first local feature vector from the feature extraction model.

In one possible implementation manner, the feature extraction model comprises a first basic network, a second basic network, a local feature vector generation network and a global feature vector generation network, wherein the local feature vector generation network and the global feature vector generation network are respectively trained by at least two loss functions,

the inputting the first image to a feature extraction model to output the first global feature vector and the first local feature vector from the feature extraction model, comprising:

inputting the first image to the first base network to output a local feature map by the first base network;

inputting the local feature map to the local feature vector generation network to output the first local feature vector from the local feature vector generation network;

inputting the local feature map to the second base network to output a global feature map by the second base network;

the global feature map is input to the global feature vector generation network to output the first global feature vector from the global feature vector generation network.

In one possible implementation, the local feature vector generation network processes the local feature map by:

carrying out average pooling treatment on the local feature map to obtain a first local feature map;

carrying out maximum pooling treatment on the first local feature map to obtain a second local feature map;

and inputting the added result of the first local feature map and the second local feature map to a full-connection layer to obtain the first local feature vector.

In one possible implementation, the global feature vector generation network processes the global feature map by:

carrying out average pooling treatment on the global feature map to obtain a first global feature map;

carrying out maximum pooling treatment on the first global feature map to obtain a second global feature map;

and inputting the added result of the first global feature map and the second global feature map to a full connection layer to obtain the first global feature vector.

In one possible implementation manner, the extracting a second global feature vector and a second local feature vector based on the second image includes:

the second image is input to the feature extraction model to output the second global feature vector and the second local feature vector from the feature extraction model.

In one possible implementation manner, the determining the similarity result of the first image and the second image based on the first global feature vector, the first local feature vector, the second global feature vector, and the second local feature vector includes:

determining a first distance between the first global feature vector and the second global feature vector;

determining a second distance between the first local feature vector and the second local feature vector if the first distance is less than a first threshold;

determining that the similar results of the first image and the second image are similar when the second distance is smaller than a second threshold;

and determining that the similar results of the first image and the second image are dissimilar when the first distance is greater than or equal to a first threshold value or the second distance is greater than or equal to a second threshold value.

In a second aspect, the present application provides an image search apparatus, including:

the acquisition module is used for acquiring a first image and a second image corresponding to the first image;

a first extraction module for extracting a first global feature vector and a first local feature vector based on the first image;

a second extraction module for extracting a second global feature vector and a second local feature vector based on the second image;

a first determining module, configured to determine a similar result of the first image and the second image based on the first global feature vector, the first local feature vector, the second global feature vector, and the second local feature vector;

and the second determining module is used for determining that the second image is the target image corresponding to the first image under the condition that the similar result is similar.

In a possible embodiment, the first extraction module is specifically configured to:

the first extraction module is further configured to:

In one possible embodiment, the first extraction module is further configured to:

In a possible embodiment, the second extraction module is specifically configured to:

In a possible implementation manner, the first determining module is specifically configured to:

In a third aspect, an electronic device is provided, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of the first aspects when executing a program stored on a memory.

In a fourth aspect, a computer-readable storage medium is provided, characterized in that the computer-readable storage medium has stored therein a computer program which, when executed by a processor, implements the method steps of any of the first aspects.

In a fifth aspect, there is provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform any of the image search methods described above.

The beneficial effects of the embodiment of the application are that:

in the embodiment of the application, first, a first image and a second image corresponding to the first image are acquired, then, a first global feature vector and a first local feature vector are extracted based on the first image, a second global feature vector and a second local feature vector are extracted based on the second image, and finally, similar results of the first image and the second image are determined based on the first global feature vector, the first local feature vector, the second global feature vector and the second local feature vector, and under the condition that the similar results are similar, the second image is determined to be a target image corresponding to the first image. According to the method and the device, the determination of the similar results of the two images can be realized based on the global feature vector and the local feature vector, so that the accuracy of the similar results of the images is improved, and the accuracy of image searching is improved.

Of course, not all of the above-described advantages need be achieved simultaneously in practicing any one of the products or methods of the present application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.

One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements, and in which the figures of the drawings are not to be taken in a limiting sense, unless otherwise indicated.

Fig. 1 is a flowchart of an image searching method provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of a feature vector comparison process according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram of a feature vector extraction process according to an embodiment of the present application;

fig. 4 is a schematic diagram of another feature vector extraction process according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an image searching device according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present application based on the embodiments herein.

The following disclosure provides many different embodiments, or examples, for implementing different structures of the invention. In order to simplify the present disclosure, components and arrangements of specific examples are described below. They are, of course, merely examples and are not intended to limit the invention. Furthermore, the present invention may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed.

Fig. 1 is a flowchart of an image searching method according to an embodiment of the present application. The method can be applied to one or more electronic devices such as smart phones, notebook computers, desktop computers, portable computers, servers and the like. The main execution body of the method may be hardware or software. When the execution body is hardware, the execution body may be one or more of the electronic devices. For example, a single electronic device may perform the method, or a plurality of electronic devices may cooperate with one another to perform the method. When the execution subject is software, the method may be implemented as a plurality of software or software modules, or may be implemented as a single software or software module. The present invention is not particularly limited herein.

As shown in fig. 1, the method specifically includes:

s101, acquiring a first image and a second image corresponding to the first image.

The image searching method is used for searching the target image similar to the reference image (namely, the first image). In an application, the first image may be entered or specified by a user.

The second image is an image for which similarity judgment is performed with the first image.

In an embodiment, the specific implementation of obtaining the second image corresponding to the first image may include the following steps: and receiving an image input by a user, and determining the image as the second image. In this way, the second image to be searched can be flexibly specified according to the user's needs.

In another embodiment, the specific implementation of acquiring the second image corresponding to the first image may include the following steps: and determining the image type of the first image, searching candidate images corresponding to the image type in a preset image set, and determining each candidate image as a second image.

The image type refers to the type of content contained in an image, for example, the image type corresponding to an image containing a person is a person image, the image type corresponding to an image containing a landscape is a landscape image, the image type corresponding to an image containing an animal is an animal image, and so on.

In this embodiment, candidate images of the same image type as the first image may be determined in the image set first, and then each candidate image is determined to be a second image, thereby reducing the workload of subsequent image searching.

S102, extracting a first global feature vector and a first local feature vector based on the first image.

And S103, extracting a second global feature vector and a second local feature vector based on the second image.

S102 and S103 are collectively described below:

in this embodiment, S102 specifically includes the following steps: the first image is input to a feature extraction model to output the first global feature vector and the first local feature vector from the feature extraction model. S103, specifically comprising the following steps: the second image is input to the feature extraction model to output the second global feature vector and the second local feature vector from the feature extraction model.

The feature extraction model may be a CNN (Convolutional Neural Network ) model, among others.

In the embodiment of the application, for each image, the local feature vector and the global feature vector corresponding to the image are extracted by the same model (namely, the feature extraction model) at the same time so as to perform subsequent image searching, thus the time consumption of feature extraction can be reduced, and the efficiency is improved.

The specific processing procedure of the feature extraction model will be explained in detail by the following examples, which will not be described in detail here.

S104, determining a similar result of the first image and the second image based on the first global feature vector, the first local feature vector, the second global feature vector and the second local feature vector.

In the embodiment of the present application, the specific implementation of S104 may include the following steps:

step A1, determining a first distance between the first global feature vector and the second global feature vector;

step A2 of determining a second distance between the first local feature vector and the second local feature vector if the first distance is less than a first threshold;

step A3, determining that the similar results of the first image and the second image are similar when the second distance is smaller than a second threshold value;

and step A4, determining that the similar results of the first image and the second image are dissimilar when the first distance is greater than or equal to a first threshold value or the second distance is greater than or equal to a second threshold value.

In this embodiment, as shown in fig. 2, first, a distance1 between a first global feature vector G1 and a second global feature vector G2 is calculated, if the distance1 is smaller than a first threshold th1, the first image and the second image are indicated to be similar in global feature, the first image and the second image are taken as candidate similarity graphs, and a similarity determination for the local features is entered, that is, a distance2 between the first local feature vector L1 and the second local feature vector L2 is calculated, if the distance2 is smaller than a second threshold th2, the local features are indicated to be similar, at this time, the first image and the second image are determined to be similar, and if the distance1 is greater than or equal to the first threshold th1, or the distance2 is greater than or equal to the second threshold th2, the first image and the second image are determined to be dissimilar.

In application, the distance between the feature vectors may be euclidean distance, where the calculation formula of the euclidean distance is specifically as follows:

and S105, determining the second image as a target image corresponding to the first image when the similar result is similar.

In this embodiment of the present application, if the result of similarity is similar, it may be determined that the second image is a target image similar to the first image, and if the result of similarity is dissimilar, it may be determined that the second image is not a target image similar to the first image. Thereby, a search for similar images to the first image is achieved.

In this embodiment, first, a first image and a second image corresponding to the first image are acquired, then, a first global feature vector and a first local feature vector are extracted based on the first image, a second global feature vector and a second local feature vector are extracted based on the second image, and finally, a similar result of the first image and the second image is determined based on the first global feature vector, the first local feature vector, the second global feature vector and the second local feature vector, and if the similar result is similar, the second image is determined to be a target image corresponding to the first image. According to the method and the device, the determination of the similar results of the two images can be realized based on the global feature vector and the local feature vector, so that the accuracy of the similar results of the images is improved, and the accuracy of image searching is improved.

In yet another embodiment of the present application, the feature extraction model includes a first base network, a second base network, a local feature vector generation network, and a global feature vector generation network, where the local feature vector generation network and the global feature vector generation network are respectively trained by at least two penalty functions.

In the training process of the feature extraction model, a loss function is often required to measure the difference between the model prediction result and the true result, so that the model parameters are reversely updated according to the difference. In the embodiment of the present application, the loss function measures the Similarity of feature vectors between similar images, and common loss functions include cosine Similarity, euclidean distance, arcFace, multi-Similarity, proxy and the like, and each loss function can be regarded as a feature space, and the smaller the loss, the more distinguishable the feature space is.

In the embodiment of the application, at least two loss functions are used to constraint the same feature vector at the same time, and the method can guarantee that the feature vector has distinguishing property in at least two feature spaces, namely has stronger distinguishing property in theory. Theoretically, the more the loss functions are used simultaneously, the more the distinguishing performance is, but practice shows that the more the loss functions are, the better the effect is, and a plurality of loss functions have certain similarity in design or a similar feature space is defined, and the superposition of the loss functions does not bring about a better effect.

Therefore, in practical applications, two loss functions may be generally used, so that the distinguishing performance is guaranteed, and the training process is prevented from being too complicated, for example, the loss function Multi-Similarity and the loss function Proxy are used to constraint the same feature vector at the same time.

Based on this, the specific process of the feature extraction model may include the steps of:

step B1, inputting the first image into the first basic network so as to output a local feature map by the first basic network;

step B2, inputting the local feature map to the local feature vector generation network so as to output the first local feature vector by the local feature vector generation network;

step B3, inputting the local feature map to the second base network so as to output a global feature map by the second base network;

and step B4, inputting the global feature map to the global feature vector generation network so as to output the first global feature vector by the global feature vector generation network.

In this embodiment, as shown in fig. 3, the first image firstly extracts a local feature map via the base network 1 (i.e. the first base network), the dimensions of the feature map are (B, H1, W1, C1), then inputs the local feature map to the feature vector generation module E (i.e. the local feature vector generation network), and the feature vector generation module E outputs a corresponding local feature vector; in addition, the local feature map is input to the base network 2 (i.e., the second base network) to extract a global feature map whose dimensions are (B, H2, W2, C2), and then the global feature map is input to the feature vector generation module E (i.e., the global feature vector generation network), and the feature vector generation module E outputs the corresponding global feature vector. Where B represents the number of pictures (e.g., 1 image, i.e., b=1), H and W represent the image length and width, and C represents the dimension of the feature map.

Specifically, the local feature vector generation network processes the local feature map by: and carrying out average pooling treatment on the local feature images to obtain a first local feature image, carrying out maximum pooling treatment on the first local feature image to obtain a second local feature image, and inputting the added result of the first local feature image and the second local feature image into a full-connection layer to obtain the first local feature vector.

The global feature vector generation network processes the global feature map by: and carrying out average pooling treatment on the global feature map to obtain a first global feature map, carrying out maximum pooling treatment on the first global feature map to obtain a second global feature map, and inputting the added result of the first global feature map and the second global feature map to a full-connection layer to obtain the first global feature vector.

The local feature vector generation network and the global feature vector generation network have the same network structure, and the extraction process of the local feature vector generation network and the global feature vector generation network is the same as the extraction process of the local feature vector generation network and the global feature vector generation network.

As shown in fig. 4, after inputting feature maps with dimensions (B, H, W, C) to the corresponding feature vector generation network, first, a first feature map AVG (B, 1, C) and a second feature map MAX (B, 1, C) are obtained respectively through average pooling and maximum pooling, then, the obtained first feature map and second feature map are added, and the added result is input to the full connection layer, so as to obtain an output feature vector (i.e., a local feature vector or a global feature vector). The feature vector (B, 1, D) is output as a specified dimension, and D represents the dimension of the feature vector, typically 128, 256, 512 and 1024, wherein 128 dimensions have slightly poorer effect, but subsequent similarity calculation is faster and has higher efficiency. Instead 1024-dimensional discrimination accuracy would be higher, but similarity calculation would be more time consuming. The user can specifically set the output dimension according to the actual needs.

In addition, the feature extraction process of the feature extraction model for the second image is substantially the same as that for the first image, and will not be described here.

According to the scheme, the local feature vector and the global feature vector corresponding to the image are extracted simultaneously by the same model (namely, the feature extraction model), so that subsequent image searching is performed, the time consumption of feature extraction is reduced, and the efficiency is improved.

Based on the same technical concept, the embodiment of the present application further provides an image searching apparatus, as shown in fig. 5, including:

an acquiring module 501, configured to acquire a first image and a second image corresponding to the first image;

a first extraction module 502, configured to extract a first global feature vector and a first local feature vector based on the first image;

a second extracting module 503, configured to extract a second global feature vector and a second local feature vector based on the second image;

a first determining module 504, configured to determine a similar result of the first image and the second image based on the first global feature vector, the first local feature vector, the second global feature vector, and the second local feature vector;

and the second determining module 505 is configured to determine that the second image is a target image corresponding to the first image when the similar result is similar.

the first extraction module is further configured to:

Based on the same technical concept, the embodiment of the present application further provides an electronic device, as shown in fig. 6, including a processor 111, a communication interface 112, a memory 113, and a communication bus 114, where the processor 111, the communication interface 112, and the memory 113 perform communication with each other through the communication bus 114,

a memory 113 for storing a computer program;

the processor 111 is configured to execute a program stored in the memory 113, and implement the following steps:

acquiring a first image and a second image corresponding to the first image;

The communication bus mentioned above for the electronic devices may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the electronic device and other devices.

The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

In yet another embodiment provided herein, there is also provided a computer readable storage medium having stored therein a computer program which when executed by a processor implements the steps of any of the image search methods described above.

In yet another embodiment provided herein, there is also provided a computer program product containing instructions that, when run on a computer, cause the computer to perform any of the image search methods of the above embodiments.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

From the above description of embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus a general purpose hardware platform, or may be implemented by hardware. Based on such understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the related art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the method described in the respective embodiments or some parts of the embodiments.

It is to be understood that the terminology used herein is for the purpose of describing particular example embodiments only, and is not intended to be limiting. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. The terms "comprises," "comprising," "includes," "including," and "having" are inclusive and therefore specify the presence of stated features, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, elements, components, and/or groups thereof. The method steps, processes, and operations described herein are not to be construed as necessarily requiring their performance in the particular order described or illustrated, unless an order of performance is explicitly stated. It should also be appreciated that additional or alternative steps may be used.

The foregoing is only a specific embodiment of the invention to enable those skilled in the art to understand or practice the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An image search method, the method comprising:

acquiring a first image and a second image corresponding to the first image;

2. The method of claim 1, wherein the extracting a first global feature vector and a first local feature vector based on the first image comprises:

3. The method of claim 2, wherein the feature extraction model comprises a first base network, a second base network, a local feature vector generation network, and a global feature vector generation network, the local feature vector generation network and the global feature vector generation network each trained from at least two loss functions,

4. A method according to claim 3, wherein the local feature vector generation network processes the local feature map by:

5. A method according to claim 3, wherein the global feature vector generation network processes the global feature map by:

6. The method of claim 2, wherein the extracting a second global feature vector and a second local feature vector based on the second image comprises:

7. The method of claim 1, wherein the determining similar results for the first image and the second image based on the first global feature vector, the first local feature vector, the second global feature vector, and the second local feature vector comprises:

8. An image search apparatus, the apparatus comprising:

9. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

a processor for carrying out the method steps of any one of claims 1-7 when executing a program stored on a memory.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored therein a computer program which, when executed by a processor, implements the method steps of any of claims 1-7.