CN114241222A

CN114241222A - Image retrieval method and device

Info

Publication number: CN114241222A
Application number: CN202111519845.8A
Authority: CN
Inventors: 裴勇; 杨启城
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2021-12-13
Filing date: 2021-12-13
Publication date: 2022-03-25
Also published as: WO2023109069A1

Abstract

The embodiment of the invention provides an image retrieval method and device. The method comprises the following steps: determining a first characteristic vector of an image to be retrieved through a depth map of the image to be retrieved; determining target candidate images corresponding to second feature vectors of which the first feature vectors meet the first similarity requirement from the candidate image set; the candidate image set comprises candidate images shot from different angles aiming at different objects; determining a target candidate image which meets a second similarity requirement with a third feature vector of the image to be retrieved from all the target candidate images as a target image of the image to be retrieved; the third feature vector is determined based on the color RGB map of the image to be retrieved. The image retrieval based on the candidate images of different angles reduces the influence of the shooting angle on the image retrieval. And extracting a third feature vector based on the color RGB image of the image to be retrieved. And different image information is considered in the two-time screening, so that the image retrieval accuracy is improved.

Description

Image retrieval method and device

Technical Field

The embodiment of the invention relates to the technical field of computer vision, in particular to an image retrieval method, an image retrieval device, computing equipment and a computer-readable storage medium.

Background

With the development of computer technology, more and more technologies are applied in the financial field, and the traditional financial industry is gradually changing to financial technology (Fintech), but due to the requirements of the financial industry on safety and real-time performance, higher requirements are also put forward on the technologies.

With the rapid development of computers, internet and multimedia in recent years, the world has more and more massive image data. The image retrieval function is introduced, so that the retrieval range is greatly widened, and the retrieval accuracy is improved. Image retrieval is widely applied in various fields, such as search engines, e-commerce platforms, security, information authentication fields and the like. For example, in a lost and found system, a user inputs an image of a lost item as an image to be retrieved, the system retrieves the image of the item in a library and presents the image to the user, or an identification of the image of the item is presented to the user.

The core of the image retrieval is to compare the similarity of the image to be retrieved and each candidate image in the candidate image set, and the conventional method for determining the similarity of two images does not consider the influence of the shooting angle, so that the similarity is low even if the two images of the same article.

In summary, the embodiments of the present invention provide an image retrieval method, so as to improve the accuracy of image retrieval.

Disclosure of Invention

The embodiment of the invention provides an image retrieval method, which is used for improving the accuracy of image retrieval.

In a first aspect, an embodiment of the present invention provides an image retrieval method, including:

determining a first feature vector of an image to be retrieved through a depth map of the image to be retrieved;

determining target candidate images corresponding to second feature vectors of which the first feature vectors meet the first similarity requirement from the candidate image set; the candidate image set comprises candidate images shot from different angles aiming at different objects;

determining a target candidate image which meets a second similarity requirement with a third feature vector of the image to be retrieved from all target candidate images as a target image of the image to be retrieved; the third feature vector is determined based on a color RGB map of the image to be retrieved.

The depth map of the image may reflect the distance of points in the image from the camera in the physical world, and may reflect the stereo characteristics of objects in the image. In the method, the first feature vector is determined through the depth map of the image to be retrieved, so that the obtained first feature vector can reflect the shape information of the object in the image to be retrieved, and the influence of the shooting angle is considered. And comparing the similarity of the first feature vector with the feature vectors of the candidate images in the candidate image set, and determining each target candidate image meeting the first similarity requirement, wherein the accuracy of each target candidate image obtained in the way is high. Moreover, since the candidate image set comprises candidate images shot from different angles aiming at different objects, the image retrieval is carried out based on the candidate images of different angles, and the influence of the shooting angle on the image retrieval is further reduced. After the depth map is subjected to the first screening, because the depth map does not reflect color information, in order to improve the accuracy of image retrieval, a third feature vector is extracted again based on the color RGB map of the image to be retrieved, and the similarity of the third feature vector and each target candidate image is compared, so that the target image is determined. Therefore, on the basis of the first screening, the second screening is carried out, different image information is considered in the two screening processes, and the accuracy of image retrieval is improved.

Optionally, determining a first feature vector of the image to be retrieved through a depth map of the image to be retrieved includes:

determining a normal vector of each sampling point in a depth map through the depth map of an image to be retrieved; the normal vector is used for representing orientation information of the sampling point;

determining a normal vector histogram of the depth map based on normal vectors of the sampling points; the normal vector histogram is used for representing the distribution state of the orientation information of each sampling point;

and obtaining the first feature vector according to the normal vector histogram.

The normal vector of each sampling point in the depth map can reflect the orientation information of the sampling point, and a normal vector histogram obtained according to the orientation information of each sampling point can reflect the distribution state of the orientation information, that is, the three-dimensional characteristics and shape information of an object in the image to be retrieved can be obtained. Therefore, the interference of the difference of the shooting angles to the image retrieval accuracy can be avoided.

Optionally, determining, by a depth map of an image to be retrieved, a normal vector of each sampling point in the depth map, including:

sampling the depth map of the image to be retrieved to obtain first sampling points;

determining a second sampling point and a third sampling point from the depth map according to a preset rule aiming at each first sampling point; and determining the normal vector of the first sampling point according to the depth vector of the first sampling point, the depth vector of the second sampling point and the depth vector of the third sampling point.

By the method, the normal vector of each sampling point can be determined, and the orientation information of each sampling point can be accurately reflected.

Optionally, determining a normal vector histogram of the depth map based on the normal vectors of the sampling points includes:

dividing the depth map into a plurality of regions, and determining the distribution state of each sampling point with different orientation information in any region according to the normal vector of each sampling point in the region;

and determining a normal vector histogram of the depth map according to the distribution state of each sampling point with different orientation information of a plurality of areas.

The depth map is divided into a plurality of areas, and the distribution state of each sampling point with different orientation information is acquired for each area. And determining a normal vector histogram of the depth map according to the distribution state of each sampling point with different orientation information of the plurality of regions. Thus, the depth information of some images can be blurred to a certain extent, so that the image retrieval sensitivity is not high, and the images representing the same object are not easy to identify. Therefore, the accuracy of image retrieval is improved. Meanwhile, the distribution state of each sampling point with different orientation information determined by each region can reflect the shape information of the shot object in the region, and the shape information of the object in different regions in the whole image to be retrieved can be reflected after the regions are integrated. Therefore, by adopting the method, the problem of inaccurate image retrieval caused by interference caused by angle change can be solved, and the retrieval accuracy is improved.

Optionally, determining a distribution state of each sampling point having different orientation information in the region according to a normal vector of each sampling point in the region includes:

aiming at a normal vector of any sampling point in the region, determining a first included angle formed by the normal vector and a first coordinate axis and a second included angle formed by the normal vector and a second coordinate axis; performing barreling binding processing on the first included angle and the second included angle to obtain a characteristic value of the normal vector;

and determining the distribution state of different characteristic values in the region according to the characteristic values of the normal vectors of all the sampling points in the region.

And carrying out barreling treatment on a first included angle formed by the normal vector and the first coordinate axis and a second included angle formed by the normal vector and the second coordinate axis to obtain a representation value of the normal vector, so that the numerical range of the first included angle and the second included angle is simplified, and the numerical range of the orientation information of the determined sampling point is correspondingly simplified instead of a numerical value with a wide value range. Then, in the subsequent image search, the sensitivity can be reduced so that two images that are originally represented by the same object are not recognized as images having a low degree of similarity.

Optionally, determining a target candidate image satisfying a second similarity requirement with a third feature vector of the image to be retrieved includes:

generating a depth weight map based on the depth information in the depth map;

inputting the RGB map and the depth weight map into a first branch and a second branch of a neural network model of a double-tower structure respectively, wherein each convolution layer in the first branch performs feature extraction on the RGB map based on the attention weight generated by the second branch in each layer, so as to generate a third feature vector of the image to be retrieved;

and calculating the similarity between the third feature vector and the fourth feature vector of any target candidate image, thereby determining the target candidate image with the similarity meeting the second similarity requirement.

In image retrieval, the background also has a great influence on the success rate of retrieval, and a difference in background may result in a low similarity between two images representing the same object. The depth map can reflect the distance between an object in the image to be retrieved and the camera, so the attention weight based on the depth information is extracted by using the depth map, and then the attention weight is used in the convolution calculation of the image to be retrieved, more weight is given to the foreground, the background is ignored, the obtained third feature vector pays more attention to the foreground, and the influence of the background is ignored as much as possible. The fourth feature vector adopts the same processing mode, so that the retrieval accuracy can be improved when the similarity of the third feature vector and the fourth feature vector is calculated.

Optionally, generating a depth weight map based on the depth information in the depth map includes:

normalizing each depth information in the depth map;

and performing weight conversion on the normalized depth information according to a preset mode to generate the depth weight map, wherein the preset mode is that the smaller the depth information is, the larger the weight value in the depth weight map is.

Since the depth information in the depth map indicates the distance from the camera, the value of the depth information decreases as the distance from the camera increases, and the weight value of the depth information having a smaller value needs to be increased in order to give more weight to the foreground. Therefore, the weight of the foreground can be improved, the attention is paid to the foreground, and the retrieval accuracy is improved.

In a second aspect, an embodiment of the present invention further provides an image retrieval apparatus, including:

a determination unit configured to:

Optionally, the determining unit is specifically configured to:

generating a depth weight map based on the depth information in the depth map;

Optionally, the determining unit is specifically configured to:

normalizing each depth information in the depth map;

In a third aspect, an embodiment of the present invention further provides a computing device, including:

a memory for storing a computer program;

and the processor is used for calling the computer program stored in the memory and executing the image retrieval method listed in any mode according to the obtained program.

In a fourth aspect, the present invention further provides a computer-readable storage medium, where a computer-executable program is stored, where the computer-executable program is configured to enable a computer to execute an image retrieval method listed in any one of the above manners.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of an image retrieval method according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a first sampling point, a second sampling point, and a third sampling point according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a split depth map according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating a normal vector histogram according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a neural network model of a double tower structure according to an embodiment of the present invention;

fig. 7 is a schematic diagram of feature extraction performed by a neural network model based on a double-tower structure according to an embodiment of the present invention;

FIG. 8 is a diagram illustrating a twin neural network according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of an image retrieval apparatus according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

To make the objects, embodiments and advantages of the present application clearer, the following description of exemplary embodiments of the present application will clearly and completely describe the exemplary embodiments of the present application with reference to the accompanying drawings in the exemplary embodiments of the present application, and it is to be understood that the described exemplary embodiments are only a part of the embodiments of the present application, and not all of the embodiments.

All other embodiments, which can be derived by a person skilled in the art from the exemplary embodiments described herein without inventive step, are intended to be within the scope of the claims appended hereto. In addition, while the disclosure herein has been presented in terms of one or more exemplary examples, it should be appreciated that aspects of the disclosure may be implemented solely as a complete embodiment.

It should be noted that the brief descriptions of the terms in the present application are only for the convenience of understanding the embodiments described below, and are not intended to limit the embodiments of the present application. These terms should be understood in their ordinary and customary meaning unless otherwise indicated.

The terms "first," "second," "third," and the like in the description and claims of this application and in the above-described drawings are used for distinguishing between similar or analogous objects or entities and are not necessarily intended to limit the order or sequence of any particular one, Unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein.

Furthermore, the terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or device that comprises a list of elements is not necessarily limited to those elements explicitly listed, but may include other elements not expressly listed or inherent to such product or device.

Fig. 1 illustrates an exemplary system architecture, which may be a server 100, including a processor 110, a communication interface 120, and a memory 130, to which embodiments of the present invention are applicable.

The communication interface 120 is used for communicating with a terminal device, and transceiving information transmitted by the terminal device to implement communication.

The processor 110 is a control center of the server 100, connects various parts of the entire server 100 using various interfaces and routes, performs various functions of the server 100 and processes data by operating or executing software programs and/or modules stored in the memory 130 and calling data stored in the memory 130. Alternatively, processor 110 may include one or more processing units.

The memory 130 may be used to store software programs and modules, and the processor 110 executes various functional applications and data processing by operating the software programs and modules stored in the memory 130. The memory 130 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to a business process, and the like. Further, the memory 130 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

It should be noted that the structure shown in fig. 1 is only an example, and the embodiment of the present invention is not limited thereto.

The server shown in fig. 1 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

The embodiment of the invention provides a possible image retrieval method, which comprises the steps of extracting a feature vector of an image to be retrieved by adopting a trained feature extractor, extracting the feature vector of each candidate image in a candidate image set, then respectively calculating the similarity between the feature vector of the image to be retrieved and the feature vector of each candidate image, and taking the candidate image corresponding to the feature vector with the similarity meeting the requirement as a target image of the image to be retrieved.

In the method, the feature vector for extracting the image to be retrieved is extracted based on the color RGB image of the image to be retrieved, and the feature vector thus extracted does not take into account the shape information of the object in the image, so that when the shooting angle changes, the feature value also changes greatly. The similarity between the two images of the same object is low due to the difference in the shooting angle, and the two images are determined not to be the same object. For example, the image to be retrieved and a candidate image are the same cup, but the shooting angles are different. Since the feature extractor does not pay attention to the shape information of the object during feature extraction, the obtained feature vectors have very different differences, and are finally determined to be non-identical objects, namely, the image retrieval fails.

It can be found that the method is easily influenced by the shooting angle of the article, and the accuracy is low.

The embodiment of the present invention further provides another possible image retrieval method, as shown in fig. 2, including the following steps:

step 201, determining a first feature vector of an image to be retrieved through a depth map of the image to be retrieved;

step 202, determining target candidate images corresponding to second feature vectors of which the first feature vectors meet the requirement of first similarity from a candidate image set; the candidate image set comprises candidate images shot from different angles aiming at different objects;

step 203, determining a target candidate image which meets a second similarity requirement with a third feature vector of the image to be retrieved from all target candidate images as a target image of the image to be retrieved; the third feature vector is determined based on a color RGB map of the image to be retrieved.

In step 201, a depth map of an image to be retrieved is extracted, and a first feature vector of the image to be retrieved is determined based on the depth map.

The Depth map can be extracted from the image to be retrieved acquired by a Depth Camera (DC). Any pixel point in the depth map can be represented as follows: (xi, yi, di), wherein xi represents the abscissa of the pixel point in the depth map, yi represents the ordinate of the pixel point in the depth map, and di represents the depth value of the pixel point, i.e., the distance from the depth map acquisition device in the physical world.

The extracted depth map has a width w and a height h, and the depth map is uniformly sampled to obtain a plurality of first sampling points, for example, m × n sampling points, where m < h and n < w.

Next, a normal vector of each first sampling point is determined, and the normal vector can represent orientation information of each sampling point. The method for determining the normal vector specifically comprises the following steps:

For example, the coordinates (i.e., depth vector) of the first sampling point is P0(x0, y0, d0), and three points are required to determine a plane to calculate the normal vector according to the normal vector calculation formula. Therefore, we sub-sample at step s at (x0+ s, y0) and (x0, y0+ s) of the depth map, respectively, to obtain a second sample point P1 and a third sample point P2.

The coordinates of P1 and P2 are represented as (x1, y1, d1) and (x2, y2, d 2). d1 and d2 are the depth values found at sample points P1 and P2. Wherein x1 ═ x0+ s, y1 ═ y0, x2 ═ x0, and y2 ═ y0+ s. Defining vector vP0P1 as the vector connecting P0 and P1, and vP0P2 as the vector connecting P0 and P2. Available vector vP0P1 ═ x1-x0, y1-y0, d1-d0, vP0P2 ═ x2-x0, y2-y0, d2-d 0.

For simplicity, the expression is further vP0P1 ═ a1, b1, c1, vP0P2 ═ a2, b2, c2, where a1 ═ x1-x0, b1 ═ y1-y0, c1 ═ d1-d0, a2 ═ x2-x0, b2 ═ y2-y0, and c2 ═ d2-d 0.

Fig. 3 shows a schematic diagram of the first, second and third sampling points, and as shown, the normal vector Vp0 ═ Vp0P1 × Vp0P2 ═ at the sampling point P0 (b1 ═ c 2-b 2 ═ c1, a1 ═ c2-a2 ═ c1, a1 ═ b2-a2 ═ b1) according to the normal vector calculation formula.

The above process is performed for each first sampling point, and a normal vector of each first sampling point can be obtained. It can be found that, since the depth values di of the first sampling points in the depth map are different, the normal vectors of the sampling points are not all perpendicular to the plane of the depth map, and the directions of the normal vectors of the first sampling points may be toward various directions.

Next, a normal vector histogram of the depth map is determined based on the normal vectors of the respective sample points. The normal vector histogram can represent the distribution state of the orientation information of each sampling point. The distribution state of the orientation information of each sampling point in the depth map may reflect shape information of an object in the image.

Specifically, the method for determining the normal vector histogram includes: the method comprises the steps of dividing a depth map into a plurality of regions, determining the distribution state of each sampling point with different orientation information in any region, and determining a normal vector histogram of the depth map according to the distribution state of each sampling point with different orientation information in the plurality of regions. The determination of the distribution state of the sampling points with different orientation information in the region can be accomplished by the following method: determining a first included angle formed by the normal vector and a first coordinate axis and a second included angle formed by the normal vector and a second coordinate axis aiming at the normal vector of any sampling point; performing barreling binding processing on the first included angle and the second included angle to obtain a representation value of a normal vector; and determining the distribution state of different characteristic values in the region according to the characteristic values of the normal vectors of all the sampling points in the region.

For example, for a certain sampling point, the normal vector of the sampling point is determined as v (x, y, z), the normal vector is projected on a plane formed by the x axis and the z axis to obtain a two-dimensional vector (x, z), and the normal vector is projected on a plane formed by the y axis and the z axis to obtain a two-dimensional vector (y, z). Then a first angle formed by the normal vector and the first coordinate axis and a second angle formed by the normal vector and the second coordinate axis can be obtained. The first coordinate axis refers to any one of an x coordinate axis or a y coordinate axis, the second coordinate axis refers to any one of the x coordinate axis or the y coordinate axis, and the first coordinate axis and the second coordinate axis are different.

Since the normal vector is perpendicular to the tangent plane of the sampling point, it is known that the first included angle α 1 is arctan (x, z), and the second included angle α 2 is arctan (y, z): the range of α 1 is (0, π) and the range of α 2 is (0, π). α 1 and α 2 are separately subjected to a barreling operation (binning), the size of the barrel being defined by one skilled in the art as required. Assuming a bucket size of 4, then:

bins＝[[0，π/4)，[π/4，π/2)，[π/2，(3*π)/4)，[(3*π)/4，π]]

if the angle of alpha 1 belongs to (0, pi/4), the characteristic value of alpha 1 is 0; if the angle of alpha 1 belongs to (pi/4, pi/2), the characterization value of alpha 1 is 1; if the angle of alpha 1 belongs to (pi/2, 3 pi/4), the characterization value of alpha 1 is 2; if the angle α 1 belongs to (3 π/4, π), α 1 is characterized as 0. Finally, the characteristic value ind alpha 1 and alpha 2 of the alpha 1 can be converted into the corresponding characteristic value ind alpha 2 in the same way.

The characteristic values of α 1 and α 2 are integrated to obtain the characteristic value of the normal vector, which may be integrated according to the following formula:

ind ═ ind α 1 × 4+ ind α 2. From this, ind_p∈[0，15]。

In the above formula, the coefficients before ind α 1 and ind α 2 are related to the size of the bucket in the barreling operation. Since the bucket size is 4, the coefficient before ind α 1 is set to be 4, the coefficient before ind α 2 is set to be 1, and finally, no matter what value is taken by ind α 1 and ind α 2, the value range of the integrated ind p is a series of continuous values, for example, in this example, the value range is [0,15], and some numbers are not missing in the middle.

Thus, each sample point in the depth map determines a value indicative of a normal vector. It will be appreciated that the token is yet another representation of a normal vector. By integrating barreled operation and characterization values of the first included angle and the second included angle of the normal vector, the normal vector is expressed in another form, and the orientation information of the sampling point is still reflected. In contrast, through the series of processing, the characterization granularity of the normal vector becomes coarse. For example, the original normal vector is represented by coordinates, e.g., (100,200,300), and the coordinates of different normal vectors are different; after the first angle and the second angle are converted into representations, the normal vector can be represented as (20 degrees and 30 degrees), and the representations of different normal vectors are still different; after the barreling operation and the characteristic value integration, the value range of the characteristic value of the obtained normal vector is [0,15], that is, the normal vector can be represented as any integer from 0 to 15, so that the situation that different normal vectors are obviously obtained, but all normal vectors are represented as the same integer occurs.

By the method, the characterization granularity is coarsened, the sensitivity can be reduced in the subsequent image retrieval, and the fault tolerance is properly increased, so that two images which clearly represent the same object are not mistakenly considered as images with lower similarity, and the retrieval is not failed. Therefore, the setting of the size of the bucket in the barreling operation has certain influence on the accuracy of subsequent image retrieval, and if the number of the divided buckets is too large and the granularity is too fine, the sensitivity of the image retrieval is too high, and images representing the same object are not easy to identify; if the number of the partition buckets is too small and the granularity is too coarse, the sensitivity of image retrieval is too low, and images representing different objects are not easy to distinguish. The size of the barrel is not limited in the embodiment of the invention, and the skilled person can select the barrel according to the requirement.

Next, a normal vector histogram is determined. The normal vector histogram can represent the distribution state of the orientation information of each sampling point. The depth map is divided into a plurality of regions, for example, 16 regions, as shown in fig. 4, and 24 sampling points are distributed in each region in fig. 4. And counting the normal vector distribution histogram of each region. For example, for the first region, the eigenvalues of the normal vectors of 24 sampling points in the region are known and are all integers in [0,15 ]. If statistics shows that there are 4 normal vectors with

token value

0,2 normal vectors with

token value

1, and 1 normal vector with token value 2, statistics is performed according to the sequence of token values from 0 to 15, and the obtained normal vector histogram of the region is shown in fig. 5. Then the normal vector distribution for the region can be represented by a 16-dimensional vector, such as: (4,2,1,3,0,5,6,1,1,1,0,0,0,0,0,0).

And synthesizing the normal vector histograms of the regions to obtain the normal vector histogram of the depth map. The 16-dimensional vectors of the regions are arranged into a 256-dimensional vector, namely a first feature vector.

In the above method, the token value of each sampling point is not directly used as the first feature vector, which is also a consideration for the accuracy of image retrieval. If the token value of each sample point is directly used as the first feature vector, the dimension of the first feature vector is 24 × 16 ═ 384 dimensions (24 sample points in the horizontal direction and 16 sample points in the vertical direction), the division dimension is too fine, the included image information is more, the sensitivity of image retrieval is too high, and images representing the same object are not easily recognized. Therefore, the method of dividing the depth map into a plurality of regions and counting the number of the characteristic values in each region is adopted, and compared with the method of directly adopting the characteristic values, the method of counting the number of the characteristic values blurs the depth information of some images to a certain extent, so that the sensitivity of image retrieval is not too high.

In summary, it can be understood that the number of the regions into which the depth map is divided is also important for the control of the accuracy of image retrieval. If the number of the divided areas is too large, the granularity is too fine, the sensitivity of image retrieval is too high, and images representing the same object are not easy to identify; if the number of divided regions is too small and the granularity is too coarse, the sensitivity of image retrieval is too low, and images representing different objects are not easily distinguished. The number of the divided areas is not limited in the embodiment of the present invention, and a person skilled in the art can select the divided areas according to requirements.

At this point, a first feature vector of the image to be retrieved is extracted through the depth map of the image to be retrieved.

In step 202, a set of candidate images is set, the set of candidate images including candidate images taken from different angles for different subjects. For example, 8 candidate images of each subject are obtained by photographing each subject from 8 different angles. The number of angles in the embodiment of the present invention is not limited, and can be set by a person skilled in the art. The embodiment of the invention also does not limit the specific angle, and can be any shooting angle.

For each candidate image in the candidate image set, the method provided in step 201 is adopted to extract the depth map, and a feature vector is extracted from the depth map, which is named as a second feature vector. Thus, a second feature vector of each candidate image is obtained. The step of extracting the second feature vector may be performed before step 201, that is, the second feature vector is extracted from each candidate image in the candidate image set in advance and stored, so that after the first feature vector of the image to be retrieved is obtained each time, the similarity of the first feature vector and the similarity of the second feature vector are directly compared, which saves the computing resources and increases the retrieval speed. Alternatively, the step of extracting the second feature vector may be performed after the step 201, and in this way, after the first feature vector of the image to be retrieved is obtained each time, the second feature vector needs to be extracted for each candidate image, which wastes computational resources and is more time-consuming.

Similarity is calculated between the first feature vector and each second feature vector, and the method for calculating the similarity may be calculating an euclidean distance, a cosine distance, and the like, which is not limited in the embodiment of the present invention.

Determining each target candidate image corresponding to each second feature vector meeting the first similarity requirement in each second feature vector, where the first similarity requirement may be the first N candidate images with the highest similarity or the similarity meets a preset threshold, and the like. For example, the first 10 candidate images with the highest similarity among the second feature vectors are determined as target candidate images.

The target candidate image determined by the method can be directly output to the user as the target image, and the subsequent secondary screening can also be carried out.

Since the methods of step 201 and step 202 only focus on the depth map, and the depth map has only depth information and no color information, the resulting image retrieval result may be less accurate.

Meanwhile, in the existing image retrieval technology, the influence of the background of the image on the image retrieval is not paid attention to, so that the images of the same object shot under different backgrounds are probably considered as different objects, and the image retrieval accuracy is low. The depth map of the image can represent the distance between each point and the camera in the physical world, so that the foreground and the background can be distinguished, wherein the foreground is close to the camera, and the background is far from the camera. If a method can be adopted, not only can the color information of the image be focused, but also the interference of the background to the image retrieval can be eliminated as much as possible, and then the accuracy of the image retrieval can be greatly improved. According to the above inventive concept, we propose the following scheme.

On the basis of the first screening, a second screening is then performed on the target candidate images.

And respectively extracting a color RGB image and a depth image of the image to be retrieved, and performing feature extraction on the RGB image based on the depth image to obtain a third feature vector.

The method for obtaining the third feature vector is described in detail below.

First, a neural network model of a double tower structure used in the embodiment of the present invention is described, as shown in fig. 6. Different information is respectively input into two branches of the neural network model, and the second branch is used for generating attention weight at each layer according to the input information and providing weight for each layer of model training of the first branch. The attention mechanism lets the model learn to assign its attention, i.e., weight the input information.

In the embodiment of the invention, the RGB map is input into the first branch, and the depth map is input into the second branch after generating the depth weight map. The first branch is a convolutional neural network which is provided with a plurality of convolutional layers, each convolutional layer further extracts the characteristics input by the previous layer, and the extracted characteristic vectors are input to the next layer. When each convolution layer is subjected to feature extraction, the attention weight extracted from each layer of the second branch is also received, and feature extraction is performed based on the attention weight, thereby generating a third feature vector.

The reason why the depth map is converted into the depth weight map and then input into the second branch is that: each point in the depth map can represent the distance from the point to the camera in the physical world, and the depth value is small if the distance from the foreground to the camera is short; if the background is far from the camera, the depth value is large. In the scheme, the model is expected to give more weight to the foreground and give less weight to the background or even ignore the background in the feature extraction process. Therefore, the foreground needs to be weighted more heavily and the background needs to be weighted less heavily. Based on such idea, the depth information in the depth map may be normalized first, and the normalized value is di, j. And (5) negating the value of di, j, and then adding 1 to generate a depth weight map W. Each point Wi, j in the depth weight map W is 1-di, j.

The above-described feature extraction process is described below with a specific example.

Fig. 7 shows a schematic diagram of feature extraction based on a neural network model of a double tower structure. As shown, the RGB map and the depth map are 3 × 3 images, the RGB map a1 is input to the first branch of the model, and the depth map is converted into the depth weight map B1 and then input to the second branch.

In the first convolutional layer of the first branch, convolution operation is performed on a1 based on the depth weight map B1 and the model parameter C1, and a specific convolution algorithm is the prior art and is not described herein again. After the first layer of convolutional layer operation is completed, the operation result A2 is input into the second layer of convolutional layer. Since the convolution kernel of the convolution layer of the first layer is plural, the number of obtained images increases as shown in the figure. If the obtained image size of a2 is 2 × 2, the depth weight map is scaled to a size of 2 × 2 in the second branch, that is, B2, and input to the second layer convolution layer of the first branch. Since the size of the depth weight map will correspond to the size of the output produced by each layer of the convolutional neural network of the first branch in the forward propagation.

The second convolutional layer performs convolution operation on a2 based on the depth weight map B2 and the model parameters C2. And inputting the obtained result into a third convolution layer, and performing feature extraction on each convolution layer according to the convolution operation of the first convolution layer until a final third feature vector is obtained. Wherein, the model parameters C1 and C2 are obtained when the neural network model of the double tower structure is trained according to a large number of sample images.

The above steps are performed on each target candidate image determined in step 202 to obtain each fourth feature vector of each target candidate image, the similarity between the third feature vector and each fourth feature vector is calculated, and the target candidate image meeting the requirement of the second similarity is taken as the target image of the image to be retrieved.

It can be seen that the operation of extracting the third feature vector based on the neural network model of the double-tower structure is more complicated, consumes more computing resources, and is more time-consuming than the operation of extracting the first feature vector based on the depth map. The first screening is performed through the first feature vector extracted by the depth map, and then the second screening is performed through the third feature vector extracted by the neural network model of the double-tower structure. A small number of target candidate images are obtained through first screening, then secondary screening is carried out on the target candidate images through the neural network model with the double-tower structure, feature vectors do not need to be extracted from all the candidate images through the neural network model with the double-tower structure, a large amount of computing resources and time are saved, and the accuracy of image retrieval is guaranteed.

In the above scheme, the neural network model of the double tower structure is used as the feature extractor. Optionally, a twin neural Network (SN) may also be trained based on the neural Network model of the double tower structure described above. As shown in fig. 8, the twin neural network includes two neural network models of a double tower structure, and a plurality of fully connected layers. And the neural network models of the double-tower structures on the left side and the right side are respectively used for inputting the image to be retrieved and any target candidate image. And then, directly outputting the similarity score of the image to be retrieved and any target candidate image after passing through the full connection layer. The twin neural network also requires a large number of sample images to train.

Based on the same technical concept, fig. 9 exemplarily shows a structure of an image retrieval apparatus provided by an embodiment of the present invention, which can perform a flow of image retrieval.

As shown in fig. 9, the apparatus specifically includes:

a determining unit 901 configured to:

Optionally, the determining unit 901 is specifically configured to:

generating a depth weight map based on the depth information in the depth map;

Optionally, the determining unit 901 is specifically configured to:

normalizing each depth information in the depth map;

Based on the same technical concept, the embodiment of the present application provides a computer device, as shown in fig. 10, including at least one processor 1001 and a memory 1002 connected to the at least one processor, where a specific connection medium between the processor 1001 and the memory 1002 is not limited in the embodiment of the present application, and the processor 1001 and the memory 1002 in fig. 10 are connected through a bus as an example. The bus may be divided into an address bus, a data bus, a control bus, etc.

In the embodiment of the present application, the memory 1002 stores instructions executable by the at least one processor 1001, and the at least one processor 1001 may execute the steps of the image retrieval method by executing the instructions stored in the memory 1002.

The processor 1001 is a control center of the computer device, and can connect various parts of the computer device by using various interfaces and lines, and perform image retrieval by executing or executing instructions stored in the memory 1002 and calling data stored in the memory 1002. Alternatively, the processor 1001 may include one or more processing units, and the processor 1001 may integrate an application processor and a modem processor, wherein the application processor mainly processes an operating system, a user interface, an application program, and the like, and the modem processor mainly processes wireless communication. It will be appreciated that the modem processor described above may not be integrated into the processor 1001. In some embodiments, the processor 1001 and the memory 1002 may be implemented on the same chip, or in some embodiments, they may be implemented separately on separate chips.

The processor 1001 may be a general-purpose processor, such as a Central Processing Unit (CPU), a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof, and may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present Application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor.

Memory 1002, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 1002 may include at least one type of storage medium, and may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charge Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and so on. The memory 1002 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 1002 in the embodiments of the present application may also be circuitry or any other device capable of performing a storage function for storing program instructions and/or data.

Based on the same technical concept, the embodiment of the present invention further provides a computer-readable storage medium, where a computer-executable program is stored, and the computer-executable program is used for causing a computer to execute the method for image retrieval listed in any mode above.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. An image retrieval method, comprising:

2. The method of claim 1, wherein determining a first feature vector of an image to be retrieved from a depth map of the image to be retrieved comprises:

3. The method of claim 2, wherein determining a normal vector for each sample point in a depth map of an image to be retrieved from the depth map comprises:

4. The method of claim 2, wherein determining a normal vector histogram of the depth map based on normal vectors of the sample points comprises:

5. The method of claim 4, wherein determining the distribution state of the sampling points with different orientation information in the area according to the normal vector of the sampling points in the area comprises:

6. The method according to any one of claims 1 to 5, wherein determining a target candidate image satisfying a second similarity requirement with a third feature vector of the image to be retrieved comprises:

generating a depth weight map based on the depth information in the depth map;

7. The method of claim 6, wherein generating a depth weight map based on depth information in the depth map comprises:

normalizing each depth information in the depth map;

8. An image retrieval apparatus, comprising:

a determination unit configured to:

9. A computing device, comprising:

a memory for storing a computer program;

a processor for calling a computer program stored in said memory, for executing the method of any one of claims 1 to 7 in accordance with the obtained program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer-executable program for causing a computer to execute the method of any one of claims 1 to 7.