WO2023109069A1 - 一种图像检索方法及装置 - Google Patents

一种图像检索方法及装置 Download PDF

Info

Publication number
WO2023109069A1
WO2023109069A1 PCT/CN2022/100676 CN2022100676W WO2023109069A1 WO 2023109069 A1 WO2023109069 A1 WO 2023109069A1 CN 2022100676 W CN2022100676 W CN 2022100676W WO 2023109069 A1 WO2023109069 A1 WO 2023109069A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
sampling point
retrieved
depth
feature vector
Prior art date
Application number
PCT/CN2022/100676
Other languages
English (en)
French (fr)
Inventor
裴勇
杨启城
Original Assignee
深圳前海微众银行股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳前海微众银行股份有限公司 filed Critical 深圳前海微众银行股份有限公司
Publication of WO2023109069A1 publication Critical patent/WO2023109069A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • Embodiments of the present invention relate to the technical field of computer vision, and in particular, to an image retrieval method, device, computing device, and computer-readable storage medium.
  • Image retrieval is widely used in various fields, such as search engines, e-commerce platforms, security, information authentication, etc.
  • search engines e-commerce platforms
  • security e.g., password, password, etc.
  • the user inputs an image of a lost item as the image to be retrieved, and the system retrieves an image that may be the item from the library and displays it to the user, or displays the image that may be the item to the user. user.
  • the core of image retrieval is to compare the similarity between the image to be retrieved and the candidate images in the candidate image set.
  • the existing methods for determining the similarity between two images do not take into account the influence of shooting angles, resulting in the fact that even two images of the same item , and the similarity is also low.
  • the embodiment of the present invention provides an image retrieval method to improve the accuracy of image retrieval.
  • An embodiment of the present invention provides an image retrieval method to improve the accuracy of image retrieval.
  • an embodiment of the present invention provides an image retrieval method, including:
  • the candidate image set includes candidate images shot from different angles for different objects ;
  • the target candidate image From each target candidate image, determine the target candidate image that meets the second similarity requirement with the third feature vector of the image to be retrieved as the target image of the image to be retrieved; the third feature vector is based on the Determined by the color RGB map of the image to be retrieved.
  • the depth map of the image can reflect the distance of each point in the image from the camera in the physical world, and can reflect the three-dimensional characteristics of the object in the image.
  • the first feature vector is determined through the depth map of the image to be retrieved, so the obtained first feature vector can reflect the shape information of the object in the image to be retrieved, taking into account the influence of the shooting angle.
  • the first feature vector is first compared with the feature vectors of each candidate image in the candidate image set for similarity, and each target candidate image that meets the first similarity requirement is determined, and the accuracy rate of each target candidate image obtained in this way is relatively high.
  • the set of candidate images includes candidate images taken from different angles for different objects, image retrieval is performed based on these candidate images from different angles, further reducing the impact of shooting angles on image retrieval.
  • the third feature vector is extracted again based on the color RGB map of the image to be retrieved, and the third feature vector is combined with each
  • the target candidate images are compared for similarity, so as to determine the target image.
  • the second screening is performed on the basis of the first screening, and the two screenings take into account different image information, thereby improving the accuracy of image retrieval.
  • determining the first feature vector of the image to be retrieved through the depth map of the image to be retrieved includes:
  • the normal vector histogram of the depth map Based on the normal vector of each sampling point, determine the normal vector histogram of the depth map; the normal vector histogram is used to characterize the distribution state of the orientation information of each sampling point;
  • the normal vector of each sampling point in the depth map can reflect the orientation information of the sampling point, and the normal vector histogram obtained according to the orientation information of each sampling point can reflect the distribution state of the orientation information, that is, the image to be retrieved can be obtained.
  • determining the normal vector of each sampling point in the depth map includes:
  • a second sampling point and a third sampling point are determined from the depth map according to preset rules; according to the depth vector of the first sampling point, the depth vector of the second sampling point and the depth vector of the third sampling point to determine the normal vector of the first sampling point.
  • the normal vector of each sampling point can be determined through the above method, and then the orientation information of each sampling point can be accurately reflected.
  • determining the normal vector histogram of the depth map includes:
  • a normal vector histogram of the depth map is determined according to distribution states of sampling points with different orientation information in multiple regions.
  • the depth map is divided into multiple regions, and the distribution status of each sampling point with different orientation information is obtained for each region.
  • a normal vector histogram of the depth map is determined according to distribution states of sampling points with different orientation information in multiple regions.
  • the depth information of some images can be blurred to a certain extent, so that the sensitivity of image retrieval is not too high, and it is difficult to identify images representing the same object. Therefore, the accuracy of image retrieval is improved.
  • the distribution state of each sampling point with different orientation information determined in each area can reflect the shape information of the captured object in this area, and after the integration of each area, it can reflect the shape information of objects in different areas in the entire image to be retrieved. shape information. Therefore, sampling the above method can overcome the problem of inaccurate image retrieval due to interference caused by angle changes, and improve the accuracy of retrieval.
  • determining the distribution state of each sampling point in the area with different orientation information according to the normal vector of each sampling point in the area includes:
  • the normal vector of any sampling point in the area determine the first angle formed by the normal vector and the first coordinate axis, and the second angle formed by the normal vector and the second coordinate axis; After the first included angle and the second included angle are subjected to bucket binning processing, the characterization value of the normal vector is obtained;
  • the distribution state of different characteristic values in the region is determined.
  • the first included angle formed by the normal vector and the first coordinate axis, and the second included angle formed by the second coordinate axis are subjected to barrel processing to obtain the characteristic value of the normal vector.
  • the first included angle and the second included angle The numerical range of the two included angles is simplified, so that the numerical range of the orientation information of the determined sampling points will be simplified accordingly, instead of a numerical value with a wide range of values. Then, when performing image retrieval in the subsequent pass, the sensitivity can be reduced, so that the two images originally represented as the same object are not confirmed as images with low similarity.
  • determining a target candidate image that meets the second similarity requirement with the third feature vector of the image to be retrieved includes:
  • the RGB image and the depth weight image are respectively input to the first branch and the second branch of the neural network model of the double tower structure, wherein each convolutional layer in the first branch is based on the second branch in The attention weight that each layer produces carries out feature extraction to described RGB picture, thus generates the 3rd feature vector of described image to be retrieved;
  • the background also has a great influence on the success rate of retrieval.
  • Different backgrounds may lead to low similarity between two images representing the same object.
  • the depth map can reflect the distance of the object in the image to be retrieved from the camera, so the depth map is used to extract the attention weight based on the depth information, and then the attention weight is used in the convolution calculation of the image to be retrieved, giving more weight to the foreground , ignoring the background, so that the third eigenvector pays more attention to the foreground and ignores the influence of the background as much as possible.
  • the fourth eigenvector adopts the same processing method, so the retrieval accuracy can be improved when calculating the similarity between the third eigenvector and the fourth eigenvector.
  • generating a depth weight map based on the depth information in the depth map includes:
  • the preset method is that the smaller the depth information is, the larger the weight value in the depth weight map is.
  • the depth information in the depth map indicates the distance from the camera, the closer the distance to the camera is, the smaller the value of the depth information is.
  • an embodiment of the present invention also provides an image retrieval device, including:
  • the candidate image set includes candidate images shot from different angles for different objects ;
  • the target candidate image From each target candidate image, determine the target candidate image that meets the second similarity requirement with the third feature vector of the image to be retrieved as the target image of the image to be retrieved; the third feature vector is based on the Determined by the color RGB map of the image to be retrieved.
  • the determining unit is specifically configured to:
  • the normal vector histogram of the depth map Based on the normal vector of each sampling point, determine the normal vector histogram of the depth map; the normal vector histogram is used to characterize the distribution state of the orientation information of each sampling point;
  • the determining unit is specifically configured to:
  • a second sampling point and a third sampling point are determined from the depth map according to preset rules; according to the depth vector of the first sampling point, the depth vector of the second sampling point and the depth vector of the third sampling point to determine the normal vector of the first sampling point.
  • the determining unit is specifically configured to:
  • a normal vector histogram of the depth map is determined according to distribution states of sampling points with different orientation information in multiple regions.
  • the determining unit is specifically configured to:
  • the normal vector of any sampling point in the area determine the first angle formed by the normal vector and the first coordinate axis, and the second angle formed by the normal vector and the second coordinate axis; After the first included angle and the second included angle are subjected to bucket binning processing, the characterization value of the normal vector is obtained;
  • the distribution state of different characteristic values in the region is determined.
  • the determining unit is specifically configured to:
  • the RGB image and the depth weight image are respectively input to the first branch and the second branch of the neural network model of the double tower structure, wherein each convolutional layer in the first branch is based on the second branch in The attention weight that each layer produces carries out feature extraction to described RGB picture, thus generates the 3rd feature vector of described image to be retrieved;
  • the determining unit is specifically configured to:
  • the preset method is that the smaller the depth information is, the larger the weight value in the depth weight map is.
  • an embodiment of the present invention also provides a computing device, including:
  • the processor is configured to call the computer program stored in the memory, and execute the image retrieval method listed in any of the above methods according to the obtained program.
  • the embodiment of the present invention also provides a computer-readable storage medium, the computer-readable storage medium stores a computer-executable program, and the computer-executable program is used to make the computer execute any of the methods listed above. Image retrieval methods.
  • FIG. 1 is a schematic diagram of a system architecture provided by an embodiment of the present invention
  • FIG. 2 is a schematic flowchart of an image retrieval method provided by an embodiment of the present invention.
  • FIG. 3 is a schematic diagram of a first sampling point, a second sampling point, and a third sampling point provided by an embodiment of the present invention
  • FIG. 4 is a schematic diagram of a divided depth map provided by an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of a normal vector histogram provided by an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of a neural network model with a double-tower structure provided by an embodiment of the present invention.
  • FIG. 7 is a schematic diagram of feature extraction based on a neural network model with a double-tower structure provided by an embodiment of the present invention.
  • FIG. 8 is a schematic diagram of a twin neural network provided by an embodiment of the present invention.
  • FIG. 9 is a schematic structural diagram of an image retrieval device provided by an embodiment of the present invention.
  • FIG. 10 is a schematic structural diagram of a computer device provided by an embodiment of the present invention.
  • FIG. 1 exemplarily shows a system architecture applicable to this embodiment of the present invention.
  • the system architecture may be a server 100 , including a processor 110 , a communication interface 120 and a memory 130 .
  • the communication interface 120 is used to communicate with the terminal equipment, send and receive information transmitted by the terminal equipment, and realize communication.
  • the processor 110 is the control center of the server 100, and uses various interfaces and routes to connect various parts of the entire server 100, by running or executing software programs/or modules stored in the memory 130, and calling data stored in the memory 130, Various functions of the server 100 are executed and data is processed.
  • the processor 110 may include one or more processing units.
  • the memory 130 can be used to store software programs and modules, and the processor 110 executes various functional applications and data processing by running the software programs and modules stored in the memory 130 .
  • the memory 130 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, at least one application required by a function, etc.; the data storage area may store data created according to business processing, etc.
  • the memory 130 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage devices.
  • FIG. 1 is only an example, which is not limited in this embodiment of the present invention.
  • the server shown in Figure 1 can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, and can also provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network Cloud servers for basic cloud computing services such as services, cloud communications, middleware services, domain name services, security services, content delivery network (Content Delivery Network, CDN), and big data and artificial intelligence platforms.
  • cloud services cloud databases, cloud computing, cloud functions, cloud storage, network Cloud servers for basic cloud computing services such as services, cloud communications, middleware services, domain name services, security services, content delivery network (Content Delivery Network, CDN), and big data and artificial intelligence platforms.
  • the embodiment of the present invention provides a possible image retrieval method, which uses a trained feature extractor to extract the feature vector of the image to be retrieved, extracts the feature vectors of each candidate image in the candidate image set, and then calculates the feature vector of the image to be retrieved and each The similarity between the feature vectors of the candidate images, the candidate image corresponding to the feature vector whose similarity meets the requirements is used as the target image of the image to be retrieved.
  • the feature vector of the image to be retrieved is extracted based on the color RGB image of the image to be retrieved.
  • the feature vector extracted in this way does not take into account the shape information of the object in the image, so when the shooting angle changes, the feature The value will also change significantly.
  • the two images of the same object have low similarity due to different shooting angles, and are determined not to be the same object.
  • the image to be retrieved and a candidate image are the same cup, but the shooting angles are different. Since the feature extractor does not pay attention to the shape information of the object during feature extraction, the respective feature vectors obtained are very different, and finally determined to be different objects, that is, the image retrieval fails.
  • the embodiment of the present invention also provides another possible image retrieval method, as shown in Figure 2, including the following steps:
  • Step 201 determine the first feature vector of the image to be retrieved through the depth map of the image to be retrieved;
  • Step 202 determine target candidate images corresponding to the second feature vectors whose first feature vectors satisfy the first similarity requirement from the candidate image set; the candidate image set includes images shot from different angles for different objects each candidate image;
  • Step 203 from each target candidate image, determine the target candidate image that meets the second similarity requirement with the third feature vector of the image to be retrieved as the target image of the image to be retrieved; the third feature vector is Determined based on the color RGB image of the image to be retrieved.
  • the depth map of the image can reflect the distance of each point in the image from the camera in the physical world, and can reflect the three-dimensional characteristics of the object in the image.
  • the first feature vector is determined through the depth map of the image to be retrieved, so the obtained first feature vector can reflect the shape information of the object in the image to be retrieved, taking into account the influence of the shooting angle.
  • the first feature vector is first compared with the feature vectors of each candidate image in the candidate image set for similarity, and each target candidate image that meets the first similarity requirement is determined, and the accuracy rate of each target candidate image obtained in this way is relatively high.
  • the set of candidate images includes candidate images taken from different angles for different objects, image retrieval is performed based on these candidate images from different angles, further reducing the impact of shooting angles on image retrieval.
  • the third feature vector is extracted again based on the color RGB map of the image to be retrieved, and the third feature vector is combined with each
  • the target candidate images are compared for similarity, so as to determine the target image.
  • the second screening is performed on the basis of the first screening, and the two screenings take into account different image information, thereby improving the accuracy of image retrieval.
  • step 201 a depth map of an image to be retrieved is extracted, and a first feature vector of the image to be retrieved is determined based on the depth map.
  • the depth map can only be extracted from the image to be retrieved collected by the depth map acquisition device (Depth Camera, DC).
  • Any pixel in the depth map can be represented as follows: (xi, yi, di), where xi represents the abscissa of the pixel in the depth map, yi represents the ordinate of the pixel in the depth map, and di represents the pixel
  • the depth value of the point that is, the distance from the depth map acquisition device in the physical world.
  • the width of the extracted depth map is w, and the height is h.
  • the depth map is uniformly sampled to obtain several first sampling points, for example, m*n sampling points are obtained, where m ⁇ h, n ⁇ w.
  • the normal vector of each first sampling point is determined, and the normal vector can represent the orientation information of each sampling point.
  • the method of determining the normal vector is as follows:
  • For each first sampling point determine the second sampling point and the third sampling point from the depth map according to preset rules; according to the depth vector of the first sampling point, the depth vector of the second sampling point and the depth vector of the third sampling point Depth vector, determine the normal vector of the first sampling point.
  • the coordinates of the first sampling point are P0(x0, y0, d0).
  • the normal vector calculation formula three points are required to determine a plane to calculate the normal vector. Therefore, we resample at (x0+s, y0) and (x0, y0+s) of the depth map respectively with a step size s to obtain the second sampling point P1 and the third sampling point P2.
  • d1 and d2 are the depth values obtained at sampling point P1 and sampling point P2.
  • x1 x0+s
  • y1 y0
  • x2 x0
  • y2 y0+s.
  • vector vP0P1 as the vector obtained by connecting P0 and P1
  • vP0P2 as the vector obtained by connecting P0 and P2.
  • vP0P1 (a1, b1, c1)
  • the above process is performed on each first sampling point to obtain the normal vector of each first sampling point. It can be found that since the depth values di of each first sampling point in the depth map are different, the normal vectors of each sampling point are not all perpendicular to the plane where the depth map is located, and the direction of the normal vector of each first sampling point Can be in any direction.
  • the normal vector histogram can represent the distribution state of the orientation information of each sampling point.
  • the distribution state of the orientation information of each sampling point in the depth map may reflect the shape information of the object in the image.
  • the method for determining the normal vector histogram is: divide the depth map into multiple regions, and for any region, determine the distribution status of each sampling point with different orientation information in the region, and according to the different orientation information of multiple regions The distribution state of each sampling point of the information is used to determine the normal vector histogram of the depth map.
  • determining the distribution state of each sampling point with different orientation information in the area can be accomplished by the following method: for the normal vector of any sampling point, determine the first angle formed by the normal vector and the first coordinate axis, and The second included angle formed by the normal vector and the second coordinate axis; after bucket binning processing is carried out on the first included angle and the second included angle, the characterization value of the normal vector is obtained; according to the The characteristic value of the normal vector determines the distribution state of different characteristic values in the region.
  • the normal vector of the sampling point as v(x, y, z), project it on the plane formed by the x-axis and the z-axis to obtain a two-dimensional vector (x, z), Project it on the plane formed by the y-axis and the z-axis to obtain a two-dimensional vector (y, z). Then the first included angle formed by the normal vector and the first coordinate axis, and the second included angle formed by the normal vector and the second coordinate axis can be obtained.
  • the first coordinate axis here refers to any one of the x-coordinate axis or the y-coordinate axis
  • the second coordinate axis refers to any one of the x-coordinate axis or the y-coordinate axis
  • the first coordinate axis and the second coordinate axis The axes are different.
  • ⁇ 1 and ⁇ 2 are respectively subjected to a bucketing operation (binning), and the size of the bucket is defined by those skilled in the art according to requirements. Assuming the bucket size is 4, then:
  • bins [[0, ⁇ /4), [ ⁇ /4, ⁇ /2), [ ⁇ /2, (3* ⁇ )/4), [(3* ⁇ )/4, ⁇ ]]
  • the characteristic value of ⁇ 1 is 0; if the angle of ⁇ 1 belongs to ( ⁇ /4, ⁇ /2), then the characteristic value of ⁇ 1 is 1; if the angle of ⁇ 1 belongs to ( ⁇ /2, 3 ⁇ /4), then the characteristic value of ⁇ 1 is 2; if the angle of ⁇ 1 belongs to (3 ⁇ /4, ⁇ ), then the characteristic value of ⁇ 1 is 0.
  • the characterization value ind ⁇ 1 of ⁇ 1 is obtained, and ⁇ 2 can also be converted into the corresponding characterization value ind ⁇ 2 in the same way.
  • Integrate the characterization values of ⁇ 1 and ⁇ 2 to obtain the characterization value of the normal vector for example, it can be integrated according to the following formula:
  • indp ind ⁇ 1*4+ind ⁇ 2. It can be seen that ind p ⁇ [0, 15].
  • the coefficient before ind ⁇ 1 and the coefficient before ind ⁇ 2 are related to the size of the bucket in the bucketization operation. Since the bucket size is 4, set the coefficient before ind ⁇ 1 to 4, and the coefficient before ind ⁇ 2 to 1.
  • the value range of the integrated indp is a series of continuous values. For example, in this In the example, the value range is [0,15], and some numbers will not appear in the middle.
  • each sampling point in the depth map determines the representative value of the normal vector.
  • the representation value is another representation of the normal vector.
  • the normal vector is expressed in another form, which still reflects the orientation information of the sampling point. The difference is that after this series of processing, the representation granularity of the normal vector becomes coarser.
  • the original normal vector is represented by coordinates, such as (100, 200, 300), and the coordinates of different normal vectors are different; after being transformed into the first included angle and the second included angle, the normal vector can be expressed as (20°, 30° ), the representations of different normal vectors are still different; after the bucketization operation and the integration of the representation value, the value range of the representation value of the obtained normal vector is [0,15], that is to say, the normal vector can be expressed as 0-15 Any integer in , so there will be a situation where the normal vectors are obviously different, but they are all expressed as the same integer.
  • the sensitivity can be reduced in the subsequent image retrieval, and the error tolerance can be increased appropriately, so that two images that clearly represent the same object are not mistakenly regarded as images with low similarity, resulting in Retrieval failed. Therefore, the setting of the bucket size in the bucketization operation has a certain impact on the accuracy of subsequent image retrieval. If the number of divided buckets is too large and the granularity is too fine, the sensitivity of image retrieval is too high, and it is not easy to represent the same object. If the number of divided buckets is too small and the granularity is too coarse, the sensitivity of image retrieval is too low, and it is not easy to distinguish images representing different objects.
  • the embodiment of the present invention does not limit the size of the bucket, which can be selected by those skilled in the art according to requirements.
  • the normal vector histogram can represent the distribution state of the orientation information of each sampling point.
  • the depth map is divided into multiple regions, for example, 16 regions, as shown in FIG. 4 , and 24 sampling points are distributed in each region in FIG. 4 .
  • the statistics are performed in the order of the characteristic values from 0-15 , the obtained normal vector histogram of this region is shown in Fig. 5.
  • the normal vector distribution of the area can be represented by a 16-dimensional vector, for example: (4,2,1,3,0,5,6,1,1,1,0,0,0,0,0,0 ).
  • the normal vector histogram of the depth map can be obtained.
  • the 16-dimensional vectors of each region are integrated and arranged into a 256-dimensional vector, which is the first feature vector.
  • the reason why the characteristic value of each sampling point is not directly used as the first feature vector is also due to the consideration of the accuracy of image retrieval.
  • the number of regions that divide the depth map into multiple regions is also crucial to the regulation of the accuracy of image retrieval. If the number of divided regions is too large, the granularity is too fine, and the sensitivity of image retrieval is too high, and it is not easy to recognize images representing the same object; if the number of divided regions is too small, the granularity is too coarse, and the sensitivity of image retrieval is too high. If the degree is too low, it is not easy to distinguish images representing different objects.
  • the embodiment of the present invention does not limit the number of divided areas, and those skilled in the art can select according to requirements.
  • a set of candidate images is set, and the set of candidate images includes candidate images shot from different angles for different objects. For example, each object is photographed from 8 different angles, and 8 candidate images of each object are obtained.
  • the embodiment of the present invention does not limit the number of angles, which can be set by those skilled in the art.
  • the embodiment of the present invention does not limit the specific angle, and it may be any shooting angle.
  • the method provided in step 201 is used to extract a depth map, and a feature vector is extracted from the depth map, which is named as a second feature vector.
  • a feature vector is extracted from the depth map, which is named as a second feature vector.
  • the step of extracting the second eigenvector can be performed before step 201, that is, extracting the second eigenvector from each candidate image in the candidate image set in advance to store the second eigenvector.
  • the second eigenvectors only need to perform similarity comparison, which saves computing resources and speeds up retrieval.
  • the step of extracting the second feature vector may be performed after step 201. In this case, each time after obtaining the first feature vector of the image to be retrieved, the second feature vector needs to be extracted for each candidate image, which wastes computing resources and consumes more hour.
  • the first feature vector and each second feature vector are used to calculate the similarity.
  • the similarity calculation method may be calculating Euclidean distance, cosine distance, etc., which is not limited in this embodiment of the present invention.
  • Determine each target candidate image corresponding to each second feature vector that meets the first similarity requirement in each second feature vector, and the first similarity requirement can be the top N candidate images with the highest similarity or the similarity meeting a preset threshold etc., which are not limited in this embodiment of the present invention.
  • the top 10 candidate images with the highest similarity are determined as target candidate images.
  • the target candidate image determined by the above method may be directly output to the user as the target image, or may be subjected to subsequent second screening.
  • steps 201 and 202 Since the methods in steps 201 and 202 only focus on the depth map, and the depth map has only depth information and no color information, the resulting image retrieval results may be less accurate.
  • the depth map of the image can represent the distance of each point from the camera in the physical world, so it can be distinguished from the foreground and the foreground.
  • the distance between the foreground and the camera is short, and the distance between the background and the camera is far. If a method can be adopted, which can not only pay attention to the color information of the image, but also eliminate the interference of the background on image retrieval as much as possible, then the accuracy of image retrieval can be greatly improved.
  • the second screening is carried out in these target candidate images.
  • the color RGB image and the depth image of the image to be retrieved are respectively extracted, and based on the depth image, feature extraction is performed on the RGB image to obtain a third feature vector.
  • the method for obtaining the third eigenvector will be specifically introduced below.
  • the neural network model of the double-tower structure adopted in the embodiment of the present invention is introduced, as shown in FIG. 6 .
  • Different information is input to the two branches of the neural network model, and the second branch is used to generate attention weights in each layer according to the input information, which provides weights for the model training of each layer of the first branch.
  • the attention mechanism allows the model to learn how to allocate its own attention, that is, to weight the input information.
  • the RGB image is input to the first branch, and the depth weight image is generated from the depth image and then input to the second branch.
  • the first branch is a convolutional neural network, which is equipped with multiple convolutional layers. Each convolutional layer further extracts features from the input features of the previous layer, and the extracted feature vectors are then input to the next layer.
  • each convolutional layer performs feature extraction, it also receives attention weights extracted from each layer of the second branch, and performs feature extraction based on the attention weights, thereby generating a third feature vector.
  • each point in the depth map can represent the distance from the camera in the physical world, and the distance between the foreground and the camera is close, and the depth value is small; The background is farther away from the camera, and the depth value is larger.
  • our scheme hopes that the model will give more weight to the foreground and less weight to the background or even ignore the background in the process of feature extraction. Therefore, it is necessary to give a larger weight value to the foreground and a smaller weight value to the background.
  • each depth information in the depth map can be normalized first, and the values after normalization are di,j. Invert the value of di, j and add 1 to generate a depth weight map W.
  • Each point Wi,j 1 ⁇ di,j in the depth weight map W.
  • FIG. 7 shows a schematic diagram of feature extraction based on a neural network model with a double-tower structure.
  • both the RGB image and the depth image are 256 ⁇ 256 images, and the RGB image A1 is input to the first branch of the model, and the depth image is converted into a depth weight image B1 and then input to the second branch.
  • the convolution operation is performed on A1 based on the depth weight map B1 and the model parameter C1.
  • the specific convolution algorithm is an existing technology and will not be repeated here.
  • the operation result A2 is input to the second convolution layer. Since there are multiple convolution kernels in the convolutional layer of the first layer, the number of images obtained increases, as shown in the figure. Assuming that the obtained image size of A2 is 128 ⁇ 128, in the second branch, the depth weight map is correspondingly scaled to a size of 128 ⁇ 128, that is, B2, which is input to the second convolutional layer of the first branch. Because the size of the depth weight map corresponds to the size of the output produced by each layer of the first branch of the convolutional neural network in the forward pass.
  • the second convolutional layer performs a convolution operation on A2 based on the depth weight map B2 and the model parameters C2.
  • the obtained result is input to the third convolutional layer, so that each convolutional layer performs feature extraction according to the convolution operation of the first convolutional layer, until the final third feature vector is obtained.
  • the model parameters C1 and C2 are obtained when the neural network model of the double-tower structure is trained according to a large number of sample images.
  • each target candidate image determined in step 202 obtains each fourth eigenvector of each target candidate image, calculate the similarity of the 3rd eigenvector and each 4th eigenvector, will satisfy the second similarity requirement
  • the target candidate image of is used as the target image of the image to be retrieved.
  • the operation of extracting the third feature vector based on the neural network model based on the double-tower structure is more complicated than that of extracting the first feature vector based on the depth map, which consumes more computing resources and is more time-consuming.
  • This is also one of the reasons for the first screening through the first feature vector extracted from the depth map, and then the second screening through the third feature vector extracted by the neural network model of the double-tower structure.
  • a small number of target candidate images are obtained through the first screening, and then the second screening is performed in the target candidate images through the neural network model of the double tower structure, without extracting feature vectors from all candidate images through the neural network model of the double tower structure , which saves a lot of computing resources and time, and ensures the accuracy of image retrieval.
  • the neural network model of the double-tower structure is used as a feature extractor.
  • a twin neural network (Siamese Network, SN) can also be trained based on the neural network model of the above-mentioned double-tower structure.
  • the twin neural network includes two neural network models with twin tower structures and multiple fully connected layers.
  • the two-tower neural network models on the left and right sides are respectively used to input the image to be retrieved and any target candidate image. Then go through the fully connected layer, and then directly output the similarity score between the image to be retrieved and any target candidate image.
  • the siamese neural network also requires a large number of sample images for training.
  • FIG. 9 exemplarily shows the structure of an image retrieval device provided by an embodiment of the present invention, and the structure can execute the process of image retrieval.
  • the device specifically includes:
  • a determining unit 901 configured to:
  • the candidate image set includes candidate images shot from different angles for different objects ;
  • the target candidate image From each target candidate image, determine the target candidate image that meets the second similarity requirement with the third feature vector of the image to be retrieved as the target image of the image to be retrieved; the third feature vector is based on the Determined by the color RGB map of the image to be retrieved.
  • the determining unit 901 is specifically configured to:
  • the normal vector histogram of the depth map Based on the normal vector of each sampling point, determine the normal vector histogram of the depth map; the normal vector histogram is used to characterize the distribution state of the orientation information of each sampling point;
  • the determining unit 901 is specifically configured to:
  • a second sampling point and a third sampling point are determined from the depth map according to preset rules; according to the depth vector of the first sampling point, the depth vector of the second sampling point and the depth vector of the third sampling point to determine the normal vector of the first sampling point.
  • the determining unit 901 is specifically configured to:
  • a normal vector histogram of the depth map is determined according to distribution states of sampling points with different orientation information in multiple regions.
  • the determining unit 901 is specifically configured to:
  • the normal vector of any sampling point in the area determine the first angle formed by the normal vector and the first coordinate axis, and the second angle formed by the normal vector and the second coordinate axis; After the first included angle and the second included angle are subjected to bucket binning processing, the characterization value of the normal vector is obtained;
  • the distribution state of different characteristic values in the region is determined.
  • the determining unit 901 is specifically configured to:
  • the RGB image and the depth weight image are respectively input to the first branch and the second branch of the neural network model of the double tower structure, wherein each convolutional layer in the first branch is based on the second branch in The attention weight that each layer produces carries out feature extraction to described RGB picture, thus generates the 3rd feature vector of described image to be retrieved;
  • the determining unit 901 is specifically configured to:
  • the preset method is that the smaller the depth information is, the larger the weight value in the depth weight map is.
  • an embodiment of the present application provides a computer device, as shown in FIG. 10 , including at least one processor 1001 and a memory 1002 connected to at least one processor.
  • the processor is not limited in the embodiment of the present application.
  • the bus connection between processor 1001 and memory 1002 in FIG. 10 is taken as an example.
  • the bus can be divided into address bus, data bus, control bus and so on.
  • the memory 1002 stores instructions executable by at least one processor 1001, and at least one processor 1001 can execute the steps of the image retrieval method above by executing the instructions stored in the memory 1002.
  • the processor 1001 is the control center of the computer equipment, which can use various interfaces and lines to connect various parts of the computer equipment, by running or executing the instructions stored in the memory 1002 and calling the data stored in the memory 1002, so as to perform image processing. search.
  • the processor 1001 may include one or more processing units, and the processor 1001 may integrate an application processor and a modem processor.
  • the tuner processor mainly handles wireless communication. It can be understood that the foregoing modem processor may not be integrated into the processor 1001 .
  • the processor 1001 and the memory 1002 can be implemented on the same chip, and in some embodiments, they can also be implemented on independent chips.
  • the processor 1001 can be a general processor, such as a central processing unit (CPU), a digital signal processor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field programmable gate array or other programmable logic devices, discrete gates or transistors Logic devices and discrete hardware components can implement or execute the methods, steps and logic block diagrams disclosed in the embodiments of the present application.
  • a general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the methods disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor.
  • the memory 1002 as a non-volatile computer-readable storage medium, can be used to store non-volatile software programs, non-volatile computer-executable programs and modules.
  • the memory 1002 may include at least one type of storage medium, such as flash memory, hard disk, multimedia card, card-type memory, random access memory (Random Access Memory, RAM), static random access memory (Static Random Access Memory, SRAM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Magnetic Memory, Disk , CD, etc.
  • the memory 1002 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and can be accessed by a computer, but is not limited thereto.
  • the memory 1002 in the embodiment of the present application may also be a circuit or any other device capable of implementing a storage function, and is used for storing program instructions and/or data.
  • an embodiment of the present invention also provides a computer-readable storage medium, the computer-readable storage medium stores a computer-executable program, and the computer-executable program is used to enable the computer to perform the image retrieval listed in any of the above methods Methods.
  • the embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions
  • the device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

本发明实施例提供一种图像检索方法及装置。该方法包括:通过待检索图像的深度图,确定待检索图像的第一特征向量;从候选图像集中确定出与第一特征向量满足第一相似度要求的各第二特征向量对应的各目标候选图像;候选图像集中包括针对不同对象从不同角度进行拍摄的各候选图像;从各目标候选图像中确定出与待检索图像的第三特征向量满足第二相似度要求的目标候选图像作为待检索图像的目标图像;第三特征向量是基于待检索图像的颜色RGB图确定的。基于不同角度的候选图像进行图像检索减少了拍摄角度对图像检索的影响。基于待检索图像的颜色RGB图提取第三特征向量。两次筛选兼顾不同的图像信息,提高图像检索准确性。

Description

一种图像检索方法及装置
相关申请的交叉引用
本申请要求在2021年12月13日提交中国专利局、申请号为202111519845.8、申请名称为“一种图像检索方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明实施例涉及计算机视觉技术领域,尤其涉及一种图像检索方法、装置、计算设备及计算机可读存储介质。
背景技术
随着计算机技术的发展,越来越多的技术应用在金融领域,传统金融业正在逐步向金融科技(Fintech)转变,但由于金融行业的安全性、实时性要求,也对技术提出的更高的要求。
随着近年来计算机、互联网、多媒体的迅猛发展,世界上有着愈来愈多的海量图像数据。图像检索功能的推出极大地拓宽了检索的范围,提高了检索的准确率。图像检索广泛应用在各种领域,如搜索引擎、电商平台、安防、信息认证领域等。例如,在失物招领系统中,用户输入一张丢失的物品的图像作为待检索图像,系统在库中检索到可能为该物品的图像展示给用户,或者将可能为该物品的图像的标识展示给用户。
图像检索的核心是比较待检索图像和候选图像集中的各候选图像的相似度,现有的确定两张图像的相似度的方法没有考虑到拍摄角度的影响,导致即便是同一物品的两张图像,相似度也较低。
综上,本发明实施例提供一种图像检索方法,用以提高图像检索的准确率。
发明内容
本发明实施例提供一种图像检索方法,用以提高图像检索的准确率。
第一方面,本发明实施例提供一种图像检索方法,包括:
通过待检索图像的深度图,确定所述待检索图像的第一特征向量;
从候选图像集中确定出与所述第一特征向量满足第一相似度要求的各第二特征向量对应的各目标候选图像;所述候选图像集中包括针对不同对象从不同角度进行拍摄的各候选图像;
从各目标候选图像中,确定出与所述待检索图像的第三特征向量满足第二相似度要求的目标候选图像作为所述待检索图像的目标图像;所述第三特征向量是基于所述待检索图像的颜色RGB图确定的。
图像的深度图可以反映图像中各点在物理世界中距离摄像机的距离,可以反映图像中的对象的立体特征。在上述方法中,通过待检索图像的深度图确定第一特征向量,因此得到的第一特征向量能够反映待检索图像中的对象 的形状信息,考虑到了拍摄角度的影响。将第一特征向量先与候选图像集中的各候选图像的特征向量进行相似度的比较,确定满足第一相似度要求的各目标候选图像,如此得到的各目标候选图像的准确率较高。并且,由于候选图像集中包括针对不同对象从不同角度进行拍摄的各候选图像,那么基于这些不同角度的候选图像进行图像检索,进一步减少了拍摄角度对图像检索的影响。在通过深度图进行第一次筛选后,由于深度图不反映颜色信息,因此为了提高图像检索的准确性,基于待检索图像的颜色RGB图再次提取第三特征向量,通过第三特征向量与各目标候选图像进行相似度的比较,从而确定目标图像。如此,在第一次筛选的基础上,进行第二次筛选,两次筛选兼顾了不同的图像信息,提高了图像检索的准确性。
可选地,通过待检索图像的深度图,确定所述待检索图像的第一特征向量,包括:
通过待检索图像的深度图,确定所述深度图中每个采样点的法向量;所述法向量用于表征采样点的朝向信息;
基于各采样点的法向量,确定所述深度图的法向量直方图;所述法向量直方图用于表征各采样点的朝向信息的分布状态;
根据所述法向量直方图,得到所述第一特征向量。
深度图中每个采样点的法向量可以反映该采样点的朝向信息,根据每个采样点的朝向信息得到的法向量直方图可以反映朝向信息的分布状态,也就是说,可以获取待检索图像中的对象的立体特征、形状信息。从而可以避免拍摄角度的不同对图像检索准确率的干扰。
可选地,通过待检索图像的深度图,确定所述深度图中每个采样点的法向量,包括:
对所述待检索图像的深度图进行采样,得到各第一采样点;
针对每个第一采样点,按照预设规则从所述深度图中确定出第二采样点和第三采样点;根据所述第一采样点的深度向量、所述第二采样点的深度向量和所述第三采样点的深度向量,确定所述第一采样点的法向量。
通过上述方法可以确定每个采样点的法向量,那么就可准确地反映各采样点的朝向信息。
可选地,基于各采样点的法向量,确定所述深度图的法向量直方图,包括:
将所述深度图划分为多个区域,针对任一区域,根据所述区域中各采样点的法向量确定所述区域中具有不同朝向信息的各采样点的分布状态;
根据多个区域的具有不同朝向信息的各采样点的分布状态,确定所述深度图的法向量直方图。
将深度图划分为多个区域,对各区域分别获取具有不同朝向信息的各采样点的分布状态。根据多个区域的具有不同朝向信息的各采样点的分布状态,确定深度图的法向量直方图。如此,可以在一定程度上模糊一些图像的深度信息,不至于使图像检索的敏感度太高,而不容易将表示相同对象的图像识别出来。因此,提高了图像检索的准确性。同时,每个区域确定的具有不同 朝向信息的各采样点的分布状态可以反映该区域的拍摄的对象的形状信息,各区域进行整合后又可以反映出整个待检索图像中的不同区域的对象的形状信息。因此采样上述方法,可以克服角度变化带来干扰导致图像检索不准确的问题,提升了检索的准确率。
可选地,根据所述区域中各采样点的法向量确定所述区域中具有不同朝向信息的各采样点的分布状态,包括:
针对所述区域中任一采样点的法向量,确定所述法向量与第一坐标轴所成的第一夹角,以及所述法向量与第二坐标轴所成的第二夹角;对所述第一夹角和所述第二夹角进行桶化binning处理后,得到所述法向量的表征值;
根据所述区域中各采样点的法向量的表征值,确定所述区域中不同表征值的分布状态。
将法向量与第一坐标轴所成的第一夹角、与第二坐标轴所成的第二夹角,进行桶化处理后,得到法向量的表征值,如此,第一夹角和第二夹角的数值范围进行了简化,这样确定出来的采样点的朝向信息的数值范围也会相应简化,而不是取值范围很广泛的数值。那么在后续通过进行图像检索时,可以降低敏感度,不至于将本来表示为同一物体的两张图像确认为相似度较低的图像。
可选地,确定出与所述待检索图像的第三特征向量满足第二相似度要求的目标候选图像,包括:
基于所述深度图中的深度信息,生成深度权重图;
将所述RGB图与所述深度权重图分别输入至双塔结构的神经网络模型的第一分支和第二分支,其中,所述第一分支中的各卷积层基于所述第二分支在每一层产生的注意力权重对所述RGB图进行特征提取,从而生成所述待检索图像的第三特征向量;
计算所述第三特征向量与任一所述目标候选图像的第四特征向量之间的相似度,从而确定相似度满足第二相似度要求的目标候选图像。
图像检索中,背景对于检索的成功率也有很大的影响,不同的背景,可能会导致表示同一物体的两张图像的相似度很低。而深度图可以反映待检索图像中物体距离摄像头的远近,因此利用深度图提取基于深度信息的注意力权重,之后将注意力权重用于对待检索图像的卷积计算中,给予前景更多的权重,忽略背景,从而得到的第三特征向量更加注重前景,尽量忽视背景的影响。第四特征向量采用同样的处理方式,那么第三特征向量和第四特征向量进行相似度的计算时,可以提高检索准确率。
可选地,基于所述深度图中的深度信息,生成深度权重图,包括:
对所述深度图中的各深度信息进行归一化;
将归一化后的深度信息按照预设方式进行权重转化,生成所述深度权重图,所述预设方式为深度信息越小,在所述深度权重图中的权重值越大。
由于深度图中的深度信息表示距离摄像头的远近,距离摄像头越近的深度信息的值越小,为了给予前景更多的权重,需要将值小的深度信息的权重值变大。如此,可以提高前景的权重,将注意力放在前景上,提高检索的准 确率。
第二方面,本发明实施例还提供一种图像检索装置,包括:
确定单元,用于:
通过待检索图像的深度图,确定所述待检索图像的第一特征向量;
从候选图像集中确定出与所述第一特征向量满足第一相似度要求的各第二特征向量对应的各目标候选图像;所述候选图像集中包括针对不同对象从不同角度进行拍摄的各候选图像;
从各目标候选图像中,确定出与所述待检索图像的第三特征向量满足第二相似度要求的目标候选图像作为所述待检索图像的目标图像;所述第三特征向量是基于所述待检索图像的颜色RGB图确定的。
可选地,所述确定单元具体用于:
通过待检索图像的深度图,确定所述深度图中每个采样点的法向量;所述法向量用于表征采样点的朝向信息;
基于各采样点的法向量,确定所述深度图的法向量直方图;所述法向量直方图用于表征各采样点的朝向信息的分布状态;
根据所述法向量直方图,得到所述第一特征向量。
可选地,所述确定单元具体用于:
对所述待检索图像的深度图进行采样,得到各第一采样点;
针对每个第一采样点,按照预设规则从所述深度图中确定出第二采样点和第三采样点;根据所述第一采样点的深度向量、所述第二采样点的深度向量和所述第三采样点的深度向量,确定所述第一采样点的法向量。
可选地,所述确定单元具体用于:
将所述深度图划分为多个区域,针对任一区域,根据所述区域中各采样点的法向量确定所述区域中具有不同朝向信息的各采样点的分布状态;
根据多个区域的具有不同朝向信息的各采样点的分布状态,确定所述深度图的法向量直方图。
可选地,所述确定单元具体用于:
针对所述区域中任一采样点的法向量,确定所述法向量与第一坐标轴所成的第一夹角,以及所述法向量与第二坐标轴所成的第二夹角;对所述第一夹角和所述第二夹角进行桶化binning处理后,得到所述法向量的表征值;
根据所述区域中各采样点的法向量的表征值,确定所述区域中不同表征值的分布状态。
可选地,所述确定单元具体用于:
基于所述深度图中的深度信息,生成深度权重图;
将所述RGB图与所述深度权重图分别输入至双塔结构的神经网络模型的第一分支和第二分支,其中,所述第一分支中的各卷积层基于所述第二分支在每一层产生的注意力权重对所述RGB图进行特征提取,从而生成所述待检索图像的第三特征向量;
计算所述第三特征向量与任一所述目标候选图像的第四特征向量之间的相似度,从而确定相似度满足第二相似度要求的目标候选图像。
可选地,所述确定单元具体用于:
对所述深度图中的各深度信息进行归一化;
将归一化后的深度信息按照预设方式进行权重转化,生成所述深度权重图,所述预设方式为深度信息越小,在所述深度权重图中的权重值越大。
第三方面,本发明实施例还提供一种计算设备,包括:
存储器,用于存储计算机程序;
处理器,用于调用所述存储器中存储的计算机程序,按照获得的程序执行上述任一方式所列的图像检索方法。
第四方面,本发明实施例还提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可执行程序,所述计算机可执行程序用于使计算机执行上述任一方式所列的图像检索方法。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简要介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本发明实施例提供的一种系统架构的示意图;
图2为本发明实施例提供的一种图像检索方法的流程示意图;
图3为本发明实施例提供的一种第一采样点、第二采样点和第三采样点的示意图;
图4为本发明实施例提供的一种划分深度图的示意图;
图5为本发明实施例提供的一种法向量直方图的示意图;
图6为本发明实施例提供的一种双塔结构的神经网络模型的示意图;
图7为本发明实施例提供的一种基于双塔结构的神经网络模型进行特征提取的示意图;
图8为本发明实施例提供的一种孪生神经网络的示意图;
图9为本发明实施例提供的一种图像检索装置的结构示意图;
图10为本发明实施例提供的一种计算机设备的结构示意图。
具体实施方式
为使本申请的目的、实施方式和优点更加清楚,下面将结合本申请示例性实施例中的附图,对本申请示例性实施方式进行清楚、完整地描述,显然,所描述的示例性实施例仅是本申请一部分实施例,而不是全部的实施例。
基于本申请描述的示例性实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请所附权利要求保护的范围。此外,虽然本申请中公开内容按照示范性一个或几个实例来介绍,但应理解,可以就这些公开内容的各个方面也可以单独构成一个完整实施方式。
需要说明的是,本申请中对于术语的简要说明,仅是为了方便理解接下来描述的实施方式,而不是意图限定本申请的实施方式。除非另有说明,这 些术语应当按照其普通和通常的含义理解。
本申请中说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”等是用于区别类似或同类的对象或实体,而不必然意味着限定特定的顺序或先后次序,除非另外注明(Unless otherwise indicated)。应该理解这样使用的用语在适当情况下可以互换,例如能够根据本申请实施例图示或描述中给出那些以外的顺序实施。
此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖但不排他的包含,例如,包含了一系列组件的产品或设备不必限于清楚地列出的那些组件,而是可包括没有清楚地列出的或对于这些产品或设备固有的其它组件。
图1示例性的示出了本发明实施例所适用的一种系统架构,该系统架构可以为服务器100,包括处理器110、通信接口120和存储器130。
其中,通信接口120用于与终端设备进行通信,收发该终端设备传输的信息,实现通信。
处理器110是服务器100的控制中心,利用各种接口和路线连接整个服务器100的各个部分,通过运行或执行存储在存储器130内的软件程序/或模块,以及调用存储在存储器130内的数据,执行服务器100的各种功能和处理数据。可选地,处理器110可以包括一个或多个处理单元。
存储器130可用于存储软件程序以及模块,处理器110通过运行存储在存储器130的软件程序以及模块,从而执行各种功能应用以及数据处理。存储器130可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序等;存储数据区可存储根据业务处理所创建的数据等。此外,存储器130可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
需要说明的是,上述图1所示的结构仅是一种示例,本发明实施例对此不做限定。
图1中示出的服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network,CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。
本发明实施例提供一种可能的图像检索方法,采用训练好的特征提取器提取待检索图像的特征向量,提取候选图像集中各候选图像的特征向量,然后分别计算待检索图像的特征向量与各候选图像的特征向量之间的相似度,将相似度满足要求的特征向量对应的候选图像作为待检索图像的目标图像。
在上述方法中,提取待检索图像的特征向量是基于待检索图像的颜色RGB图进行提取的,如此提取的特征向量没有考虑到图像中的对象的形状信息,那么当拍摄角度发生变化时,特征值也会发生较大的变化。最终导致同一对象的两张图像由于拍摄角度的不同而相似度较低,被确定为不是同一对 象。例如,待检索图像和某一候选图像为同一个杯子,但是拍摄角度不同。由于特征提取器在特征提取时没有关注对象的形状信息,因此得到的各自的特征向量差别悬殊,最终被确定为非同一对象,即,图像检索失败。
可以发现,上述方法容易受到物品拍摄角度的影响,准确率较低。
本发明实施例还提供另一种可能的图像检索方法,如图2所示,包括如下步骤:
步骤201,通过待检索图像的深度图,确定所述待检索图像的第一特征向量;
步骤202,从候选图像集中确定出与所述第一特征向量满足第一相似度要求的各第二特征向量对应的各目标候选图像;所述候选图像集中包括针对不同对象从不同角度进行拍摄的各候选图像;
步骤203,从各目标候选图像中,确定出与所述待检索图像的第三特征向量满足第二相似度要求的目标候选图像作为所述待检索图像的目标图像;所述第三特征向量是基于所述待检索图像的颜色RGB图确定的。
图像的深度图可以反映图像中各点在物理世界中距离摄像机的距离,可以反映图像中的对象的立体特征。在上述方法中,通过待检索图像的深度图确定第一特征向量,因此得到的第一特征向量能够反映待检索图像中的对象的形状信息,考虑到了拍摄角度的影响。将第一特征向量先与候选图像集中的各候选图像的特征向量进行相似度的比较,确定满足第一相似度要求的各目标候选图像,如此得到的各目标候选图像的准确率较高。并且,由于候选图像集中包括针对不同对象从不同角度进行拍摄的各候选图像,那么基于这些不同角度的候选图像进行图像检索,进一步减少了拍摄角度对图像检索的影响。在通过深度图进行第一次筛选后,由于深度图不反映颜色信息,因此为了提高图像检索的准确性,基于待检索图像的颜色RGB图再次提取第三特征向量,通过第三特征向量与各目标候选图像进行相似度的比较,从而确定目标图像。如此,在第一次筛选的基础上,进行第二次筛选,两次筛选兼顾了不同的图像信息,提高了图像检索的准确性。
在步骤201中,提取待检索图像的深度图,基于深度图确定待检索图像的第一特征向量。
采用深度图采集设备(Depth Camera,DC)采集的待检索图像中才可以提取到深度图。深度图中任一像素点可以这样表示:(xi,yi,di),其中xi表示该像素点在深度图中的横坐标,yi表示该像素点在深度图中的纵坐标,di表示该像素点的深度值,即,在物理世界中与深度图采集设备的距离。
提取的深度图的宽度为w,高度为h,对深度图进行均匀采样,得到若干个第一采样点,例如得到m*n个采样点,其中m<h,n<w。
接下来确定各第一采样点的法向量,法向量可以表征各采样点的朝向信息。确定法向量的方法具体为:
针对每个第一采样点,按照预设规则从深度图中确定出第二采样点和第三采样点;根据第一采样点的深度向量、第二采样点的深度向量和第三采样点的深度向量,确定所述第一采样点的法向量。
举个例子,第一采样点的坐标(即深度向量)为P0(x0,y0,d0),根据法向量计算公式,需要三个点确定一个平面,以计算法向量。因此我们以步长s,分别在深度图的(x0+s,y0)和(x0,y0+s)处二次采样,获取第二采样点P1和第三采样点P2。
将P1和P2的坐标表示为(x1,y1,d1)和(x2,y2,d2)。d1和d2为采样点P1和采样点P2处得到的深度值。其中x1=x0+s,y1=y0,x2=x0,y2=y0+s。定义向量vP0P1为连接P0和P1得到的向量,vP0P2为连接P0和P2得到的向量。可得向量vP0P1=(x1-x0,y1-y0,d1-d0),vP0P2=(x2-x0,y2-y0,d2-d0)。
为了简化表示,进一步表示为vP0P1=(a1,b1,c1),vP0P2=(a2,b2,c2),其中a1=x1-x0,b1=y1-y0,c1=d1-d0,a2=x2-x0,b2=y2-y0,c2=d2-d0。
图3示出了第一采样点、第二采样点和第三采样点的示意图,如图所示,根据法向量计算公式,采样点P0处的法向量Vp0=vP0P1×vP0P2=(b1*c2–b2*c1,a1*c2-a2*c1,a1*b2-a2*b1)。
对各第一采样点均执行上述流程,可得各第一采样点的法向量。可以发现,由于深度图中各第一采样点的深度值di是不同的,因此各采样点的法向量并不是全都垂直于该深度图所在的平面的,各第一采样点的法向量的方向可以是朝向各个方向。
接下来,基于各采样点的法向量,确定深度图的法向量直方图。法向量直方图可以表征各采样点的朝向信息的分布状态。深度图中各采样点的朝向信息的分布状态可以反映图像中的对象的形状信息。
具体地,确定法向量直方图的方法为:将深度图划分为多个区域,针对任一区域,确定该区域中具有不同朝向信息的各采样点的分布状态,根据多个区域的具有不同朝向信息的各采样点的分布状态,确定所述深度图的法向量直方图。其中,确定该区域中具有不同朝向信息的各采样点的分布状态,可以通过如下方法完成:针对任一采样点的法向量,确定法向量与第一坐标轴所成的第一夹角,以及法向量与第二坐标轴所成的第二夹角;对第一夹角和所述第二夹角进行桶化binning处理后,得到法向量的表征值;根据所述区域中各采样点的法向量的表征值,确定所述区域中不同表征值的分布状态。
举个例子,对于某一个采样点,确定该采样点的法向量为v(x,y,z),将其投影在x轴与z轴组成的平面上获得二维向量(x,z),将其投影在y轴与z轴组成的平面上获得二维向量(y,z)。则可以得到该法向量与第一坐标轴所成的第一夹角,与第二坐标轴所成的第二夹角。这里的第一坐标轴是指x坐标轴或y坐标轴中的任意一个坐标轴,第二坐标轴是指x坐标轴或y坐标轴中的任意一个坐标轴,第一坐标轴和第二坐标轴不同。
第一夹角α1=arctan(x,z),第二夹角α2=arctan(y,z),由于法向量垂直于其采样点所在切面向外,可知:α1的范围是(0,π),α2的范围是(0,π)。将α1和α2分别进行桶化操作(binning),桶的大小由本领域技术人员根据需求定义。假设桶的大小为4,则:
bins=[[0,π/4),[π/4,π/2),[π/2,(3*π)/4),[(3*π)/4,π]]
若α1的角度属于(0,π/4),则α1的表征值为0;若α1的角度属于(π/4,π/2),则α1的表征值为1;若α1的角度属于(π/2,3π/4),则α1的表征值为2;若α1的角度属于(3π/4,π),则α1的表征值为0。最终得到α1的表征值indα1,α2同理,也可转换成相应的表征值indα2。
对α1和α2的表征值做整合处理,得到该法向量的表征值,例如可以按照如下公式整合:
indp=indα1*4+indα2。由此可知,ind p∈[0,15]。
在上述公式中,indα1前的系数和indα2前的系数与桶化操作中桶的大小有关。由于桶的大小是4,因此设置indα1前的系数为4,indα2前的系数为1,最终无论indα1和indα2取何值,整合得到的indp的取值范围是一串连续的数值,例如在本例中,取值范围是[0,15],而不会出现中间缺了某些数。
这样,深度图中每个采样点都确定了法向量的表征值。可以理解的是,该表征值是法向量的又一种表示形式。通过对法向量的第一夹角和第二夹角做桶化操作、表征值的整合,将法向量以另一种形式表现了出来,依然反映的是该采样点的朝向信息。不同的是,经过这一系列的处理,对法向量的表征粒度变粗。举个例子,原来的法向量通过坐标表示,例如(100,200,300),不同法向量的坐标不同;转化为第一夹角和第二夹角表示后,法向量的可以表示成(20°,30°),不同法向量的表示依然不同;经过桶化操作和表征值整合后,得到的法向量的表征值的取值范围为[0,15],也就是说,法向量可以表示成0-15中的任一个整数,这样就会出现明明是不同的法向量,但是均被表示成同一个整数的情况。
通过上述方法将表征粒度变粗,在后续的图像检索中可以减小敏感度,适当增加容错度,不至于将明明是表示同一对象的两张图像错误认为是相似度较低的图像,从而导致检索失败。因此,桶化操作中桶的大小的设置对于后续图像检索的正确率具有一定的影响,若划分桶的个数过多,粒度过细,则图像检索的敏感度太高,不容易将表示相同对象的图像识别出来;若划分桶的个数过少,粒度过粗,则图像检索的敏感度太低,不容易将表示不同对象的图像区分开来。本发明实施例对桶的大小不作限制,本领域技术人员可以根据需求进行选择。
接下来确定法向量直方图。法向量直方图可以表征各采样点的朝向信息的分布状态。将深度图划分为多个区域,例如划分为16个区域,如图4所示,图4中每个区域中分布着24个采样点。对各区域的法向量分布直方图进行统计。例如对于第一个区域,该区域中,24个采样点的法向量的表征值已知,均为[0,15]中的任一个整数。假如统计得到,表征值为0的法向量有4个,表征值为1的法向量有2个,表征值为2的法向量有1个……如此按照表征值从0-15的顺序进行统计,得到的该区域的法向量直方图如图5所示。那么该区域的法向量分布可以用一个16维的向量表示,例如:(4,2,1,3,0,5,6,1,1,1,0,0,0,0,0,0)。
综合各区域的法向量直方图,可得该深度图的法向量直方图。将各区域 的16维向量进行整合排列成一个256维的向量,即第一特征向量。
在上述方法中,之所以没有直接用各采样点的表征值作为第一特征向量,也是出于对图像检索的准确度的考量。若直接采用各采样点的表征值作为第一特征向量,则该第一特征向量的维度为24×16=384维(横向上24个采样点,纵向上16个采样点),划分的维度过细,包含的图像信息也更多,图像检索的敏感度太高,不容易将表示相同对象的图像识别出来。因此采用将深度图划分为多个区域,各区域分别统计表征值个数的方法,统计表征值个数的方法相比较于直接采用表征值的方法而言,一定程度上模糊了一些图像的深度信息,不至于使图像检索的敏感度太高。
综上,可以理解的是,将深度图划分为多个区域的数量对于图像检索的准确率的调控也至关重要。若划分的区域的数量太多,则粒度过细,图像检索的敏感度太高,不容易将表示相同对象的图像识别出来;若划分的区域的数量太少,粒度过粗,则图像检索的敏感度太低,不容易将表示不同对象的图像区分开来。本发明实施例对划分的区域的数量不作限制,本领域技术人员可以根据需求进行选择。
至此,我们通过待检索图像的深度图提取出了待检索图像的第一特征向量。
在步骤202中,设置候选图像集,候选图像集中包括针对不同对象从不同角度进行拍摄的各候选图像。例如,对每个对象从8个不同的角度进行拍摄,得到了每个对象的8张候选图像。本发明实施例对角度的数量不作限制,可以由本领域技术人员自行设置。本发明实施例对具体的角度也不作限制,可以是任意拍摄角度。
对于候选图像集中的各候选图像,均采用步骤201中提供的方法,提取深度图,并对深度图提取特征向量,命名为第二特征向量。如此得到了各候选图像的第二特征向量。提取第二特征向量的步骤可以在步骤201之前进行,即,提前将候选图像集中的各候选图像提取第二特征向量进行存储,这样,每次获取待检索图像的第一特征向量之后,直接与第二特征向量进行相似度的比较即可,节省计算资源,且加快了检索速度。或者,提取第二特征向量的步骤可以在步骤201之后进行,这样的话,每次获取待检索图像的第一特征向量之后,均需对各候选图像提取第二特征向量,浪费计算资源且更加耗时。
将第一特征向量和各第二特征向量计算相似度,计算相似度的方法可以为计算欧氏距离、余弦距离等,本发明实施例对此不作限制。
在各第二特征向量中确定满足第一相似度要求的各第二特征向量对应的各目标候选图像,第一相似度要求可以为相似度最高的前N张候选图像或者相似度满足预设阈值等,本发明实施例对此不作限制。例如,在各第二特征向量中确定相似度最高的前10张候选图像为目标候选图像。
采用上述方法确定的目标候选图像可以直接作为目标图像输出给用户,也可以进行接下来的第二次筛选。
由于步骤201和步骤202的方法仅关注了深度图,而深度图由于仅有深 度信息,没有颜色信息,因此得到的图像检索结果可能准确性较低。
同时,在现有的图像检索技术中,没有关注过图像的背景对图像检索的影响,那么同一对象在不同的背景下拍摄的图像很可能被认为是不同的对象,图像检索准确率较低。而图像的深度图可以表示各点在物理世界中距离摄像机的距离,那么据此可以区分前后景,前景距离摄像机的距离近,背景距离摄像机的距离远。若能采用一种方法,既能关注到图像的颜色信息,又能尽可能排除背景对图像检索的干扰,那么就可大大提升图像检索的准确性。依据以上发明构思,我们提出了如下方案。
在第一次筛选的基础上,接下来在这些目标候选图像中进行第二次筛选。
分别提取待检索图像的颜色RGB图和深度图,基于该深度图,对该RGB图进行特征提取,得到第三特征向量。
下面对得到第三特征向量的方法进行具体介绍。
首先介绍本发明实施例所采用的双塔结构的神经网络模型,如图6所示。向该神经网络模型的两个分支分别输入不同的信息,第二分支用于根据输入的信息在每一层产生注意力权重,为第一分支的每一层模型训练提供加权。注意力机制让模型自己学习如何分配自己的注意力,即为输入信息加权。
在本发明实施例中,将RGB图输入至第一分支,将深度图生成深度权重图后输入至第二分支。第一分支为卷积神经网络,设置有多层卷积层,各卷积层对上一层输入的特征进行进一步的特征提取,提取得到的特征向量再输入至下一层。各卷积层在进行特征提取时,还会接收来自第二分支的各层提取的注意力权重,基于注意力权重进行特征提取,从而生成第三特征向量。
为何要将深度图转换为深度权重图后再输入第二分支,是由于:深度图中各点可以表示该点在物理世界中距离摄像机的距离,前景距离摄像机的距离近,则深度值小;背景距离摄像机的距离远,则深度值大。而我们的方案希望模型在特征提取的过程中,给予前景更多的权重,给予背景更小的权重甚至忽略背景。因此需要给前景的权重值更大,给背景的权重值更小。基于这样的思路,可以先对深度图中的各深度信息进行归一化,归一化后的值为di,j。对di,j其值取反后加1,生成深度权重图W。深度权重图W中的每个点Wi,j=1–di,j。
下面以具体的例子介绍上述特征提取的过程。
图7示出了一种基于双塔结构的神经网络模型进行特征提取的示意图。如图所示,例如RGB图和深度图均为256×256的图像,将RGB图A1输入至模型的第一分支,将深度图转换为深度权重图B1后输入至第二分支。
在第一分支的第一层卷积层中,基于深度权重图B1和模型参数C1对A1进行卷积运算,具体的卷积算法为现有技术,在此不再赘述。第一层卷积层运算完成后将运算结果A2输入至第二层卷积层。由于第一层的卷积层的卷积核为多个,因此得到的图像数目增多,如图所示。假使得到的A2的图像大小为128×128,则在第二分支中,对应地将深度权重图缩放为128×128的大小,即B2,输入至第一分支的第二层卷积层中。因为深度权重图的大小要与第一分支的卷积神经网络的每一层在前向传播中产生的输出的大小相对应。
第二卷积层基于深度权重图B2和模型参数C2对A2进行卷积运算。将得到的结果输入至第三卷积层,如此各卷积层按照第一卷积层的卷积运算进行特征提取,直至得到最终的第三特征向量。其中,模型参数C1和C2为之前根据大量样本图像对该双塔结构的神经网络模型进行训练时得到的。
对在步骤202中确定的各目标候选图像均执行上述步骤,得到各目标候选图像的各第四特征向量,计算第三特征向量与各第四特征向量的相似度,将满足第二相似度要求的目标候选图像作为所述待检索图像的目标图像。
可以看出,基于双塔结构的神经网络模型提取第三特征向量相对于基于深度图提取第一特征向量的运算更加复杂,耗费更多的计算资源,比较耗时。这也是先通过深度图提取的第一特征向量进行第一次筛选,后通过双塔结构的神经网络模型提取的第三特征向量进行第二次筛选的原因之一。先通过第一次筛选得到少量的目标候选图像,后通过双塔结构的神经网络模型在目标候选图像中进行第二次筛选,无需对所有的候选图像通过双塔结构的神经网络模型提取特征向量,节省了大量算力资源和时间,且保证了图像检索的准确性。
上述方案中,将双塔结构的神经网络模型作为特征提取器使用。可选地,还可基于上述双塔结构的神经网络模型训练一个孪生神经网络(Siamese Network,SN)。如图8所示,该孪生神经网络包括两个双塔结构的神经网络模型,以及多个全连接层。左右两侧的双塔结构的神经网络模型,分别用于输入待检索图像和任一目标候选图像。然后经过全连接层,之后直接输出待检索图像和任一目标候选图像的相似度评分。该孪生神经网络也需要大量的样本图像进行训练。
基于相同的技术构思,图9示例性的示出了本发明实施例提供的一种图像检索装置的结构,该结构可以执行图像检索的流程。
如图9所示,该装置具体包括:
确定单元901,用于:
通过待检索图像的深度图,确定所述待检索图像的第一特征向量;
从候选图像集中确定出与所述第一特征向量满足第一相似度要求的各第二特征向量对应的各目标候选图像;所述候选图像集中包括针对不同对象从不同角度进行拍摄的各候选图像;
从各目标候选图像中,确定出与所述待检索图像的第三特征向量满足第二相似度要求的目标候选图像作为所述待检索图像的目标图像;所述第三特征向量是基于所述待检索图像的颜色RGB图确定的。
可选地,所述确定单元901具体用于:
通过待检索图像的深度图,确定所述深度图中每个采样点的法向量;所述法向量用于表征采样点的朝向信息;
基于各采样点的法向量,确定所述深度图的法向量直方图;所述法向量直方图用于表征各采样点的朝向信息的分布状态;
根据所述法向量直方图,得到所述第一特征向量。
可选地,所述确定单元901具体用于:
对所述待检索图像的深度图进行采样,得到各第一采样点;
针对每个第一采样点,按照预设规则从所述深度图中确定出第二采样点和第三采样点;根据所述第一采样点的深度向量、所述第二采样点的深度向量和所述第三采样点的深度向量,确定所述第一采样点的法向量。
可选地,所述确定单元901具体用于:
将所述深度图划分为多个区域,针对任一区域,根据所述区域中各采样点的法向量确定所述区域中具有不同朝向信息的各采样点的分布状态;
根据多个区域的具有不同朝向信息的各采样点的分布状态,确定所述深度图的法向量直方图。
可选地,所述确定单元901具体用于:
针对所述区域中任一采样点的法向量,确定所述法向量与第一坐标轴所成的第一夹角,以及所述法向量与第二坐标轴所成的第二夹角;对所述第一夹角和所述第二夹角进行桶化binning处理后,得到所述法向量的表征值;
根据所述区域中各采样点的法向量的表征值,确定所述区域中不同表征值的分布状态。
可选地,所述确定单元901具体用于:
基于所述深度图中的深度信息,生成深度权重图;
将所述RGB图与所述深度权重图分别输入至双塔结构的神经网络模型的第一分支和第二分支,其中,所述第一分支中的各卷积层基于所述第二分支在每一层产生的注意力权重对所述RGB图进行特征提取,从而生成所述待检索图像的第三特征向量;
计算所述第三特征向量与任一所述目标候选图像的第四特征向量之间的相似度,从而确定相似度满足第二相似度要求的目标候选图像。
可选地,所述确定单元901具体用于:
对所述深度图中的各深度信息进行归一化;
将归一化后的深度信息按照预设方式进行权重转化,生成所述深度权重图,所述预设方式为深度信息越小,在所述深度权重图中的权重值越大。
基于相同的技术构思,本申请实施例提供了一种计算机设备,如图10所示,包括至少一个处理器1001,以及与至少一个处理器连接的存储器1002,本申请实施例中不限定处理器1001与存储器1002之间的具体连接介质,图10中处理器1001和存储器1002之间通过总线连接为例。总线可以分为地址总线、数据总线、控制总线等。
在本申请实施例中,存储器1002存储有可被至少一个处理器1001执行的指令,至少一个处理器1001通过执行存储器1002存储的指令,可以执行上述图像检索方法的步骤。
其中,处理器1001是计算机设备的控制中心,可以利用各种接口和线路连接计算机设备的各个部分,通过运行或执行存储在存储器1002内的指令以及调用存储在存储器1002内的数据,从而进行图像检索。可选的,处理器1001可包括一个或多个处理单元,处理器1001可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解 调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器1001中。在一些实施例中,处理器1001和存储器1002可以在同一芯片上实现,在一些实施例中,它们也可以在独立的芯片上分别实现。
处理器1001可以是通用处理器,例如中央处理器(CPU)、数字信号处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件,可以实现或者执行本申请实施例中公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。
存储器1002作为一种非易失性计算机可读存储介质,可用于存储非易失性软件程序、非易失性计算机可执行程序以及模块。存储器1002可以包括至少一种类型的存储介质,例如可以包括闪存、硬盘、多媒体卡、卡型存储器、随机访问存储器(Random Access Memory,RAM)、静态随机访问存储器(Static Random Access Memory,SRAM)、可编程只读存储器(Programmable Read Only Memory,PROM)、只读存储器(Read Only Memory,ROM)、带电可擦除可编程只读存储器(Electrically Erasable Programmable Read-Only Memory,EEPROM)、磁性存储器、磁盘、光盘等等。存储器1002是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。本申请实施例中的存储器1002还可以是电路或者其它任意能够实现存储功能的装置,用于存储程序指令和/或数据。
基于相同的技术构思,本发明实施例还提供一种计算机可读存储介质,计算机可读存储介质存储有计算机可执行程序,计算机可执行程序用于使计算机执行上述任一方式所列的图像检索的方法。
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。

Claims (10)

  1. 一种图像检索方法,其特征在于,包括:
    通过待检索图像的深度图,确定所述待检索图像的第一特征向量;
    从候选图像集中确定出与所述第一特征向量满足第一相似度要求的各第二特征向量对应的各目标候选图像;所述候选图像集中包括针对不同对象从不同角度进行拍摄的各候选图像;
    从各目标候选图像中,确定出与所述待检索图像的第三特征向量满足第二相似度要求的目标候选图像作为所述待检索图像的目标图像;所述第三特征向量是基于所述待检索图像的颜色RGB图确定的。
  2. 如权利要求1所述的方法,其特征在于,通过待检索图像的深度图,确定所述待检索图像的第一特征向量,包括:
    通过待检索图像的深度图,确定所述深度图中每个采样点的法向量;所述法向量用于表征采样点的朝向信息;
    基于各采样点的法向量,确定所述深度图的法向量直方图;所述法向量直方图用于表征各采样点的朝向信息的分布状态;
    根据所述法向量直方图,得到所述第一特征向量。
  3. 如权利要求2所述的方法,其特征在于,通过待检索图像的深度图,确定所述深度图中每个采样点的法向量,包括:
    对所述待检索图像的深度图进行采样,得到各第一采样点;
    针对每个第一采样点,按照预设规则从所述深度图中确定出第二采样点和第三采样点;根据所述第一采样点的深度向量、所述第二采样点的深度向量和所述第三采样点的深度向量,确定所述第一采样点的法向量。
  4. 如权利要求2所述的方法,其特征在于,基于各采样点的法向量,确定所述深度图的法向量直方图,包括:
    将所述深度图划分为多个区域,针对任一区域,根据所述区域中各采样点的法向量确定所述区域中具有不同朝向信息的各采样点的分布状态;
    根据多个区域的具有不同朝向信息的各采样点的分布状态,确定所述深度图的法向量直方图。
  5. 如权利要求4所述的方法,其特征在于,根据所述区域中各采样点的法向量确定所述区域中具有不同朝向信息的各采样点的分布状态,包括:
    针对所述区域中任一采样点的法向量,确定所述法向量与第一坐标轴所成的第一夹角,以及所述法向量与第二坐标轴所成的第二夹角;对所述第一夹角和所述第二夹角进行桶化binning处理后,得到所述法向量的表征值;
    根据所述区域中各采样点的法向量的表征值,确定所述区域中不同表征值的分布状态。
  6. 如权利要求1-5任一项所述的方法,其特征在于,确定出与所述待检索图像的第三特征向量满足第二相似度要求的目标候选图像,包括:
    基于所述深度图中的深度信息,生成深度权重图;
    将所述RGB图与所述深度权重图分别输入至双塔结构的神经网络模型的第一分支和第二分支,其中,所述第一分支中的各卷积层基于所述第二分支 在每一层产生的注意力权重对所述RGB图进行特征提取,从而生成所述待检索图像的第三特征向量;
    计算所述第三特征向量与任一所述目标候选图像的第四特征向量之间的相似度,从而确定相似度满足第二相似度要求的目标候选图像。
  7. 如权利要求6所述的方法,其特征在于,基于所述深度图中的深度信息,生成深度权重图,包括:
    对所述深度图中的各深度信息进行归一化;
    将归一化后的深度信息按照预设方式进行权重转化,生成所述深度权重图,所述预设方式为深度信息越小,在所述深度权重图中的权重值越大。
  8. 一种图像检索装置,其特征在于,包括:
    确定单元,用于:
    通过待检索图像的深度图,确定所述待检索图像的第一特征向量;
    从候选图像集中确定出与所述第一特征向量满足第一相似度要求的各第二特征向量对应的各目标候选图像;所述候选图像集中包括针对不同对象从不同角度进行拍摄的各候选图像;
    从各目标候选图像中,确定出与所述待检索图像的第三特征向量满足第二相似度要求的目标候选图像作为所述待检索图像的目标图像;所述第三特征向量是基于所述待检索图像的颜色RGB图确定的。
  9. 一种计算设备,其特征在于,包括:
    存储器,用于存储计算机程序;
    处理器,用于调用所述存储器中存储的计算机程序,按照获得的程序执行权利要求1至7任一项所述的方法。
  10. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机可执行程序,所述计算机可执行程序用于使计算机执行权利要求1至7任一项所述的方法。
PCT/CN2022/100676 2021-12-13 2022-06-23 一种图像检索方法及装置 WO2023109069A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111519845.8 2021-12-13
CN202111519845.8A CN114241222A (zh) 2021-12-13 2021-12-13 一种图像检索方法及装置

Publications (1)

Publication Number Publication Date
WO2023109069A1 true WO2023109069A1 (zh) 2023-06-22

Family

ID=80755359

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/100676 WO2023109069A1 (zh) 2021-12-13 2022-06-23 一种图像检索方法及装置

Country Status (2)

Country Link
CN (1) CN114241222A (zh)
WO (1) WO2023109069A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114241222A (zh) * 2021-12-13 2022-03-25 深圳前海微众银行股份有限公司 一种图像检索方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104850850A (zh) * 2015-04-05 2015-08-19 中国传媒大学 一种结合形状和颜色的双目立体视觉图像特征提取方法
CN107636727A (zh) * 2016-12-30 2018-01-26 深圳前海达闼云端智能科技有限公司 一种目标检测的方法及装置
CN108829692A (zh) * 2018-04-09 2018-11-16 华中科技大学 一种基于卷积神经网络的花卉图像检索方法
CN111339343A (zh) * 2020-02-12 2020-06-26 腾讯科技(深圳)有限公司 图像检索方法、装置、存储介质及设备
CN114241222A (zh) * 2021-12-13 2022-03-25 深圳前海微众银行股份有限公司 一种图像检索方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104850850A (zh) * 2015-04-05 2015-08-19 中国传媒大学 一种结合形状和颜色的双目立体视觉图像特征提取方法
CN107636727A (zh) * 2016-12-30 2018-01-26 深圳前海达闼云端智能科技有限公司 一种目标检测的方法及装置
CN108829692A (zh) * 2018-04-09 2018-11-16 华中科技大学 一种基于卷积神经网络的花卉图像检索方法
CN111339343A (zh) * 2020-02-12 2020-06-26 腾讯科技(深圳)有限公司 图像检索方法、装置、存储介质及设备
CN114241222A (zh) * 2021-12-13 2022-03-25 深圳前海微众银行股份有限公司 一种图像检索方法及装置

Also Published As

Publication number Publication date
CN114241222A (zh) 2022-03-25

Similar Documents

Publication Publication Date Title
US20220230282A1 (en) Image processing method, image processing apparatus, electronic device and computer-readable storage medium
CN109543641B (zh) 一种实时视频的多目标去重方法、终端设备及存储介质
CN110033514B (zh) 一种基于点线特征快速融合的重建方法
WO2023193401A1 (zh) 点云检测模型训练方法、装置、电子设备及存储介质
CN112614110B (zh) 评估图像质量的方法、装置及终端设备
CN114612665B (zh) 基于法向量直方图特征的位姿估计及动态车辆检测方法
CN108416801B (zh) 一种面向立体视觉三维重建的Har-SURF-RAN特征点匹配方法
WO2023109069A1 (zh) 一种图像检索方法及装置
CN114627244A (zh) 三维重建方法及装置、电子设备、计算机可读介质
CN114820987A (zh) 一种基于多视角图像序列的三维重建方法及系统
CN116452631A (zh) 一种多目标跟踪方法、终端设备及存储介质
Sfikas et al. 3D object retrieval via range image queries in a bag-of-visual-words context
CN113436251B (zh) 一种基于改进的yolo6d算法的位姿估计系统及方法
Zhang et al. The farther the better: Balanced stereo matching via depth-based sampling and adaptive feature refinement
Chen et al. Multi-stage matching approach for mobile platform visual imagery
WO2024113078A1 (zh) 用于3d点云场景语义分割的局部上下文特征提取模块
Lin et al. Matching cost filtering for dense stereo correspondence
US20230053952A1 (en) Method and apparatus for evaluating motion state of traffic tool, device, and medium
WO2023109086A1 (zh) 文字识别方法、装置、设备及存储介质
CN113887462B (zh) 一种基于多帧点云数据的3d目标检测装置和方法
CN115937537A (zh) 一种目标图像的智能识别方法、装置、设备及存储介质
CN114219831A (zh) 目标跟踪方法、装置、终端设备及计算机可读存储介质
CN112200076A (zh) 基于头部躯干特征进行多目标跟踪的方法
CN111967579A (zh) 使用卷积神经网络对图像进行卷积计算的方法和装置
CN111914697A (zh) 基于视图语义信息和序列上下文信息的多视目标识别方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22905835

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE