CN111373393B

CN111373393B - Image retrieval method and device and image library generation method and device

Info

Publication number: CN111373393B
Application number: CN201780097137.5A
Authority: CN
Inventors: 付宇新; 温丰; 薛常亮
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2017-11-24
Filing date: 2017-11-24
Publication date: 2022-05-31
Anticipated expiration: 2037-11-24
Also published as: CN111373393A; WO2019100348A1

Abstract

An image retrieval method and device and an image library generation method and device are provided, wherein the image retrieval method comprises the following steps: acquiring a plurality of visual words of an image to be retrieved and image category information of the image to be retrieved; determining visual stop words corresponding to the image category of the image to be retrieved according to the image category information of the image to be retrieved and a stop word library, wherein the visual stop words corresponding to the image category of the image to be retrieved comprise visual words irrelevant to the image category of the image to be retrieved; removing the visual stop words corresponding to the image category of the image to be retrieved from the plurality of visual words of the image to be retrieved to obtain target visual words of the image to be retrieved (S130); and determining a retrieval result according to the target visual words of the image to be retrieved and a retrieval image library, wherein the retrieval image library comprises a plurality of retrieval images (S140). The efficiency and accuracy of image retrieval can be improved.

Description

Image retrieval method and device and image library generation method and device

Technical Field

The present application relates to the field of image retrieval technologies, and in particular, to an image retrieval method and apparatus and an image library generation method and apparatus in the field of image retrieval technologies.

Background

A bag of visual words (BoVW) model is widely used in the field of image retrieval, and includes a plurality of visual words obtained by clustering a plurality of visual feature descriptors extracted from a plurality of images, each of the plurality of visual words being a cluster center.

In the existing image retrieval process, firstly, a plurality of visual feature descriptors of an image to be retrieved are obtained, the visual feature descriptors are matched and mapped with visual words in the visual word bag model to obtain a plurality of visual words of the image to be retrieved, the visual words are used for representing the image to be retrieved, the similarity between the image to be retrieved and a retrieval image in a retrieval image library is calculated according to the visual words of the image to be retrieved, and at least one image with the highest similarity between the image to be retrieved and the retrieval image in the retrieval image library is output as an image retrieval result.

However, when the content of the image to be retrieved is cluttered or the amount of information contained in the image to be retrieved is large, the number of the plurality of visual words of the image to be retrieved is large, so that the efficiency and the accuracy are low when the image retrieval is performed.

Disclosure of Invention

The application provides an image retrieval method and device and an image processing method and device, which are beneficial to improving the efficiency and the accuracy of image retrieval.

In a first aspect, the present application provides an image retrieval method, including:

acquiring a plurality of visual words of an image to be retrieved and image category information of the image to be retrieved, wherein the visual words of the image to be retrieved are obtained by matching and mapping a plurality of visual feature descriptors of the image to be retrieved and visual words in a visual word bag model, and the image category information of the image to be retrieved is used for indicating the image category of the image to be retrieved;

determining visual stop words corresponding to the image category of the image to be retrieved according to the image category information of the image to be retrieved and a stop word library, wherein the visual stop words corresponding to the image category of the image to be retrieved comprise visual words irrelevant to the image category of the image to be retrieved, and the stop word library comprises a mapping relation between the image category of the image to be retrieved and the visual stop words corresponding to the image category of the image to be retrieved;

removing visual stop words corresponding to the image category of the image to be retrieved from the plurality of visual words of the image to be retrieved to obtain target visual words of the image to be retrieved;

and determining a retrieval result according to the target visual words of the image to be retrieved and a retrieval image library, wherein the retrieval image library comprises a plurality of retrieval images.

In the image retrieval method provided by the embodiment of the application, the target visual word of the image to be retrieved is obtained by removing the visual stop word corresponding to the image category of the image to be retrieved from the plurality of visual words of the image to be retrieved, that is, the visual word which has no significant effect on identifying the retrieval image or affects image identification is removed from the plurality of visual words of the image to be retrieved, that is, the effect of the target visual word of the image to be retrieved on identifying the image to be retrieved is relatively significant. Therefore, the target visual words of the image to be retrieved and the retrieval image library are used for retrieval, and the efficiency and the accuracy of image retrieval are improved.

In a possible implementation manner, the search image library further includes a target visual word corresponding to each search image in the plurality of search images, where the target visual word corresponding to each search image is obtained by removing a visual stop word corresponding to an image category of each search image from the plurality of visual words corresponding to each search image.

In the image retrieval method provided by the embodiment of the application, the target visual words of the retrieval images stored in the retrieval image library are obtained by removing the visual stop words corresponding to the image categories of the retrieval images from the plurality of visual words of the retrieval images, which is beneficial to reducing the occupancy rate of the memory of the retrieval image library.

In addition, the retrieval result is determined according to the similarity between the target visual words of the images to be retrieved and the target visual words of the retrieved images in the retrieval image library, and the efficiency and the accuracy of image retrieval are improved.

It should be understood that the image retrieval apparatus may be a first device having computing and storing functions, the first device may be a computer, for example, or the image retrieval apparatus may be a functional module in the first device, which is not limited in this embodiment of the present application.

It should also be understood that, the visual feature points of the image in the embodiment of the present application refer to pixel points of the image that can maintain consistency through transformation such as proportion, rotation, translation, and view angle, that is, pixel points that are most easily identified in the image, for example, corner points or edge points with rich texture. The quality of the visual characteristic points of the image directly influences the efficiency and the precision of image retrieval.

Optionally, the type of the visual feature point of the image may include scale-invariant feature transform (SIFT), ORB, Speeded Up Robust Feature (SURF), speeded up segmented Feature (FAST), and so on, which is not limited in this embodiment of the present application.

Optionally, the number of the visual feature points of the image may be one or more, which is not limited in this application.

It should also be understood that the visual feature descriptor of an image in the embodiment of the present application refers to a visual feature point that describes an image by a mathematical feature.

For example, taking ORB as an example, the main steps of acquiring the visual feature descriptor of an image include: randomly selecting a plurality of pixel pairs near the visual feature point of the image, and obtaining a code of 0 or 1 by comparing the size relationship between two pixels in each pixel pair; and rotating the visual characteristic points by utilizing the information of the directions of the visual characteristic points to obtain a robust binary vector visual characteristic descriptor.

Optionally, the visual feature descriptor of the image may be one or more embodiments of the present application without limitation.

It should be further understood that the visual bag of words model in the embodiment of the present application includes a plurality of visual words, and each of the plurality of visual words is a clustering center obtained by clustering visual feature descriptors extracted from a plurality of images.

It should also be understood that the visual words of the image in the embodiment of the present application refer to the visual words in the visual bag-of-words model that are closest to the visual feature descriptors by matching and mapping the visual feature descriptors of the image and the visual words in the visual bag-of-words model.

Optionally, the number of the visual words of the image may be one or more, which is not limited in the embodiments of the present application.

It should also be understood that, in the embodiment of the present application, the plurality of images are classified according to different classification methods, and the image category of each image can be obtained.

As an alternative embodiment, if the images are classified by scene, the image categories of the images may include forest scenes, suburban scenes, indoor scenes, and the like.

As another alternative embodiment, if the images are classified by weather, the image categories of the images may include sunny days, rainy days, snowy days, and the like.

It should also be understood that, since different visual words appearing in the same image may have different effects on identifying the image, and the same visual word appearing in different images may have the same effect on identifying the two images, the visual stop word corresponding to an image category in the embodiment of the present application refers to a visual word that has no significant effect on identifying an image of a certain image category or affects image identification, that is, a visual word unrelated to an image of the image category.

It should be understood that the visual stop words described in the embodiments of the present application, which are not related to a certain image category, refer to visual words whose relevance to the image of the certain image category is lower than a preset threshold.

Optionally, the visual stop word corresponding to the image category may include one or more visual words, which is not limited in this application.

For example, in a forest scene and a suburban scene, almost every image contains a large number of trees, and the feature point pairs extracted from the trees in the image have low degree of recognition for recognizing whether the image is a forest scene or a suburban scene, and therefore, the trees may be visual stop words of the forest class or the suburban class.

For example, in rainy weather, a trace of rain drops in an image, and a plurality of visual words of the image are also contaminated by feature points extracted from rain in the image.

It should also be understood that the target visual word of an image in the embodiments of the present application includes a visual word excluding the visual stop word corresponding to the image category of the image from the plurality of visual words of the image.

Optionally, the target visual word of the image may include one or more visual words, which is not limited in this application.

It should also be understood that the positive sample image set in the embodiment of the present application includes artificially labeled images that can be regarded as being high in similarity or identical.

For example, two images of the same object are taken in different scenes, for example, a school in rainy days and a school in snowy days.

For example, two images of the same scene are taken at different times, such as the current pose and the historical pose of the same scene in loop detection.

In a possible implementation manner, before determining the visual stop word corresponding to the image category of the image to be retrieved according to the image category information of the image to be retrieved and the stop word lexicon, the method further includes: acquiring a plurality of visual words of each training image in a training image library, image category information of each training image and positive sample image set information, wherein the plurality of visual words of each training image are obtained by matching and mapping a plurality of visual feature descriptors of each training image with the visual words in the visual word bag model, the image category information of each training image is used for indicating the image category of each training image, the positive sample image set information is used for indicating at least one positive sample image set, and the positive sample image set comprises a plurality of artificially labeled similar training images in the training image library; and generating the stop word bank according to the plurality of visual words of each training image, the image category information of each training image and the positive sample image set information.

In one possible implementation, the generating the stop word lexicon according to the plurality of visual words of each training image, the image category information of each training image, and the positive sample image set information includes: determining the correlation between a plurality of image categories of the training image library and a plurality of visual words of the training image library according to the plurality of visual words of each training image, the image category information of each training image and the positive sample image set information, wherein the plurality of image categories of the training image library comprise the image category of each training image, and the plurality of visual words of the training image library comprise the plurality of visual words of each training image; and generating the stop word library according to the correlation between the multiple image categories of the training image library and the multiple visual words of the training image library.

As an alternative embodiment, the image retrieval apparatus may use at least one of the plurality of visual words corresponding to the training image library, which has the smallest correlation with each image category corresponding to the training image library, as the visual stop word corresponding to each image category.

As another alternative embodiment, the image retrieval apparatus may use at least one of the plurality of visual words corresponding to the training image library, which has a correlation with each image category corresponding to the training image library that is smaller than a first preset threshold, as the visual stop word corresponding to each image category.

Optionally, the visual stop word corresponding to each image category may include one or more visual words, which is not limited in this application.

Optionally, the stop word thesaurus may include a mapping relationship between each image category and the visual stop word corresponding to each image category.

It should be understood that since different visual words appearing in the same image may have different effects on recognizing the image, the same visual words appearing in different images may have the same effect on recognizing the two images.

In one possible implementation, the determining, according to the plurality of visual words of each training image, the image category information of each training image, and the positive sample image set information, the correlation between the plurality of image categories of the training image library and the plurality of visual words of the training image library includes: determining the correlation between a first image category and a first visual word according to the probability of occurrence of a first event, the probability of occurrence of a second event and the probability of occurrence of the first event and the second event simultaneously, wherein multiple image categories of the training image library comprise the first image category, multiple visual words of the training image library comprise the first visual word, the first event represents that the multiple visual words of a first training image in the training image library and the multiple visual words of a second training image in the training image library both comprise the first visual word, the image category of the first training image is the first image category, and the second event represents that the first training image and the second training image belong to the same positive sample image set.

For example, assuming that there are P training images, M image classes, and L visual words in the training image set, and there are N training images in the first class in the P training images, the correlation between the first visual word and the training image in the first image class can be determined by equations (1) to (6):

wherein x represents a first event, the first event is that a plurality of visual words of a first training image in the N training images and a plurality of visual words of a second training image in the P training images except the first training image all include a first visual word in the L visual words, and x represents a first event_iA first event indicating occurrence of an ith image in the training images of the first class; y represents a second event that the first training image and the second training image belong to the same positive sample image set, y_iA second event, count (x), representing the occurrence of the ith image in the training images of the first class_i) Representing the number of times of occurrence of a first event in the ith image in the training images of the first class, count (y) representing the number of times of occurrence of a second event, count (y)_i) Representing the number of times the second event occurred in the ith image of the training images of the first class, count (x)_i,y_i) Representing a first image in an ith image of a training image of a first classThe number of times of the simultaneous occurrence of the event and the second event, p (x) represents the probability of the occurrence of the first event, p (y) represents the probability of the occurrence of the second event, p (x, y) represents the probability of the simultaneous occurrence of the first event and the second event, PMI (x, y) represents the mutual information quantity of the first event and the second event, H (y) represents the information entropy of the second event, RATE_PMI(x, y) represents the mutual point information rate of the first event and the second event, i.e. the correlation of the first visual word and the first image category, wherein P, L, M, N are positive integers greater than 1.

Alternatively, the image retrieval device may generate the retrieval image library by itself, or may obtain the retrieval image library from the image library generation device, which is not limited in this embodiment of the application.

Optionally, the search image library may be obtained by training according to the plurality of training images, or the search image library may be obtained by training according to a history image to be searched, which is searched by the image searching apparatus before the current search, or may be obtained by training according to other images, which is not limited in this embodiment of the present application.

As an alternative embodiment, the image retrieval apparatus may generate the retrieval image library based on the plurality of visual words of each of the plurality of training images, the image category information of each of the plurality of training images, and the stop word library. That is, the search image library is trained based on the plurality of training images.

Specifically, the image retrieval apparatus may determine the visual stop word corresponding to the image category of each training image according to the image category information of each training image and the stop word library, remove the visual stop word corresponding to the image category of each training image from the plurality of visual words of each training image, obtain the target visual word of each training image, and add the target visual word of each training image to the retrieval image library.

Optionally, the image retrieval apparatus may further obtain the target visual word of each training image by using the stop word lexicon in the embodiment of the present application, or may obtain the target visual word of each training image by other manners, which is not limited in the embodiment of the present application.

For example, in electronic commerce, the search image library includes target visual words of all commodity images provided by the user, and the search image is a commodity image which the user wants to purchase.

As another alternative embodiment, the image retrieval apparatus may add the target visual words of the historical images to be retrieved before S140 to the retrieval image library to generate the retrieval image library.

For example, in loop detection, the retrieval image library includes all historical pose images, and the image to be retrieved is a current pose image.

According to the image retrieval method provided by the embodiment of the application, in a loop detection scene, a scene appearing in history is stored, a current image is used for retrieving and identifying a loop, a constraint of a current pose and a history pose is constructed, and an overall error is reduced through optimization to obtain a globally consistent map. In an e-commerce scene, when the name of a commodity is not clear, a user submits an image of the commodity, the system searches according to the image of the commodity, and an image with high similarity is returned as a search result.

Optionally, the image retrieval device may select at least one retrieval image that is most similar to the image to be retrieved from the retrieval image library according to a plurality of ways to be output as a retrieval result, which is not limited in this embodiment of the application.

As an alternative embodiment, the image retrieval apparatus may calculate similarities between the plurality of visual words of the image to be retrieved and the plurality of visual words of the retrieval images in the retrieval image library, and determine at least one retrieval image having the highest similarity with the image to be retrieved as the retrieval result.

As another alternative, the image retrieval apparatus may calculate similarities between the plurality of visual words of the image to be retrieved and the plurality of visual words of the retrieval images in the retrieval image library, and determine at least one retrieval image with a similarity greater than a second preset threshold as a retrieval result.

In addition, according to the similarity between the target visual words of the images to be retrieved and the target visual words of the retrieved images in the retrieved image library, at least one retrieved image similar to the images to be retrieved is determined to obtain a retrieval result, and the efficiency and the accuracy of image retrieval are improved.

In a second aspect, the present application provides an image processing method, comprising:

the method comprises the steps of obtaining a plurality of visual words of each training image in a training image library, image category information of each training image and positive sample image set information, wherein the plurality of visual words of each training image are obtained by matching and mapping a plurality of visual feature descriptors of each training image with the visual words in a visual word bag model, the image category information of each training image is used for indicating the image category of each training image, the positive sample image set information is used for indicating at least one positive sample image set, and the positive sample image set comprises a plurality of similar training images in the training image library which are manually marked.

Generating a stop word bank according to the plurality of visual words of each training image, the image category information of each training image and the positive sample image set information, wherein the stop word bank comprises a mapping relation between the image category of each training image and the visual stop word corresponding to the image category of each image, and the visual stop word corresponding to the image category of each training image comprises visual words unrelated to the image category of each training image.

According to the method for generating the database, the stop word library is generated by acquiring the plurality of visual words of each training image, the image category information of each training image and the positive sample image set information in the training image library and according to the plurality of visual words of each training image, the image category information of each training image and the positive sample image set information, and the efficiency and the accuracy of image retrieval are improved.

It should be understood that the generating device of the image library may be a second device with computing and storing functions, and the second device may be a computer, for example, or the generating device of the image library may be a functional module in the second device, which is not limited in this embodiment of the present application.

Optionally, the second device may be the same device as the first device in the first aspect or a different device, which is not limited in this embodiment of the application.

Optionally, when the second device is the same as the second device, the generating device of the image library and the image retrieving device in the first aspect are different functional modules in the same device, or the generating device of the image library is a functional module in the image retrieving device.

In one possible implementation manner, the generating the stop word lexicon according to the plurality of visual words of each training image, the image category information of each training image, and the positive sample image set information includes: determining the correlation between a plurality of image categories of the training image library and a plurality of visual words of the training image library according to the plurality of visual words corresponding to each training image, the image category information of each training image and the positive sample image set information, wherein the plurality of image categories of the training image library comprise the image category of each training image, and the plurality of visual words of the training image library comprise the plurality of visual words of each training image; and generating the stop word library according to the correlation between the multiple image categories of the training image library and the multiple visual words of the training image library.

In a possible implementation manner, the determining, according to the plurality of visual words corresponding to each training image, the image category information of each training image, and the positive sample image set information, a correlation between a plurality of image categories of the training image library and a plurality of visual words of the training image library includes: determining the correlation between a first image category and a first visual word according to the probability of occurrence of a first event, the probability of occurrence of a second event and the probability of occurrence of the first event and the second event simultaneously, wherein multiple image categories of the training image library comprise the first image category, multiple visual words of the training image library comprise the first visual word, the first event represents that the multiple visual words of a first training image in the training image library and the multiple visual words of a second training image in the training image library both comprise the first visual word, the image category of the first training image is the first image category, and the second event represents that the first training image and the second training image belong to the same positive sample image set.

In one possible implementation, after the generating a stop word lexicon according to the plurality of visual words of each training image, the image category information of each training image, and the positive sample image set information, the method further includes: determining the visual stop words corresponding to the image category of each training image according to the image category information of each training image and the stop word library, removing the visual stop words corresponding to the image category of each training image from the plurality of visual words of each training image to obtain the target visual words of each training image, and adding the target visual words of each training image to the retrieval image library.

In one possible implementation, the obtaining a plurality of visual words of each training image in the training image library includes: obtaining each training image, extracting a plurality of visual feature descriptors of each training image, wherein the visual feature descriptors are used for describing a plurality of visual feature points of each training image, the visual feature descriptors are in one-to-one correspondence with the visual feature points, obtaining a visual word bag model, and determining a plurality of visual words which are closest to each visual feature descriptor in the visual word bag model as a plurality of visual words of each training image.

In a third aspect, the present application provides an image retrieval apparatus configured to perform the method of the first aspect or any possible implementation manner of the first aspect.

In a fourth aspect, the present application provides an image processing apparatus configured to perform the method of the second aspect or any possible implementation manner of the second aspect.

In a fifth aspect, the present application provides an image retrieval apparatus, comprising: memory, a processor, a communication interface and a computer program stored on the memory and executable on the processor, characterized in that the processor executes the computer program to perform the method of the first aspect or any possible implementation manner of the first aspect.

In a sixth aspect, the present application provides an image processing apparatus comprising: memory, processor, communication interface and computer program stored on the memory and executable on the processor, characterized in that the processor executes the computer program to perform the method of the second aspect or any possible implementation of the second aspect.

In a seventh aspect, the present application provides a computer-readable medium for storing a computer program comprising instructions for performing the method of the first aspect or any possible implementation manner of the first aspect.

In an eighth aspect, the present application provides a computer readable medium for storing a computer program comprising instructions for performing the method of the second aspect or any possible implementation of the second aspect.

In a ninth aspect, the present application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of the first aspect or any possible implementation manner of the first aspect.

In a tenth aspect, the present application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of the second aspect described above or any possible implementation of the second aspect.

In an eleventh aspect, the present application provides a chip comprising: an input interface, an output interface, at least one processor, a memory, the input interface, the output interface, the processor and the memory being in communication with each other via an internal connection path, the processor being configured to execute code in the memory, and when executed, the processor being configured to perform the method of the first aspect or any possible implementation manner of the first aspect.

In a twelfth aspect, the present application provides a chip comprising: an input interface, an output interface, at least one processor, a memory, the input interface, the output interface, the processor and the memory are in communication with each other through an internal connection path, the processor is configured to execute code in the memory, and when the code is executed, the processor is configured to perform the method in the second aspect or any possible implementation manner of the second aspect.

Drawings

Fig. 1 is a schematic flowchart of an image retrieval method of an embodiment of the present application;

FIG. 2 is a schematic block diagram of a method of generating an image library according to an embodiment of the present application;

fig. 3 is a schematic block diagram of an image retrieval apparatus of an embodiment of the present application;

fig. 4 is a schematic block diagram of an image library generation apparatus according to an embodiment of the present application;

fig. 5 is a schematic block diagram of another image retrieval apparatus of an embodiment of the present application;

fig. 6 is a schematic block diagram of another image library generation apparatus according to an embodiment of the present application.

Detailed Description

The technical solution in the present application will be described below with reference to the accompanying drawings.

For the sake of clarity, the terms used in this application are explained first.

1. Visual feature points of an image

The visual feature points of the image refer to pixel points of the image which can keep consistency through transformation such as proportion, rotation, translation, visual angle and the like, namely the pixel points which are most easily identified in the image, such as corner points or edge points with rich texture. The quality of the visual feature points of the image directly affects the efficiency and the precision of image retrieval.

For example, taking ORB as an example, the main steps of extracting FAST corner points from the image include: calculating the difference between the brightness of each pixel in the image and the brightness of the pixels in the neighborhood of the pixel, wherein if the difference between the pixel and the pixels in the neighborhood of the pixel is larger, the pixel is more likely to be a corner point; then, only the corner points responding to the maximum value are reserved in a certain area through non-maximum value inhibition, and the problem of corner point concentration is avoided; and adding descriptions of scale and rotation aiming at the weak point that the FAST corner has no directionality and scale. The scale invariance is realized by constructing an image pyramid, and performing different-level down-sampling on the image to obtain images with different resolutions. The rotation invariance is realized by a gray scale centroid method, namely, a direction vector obtained by connecting the centroid of the gray scale value of the image block and the geometric center is calculated to be used as the description of the direction of the characteristic point.

2. Visual feature descriptor for images

The visual feature descriptor of an image refers to a visual feature point describing the image by a mathematical feature.

3. Visual bag of words model

The visual bag of words model includes a plurality of visual words, each of which is a clustering center obtained by clustering visual feature descriptors extracted from a plurality of images.

4. Visual words of images

The visual words of the image are the visual words which are closest to the visual feature descriptors in the visual word bag model and are obtained by matching and mapping the visual feature descriptors of the image and the visual words in the visual word bag model.

5. Image classification of an image

The plurality of images are classified according to different classification methods, and an image class of each image can be obtained.

6. Visual stop words corresponding to image categories

Because different visual words appearing in the same image may have different effects on identifying the image, the same visual words appearing in different images may have the same effect on identifying the two images, and the visual stop words corresponding to the image categories refer to visual words which have no significant effect on identifying the images of certain image categories or influence on image identification, namely visual words unrelated to the images of the image categories.

It should be understood that the visual stop words that are not related to a certain image category in the embodiments of the present application refer to the visual words that are related to the images of the image category with a correlation lower than a preset threshold.

Optionally, the visual stop word corresponding to the image category may include one or more visual words, which is not limited in this embodiment of the application.

For example, in rainy weather, a trace of rain drops in an image, and a word representation of the image is contaminated by feature points extracted from rain in the image.

7. Target visual words of an image

The target visual words of the image include visual words excluding visual stop words corresponding to the image category of the image from the plurality of visual words of the image.

8. Positive sample image set

The manually labeled images in the positive sample image set can be considered as images with high similarity or the same similarity.

The applicable scenes of the embodiment of the application include loop detection (loop closure) in Simultaneous localization and mapping (SLAM), commodity image retrieval in electronic commerce and the like.

And loop detection is realized by storing scenes which have appeared in history, utilizing the current image to search and identify loops, constructing a constraint of the current pose and the history pose, and reducing the overall error through optimization to obtain a map with the overall consistency.

When the name of a commodity is not clear in electronic commerce, a user submits an image of the commodity, the system searches according to the image of the commodity, and an image with high similarity is returned as a search result.

Fig. 1 shows a schematic flowchart of an image retrieval method 100 provided in an embodiment of the present application. The method may be performed by an image retrieval apparatus.

S110, acquiring a plurality of visual words of an image to be retrieved and image category information of the image to be retrieved, wherein the plurality of visual words of the image to be retrieved are obtained by matching and mapping a plurality of visual feature descriptors of the image to be retrieved and visual words in a visual word bag model, and the image category information of the image to be retrieved is used for indicating the image category of the image to be retrieved.

S120, determining the visual stop words corresponding to the image category of the image to be retrieved according to the image category information of the image to be retrieved and a stop word library, wherein the visual stop words corresponding to the image category of the image to be retrieved comprise visual words irrelevant to the image category of the image to be retrieved, and the stop word library comprises a mapping relation between the image category of the image to be retrieved and the visual stop words corresponding to the image category of the image to be retrieved.

S130, removing the visual stop words corresponding to the image category of the image to be retrieved from the plurality of visual words of the image to be retrieved to obtain the target visual words of the image to be retrieved.

S140, determining a retrieval result according to the target visual words of the image to be retrieved and a retrieval image library, wherein the retrieval image library comprises a plurality of retrieval images.

Optionally, in S110, the image retrieving device may obtain a plurality of visual words of the image to be retrieved in a plurality of ways, which is not limited in this embodiment of the present application.

As an optional embodiment, the image retrieval apparatus may obtain an image to be retrieved, extract a plurality of visual feature descriptors of the image to be retrieved, where the plurality of visual feature descriptors are used to describe a plurality of visual feature points of the image to be retrieved, the plurality of visual feature descriptors are in one-to-one correspondence with the plurality of visual feature points, obtain a visual bag-of-words model, and determine a plurality of visual words in the visual bag-of-words model that are closest to each visual feature descriptor in the plurality of visual feature descriptors as the plurality of visual words of the image to be retrieved.

Optionally, the visual bag-of-words model may be an existing trained visual bag-of-words model, or may be obtained by clustering visual feature descriptors of training images in a training image set by the image retrieval apparatus, which is not limited in this application.

Optionally, the image retrieval device may obtain the image to be retrieved in various ways, for example, by shooting with a camera, reading with a local disk, downloading over a network, or other ways, which is not limited in this embodiment of the present application.

Optionally, the image to be retrieved acquired by the image retrieving apparatus may be an image subjected to distortion, denoising, or other preprocessing operations, which is not limited in this embodiment of the application.

Optionally, in S110, the image retrieving device may obtain the image category information of the image to be retrieved in a plurality of ways, which is not limited in this embodiment of the application.

As an alternative embodiment, the image retrieving apparatus may determine the image category information of the image to be retrieved according to the image to be retrieved and an image classification model, where the image classification model includes a mapping relationship between the image to be retrieved and the image category of the image to be retrieved.

As another alternative, the image retrieving apparatus may determine the image category information of the image to be retrieved according to the image to be retrieved and a preset classification algorithm.

As still another alternative embodiment, the image retrieval apparatus may acquire the image category information of the image to be retrieved, which is manually labeled.

Optionally, the image category information of the image to be retrieved may be one or more bits, that is, the image category of the image to be retrieved is indicated by the 1 or more bits, which is not limited in this embodiment of the application.

As an alternative embodiment, the image category information of the image to be retrieved may be 2 bits, for example, when the 2 bits are "00", the image to be retrieved is indicated as the first type of image, when the 2 bits are "01", the image to be retrieved is indicated as the second type of image, when the 2 bits are "10", the image to be retrieved is indicated as the third type of image, and when the 2 bits are "11", the image to be retrieved is indicated as the fourth type of image.

Alternatively, before S120, the image retrieval apparatus may acquire the stop word lexicon.

Optionally, the stop word thesaurus may include a mapping between an identifier of each of the plurality of image categories and a visual stop word corresponding to the identifier of each of the image categories.

Optionally, the image retrieval device may generate the stop word lexicon by itself, or may obtain the stop word lexicon from a generating device of the image lexicon, which is not limited in the embodiment of the present application.

Optionally, the image retrieval apparatus may obtain a plurality of visual words of each training image in a training image library, image category information of each training image, and positive sample image set information, where the plurality of visual words of each training image are obtained by matching and mapping a plurality of visual feature descriptors of each training image with the visual words in the visual bag-of-words model, the image category information of each training image is used to indicate an image category of each training image, the positive sample image set information is used to indicate at least one positive sample image set, and the positive sample image set includes a plurality of similar training images in the artificially labeled training image library; generating a stop word bank according to the plurality of visual words of each training image, the image category information of each training image and the positive sample image set information, wherein the stop word bank comprises a mapping relation between the image category of each training image and the visual stop word corresponding to the image category of each image, and the visual stop word corresponding to the image category of each training image comprises visual words unrelated to the image category of each training image.

Specifically, the image retrieval apparatus generates the stop word library according to the plurality of visual words of each training image, the image category information of each training image, and the positive sample image set information, and may determine, for each of the plurality of visual words corresponding to each of the training images, the image category information of each of the training images, and the positive sample image set information, a correlation between a plurality of image categories of the training image library and the plurality of visual words of the training image library, where the plurality of image categories of the training image library include an image category of each of the training images, and the plurality of visual words of the training image library include the plurality of visual words of each of the training images; and generating the stop word library according to the correlation between the multiple image categories of the training image library and the multiple visual words of the training image library.

As an alternative embodiment, the image retrieval apparatus may determine a correlation between a first image category and a first visual word according to a probability of occurrence of a first event, a probability of occurrence of a second event, and a probability of occurrence of the first event and the second event simultaneously, where multiple image categories of the training image library include the first image category, multiple visual words of the training image library include the first visual word, the first event indicates that multiple visual words of a first training image in the training image library and multiple visual words of a second training image in the training image library each include the first visual word, the image category of the first training image is the first image category, and the second event indicates that the first training image and the second training image belong to the same positive sample image set.

wherein x represents a first event, the first event is that a plurality of visual words of a first training image in the N training images and a plurality of visual words of a second training image in the P training images except the first training image all include a first visual word in the L visual words, and x represents a first event_iA first event indicating occurrence of an ith image in the training images of the first class; y represents a second event that the first training image and the second training image belong to the same positive sample image set, y_iA second event, count (x), representing the occurrence of the ith image in the training images of the first class_i) Representing the number of times count (y) that the first event occurred in the ith image of the training images of the first class_i) Representing the number of times of the second event occurring in the ith image in the training images of the first class, count (y) representing the number of times of the second event occurring, count (x)_i,y_i) Representing the number of times a first event and a second event occur simultaneously in the ith image of the training image of the first class, p (x) representing the occurrence of the first eventProbability, p (y) represents the probability of occurrence of the second event, p (x, y) represents the probability of the first event and the second event occurring simultaneously, PMI (x, y) represents the mutual information quantity of the first event and the second event, H (y) represents the information entropy of the second event, RATE_PMI(x, y) represents the mutual point information rate of the first event and the second event, i.e. the correlation of the first visual word and the first image category, wherein P, L, M, N are positive integers greater than 1.

Alternatively, the image retrieval device may acquire the positive sample image set information in various ways, which is not limited in this embodiment of the application.

As an alternative embodiment, the image retrieving apparatus may obtain one or more bits carried in each training image, and obtain the positive sample image set information according to the one or more bits of each training image. For example, if a first training image and a second training image in the plurality of training images carry the same bits, it is determined that the first training image and the second training image belong to the same positive sample image set.

As another alternative embodiment, the image retrieval apparatus may obtain first information that includes a mapping relationship between an identifier of each of the plurality of positive sample image sets and an identifier of a training image included in each of the positive sample image sets, and the image retrieval apparatus may obtain the positive sample image set information based on the first information.

Optionally, the image retrieval device may determine the visual stop word corresponding to each image category from a plurality of visual words in the training image library in a plurality of ways, which is not limited in the embodiment of the present application.

As an alternative embodiment, the image retrieval apparatus may use at least one of the plurality of visual words of the training image library, which has the smallest correlation with each image category, as the visual stop word corresponding to each image category.

As another alternative embodiment, the image retrieval apparatus may use at least one of the plurality of visual words in the training image library, which has a correlation with each image category smaller than a first preset threshold, as the visual stop word corresponding to each image category.

Optionally, the visual stop word corresponding to each image category may be one or more visual words, which is not limited in this application.

Alternatively, before S140, the image retrieval device may acquire the retrieval image library.

Optionally, the search image library includes a plurality of search images and a target visual word corresponding to each of the plurality of search images, where the target visual word corresponding to each of the search images is obtained by removing a visual stop word corresponding to an image category of each of the search images from the plurality of visual words corresponding to each of the search images.

Optionally, the search image library may be obtained by training according to the multiple training images, or the search image library may be obtained by training according to historical images to be searched, which are searched by the image searching apparatus before the search, or may be obtained by training according to other images, which is not limited in this embodiment of the application.

For example, in electronic commerce, the search image library includes target visual words of all product images provided by the user, and the search image is a product image that the user wants to purchase.

In addition, the target visual words of the retrieval images stored in the retrieval image library are obtained by removing the visual stop words corresponding to the image categories of the retrieval images from the plurality of visual words of the retrieval images, which is beneficial to reducing the occupancy rate of the memory of the retrieval image library.

Fig. 2 is a schematic flow chart of a method 200 for generating an image library, which is provided in this embodiment of the present application, and the method 200 may be executed by an apparatus for generating an image library, which is not limited in this embodiment of the present application.

It should be understood that the image library generating device may be a second device with computing and storing functions, and the second device may be a computer, for example, or the image library generating device may be a functional module in the second device, which is not limited in this embodiment of the present application.

Optionally, the second device may be the same device as the first device described in fig. 1 or a different device, which is not limited in this application.

Optionally, when the second device is the same as the second device, the generating device of the image library and the image retrieving device described in fig. 1 are different functional modules in the same device, or the generating device of the image library is a functional module in the image retrieving device.

S210, obtaining a plurality of visual words of each training image in a training image library, image category information of each training image and positive sample image set information, wherein the plurality of visual words of each training image are obtained by matching and mapping a plurality of visual feature descriptors of each training image with the visual words in the visual word bag model, the image category information of each training image is used for indicating the image category of each training image, the positive sample image set information is used for indicating at least one positive sample image set, and the positive sample image set comprises a plurality of artificially labeled training images similar to each other in the training image library.

S220, generating a stop word library according to the plurality of visual words of each training image, the image category information of each training image and the positive sample image set information, wherein the stop word library comprises a mapping relation between the image category of each training image and the visual stop word corresponding to the image category of each image, and the visual stop word corresponding to the image category of each training image comprises visual words irrelevant to the image category of each training image.

According to the method for generating the database, a stop word library is generated by acquiring the plurality of visual words of each training image, the image category information of each training image and the positive sample image set information in the training image library and according to the plurality of visual words of each training image, the image category information of each training image and the positive sample image set information, and the stop word library is used for acquiring the target visual words of the images to be retrieved, so that the efficiency and the accuracy of image retrieval are improved.

Optionally, in S210, the generating device of the image library may obtain the plurality of visual words of the training image in a plurality of ways, which is not limited in this embodiment of the application.

As an optional embodiment, the image training apparatus may obtain a training image, extract a plurality of visual feature descriptors of the training image, where the plurality of visual feature descriptors are used to describe a plurality of visual feature points of the training image, where the plurality of visual feature descriptors correspond to the plurality of visual feature points one to one, obtain a visual word bag model, and determine a plurality of visual words closest to each visual feature descriptor in the plurality of visual feature descriptors in the visual word bag model as the plurality of visual words of the training image.

Optionally, the visual bag-of-words model may also be an existing trained visual bag-of-words model, or may be obtained by clustering a plurality of visual feature descriptors corresponding to the plurality of images by the generation apparatus of the image library itself, which is not limited in this embodiment of the present application.

Optionally, the generating device of the image library may acquire the training image in various ways, for example, by shooting with a camera, reading with a local disk, downloading over a network, or other ways, which is not limited in this embodiment of the present application.

Optionally, the plurality of images obtained by the generating device of the image library may be images subjected to distortion, denoising, or other preprocessing operations, which is not limited in this embodiment of the application.

Optionally, in S210, the generating device of the image library may acquire the image category information of the training image in various ways, which is not limited in this embodiment of the application.

As an alternative embodiment, the generating device of the image library may determine the image category information of the training image according to the training image and an image classification model, where the image classification model includes a mapping relationship between the training image and the image category of the training image.

As another alternative, the generating device of the image library may determine the image category information of the training image according to the training image and a preset classification algorithm.

As yet another alternative, the generating means of the image library may obtain image class information of the training image that is manually labeled.

Optionally, the image class information of the training image may be one or more bits, that is, the image class of the training image is indicated by the 1 or more bits, which is not limited in this embodiment of the application.

As an alternative embodiment, the image class information of the training image may be 2 bits, for example, when the 2 bit is "00", the training image is the first class image, when the 2 bit is "01", the training image is the second class image, when the 2 bit is "10", the training image is the third class image, and when the 2 bit is "11", the training image is the fourth class image.

Optionally, in S220, the generating device of the image library generates the stop word library according to the plurality of visual words of each training image, the image category information of each training image, and the positive sample image set information, and may determine correlations between a plurality of image categories of the training image library and a plurality of visual words of the training image library according to the plurality of visual words corresponding to each training image, the image category information of each training image, and the positive sample image set information, where the plurality of image categories of the training image library include the image category of each training image, and the plurality of visual words of the training image library include the plurality of visual words of each training image; and generating the stop word library according to the correlation between the multiple image categories of the training image library and the multiple visual words of the training image library.

As an alternative embodiment, the generating device of the image library may determine the correlation between a first image category and a first visual word according to the probability of occurrence of a first event, the probability of occurrence of a second event, and the probability of occurrence of the first event and the second event simultaneously, where multiple image categories of the training image library include the first image category, multiple visual words of the training image library include the first visual word, the first event indicates that multiple visual words of a first training image in the training image library and multiple visual words of a second training image in the training image library each include the first visual word, the image category of the first training image is the first image category, and the second event indicates that the first training image and the second training image belong to the same positive sample image set.

For example, assuming that the training image set has P training images, M image classes, and L visual words, and the training images of the first class in the P training images have N training images, the correlation between the first visual word and the training images of the first image class can be determined through the above equations (1) to (6).

Optionally, the generating device of the image library may obtain the positive sample image set information in a variety of ways, which is not limited in this embodiment of the application.

As an alternative embodiment, the generating device of the image library may obtain one or more bits carried in each training image, and obtain the positive sample image set information according to the one or more bits of each training image. For example, if a first training image and a second training image in the plurality of training images carry the same bits, it is determined that the first training image and the second training image belong to the same positive sample image set.

As another alternative embodiment, the generating device of the image library may obtain first information, where the first information includes a mapping relationship between an identifier of each positive sample image set in a plurality of positive sample image sets and an identifier of a training image included in each positive sample image set, and the generating device of the image library may obtain the positive sample image set information according to the first information.

Optionally, the generating device of the image library may determine the visual stop word corresponding to each image category from the plurality of visual words in the training image library in a variety of ways, which is not limited in this embodiment of the application.

As an alternative embodiment, the generating device of the image library may use at least one of the plurality of visual words of the training image library, which has the smallest correlation with each image category, as the visual stop word corresponding to each image category.

As another alternative embodiment, the generating device of the image library may use at least one of the plurality of visual words of the training image library, which has a correlation with each image category smaller than a first preset threshold, as the visual stop word corresponding to each image category.

Optionally, after S220, the generating device of the image library may determine, according to the image category information of each training image and the stop word library, a visual stop word corresponding to the image category of each training image, remove the visual stop word corresponding to the image category of each training image from the plurality of visual words of each training image, obtain a target visual word of each training image, and add the target visual word of each training image to the search image library.

The image retrieval method and the image processing method provided by the embodiment of the present application are described in detail above with reference to fig. 1 and 2, and the image retrieval apparatus and the image processing apparatus provided by the embodiment of the present application are described below with reference to fig. 3 to 6.

Fig. 3 is a schematic block diagram of an image retrieval apparatus 300 provided in an embodiment of the present application. The apparatus 300 comprises:

an obtaining unit 310, configured to obtain a plurality of visual words of an image to be retrieved and image category information of the image to be retrieved, where the plurality of visual words of the image to be retrieved are obtained by matching and mapping a plurality of visual feature descriptors of the image to be retrieved with the visual words in the visual bag-of-words model, and the image category information of the image to be retrieved is used to indicate an image category of the image to be retrieved;

a processing unit 320, configured to determine, according to the image category information of the image to be retrieved and the stop word lexicon obtained by the obtaining unit 310, a visual stop word corresponding to the image category of the image to be retrieved, where the visual stop word corresponding to the image category of the image to be retrieved includes a visual word unrelated to the image category of the image to be retrieved, and the stop word lexicon includes a mapping relationship between the image category of the image to be retrieved and the visual stop word corresponding to the image category of the image to be retrieved; removing the visual stop word corresponding to the image category of the image to be retrieved from the plurality of visual words of the image to be retrieved acquired by the acquiring unit 310 to obtain a target visual word of the image to be retrieved;

the retrieving unit 330 is configured to determine a retrieval result according to the target visual word of the image to be retrieved obtained by the processing unit 320 and a retrieval image library, where the retrieval image library includes a plurality of retrieval images.

Optionally, the search image library includes a mapping relationship between the plurality of search images and a target visual word corresponding to each of the plurality of search images, where the target visual word corresponding to each of the search images is obtained by removing a visual stop word corresponding to an image category of each of the search images from the plurality of visual words corresponding to each of the search images.

Optionally, the apparatus further includes a generating unit, where the obtaining unit is further configured to obtain a plurality of visual words of each training image in a training image library, image category information of each training image, and positive sample image set information before determining the visual stop word corresponding to the image category of the image to be retrieved according to the image category information of the image to be retrieved and the stop word library, the plurality of visual words of each training image are obtained by matching and mapping the plurality of visual feature descriptors of each training image with the visual words in the visual word bag model, the image class information of each training image is used to indicate the image class of each training image, the positive sample image set information is used to indicate at least one positive sample image set, the positive sample image set comprising a plurality of similar training images in the training image library that are manually labeled; the generating unit is used for generating the stop word stock according to the plurality of visual words of each training image, the image category information of each training image and the positive sample image set information.

Optionally, the generating unit is specifically configured to: determining the correlation between a plurality of image categories of the training image library and a plurality of visual words of the training image library according to the plurality of visual words of each training image, the image category information of each training image and the positive sample image set information, wherein the plurality of image categories of the training image library comprise the image category of each training image, and the plurality of visual words of the training image library comprise the plurality of visual words of each training image; and generating the stop word library according to the correlation between the multiple image categories of the training image library and the multiple visual words of the training image library.

Optionally, the generating unit is specifically configured to: determining the correlation between a first image category and a first visual word according to the probability of occurrence of a first event, the probability of occurrence of a second event and the probability of occurrence of the first event and the second event simultaneously, wherein multiple image categories of the training image library comprise the first image category, multiple visual words of the training image library comprise the first visual word, the first event represents that the multiple visual words of a first training image in the training image library and the multiple visual words of a second training image in the training image library both comprise the first visual word, the image category of the first training image is the first image category, and the second event represents that the first training image and the second training image belong to the same positive sample image set.

Optionally, the retrieving unit is specifically configured to: determining the similarity between the target visual words of the image to be retrieved and the target visual words of the retrieval images in the retrieval image library; and determining at least one retrieval image with the similarity of the target visual words of the image to be retrieved larger than a first preset value as the retrieval result.

It should be understood that the image retrieval apparatus 300 herein is embodied in the form of a functional unit. The term "unit" herein may refer to an Application Specific Integrated Circuit (ASIC), an electronic circuit, a processor (e.g., a shared, dedicated, or group processor) and memory that execute one or more software or firmware programs, a combinational logic circuit, and/or other suitable components that support the described functionality. In an alternative example, as can be understood by those skilled in the art, the image retrieval apparatus 300 may be embodied as an image retrieval apparatus in the above method 100 and method 200 embodiments, and the image retrieval apparatus 300 may be configured to perform each flow and/or step corresponding to the image retrieval apparatus in the above method 100 and method 200 embodiments, and is not described herein again to avoid repetition.

Fig. 4 shows a schematic block diagram of an apparatus 400 for generating an image library provided in an embodiment of the present application, where the apparatus 400 includes:

an obtaining unit 410, configured to obtain a plurality of visual words of each training image in a training image library, image category information of each training image, and positive sample image set information, where the plurality of visual words of each training image are obtained by matching and mapping a plurality of visual feature descriptors of each training image with visual words in the visual bag-of-words model, the image category information of each training image is used to indicate an image category of each training image, the positive sample image set information is used to indicate at least one positive sample image set, and the positive sample image set includes a plurality of artificially labeled similar training images in the training image library;

a generating unit 420, configured to generate a stop word vocabulary library according to the plurality of visual words of each training image, the image category information of each training image, and the positive sample image set information acquired by the acquiring unit 410, where the stop word vocabulary library includes a mapping relationship between an image category of each training image and a visual stop word corresponding to the image category of each training image, and the visual stop word corresponding to the image category of each training image includes a visual word unrelated to the image category of each training image.

Optionally, the generating unit is specifically configured to: determining the correlation between a plurality of image categories of the training image library and a plurality of visual words of the training image library according to the plurality of visual words corresponding to each training image, the image category information of each training image and the positive sample image set information, wherein the plurality of image categories of the training image library comprise the image category of each training image, and the plurality of visual words of the training image library comprise the plurality of visual words of each training image; and generating the stop word library according to the correlation between the multiple image categories of the training image library and the multiple visual words of the training image library.

Optionally, the generating unit is specifically configured to: determining correlation between a first image category and a first visual word according to the probability of occurrence of a first event, the probability of occurrence of a second event and the probability of occurrence of the first event and the second event at the same time, wherein multiple image categories of the training image library comprise the first image category, multiple visual words of the training image library comprise the first visual word, the first event represents that multiple visual words of a first training image in the training image library and multiple visual words of a second training image in the training image library both comprise the first visual word, the image category of the first training image is the first image category, and the second event represents that the first training image and the second training image belong to the same positive sample image set

Optionally, the generating unit is further configured to, after generating a stop word library according to the plurality of visual words of each training image, the image category information of each training image, and the positive sample image set information, determine a visual stop word corresponding to the image category of each training image according to the image category information of each training image and the stop word library, remove the visual stop word corresponding to the image category of each training image from the plurality of visual words of each training image, obtain a target visual word of each training image, and add the target visual word of each training image to the search image library.

Optionally, the obtaining unit is specifically configured to obtain each training image, extract a plurality of visual feature descriptors of each training image, where the plurality of visual feature descriptors are used to describe a plurality of visual feature points of each training image, and the plurality of visual feature descriptors are in one-to-one correspondence with the plurality of visual feature points, obtain a visual word bag model, and determine a plurality of visual words in the visual word bag model, which are closest to each visual feature descriptor in the plurality of visual feature descriptors, as the plurality of visual words of each training image.

It is to be understood that the generation means 400 of the image library herein is embodied in the form of a functional unit. The term "unit" herein may refer to an ASIC, an electronic circuit, a processor (e.g., a shared, dedicated, or group processor) and memory that execute one or more software or firmware programs, a combinational logic circuit, and/or other suitable components that support the described functionality. In an optional example, as can be understood by those skilled in the art, the image library generating device 400 may be specifically the image library generating device in the foregoing method 100 and method 100 embodiments, and the image library generating device 400 may be configured to execute each flow and/or step corresponding to the image library generating device in the foregoing method 100 and method 200 embodiments, and is not described herein again to avoid repetition.

Fig. 5 shows a schematic block diagram of an image retrieval apparatus 500 provided in an embodiment of the present application, where the image retrieval apparatus 500 may be the image retrieval apparatus described in fig. 1 and fig. 2, and the image retrieval apparatus may adopt a hardware architecture as shown in fig. 5. The image retrieval device may include a processor 510, a communication interface 520, and a memory 530, the processor 510, the communication interface 520, and the memory 530 communicating with each other through an internal connection path. The related functions implemented by the processing unit 320 and the retrieving unit 330 in fig. 3 may be implemented by the processor 510, and the related functions implemented by the obtaining unit 310 may be implemented by the processor 510 controlling the communication interface 520.

The processor 510 may include one or more processors, such as one or more Central Processing Units (CPUs), and in the case of one CPU, the CPU may be a single-core CPU or a multi-core CPU.

The communication interface 520 is used to transmit and/or receive data. The communication interface may include a transmit interface for transmitting data and a receive interface for receiving data.

The memory 530 includes, but is not limited to, a Random Access Memory (RAM), a read-only memory (ROM), an Erasable Programmable Read Only Memory (EPROM), and a compact disc read-only memory (CD-ROM), and the memory 530 is used for storing relevant instructions and data.

The memory 530 is used for storing program codes and data of the image retrieval apparatus, and may be a separate device or integrated in the processor 510.

In particular, the processor 510 is configured to control the communication interface to perform data transmission with other devices, such as a generation device of an image library. Specifically, reference may be made to the description of the method embodiment, which is not repeated herein.

It will be appreciated that fig. 5 only shows a simplified design of the image retrieval apparatus. In practical applications, the image retrieval devices may further include other necessary components, including but not limited to any number of communication interfaces, processors, controllers, memories, etc., and all image retrieval devices that can implement the present application are within the scope of the present application.

In one possible design, the image retrieval apparatus 500 may be replaced with a chip apparatus, for example, a chip that can be used in the image retrieval apparatus, and is used for implementing the relevant functions of the processor 510 in the image retrieval apparatus. The chip device can be a field programmable gate array, a special integrated chip, a system chip, a central processing unit, a network processor, a digital signal processing circuit and a microcontroller for realizing related functions, and can also adopt a programmable controller or other integrated chips. The chip may optionally include one or more memories for storing program code that, when executed, causes the processor to implement corresponding functions.

Fig. 6 shows a schematic block diagram of an image library generating apparatus 600 provided in an embodiment of the present application, where the image library generating apparatus 600 may be the image library generating apparatus described in fig. 1 and fig. 2, and the image library generating apparatus may adopt a hardware architecture as shown in fig. 6. The generating means of the image library may comprise a processor 610, a communication interface 620 and a memory 630, the processor 610, the communication interface 620 and the memory 630 communicating with each other through an internal connection path. The related functions implemented by the generating unit 420 in fig. 4 may be implemented by the processor 610, and the related functions implemented by the obtaining unit 410 may be implemented by the processor 610 controlling the communication interface 620.

The processor 610 may include one or more processors, for example, one or more Central Processing Units (CPUs), and in the case that the processor is a CPU, the CPU may be a single-core CPU or a multi-core CPU.

The communication interface 620 is used to transmit and/or receive data. The communication interface may include a transmit interface for transmitting data and a receive interface for receiving data.

The memory 630 includes, but is not limited to, a Random Access Memory (RAM), a read-only memory (ROM), an Erasable Programmable Read Only Memory (EPROM), and a compact disc read-only memory (CD-ROM), and the memory 630 is used for storing related instructions and data.

The memory 630 is used to store program codes and data of the image library generating apparatus, and may be a separate device or integrated in the processor 610.

In particular, the processor 610 is configured to control the communication interface to perform data transmission with other devices, such as an image retrieval device. Specifically, reference may be made to the description of the method embodiment, which is not repeated herein.

It will be appreciated that fig. 6 only shows a simplified design of the generating means of the image library. In practical applications, the image library generating device may further include other necessary elements, including but not limited to any number of communication interfaces, processors, controllers, memories, etc., and all image library generating devices that can implement the present application are within the scope of the present application.

In one possible design, the image library generating apparatus 600 may be replaced by a chip apparatus, for example, a chip that can be used in the image library generating apparatus, and is used to implement the relevant functions of the processor 610 in the image library generating apparatus. The chip device can be a field programmable gate array, a special integrated chip, a system chip, a central processing unit, a network processor, a digital signal processing circuit and a microcontroller for realizing related functions, and can also adopt a programmable controller or other integrated chips. The chip may optionally include one or more memories for storing program code that, when executed, causes the processor to implement corresponding functions.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An image retrieval method, comprising:

acquiring a plurality of visual words of an image to be retrieved and image category information of the image to be retrieved, wherein the plurality of visual words of the image to be retrieved are obtained by matching and mapping a plurality of visual feature descriptors of the image to be retrieved and the visual words in a visual word bag model, and the image category information of the image to be retrieved is used for indicating the image category of the image to be retrieved;

2. The method according to claim 1, wherein the search image library includes the mapping relationship between the plurality of search images and a target visual word corresponding to each search image in the plurality of search images, and the target visual word corresponding to each search image is obtained by removing a visual stop word corresponding to an image category of each search image from the plurality of visual words corresponding to each search image.

3. The method according to claim 1 or 2, wherein before determining the visual stop word corresponding to the image category of the image to be retrieved according to the image category information of the image to be retrieved and the stop word lexicon, the method further comprises:

acquiring a plurality of visual words of each training image in a training image library, image category information of each training image and positive sample image set information, wherein the plurality of visual words of each training image are obtained by matching and mapping a plurality of visual feature descriptors of each training image with the visual words in the visual word bag model, the image category information of each training image is used for indicating the image category of each training image, the positive sample image set information is used for indicating at least one positive sample image set, and the positive sample image set comprises a plurality of artificially labeled similar training images in the training image library;

and generating the stop word bank according to the plurality of visual words of each training image, the image category information of each training image and the positive sample image set information.

4. The method of claim 3, wherein generating the stop word lexicon from the plurality of visual words of each training image, the image category information of each training image, and the positive sample image set information comprises:

determining, according to the plurality of visual words of each training image, the image category information of each training image, and the positive sample image set information, a correlation between a plurality of image categories of the training image library and a plurality of visual words of the training image library, where the plurality of image categories of the training image library include the image category of each training image, and the plurality of visual words of the training image library include the plurality of visual words of each training image;

and generating the stop word library according to the correlation between the multiple image categories of the training image library and the multiple visual words of the training image library.

5. The method of claim 4, wherein determining the correlation between the plurality of image classes of the training image library and the plurality of visual words of the training image library according to the plurality of visual words of each training image, the image class information of each training image, and the positive sample image set information comprises:

determining correlation between a first image category and a first visual word according to the probability of occurrence of a first event, the probability of occurrence of a second event and the probability of occurrence of the first event and the second event at the same time, wherein multiple image categories of the training image library comprise the first image category, multiple visual words of the training image library comprise the first visual word, the first event represents that multiple visual words of a first training image in the training image library and multiple visual words of a second training image in the training image library both comprise the first visual word, the image category of the first training image is the first image category, and the second event represents that the first training image and the second training image belong to the same positive sample image set.

6. The method according to claim 2, wherein determining a search result according to the target visual word of the image to be searched and a search image library comprises:

determining the similarity between the target visual words of the image to be retrieved and the target visual words of the retrieval images in the retrieval image library;

and determining at least one retrieval image with the similarity of the target visual words of the image to be retrieved larger than a first preset value as the retrieval result.

7. A method for generating an image library, comprising:

acquiring a plurality of visual words of each training image in a training image library, image category information of each training image and positive sample image set information, wherein the plurality of visual words of each training image are obtained by matching and mapping a plurality of visual feature descriptors of each training image with visual words in a visual word bag model, the image category information of each training image is used for indicating the image category of each training image, the positive sample image set information is used for indicating at least one positive sample image set, and the positive sample image set comprises a plurality of artificially labeled similar training images in the training image library;

generating a stop word bank according to the plurality of visual words of each training image, the image category information of each training image and the positive sample image set information, wherein the stop word bank comprises a mapping relation between the image category of each training image and the visual stop word corresponding to the image category of each training image, and the visual stop word corresponding to the image category of each training image comprises visual words unrelated to the image category of each training image.

8. The method of claim 7, wherein generating a stop word lexicon from the plurality of visual words of each training image, the image category information of each training image, and the positive sample image set information comprises:

determining the correlation between a plurality of image categories of the training image library and a plurality of visual words of the training image library according to the plurality of visual words corresponding to each training image, the image category information of each training image and the positive sample image set information, wherein the plurality of image categories of the training image library comprise the image category of each training image, and the plurality of visual words of the training image library comprise the plurality of visual words of each training image;

9. An image retrieval apparatus, comprising:

the device comprises an acquisition unit, a search unit and a search unit, wherein the acquisition unit is used for acquiring a plurality of visual words of an image to be searched and image category information of the image to be searched, the plurality of visual words of the image to be searched are obtained by matching and mapping a plurality of visual feature descriptors of the image to be searched and visual words in a visual word bag model, and the image category information of the image to be searched is used for indicating the image category of the image to be searched;

the processing unit is used for determining visual stop words corresponding to the image category of the image to be retrieved according to the image category information of the image to be retrieved and the stop word library acquired by the acquiring unit, wherein the visual stop words corresponding to the image category of the image to be retrieved comprise visual words irrelevant to the image category of the image to be retrieved, and the stop word library comprises a mapping relation between the image category of the image to be retrieved and the visual stop words corresponding to the image category of the image to be retrieved;

removing visual stop words corresponding to the image category of the image to be retrieved from the plurality of visual words of the image to be retrieved acquired by the acquisition unit to obtain target visual words of the image to be retrieved;

and the retrieval unit is used for determining a retrieval result according to the target visual words of the image to be retrieved and the retrieval image library obtained by the processing unit, and the retrieval image library comprises a plurality of retrieval images.

10. The apparatus according to claim 9, wherein the retrieval image library includes mapping relationships between the plurality of retrieval images and a target visual word corresponding to each retrieval image in the plurality of retrieval images, and the target visual word corresponding to each retrieval image is obtained by removing a visual stop word corresponding to an image category of each retrieval image from the plurality of visual words corresponding to each retrieval image.

11. The apparatus according to claim 9 or 10, characterized in that the apparatus further comprises a generating unit,

the obtaining unit is further configured to obtain a plurality of visual words of each training image in a training image library, image category information of each training image, and positive sample image set information before determining a visual stop word corresponding to the image category of the image to be retrieved according to the image category information of the image to be retrieved and the stop word library, the plurality of visual words of each training image are obtained by matching and mapping the plurality of visual feature descriptors of each training image with the visual words in the visual word bag model, the image class information of each training image is used for indicating the image class of each training image, the positive sample image set information is indicative of at least one positive sample image set comprising a plurality of similar training images in the artificially labeled training image library;

the generating unit is used for generating the stop word bank according to the plurality of visual words of each training image, the image category information of each training image and the positive sample image set information.

12. The apparatus according to claim 11, wherein the generating unit is specifically configured to:

13. The apparatus according to claim 12, wherein the generating unit is specifically configured to:

14. The apparatus according to claim 10, wherein the retrieving unit is specifically configured to:

and determining at least one retrieval image with the similarity of the target visual words of the images to be retrieved larger than a first preset value as the retrieval result.

15. An apparatus for generating an image library, comprising:

an obtaining unit, configured to obtain a plurality of visual words of each training image in a training image library, image category information of each training image, and positive sample image set information, where the plurality of visual words of each training image are obtained by matching and mapping a plurality of visual feature descriptors of each training image with visual words in a visual bag-of-words model, the image category information of each training image is used to indicate an image category of each training image, the positive sample image set information is used to indicate at least one positive sample image set, and the positive sample image set includes a plurality of artificially labeled training images similar to the training image library;

the generating unit is configured to generate a stop word lexicon according to the multiple visual words of each training image, the image category information of each training image, and the positive sample image set information acquired by the acquiring unit, where the stop word lexicon includes a mapping relationship between an image category of each training image and a visual stop word corresponding to the image category of each training image, and the visual stop word corresponding to the image category of each training image includes a visual word unrelated to the image category of each training image.

16. The apparatus according to claim 15, wherein the generating unit is specifically configured to:

17. An image retrieval device, the device comprising a memory, a processor, a communication interface and a computer program stored on the memory and executable on the processor, wherein the memory, the processor and the communication interface are in communication with each other via an internal connection path, wherein the processor executes the computer program to perform the method of any one of the preceding claims 1 to 6.

18. An apparatus for generating an image library, the apparatus comprising a memory, a processor, a communication interface and a computer program stored in the memory and executable on the processor, wherein the memory, the processor and the communication interface are in communication with each other via an internal connection path, and wherein the processor executes the computer program to perform the method of claim 7 or claim 8.

19. A computer-readable medium for storing a computer program, characterized in that the computer program comprises instructions for performing the method of any of the preceding claims 1 to 6.

20. A computer-readable medium for storing a computer program, characterized in that the computer program comprises instructions for performing the method of claim 7 or claim 8.