CN113590898A

CN113590898A - Data retrieval method and device, electronic equipment, storage medium and computer product

Info

Publication number: CN113590898A
Application number: CN202111130352.5A
Authority: CN
Inventors: 林庆泓; 田上萱; 赵文哲; 陈小军; 王红法; 刘威
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-09-26
Filing date: 2021-09-26
Publication date: 2021-11-02

Abstract

The embodiment of the application provides a data retrieval method and device, electronic equipment, a storage medium and a computer product, and relates to the technical field of multimedia, big data and cloud. The method comprises the following steps: acquiring retrieval data, and determining first similarity between the retrieval data and each of at least two first anchor points, wherein the at least two anchor points are at least two clustering centers obtained by clustering a plurality of sample data in a training data set; acquiring a second similarity between each retrieved data in the retrieved database and each first anchor point; determining candidate retrieved data matched with the retrieved data from the retrieved database according to the first similarity of the retrieved data corresponding to each first anchor point and the second similarity of each retrieved data corresponding to each first anchor point; at least one target retrieved data matching the retrieved data is determined from the candidate retrieved data. Based on the method provided by the embodiment of the application, the efficiency of data retrieval can be greatly improved.

Description

Data retrieval method and device, electronic equipment, storage medium and computer product

Technical Field

The application relates to the technical field of big data and cloud, in particular to a data retrieval method, a data retrieval device, electronic equipment, a storage medium and a computer product.

Background

With the rapid development of the internet, multimedia data such as images, texts and videos are rapidly increased, and large-scale data retrieval/query becomes a research hotspot. In the face of mass data, the Nearest Neighbor search (ANN) has wider application advantages compared with the accurate retrieval, thereby becoming a key technology in information retrieval.

In order to meet the requirement of searching large-scale data, the searching technology is continuously optimized and improved. The hash technique receives more and more attention due to its low storage cost and high query efficiency. The high-order data can be coded into a low-dimensional compact binary hash code through the hash technology, so that the calculation of the similarity between the data is accelerated. Although the retrieval method based on the hash technology can achieve relatively good performance, the huge data size is still a bottleneck of efficient retrieval. How to reduce the complexity of data retrieval and improve the retrieval efficiency is still one of the important problems needing improvement.

Disclosure of Invention

The application aims to provide a data retrieval method, a data retrieval device, an electronic device and a storage medium, which can effectively improve the data retrieval efficiency. In order to achieve the above purpose, the technical solutions provided in the embodiments of the present application are as follows:

in one aspect, an embodiment of the present application provides a data query method, where the method includes:

acquiring retrieval data, and determining first similarity between the retrieval data and each of at least two first anchor points, wherein the at least two first anchor points are at least two clustering centers obtained by clustering a plurality of sample data in a training data set;

acquiring a second similarity between each retrieved data in the retrieved database and each first anchor point;

determining candidate retrieved data matched with the retrieved data from the retrieved database according to the first similarity of the retrieved data corresponding to each first anchor point and the second similarity of each retrieved data corresponding to each first anchor point;

at least one target retrieved data matching the retrieved data is determined from the candidate retrieved data.

In another aspect, an embodiment of the present application provides a data retrieval apparatus, where the apparatus includes:

the similarity determining module is used for acquiring retrieval data and determining first similarity between the retrieval data and each of at least two first anchor points, wherein the at least two first anchor points are at least two clustering centers obtained by clustering a plurality of sample data in the training data set;

the candidate data determining module is used for acquiring a second similarity between each piece of retrieved data in the retrieved database and each first anchor point; determining candidate retrieved data matched with the retrieved data from the retrieved database according to the first similarity of the retrieved data corresponding to each first anchor point and the second similarity of each retrieved data corresponding to each first anchor point;

and the retrieval result determining module is used for determining at least one target retrieved data matched with the retrieved data from the candidate retrieved data.

Optionally, the candidate data determining module may be specifically configured to:

for each piece of data to be retrieved, determining at least one anchor point matched with the data to be retrieved from each first anchor point according to the second similarity of the data to be retrieved corresponding to each first anchor point; for each first anchor point, determining each data to be retrieved matched with the first anchor point as a data subset corresponding to the first anchor point; determining at least one target anchor point matched with the retrieval data from the first anchor points according to the first similarity of the retrieval data corresponding to the first anchor points; and determining the data to be searched in the data subset corresponding to each target anchor point as the data to be searched candidate matched with the searched data.

for each piece of data to be retrieved, determining a second anchor point which is most matched with the data to be retrieved from each anchor point according to the second similarity of the data to be retrieved corresponding to each first anchor point;

and determining each candidate data to be retrieved from the data to be retrieved corresponding to at least one first anchor point according to the first similarity corresponding to each first anchor point and the sequence of the first similarities from large to small, wherein if one first anchor point is a second anchor point corresponding to the data to be retrieved, the data to be retrieved is the data to be retrieved corresponding to the first anchor point.

Optionally, when the candidate data determining module determines, according to the first similarity corresponding to each first anchor point and according to the sequence of similarity from large to small, each candidate retrieved data from the retrieved data corresponding to each first anchor point, any one of the following may be executed:

according to the sequence of similarity from big to small, sorting the retrieved data corresponding to each first anchor point, and determining the first set number of retrieved data in the sorted retrieved data as the retrieved data candidates matched with the retrieval data;

and according to the sequence of similarity from large to small, determining the data to be searched corresponding to the first anchor points with the second set number in the front sequence as the data to be searched candidate matched with the searched data.

Optionally, the query result determining module may be specifically configured to:

determining a third similarity between the retrieval data and each candidate retrieved data, and determining at least one target retrieved data matched with the retrieval data from each candidate retrieved data based on the third similarity corresponding to each candidate retrieved data; alternatively, each candidate retrieved data is set as at least one target retrieved data.

Optionally, the query result determining module, when determining the third similarity between the retrieved data and each candidate retrieved data, may be configured to:

acquiring a first feature vector of retrieval data and a second feature vector of each candidate retrieved data; and for each candidate retrieved data, obtaining a third similarity between the retrieval data and the candidate retrieved data according to the first feature vector and the second feature vector of the candidate retrieved data.

acquiring a first hash code corresponding to the retrieval data and a second hash code corresponding to each candidate retrieved data;

and for each candidate retrieved data, determining a Hamming distance between the first Hash code and a second Hash code corresponding to the candidate retrieved data, and determining a third similarity between the retrieved data and the candidate retrieved data according to the Hamming distance.

Optionally, when the query result determining module obtains the first hash code corresponding to the retrieved data and the second hash code corresponding to each candidate retrieved data, the query result determining module may be configured to:

acquiring a hash function, wherein the hash function is obtained through an anchor point graph hash algorithm based on a plurality of sample data in a training data set and at least two first anchor points;

based on first similarity between the retrieval data and each first anchor point, obtaining a first hash code corresponding to the retrieval data through a hash function;

and for each candidate retrieved data, obtaining a second hash code corresponding to the candidate retrieved data through a hash function based on the second similarity between the candidate retrieved data and each first anchor point.

Optionally, the query result determination module may be configured to:

sorting the candidate retrieved data according to the sequence that the similarity of the candidate retrieved data and the retrieved data is from big to small; and sequencing the sequenced candidate retrieved data to be used as at least one target retrieved data. In another aspect, the present application provides an electronic device, which includes a memory and a processor, where the memory stores a computer program, and the processor is configured to execute the data retrieval method provided in any optional embodiment of the present application when the processor runs the computer program.

In another aspect, the present application provides a computer-readable storage medium having a computer program stored thereon, where the computer program is run on a processor, and the processor is configured to execute the data retrieval method provided in any of the alternative embodiments of the present application.

In another aspect, the present application provides a computer program comprising a computer program which, when executed by a processor, performs the method provided in any of the alternative embodiments of the present application.

In another aspect, the present application provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the data retrieval method provided in any of the alternative embodiments of the present application.

The beneficial effect that technical scheme that this application provided brought as follows:

according to the scheme provided by the embodiment of the application, when target retrieved data matched with the retrieved data is determined, based on the similarity between the retrieved data and each anchor point and the similarity between the retrieved data and each anchor point, part of retrieved data which is higher in possibility of being matched with the retrieved data is selected from a retrieved database, the retrieved data is taken as candidate retrieved data, and then at least one target retrieved data matched with the retrieved data can be determined from the candidate retrieved data. Because the determined candidate retrieved data is part of the data in the retrieved database, the similarity between the retrieved data and each retrieved data does not need to be calculated, the calculation amount can be effectively reduced, the data retrieval efficiency is improved, and the effect is more prominent particularly when the retrieved data in the retrieved database is large in scale.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic flowchart of a data retrieval method according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a data retrieval system according to an embodiment of the present application;

FIG. 3 is a schematic diagram illustrating a data retrieval method according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram illustrating a principle of calculating similarity between data and an anchor point according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram illustrating a method for determining target retrieved data according to an alternative embodiment of the present application;

FIG. 6 is an example of determining candidate retrieved data based on the scheme shown in FIG. 5;

FIG. 7 is a schematic diagram illustrating a method for determining target retrieved data according to another alternative embodiment of the present application;

FIGS. 8a, 8b and 8c are schematic diagrams of the determination of target retrieved data provided by an alternative embodiment based on the scheme shown in FIG. 7;

fig. 9 is a schematic structural diagram of a data retrieval device according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

The data query method provided by the embodiment of the application can be applied to processing of Big data (Big data), and optionally can be implemented based on Cloud technology (Cloud technology). Optionally, the data computation related in the embodiment of the present application may adopt a Cloud computing (Cloud computing) manner. For example, cloud computing may be used for computing the steps of determining similarity between data, clustering data, and the like. The data in the embodiment of the application may be stored in a cloud storage manner, for example, the retrieved database may be stored in a cloud.

Big data is a data set which cannot be captured, managed and processed by a conventional software tool within a certain time range, and is a massive, high-growth-rate and diversified information asset which can have stronger decision-making power, insight discovery power and flow optimization capability only by a new processing mode. With the advent of the cloud era, big data has attracted more and more attention, and the big data needs special technology to effectively process a large amount of data within a tolerance elapsed time. The method is suitable for the technology of big data, and comprises a large-scale parallel processing database, data mining, a distributed file system, a distributed database, a cloud computing platform, the Internet and an extensible storage system.

The cloud technology is a general term of network technology, information technology, integration technology, management platform technology, application technology and the like based on cloud computing business model application, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Cloud computing is a computing model that distributes computing tasks over a resource pool of large numbers of computers, enabling various application systems to obtain computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the "cloud" appear to the user as being infinitely expandable and available at any time, available on demand, expandable at any time, and paid for on-demand. With the development of diversification of internet, real-time data stream and connecting equipment and the promotion of demands of search service, social network, mobile commerce, open collaboration and the like, cloud computing is rapidly developed. Different from the prior parallel distributed computing, the generation of cloud computing can promote the revolutionary change of the whole internet mode and the enterprise management mode in concept. The distributed cloud storage system (hereinafter referred to as a storage system) refers to a storage system which integrates a large number of storage devices (storage devices are also referred to as storage nodes) of different types in a network through application software or application interfaces to cooperatively work through functions of cluster application, grid technology, distributed storage file system and the like, and provides data storage and service access functions to the outside.

The scheme provided by the embodiment of the application can be realized based on an Artificial Intelligence (AI) technology, for example, the feature vector of the data can be obtained by performing feature extraction through a trained neural network. Optionally, a large amount of sample data in the training data set may also implement data classification by using a neural network, and a cluster center corresponding to each category is obtained based on the data included in each category. Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The embodiment of the application provides a data retrieval method for improving the efficiency of data query/retrieval, the method can be suitable for any application scene needing data retrieval and query, such as a scene of similar data retrieval, and the method is particularly suitable for a data retrieval scene of large-scale and mass data. For example, the method provided by the embodiment of the present application may be applied to a similar retrieval scenario of large-scale multimedia data, and may effectively improve the efficiency of similar multimedia data, for example, the multimedia data may be an advertisement, and any advertisement (of course, a specified advertisement) in a large number of advertisements may be used as a retrieval advertisement (i.e., a query sample, reference data for retrieval).

The method of the embodiment of the application can also be applied to application scenes such as data deduplication, data grouping and the like, the data with high similarity in a large amount of data can be rapidly determined through the method, deduplication processing is carried out on the data, or grouping is carried out on the data, and the data in the same group are similar data.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 1 shows a flowchart of a data retrieval method provided in an embodiment of the present application, where the method may be executed by any electronic device, such as a terminal device, or a server, and the method may also be executed by a server, for example, the method may be applied to a retrieval application, and the server of the application may determine, according to a received retrieval request, a retrieval result matching the retrieval request from a database by executing the retrieval method provided in an embodiment of the present application, where in this example, retrieval information (i.e., retrieval data/query samples, such as retrieval keywords, words, or images, and the like) carried in the retrieval request. For another example, the method may be executed by any terminal device, and the user may input search data or locally select data (such as an image or text) from the terminal device as the search data, and the find-confused device may query a large amount of data stored in the device (or other electronic devices in communication with the terminal device, such as a cloud) for matching data (i.e., target searched data) according to the search data, and display the data to the user. The terminal device comprises a user terminal, and the user terminal comprises but is not limited to a mobile phone, a computer, an intelligent voice interaction device, an intelligent household appliance, a vehicle-mounted terminal and the like.

As shown in fig. 1, the data retrieval method provided by the embodiment of the present application may include the following steps:

step S110: acquiring retrieval data, and determining first similarity between the retrieval data and each of at least two first anchor points, wherein the at least two first anchor points are at least two clustering centers obtained by clustering a plurality of sample data in a training data set;

step S120: acquiring a second similarity between each retrieved data in the retrieved database and each first anchor point; determining candidate retrieved data matched with the retrieved data from the retrieved database according to the first similarity of the retrieved data corresponding to each first anchor point and the second similarity of each retrieved data corresponding to each first anchor point;

step S130: at least one target retrieved data matching the retrieved data is determined from the candidate retrieved data.

The search data is reference data, i.e. search basis, upon which data search is performed, and may also be referred to as query sample, and based on the query sample, data matching the query sample is found from the searched database. The retrieved database is a database containing a large amount of retrieved data, the retrieved data is referred to as retrieval data, the retrieved database may also be referred to as a retrieval/query database, and the retrieved data may also be referred to as a database sample. For example, the retrieved data may be an image, the retrieved database may be an image library, and a large number of images stored in the image library are retrieved data, and based on the solution provided in the embodiment of the present application, a similar image of the retrieved data, for example, an image with a similarity greater than or equal to a set similarity threshold with the image as the retrieved data, may be found from the image library.

The type of the data (e.g., the retrieved data, and the sample data) is not limited in this embodiment, the data may include multimedia data, the multimedia data may include one or more of various types of data such as text, image, and video, for example, the retrieved data may be image data or text data, and the retrieved database may include a plurality of types of image data, text data, or video data. For another example, the retrieved data may be keywords of an advertisement or video data of an advertisement, and the retrieved database may include a large amount of advertisement data, and the matching advertisement data may be determined from the retrieved database based on the keywords/video data. That is, the retrieved data and the retrieved data in the retrieved database may be the same type of data, may be different types of data, and the type of data may be pre-configured according to the actual application scenario and the application requirements.

Optionally, in order to obtain a better query result, the sample data in the training data set and the retrieved data in the retrieved data may be data in the same application scenario/the same field.

The first anchor point, i.e., the cluster center, is a class center of a plurality of classes obtained by clustering a large amount of sample data, i.e., a feature vector of each class. The embodiment of the present application is not limited to a method for clustering a plurality of sample data in a training data set, and for example, any conventional clustering algorithm, such as a K-means clustering algorithm (K-means clustering algorithm) may be used to divide a training sample into a plurality of classes (which may also be referred to as clusters) of samples, or to classify sample data through a neural network. For each class, a feature vector for the class, i.e., a feature vector for the cluster center, may be obtained based on the sample data contained in the class. For the search data, the distance (such as euclidean distance or hamming distance) between the feature vector of the search data and the feature vector of each anchor point may be calculated by obtaining the feature vector of the search data, so as to obtain a first similarity between the search data and each first anchor point, where the smaller the distance between the feature vectors, the greater the corresponding similarity. The specific manner of obtaining the feature vector of the search data and the feature vector of each first anchor point is not limited in the embodiments of the present application.

Similarly, for the retrieved data, a second similarity between each of the retrieved data and the first anchor points may be determined. The retrieved data is data in the retrieved database, and the second similarity between each retrieved data and each first anchor point can be calculated in advance and stored, so that when the retrieved data needs to be retrieved in the database, the second similarity of each first anchor point corresponding to each retrieved data can be directly acquired, and recalculation in each time is not needed.

Since the first anchor point may reflect the distribution information of the sample data in the training data set, one first anchor point may represent a local structure between approximate sample data classified into the cluster/class, and the original features of different sample data classified into the same cluster should be similar in approximate rate. For the same first anchor point, if the similarity between the retrieved data and the retrieved data is high, the probability that the retrieved data is similar to the retrieved data should also be high. Based on this, according to the scheme provided by the embodiment of the present application, when target retrieved data matching the retrieved data is determined, based on the similarity between the retrieved data and each first anchor point and the similarity between the retrieved data and each first anchor point, a part of retrieved data having a higher possibility of matching the retrieved data may be selected from the retrieved database, these retrieved data may be used as candidate retrieved data, and then at least one target retrieved data matching the retrieved data may be determined from these candidate retrieved data. Because the determined candidate retrieved data is part of the data in the retrieved database, the similarity between the retrieved data and each retrieved data does not need to be calculated by adopting the method, the calculation amount can be effectively reduced, the data retrieval efficiency is improved, and the effect is particularly remarkable particularly when the scale of the retrieved data in the retrieved database is large.

After determining each candidate retrieved data, a specific manner of determining target retrieved data that matches the retrieved data from each candidate retrieved data is not limited in this embodiment of the application, and may be configured according to actual requirements, and optionally, if the requirement for matching is not very strict, each candidate retrieved data may be directly used as the target retrieved data, or the target retrieved data may be determined from each candidate retrieved data by calculating the similarity between the retrieved data and each candidate retrieved data and according to the calculated similarity, for example, the target retrieved data with a corresponding similarity greater than a certain threshold may be used as the target retrieved data, or a certain number of candidate retrieved data ranked in the order of similarity from large to small may be used as the target retrieved data.

Optionally, the determining at least one target retrieved data matching the retrieved data from the candidate retrieved data may include:

sorting the candidate retrieved data according to the sequence that the similarity of the candidate retrieved data and the retrieved data is from big to small;

and sequencing the sequenced candidate retrieved data to be used as at least one target retrieved data.

Alternatively, each candidate retrieved data may be directly used as target retrieved data without sorting. Or after sorting, taking one or more candidate retrieved data sorted in the front as target retrieved data according to the sequence from the big similarity to the small similarity.

The method comprises the steps of determining candidate retrieved data, directly displaying the candidate retrieved data as a retrieval result, or sequencing the target retrieved data according to the sequence of similarity with the retrieved data from high to low, and taking the sequenced result as a final retrieval result, so that the retrieval result can be displayed according to the sequence of matching with the retrieved data from high to low, the display requirement of the retrieval result is better met, the user requirement is better met, and the user perception is improved.

In an alternative embodiment of the present application, in the step S120, determining, from the retrieved database, each candidate retrieved data that matches the retrieved data according to the first similarity of the retrieved data corresponding to each first anchor point and the second similarity of each retrieved data corresponding to each first anchor point may include:

for each piece of data to be retrieved, determining at least one anchor point matched with the data to be retrieved from each first anchor point according to the second similarity of the data to be retrieved corresponding to each first anchor point;

for each first anchor point, determining each data to be retrieved matched with the first anchor point as a data subset corresponding to the first anchor point;

determining at least one target anchor point matched with the retrieval data from the first anchor points according to the first similarity of the retrieval data corresponding to the first anchor points;

and determining the data to be searched in the data subset corresponding to each target anchor point as the data to be searched candidate matched with the searched data.

In this alternative embodiment, a corresponding subset of data may be determined for each first anchor point based on the similarity between the respective retrieved data and the respective first anchor point, where the subset of data includes the retrieved data that matches the first anchor point. After the first similarity between the retrieved data and each first anchor point is obtained, at least one target anchor point matched with the retrieved data can be determined from each first anchor point according to the first similarity, because the target anchor point is an anchor point matched with the retrieved data (i.e. an anchor point whose similarity with the retrieved data satisfies a certain condition, such as the first similarity corresponding to the anchor point is a set threshold, or the anchor points corresponding to at least one first similarity in the similarity corresponding to all anchor points are sorted in descending order, and the retrieved data in the data subset corresponding to the target anchor point is the retrieved data matched with the target anchor point, therefore, the probability that the retrieved data corresponding to each target anchor point matches the retrieved data is high, the retrieved data in the data subset corresponding to each target anchor may be taken as each candidate retrieved data.

In practical applications, the number of anchor points matching the retrieved data/retrieval data may be preconfigured as needed, for example, the number of matched anchor points may be preconfigured as 2. For each piece of retrieved data, a first anchor point corresponding to 2 second similarities with larger similarities (maximum and second largest) may be determined as an anchor point (which may also be referred to as a first 2 neighboring anchor points) matching the piece of retrieved data according to the second similarities between the piece of retrieved data and the respective first anchor points, and the piece of retrieved data is also matched with the first anchor point, that is, belongs to the data subset corresponding to the first anchor point. Similarly, for the retrieved data, the first anchor points corresponding to the larger 2 first similarities may be determined as the target anchor points that match the retrieved data, and the retrieved data in the data subsets corresponding to the two target anchor points may be used as the retrieved data candidates.

The number of the at least one anchor point matching the retrieved data and the number of the at least one anchor point matching the retrieved data (i.e., the target anchor point) may be the same or different.

It is understood that, in practical applications, the step of determining the second similarity between the retrieved data and the first anchor points, the step of determining at least one anchor point matching the retrieved data from the first anchor points according to the second similarity of the retrieved data corresponding to the first anchor points, and the step of determining the data subset corresponding to each first anchor point may be preprocessed, that is, the data subset corresponding to each first anchor point may be determined in advance, and the steps are not performed each time the retrieved data is acquired. Optionally, after determining each first anchor point based on the training data set, the above steps may be performed to obtain a data subset corresponding to each first anchor point. Of course, if the data in the retrieved database is updated, the data subset corresponding to each first anchor point may be re-determined according to the updated data, for example, if the data is newly added to the retrieved database, the similarity between the newly added data and each first anchor point may be calculated, and the data may be assigned to the data subset corresponding to each first anchor point matched with the data.

In an alternative embodiment of the present application, in step S120, the determining, from the retrieved database, each candidate retrieved data that matches the retrieved data according to the first similarity of the retrieved data corresponding to each first anchor point and the second similarity of each retrieved data corresponding to each first anchor point may include:

That is, the retrieved data corresponding to a first anchor point is the retrieved data corresponding to a second anchor point.

This alternative embodiment of the present application provides another approach to determining candidate retrieved data. In this scheme, for each piece of data to be retrieved, the first anchor point corresponding to the maximum similarity may be determined as the second anchor point (which may also be referred to as the nearest anchor point or the best-matched anchor point) that is the closest match to the piece of data, according to the similarity between the piece of data and each first anchor point. For each first anchor point, the first similarity between the first anchor point and the retrieval data represents the matching possibility between the anchor point and the retrieval data, so that the greater the first similarity between the first anchor point and the retrieval data, the higher the matching possibility between the retrieved data (the closest anchor point of the retrieved data is the anchor point) corresponding to the first anchor point and the retrieval data is, and the candidate retrieval data can be determined from the retrieved data corresponding to at least one first anchor point according to the descending order of the first similarity corresponding to each first anchor point.

In an optional embodiment of the application, the determining, according to the first similarity corresponding to each first anchor point and according to the order of the first similarities from large to small, each candidate retrieved data from the retrieved data corresponding to each first anchor point includes any one of:

according to the sequence of the first similarity from big to small, the retrieved data corresponding to each first anchor point are sorted, and the first set number of the retrieved data which are positioned at the front in the sorted retrieved data are determined as the retrieved data candidates matched with the retrieved data;

and according to the sequence of the first similarity from big to small, determining the data to be searched corresponding to the first anchor points with the second set number in the front sequence as the data to be searched candidate matched with the searched data.

As an alternative, the first anchors may be sorted according to the similarity between the first anchor and the retrieved data, that is, according to the size, in the order from the largest corresponding first similarity, and the retrieved data may be sorted according to the sorting, where the sorting of the retrieved data is the sorting of the anchor that is the best matched with the retrieved data. Since the order of the retrieved data is determined according to the order of the first anchor that matches the retrieved data most closely, and the order of the first anchor is determined according to the similarity between the first anchor and the retrieved data, the retrieved data that is ranked earlier is more likely to match the retrieved data, and the first set number of retrieved data that is ranked earlier can be determined as each of the candidate retrieved data that match the retrieved data.

It can be understood that the data to be retrieved whose corresponding best-matching anchor points are the same anchor point are not ordered in sequence, for example, the best-matching anchor point of the data to be retrieved is anchor point 1, the best-matching anchor point of the data to be retrieved is also anchor point 1, the data a and the data B are ordered in sequence by the anchor point 1 in each anchor point, but the data a and the data B are not ordered in sequence.

As another alternative, since there may be a plurality of retrieved data corresponding to each first anchor point, that is, the most matched anchor points corresponding to the plurality of retrieved data are the same, and the degree of order of the plurality of retrieved data (that is, the probability of possible matching with the retrieved data is high or low) cannot be guaranteed, a second set number of retrieved data in the top rank may be determined as the retrieved data candidates, so as to avoid a situation that some retrieved data are selected as the retrieved data candidates when the plurality of retrieved data correspond to the same most matched anchor point, and retrieved data that may be more matched with the retrieved data in the plurality of retrieved data may be omitted when some retrieved data are not selected.

After each candidate retrieved data is determined, the target retrieved data matching the retrieved data can be further determined from the candidate retrieved data, and a retrieval result corresponding to the retrieved data is obtained.

In an optional embodiment of the present application, the determining, from each of the candidate retrieved data, at least one target retrieved data that matches the retrieved data may include:

determining a third similarity between the retrieval data and each candidate retrieved data, and determining at least one target retrieved data matched with the retrieval data from each candidate retrieved data based on the third similarity corresponding to each candidate retrieved data;

alternatively, each candidate retrieved data is set as at least one target retrieved data.

Optionally, through a preset similarity threshold, determining candidate retrieval data corresponding to a third similarity threshold that is greater than or not less than the similarity threshold as target retrieved data; alternatively, the search candidate data corresponding to one or a plurality of (for example, a third set number of) third similarities ranked earlier may be set as the target search data in a substantially smaller order from the third similarity; alternatively, each of the candidate retrieved data may be set as the target retrieved data.

It is understood that after each candidate retrieved data is determined, the candidate retrieved data may be regarded as a new database, and each retrieved data in the database has a high possibility of matching with the retrieved data, and the final retrieval result may be obtained by any of the above-mentioned methods for determining the target retrieved data listed in the present application, or may be obtained by other methods for retrieving similar data/matching data of the retrieved data from the database in the prior art.

Optionally, the determining a third similarity between the retrieved data and each candidate retrieved data includes:

acquiring a first feature vector of retrieval data and a second feature vector of each candidate retrieved data;

and for each candidate retrieved data, obtaining a third similarity between the retrieval data and the candidate retrieved data according to the first feature vector and the second feature vector of the candidate retrieved data.

That is, the third similarity of the retrieved data and the retrieved data may be determined in a manner of calculating the similarity between feature vectors of the data. The embodiment of the present application is not limited to the manner of obtaining the feature vectors of the data (the search data and the data to be searched) and calculating the similarity between the feature vectors of the data. For example, a neural network may be used to extract feature vectors of the data and determine the similarity between the data according to the feature vectors of the data, or after the feature vectors of the data are obtained, the similarity between the vectors may be determined by calculating the distance between the vectors, for example, the hamming distance between the feature vectors may be calculated, and the smaller the distance is, the greater the similarity is.

Similarly, in practical applications, since the retrieved data is data existing in the retrieved database, the feature vectors of the retrieved data can be obtained in advance and stored, and when the third similarity between the retrieved data and the candidate retrieved data needs to be calculated, the feature vectors of the candidate retrieved data can be directly obtained from the corresponding storage positions, which can further improve the data processing efficiency.

In an optional embodiment of the application, the determining the third similarity between the retrieved data and each candidate retrieved data may include:

for each candidate retrieved data, determining a Hamming distance between the first Hash code and a second Hash code corresponding to the candidate retrieved data;

and for each candidate searched data, determining a third similarity between the searched data and the candidate searched data according to the Hamming distance corresponding to the candidate searched data.

In this alternative, the retrieved data and the retrieved data may be mapped in a hash manner to obtain hash codes corresponding to the retrieved data and the similarity between the retrieved data and the candidate retrieved data may be obtained by calculating the similarity between the hash code of the retrieved data and the hash code of the candidate retrieved data. The hash conversion is to encode high-dimensional data into low-dimensional hash codes, and the dimension reduction of the data can be realized through the hash conversion, so that the data calculation amount can be effectively reduced, and the calculation efficiency is improved.

The specific way of hash conversion is not limited in the embodiments of the present application, and may be configured according to actual requirements. In order to further improve the calculation efficiency, the retrieved data and the retrieved data can be converted into binary hash codes through a hash algorithm through hash conversion, the binary hash codes can greatly reduce the data storage overhead, and the calculation of the similarity is carried out based on the binary hash codes, so that the calculation amount can be greatly reduced.

Optionally, a deep hash mode based on a neural network may be used to obtain hash codes of the retrieved data and the retrieved data, or a traditional hash algorithm with a better performance may be used to implement hash conversion of the data through a hash function. After the hash code is obtained, the similarity between the search data and the candidate search data can be obtained by calculating the hamming distance between the search data and the candidate searched data.

Optionally, the obtaining of the first hash code corresponding to the retrieved data and the second hash code corresponding to each candidate retrieved data may include:

and for each candidate data to be searched, obtaining a second hash code corresponding to the candidate data to be searched through a hash function based on the second similarity between the candidate data to be searched and each first anchor point.

Specifically, a feature vector of data to be converted (search data/data to be searched) may be first obtained, and the feature vector is converted into a hash code of a certain dimension by the hash function.

The anchor Graph hash algorithm is an AGH (Anchor Graph hashing) algorithm, firstly, clustering sample data in a training data set to obtain a plurality of clustering centers, taking the clustering centers as first anchors, constructing an approximate adjacency matrix by calculating the similarity between the first anchors and the sample data pieces, and obtaining a corresponding hash function based on the constructed approximate adjacency matrix. For the retrieved data or the retrieved data, the binary hash code corresponding to the data can be obtained through the hash function conversion according to the similarity between the data and each first anchor point. The anchor point graph hash algorithm can obviously reduce the calculation complexity of constructing the adjacency matrix by using anchor points in a training stage (namely, the hash function is obtained based on a training data set), and reduces the calculation amount of obtaining the hash function.

According to the scheme based on the AGH algorithm, the hash function is acquired based on the training data set, and meanwhile, each first anchor point for determining each candidate data to be searched is acquired, so that the complexity of data processing can be further reduced. Of course, in practical applications, other schemes may be adopted for determining each first anchor point, that is, determining each first anchor point may be associated with obtaining the hash function, or may be two steps executed independently.

In the optional embodiment of the present application, based on at least anchor points obtained by clustering a plurality of sample data in a training data set (the number of anchor points may be preconfigured), in a training stage, a hash function may be obtained by using an anchor point graph hash algorithm based on the similarity between the anchor points and the training sample, in the embodiment of the present application, in an application stage, that is, when data retrieval is applied, in addition to obtaining hash codes of retrieved data and retrieved data by using the anchor points and the hash function, the anchor points are also applied in a data retrieval stage, specifically, by calculating the first similarity between the retrieved data and each anchor point and the second similarity between the retrieved data and each anchor point, a part of the retrieved data may be screened out from a large amount of retrieved data as candidate retrieved data, thereby greatly reducing the calculation amount for subsequently determining target retrieved data, the data retrieval efficiency is effectively improved, and particularly, the effect is more prominent when large-scale retrieved data exist in the retrieved database.

In practical application, hash conversion processing may be performed on retrieved data in a retrieved database in advance, the retrieved data and hash codes corresponding to the retrieved data are stored in an associated manner, when target retrieved data corresponding to the retrieved data needs to be determined, hash codes corresponding to the retrieved data may be obtained in the same hash conversion manner, hash codes corresponding to candidate retrieved data are quickly obtained from the pre-stored hash codes, a similarity between the hash codes and the candidate retrieved data is calculated, and the target retrieved data is determined according to the similarity corresponding to the candidate retrieved data.

For a better understanding and a better understanding of the solutions provided by the examples of the present application and their effects, an alternative embodiment of the present application is described below with reference to a specific example. In the present embodiment, the retrieved data and the retrieved data are both illustrated by taking multimedia data (such as an advertisement, which may include at least one of an image, a text, or a video) as an example, and for convenience of description, the retrieved data is referred to as a query sample, the retrieved data is referred to as a database sample, and the sample data in the training data set is referred to as a training sample.

Fig. 2 shows a schematic structural diagram of a data retrieval system applied in the embodiment of the present application, as shown in fig. 2, the data retrieval system may include a terminal device 10, a server 20, and a training device 30, optionally, the server 20 may be a cloud server, the terminal device 10 and the training device 30 are respectively in network communication with the server 20, a database 21 is configured on the server 20 side, the database 21 stores mass multimedia data, the database 21 is a database to be retrieved in the embodiment, and the mass multimedia data is mass data to be retrieved. The user can input a retrieval request in the terminal device 10 according to the retrieval requirement, the retrieval request includes multimedia data as query sample, namely retrieval data, the terminal device 10 can send the retrieval request containing the query sample to the server 20, the server 20 can retrieve a plurality of target multimedia data matched with the query sample from the database 21, and can send related information (such as name, key information, cover information and the like of the multimedia data) of each target multimedia data, namely target retrieved data to the terminal device 10 according to the sequence from large to small of the matching degree of the query sample, and show the related information to the user on the terminal device 10, the user can process the plurality of retrieved target multimedia data according to the requirement, such as opening one or more multimedia data for viewing, or selecting one or more multimedia data for publishing, or download one or more multimedia data therein, etc.

The training device 30 may be any computer device, may be a terminal device, or may be a server, and the training device 30 may learn to obtain a set number of first anchors and hash functions based on a large number of training samples in the training data set, and may send the plurality of first anchors and hash functions to the server 20, so that the server 20 may retrieve target retrieved data matching the query sample from the database 21 based on the data retrieval method provided in the embodiment of the present application.

Fig. 3 shows a schematic flow chart of a data retrieval method in this embodiment, and an alternative embodiment of the present application is described below with reference to fig. 2 and fig. 3, as shown in fig. 3, an implementation flow of the method may include step S310 and step S340, which are specifically as follows:

step S310: an anchor point and a hash function are determined based on the training data set.

This step, which may be referred to as a training phase, may be based on a training data set (denoted X), determining a number of anchor points, and transforming into a hash function of binary hash codes of the query sample and the database sample. In this embodiment, the AGH algorithm is used as a way of determining the first anchor point and the hash function, and the training process may include the following steps 1 to 8.

Step 1: input data is acquired.

The input data in the training phase includes a training data set, the number of preset anchor nodes (referred to as anchor point number, in this embodiment, the number of anchor points is m, where m is greater than or equal to 2), the number of neighbors s of the preset sample and the anchor point, the hash code length l, and the gaussian kernel bandwidth t.

Wherein m is the number of clusters obtained by clustering the training samples in the training data set, that is, the number of clustering centers. s refers to the number of anchor points which are determined from the first anchor points and are most adjacent to the training sample according to the similarity between the training sample and each first anchor point, that is, the s first anchor points corresponding to the s similarities which are ranked most before are taken as the s anchor points which are most adjacent to the training sample (which may be called as the former s adjacent anchor points or the most adjacent s anchor points) according to the sequence of the similarity from the largest to the smallest. The hash code length l refers to the dimension/length of the binary hash code obtained through hash function conversion, and the gaussian kernel bandwidth t is a hyper-parameter in the AGH algorithm.

Step 2: and clustering the training data set to obtain m clustering centers.

Optionally, K-means clustering may be performed on the training data set X to obtain m clustering centers

It is to be understood that, in actual calculation, the first anchor point refers to a feature vector of an anchor point, in this embodiment, a dimension of the feature vector of each first anchor point is d, and U represents a matrix corresponding to the feature vectors of m first anchor points.

And step 3: and calculating the similarity of each training sample and each anchor point.

To train the ith sample X in the data set X_i(the calculation specifically uses the feature vector of the sample) as an example, sample x_iThe similarity between the first anchor point U and the second anchor point U is recorded as

It is defined as:

（1）

wherein, aggregate

Represents sum x_iThe nearest neighboring s anchor points (s ≪ m), j representing the jth of the m first anchor points, u_jFeature vector, Z, representing the jth first anchor point_ijRepresents a sample x_iThe similarity with the jth first anchor point, in this embodiment, the similarity between the sample and the first anchor point is determined according to the L2 distance between the sample and the first anchor point.

Based on the formula (1), the similarity between each training sample and each first anchor point can be calculated, and based on the similarity corresponding to each first anchor point, s anchor points most adjacent to each training sample can be determined.

As an example, as shown in fig. 4, assuming s =2, 4 first anchor points, i.e. u in fig. 4, are shown in fig. 4₁-u₄The small circles without padding in FIG. 4 represent the individual training samples, and are denoted by x₁For example, after obtaining the feature vector of each first anchor point and the feature vector of the sample, the similarity calculation formula may be used to determine the similarity with the sample x₁Nearest neighbor 2 anchor points, i.e. with sample x₁2 anchor points with the greatest similarity, such as anchor point u shown in fig. 4₁And u₂. Wherein Z is₁₁Represents a sample x₁And anchor point u₁Similarity of (D), Z₁₂Represents a sample x₁And anchor point u₂The similarity of (c).

And 4, step 4: and constructing an approximate adjacency matrix based on the similarity between the training sample and the anchor point.

And (3) obtaining a matrix Z based on the similarity of each training sample corresponding to each first anchor point, which is obtained by calculation in the step (3), wherein the row number of Z is the number of the training samples, the column number is the number of the anchor points, each row in Z represents the similarity of one sample corresponding to m first anchor points, and the number of nonzero elements in m similarities is s. Based on Z, an approximate adjacency matrix M can be constructed, which is expressed as follows:

and 5: based on the approximate adjacency matrix, a projection matrix is constructed.

Specifically, the matrix M is subjected to feature decomposition to obtain the first one eigenvectors

And a characteristic value

Based on which a matrix is obtained

And a matrix

Based on the matrices V and

a projection matrix W is constructed, represented as follows:

step 6: and obtaining a hash function based on the projection matrix.

After obtaining the projection matrix W, the hash function of AGH is

Thus, for a given new sample x_i(e.g., query sample, database sample), the similarity Z between the new sample and the first anchor point may be calculated by step 3_iSubsequently introducing Z_iMultiplied by W, by a sign function

And quantizing to obtain a corresponding binary hash code, wherein the definition of the symbolic function is as follows:

based on the above steps, a hash function and m first anchor points (i.e., feature vectors of m anchor points), i.e., the U, can be obtained.

It can be understood that step S310 need not be executed once each time the query result corresponding to the query sample is determined, and this step may be executed in advance, and the obtained result is stored, or of course, this step may be executed periodically according to the actual application requirement, and the first anchor point and the hash function are updated. For example, the first anchor point and the hash function may be updated based on more training samples by adding new training samples. For another example, corresponding to different application scenarios, the training data set of the corresponding application scenario may be used to determine corresponding first anchor points and hash functions by performing the step.

Step S320: and determining the hash codes corresponding to the query samples and each database sample based on the anchor points and the hash functions.

The input data for this step includes: database sample

Query samples

Anchor point

Projection matrix

(ii) a The number of neighbors s of the sample to the first anchor point.

In this embodiment, the number of the database samples is n, and d represents a dimension of a feature vector of each database sample. The query sample is the retrieval data, and the reference data on which the retrieval is based. The number of neighbors s in this step refers to, for the query sample, the number of anchors matching the query sample determined from each first anchor according to the first similarity between the query sample and each first anchor, and for the database sample, s refers to the number of anchors matching the database sample determined from each first anchor according to the second similarity between the query sample and each first anchor.

For each database sample X_dbThe database sample X can be calculated using the above equation (1)_dbSimilarity between U and anchor point

And computing a query sample X_qSimilarity between U and anchor point

. Based on the calculated similarity with each first anchor point, the hash code corresponding to each sample can be obtained through hash function conversion. Specifically, for each database sample X_dbThe sample Z can be_dbMultiplied by W and passed through a sign function

Obtaining X_dbCorresponding hash code

(i.e., second hash code in the foregoing), for query sample X_qIs a reaction of Z_qMultiplied by W and passed through a sign function

Obtaining the hash code corresponding to the query sample

(i.e., the first hash code in the foregoing). Through the steps, the hash code corresponding to each database sample in the database and the hash code b corresponding to the query sample can be obtained_qThe hash code corresponding to n database samples is recorded as database hash code B_dbEach sample ofThe length of the binary hash code of this (query sample/database sample) is l.

Step S330: and determining a candidate database sample corresponding to the query sample based on the anchor point.

Step S340: and determining a target database sample based on the candidate database sample to obtain a query result.

For step S330 and step S340, the present embodiment provides two different alternatives, as shown in fig. 5 and fig. 7, respectively. The two schemes are described below with reference to fig. 5 and 7, respectively.

The approach shown in fig. 5 may be referred to as Anchor-based Hash lookup (Anchor-based Hash lookup), and may include the following steps:

step S510: a subset of neighboring samples (i.e., a subset of data in the foregoing) corresponding to each first anchor point is determined.

To speed up the efficiency of data retrieval, this alternative approach may be to screen a meaningful subset of candidate database samples (i.e., each candidate retrieved data) by mining potential connections between samples through the association of the samples with the first anchor point, the number of samples in the subset of candidate database samples

Will be significantly less than the number of database samples

Therefore, the complexity of subsequent calculation is effectively reduced.

Specifically, in this step, the similarity between each sample (the query sample and each database sample) and each first anchor point may be calculated, and the first s neighboring anchor points corresponding to each sample (i.e., at least one (s in this embodiment) anchor points matching the sample) are determined, and optionally, the similarity between the sample and the first anchor point may be determined by calculating the distance (e.g., L2 distance) between the feature vector of the sample and the feature vector of the first anchor point, and the first anchor point corresponding to the top s similarities with greater similarity is determined as the first s neighboring anchor points of the sample. Optionally, the similarity between the query sample and each database sample and each first anchor may be calculated by formula (1) in the foregoing, and the first s neighboring anchors corresponding to each sample are determined according to the calculation result, that is, at least one anchor matched with the sample is determined from the multiple first anchors according to the similarity between the sample and each first anchor.

The scheme shown in fig. 5 is based on the following assumptions: "if two samples fall into one or more identical neighboring anchors (i.e. two samples have one or more identical neighboring anchors), their original features should be similar with a high probability, and therefore they should also maintain a similar relationship after hash coding". Thus, in the retrieval phase, the samples in the subset of candidate database samples

(i.e., each candidate retrieved data) should be associated with the query sample x_qSharing one and multiple identical neighbor anchor points

. Based on this, after the first s neighboring anchors of each database sample are determined, a neighboring sample subset of each first anchor (i.e. the data subset corresponding to the first anchor) may be determined, and for each first anchor, the samples (database samples, i.e. the data to be retrieved) in the neighboring sample subset of one first anchor, i.e. each sample containing the first anchor in the first s neighboring anchors of the sample, as an example, it is assumed that s is 2, and the first 2 neighboring anchors of one database sample are the first anchor u₁And u₂The first 2 neighboring anchors of another database sample are the first anchor u₂And u₄For the first anchor u₂Since there is anchor u in the first 2 neighboring anchors of the two database samples₂Both database samples are samples in their adjacent sample subsets.

Step S520: and determining a target anchor point corresponding to the query sample.

Step S530: and determining each candidate database sample corresponding to the query sample.

The target anchor point corresponding to the query sample, that is, the first s neighboring anchor points of the query sample (i.e., at least one target anchor point matching the retrieved data). In practical applications, s of the first s neighboring anchors of the query sample may be the same as or different from s of the first s neighboring anchors of the sample database.

After the first s neighboring anchor points of the query sample are determined, the database samples included in the neighboring sample subset corresponding to the s anchor points may be determined as candidate database samples of the query sample. That is, the candidate database sample subset corresponding to the query sample contains the neighboring sample subset corresponding to the s anchors.

Step S540: and determining a target database sample from the candidate database samples to obtain a retrieval result of the query sample, namely a query result.

Optionally, the hamming distance between the query sample and each candidate database sample may be calculated, and the candidate database sample whose hamming distance is smaller than the set distance threshold is used as the search result of the query sample.

An alternative implementation flow of steps S510 to S540 may be as follows:

1) calculate each database sample X_dbSimilarity between U and anchor point

；

2) Based on Z_dbRecording each first anchor point u_jNeighboring database sample indices

I.e. a subset of neighboring samples;

wherein u is_jRepresenting the jth first anchor point, set

The element in (1) is that u is included in the corresponding first s neighbor anchor points_jOf the respective database samples of (a) and (b),x_iindicating the ith database sample in the retrieved database, i being the index of that sample. Through this step, the index of each sample in the adjacent sample subset corresponding to each first anchor point, that is, the identifier of each database sample in the data subset corresponding to the first anchor point, can be recorded as

。

3) Computing a query sample x_qSimilarity between anchor points

And retain x_qNeighboring first anchor index

I.e. x_qThe index of the first s neighboring anchor points;

4) determining a query sample x_qCorresponding sample subset

That is, the set of indexes of candidate database samples corresponding to the query sample, where the database sample corresponding to the set is sample x_qThe database samples in the adjacent sample subset corresponding to the first s neighboring anchor points;

5) computing a query sample x_qSample subset thereof

Hamming distance of each database sample

Will be

The database samples (i.e. target queried samples) in which the corresponding hamming distance is less than the distance threshold r (i.e. hamming radius) are used as the search result

。

As an example, the principle schematic of the scheme shown in fig. 5 is shown in fig. 6, as shown in fig. 6, assuming s =2, given a query sample x_qIts first 2 neighboring anchor points are u₁And u₃，u₁For database sample x₁And x₂(iii) neighboring anchor points (anchor points of the first 2 neighboring anchor points); u. of₂For database sample x₃And x₄Adjacent anchor points of (u)₃For database sample x₅And x₆Adjacent anchor points of (u)₄For database sample x₆Adjacent anchor points of (a); in this example, the target anchor point corresponding to the query sample is anchor point u₁And u₃And anchor point u₁The corresponding neighboring sample subset, i.e. the data subset, comprises the database sample x₁And x₂Anchor point u₃The corresponding neighboring sample subset comprises database sample x₅And x₆Thus, the database samples are set { x₁,x₂,x₅,x₆As query sample x_qIs a subset of

And each database sample in the subset is each candidate retrieved data.

Based on the above alternative provided by the embodiments of the present application, the subset of candidate data samples corresponding to the query sample is obtained

Number of samples of

Is obviously less than the total number n of samples of the database, so that the computation complexity of the hash lookup is reduced to

Compared with the existing hash searching scheme, the complexity is greatly reduced. In addition, apart from the advantages in computational efficiency, the scheme returns a testThe similarity quality of the search result and the query sample is higher, because the query sample and the subset share the same anchor point in the original feature space, and the constraint can make up for the information loss caused by the reduction of the dimension of the hash coding. Therefore, based on the scheme, the search efficiency can be improved, and meanwhile, a better search result can be ensured.

Fig. 7 illustrates another approach to determining query results for a query sample provided by the present embodiment. Alternatively, the similarity between the data in the scheme may be determined based on Hamming distance between the data, and the scheme may be Anchor-based Hamming ranking (Anchor-based Hamming ranking).

In order to speed up the original search efficiency, the scheme also selects a subset, in which the database samples are ordered and the number of the database samples can be exactly equal to the number K of the samples included in the preset query result. The principle of the scheme is as follows: "sample nearest neighbor anchor point (nearest neighbor anchor point u)_jI.e. the second anchor point that is the most matched, i.e. the anchor point with the greatest similarity) may approximate to some extent the self-characteristics of the sample, i.e. x_i≈u_jBased on which the sample x can be sampled_iAssigned to its nearest neighbor anchor u_jBack to Global { x_iThe ordering (ordering of similarity to query sample) of (i.e., all data/database samples retrieved) can be abstracted as the global anchor point { u }_jBased on the principle, the samples are approximately ordered globally (namely between anchor points), but the degree of ordering inside the anchor points has a certain performance loss, and in consideration of the problem, the scheme selects the first K samples (the K samples are both candidate database samples and target database samples) on the basis of global ordering, and orders the first K samples again, so as to return a retrieval result. As shown in fig. 7, the scheme may include the steps of:

step S710: determining the nearest neighbor anchor point corresponding to each database sample, namely the anchor point corresponding to the maximum similarity in the similarity between the database sample and each first anchor point, namely a second anchor point;

optionally, it can be based on the foregoingEquation (1) of (2), calculate database sample X_dbSimilarity between U and anchor point

And each sample x is retained_i∈X_dbIndex of nearest neighbor anchor

(ii) a Wherein the content of the first and second substances,

representing each first anchor point and database sample x_iThe anchor point corresponding to the maximum similarity in the similarity is the jth anchor point, that is, each anchor point and the database sample x_iThe anchor point corresponding to the minimum distance in the distance (e.g., hamming distance) of (a) is the jth anchor point, which is the anchor point in the plurality of first anchor points corresponding to the database sample x_iThe best matching second anchor point.

Step S720: determining the similarity between the query sample and each first anchor point;

similarly, the query sample x may be calculated based on equation (1) above_qSimilarity between U and anchor point

And the distance D between the two is kept_qI.e. the distance between the query sample and the respective first anchor point.

Step S730: sequencing each database sample based on the similarity between the query sample and each first anchor point;

this step, based on D_qOrdering the first anchor points by

And applying the order transformation of the first anchor point to the database sample to order the database sample in the whole. In particular, it may be based on D_qThe first anchor points are sorted in the order of similarity from large to small, that is, in the order of distance from small to large, and the indexes of the sorted anchor point sequences are assumed to be { 1, …,m, wherein,

and respectively representing the distances between the 1 st first anchor point to the mth first anchor point after the sorting and the query sample.

Then based on

Applying the ordered anchor points { 1, …, m } to the database sample X_dbIn the above, the ordered sample sequence { 1, …, n } is obtained, that is, the database samples are ordered according to the ordering of the first anchor point, and the ordering of the database samples matches the ordering of the nearest neighbor anchor point.

Step S740: and taking out the first K samples according to the sorted database samples, sorting the K samples again based on the similarity between the K samples and the query samples, namely performing a round of sorting, and returning each sorted database sample as a retrieval result.

Optionally, the first K samples are sorted from the sample sequence { 1, …, n } sorted once according to the sorting of the first anchor point, and the hash code B corresponding to the K samples is selected according to the hash code B_kAnd the Hash code corresponding to the query sample calculates the Hamming distance between the query sample and each sample in the K database samples, the K database samples are sequenced according to the sequence from small to large of the corresponding Hamming distance, and the sequenced sample sequence is used as a retrieval result R corresponding to the query sample_q。

As an example, the principle schematic of the scheme shown in FIG. 7 is shown in FIGS. 8a, 8b, and 8c, as shown in FIGS. 8 a-8 c, x₁To x₆Represents 6 database samples, u₁To u₄Represents 4 first anchor points, u₁For database sample x₁And x₂The nearest neighbor anchor point; u. of₂For database sample x₃And x₄Nearest neighbor anchor point of u₃For database sample x₅Nearest neighbor anchor point of u₄For database sample x₆Is closest to the anchor point.Given a query sample x_qFirst, it is based on the first anchor point and the query sample x_qThe similarity of the anchor points is used for sequencing the first anchor points, and the four anchor points shown by the dotted line frames in the figure 8a are sequenced according to the sequence of the similarity from large to small, so that the anchor points are ordered, each ordered anchor point is shown by the dotted line frames in the figure 8b, and the front and back sequencing of the 4 anchor points is u₃、u₁、u₄And u₂Correspondingly, each database sample is permuted with the ordering of its nearest neighbor anchor point, which is u as shown in fig. 8b₃Database sample x₅Arranged at the first position; the first K database samples are then truncated, K =3 in this example, and the first 3 database samples are sample x₅、x₁And x₂(First K shown in FIG. 8 b), then, based on the query sample x_qA hash code of, and x₅、x₁And x₂Respectively calculating x_qThe hamming distances from the 3 database samples are sorted again according to the sequence of the hamming distances from small to large, the sorted TopK database samples are used as the final search result (i.e. query result) of the query sample, the search result in this example is shown as a dotted line box in fig. 8c, and is the database sample x with sorting₂、x₅And x₁I.e. the target retrieved data.

In the above alternative embodiment provided by the present application, the computational complexity of the two ordering operations is respectively

And

the number m of the anchor points is far smaller than the number n of the samples, namely the number of the retrieved data in the database; in practical application, when we pay attention to the first few K samples, m and K are far smaller than the number n of the samples, compared with the prior art, the calculation complexity of the scheme can be greatly reduced, and the retrieval efficiency is greatly improved. In special cases, even if the above K is set as a sampleWhen the number n is larger, the computational complexity of the query scheme provided by the embodiment of the application does not exceed that in the prior art.

To prove the effectiveness of the scheme provided by the present application, we performed relevant experiments on a handwritten digit data set MNIST. The data set comprises 6w (ten thousand) samples as a training set, and the rest 1w samples as a testing set, wherein the data set comprises 10 classes; each picture contains 28x28 pixel points, i.e. 784-dimensional features. We adopt 6w samples to train and use them as database X_dqThe rest 1w samples are used as a search set X_q(i.e., retrieved data in the retrieved database); we respectively verify two corresponding retrieval schemes in FIG. 5 and FIG. 7, which are provided by the embodiment of the present application, and take Hash lookup manner (based on Hash Table data structure: regarding each Hash code as a bucket, samples falling into a bucket will share the same Hash code, and samples falling into the same bucket as query sample x _ q in database will be regarded as retrieval results) and Hamming sorting manner (firstly calculating Hamming distances between query samples and all samples in database, then sorting samples in database from small to large according to Hamming distance), and then sorting first K samples (K<<n) will be returned as the search result) two existing solutions as Baseline standards (Baseline); the training parameters of the AGH algorithm set the anchor number m to 300, s to 2, and the Gaussian kernel bandwidth t to 1e 3.

For the anchor-based hash search scheme provided in the embodiment of the present application, in an experiment, we use Precision @ r (i.e., the search accuracy when the hamming radius is r) as an evaluation index, and respectively use the existing hash search (Baseline) and the anchor-based hash search (deployed) method provided in the present application, set the hamming radius r to 1, 2, 4, and 8, and calculate the average search time of each search strategy, where the experimental result is shown in table 1 below:

TABLE 1

As can be seen from table 1, the hash lookup based on anchor points provided by the present application can significantly reduce the search overhead on the premise of ensuring the search accuracy, and the average reduction is 83%; secondly, the retrieval accuracy rate can surpass that of the original hash searching retrieval method under the condition of lower hash code length (8 bits and 16 bits); as the hash code length increases, its performance will be equal to the original method. The above experiment verifies the validity of the hash lookup retrieval strategy based on the anchor point.

For the scheme based on hamming ordering of anchor points provided by the embodiment of the application, MAP @ K (average retrieval precision, where K denotes the number of samples specified to be returned) is adopted as an evaluation index in an experiment, K is respectively set to 500, 1000, and 2000, and the retrieval time of the two schemes under each condition is calculated. The results of the experiment are shown in table 2 below:

TABLE 2

As can be seen from table 2, the hamming ordering strategy based on anchor points proposed in the present application can significantly accelerate the retrieval process under the premise of sacrificing a certain performance. Particularly, when the hash code length is low, the search efficiency is greatly improved, and the search result is relatively stable (reduction < 2%). The retrieval time on MAP @500, MAP @1000, MAP @2000 is improved by 10.2, 7.3 times on average.

Based on the same principle as the method provided by the embodiment of the present application, the embodiment of the present application further provides a data retrieval apparatus, as shown in fig. 9, the data retrieval apparatus 100 may include a similarity determination module 110, a candidate data determination module 120, and a query result determination module 130. Wherein:

a similarity determining module 110, configured to obtain search data, and determine a first similarity between the search data and each of at least two first anchors, where the at least two first anchors are at least two clustering centers obtained by clustering a plurality of sample data in a training data set;

a candidate data determining module 120, configured to obtain a second similarity between each retrieved data in the retrieved database and each first anchor point; determining candidate retrieved data matched with the retrieved data from the retrieved database according to the first similarity of the retrieved data corresponding to each first anchor point and the second similarity of each retrieved data corresponding to each first anchor point;

a retrieval result determining module 130, configured to determine at least one target retrieved data that matches the retrieved data from each of the candidate retrieved data.

Optionally, the candidate data determining module 120 may be specifically configured to: for each piece of data to be retrieved, determining at least one anchor point matched with the data to be retrieved from each first anchor point according to the second similarity of the data to be retrieved corresponding to each first anchor point; for each first anchor point, determining each data to be retrieved matched with the first anchor point as a data subset corresponding to the first anchor point; determining at least one target anchor point matched with the retrieval data from the first anchor points according to the first similarity of the retrieval data corresponding to the first anchor points; and determining the data to be searched in the data subset corresponding to each target anchor point as the data to be searched candidate matched with the searched data.

Optionally, the candidate data determining module 120 may be specifically configured to: for each piece of data to be retrieved, determining a second anchor point which is most matched with the data to be retrieved from each first anchor point according to the second similarity of the data to be retrieved corresponding to each first anchor point; and determining each candidate data to be retrieved from the data to be retrieved corresponding to at least one first anchor point according to the first similarity corresponding to each first anchor point and the sequence of the similarity from big to small, wherein if one first anchor point is a second anchor point corresponding to the data to be retrieved, the data to be retrieved is the data to be retrieved corresponding to the first anchor point.

Optionally, when the candidate data determining module 120 determines, according to the first similarity corresponding to each first anchor point and according to the sequence of similarity from large to small, each candidate retrieved data from the retrieved data corresponding to each first anchor point, any one of the following may be executed:

Optionally, the query result determining module 130 may be specifically configured to: determining a third similarity between the retrieval data and each candidate retrieved data, and determining at least one target retrieved data matched with the retrieval data from each candidate retrieved data based on the third similarity corresponding to each candidate retrieved data; alternatively, each candidate retrieved data is set as at least one target retrieved data.

Optionally, the query result determining module 130, when determining the third similarity between the retrieved data and each candidate retrieved data, may be configured to:

acquiring a first hash code corresponding to the retrieval data and a second hash code corresponding to each candidate retrieved data; and for each candidate retrieved data, determining a Hamming distance between the first Hash code and a second Hash code corresponding to the candidate retrieved data, and determining a third similarity between the retrieved data and the candidate retrieved data according to the Hamming distance.

Optionally, when the query result determining module 130 obtains the first hash code corresponding to the retrieved data and the second hash code corresponding to each candidate retrieved data, it may be configured to:

acquiring a hash function, wherein the hash function is obtained through an anchor point graph hash algorithm based on a plurality of sample data in a training data set and at least two first anchor points; based on first similarity between the retrieval data and each first anchor point, obtaining a first hash code corresponding to the retrieval data through a hash function; and for each candidate retrieved data, obtaining a second hash code corresponding to the candidate retrieved data through a hash function based on the second similarity between the candidate retrieved data and each first anchor point.

Optionally, the query result determination module 130 may be configured to: sorting the candidate retrieved data according to the sequence that the similarity of the candidate retrieved data and the retrieved data is from big to small; and sequencing the sequenced candidate retrieved data to be used as at least one target retrieved data.

Based on the same principle as the data retrieval method and the data retrieval device provided by the embodiments of the present application, the embodiments of the present application also provide an electronic device, which may include a memory and a processor, where the memory stores a computer program, and the processor, when running the computer program, is configured to execute the data retrieval method provided by any of the alternative embodiments of the present application, or is configured to execute actions performed by the device provided by any of the alternative embodiments of the present application.

As an alternative embodiment, fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application, where the electronic device may execute the data query method provided in any alternative embodiment of the present application. As shown in fig. 10, the electronic device 4000 may include a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further include a transceiver 4004, and the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (field programmable Gate Array) or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 10, but this is not intended to represent only one bus or type of bus.

The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.

The memory 4003 is used for storing application program codes (computer programs) for executing the present scheme, and is controlled by the processor 4001 to execute. Processor 4001 is configured to execute application code stored in memory 4003 to implement what is shown in the foregoing method embodiments.

Embodiments of the present application provide a computer-readable storage medium having stored thereon a computer program, which, when run on a computer, enables the computer to perform the method provided in any of the alternative embodiments of the present application.

Embodiments of the present application also provide a computer product comprising a computer program that, when executed by a processor, performs the method provided in any of the alternative embodiments of the present application.

Based on the same principle as the method provided by the embodiment of the present application, the embodiment of the present application also provides a computer program product or a computer program, which includes computer instructions, and the computer instructions are stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the data retrieval method provided in any of the alternative embodiments of the present application.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for data retrieval, the method comprising:

determining candidate retrieved data matched with the retrieval data from the retrieved database according to the first similarity of the retrieval data corresponding to the first anchor points and the second similarity of each retrieved data corresponding to the first anchor points;

at least one target retrieved data matching the retrieved data is determined from each of the candidate retrieved data.

2. The method of claim 1, wherein determining from the retrieved database respective candidate retrieved data that match the retrieved data based on a first similarity of the retrieved data to the respective first anchor points and a second similarity of each of the retrieved data to the respective first anchor points comprises:

for each piece of retrieved data, determining at least one anchor point matched with the retrieved data from each first anchor point according to the second similarity of the retrieved data corresponding to each first anchor point;

for each first anchor point, determining each retrieved data matched with the first anchor point as a data subset corresponding to the first anchor point;

determining at least one target anchor point matched with the retrieval data from each first anchor point according to the first similarity of the retrieval data corresponding to each first anchor point;

and determining the searched data in the data subset corresponding to each target anchor point as the searched data candidate matched with the searched data.

3. The method of claim 1, wherein determining from the retrieved database respective candidate retrieved data that match the retrieved data based on a first similarity of the retrieved data to the respective first anchor points and a second similarity of each of the retrieved data to the respective first anchor points comprises:

for each piece of retrieved data, determining a second anchor point which is most matched with the retrieved data from each anchor point according to the second similarity of the retrieved data corresponding to each first anchor point;

4. The method according to claim 3, wherein the determining, according to the first similarity corresponding to each first anchor point and in order of decreasing first similarity, each data candidate to be retrieved from the retrieved data corresponding to each first anchor point comprises any one of:

according to the sequence of the first similarity from big to small, sorting the retrieved data corresponding to each first anchor point, and determining the first set number of retrieved data at the front of the sorted retrieved data as the retrieved data candidates matched with the retrieved data;

and according to the sequence of the first similarity from big to small, determining the searched data corresponding to a second set number of first anchor points which are ranked in the front as the searched data candidates matched with the searched data.

5. The method according to claim 1 or 2, wherein the determining at least one target retrieved data from each of the candidate retrieved data that matches the retrieved data comprises:

determining a third similarity between the retrieved data and each of the candidate retrieved data;

determining at least one target retrieved data matched with the retrieval data from each candidate retrieved data based on the third similarity corresponding to each candidate retrieved data;

alternatively, each of the candidate retrieved data is taken as the at least one target retrieved data.

6. The method of claim 5, wherein determining a third similarity between the retrieved data and each of the candidate retrieved data comprises:

acquiring a first feature vector of the retrieval data and a second feature vector of each candidate retrieved data;

and for each candidate retrieved data, obtaining a third similarity between the retrieved data and the candidate retrieved data according to the first feature vector and a second feature vector of the candidate retrieved data.

7. The method of claim 5, wherein determining a third similarity between the retrieved data and each of the candidate retrieved data comprises:

8. The method of claim 7, wherein the obtaining a first hash code corresponding to the retrieved data and a second hash code corresponding to each of the candidate retrieved data comprises:

acquiring a hash function, wherein the hash function is obtained through an anchor point graph hash algorithm based on a plurality of sample data in the training data set and the at least two first anchor points;

obtaining a first hash code corresponding to the retrieval data through the hash function based on first similarity between the retrieval data and each first anchor point;

and for each candidate retrieved data, obtaining a second hash code corresponding to the candidate retrieved data through the hash function based on a second similarity between the candidate retrieved data and each first anchor point.

9. The method according to any one of claims 1 to 4, wherein the determining at least one target retrieved data that matches the retrieved data from each of the candidate retrieved data comprises:

sorting the candidate retrieved data according to the sequence that the similarity of the candidate retrieved data and the retrieved data is from large to small;

and sequencing each piece of the sorted candidate retrieved data to be used as the at least one piece of target retrieved data.

10. A data retrieval device, the device comprising:

the similarity determination module is used for acquiring retrieval data and determining first similarities between the retrieval data and each of at least two first anchor points, wherein the at least two first anchor points are at least two clustering centers obtained by clustering a plurality of sample data in a training data set;

the candidate data determining module is used for acquiring a second similarity between each piece of retrieved data in the retrieved database and each first anchor point; determining candidate retrieved data matched with the retrieval data from the retrieved database according to the first similarity of the retrieval data corresponding to the first anchor points and the second similarity of each retrieved data corresponding to the first anchor points;

and the retrieval result determining module is used for determining at least one target retrieved data matched with the retrieval data from the candidate retrieved data.

11. An electronic device, comprising a memory in which a computer program is stored and a processor, which, when running the computer program, is configured to perform the method of any of claims 1 to 9.

12. A computer-readable storage medium, in which a computer program is stored which, when run on a processor, is adapted to carry out the method of any one of claims 1 to 9.

13. A computer product comprising a computer program, characterized in that the computer program, when executed by a processor, performs the method of any of claims 1 to 9.