WO2020237511A1 - 相似性搜索方法、装置、服务器及存储介质 - Google Patents

相似性搜索方法、装置、服务器及存储介质 Download PDF

Info

Publication number
WO2020237511A1
WO2020237511A1 PCT/CN2019/088879 CN2019088879W WO2020237511A1 WO 2020237511 A1 WO2020237511 A1 WO 2020237511A1 CN 2019088879 W CN2019088879 W CN 2019088879W WO 2020237511 A1 WO2020237511 A1 WO 2020237511A1
Authority
WO
WIPO (PCT)
Prior art keywords
similarity
objects
queue
threshold
cluster
Prior art date
Application number
PCT/CN2019/088879
Other languages
English (en)
French (fr)
Inventor
熊思路
何欢
高剑
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2019/088879 priority Critical patent/WO2020237511A1/zh
Priority to CN201980096330.6A priority patent/CN113811865A/zh
Publication of WO2020237511A1 publication Critical patent/WO2020237511A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • This application relates to the field of search technology, in particular to a similarity search method, device, server and storage medium.
  • Similarity search technology has been widely used in various application scenarios. For example, in the scene of image search, you can search for the same or similar pictures according to the pictures input by the user. For example, in the scene of web search, you can Search for web pages based on the keywords entered by the user. In the audio search scenario, you can search for the same or similar audio based on the audio entered by the user. In the document search scenario, you can search for related keywords based on the keyword entered by the user. Document.
  • the similarity search method usually includes: the terminal sends a search request to the server, and the search request includes the objects to be searched, such as pictures, audios, keywords, etc.
  • the server After the server receives the search request from the terminal, it will obtain the object to be searched from the search request. After that, the server will search according to the object to be searched. Specifically, the server traverses each object in the database, obtains the distance between each object in the database and the object to be searched, selects the K objects with the closest distance from the database, and returns them to the terminal, where K is a positive integer .
  • the embodiments of the present application provide a similarity search method, device, server and storage medium, which can solve the problem of low recall rate in related technologies.
  • an embodiment of the present application provides a similarity search method, and the method includes:
  • Receive a search instruction from the terminal is used to instruct to search for objects similar to the target object; search the database to obtain a first set, the first set includes multiple objects; divide the first set into the first set The second set and the third set, the similarity between each object in the second set and the target object meets a first threshold, and the similarity between each object in the third set and the target object does not meet The first threshold; sort the objects in the first set according to the order of the objects in the second set first and the objects in the third set; Select objects from the first set and send them to the terminal.
  • the method provided in this embodiment designs a set of rearrangement frameworks under a single input. After searching for some objects from the database, the objects whose similarity with the target object meets the first threshold are placed in the front row. The objects whose similarity with the target object does not meet the first threshold are placed in the back row, and then the objects are selected in the order from front to back and sent to the terminal. Since the confidence of the objects whose similarity meets the first threshold is higher than the similarity does not Objects that meet the first threshold, and objects whose similarity meets the first threshold have a high probability of being the correct result. By placing these objects in the front, when selecting objects from front to back, these objects will be selected first, which can improve the selection. The probability of the correct result increases the proportion of the correct result selected, thereby effectively increasing the recall rate.
  • the rearrangement can be performed based on the first set searched from the database, as opposed to constructing ghost points based on the objects in the first set and target objects, and then searching the database again based on the ghost points.
  • the steps of constructing ghost points and the steps of second search in the database are omitted, thereby solving the problem of huge amount of calculation in the rearrangement frame, improving the calculation speed, and avoiding the recall rate caused by inaccurate ghost points
  • the problem of descent improves robustness.
  • the dividing the first set into a second set and a third set specifically includes: combining multiple similarity algorithms to obtain the objects in the first set and the target For the comprehensive similarity between objects, objects whose comprehensive similarity meets the first threshold are added to the second set, and objects whose comprehensive similarity does not meet the first threshold are added to the third set.
  • the searching the database to obtain the first set specifically includes: using a first similarity algorithm to add objects whose similarity with the target object meets a third threshold to the first set A set; further, the dividing the first set into a second set and a third set specifically includes: using a combination of multiple similarity algorithms to obtain the relationship between the object in the first set and the target object Add the objects whose comprehensive similarity meets the first threshold to the second set, and add the objects whose comprehensive similarity does not meet the first threshold into the third set.
  • the degree algorithm includes the first similarity algorithm.
  • the first set is obtained based on the similarity obtained by the first similarity algorithm, and then based on the first similarity algorithm, combined with other similarity algorithms to obtain
  • the first similarity algorithm is combined with other similarity algorithms, it can make up for the lack of the measurement method of the first similarity algorithm, and achieve the improvement of the first similarity algorithm Therefore, compared with the similarity obtained by the first similarity algorithm, the comprehensive similarity can ensure the improvement of accuracy. Therefore, the object with high comprehensive similarity with the target object has a higher probability of being the correct result.
  • the first similarity algorithm is a Euclidean distance algorithm
  • the multiple similarity algorithms include the Euclidean distance algorithm and other similarity algorithms other than the Euclidean distance algorithm.
  • the third set includes a fourth set and a fifth set, the similarity between each object in the fourth set and the second set meets a second threshold, and the fifth set The similarity between each object in the second set and the second set does not meet the second threshold; the order of the objects in the third set is specifically: the object in the fourth set comes first, and the fifth set The object in is behind.
  • the second set is used as a reference for rearrangement. Since the confidence of the objects in the second set is high, if the objects in the third set are If the second set is similar, the probability that the object is the correct result is higher. Then, by making the objects whose similarity with the second set meets the second threshold first, and the objects whose similarity with the second set does not meet the threshold last, The order of the objects in the third set can be made more accurate. Therefore, when selecting objects in the order from front to back, the objects with similarity to the second set meeting the second threshold will be selected first, so the correct result of selection can be improved. Probability, thereby further improving the recall rate.
  • the method further includes: combining multiple similarity algorithms to obtain objects in the third set With the comprehensive similarity of the second set, the objects whose comprehensive similarity meets the second threshold are added to the fourth set, and the objects whose comprehensive similarity does not meet the second threshold are added to the fifth set.
  • the third set includes For the fourth set and the fifth set, the order of the objects in the third set is specifically as follows: the objects in the fourth set are first and the objects in the fifth set are second.
  • a variety of similarity algorithms can be combined to obtain comprehensive similarity, which comprehensively considers various similarity algorithms and takes advantage of the advantages of different similarity algorithms.
  • the comprehensive similarity can more comprehensively and scientifically reflect the similarity between the objects in the third set and the second set, so the problem of inaccuracy of a single measurement method can be solved, and the accuracy of the similarity can be improved.
  • multiple similarity algorithms used can include rank order algorithms or other similarity algorithms that take into account group relationships.
  • the objects in the third set and the second set itself also consider the groups to which the objects in the third set belong and the groups to which the objects in the second set belong, so as to improve the accuracy of similarity and further improve the selection based on similarity The accuracy of the subject.
  • the searching the database to obtain the first set specifically includes: using a first similarity algorithm to add objects whose similarity with the target object meets a third threshold to the first set A set; further, after the first set is divided into a second set and a third set, the method further includes: combining multiple similarity algorithms to obtain the objects in the third set and all In the second set of comprehensive similarity, objects whose comprehensive similarity meets the second threshold are added to the fourth set, and objects whose comprehensive similarity does not meet the second threshold are added to the fifth set.
  • the multiple similarity algorithms include According to the first similarity algorithm, the third set includes the fourth set and the fifth set, and the order of objects in the third set is specifically: the objects in the fourth set are first, The objects in the fifth set are later.
  • the first set is obtained based on the similarity obtained by the first similarity algorithm, and then based on the first similarity algorithm, combined with other similarity algorithms to obtain
  • the comprehensive similarity can ensure the improvement of accuracy. Therefore, the object with high comprehensive similarity with the target object has a higher probability of being the correct result.
  • the overall similarity between the target objects is low, that is, the confidence of the fourth set as a whole will be higher than the confidence of the fifth set. Therefore, the objects in the fourth set are ranked in the fifth set. In the foregoing, it can be ensured to improve the accuracy of the order of the objects in the first set.
  • the method further includes: obtaining clusters from the third set, any of the clusters The degree of relevance between the object and other objects in the cluster meets a preset condition; the degree of similarity between the cluster and the second set is acquired as each object in the cluster and the second set Add the objects whose similarity meets the second threshold to the fourth set, and add the objects whose similarity does not meet the second threshold to the fifth set.
  • the third set includes the fourth set and all In the fifth set, the order of the objects in the third set is specifically: the objects in the fourth set are first and the objects in the fifth set are second.
  • the similarity between the cluster and the second set is used as the similarity of each object in the cluster.
  • the noise data will be due to After being divided into corresponding clusters, the similarity between the noise data itself and the second set will be replaced by the similarity between the cluster and the second set. Then even if the similarity between the noise data itself and the second set is very high, it will be reduced to the similarity between the cluster and the second set, thus effectively preventing the influence of the noise data, filtering out the noise data, and solving The problem of misjudgment caused by noise data is reduced, the number of false results in the search results is reduced, and the recall rate is greatly improved.
  • the obtaining the similarity between the cluster and the second set includes: selecting a representative point from the cluster, and obtaining the relationship between the representative point and the second set The similarity of is used as the similarity between the cluster and the second set, and the representative point is used to represent each object in the cluster.
  • representative points can be used instead of all objects in the entire cluster to measure with the second set. Compared with the method of using all objects in the cluster to measure with the second set one by one, it can reduce The amount of calculation, thereby improving the calculation speed.
  • the representative point may be the center point of the cluster, and the similarity between the cluster and the second set can be calculated more accurately through the center point, so as to avoid noise points at the edge of the cluster from affecting the accuracy of the similarity.
  • the obtaining the similarity between the cluster and the second set includes: obtaining the similarity between each object in the cluster and the second set, according to The similarity between each object and the second set is obtained, and the similarity between the cluster and the second set is acquired.
  • the sorting the objects in the first set according to the order of the objects in the second set first and the objects in the third set last includes: Store the second set in the first queue; store the fourth set in the second queue; store the fifth set in the third queue; according to the first queue first and the second queue second The last order of the third queue is to sort the objects in the first set.
  • a set of queue-based rearrangement framework is designed.
  • the second set to the first queue By adding the second set to the first queue, the fourth set to the second queue, and the first Five sets are added to the third queue.
  • Objects searched in the database can be divided into multiple queues.
  • the objects in the first queue are ranked first, the objects in the second queue are ranked in the middle, and the third The objects in the queue are ranked last.
  • the objects in the first queue are selected with high priority
  • the objects in the second queue are selected second
  • the objects in the third queue are selected with low priority.
  • the confidence of the objects in the first queue is the highest
  • the confidence of the objects in the second queue is the second
  • the confidence of the objects in the third queue is the lowest.
  • the sorting the objects in the first set according to the order of the objects in the second set first and the objects in the third set last includes: Save the second set in the first queue; save the third set in the second queue; in the order of the first queue and the second queue, The objects are sorted.
  • a set of queue-based rearrangement framework is designed.
  • the objects searched out from the database are divided into multiple queues.
  • the objects in the first queue are ranked first and the objects in the second queue are ranked second.
  • the greater the similarity between the object and the target object the higher the arrangement position of the object in the second set .
  • the objects with high similarity to the target object are ranked in the front, and the objects with less similarity to the target object are ranked in the back, which can further improve the order of candidate objects.
  • the accuracy of the object because the object with a high similarity to the target object is more likely to be the correct result, and this type of object is placed in the front, when the objects are selected in the order of the front, the correct result can be improved among the selected objects Proportion, and can try to make the correct result rank in front of the search result, so that when the terminal displays the search result, the display position of the correct result will be higher.
  • the object will be ranked first in the second set, and then it will also be ranked first in the first set.
  • the server After the selected object is sent to the terminal, the object will be ranked first in the search results presented by the terminal. Moreover, objects with low similarity to the target object will be ranked behind the search results, or they can be avoided to make the proportion of correct results in the search results higher, thereby effectively improving the recall rate of the search.
  • the greater the similarity between the object and the second set the closer the arrangement position of the object in the third set before.
  • the objects with greater similarity to the second set are ranked first, and the objects with less similarity to the second set are ranked behind, which can further improve candidates
  • the accuracy of the order of the objects because the objects with greater similarity to the second set are ranked before the objects with less similarity to the second set, and the objects with greater similarity to the second set are more likely to be the correct result.
  • the proportion of correct results among the selected objects can be increased, and as many objects similar to the second set can be selected as possible, so that the correct results in the search results The proportion is larger, which effectively improves the recall rate.
  • objects with low similarity to the second set will be ranked behind the search results, or they can be avoided from being placed in the search results, thereby reducing the number of false results in the search results, thereby effectively improving the search The recall rate.
  • the dividing the first set into a second set and a third set specifically includes: obtaining clusters from the first set, and any object in the cluster is related to the The correlation between other objects in the cluster meets a preset condition; acquiring the similarity between the cluster and the target object as the similarity between each object in the cluster and the target object; The objects whose similarity meets the first threshold are added to the second set, and the objects whose similarity does not meet the first threshold are added to the third set.
  • the similarity between the cluster and the target object is used as the similarity of each object in the cluster. Because the noise data is divided into corresponding clusters, the noise data and the target object are divided into clusters. The similarity between the cluster and the target object will be replaced by the similarity between the cluster and the target object. Therefore, even if the similarity between the noise data itself and the target object is high, the similarity between the noise data and the target object is not used, but Using the similarity between the cluster to which the noise data belongs and the target object, the similarity corresponding to the noise data can be reduced to the similarity corresponding to the cluster, which can effectively prevent the influence of the noise data and eliminate the noise data in advance. This solves the problem of misjudgment due to noisy data, reduces the number of erroneous results in search results, and greatly improves the recall rate.
  • an embodiment of the present application provides a similarity search device, which is used to execute the above-mentioned similarity search method.
  • the similarity search device includes a functional module for executing the foregoing first aspect or any possible implementation manner of the first aspect.
  • an embodiment of the present application provides a server.
  • the server includes one or more processors and one or more memories, and at least one instruction is stored in the one or more memories.
  • One or more processors are loaded and executed to implement the foregoing first aspect or the method provided in any possible implementation manner of the first aspect.
  • an embodiment of the present application provides a server cluster, the server cluster includes at least one server, and each server includes one or more processors and one or more memories, and the memory of the at least one server stores At least one instruction, which is loaded and executed by the processor of the at least one server to implement the foregoing first aspect or the method provided by any possible implementation manner of the first aspect.
  • an embodiment of the present application provides a computer-readable storage medium, in which at least one instruction is stored, and the instruction is loaded and executed by a processor to implement the first aspect or any possible aspect of the first aspect.
  • the storage medium stores the program.
  • the type of the storage medium includes but is not limited to volatile memory, such as random access memory, non-volatile memory, such as flash memory, hard disk drive (HDD), solid state drive (SSD).
  • an embodiment of the present application provides a chip, which includes a processor, configured to call and execute instructions stored in the memory from the memory, so that the device installed with the chip executes the above-mentioned similarity search method.
  • the embodiments of the present application provide another chip, including an input interface, an output interface, a processor, and a memory, and the input interface, output interface, the processor, and the memory are connected by an internal connection path,
  • the processor is configured to execute instructions in the memory, and when the instructions are executed, the processor is configured to execute the aforementioned similarity search method.
  • an embodiment of the present application provides a computer program, where the computer program includes instructions for executing the foregoing first aspect or any possible implementation manner of the first aspect.
  • the computer program may be a software installation package.
  • the computer program may be downloaded and executed on the server.
  • an embodiment of the present application provides a similarity search system.
  • the similarity search system includes a terminal and a server.
  • the terminal is configured to send a search instruction to the server, and the server is configured to execute the first aspect or the first aspect described above.
  • the terminal is also used to receive the object sent by the server.
  • Fig. 1 is a schematic diagram of an application scenario provided by an embodiment of the present application.
  • Fig. 2 is a structural block diagram of a similarity search system provided by an embodiment of the present application.
  • Fig. 3 is a flowchart of a similarity search method provided by an embodiment of the present application.
  • Fig. 4 is a schematic diagram of dividing the second set and the third set according to an embodiment of the present application.
  • FIG. 5 is a schematic diagram of dividing the second set and the third set according to an embodiment of the present application.
  • Fig. 6 is a flowchart of a similarity search method provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of dividing a first queue and a second queue according to an embodiment of the present application.
  • Fig. 8 is a flowchart of a similarity search method provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of dividing the fourth set and the fifth set according to an embodiment of the present application.
  • FIG. 10 is a schematic diagram of dividing the fourth set and the fifth set according to an embodiment of the present application.
  • Fig. 11 is a flowchart of a similarity search method provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of dividing a first queue, a second queue, and a third queue according to an embodiment of the present application.
  • Fig. 13 is a structural block diagram of a similarity search device provided by an embodiment of the present application.
  • Fig. 14 is a structural block diagram of a server provided by an embodiment of the present application.
  • FIG. 15 is a schematic structural diagram of a server cluster provided by an embodiment of the present application.
  • Fig. 16 is a schematic structural diagram of another server cluster provided by an embodiment of the present application.
  • Object can be but not limited to any one or a combination of images, web pages, documents, audio, information, and videos.
  • images can be but not limited to face images, human body images, footprint images, gait images, emoticon images, location images, vehicle images, product images, landscape images, architectural images, film and television images, food images, game images, plants Any one of images, animal images, and combinations thereof.
  • Target object refers to the object to be searched, for example, the content requested in the query (query) command, which can also be called a search item or query.
  • Similarity algorithm also called measurement algorithm, used to measure the distance between two data points in the data space.
  • a similarity algorithm can be used to measure the degree of similarity between two objects.
  • similarity algorithms can be implemented by functions (a kind of computer program).
  • Similarity algorithms can be, but not limited to, Euclidean distance (English: euclidean distance) algorithm, rank ordering (English: rank order) algorithm, machine learning model, Euclidean distance standardization (English: standardized euclidean distance) algorithm, Mahalanobis distance (English: Euclidean distance) : Mahalanobis distance algorithm, Manhattan distance (English: Manhattan distance) algorithm, Chebyshev distance (English: Chebyshev distance) algorithm, Minkowski distance (English: Minkowski distance) algorithm, Hamming distance (English: Hamming distance) ) Algorithm, cosine similarity (English: Cosine similarity) algorithm, Pearson correlation coefficient (English: Pearson correlation coefficient) algorithm, Jaccard similarity coefficient algorithm, log-likelihood similarity algorithm, mutual information gain algorithm, information gain algorithm, relative Entropy algorithm, KL divergence (Kullback-Leibler divergence), point mutual information (English full name: pointwise mutual information, English abbreviation: PMI) algorithm of any one and
  • Search for images by image After inputting an image, the function of searching and returning one or more images similar to this image from the massive images in the database.
  • Candidate solution If the number of objects required to be returned is K, and the number of objects greater than K is first searched from the database, such as searching for (2*K) objects, these objects can be called candidate solutions. K objects from these objects will be selected as the search results later.
  • Rerank refers to the process of reordering the candidate solutions according to the set ordering method after obtaining the candidate solutions, thereby changing the original internal order of the candidate solutions and forming a new internal order.
  • These K objects are commonly called TOP K results (the first K results), these K objects can be used as search results, and these K objects are returned.
  • Clustering refers to the given multiple data and clustering algorithm, the process of categorizing the data that meet the conditions of the multiple data into one category.
  • Clustering algorithm can be but not limited to k-means clustering algorithm (English: k-means clustering algorithm), density clustering algorithm, graph clustering algorithm, hierarchical clustering algorithm, network-based clustering algorithm, Fuzzy-based clustering algorithm, constraint-based clustering algorithm, constraint-based clustering algorithm, granularity-based clustering algorithm, kernel clustering algorithm, quantum clustering algorithm, any one or a combination of multiple.
  • k-means clustering algorithm English: k-means clustering algorithm
  • density clustering algorithm graph clustering algorithm
  • hierarchical clustering algorithm network-based clustering algorithm
  • Fuzzy-based clustering algorithm Fuzzy-based clustering algorithm
  • constraint-based clustering algorithm constraint-based clustering algorithm
  • constraint-based clustering algorithm granularity-based clustering algorithm
  • kernel clustering algorithm any one or a combination of multiple.
  • Queue A data structure that stores data.
  • each similarity algorithm can obtain a similarity, and combining multiple similarities obtained by multiple similarity algorithms can obtain a comprehensive similarity.
  • each similarity algorithm can be assigned a corresponding weight, and multiple similarities obtained by multiple similarity algorithms can be combined according to the weight corresponding to each similarity algorithm.
  • the weight corresponding to each similarity algorithm can be set according to instructions, experiments, experience or needs.
  • the weight corresponding to the similarity algorithm can be positively correlated with the accuracy of the similarity algorithm. For example, if the accuracy of the Euclidean distance algorithm is less than that of the rank order algorithm, the weight of the Euclidean distance algorithm can be less than the weight of the rank order algorithm.
  • the method of combining multiple similarity algorithms includes, but is not limited to, any one or a combination of average, weighted average, sum, and weighted sum. For example, if the multiple similarity algorithms are similarity algorithm 1 and similarity algorithm 2, similarity algorithm 1 obtains similarity 1, and similarity algorithm 2 obtains similarity 2, which can be based on the weight and similarity of similarity algorithm 1.
  • the weight of algorithm 2 the weighted average of similarity 1 and similarity 2, and the weighted average as the comprehensive similarity; or, according to the weight of similarity algorithm 1 and the weight of similarity algorithm 2, the similarity 1 and similarity 2 Perform weighted summation, and use the weighted sum value as the comprehensive similarity; or, sum the similarity 1 and the similarity 2, and use the sum value as the comprehensive similarity; or, average the similarity 1 and the similarity 2 , Regard the average as the comprehensive similarity.
  • Similarity search belongs to semantic search, which can search for objects similar to known objects.
  • One similarity search in related technologies is Faiss of Facebook (Chinese called Facebook). Similarity search is a non-matching search, for example, given a picture, search for similar pictures; or given a word, search for similar words, give a paragraph, search for a similar paragraph.
  • Similarity search is a non-matching search, for example, given a picture, search for similar pictures; or given a word, search for similar words, give a paragraph, search for a similar paragraph.
  • current text editing software for example, Microsoft Office Word 2013, a text editing software launched by Microsoft
  • matching search is used, and only content that is completely consistent with the target word/sentence can be retrieved.
  • the server can search for other images similar to it from the database based on the image provided by the terminal, and return these images to the terminal. For example, if a user wants to know who a person in an image is, he can input the person’s image on the terminal, the server searches the database for other images similar to this image, and returns other images similar to this image, and the database
  • the identity information corresponding to the image can be stored in the image, and the server can return the identity information corresponding to the image when returning the image, so as to help identify the identity of the person. For another example, if a user sees a product and wants to know the purchase address, price, etc.
  • the server searches the database for similar product images.
  • Other product images, return other product images similar to this product image, and the database can store the purchase address, price, etc. corresponding to the product image, and the server can return the purchase address and price of other product images when returning other product images.
  • the prices are returned together to help users quickly purchase goods; for another example, if the user wants to know the breed of a certain dog, he can input the image of the dog in the terminal, and the server searches the database for the image of a dog similar to this image. Help identify the breed of dog.
  • a large number of face photos can be captured in advance through cameras deployed everywhere, and facial features can be extracted from each face photo, and a large number of facial features can be stored in Face feature library.
  • the server can use this photo as the face image to be searched, and extract facial features from this photo.
  • the face feature database is searched to obtain 10 face features, and then through the rearrangement process in the following embodiment, the 10 face features are sorted, the top 3 face features are selected, and the 3 face features are selected. The corresponding 3 photos are returned to the terminal.
  • the scenario of using graphs for similarity search is only for illustration, and this application can also be applied to the scenario of using documents for similarity search, that is, according to a given document, search for other similar documents from a document database.
  • the user can give a paper, by implementing the method provided in this application, you can find some papers with the highest similarity to this paper from the paper database, and you can also return the similarity of each paper to this paper In this way, it can be determined whether there are other papers that overlap the paper in a large area, so as to realize the function of checking duplicate papers; for example, it can be applied to the scene of similarity search using audio, that is, according to the given audio, from the audio Search for other audio similar to it in the database. For example, if the user hummed a song segment, you can record the song segment, and search for songs similar to the song segment hummed by the user from the song library.
  • Fig. 2 is a structural block diagram of a similarity search system provided by an embodiment of the present application.
  • the similarity search system includes: a terminal 210 and a search platform 220.
  • the terminal 210 is connected to the search platform 220 through a wireless network or a wired network.
  • the terminal 210 may be at least one of a smart phone, a game console, a desktop computer, a tablet computer, an e-book reader, an MP3 player, an MP4 player, and a laptop portable computer.
  • the terminal 210 can install and run an application program that supports search, and the search platform 220 is used to provide background services for the application program.
  • the terminal 210 may be a terminal used by a user, and an account registered by the user on the search platform 220 is logged in an application program running in the terminal 210.
  • the application can be the client of the search engine, or can be the web version of the search engine.
  • the application can be any one of shopping applications, audio programs, video programs, social applications, instant messaging applications, translation applications, and browser programs.
  • the application has a built-in search function
  • it is equipped with a component for searching images by image
  • the application may be a shopping application with the function of recognizing images for shopping.
  • the search platform 220 may not be limited to running in any one of a cloud environment, an edge environment, or a terminal environment, for example, it may run on a public cloud, a private cloud, or a hybrid cloud.
  • the search platform 220 may be provided to users as a cloud search service.
  • the search platform 220 includes a server 2201 and a database 2202.
  • the server 2201 is configured to execute the method of the embodiment in FIG. 3 described below.
  • the server 2201 is connected to the database 2202 through a wireless network or a wired network.
  • the server 2201 may include at least one of one server, multiple servers, a cloud computing platform, or a virtualization center.
  • the server 2201 may be one or more.
  • the server 2201 may be an elastic cloud server (English full name: elastic cloud server, English abbreviation: ECS), a virtual machine, a container, an application, service, or microservice running in a cloud environment.
  • ECS elastic cloud server
  • the database 2202 is used to store multiple objects.
  • the database 2202 may be located on one storage device or distributed on multiple storage devices.
  • the database 2202 may be implemented by a cloud storage service.
  • the database 2202 may be an object storage service (English full name: object storage service, English abbreviation: OBS), cloud hard disk, cloud database, etc.
  • FIG. 1 is only an example where the server 2201 and the database 2202 are separated on different devices for illustration.
  • the server 2201 and the database 2202 may also be integrated, and the server 2201 and the database 2202 The database 2202 can be located on the same device.
  • the number of the aforementioned terminals may be more or less.
  • the foregoing terminal may be only one, or the foregoing terminal may be tens or hundreds, or a greater number.
  • the foregoing similarity search system also includes other terminals.
  • the embodiments of this application do not limit the number of terminals and device types.
  • FIG. 3 is a flowchart of a similarity search method provided by an embodiment of the present application. As shown in FIG. 3, the method includes the following steps 301 to 306:
  • Step 301 The terminal sends a search instruction to the server.
  • the search instruction is used to instruct to search for objects similar to the target object.
  • the search instruction may include the target object.
  • the target number is the number of objects to be searched out.
  • the search instruction can be triggered according to the user's operation on the terminal. For example, the user can input the target object on the terminal, and the terminal can generate the search instruction according to the target object.
  • the user needs to search for a photo that is similar to a certain photo. The user can input this photo on the terminal and click the confirmation option. The terminal will generate a search instruction, and this photo is the target object.
  • the target object may not be provided by the terminal.
  • the terminal provides the web address of the target object to the server, and the server can obtain the target object through the network.
  • Step 302 The server receives a search instruction from the terminal.
  • the server After receiving the search instruction, the server can obtain the target object and the number of targets, so as to execute the subsequent steps according to the target object and the number of targets.
  • the target number is the number of objects to be searched out, and the target number can be a positive integer.
  • the server may parse the search instruction to obtain the target object carried by the search instruction.
  • the server may also parse the search instruction to obtain the web address of the target object obtained by the search instruction, and obtain the target object through the network according to the web address.
  • the server may parse the search instruction to obtain the number of targets carried by the search instruction, or the server may determine the number of targets. For example, the server may pre-set a default number and set the default number as the target number. For example, the server may return 10 objects to the terminal by default, and 10 is the target number.
  • Step 303 The server searches the database to obtain the first set.
  • the server can obtain the similarity between the object in the database and the target object, and determine whether the similarity meets the third threshold. When the similarity meets the third threshold, the server will determine the object in the database. Join the first collection.
  • the server may use the first similarity algorithm to obtain the similarity between the object in the database and the target object, and according to the similarity, the similarity with the target object meets the third threshold Of objects are added to the first collection.
  • the first similarity algorithm may be the Euclidean distance algorithm, of course, it may also be set to other similarity algorithms according to requirements. This embodiment does not limit the specific similarity algorithm of the first similarity algorithm.
  • the first set may be referred to as a candidate solution set.
  • the first set includes multiple objects, and each object may be referred to as a candidate solution.
  • the number of objects in the first set is greater than the target number.
  • the ratio between the number of objects in the first set and the target number may be a preset ratio.
  • the first set may include (2* K) objects, K objects will be selected from these (2*K) objects in the follow-up, and these K objects will be sent to the terminal, where K is a positive integer.
  • the similarity between each object in the first set and the target object may satisfy the third threshold.
  • the similarity meeting the third threshold may mean that the similarity is greater than the third threshold, and the similarity not meeting the third threshold may mean that the similarity is less than or equal to the third threshold.
  • that the similarity meets the third threshold may mean that the similarity is greater than or equal to the third threshold, and the similarity does not meet the third threshold may mean that the similarity is less than the third threshold.
  • the third threshold can be set according to experiment, experience or demand, and the third threshold can be stored in the server in advance.
  • the server After the server searches for the first set, it can store the first set from the database in the server's own memory, for example, cache the first set in the server's memory, or store the first set in the non-volatile memory contained in the server.
  • the first set In the readable storage medium, for example, the first set is stored in a flash memory, a hard disk (full English name: hard disk drive, abbreviation: HDD), and solid state hard disk (full English name: solid state drive, abbreviation: SSD).
  • the server can also store the first collection in other devices other than the server.
  • the first collection can be sent to a network storage, and the first collection can be stored through the network storage.
  • the network storage can be a cloud disk or a cloud database. Or object storage services. This embodiment does not limit the storage location of the first set.
  • each object in the first set may be sorted in order of similarity with the target object. For each object in the first set, if the similarity between the object and the target object is greater, the arrangement position of the object in the first set is higher.
  • the first object in the first set may be the object with the greatest similarity to the target object in the first set, for example, the object with the smallest Euclidean distance to the target object.
  • the server may first add the objects in the database to the first set, and then sort each object in the first set in descending order of similarity.
  • the server can also add objects to the first set in order of similarity during the search process. For example, it can search for an object i , The similarity between the object i and the target object can be compared with the similarity between the existing objects in the first set and the target object. If the similarity between the object k and the target object is higher than that between the object i and the target object.
  • the similarity between the target objects, the similarity between the object m and the target object is lower than the similarity between the object i and the target object, then the object i is added between the object k and the object m, where i, k or m represents the identity of the object.
  • Step 304 The server divides the first set into a second set and a third set.
  • the similarity between each object in the second set and the target object meets the first threshold, and the similarity between each object in the third set and the target object does not meet the first threshold.
  • the first threshold may be set according to experiment, experience, or demand, the first threshold may be pre-stored in the server, and the first threshold may be the highest threshold among the thresholds involved in this embodiment. The first threshold may be higher than the second threshold and the third threshold described below.
  • the similarity meeting the first threshold may mean that the similarity is greater than the first threshold, and the similarity dissatisfying the first threshold may mean that the similarity is less than or equal to the first threshold. In other possible embodiments, that the similarity meets the first threshold may mean that the similarity is greater than or equal to the first threshold, and the similarity does not meet the first threshold may mean that the similarity is less than the first threshold.
  • the objects in the third set since the similarity between the objects in the second set and the target object is higher than the first threshold, the objects in the second set have a higher probability of being the correct result, so the second set The confidence of the objects in the second set is higher, and the objects in the second set can be regarded as trusted objects.
  • the second set can be recorded as a trusted solution set. When the objects in the second set are rearranged, the order priority is the highest.
  • the server may obtain the similarity between each object in the first set and the target object, and according to the similarity between each object in the first set and the target object. For the similarity between objects, each object in the first set is divided into a second set and a third set according to whether the similarity meets the first threshold.
  • the server can create a second collection and a third collection; for each object in the first collection, the server can obtain the similarity between the object and the target object, and then determine whether the similarity between the object and the target object is Meet the first threshold; if the similarity between the object and the target object meets the first threshold, the object is added to the second set; if the similarity between the object and the target object does not meet the first threshold, the The object joins the third set.
  • the similarity based on dividing the second set and the third set in step 304 and the similarity based on adding the object to the first set in step 303 may be different similarities or The same degree of similarity.
  • the similarity based on step 303 can be directly reused to divide the second set and the third set.
  • step 304 there is no need to re-acquire the similarity between the object in the first set and the target object. degree.
  • step 304 the similarity between the objects in the first set and the target object can also be obtained again, and the method of obtaining the similarity may be different from the method of obtaining the similarity in step 303.
  • the similarity algorithm B can be used again to obtain the similarity between the objects in the first set and the target object in step 304. Or re-use similarity algorithm A + similarity algorithm B to obtain the similarity between the objects in the first set and the target object.
  • This embodiment does not limit whether the step of obtaining similarity is performed in step 304.
  • the method of obtaining the similarity in step 304 is also not limited.
  • the process of obtaining the similarity between the object and the target object can be understood as scoring the object, and the similarity between the object and the target object can be understood It is the score of the object, which can reflect the confidence of the object.
  • the server may use a combination of multiple similarity algorithms to obtain the comprehensive similarity between the objects in the first set and the target object, and add the objects whose comprehensive similarity meets the first threshold to the second set. Add objects whose comprehensive similarity does not meet the first threshold into the third set,
  • step 303 may specifically include: for any object in the first set, the server may use each of multiple similarity algorithms to obtain the difference between the object and the target object.
  • the similarity between the multiple similarity algorithms is obtained, and the multiple similarities are combined to obtain the comprehensive similarity between the object and the target object.
  • each similarity algorithm can be assigned a corresponding weight. After each similarity algorithm obtains the similarity between the object and the target object, it can be based on the weight corresponding to each similarity algorithm , Perform a weighted average of multiple similarities, and use the weighted average as the comprehensive similarity between the object and the target object.
  • the Euclidean distance algorithm can be used to obtain the similarity between the object and the target object
  • the rank order algorithm can be used to obtain the similarity between the object and the target object.
  • the multiple similarity algorithms used may include the rank order algorithm or other similarity algorithms that consider the group relationship. Then, when measuring the similarity between two objects, not only the two objects themselves are considered. , Also consider other objects that belong to the same group as two objects.
  • the multiple similarity algorithms used when dividing the second set and the third set may include the first similarity algorithm, that is, the similarity algorithm used when searching the database.
  • the multiple similarity algorithms may also include other similarity algorithms other than the first similarity.
  • the first similarity algorithm may be the Euclidean distance algorithm
  • the multiple similarity algorithms may include the Euclidean distance algorithm and the rank order algorithm.
  • the first similarity algorithm By first obtaining the first set based on the similarity obtained by the first similarity algorithm, and then on the basis of the first similarity algorithm, according to the first similarity algorithm combined with other similarity algorithms to obtain the comprehensive similarity, To divide the second set and the third set, because the first similarity algorithm is combined with other similarity algorithms, it can make up for the lack of the measurement method of the first similarity algorithm, and achieve the purpose of improving the first similarity algorithm, so comprehensive similarity Compared with the similarity obtained by the first similarity algorithm, the accuracy can be improved. Therefore, an object with a high comprehensive similarity to the target object has a higher probability of being the correct result than that of the target object.
  • Objects with low similarity that is, the overall confidence of the second set will be higher than the overall confidence of the third set, so by arranging the objects in the second set in front of the objects in the third set, the improvement can be ensured The accuracy of the order of objects in the first set.
  • the server may also It is possible to use only one similarity algorithm to obtain the similarity between the object in the first set and the target object, for example, only use the Euclidean distance algorithm to obtain the similarity between the object in the first set and the target object, or only The rank order algorithm is used to obtain the similarity between the object in the first set and the target object. Then, according to the similarity obtained by a similarity algorithm, the objects whose similarity meets the first threshold are added to the second set, and the objects whose similarity does not meet the first threshold are added to the third set.
  • clusters can be divided into the first set, and the similarity can be obtained according to the cluster to which the object belongs.
  • this method may include the following steps 1 to 2.
  • Step 1 The server obtains clusters from the first set.
  • a cluster can also be called a class.
  • a cluster includes multiple objects. The similarity between any object in the cluster and other objects in the cluster meets a preset condition.
  • the number of clusters obtained from the first set can be one or more. One. Specifically, different objects in the cluster are similar to each other, and all objects in the cluster have certain commonalities. For example, if each object in the first set is an image, the cluster can be an image of the same person. For example, each image in cluster 1 can be an image of user A, and each image in cluster 2 can be an image of user B. Image.
  • these objects should also be recognized as similar objects. Therefore, The concepts of cluster and correlation can be introduced to describe the similarity of these objects. For example, multiple pictures of the same person should be recognized as pictures of the same person, but due to different shooting angles or different clothes, the similarity values of different pictures may be quite different when calculating the similarity of these pictures.
  • these pictures are marked as the same cluster according to their correlation, and the pictures in the cluster share the same similarity, instead of the similarity calculated by each picture. For example, if there are N pictures in a cluster, these N pictures will share the same similarity, instead of picture 1 taking picture 1’s own similarity as standard, picture 2 taking picture 2’s own similarity as standard, and N being positive Integer.
  • the cluster acquisition manner may include at least one of the following manners 1 to 2.
  • Method 1 The server can use a clustering algorithm to cluster the first set to obtain clusters.
  • Method 2 For each object in the first set, the server obtains the correlation between the object and other objects in the first set, and obtains the correlation according to the correlation between each object and other objects The multiple objects whose correlation degree meets the preset condition are divided into clusters.
  • the server can perform a pairwise comparison of multiple objects in the first collection to obtain the correlation between any two objects in the first collection, and the server can determine the difference between each object in the first collection and other objects in the first collection. Whether the correlation between the objects meets the preset condition, if the correlation between the object in the first set and the other objects in the first set meets the preset condition, the object in the first set is divided into clusters, if the first set If the correlation between the middle object and other objects in the first set does not meet the preset condition, the object in the first set is taken as a scatter point.
  • the correlation degree meeting the preset condition may not be limited to the correlation degree meeting the fourth threshold. For example, if the correlation degree between two objects in the first set meets the fourth threshold, the two objects are divided into the same cluster in.
  • the server can generate a correlation matrix according to the correlation between each object in the first set and other objects in the first set.
  • the server can traverse the correlation matrix and find the correlation according to the correlation matrix. Objects that meet the relevance condition are divided into clusters, and the remaining objects outside the clusters in the relevance matrix are taken as scattered points.
  • the correlation matrix can be as shown in Table 1. Each row of the correlation matrix represents an object, each column of the correlation matrix represents an object, and each element in the correlation matrix is equal to the difference between the object corresponding to the row and the object corresponding to the column.
  • the number of rows of the correlation matrix may be equal to the number of objects in the first set, and the number of columns of the correlation matrix may be equal to the number of objects in the first set.
  • Step 2 The server obtains the similarity between the cluster and the target object as the similarity between each object in the cluster and the target object.
  • each object in the cluster will be equalized.
  • Join the second set if the similarity between the cluster and the target object does not meet the first threshold, each object in the cluster will be added to the third set. That is, in this way, each object in the same cluster can belong to the same set.
  • the scatter points and target objects can be obtained directly Determine whether the similarity meets the first threshold, if the similarity meets the first threshold, divide the scattered points into the second set, and if the similarity does not meet the first threshold, divide the scattered points into the third set.
  • the manner of obtaining the similarity between the cluster and the target object includes but is not limited to any one of the following manners (1) to (2) and a combination thereof:
  • Method (1) The server selects a representative point from the cluster, and the server obtains the similarity between the representative point and the target object as the similarity between the cluster and the target object.
  • the representative point is used to represent each object in the cluster.
  • the representative point can be used to represent all the objects in the entire cluster to measure with the target object.
  • any object from the cluster can be selected as the representative point; the cluster center can also be selected as the representative point; the object adjacent to the cluster center can also be selected as the representative point.
  • the representative point can be an object or a collection of multiple objects. If the representative point includes multiple objects, the similarity between each representative point and the target object can be obtained; the similarity between each representative point and the target object can be averaged, and the average value can be regarded as the cluster and the target object. The similarity. Alternatively, the sum value of the similarity between each representative point and the target object can be obtained, and the sum value can be used as the similarity between the cluster and the target object.
  • each object in the cluster will be added to the first threshold.
  • each object in the cluster will be added to the third set.
  • the server obtains the similarity between each object in the cluster and the target object, and the server obtains the similarity between the cluster and the target object according to the similarity between each object and the target object.
  • the server may average the similarity between each object and the target object, and use the average value as the similarity between the cluster and the target object.
  • the sum value of the similarity between each object and the target object can be obtained, and the sum value can be used as the similarity between the cluster and the target object.
  • the server can determine whether the number of objects in the cluster meets the number threshold. If the number of objects in the cluster meets the number threshold, indicating that the cluster is relatively large, the method (1) is adopted. If the number of objects in the cluster does not If the number threshold is met, indicating that the cluster is relatively small, the method (2) is adopted.
  • a clustering algorithm may be used to divide each object in the first set into clusters or scattered points, for example, k-means clustering algorithm, density clustering algorithm, graph clustering algorithm or other Clustering algorithm is used for clustering; alternatively, each object in the first set can be compared in pairs to obtain the correlation matrix, and the data merging algorithm is used to divide the objects in the correlation matrix into clusters or scattered points, and then judge Whether the similarity of clusters and scattered points meets the first threshold, if the similarity of clusters or scattered points meets the first threshold, add clusters or scattered points to the second set, if the similarity of clusters or scattered points does not meet the first threshold , The similarity of clusters or scattered points is added to the third set.
  • k-means clustering algorithm for example, k-means clustering algorithm, density clustering algorithm, graph clustering algorithm or other Clustering algorithm is used for clustering; alternatively, each object in the first set can be compared in pairs to obtain the correlation matrix, and the data merging algorithm is used to divide the
  • the effect achieved can at least include: in the search process, the target object is relatively fuzzy or other factors, the object searched from the database may have noise data, which means that the noise data is not the correct result But an object with a high degree of similarity to the target object.
  • noise data will be incorrectly included in the search results, resulting in a low recall rate.
  • the noise data can be divided into corresponding clusters.
  • the similarity between the cluster and the target object is used as the similarity of each object in the cluster.
  • the noise data itself
  • the similarity with the target object will be replaced by the similarity between the cluster and the target object, for example, it will be replaced by the similarity between the representative point and the target object.
  • the similarity between the noise data itself and the target object is high, since the similarity between the noise data itself and the target object is not used, but the similarity between the cluster to which the noise data belongs and the target object is used, It can effectively prevent the influence of noise data and eliminate the noise data in advance, thereby solving the problem of misjudgment caused by noise data, reducing the number of false results in the search results, and greatly improving the recall rate.
  • the cluster includes object 1, object 2 to object 10, where object 1 is noise data and object 3 is the representative point, then the similarity between object 3 and the target object will be taken as each of objects 1 to 10 The similarity between the object and the target object. Therefore, the similarity between the object 1 and the target object will be reduced to the similarity between the object 3 and the target object, thereby filtering out the interference of the object 1.
  • the target object is a photo of user A
  • the photo happens to have a high degree of similarity with the photo of user A (this photo will be denoted as photo X below), then photo X is noise data.
  • photo X is not the photo of user A, but the photo of user B, due to the high similarity between the photo X and the photo of user A, the photo X will be misjudged as the photo of user A and therefore will be wrong.
  • the 10 photos of user B will be divided into the same cluster, then the similarity of the clusters will be used as the similarity of each photo of user B to the photo of user A. Then the photo X will be affected by the other 9 photos of user B. The similarity between the photo X and the photo of user A will be replaced by the similarity between the other photos of user B and the photo of user A, so the photo X can be removed Interference, so as to avoid misjudged photo X as a photo of user A,
  • the two technical means of combining multiple similarity algorithms and dividing clusters can be combined to form step 304.
  • the first set can be divided into clusters or scattered points; a combination of multiple similarity algorithms is used to obtain the comprehensive similarity between the cluster and the second set as the difference between each object in the cluster and the second set.
  • Combining multiple similarity algorithms to obtain the comprehensive similarity between the scatter points and the second set add clusters and scatter points whose comprehensive similarity meets the first threshold to the second set, and change the comprehensive similarity to the second set.
  • the clusters and scattered points that meet the first threshold are added to the third set.
  • Step 305 The server sorts the objects in the first set in the order that the objects in the second set are first and the objects in the third set are last.
  • the server may rank the objects in the second set before the objects in the third set.
  • the object i and the object j in the first set if the object i is the object in the second set and the object j is the object in the third set, the object i will be ranked before the object j.
  • the first set includes (2*K) objects
  • the second set includes Q1 objects
  • the third set includes Q2 objects
  • the first to Q1 objects are all the second set
  • the objects in the (Q1+1)th to the last object are all objects in the third set, and both Q1 and Q2 are positive integers.
  • Step 306 The server selects the target number of objects from the first set and sends them to the terminal according to the order from the front to the back.
  • the server can start from the first object in the first collection, and select objects from the first collection in order from the front to the back, until the number of selected objects reaches the target number, and send the selected objects to the terminal.
  • the server receives the target number of objects, the target number of objects can be used as search results and presented to the user.
  • the terminal can display a search result page, and each content item in the search result page is an object.
  • the priority of the objects in the second set will be higher than the priority of the objects in the third set. Specifically, if the number of objects in the second set is greater than or equal to the target number, the server will select the objects in the second set instead of selecting the objects in the third set; if the number of objects in the second set is less than the target number , The server will continue to select objects from the third set after selecting the objects in the second set.
  • the first set includes (2*K) objects
  • the second set includes Q1 objects
  • the third set includes Q2 objects. If K is less than Q1, the server will start from the second set Select K objects in the second set and send them to the terminal, and there will be (Q1-K) remaining objects in the second set that are not selected, and each object in the third set will not be selected; if K is equal to Q1, the server will It happens to send each object in the second set to the terminal, and each object in the third set will not be selected; if K is greater than Q1, the server will send each object in the second set and the third set The middle (K-Q1) objects are sent to the terminal, and the K objects in the third set will not be selected.
  • the second set is similar to the third set.
  • the objects in the second set are more similar to the target objects, and the objects in the second set have a higher probability of being a correct result, that is, the confidence of the objects in the second set is higher. Then, by selecting objects in the second set with high priority and selecting objects in the third set with low priority, it can be ensured that the proportion of correct results in the selected objects is larger, thereby improving the recall rate of the search.
  • the greater the similarity between the object and the target object the higher the arrangement position of the object in the second set. For example, for the object i in the second set and the object j in the second set, if the similarity between the object i and the target object is greater than the similarity between the object j and the target object, the object i is ranked before the object j.
  • the objects with high similarity to the target object are ranked in the front, and the objects with less similarity to the target object are ranked in the back, which can further improve the order of candidate objects.
  • the accuracy of the object because the object with a high similarity to the target object is more likely to be the correct result, and this type of object is placed in the front, when the objects are selected in the order of the front, the correct result can be improved among the selected objects Proportion, and can try to make the correct result rank in front of the search result, so that when the terminal displays the search result, the display position of the correct result will be higher.
  • the object will be ranked first in the second set, and then it will also be ranked first in the first set.
  • the server After the selected object is sent to the terminal, the object will be ranked first in the search results presented by the terminal. Moreover, objects with low similarity to the target object will be ranked behind the search results, or they can be avoided to make the proportion of correct results in the search results higher, thereby effectively improving the recall rate of the search.
  • the greater the similarity between the object and the target object, the higher the arrangement position of the object in the third set may be.
  • the objects with high similarity to the target object are ranked in the front, and the objects with the lower similarity to the target object are ranked in the back, in order from first to back.
  • step 303 if after step 303 is performed, each object in the first set has been sorted in order of similarity with the target object, then the server is When dividing the first set into the second set and the third set, the arrangement order of each object can be maintained, that is, the arrangement order of each object is still the arrangement order of the first set. For example, for object i and object j to be divided into the second set, if object i is ranked before object j in the first set, when dividing object i and object j into the second set, you can keep Object i is ranked before object j.
  • object n and object m to be divided into the third set, if object n is ranked before object m in the first set, then when object n and object m are divided into the third set, The order in which the object n is arranged before the object m can be maintained. In this way, by keeping the arrangement order of each object in the order in the first set, the above-mentioned “the greater the similarity between the object and the target object, the closer the arrangement position of the object in the second set The effect of "front” and the effect of "the greater the similarity between the object and the third set, the higher the arrangement position of the object in the third set".
  • the server may re-sort each object in the second set in descending order of similarity.
  • the similarity based on sorting may be the similarity used when dividing the second set and the third set, for example, it may be the comprehensive similarity of objects obtained by multiple similarity algorithms.
  • each object in the third set can be sorted according to the order of similarity from largest to smallest.
  • step 302 may be performed by one server
  • step 303 may be performed by another server
  • step 304 to step 305 may be performed by another server.
  • the method provided in this embodiment designs a set of rearrangement frameworks under a single input. After searching for some objects from the database, the objects whose similarity with the target object meets the first threshold are placed in the front row. The objects whose similarity with the target object does not meet the threshold are placed in the back row, and then the objects are selected in the order from front to back and sent to the terminal. Since the similarity meets the first threshold, the confidence of the objects is high and the result is correct. The probability is very high. By arranging these objects in the front, when selecting objects from front to back, the probability of selecting the correct result can be increased, and the proportion of the correct result selected can be increased, thereby effectively increasing the recall rate.
  • the rearrangement can be performed based on the first set searched from the database, as opposed to constructing ghost points based on the objects in the first set and target objects, and then searching the database again based on the ghost points.
  • the steps of constructing ghost points and the steps of second search in the database are omitted, thereby solving the problem of huge amount of calculation in the rearrangement frame, improving the calculation speed, and avoiding the recall rate caused by inaccurate ghost points
  • the problem of descent improves robustness.
  • the sorting function involved in the above-mentioned embodiment in FIG. 3 may be implemented based on a queue mechanism, which is described in the embodiment in FIG. 6 below.
  • FIG. 6 is a flowchart of a similarity search method provided by an embodiment of the present application. As shown in FIG. 6, the method includes the following steps 601 to 607:
  • Step 601 The terminal sends a search instruction to the server.
  • Step 602 The server receives a search instruction from the terminal.
  • Step 603 The server searches the database to obtain the first set.
  • Step 604 The server divides the first set into a second set and a third set.
  • Step 605 The server stores the second set in the first queue, and stores the third set in the second queue.
  • the first queue and the second queue respectively represent different priorities, and the priority of the first queue is higher than the priority of the second queue.
  • the server may first create the first queue and the second queue; the server may store each object in the second set in the first queue, and store each object in the third set in The second queue. Wherein, since the confidence level of the object stored in the first queue is higher than the confidence level of the object stored in the second queue, the first queue can be recorded as a trust queue.
  • cluster 1 belongs to the second set, and cluster 1 will be stored in the first queue; if representative point 1 and the target If the comprehensive similarity between objects does not meet the first threshold, cluster 1 belongs to the third set, and cluster 1 is stored in the second queue; if the comprehensive similarity between representative point 2 and the target object meets the first threshold, then Cluster 2 belongs to the second set, and cluster 2 will be stored in the first queue; if the comprehensive similarity between representative point 2 and the target object does not meet the first threshold, cluster 2 belongs to the third set, and cluster 2 will be stored in The second queue; if the comprehensive similarity between the scattered points and the target object meets the first threshold, the scattered points belong to the second set, and the scattered points will be stored in the first queue; if the comprehensive similarity between the scattered points and the target object If the degree does not meet the first threshold, the scattered points belong to the third set, and the scattered points are stored in the second queue.
  • Step 606 The server sorts the objects in the first set in the order of the first queue first and the second queue last.
  • the server may sort the objects in the first queue before the objects in the second queue.
  • the object i and the object j in the first set if the object i is an object in the first queue and the object j is an object in the second queue, the object i will be ranked before the object j. Then, if the first set includes (2*K) objects, the first queue includes Q1 objects, and the second queue includes Q2 objects, then after sorting, the first to Q1 objects are all the first queue The objects in the (Q1+1)th to the last are all objects in the second queue.
  • Step 607 The server selects the target number of objects from the first set and sends them to the terminal in order from the first to the back.
  • the server may start from the head of the first queue and sequentially select objects from the first queue in the order from the head to the end of the queue. In the process of selecting objects, the server can determine whether the number of selected objects has reached the target number; if the number of selected objects reaches the target number, it will stop selecting objects from the first queue, and will not select those in the second queue.
  • the priority of the first queue is higher than the priority of the second queue.
  • the server will select the first queue. From the top of the queue to the Kth object in the first queue; if K is equal to Q1, the server will select each object in the first queue; if K is greater than Q1, the server will select each object in the first queue, And the head of the second queue to the object ranked (K-Q1) in the second queue.
  • the internal arrangement order of the first queue may be determined according to the similarity with the target object. Specifically, for each object in the first queue, the greater the similarity between the object and the target object, the higher the arrangement position of the object in the first queue. For example, the head of the first queue may be the object with the greatest similarity to the target object in the first queue, and the tail of the first queue may be the object with the least similarity to the target object in the first queue.
  • the server when storing the second set in the first queue, the server can keep the order of each object in the first queue or the order in the first set.
  • the server may also store the second set in the first queue, and may reorder the objects in the first queue in descending order of similarity with the target object. For example, the comprehensive similarity between the objects in the first set and the target object obtained by combining multiple similarity algorithms is used to reorder the objects in the first queue.
  • the internal arrangement order of the third set may be represented by the internal arrangement order of the second queue. Specifically, for each object in the second queue, the greater the degree of similarity between the object and the target object, the higher the arrangement position of the object in the second queue may be.
  • the server is When storing the third set in the second queue, the server can keep the arrangement order of each object in the second queue or the arrangement order in the first set.
  • the objects in the second queue may be re-ordered in descending order of similarity with the target object.
  • the objects with high similarity to the target object are ranked in the front, and the objects with less similarity to the target object are ranked in the back, which can be further improved
  • the display position of the correct result will be more forward. For example, for the object with the greatest similarity to the target object in the first queue, the object will be ranked first in the first queue, and then it will also be ranked first in the first set.
  • the server After the selected object is sent to the terminal, the object will be ranked first in the search results presented by the terminal.
  • objects with low similarity to the target object will be ranked behind the search results because they are ranked behind in the queue, or can be avoided from being placed in the search results, thereby reducing the error of the search results Number, so that the proportion of correct results in the search results is higher, thereby effectively improving the search recall rate.
  • the server will take each object in the first queue, and The object with the highest similarity in the second queue to the object with the (K-Q1) position in the second queue is sent to the terminal, and the object with the last K in the second queue will not be selected. It will not be sent to the terminal. Then, since the objects with the lower K positions in the second queue are more likely to be wrong results than other objects in the first set, then by excluding these K objects from the search results, the recall of the search results can be improved rate.
  • the method provided in this embodiment designs a set of queue-based rearrangement framework on the basis of achieving the effect achieved by the embodiment in FIG. 3.
  • the third set is added to the second queue.
  • the objects searched out from the database can be divided into multiple queues.
  • the objects in the first queue are ranked first and the objects in the second queue are ranked second, and then the objects are selected in the order from front to back.
  • the objects in the first queue are selected with high priority, and the objects in the second queue are selected with low priority.
  • the confidence of the objects in the first queue is higher than the objects in the second queue, the total number of selected objects is certain In the case of, it can increase the probability of selecting objects with high confidence and reduce the probability of selecting objects with low confidence, thereby increasing the proportion of objects with high confidence in the search results, thus increasing the recall rate of search results, and, It can put high-confidence objects at the forefront of search results, and improve the accuracy of search results.
  • the third set may be further divided into different sets, which will be described below with reference to the embodiment in FIG. 8.
  • FIG. 8 is a flowchart of a similarity search method provided by an embodiment of the present application. As shown in FIG. 8, the method includes steps 801 to 806 performed by the server:
  • Step 801 The terminal sends a search instruction to the server.
  • Step 802 The server receives a search instruction from the terminal.
  • Step 803 The server searches the database to obtain the first set.
  • Step 804 The server divides the first set into a second set and a third set.
  • Step 805 The server divides the third set into a fourth set and a fifth set.
  • the second set is used as the basis for rearrangement
  • the third set is divided according to whether the similarity with the second set meets the second threshold, so as to reorder according to the result of the division.
  • the second set since the second set usually includes multiple objects, compared to the way of rearranging based on a single candidate object, expanding the granularity of rearrangement from a single candidate object to the entire set is equivalent to Coarse the granularity of the rearrangement can solve the problem of inaccurate selection of a single candidate object leading to a decrease in the recall rate, so it can improve the robustness of the rearrangement method.
  • the objects in the second set are The similarity of the target objects meets the first threshold, so the confidence of the objects in the second set is high.
  • the similarity between each object in the fourth set and the second set meets the second threshold, and the similarity between each object in the fifth set and the second set does not meet the second threshold.
  • the second threshold may be lower than the first threshold, the second threshold may be higher than the third threshold, the second threshold may be set according to experiment, experience or demand, and the second threshold may be stored in the server in advance.
  • the similarity meeting the second threshold may mean that the similarity is greater than the second threshold, and the similarity not meeting the second threshold may mean that the similarity is less than or equal to the second threshold.
  • the similarity meeting the second threshold may mean that the similarity is greater than or equal to the second threshold, and the similarity not meeting the second threshold may mean that the similarity is less than the second threshold.
  • the probability of the objects in the fourth set being the correct result is higher, so the fourth set
  • the confidence of the objects in the set is higher, so when the objects in the fourth set are rearranged, the order priority will be higher than the objects in the fifth set.
  • the server may create the fourth set and the fifth set; for each object in the third set, the server may obtain the object and the second set.
  • the similarity between the sets; the server can determine whether the similarity between the object and the second set meets the second threshold; if the similarity between the object and the second set meets the second threshold, the object is added to the first Four sets; if the similarity between the object and the second set does not meet the second threshold, add the object to the fifth set.
  • the second set includes n objects
  • the similarity between the object i and each of the n objects can be obtained, and n similarities can be obtained.
  • n is a positive integer
  • the combination method includes but is not limited to any one of weighted summation, summation, weighted average, and average, and combinations thereof.
  • the process of obtaining the similarity between the object and the second set can be understood as scoring the object; the similarity between the object and the second set can be It is understood as the score of the object, which can reflect the confidence of the object.
  • the server may use a combination of multiple similarity algorithms to obtain the comprehensive similarity between the objects in the third set and the second set. If the comprehensive similarity meets the second threshold, the object is added to the fourth set. Set, if the comprehensive similarity does not meet the second threshold, add the object to the fifth set.
  • the server can use the similarity algorithm to obtain the similarity between the objects in the third set and the second set, and then compare multiple similarity algorithms.
  • the similarity is combined to obtain the comprehensive similarity between the object and the second set.
  • the method of combining multiple similarities includes, but is not limited to, any one of weighted summation, summation, weighted average, and average, and combinations thereof.
  • the similarity between the objects in the third set and the second set obtained by each similarity algorithm can be weighted and averaged, and the weighted average The value is used as the comprehensive similarity between the objects in the third set and the second set.
  • the Euclidean distance algorithm can be used to obtain the similarity between the object and the second set, and the rank order algorithm can be used to obtain the difference between the object and the second set.
  • the machine learning model is used to obtain the similarity between the object and the second set, and three similarities are obtained, and then these three similarities are combined into one similarity to determine whether the similarity meets the second threshold. If the second threshold is met, the object is added to the fourth set, if the second threshold is not met, the object is added to the fifth set.
  • the multiple similarity algorithms used may include a rank order algorithm or other similarity algorithms that consider group relationships, so that the accuracy of the similarity between the object and the second set can be improved.
  • the multiple similarity algorithms used when dividing the fourth set and the fifth set may include the first similarity algorithm, that is, the similarity algorithm used when searching the database.
  • the multiple similarity algorithms may also include other similarity algorithms other than the first similarity.
  • the first similarity algorithm may be the Euclidean distance algorithm
  • the multiple similarity algorithms may include the Euclidean distance algorithm and the rank order algorithm.
  • the first similarity algorithm By first obtaining the first set based on the similarity obtained by the first similarity algorithm, and then on the basis of the first similarity algorithm, according to the first similarity algorithm combined with other similarity algorithms to obtain the comprehensive similarity, To divide the fourth set and the fifth set, because the first similarity algorithm is combined with other similarity algorithms, it can make up for the lack of the measurement method of the first similarity algorithm, and achieve the purpose of improving the first similarity algorithm, so comprehensive similarity Compared with the similarity obtained by the first similarity algorithm, the accuracy can be improved. Therefore, an object with a high comprehensive similarity to the target object has a higher probability of being the correct result than that of the target object.
  • the confidence of the fourth set as a whole will be higher than the confidence of the fifth set. Therefore, by ranking the objects in the fourth set before the objects in the fifth set, the improvement can be ensured The accuracy of the order of objects in the first set.
  • the similarity algorithm used when dividing the fourth set and the fifth set may be the same or different from the similarity algorithm used when dividing the second set and the third set.
  • the similarity algorithm used when dividing the fourth set and the fifth set can be more than the similarity algorithm used when dividing the second set and the third set.
  • the multiple similarity algorithms used when dividing the fourth set and the fifth set may include the first similarity algorithm and the second similarity algorithm, and may also include other similarities. algorithm. For example, when searching the database, the Euclidean distance algorithm is used.
  • the Euclidean distance algorithm and the rank order algorithm can be used.
  • the Euclidean distance can still be used.
  • Algorithm and rank order algorithm can be used, so that the accuracy of the similarity is higher when the fourth set and the fifth set are divided.
  • server A similarity algorithm can also be used to obtain the similarity between the objects in the third set and the second set.
  • clusters can be divided into the third set, and the similarity can be obtained according to the cluster to which the object belongs.
  • this method may include the following steps 1 to 2.
  • Step 1 The server obtains clusters from the third set.
  • the number of clusters obtained from the third set can be one or more.
  • the cluster acquisition manner may include any one of the following manners 1 to 2, and a combination thereof.
  • Method 1 The server uses a clustering algorithm to cluster the third set to obtain clusters.
  • Method 2 For each object in the third set, the server obtains the correlation between the object and other objects in the third set, and obtains the correlation according to the correlation between each object and other objects The objects whose correlation degree meets the preset condition are divided into clusters.
  • the server can perform a pairwise comparison of multiple objects in the third set to obtain the correlation between any two objects in the third set, and the server can determine the difference between each object in the third set and other objects in the third set. Whether the correlation between the objects in the third set meets the preset condition, if the correlation between the objects in the third set and other objects in the third set meets the preset conditions, then the objects in the third set are divided into clusters, if in the third set The correlation between the object and other objects in the third set does not satisfy the preset condition, then the objects in the third set are taken as scatter points.
  • the server may generate a correlation matrix based on the correlation between each object in the third set and other objects in the third set, and the server may traverse the correlation matrix to find the correlation degree that satisfies the correlation condition. Objects, divide these objects into clusters, and use the remaining objects outside the clusters in the correlation matrix as scatter points.
  • the correlation matrix may be as shown in Table 1 in step 306 in the embodiment of FIG. 3 above.
  • Step 2 The server obtains the similarity between the cluster and the second set as the similarity between each object in the cluster and the second set.
  • each All objects are added to the fourth set. If the similarity between the cluster and the second set does not meet the second threshold, each object in the cluster is added to the fifth set. That is, in this way, the collections added to the objects belonging to the same cluster can be the same.
  • the scatter points For scatter points, the scatter points and the second Set the similarity to determine whether the similarity meets the second threshold. If the similarity meets the second threshold, divide the scattered points into the fourth set. If the similarity does not meet the second threshold, divide the scattered points into the fifth set .
  • the manner of obtaining the similarity between the cluster and the second set includes but is not limited to any one of the following manners (1) to (2) and combinations thereof:
  • the server selects a representative point from the cluster, and the server obtains the similarity between the representative point and the second set as the similarity between the cluster and the second set.
  • the similarity between each representative point and the second set can be obtained; the similarity between each representative point and the second set can be averaged, and the average value is used as the cluster and The similarity between the second set.
  • the sum value of the similarity between each representative point and the second set can be obtained, and the sum value is used as the similarity between the cluster and the second set.
  • each cluster in the The object is added to the fourth set, and if the similarity between the representative point and the second set does not meet the second threshold, each object in the cluster is added to the fifth set.
  • the server obtains the similarity between each object in the cluster and the second set, and the server obtains the similarity between each object and the second set according to the similarity between each object and the second set.
  • the server may average the similarity between each object and the second set, and use the average as the similarity between the cluster and the second set.
  • the sum value of the similarity between each object and the second set can be obtained, and the sum value is used as the similarity between the cluster and the second set.
  • the server can determine whether the number of objects in the cluster meets the number threshold. If the number of objects in the cluster meets the number threshold, method (1) is adopted. If the number of objects in the cluster does not meet the number threshold, then Use method (2).
  • a clustering algorithm can be used to divide each object in the third set into clusters or scattered points; or, each object in the third set can be compared in pairs to obtain Correlation matrix, using a data merging algorithm to divide the objects in the correlation matrix into clusters or scattered points, and then judge whether the similarity of clusters and scattered points meets the second threshold, and if the similarity of clusters or scattered points meets the second threshold, Then the clusters or scattered points are added to the fourth set, and if the similarity of the clusters or scattered points does not meet the second threshold, the similarity of the clusters or scattered points is added to the fifth set.
  • the effect achieved can at least include: in the search process, the object searched out from the database may have noise data due to the obscure target object or other factors. In related technologies, noisy data will cause more false results to be included in the search results, resulting in a lower recall rate.
  • the noise data can be divided into corresponding clusters. Then, the similarity between the cluster and the second set is used as the similarity of each object in the cluster.
  • the noise data The similarity between itself and the second set will be replaced by the similarity between the cluster and the second set, so even if the similarity between the noise data itself and the second set is high, because the noise data itself and the first set are not used
  • the similarity between the two sets uses the similarity between the cluster to which the noise data belongs and the second set, which can effectively prevent the influence of the noise data, thereby solving the problem of misjudgment caused by the noise data. Reduce the number of false results in the search results, thereby greatly improving the recall rate.
  • the two technical means of combining multiple similarity algorithms and dividing clusters can be combined to form step 805.
  • the third set can be divided into clusters or scatter points first; a combination of multiple similarity algorithms is used to obtain the similarity between the cluster and the second set as each object in the cluster and the second set.
  • the similarity between the sets is combined with multiple similarity algorithms to obtain the similarity between the scattered points and the second set; the clusters and scattered points whose similarity meets the second threshold are added to the fourth set, and the similarity is not
  • the clusters and scattered points that meet the second threshold are added to the fifth set.
  • Step 806 The server sorts the objects in the first set in the order that the objects in the second set are first and the objects in the third set are last.
  • Step 807 The server sorts the objects in the third set in the order that the objects in the fourth set are first and the objects in the fifth set are last.
  • the server may rank the objects in the fourth set before the objects in the fifth set.
  • the order of the objects in the third set is specifically as follows: the objects in the fourth set are first and the objects in the fifth set are second.
  • the object i and the object j in the third set if the object i is the object in the second set and the object j is the object in the fifth set, then the object i will be ranked before the object j.
  • the number of targets is K
  • the first set includes (2*K) objects
  • the second set includes Q1 objects
  • the third set includes Q2 objects.
  • the fourth set includes Q3 objects.
  • the order of the objects in the first set is: Q1 objects in the second set are ranked first, Q3 objects in the fourth set are ranked in the middle, and Q4 objects in the fifth set are ranked last.
  • step 806 and step 807 can be executed sequentially.
  • step 806 may be executed first, and then step 807 may be executed; or step 807 may be executed first, and then step 806 may be executed.
  • step 806 and step 807 can also be executed in parallel, that is, step 806 and step 807 can be executed simultaneously.
  • Step 808 The server selects the target number of objects from the first set and sends them to the terminal in the order from the first to the back.
  • the objects in the second set are ranked first
  • the objects in the fourth set are ranked after the objects in the second set
  • the objects in the fifth set are ranked after the objects in the fourth set.
  • the priority of the objects in the second set is higher than the priority of the objects in the fourth set
  • the priority of the objects in the fourth set is higher than the priority of the objects in the fifth set.
  • the server will select the objects in the second set instead of selecting the objects in the third set; if the number of objects in the second set is less than the target number , The server will continue to select objects from the fourth set on the basis of selecting the objects in the second set; where, if the sum of the number of objects in the second set and the number of objects in the fourth set is greater than the target number, The server selects objects from the second set and the fifth set, but does not select objects from the fifth set. If the sum of the number of objects in the second set and the number of objects in the fourth set is less than the target number, the server will select the objects in the second set and the objects in the fourth set and continue from the fifth set Select the object.
  • the first set includes (2*K) objects
  • the second set includes Q1 objects
  • the third set includes Q2 objects
  • the fourth set includes Q3 objects
  • the server can perform operations on the (2*K) objects in the first set in the order of Q1 objects in the second set first, Q3 objects in the fourth set second, and Q4 objects in the fifth set last. Sort, select K objects from the sorted first set and send them to the terminal.
  • the server will select K objects from the second set to send to the terminal, and there will be (Q1-K) remaining objects in the second set that are not selected, and all Q3 in the fourth set Objects and all Q4 objects in the fifth set will not be selected; if K is equal to Q1, the server will just send all Q1 objects in the second set to the terminal, and all Q3 objects in the fourth set And all Q4 objects in the fifth set will not be selected; if K is greater than Q1 and less than (Q1+Q3), the server will select Q1 objects in the second set and the fourth set (K-Q1) Objects are sent to the terminal, and the remaining (Q3-K+Q1) objects in the fourth set and all Q4 objects in the fifth set will not be selected.
  • K is equal to (Q1+Q3)
  • the server will send all Q1 objects in the second set and all Q3 objects in the fourth set to the terminal. If K is greater than (Q1+Q3), the server will send Q1 objects in the second set, Q3 objects in the fourth set, and (K-Q1-Q3) in the fifth set to the terminal.
  • the fourth set and the fifth set Comparing the collections, the objects in the fourth collection are more similar to the second collection, and the objects in the fourth collection have a higher probability of being a correct result, that is, the confidence of the objects in the fourth collection is higher. Then, by making the priority of the objects in the fourth set higher than the priority of the objects in the fifth set, it can be ensured that the proportion of correct results in the selected objects is larger, thereby improving the recall rate of the search.
  • the greater the similarity between the object and the second set the higher the arrangement position of the object in the third set.
  • the similarity based on the arrangement position of the objects in the third set can be the similarity obtained through a similarity algorithm, or the comprehensive similarity obtained by combining multiple similarity algorithms. This embodiment There is no restriction on this.
  • the objects with greater similarity to the second set are ranked in the front, and the objects with less similarity to the second set are ranked at the back, which can further improve candidates
  • the accuracy of the order of the objects because the objects with greater similarity to the second set are ranked before the objects with less similarity to the second set, and the objects with greater similarity to the second set are more likely to be the correct result.
  • the proportion of correct results among the selected objects can be increased, and as many objects similar to the second set can be selected as possible, so that the correct results in the search results The proportion is larger, which effectively improves the recall rate.
  • objects with low similarity to the second set will be ranked behind the search results, or they can be avoided from being placed in the search results, thereby reducing the number of false results in the search results, thereby effectively improving the search The recall rate.
  • the server can re-sort each object in the third set according to the similarity between the object and the second set in descending order of similarity. By sorting, To achieve the above-mentioned effect of "the greater the similarity between the object and the second set, the higher the arrangement position of the object in the third set".
  • the arrangement positions of all objects may be related to the similarity of the second set, or only the arrangement positions of some objects may be related to the similarity of the second set.
  • the server may, according to the similarity between the object and the second set, perform similarity to each object in the fourth set. Re-sort the objects in the order of degree from largest to smallest.
  • the order of the objects in the fourth set can be updated from the order of similarity with the target object to the order of similarity with the second set. . Therefore, the fourth set can be recorded as the rearrangement result set.
  • the server may keep the arrangement order of each object as the arrangement order of different objects when each object in the fifth set is searched from the database. That is, the order from front to back of each object in the fifth set can be the order of the similarity between each object and the target object in descending order. The object with the greatest degree of similarity between the two ranks first in the fifth set. Then, when step 808 is performed, if K objects are selected from the fifth set, the K objects are maintained in the order in step 803, so that the order of the K objects in the search result overlaps with the initial order. .
  • the second set is empty, and the fourth The collection can also be empty, and every object in the first collection will be added to the fifth collection. Since the objects in the fifth set are arranged in the order of the degree of similarity with the target object, when selecting the objects in order from the first to the back, the object with the number of similarities in the front of the target will be selected. There are no high-confidence objects in the collection, and it can also ensure that the search results will not be worse than in the related technology, so as to achieve the bottom line function.
  • the objects in the third set whose similarity with the second set meets the threshold are added to To the fourth set, add the objects in the third set whose similarity to the second set does not meet the threshold to the fifth set, and then arrange the fourth set before the fifth set, so as to reproduce the objects in the third set.
  • the sorting function involved in the above-mentioned embodiment in FIG. 8 may be implemented based on a queue mechanism, which is described in the embodiment in FIG. 11 below.
  • FIG. 11 is a flowchart of a similarity search method provided by an embodiment of the present application. As shown in FIG. 11, the method includes the following steps 1101 to 1109:
  • Step 1101 The terminal sends a search instruction to the server.
  • Step 1102 The server receives the search instruction of the terminal.
  • Step 1103 The server searches the database to obtain the first set.
  • Step 1104 The server divides the first set into a second set and a third set.
  • Step 1105 The server stores the second set into the first queue.
  • Step 1106 The server divides the third set into a fourth set and a fifth set.
  • the process of measuring the similarity between the object and the second set in the embodiment of FIG. 8 can be replaced with the similarity between the measuring object and the first queue. That is, the server can obtain the similarity between each object in the third set and the first queue, and if the similarity between the object in the third set and the first queue meets the second threshold, add the object The fourth set; if the similarity between the object and the first queue does not meet the second threshold, add the object to the fifth set.
  • Step 1107 The server stores the fourth set in the second queue, and stores the fifth set in the third queue.
  • the first queue, the second queue, and the third queue respectively represent different priorities.
  • the priority of the first queue is higher than the priority of the second queue, and the priority of the second queue is higher than the priority of the third queue.
  • the server may first create the first queue, the second queue, and the third queue; the server may store each object in the second set in the first queue, and each object in the fourth set Each object is stored in the second queue, and each object in the fifth set is stored in the third queue.
  • the third set can be divided into cluster 1, cluster 2 and multiple scattered points; select representative point 1 from cluster 1, and select from cluster 2.
  • Representative point 2 Use multiple similarity algorithms to obtain the comprehensive similarity between representative point 1 and the second set, and use multiple similarity algorithms to obtain the comprehensive similarity between representative point 2 and the second set.
  • Step 1108 The server sorts the objects in the first set in the order of the first queue first, the second queue second, and the third queue last.
  • the server may sort the objects in the first queue before the objects in the second queue, and sort the objects in the second queue before the objects in the third object.
  • the order of the objects in the first set is specifically as follows: the objects in the second set are first, the objects in the fourth set are in the middle, and the objects in the fifth set are the last.
  • object i, object j, and object k in the first set if object i is an object in the first queue, object j is an object in the second queue, and object k is an object in the third queue.
  • object the object i is sorted before the object j
  • the object j is sorted before the object k.
  • the first queue includes Q1 objects
  • the second queue includes Q2 objects
  • the third queue includes Q3 objects
  • the first object to the first Q1 objects are all objects in the first queue
  • (Q1+1) to (Q1+Q2) objects are all objects in the second queue
  • (Q1+Q2+1) to last objects are all Is the object in the third queue.
  • Step 1109 The server selects the target number of objects from the first set and sends them to the terminal in order from the first to the back.
  • the server can start from the first object in the first queue, and select objects from the first queue in order from the head of the queue to the end of the queue. If the number of selected objects reaches the target number, it stops continuing to select from the first queue. Objects in the second queue will not be selected; if the end of the first queue has been selected and the selected objects have not reached the target number, the server will continue to select the objects in the first queue. Starting from the head of the second queue, objects are selected from the second queue in order from the head to the end of the queue.
  • the server will continue to start from the head of the third queue on the basis of selecting the objects in the first queue and the objects in the second queue, from the first to the end of the queue, from the third queue Select objects in turn.
  • the priority of the first queue is higher than the priority of the second queue
  • the priority of the second queue is higher than the priority of the third queue.
  • the first set includes (2*K) objects
  • the first queue includes Q1 objects
  • the second queue includes Q2 objects
  • the server may store the fourth set in the second queue, and compare the second set in descending order of similarity with the second set.
  • the objects in the queue are reordered.
  • the objects in the second queue can be reordered according to the similarity between the objects in the first set and the second set obtained by multiple similarity algorithms.
  • the order of the different objects in the second queue can also be kept as the order of each object in the first set in step 1103. Specifically, for each object in the second queue, the object and the target The greater the similarity of the object, the higher the arrangement position of the object in the second queue.
  • the server will store the fifth set in the third When queuing, the server may maintain the order of each object in the third queue as the order of each object in the first set.
  • the object with the greatest similarity to the target object among all the objects in the database is object 1, followed by object 2, then object 3, and so on, 10 objects are searched out, the first set is (object 1, object 2, object 3...object 10).
  • the fourth set is (Object 4, Object 5, Object 6), and the fifth set is (Object 7, Object 8, Object 9, Object 10).
  • the object 6 and the second set are The similarity is the largest, the similarity between the object 5 and the second set is second, and the similarity between the object 4 and the second set is the smallest.
  • the internal arrangement order of the second queue is: object 6 is the head of the queue, object 5 is in the middle of the queue, and object 4 It is the end of the team.
  • the internal arrangement order of the third queue can be (object 7, object 8, object 9, and object 10), that is, the order remains unchanged. Therefore, the third queue can be recorded as a preserving queue.
  • the method provided in this embodiment designs a set of queue-based rearrangement framework on the basis of achieving the effect achieved by the embodiment in FIG. 8.
  • the objects searched from the database can be divided into multiple queues.
  • the objects in the first queue are ranked first, and the objects in the second queue are ranked in the middle.
  • the objects in the third queue are ranked last.
  • the objects in the first queue are selected with high priority
  • the objects in the second queue are selected second
  • the objects in the third queue are selected with low priority.
  • the confidence of the objects in the first queue is the highest
  • the confidence of the objects in the second queue is the second
  • the confidence of the objects in the third queue is the lowest.
  • the queuing mechanism shown in the embodiment of FIG. 8 and the embodiment of FIG. 11 is only an exemplary implementation, and the protection scope of the present application is not limited to this.
  • the queuing mechanism can be equivalently replaced with Other sequential storage structures, such as replacing the queues in the embodiment of FIG. 8 and the embodiment of FIG. 11 with arrays, and these modifications or replacements should be covered by the protection scope of this application.
  • the similarity search method of the embodiment of the present application is described above, and the similarity search device of the embodiment of the present application is introduced below. It should be understood that the similarity search device used in the similarity search device has any function of the server in the above method.
  • the application also provides a similarity search device.
  • the similarity search device 1300 includes a receiving module 1301, a searching module 1302, a dividing module 1303, a sorting module 1304, and a sending module 1305.
  • the above modules can be software modules.
  • the receiving module 1301 is used to perform step 302; the search module 1302 is used to perform step 303; the dividing module 1303 is used to perform step 304; the sorting module 1304 is used to perform step 305; and the sending module 1305 is used to perform step 306.
  • the division module is also used to execute step 805 or step 1106.
  • the sorting module is specifically configured to execute step 1105 to step 1108.
  • the sorting module is specifically configured to perform step 605 and step 606:
  • the similarity search device 1300 may be provided to users as a cloud search service.
  • the similarity search device 1300 (or a part thereof) runs on a cloud environment, for example, one or more servers in the cloud environment.
  • the similarity search device 1300 is started to pair The target object is searched, and the output target number objects are provided to the user.
  • the operation of the device in a cloud environment is merely indicative, and the device may also be operated in an edge environment, for example, running on one or more servers in the edge environment.
  • the device can also run in a terminal environment, specifically on one or more terminal devices in the terminal environment.
  • the terminal device can be a mobile phone, a notebook, a server, a desktop computer, etc.
  • the similarity search device provided in the above embodiment performs similarity search
  • only the division of the above-mentioned functional modules is used as an example for illustration.
  • the above-mentioned functions can be allocated by different functional modules as needed. That is, the internal structure of the server is divided into different functional modules to complete all or part of the functions described above.
  • the similarity search device provided in the foregoing embodiment and the similarity search method embodiment belong to the same concept, and the specific implementation process is detailed in the method embodiment, and will not be repeated here.
  • the similarity search device in the embodiment of this application can be implemented by a general bus architecture.
  • the similarity search device can be implemented as a server. See FIG. 14.
  • FIG. 14 is a schematic structural diagram of a server provided in an embodiment of the present application.
  • the server 1400 may have relatively large differences due to different configurations or performances, and may include One or more processors 1401 and one or more memories 1402 may also include a bus 1403 and a transceiver 1404.
  • the processors 1401, the memory 1402 and the transceiver 1404 may communicate with each other through the bus 1403.
  • At least one instruction is stored in the memory 1402, and at least one instruction is loaded and executed by the processor 1401 to implement the similarity search method provided by the foregoing method embodiments.
  • the processor 1401 may control the transceiver 1404 to perform step 302 and step 306, or step 602 and step 607, or step 802 and step 808, or step 1102 and step 1109.
  • the processor 1401 may be a central processing unit (English: central processing unit, abbreviated: CPU).
  • the memory 1402 may include a volatile memory 1402 (English: volatile memory), such as a random access memory 1402 (English: random access memory, abbreviation: RAM).
  • the memory 1402 may further include a non-volatile memory 1402 (English: non-volatile memory), such as a read-only memory 1402 (English: read-only memory, abbreviation: ROM), flash memory 1402, HDD or SSD.
  • the memory 1402 may also include an operating system and other software modules required for running processes.
  • the operating system can be LINUX TM , UNIX TM , WINDOWS TM etc.
  • the server may also have components such as a wired or wireless network interface and an input/output interface for input and output, and the server may also include other components for implementing device functions, which will not be repeated here.
  • the server 1400 may be a server in a cloud environment, or a server in an edge environment, or a server in a terminal environment.
  • the server cluster includes a plurality of servers 1400.
  • the structure of each server 1400 please refer to the embodiment in FIG. 14 described above.
  • Communication channels are established between different servers 1400 through a communication network.
  • different steps of the similarity search method can be dispersed and executed in different servers. For example, server 1 is used to execute step 302, server 2 is used to execute step 303 to step 305, and server 3 is used to execute step 306.
  • different modules of the similarity search device 1300 can be distributed on different servers 1400, for example, the receiving module 1301 is located on the server 1, the search module 1302 and the dividing module 1303 are located on the server 2, and the sending module 1305 is located on the server 3.
  • Any server 1400 may be a server in a cloud environment, or a server in an edge environment, or a server in a terminal environment.
  • this application also proposes a server cluster, which includes multiple Server 1400 and cloud storage service.
  • the database or the first collection is stored in a cloud storage service (for example, an object storage service), and the user applies for a certain capacity of storage space in the cloud storage service, and stores the database or the first collection in the storage space.
  • a cloud storage service for example, an object storage service
  • the server 1400 When the server 1400 is running, it obtains the required objects from the remote cloud storage service through the communication network.
  • the similarity search device in the embodiment of the present application can be implemented by a chip.
  • the chip includes a processor, which is used to call and execute instructions stored in the memory from the memory, so that the device installed with the chip executes the similarity search methods provided by the foregoing method embodiments.
  • the chip includes an input interface, an output interface, a processor, and a memory.
  • the input interface, output interface, the processor, and the memory are connected by an internal connection path, and the processor is used to execute the memory.
  • the processor is used to execute step 303 to step 305, step 603 to step 607, step 803 to step 807, step 1103 to step 1108, and the processor is used to control the input interface for Performing the foregoing step 302, step 602, step 802, and step 1102, the processor is configured to control the output interface to perform step 306, step 607, step 808, and step 1109 in the foregoing method embodiment.
  • the similarity search device in the embodiment of this application can also be implemented as follows: one or more field-programmable gate arrays (FPGA), programmable logic devices (English: Programmable Logic Device, Abbreviation: PLD), Complex Programmable Logic Device (English: Complex Programmable Logic Device, Abbreviation: CPLD), Controller, Application Specific Integrated Circuit (ASIC), State Machine, Gate Logic , Discrete hardware components, transistor logic devices, network processors (Network Processor, NP), any other suitable circuits, or any combination of circuits capable of performing various functions described throughout this application.
  • FPGA field-programmable gate arrays
  • PLD Programmable Logic Device
  • Complex Programmable Logic Device English: Complex Programmable Logic Device, Abbreviation: CPLD
  • Controller Application Specific Integrated Circuit
  • ASIC Application Specific Integrated Circuit
  • Gate Logic Discrete hardware components
  • transistor logic devices Network processors (Network Processor, NP), any other suitable circuits, or any combination of circuits capable of performing various functions described throughout this application.
  • the similarity search device in the embodiments of the present application may be implemented by a computer program, and the computer program includes instructions for executing the foregoing method embodiments.
  • the computer program may be a software installation package.
  • the computer program may be downloaded and executed on the server.
  • the disclosed system, device, and method may be implemented in other ways.
  • the device embodiments described above are merely illustrative.
  • the division of the modules is only a logical function division. In actual implementation, there may be other division methods.
  • multiple modules or components may be combined or may be Integrate into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or modules, and may also be electrical, mechanical or other forms of connection.
  • the module described as a separate component may or may not be physically separated, and the component displayed as a module may or may not be a physical module, that is, it may be located in one place, or may be distributed to multiple network modules. Some or all of the modules may be selected according to actual needs to achieve the objectives of the solutions of the embodiments of the present application.
  • each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.
  • the above-mentioned integrated modules can be implemented in the form of hardware or software functional modules.
  • the integrated module is implemented in the form of a software function module and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of this application is essentially or the part that contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method in each embodiment of the present application.
  • the aforementioned storage medium includes volatile memory and non-volatile memory.
  • the storage medium may be: U disk, mobile hard disk, read-only memory (ROM), random access memory, RAM), magnetic disks, flash memory, hard disk (hard disk drive, HDD), solid state drive (solid state drive, SSD), or optical disks and other media that can store program codes.
  • the computer program product includes one or more computer program instructions.
  • the computer may be a general-purpose computer, a dedicated computer, a computer network, or other programmable devices.
  • the computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer program instructions can be passed from a website, computer, server, or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or data center integrated with one or more available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a digital video disc (DVD), or a semiconductor medium (for example, a solid state hard disk).
  • the program can be stored in a computer-readable storage medium.
  • the storage medium can be read-only memory, magnetic disk or optical disk, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开了一种相似性搜索方法、装置、服务器及存储介质,涉及搜索技术领域。本申请通过从数据库搜索到一些对象后,将这些对象中与目标对象的相似度满足第一阈值的对象往前排,将这些对象中与目标对象的相似度不满足第一阈值的对象往后排,再按照从前往后的顺序选择对象并发往终端,由于相似度满足第一阈值的对象的置信度高于相似度不满足第一阈值的对象,相似度满足第一阈值的对象是正确结果的概率很大,那么通过将这些对象排在前面,在从前往后选择对象时,会优先选择这些对象,从而可以提升选择出正确结果的概率,让选出的正确结果的比重更大,从而有效地提升了召回率。

Description

相似性搜索方法、装置、服务器及存储介质 技术领域
本申请涉及搜索技术领域,特别涉及一种相似性搜索方法、装置、服务器及存储介质。
背景技术
相似性搜索技术在各种应用场景中已得到了广泛的使用,例如在图像搜索的场景中,可以根据用户输入的图片,来搜索相同或相似的图片,又如在网页搜索的场景中,可以根据用户输入的关键字搜索网页,又如在音频搜索的场景中,可以根据用户输入的音频搜索相同或相似的音频,又如在文档搜索的场景中,可以根据用户输入的关键字搜索相关的文档。
目前,相似性搜索方法通常包括:终端向服务器发送搜索请求,搜索请求包括待搜索的对象,例如图片、音频、关键字等。服务器接收终端的搜索请求后,会从搜索请求中,获取待搜索的对象。之后,服务器会根据待搜索的对象进行搜索。具体来说,服务器会遍历数据库中的每个对象,获取数据库中每个对象与待搜索的对象之间的距离,从数据库中选取距离最近的K个对象,返回给终端,其中K为正整数。
采用上述方法进行相似性搜索时,仅考虑了待搜索的对象与数据库的对象之间的距离,如果待搜索的对象本身包含噪声,搜索结果就会很容易受到噪声的影响,包含很多的错误结果,导致召回率低下。
发明内容
本申请实施例提供了一种相似性搜索方法、装置、服务器及存储介质,能够解决相关技术中召回率低下的问题。
第一方面,本申请实施例提供一种相似性搜索方法,所述方法包括:
接收终端的搜索指令,所述搜索指令用于指示搜索与目标对象相似的对象;对数据库进行搜索,得到第一集合,所述第一集合包括多个对象;将所述第一集合划分为第二集合以及第三集合,所述第二集合中每个对象与所述目标对象之间的相似度满足第一阈值,所述第三集合中每个对象与所述目标对象的相似度不满足所述第一阈值;按照所述第二集合中的对象在前、所述第三集合中的对象在后的顺序,对所述第一集合中的对象进行排序;按照从前往后的顺序,从所述第一集合中选择对象发往所述终端。
本实施例提供的方法,设计了一套单一输入下的重排框架,通过从数据库搜索到一些对象后,将这些对象中与目标对象的相似度满足第一阈值的对象往前排,将这些对象中与目标对象的相似度不满足第一阈值的对象往后排,再按照从前往后的顺序选择对象并发往终端,由于相似度满足第一阈值的对象的置信度高于相似度不满足第一阈值的对象,相似度满足第一阈值的对象是正确结果的概率很大,通过将这些对象排在前面,在从前往后选择对象时,会优先选择这些对象,从而可以提升选择出正确结果的概率,让选出的正确结果的比重更大,从而有效地提升了召回率。
并且,相关技术在重排时,需要基于搜索出的对象以及目标对象,构造幽灵点,再根据幽灵点在数据库中再次搜索一遍,导致运算量大、运算速度慢,并且如果幽灵点不准确,会导致召回率急剧下降。而本实施例中,基于从数据库中搜索出的第一集合即可进行重排,相对于先根据第一集合中的对象以及目标对象构造幽灵点,再根据幽灵点在数据库中再次搜索一遍的方式来说,省去了构造幽灵点的步骤以及在数据库中进行二次搜索的步骤,从而解决了重排框架运算量巨大的问题,提升了运算速度,并且避免幽灵点不准而导致召回率下降的问题,提升了鲁棒性。
在一种可能的实现中,所述将所述第一集合划分为第二集合以及第三集合,具体包括:采用多种相似度算法结合,获取所述第一集合中的对象与所述目标对象之间的综合相似度,把综合相似度满足所述第一阈值的对象加入所述第二集合,把综合相似度不满足所述第一阈值的对象加入所述第三集合。
通过这种方式,在达到第一方面所述的效果的基础上,可以结合多种相似度算法来获取综合相似度,从而综合考虑了各种相似度算法,利用不同相似度算法的优势,因此综合相似度能够更全面、更科学地反映两个对象之间的相似度,因此可以解决单一度量方式不准确的问题,提高相似度的准确性。另外,采用的多种相似度算法可以包括rank order算法或者其他考虑了群体关系的相似度算法,那么通过在度量两个对象之间的相似度时,不仅考虑了两个对象本身,也考虑与两个对象属于同一群体的其他对象,比如说在度量对象A与对象B之间的相似度时,会不仅考虑对象A以及对象B,还考虑对象A所属的群体以及对象B所属的群体,从而可以提高相似度的准确性,进而提高根据相似度选择对象时的准确性。
在一种可能的实现中,所述对数据库进行搜索,得到第一集合具体包括:采用第一相似度算法,把与所述目标对象之间的相似度满足第三阈值的对象加入所述第一集合;进一步的,所述将所述第一集合划分为第二集合以及第三集合,具体包括:采用多种相似度算法结合,获取所述第一集合中的对象与所述目标对象之间的综合相似度,把综合相似度满足所述第一阈值的对象加入所述第二集合,把综合相似度不满足所述第一阈值的对象加入所述第三集合,所述多种相似度算法包括所述第一相似度算法。
通过这种方式,通过先依据第一相似度算法得出的相似度,来得出第一集合,再在第一相似度算法的基础上,依据该第一相似度算法结合其他相似度算法得出的综合相似度,来划分第二集合和第三集合,由于第一相似度算法与其他相似度算法结合后,能够弥补第一相似度算法的度量方式的不足,达到改进第一相似度算法的目的,因此综合相似度相对于第一相似度算法得出的相似度来说,能够确保提升准确性,因此,与目标对象之间综合相似度高的对象是正确结果的概率会显著高于与目标对象之间综合相似度低的对象,也即是,第二集合整体的置信度会高于第三集合整体的置信度,因此通过将第二集合中的对象排在第三集合中的对象前面,可以确保提升第一集合中对象顺序的准确性。
在一种可能的实现中,所述第一相似度算法为欧式距离算法,所述多种相似度算法包括所述欧式距离算法以及所述欧式距离算法之外的其他相似度算法。
在一种可能的实现中,所述第三集合包括第四集合以及第五集合,所述第四集合中每个对象与所述第二集合的相似度满足第二阈值,所述第五集合中每个对象与所述第二集合的相似度不满足所述第二阈值;在所述第三集合中对象的顺序具体为:所述第四集合中的对象在 前、所述第五集合中的对象在后。
通过这种方式,在达到第一方面所述的效果的基础上,将第二集合作为重排时参照的基准,由于第二集合中的对象的置信度高,如果第三集合中的对象与第二集合相似,则该对象是正确结果的概率更高,那么通过令与第二集合的相似度满足第二阈值的对象在前,与第二集合的相似度不满足阈值的对象在后,可以让第三集合中对象的排列顺序更加准确,因此在按照从前往后的顺序选取对象时,会优先选取到与第二集合的相似度满足第二阈值的对象,因此可以提高选中正确结果的概率,从而进一步提升了召回率。
在一种可能的实现中,所述将所述第一集合划分为第二集合以及第三集合之后,所述方法还包括:采用多种相似度算法结合,获取所述第三集合中的对象与所述第二集合的综合相似度,把综合相似度满足第二阈值的对象加入第四集合,把综合相似度不满足所述第二阈值的对象加入第五集合,所述第三集合包括所述第四集合以及所述第五集合,在所述第三集合中对象的顺序具体为:所述第四集合中的对象在前、所述第五集合中的对象在后。
通过这种方式,在达到第一方面所述的效果的基础上,可以结合多种相似度算法来获取综合相似度,从而综合考虑了各种相似度算法,利用不同相似度算法的优势,因此综合相似度能够更全面、更科学地反映第三集合中的对象与所述第二集合之间的相似度,因此可以解决单一度量方式不准确的问题,提高相似度的准确性。另外,采用的多种相似度算法可以包括rank order算法或者其他考虑了群体关系的相似度算法,那么在度量第三集合中的对象与所述第二集合之间的相似度时,不仅考虑了第三集合中的对象以及第二集合本身,也考虑了第三集合中的对象所属的群体以及第二集合中的对象所属的群体,从而可以提高相似度的准确性,进而提高根据相似度选择对象时的准确性。
在一种可能的实现中,所述对数据库进行搜索,得到第一集合具体包括:采用第一相似度算法,把与所述目标对象之间的相似度满足第三阈值的对象加入所述第一集合;进一步的,所述将所述第一集合划分为第二集合以及第三集合之后,所述方法还包括:采用多种相似度算法结合,获取所述第三集合中的对象与所述第二集合的综合相似度,把综合相似度满足第二阈值的对象加入第四集合,把综合相似度不满足所述第二阈值的对象加入第五集合,所述多种相似度算法包括所述第一相似度算法,所述第三集合包括所述第四集合以及所述第五集合,在所述第三集合中对象的顺序具体为:所述第四集合中的对象在前、所述第五集合中的对象在后。
通过这种方式,通过先依据第一相似度算法得出的相似度,来得出第一集合,再在第一相似度算法的基础上,依据该第一相似度算法结合其他相似度算法得出的综合相似度,来划分第四集合和第五集合,由于第一相似度算法与其他相似度算法结合后,能够弥补第一相似度算法的度量方式的不足,达到改进第一相似度算法的目的,因此综合相似度相对于第一相似度算法得出的相似度来说,能够确保提升准确性,因此,与目标对象之间综合相似度高的对象是正确结果的概率会显著高于与目标对象之间综合相似度低的对象,也即是,第四集合整体的置信度会高于第五集合整体的置信度,因此通过将第四集合中的对象排在第五集合中的对象前面,可以确保提升第一集合中对象顺序的准确性。
在一种可能的实现中,所述将所述第一集合划分为第二集合以及第三集合之后,所述方法还包括:从所述第三集合中获取簇,所述簇中的任一对象与所述簇中的其他对象之间的相 关度符合预设条件;获取所述簇与所述第二集合之间的相似度,作为所述簇中的每个对象与所述第二集合之间的相似度;把相似度满足第二阈值的对象加入第四集合,把相似度不满足所述第二阈值的对象加入第五集合,所述第三集合包括所述第四集合以及所述第五集合,在所述第三集合中对象的顺序具体为:所述第四集合中的对象在前、所述第五集合中的对象在后。
通过将第三集合中相关的对象聚为簇,在达到第一方面所述的效果的基础上,将簇与第二集合的相似度来作为簇中每个对象的相似度,噪声数据会由于被划分到对应的簇中,噪声数据本身与第二集合之间的相似度会被替换为簇与第二集合之间的相似度。那么即使噪声数据本身与第二集合之间的相似度很高,也会将其拉低至簇与第二集合之间的相似度,从而有效地防止噪声数据的影响,滤除了噪声数据,解决了噪声数据而造成误判的问题,减少了搜索结果中错误结果的数量,进而极大地提升了召回率。
在一种可能的实现中,所述获取所述簇与所述第二集合之间的相似度,包括:从所述簇中选取代表点,获取所述代表点与所述第二集合之间的相似度,作为所述簇与所述第二集合之间的相似度,所述代表点用于代表所述簇中的每个对象。
通过这种实现方式,可以使用代表点,来代替整个簇中的所有对象,去和第二集合进行度量,相对于使用簇中的所有对象逐一和第二集合进行度量的方式来说,可以减少计算量,从而提高计算速度。并且,代表点可以是簇的中心点,通过中心点能够更准确地计算簇与所述第二集合之间的相似度,避免簇边缘的噪声点影响相似度的准确性。
在一种可能的实现中,所述获取所述簇与所述第二集合之间的相似度,包括:获取所述簇中的每个对象与所述第二集合之间的相似度,根据所述每个对象与所述第二集合之间的相似度,获取所述簇与所述第二集合之间的相似度。
通过这种实现方式,提供一种可以适用于包含对象较少的簇的度量方式,提高了灵活性。
在一种可能的实现中,所述按照所述第二集合中的对象在前、所述第三集合中的对象在后的顺序,对所述第一集合中的对象进行排序,具体包括:将所述第二集合存入第一队列;将所述第四集合存入第二队列;将所述第五集合存入第三队列;按照所述第一队列最前、所述第二队列其次、所述第三队列最后的顺序,对所述第一集合中的对象进行排序。
通过这种方式,在达到第一方面所述的效果的基础上,设计了一套基于队列的重排框架,通过将第二集合加入第一队列,将第四集合加入第二队列,将第五集合加入第三队列,可以将从数据库中搜索出的对象划分出多种队列,在选取对象时,通过第一队列中的对象排在前、第二队列中的对象排在中间,第三队列中的对象排在最后,在按照从前到后的顺序选取对象时,会高优先选取第一队列中的对象,其次优先选取第二队列中的对象,低优先选取第三队列中的对象,那么由于三种队列中,第一队列中对象的置信度最高,第二队列中对象的置信度其次,第三队列中对象的置信度最低,在选取的对象的总数目一定的情况下,能够提高选取置信度高的对象的概率,降低选取置信度低的对象的概率,从而提高搜索结果中置信度高的对象的占比,因此可以提高搜索结果的召回率,并且,能够让置信度高的对象置于搜索结果的前列,提高搜索结果顺序的准确性。
在一种可能的实现中,所述按照所述第二集合中的对象在前、所述第三集合中的对象在后的顺序,对所述第一集合中的对象进行排序,具体包括:将所述第二集合存入第一队列; 将所述第三集合存入第二队列;按照所述第一队列在前、所述第二队列在后的顺序,对所述第一集合中的对象进行排序。
通过这种方式,在达到第一方面所述的效果的基础上,设计了一套基于队列的重排框架,通过将第二集合加入第一队列,将第三集合加入第二队列,可以将从数据库中搜索出的对象划分出多种队列,在选取对象时,通过将第一队列中的对象排在前、第二队列中的对象排在后,在按照从前到后的顺序选取对象时,会高优先选取第一队列中的对象,低优先选取第二队列中的对象,那么由于第一队列中对象的置信度高于第二队列中对象,在选取的对象的总数目一定的情况下,能够提高选取置信度高的对象的概率,降低选取置信度低的对象的概率,从而提高搜索结果中置信度高的对象的占比,因此可以提高搜索结果的召回率,并且,能够让置信度高的对象置于搜索结果的前列,提高搜索结果顺序的准确性。
在一种可能的实现中,对于所述第二集合中的每个对象,所述对象与所述目标对象的相似度越大,所述对象在所述第二集合中的排列位置越靠前。
通过将第二集合内部的排列顺序与目标对象的相似度关联起来,让与目标对象相似度大的对象排在前面,与目标对象相似度小的对象排在后面,可以进一步提升候选对象排列顺序的准确性,由于与目标对象相似度大的对象是正确结果的概率更高,并且这种对象被放在前面,在按照前往后的顺序依次选择对象时,可以提升选择的对象中正确结果的比例,并且可以尽量让正确结果排在搜索结果的前面,从而让终端呈现搜索结果时,正确结果的显示位置会更靠前。比如说,对于第二集合中与目标对象的相似度最大的对象来说,该对象会在第二集合中排在第一位,那么也就会在第一集合中排在第一位,服务器将选择的对象发往终端后,终端呈现的搜索结果中,该对象会排在搜索结果的第一位。并且,与目标对象之间的相似度小的对象会被排在搜索结果的后面,或者可以避免被放入让搜索结果中正确结果的比例更高,从而有效地提高了搜索的召回率。
在一种可能的实现中,对于所述第三集合中的每个对象,所述对象与所述第二集合的相似度越大,所述对象在所述第三集合中的排列位置越靠前。
通过将第三集合内部的排列顺序与第二集合的相似度结合起来,让与第二集合相似度大的对象排在前面,与第二集合相似度小的对象排在后面,可以进一步提升候选对象排列顺序的准确性,由于与第二集合相似度大的对象排在与第二集合相似度小的对象之前,而与第二集合相似度大的对象是正确结果的概率高于与第二集合相似度小的对象,在按照前往后的顺序依次选择对象时,可以提升选择的对象中正确结果的比例,能够尽量多地选择与第二集合相似的对象,从而让搜索结果中正确结果的比重更大,从而有效地提升了召回率。并且,与第二集合之间的相似度小的对象会被排在搜索结果的后面,或者可以避免被放入搜索结果中,从而减少了搜索结果的错误结果的数量,从而有效地提高了搜索的召回率。
在一种可能的实现中,所述将所述第一集合划分为第二集合以及第三集合,具体包括:从所述第一集合中获取簇,所述簇中的任一对象与所述簇中的其他对象之间的相关度符合预设条件;获取所述簇与所述目标对象之间的相似度,作为所述簇中的每个对象与所述目标对象之间的相似度;把相似度满足所述第一阈值的对象加入所述第二集合,把相似度不满足所述第一阈值的对象加入所述第三集合。
通过将第一集合中相关的对象聚为簇,将簇与目标对象的相似度来作为簇中每个对象的 相似度,由于噪声数据被划分到对应的簇中,噪声数据与目标对象之间的相似度会被替换为簇与目标对象之间的相似度,因此,即使噪声数据本身与目标对象之间的相似度很高,由于没有使用噪声数据与目标对象之间的相似度,而是使用了该噪声数据所属的簇与目标对象之间的相似度,可以将噪声数据对应的相似度拉低到簇对应的相似度,可以有效地防止噪声数据的影响,将噪声数据提前排除掉,从而解决了由于噪声数据而造成误判的问题,减少了搜索结果中错误结果的数量,进而极大地提升了召回率。
第二方面,本申请实施例提供一种相似性搜索装置,所述装置用于执行上述相似性搜索方法。具体地,该相似性搜索装置包括用于执行上述第一方面或第一方面任意可能的实现方式的功能模块。
第三方面,本申请实施例提供一种服务器,所述服务器包括一个或多个处理器和一个或多个存储器,所述一个或多个存储器中存储有至少一条指令,所述指令由所述一个或多个处理器加载并执行以实现上述第一方面或第一方面任意可能的实现方式提供的方法。
第四方面,本申请实施例提供一种服务器集群,所述服务器集群包括至少一个服务器,每个服务器包括一个或多个处理器和一个或多个存储器,所述至少一个服务器的存储器中存储有至少一条指令,所述指令由所述至少一个服务器的处理器加载并执行以实现上述第一方面或第一方面任意可能的实现方式提供的方法。
第五方面,本申请实施例提供一种计算机可读存储介质,所述存储介质中存储有至少一条指令,所述指令由处理器加载并执行以实现上述第一方面或第一方面任意可能的实现方式提供的方法。该存储介质中存储了程序。该存储介质的类型包括但不限于易失性存储器,例如随机访问存储器,非易失性存储器,例如快闪存储器、硬盘(hard disk drive,HDD)、固态硬盘(solid state drive,SSD)。
第六方面,本申请实施例提供一种芯片,该芯片包括处理器,用于从存储器中调用并运行存储器中存储的指令,使得安装有所述芯片的设备执行上述相似性搜索方法。
第七方面,本申请实施例提供另一种芯片,包括输入接口、输出接口、处理器和存储器,所述输入接口、输出接口、所述处理器以及所述存储器之间通过内部连接通路相连,所述处理器用于执行所述存储器中的指令,当所述指令被执行时,所述处理器用于执行上述相似性搜索方法。
第八方面,本申请实施例提供一种计算机程序,所述计算机程序包括用于执行上述第一方面或第一方面任意可能的实现方式的指令。该计算机程序可以为一个软件安装包,在需要使用上述相似性搜索方法的情况下,可以下载该计算机程序并在服务器上执行该计算机程序。
第九方面,本申请实施例提供一种相似性搜索系统,所述相似性搜索系统包括终端以及服务器,所述终端用于向服务器发送搜索指令,所述服务器用于执行上述第一方面或第一方面任意可能的实现方式,所述终端还用于接收服务器发送的对象。
附图说明
图1是本申请实施例提供的一种应用场景的示意图。
图2是本申请实施例提供的一种相似性搜索系统的结构框图。
图3是本申请实施例提供的一种相似性搜索方法的流程图。
图4是本申请实施例提供的一种划分第二集合和第三集合的示意图。
图5是本申请实施例提供的一种划分第二集合和第三集合的示意图。
图6是本申请实施例提供的一种相似性搜索方法的流程图。
图7是本申请实施例提供的一种划分第一队列和第二队列的示意图。
图8是本申请实施例提供的一种相似性搜索方法的流程图。
图9是本申请实施例提供的一种划分第四集合和第五集合的示意图。
图10是本申请实施例提供的一种划分第四集合和第五集合的示意图。
图11是本申请实施例提供的一种相似性搜索方法的流程图。
图12是本申请实施例提供的一种划分第一队列、第二队列和第三队列的示意图。
图13是本申请实施例提供的一种相似性搜索装置的结构框图。
图14是本申请实施例提供的一种服务器的结构框图。
图15是本申请实施例提供的一种服务器集群的结构示意图。
图16是本申请实施例提供的另一种服务器集群的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
本申请中术语“第一”“第二”等字样用于对作用和功能基本相同的相同项或相似项进行区分,应理解,“第一”、“第二”、“第n”之间不具有逻辑或时序上的依赖关系,也不对数量和执行顺序进行限定。
以下,介绍本申请使用的概念。
对象:可以而不限于是图像、网页、文档、音频、资讯、视频中的任一项或多项的组合。其中,图像可以而不限于人脸图像、人体图像、足迹图像、步态图像、表情包图像、地点图像、车辆图像、商品图像、风景图像、建筑图像、影视图像、食品图像、游戏图像、植物图像、动物图像中的任一项及其组合。
目标对象:是指待搜索的对象,例如query(查询)命令中所请求搜索的内容,也可以称为搜索项或查询。
相似度算法:也称度量算法,用于度量数据空间中两个数据点之间的距离。本实施例中,可以用相似度算法来度量两个对象之间的相似程度。在计算机技术中,相似度算法可以通过函数(一种计算机程序)来实现。相似度算法可以而不限于是欧式距离(英文:euclidean distance)算法、等级排序(英文:rank order)算法、机器学习模型、欧式距离的标准化(英文:standardized euclidean distance)算法、马氏距离(英文:mahalanobis distance)算法、曼哈顿距离(英文:manhattan distance)算法、切比雪夫距离(英文:Chebyshev distance)算法、明可夫斯基距离(英文:Minkowski distance)算法、海明距离(英文:Hamming distance)算法、余弦相似度(英文:Cosine similarity)算法、皮尔森相关系数(英文:Pearson correlation coefficient)算法、Jaccard相似系数算法、对数似然相似度算法、互信息增益算法、信息增益算法、相对熵算法、KL散度(Kullback-Leibler divergence)、点互信息(英文全称:pointwise mutual information,英文简称:PMI)算法中的任一项及其组合。
以图搜图:是指输入一张图像后,从数据库的海量图像中,搜索并返回与这张图像相似的一个或多个图像的功能。
召回率(recall):在搜索技术中,如果共计返回K个结果,这K个结果中包含n个正确结果,则n与K的比值是召回率,其中K为正整数,n为正整数或0。例如,应用在以图搜图的场景中,如果输入了用户A的照片,要求返回用户A的10张照片,而实际返回的10张照片中6张照片是用户A的照片,而4张照片是用户B的照片,则召回率为6/10=0.6,其中/表示相除。
候选解(candidate):如果要求返回的对象的数目是K,而先从数据库中搜索出数量大于K个的对象,比如搜索出(2*K)个对象,则这些对象可以称为候选解,后续会从这些对象中选择K个对象作为搜索结果。
重排(rerank):是指得到候选解后,对候选解按照设定的排序方式进行重新排序,从而改变候选解原始的内部顺序,而形成新的内部顺序的过程。在进行重排后,可以按照候选解的新的顺序,按照从前到后的顺序从候选解中,选择排在第一位至排在第K位的对象,这K个对象通俗来讲叫做TOP K结果(前K个结果),可以将这K个对象作为搜索结果,返回这K个对象。
聚类(clustering):是指针对给定的多个数据以及聚类算法,将多个数据中满足条件的数据归为一类的过程。
聚类算法:可以而不限于是k均值聚类算法(英文:k-means clustering algorithm)、密度聚类算法、图聚类算法、层次聚类(hierarchical clustering)算法、基于网络的聚类算法、基于模糊的聚类算法、基于约束的聚类算法、基于约束的聚类算法、基于粒度的聚类算法、核聚类算法、量子聚类算法中的任一种或多种的结合。
队列(queue):一种存放数据的数据结构。
综合相似度:是指结合多种相似度算法得出的相似度。具体地,每种相似度算法可以得出一个相似度,对多种相似度算法得出的多个相似度进行结合,可以得到综合相似度。例如,可以为每种相似度算法分配对应的权重,根据每种相似度算法对应的权重,对多种相似度算法得出的多个相似度进行结合。其中,每个相似度算法对应的权重可以根据指令、实验、经验或需求设置。例如,相似度算法对应的权重可以与相似度算法的精确性正相关。比如说,如果欧式距离算法的精确性小于rank order算法,则欧式距离算法的权重可以小于rank order算法的权重的权重。
在一些可能的实施例中,结合多种相似度算法的方式包括而不限于平均、加权平均、求和、加权求和中的任意一项或者多项的组合。例如,如果多种相似度算法为相似度算法1以及相似度算法2,相似度算法1得出相似度1,相似度算法2得出相似度2,可以根据相似度算法1的权重以及相似度算法2的权重,对相似度1和相似度2进行加权平均,将加权平均值作为综合相似度;或者,根据相似度算法1的权重以及相似度算法2的权重,对相似度1和相似度2进行加权求和,将加权和值作为综合相似度;或者,对相似度1和相似度2进行求和,将和值作为综合相似度;或者,对相似度1和相似度2进行求平均,将平均值作为综合相似度。
以下,示例性介绍本申请的应用场景。
本申请实施例可以用于进行基于相似性的搜索。相似性搜索属于语义搜索,可以搜索与 已知对象相似的对象。相关技术中一种相似性搜索是Facebook(中文称脸书)公司的Faiss。相似性搜索是一种非匹配性的搜索,例如给出一张图片,搜出类似的图片;或者给出一个词语,搜索类似的词语,给出一段话,搜出类似的一段话。反之,在目前的文本编辑软件中(例如Microsoft office word2013,微软推出的一种文本编辑软件)使用的是匹配性搜索,只有完全与目标词语/语句完全一致的内容才能被检索到。
示意性地,可以应用在以图搜图的场景,用户可以在终端上输入某张图像,服务器可以根据终端提供的图像,从数据库中搜索与其相似的其他图像,将这些图像返回给终端。例如,用户想要知道图像中的某个人是谁,可以在终端上输入这个人的图像,服务器从数据库中搜索与这张图像相似的其他图像,返回与这张图像相似的其他图像,并且数据库中可以存有图像对应的身份信息,服务器可以在返回图像时,将图像对应的身份信息一起返回,从而帮助识别这个人的身份。又如,用户看上了某件商品,想要知道这件商品的购买地址、价格等,可以使用终端拍摄商品图像,在终端上输入商品图像,服务器从数据库中搜索与这幅商品图像相似的其他商品图像,返回与这幅商品图像相似的其他商品图像,并且数据库可以存有商品图像对应的购买地址、价格等,服务器可以在返回其他商品图像时,可以将其他商品图像对应的购买地址、价格一起返回,从而帮助用户快速购买商品;又如,用户想要知道某种狗的品种,可以在终端输入这只狗的图像,服务器从数据库中搜索与这张图像相似的狗的图像,从而帮助识别狗的品种。
示意性地,参见图1,在监控安防领域,可以预先通过在各处布设的摄像头,抓拍大量的人脸照片,从每张人脸照片分别提取人脸特征,将大量的人脸特征存入人脸特征库。当用户想要搜索与某张照片相似的其他3张照片时,可以将这张照片发送至服务器,服务器可以这张照片作为待搜索的人脸图像,从这张照片中提取人脸特征,对人脸特征库进行搜索,得到10个人脸特征,再通过下述实施例中的重排过程,对10个人脸特征进行排序,选择排在前3位的人脸特征,将这3个人脸特征对应的3张照片返回终端。
当然,使用图来进行相似性搜索的场景仅是示意,本申请也可以应用在使用文档来进行相似性搜索的场景,即根据给定的文档,从文档数据库中搜索与其相似的其他文档。比如说,用户可以给定一篇论文,通过实施本申请提供的方法,可以从论文数据库中找到与这篇论文相似度最高的一些论文,另外还可以返回每篇论文与这篇论文的相似度,从而可以判定是否已经存在与该论文大面积重复的其他论文,从而实现论文查重的功能;又如,可以应用在使用音频来进行相似性搜索的场景,即根据给定的音频,从音频数据库中搜索与其相似的其他音频,比如用户哼唱了歌曲片段,可以录制该歌曲片段,从歌曲库中搜索与用户哼唱的歌曲片段相似的歌曲。
以下,示例性介绍本申请的系统架构。
图2是本申请实施例提供的一种相似性搜索系统的结构框图。该相似性搜索系统包括:终端210和搜索平台220。
终端210通过无线网络或有线网络与搜索平台220相连。终端210可以是智能手机、游戏主机、台式计算机、平板电脑、电子书阅读器、MP3播放器、MP4播放器和膝上型便携计算机中的至少一种。终端210可以安装和运行有支持搜索的应用程序,搜索平台220用于为 该应用程序提供后台服务。示意性地,终端210可以是用户使用的终端,终端210中运行的应用程序内登录有用户在搜索平台220上注册的账号。其中,该应用程序可以是搜索引擎的客户端,或者可以是搜索引擎的网页版。或者,该应用程序可以是购物类应用程序、音频程序、视频程序、社交应用程序、即时通讯应用程序、翻译类应用程序、浏览器程序中的任意一种,该应用程序中内置有搜索的功能,比如说配置有以图搜图的组件,例如该应用程序可以是具有识图购物功能的购物类应用程序。
搜索平台220可以而不限于运行在云环境、边缘环境或者终端环境中的任意一种,例如可以运行在公有云、私有云或混合云上。搜索平台220可以作为云搜索服务向用户提供。搜索平台220包括服务器2201以及数据库2202。
服务器2201用于执行下述图3实施例的方法。服务器2201通过无线网络或有线网络与数据库2202相连。服务器2201可以包括一台服务器、多台服务器、云计算平台或者虚拟化中心中的至少一种。服务器2201可以是一台或多台。当服务器2201是多台时,存在至少两台服务器2201用于提供不同的服务,和/或,存在至少两台服务器2201用于提供相同的服务,比如以负载均衡方式提供同一种服务,本申请实施例对此不加以限定。在一些可能的实施例中,服务器2201可以是弹性云服务器(英文全称:elastic cloud server,英文简称:ECS)、虚拟机、容器、在云环境中运行的应用、服务或微服务。
数据库2202用于存储多个对象。数据库2202可以位于一台存储设备上,也可以分布在多个存储设备上。数据库2202可以通过云存储服务实现,例如,数据库2202可以为对象存储服务(英文全称:object storage service,英文简称:OBS)、云硬盘、云数据库等。
本领域技术人员可以知晓,图1仅是以服务器2201与数据库2202分置在不同的设备为例进行说明,在其它可能的实施方式中,服务器2201与数据库2202也可以集成在一起,服务器2201与数据库2202可以位于同一设备上。
本领域技术人员可以知晓,上述终端的数量可以更多或更少。比如上述终端可以仅为一个,或者上述终端为几十个或几百个,或者更多数量,此时上述相似性搜索系统还包括其他终端。本申请实施例对终端的数量和设备类型不加以限定。
以下,示例性介绍本申请的方法流程。
图3是本申请实施例提供的一种相似性搜索方法的流程图,如图3所示,该方法包括下述步骤301至306:
步骤301、终端向服务器发送搜索指令。
搜索指令用于指示搜索与目标对象相似的对象。可选地,搜索指令可以包括目标对象。目标数目为需要搜索出的对象的数目。搜索指令可以根据用户对终端的操作触发。例如,用户可以在终端上输入目标对象,终端可以根据目标对象,生成该搜索指令。在一个示例性场景中,用户需要搜索与某张照片相似的照片,则用户可以在终端上输入这张照片,再点击确认选项,则终端会生成搜索指令,则这张照片即为目标对象。
当然,目标对象也可以不由终端提供,例如终端提供该目标对象的网址给服务器,服务器可以通过网络获得该目标对象。
步骤302、服务器接收终端的搜索指令。
服务器收到搜索指令后,可以获取目标对象以及目标数目,以便根据目标对象以及目标数目执行后续步骤。其中,目标数目为需要搜索出的对象的数目,目标数目可以是正整数。
关于获取目标对象的方式,可选地,服务器可以解析搜索指令,得到搜索指令携带的目标对象。服务器也可以解析搜索指令,得到搜索指令得到的目标对象的网址,根据该网址,通过网络来获得目标对象。关于获取目标数目的方式,可选地,服务器可以解析搜索指令,得到搜索指令携带的目标数目,也可以由服务器确定目标数目。例如,服务器可以预先设定默认数目,将默认数目作为目标数目,比如说,服务器可以默认返回10个对象给终端,则10为目标数目。
步骤303、服务器对数据库进行搜索,得到第一集合。
数据库中预先存有大量的对象,比如说可以包括百万量级的对象。对于数据库中的每个对象,服务器可以获取数据库中的该对象与目标对象之间的相似度,判断相似度是否满足第三阈值,当相似度满足第三阈值时,服务器将数据库中的该对象加入第一集合。在一些可能的实施例中,服务器可以采用第一相似度算法,获取数据库中的对象与目标对象之间的相似度,根据该相似度,把与该目标对象之间的相似度满足第三阈值的对象加入该第一集合。其中,第一相似度算法可以是欧式距离算法,当然也可以根据需求设置为其他相似度算法,本实施例对第一相似度算法具体是哪种相似度算法不做限定。
第一集合可以称为候选解集合,第一集合包括多个对象,每个对象可以称为一个候选解。第一集合中对象的数目大于该目标数目,例如,第一集合中对象的数目与目标数目之间的比值可以为预设比值,例如,如果目标数目为K,第一集合可以包括(2*K)个对象,后续会从这(2*K)个对象中选择K个对象,将这K个对象发往终端,其中K为正整数。
第一集合中的每个对象与目标对象之间的相似度可以满足第三阈值。在一些可能的实施例中,相似度满足第三阈值可以指相似度大于第三阈值,相似度不满足第三阈值可以指相似度小于或等于第三阈值。在另一些可能的实施例中,相似度满足第三阈值可以指相似度大于或等于第三阈值,相似度不满足第三阈值可以指相似度小于第三阈值。其中,第三阈值可以根据实验、经验或需求设置,第三阈值可以预先存储在服务器中。
服务器搜索得到第一集合后,可以将第一集合从数据库存储在服务器自身的存储器中,例如将第一集合缓存在服务器的内存中,或者可以将第一集合存储在服务器包含的非易失性可读存储介质中,比如将第一集合存储在快闪存储器、硬盘(英文全称:hard disk drive,英文简称:HDD)、固态硬盘(英文全称:solid state drive,英文简称:SSD)中。当然,服务器也可以将第一集合存储在该服务器之外的其他设备中,比如可以将第一集合发送至网络存储器,通过网络存储器来存储第一集合,其中网络存储器可以是云盘、云数据库或对象存储服务。本实施例对第一集合的存储位置不做限定。
在一些可能的实施例中,第一集合中的各个对象可以按照与目标对象的相似度的大小顺序依次排序。对于第一集合中的每个对象,如果该对象与目标对象的相似度越大,则该对象在第一集合中的排列位置越靠前。例如,第一集合中的第一个对象,可以是第一集合中与目标对象的相似度最大的对象,比如说是与目标对象之间的欧式距离最小的对象。
在一些可能的实施例中,服务器可以先将数据库中的对象加入到第一集合中,再按照相似度从大到小的顺序,对第一集合中的每个对象进行排序。在另一些可能的实施例中,服务 器也可以在搜索的过程中,就按照相似度从大到小的顺序来将对象依次加入到第一集合中,比如说,可以每当搜索到一个对象i,可以对对象i与目标对象之间的相似度,与第一集合中已有的对象与目标对象之间的相似度进行比较,如果对象k与目标对象之间的相似度高于对象i与目标对象之间的相似度,对象m与目标对象之间的相似度低于对象i与目标对象之间的相似度,则将对象i加入到对象k与对象m之间,其中i、k或m表示对象的标识。
步骤304、服务器将第一集合划分为第二集合以及第三集合。
第二集合中的每个对象与目标对象之间的相似度满足第一阈值,第三集合中的每个对象与目标对象的相似度不满足第一阈值。其中,第一阈值可以根据实验、经验或需求设置,第一阈值可以预先存储在服务器中,第一阈值可以是本实施例涉及的各个阈值中最高的阈值。第一阈值可以高于后文中的第二阈值以及第三阈值。
在一些可能的实施例中,相似度满足第一阈值可以指相似度大于第一阈值,相似度不满足第一阈值可以指相似度小于或等于第一阈值。在另一些可能的实施例中,相似度满足第一阈值可以指相似度大于或等于第一阈值,相似度不满足第一阈值可以指相似度小于第一阈值。相对于第三集合中的对象来说,由于第二集合中的对象与目标对象之间的相似度高于第一阈值,第二集合中的对象是正确结果的概率更高,因此第二集合中的对象的置信度更高,可以将第二集合中的对象视为可信的对象。第二集合可以记为可信解集合,第二集合中的对象在重排时,顺序优先级是最高的。
关于划分出第二集合以及第三集合的方式,在一些可能的实施例中,服务器可以获取第一集合中每个对象与目标对象之间的相似度,根据第一集合中每个对象与目标对象之间的相似度,把第一集合中的每个对象按照相似度是否满足第一阈值,划分为第二集合以及第三集合。例如,服务器可以创建第二集合以及第三集合;对于第一集合中的每个对象,服务器可以获取该对象与目标对象之间的相似度,再判断该对象与目标对象之间的相似度是否满足第一阈值;如果该对象与目标对象之间的相似度满足第一阈值,则把该对象加入第二集合;如果该对象与目标对象之间的相似度不满足第一阈值,则把该对象加入第三集合。
需要说明的一点是,步骤304中划分第二集合以及第三集合所依据的相似度,与步骤303中将对象加入到第一集合所依据的相似度,可以是不同的相似度,也可以是相同的相似度。具体来说,可以直接复用步骤303所依据的相似度,来划分第二集合以及第三集合,则在执行步骤304时,可以无需重新获取第一集合中的对象与目标对象之间的相似度。当然,在步骤304中,也可以重新获取第一集合中的对象与目标对象之间的相似度,获取相似度的方式可以和步骤303中获取相似度的方式不同。比如说,如果步骤303中采用了相似度算法A来获取数据库中对象与目标对象之间的相似度,步骤304中可以重新采用相似度算法B来获取第一集合中的对象与目标对象之间的相似度,或者重新采用相似度算法A+相似度算法B来获取第一集合中的对象与目标对象之间的相似度,本实施例对步骤304中是否执行获取相似度的步骤不做限定,也不对步骤304中获取相似度的方式进行限定。
对于第一集合中的每个对象来说,获取该对象与目标对象之间的相似度的过程,可以理解为对该对象进行打分操作,而该对象与目标对象之间的相似度,可以理解为该对象的分数,能够反映该对象的置信度。
在一些可能的实施例中,服务器可以采用多种相似度算法结合,获取第一集合中对象与 目标对象之间的综合相似度,把综合相似度满足第一阈值的对象加入到第二集合,把综合相似度不满足第一阈值的对象加入第三集合,
在一些可能的实施例中,步骤303具体可以包括:对于第一集合中的任一对象来说,服务器可以采用多种相似度算法中的每种相似度算法,分别获取该对象与目标对象之间的相似度,得到多种相似度算法对应的多个相似度,对多个相似度进行结合,得到该对象与目标对象之间的综合相似度。以结合方式为加权平均为例,可以为每个相似度算法分配对应的权重,在每个相似度算法得出对象与目标对象之间的相似度之后,可以根据每个相似度算法对应的权重,对多个相似度进行加权平均,将加权平均值作为该对象与目标对象之间的综合相似度。
示意性地,参见图4,对于第一集合中的每个对象,可以采用欧式距离算法获取该对象与目标对象之间的相似度,采用rank order算法获取该对象与目标对象之间的相似度,采用机器学习模型获取该对象与目标对象之间的相似度,得到3个相似度,再将这3个相似度结合为1个相似度,这个相似度即为综合相似度,可以判断综合相似度是否满足第一阈值,如果综合相似度满足第一阈值,将这个对象加入第二集合,如果相似度不满足第一阈值,将这个对象加入第三集合。
下面描述结合多种相似度算法来获取综合相似度的效果:
一方面,相关技术中,通常仅通过一种相似度算法来获取两个对象之间的相似度,而单一的相似度算法通常具有局限性,导致相似度的准确性差。而本实施例中,通过结合多种相似度算法来获取综合相似度,可以利用不同相似度算法的优势,综合考虑了各种相似度算法,综合相似度能够更全面、更科学地反映两个对象之间的相似度,因此可以解决单一度量方式不准确的问题,提高相似度的准确性。
另一方面,相关技术中,通常仅通过欧式距离算法来获取相似度,那么由于欧式距离算法在度量两个对象之间的相似度时,仅会考虑了两个对象本身,而忽视了与两个对象邻近的其他对象,导致相似度的准确性差。而本实施例中,采用的多种相似度算法可以包括rank order算法或者其他考虑了群体关系的相似度算法,那么通过在度量两个对象之间的相似度时,不仅考虑了两个对象本身,也考虑与两个对象属于同一群体的其他对象,比如说在度量对象A与对象B之间的相似度时,会不仅考虑对象A以及对象B,还考虑对象A所属的群体以及对象B所属的群体,从而可以提高相似度的准确性,进而提高根据相似度选择对象时决策的准确性。
在一些可能的实施例中,划分第二集合以及第三集合时采用的多种相似度算法可以包括第一相似度算法,即在数据库时进行搜索时所采用的相似度算法。另外,该多种相似度算法还可以包括第一相似度之外的其他相似度算法。例如,第一相似度算法可以是欧式距离算法,该多种相似度算法可以包括欧式距离算法以及rank order算法。
通过先依据第一相似度算法得出的相似度,来得出第一集合,再在第一相似度算法的基础上,依据该第一相似度算法结合其他相似度算法得出的综合相似度,来划分第二集合和第三集合,由于第一相似度算法与其他相似度算法结合后,能够弥补第一相似度算法的度量方式的不足,达到改进第一相似度算法的目的,因此综合相似度相对于第一相似度算法得出的相似度来说,能够确保提升准确性,因此,与目标对象之间综合相似度高的对象是正确结果的概率会显著高于与目标对象之间综合相似度低的对象,也即是,第二集合整体的置信度会 高于第三集合整体的置信度,因此通过将第二集合中的对象排在第三集合中的对象前面,可以确保提升第一集合中对象顺序的准确性。
需要说明的一点是,采用多种相似度算法结合,来获取第一集合中对象与目标对象之间的综合相似度仅是一种示意性实施方式,在另一些可能的实施例中,服务器也可以仅采用一种相似度算法,来获取第一集合中对象与目标对象之间的相似度,例如仅是采用欧式距离算法,获取第一集合中对象与目标对象之间的相似度,或者仅是采用rank order算法,获取第一集合中对象与目标对象之间的相似度。之后,根据一种相似度算法获取的相似度,把相似度满足第一阈值的对象加入第二集合,把相似度不满足第一阈值的对象加入第三集合。
在一些可能的实施例中,可以在第一集合内部划分出簇,根据对象所属的簇来获取相似度。具体来说,这种方式可以包括下述步骤一至步骤二。
步骤一、服务器从第一集合中获取簇。
簇也可以称为类,簇包括多个对象,簇中的任一对象与簇中的其他对象之间的相似度符合预设条件,从第一集合获取到的簇的数量可以为一个或多个。具体来说,簇内部的不同对象之间互相相似,簇中的所有对象具有一定的共性。比如说,如果第一集合中的每个对象是图像,则簇可以是同一个人的图像,例如簇1中的各个图像可以均是用户A的图像,簇2中的各个图像可以均是用户B的图像。
在一些情况下,对于相似度的值差别比较大的不同对象来说,如果这些对象在相似度之外的其他属性上有相似之处,则这些对象也应该被识别成相似的对象,因此,可以引入簇和相关度的概念来描述这些对象的相似性。例如:同一个人的多张图片原本应该被识别同一个人的图片,但是由于拍摄角度不同或者穿着不同,可能导致在计算这些图片的相似度时,不同图片的相似度的值的差别比较大。本实施例中,会根据这些图片的相关度,把这些图片标记为同一个簇,簇内的图片会共用相同的相似度,而不以各个图片各自计算出来的相似度为准。例如,如果簇内有N个图片,这N个图片会共用同一个相似度,而不是图片1以图片1自己的相似度为准,图片2以图片2自己的相似度为准,N为正整数。
在一些可能的实施例中,簇的获取方式可以包括下述方式一至方式二中的至少一项。
方式一、服务器可以采用聚类算法,对第一集合进行聚类,得到簇。
方式二、对于第一集合中的每个对象,服务器获取该对象与第一集合中该对象之外的其他对象之间的相关度,根据每个对象与其他对象之间的相关度,获取相关度满足预设条件的多个对象,将相关度满足预设条件的多个对象划分为簇。
服务器可以对第一集合中的多个对象进行两两比对,得到第一集合中任两个对象之间的相关度,服务器可以判断第一集合中每个对象与第一集合中其他对象之间的相关度是否满足预设条件,如果第一集合中对象与第一集合中其他对象之间的相关度满足预设条件,则将第一集合中的该对象划分至簇,如果第一集合中对象与第一集合中其他对象之间的相关度不满足预设条件,则将第一集合中的该对象作为散点。其中,相关度满足预设条件可以而不限于是相关度满足第四阈值,例如如果第一集合中两个对象之间的相关度满足第四阈值,则将这两个对象划分到同一个簇中。
在一些可能的实施例中,服务器可以根据第一集合每个对象与第一集合中其他对象之间的相关度,生成相关度矩阵,服务器可以遍历相关度矩阵,根据相关度矩阵,寻找相关度满 足相关度条件的对象,将这些对象划分为簇,将相关度矩阵中簇之外的剩余对象作为散点。其中,相关度矩阵可以如表1所示,相关度矩阵的每一行代表一个对象,相关度矩阵的每一列代表一个对象,相关度矩阵中每一个元素等于行对应的对象与列对应的对象之间的相关度。其中,相关度矩阵的行数可以等于第一集合中对象的数目,相关度矩阵可以的列数等于第一集合中对象的数目。
表1
Figure PCTCN2019088879-appb-000001
步骤二、服务器获取簇与目标对象之间的相似度,作为簇中的每个对象与目标对象之间的相似度。
通过将簇与目标对象之间的相似度作为簇中每个对象与目标对象之间的相似度,如果簇与目标对象之间的相似度满足第一阈值,则会将簇中每个对象均加入到第二集合,如果簇与目标对象之间的相似度不满足第一阈值,则会将簇中每个对象均加入到第三集合。也即是,通过这种方式,同一个簇中的各个对象所属的集合可以相同。
另外,在将簇中的对象划分为簇后,可以有一个或多个对象没有对应的簇,这些没有簇的对象可以称为散点,对于散点来说,可以直接获取散点与目标对象的相似度,判断相似度是否满足第一阈值,如果相似度满足第一阈值,则将散点划分至第二集合,如果相似度不满足第一阈值,则将散点划分至第三集合。
在一些可能的实施例中,获取簇与目标对象之间的相似度的方式包括而不限于下述方式(1)至方式(2)中的任意一项及其组合:
方式(1)服务器从簇中选取代表点,服务器获取代表点与目标对象之间的相似度,作为簇与目标对象之间的相似度。
代表点用于代表簇中的每个对象,可以使用代表点,来代表整个簇中的所有对象,去和目标对象进行度量。其中,可以从簇中任取对象作为代表点;也可以选择簇中心,作为代表点;也可以选择簇中心邻近的对象,作为代表点,本实施例对选择代表点的方式不做限定。其中,代表点可以是一个对象,也可以是多个对象组成的集合。如果代表点包括多个对象,可以获取每个代表点与目标对象之间的相似度;可以对每个代表点与目标对象之间的相似度求平均,将平均值作为簇与目标对象之间的相似度。或者,可以获取每个代表点与目标对象之间的相似度的和值,将和值作为簇与目标对象之间的相似度。
通过将代表点与目标对象之间的相似度作为簇与目标对象之间的相似度,如果代表点与目标对象之间的相似度满足第一阈值,则会将簇中的每个对象加入第二集合,如果代表点与 目标对象之间的相似度不满足第一阈值,则会将簇中的每个对象加入第三集合。
方式(2)服务器获取簇中的每个对象与目标对象之间的相似度,服务器根据每个对象与目标对象之间的相似度,获取簇与目标对象之间的相似度。在一些可能的实施例中,服务器可以对每个对象与目标对象之间的相似度求平均,将平均值作为簇与目标对象之间的相似度。或者,可以获取每个对象与目标对象之间的相似度的和值,将和值作为簇与目标对象之间的相似度。
在一些可能的实施例中,服务器可以判断簇中对象的数目是否满足数目阈值,如果簇中对象的数目满足数目阈值,表明簇比较大,则采用方式(1),如果簇中对象的数目不满足数目阈值,表明簇比较小,则采用方式(2)。
示意性地,参见图5,可以采用聚类算法,来将第一集合中的每个对象划分为簇或散点,例如采用k均值聚类算法、密度聚类算法、图聚类算法或其他聚类算法来进行聚类;或者,可以对第一集合中的每个对象两两进行比对,得到相关度矩阵,采用数据合并算法将相关度矩阵中对象划分为簇或散点,再判断簇和散点的相似度是否满足第一阈值,如果簇或散点的相似度满足第一阈值,则将簇或散点加入第二集合,如果簇或散点的相似度不满足第一阈值,则将簇或散点的相似度加入第三集合。
通过这种方式,达到的效果至少可以包括:在搜索的过程中,受到目标对象较为模糊或者其他因素的影响,从数据库中搜索出的对象可能存在噪声数据,该噪声数据是指并不是正确结果但却与目标对象之间的相似度较高的对象。相关技术中,噪声数据会被错误地包含在搜索结果中,导致召回率较低。而通过将第一集合中彼此相似的对象聚为簇,噪声数据能够被划分到对应的簇中,那么通过将簇与目标对象的相似度来作为簇中每个对象的相似度,噪声数据本身与目标对象之间的相似度会被替换为簇与目标对象之间的相似度,比如会被替换为代表点与目标对象之间的相似度。那么即使噪声数据本身与目标对象之间的相似度较高,由于没有使用噪声数据本身与目标对象之间的相似度,而是使用了该噪声数据所属的簇与目标对象之间的相似度,可以有效地防止噪声数据的影响,将噪声数据提前排除掉,从而解决了由于噪声数据而造成误判的问题,减少了搜索结果中错误结果的数量,进而极大地提升了召回率。例如,如果簇中包括对象1、对象2至对象10,其中对象1是噪声数据,对象3是代表点,则对象3与目标对象之间的相似度,会作为对象1至对象10中每个对象与目标对象之间的相似度,因此对象1与目标对象之间的相似度会被拉低至对象3与目标对象之间的相似度,从而滤除了对象1的干扰。
示意性地,如果目标对象为用户A的照片,第三集合中具有10张用户B的照片,其中有9张用户B的照片和用户A的照片相似度很低,而有1张用户B的照片恰好和用户A的照片相似度很高(以下将这张照片记为照片X),则照片X即为噪声数据。相关技术中,虽然照片X并不是用户A的照片,而是用户B的照片,但由于照片X和用户A的照片相似度高,导致照片X会被误判为用户A的照片,因此会错误地将照片X也发往终端。而通过上述方式,10张用户B的照片会被划分至同一个簇中,那么会使用簇的相似度,统一地作为10张用户B的照片中每张照片与用户A的照片的相似度,那么照片X会受到其他9张用户B的照片的影响,照片X与用户A的照片的相似度,会被替换为用户B的其他照片与用户A的照片的相似度,因此可以去除照片X的干扰,从而避免将照片X误判为用户A的照片,
需要说明的一点是,多种相似度算法结合、簇类划分这两个技术手段可以结合,以形成步骤304。具体来说,可以先将第一集合划分为簇或散点;采用多种相似度算法结合,获取簇与第二集合之间的综合相似度,作为簇中的每个对象与第二集合之间的相似度,采用多种相似度算法结合,获取散点与第二集合之间的综合相似度;把综合相似度满足第一阈值的簇和散点加入第二集合,把综合相似度不满足第一阈值的簇和散点加入第三集合。
步骤305、服务器按照第二集合中的对象在前、第三集合中的对象在后的顺序,对第一集合中的对象进行排序。
服务器可以将第二集合中的对象排在第三集合中的对象之前。示例性地,对于第一集合中的对象i和对象j来说,如果对象i是第二集合中的对象,对象j是第三集合中的对象,则会将对象i排在对象j之前。那么,如果第一集合包括(2*K)个对象,第二集合包括Q1个对象,第三集合包括Q2个对象,那么进行排序后,第1个对象至第Q1个对象均是第二集合中的对象,第(Q1+1)至最后1个对象均是第三集合中的对象,Q1和Q2均为正整数。
步骤306、服务器按照从前往后的顺序,从第一集合中选择目标数目个对象发往终端。
服务器可以从第一集合的第一个对象开始,按照从前往后的顺序,从第一集合中依次选取对象,直到选择的对象的数目达到目标数目为止,将选取的对象发往终端,终端从服务器接收到目标数目个对象后,可以将该目标数目个对象作为搜索结果,呈现给用户,例如,终端可以显示搜索结果页面,搜索结果页面中的每个内容项为一个对象。
其中,由于第二集合中的对象在前,第三集合中的对象在后,在选择对象时,第二集合中的对象的优先级会高于第三集合中对象的优先级。具体来说,如果第二集合中对象的数目大于或等于目标数目,则服务器会选择第二集合中的对象,而不选择第三集合中的对象;如果第二集合中对象的数目小于目标数目,服务器才会在选择第二集合中的对象的基础上,继续从第三集合中选择对象。
示例性地,如果目标数目为K,第一集合包括(2*K)个对象,第二集合包括Q1个对象,第三集合包括Q2个对象,如果K小于Q1,则服务器会从第二集合中选择K个对象发往终端,而第二集合中会剩余(Q1-K)个对象未被选择,另外第三集合中的每个对象也不会被选择;如果K等于Q1,则服务器会恰好将第二集合中的每个对象发往终端,另外第三集合中的每个对象也不会被选择;如果K大于Q1,服务器会将第二集合中的每个对象,以及第三集合中(K-Q1)个对象发往终端,而第三集合中K个对象不会被选择。
由于第二集合中的对象与目标对象之间的相似度满足第一阈值,而第三集合中的对象与目标对象之间的相似度不满足第一阈值,因此第二集合与第三集合相比较来说,第二集合中的对象与目标对象更加相似,第二集合中的对象是正确结果的概率更高,即第二集合中的对象的置信度更高。那么,通过高优先选择第二集合中的对象,低优先选择第三集合中的对象,可以保证选择的对象中正确结果的比例更大,从而提升了搜索的召回率。
在一些可能的实施例中,对于第二集合中的每个对象,该对象与目标对象的相似度越大,该对象在第二集合中的排列位置可以越靠前。例如,对于第二集合中的对象i以及第二集合中的对象j来说,如果对象i与目标对象的相似度大于对象j与目标对象的相似度,则对象i排在对象j的前面。
通过将第二集合内部的排列顺序与目标对象的相似度关联起来,让与目标对象相似度大 的对象排在前面,与目标对象相似度小的对象排在后面,可以进一步提升候选对象排列顺序的准确性,由于与目标对象相似度大的对象是正确结果的概率更高,并且这种对象被放在前面,在按照前往后的顺序依次选择对象时,可以提升选择的对象中正确结果的比例,并且可以尽量让正确结果排在搜索结果的前面,从而让终端呈现搜索结果时,正确结果的显示位置会更靠前。比如说,对于第二集合中与目标对象的相似度最大的对象来说,该对象会在第二集合中排在第一位,那么也就会在第一集合中排在第一位,服务器将选择的对象发往终端后,终端呈现的搜索结果中,该对象会排在搜索结果的第一位。并且,与目标对象之间的相似度小的对象会被排在搜索结果的后面,或者可以避免被放入让搜索结果中正确结果的比例更高,从而有效地提高了搜索的召回率。
同理地,在一些可能的实施例中,对于该第三集合中的每个对象,该对象与目标对象的相似度越大,该对象在第三集合中的排列位置可以越靠前。
通过将第三集合内部的排列顺序与目标对象的相似度关联起来,让与目标对象相似度大的对象排在前面,与目标对象相似度小的对象排在后面,在从按照前往后的顺序依次选择对象时,可以提升选择的对象中正确结果的比例。比如说,如果目标数目为K,第一集合包括(2*K)个对象,第二集合包括Q1个对象,第三集合包括Q2个对象,且K大于Q1,服务器会将第二集合中的每个对象,以及第三集合中相似度最大的对象至第三集合中相似度排在第(K-Q1)位的对象发往终端,而第三集合中相似度排在后K位的对象不会被选择,也不会被发往终端。那么,由于第三集合中相似度排在后K位的对象是错误结果的概率高于第一集合中的其他对象,那么通过将这K个对象从搜索结果排除掉,可以提升搜索结果的召回率。
关于如何实现第二集合内部的排列顺序,在一些可能的实施例中,如果在执行步骤303之后,第一集合中的各个对象已经按照与目标对象的相似度的大小顺序依次排序,则服务器在将第一集合划分为第二集合以及第三集合时,可以保持每个对象中的排列顺序,即让每个对象的排列顺序还是在第一集合中的排列顺序。例如,对于待划分至第二集合的对象i和对象j来说,如果在第一集合中,对象i排在对象j之前,则在将对象i以及对象j划分至第二集合时,可以保持对象i排在对象j之前的顺序。同理地,对于待划分至第三集合的对象n和对象m来说,如果在第一集合中,对象n排在对象m之前,则在将对象n以及对象m划分至第三集合时,可以保持对象n排在对象m之前的顺序。如此,通过保持每个对象中的排列顺序为在第一集合中的排列顺序,即可实现上述“对象与该目标对象的相似度越大,该对象在该第二集合中的排列位置越靠前”的效果,以及“对象与该第三集合的相似度越大,该对象在该第三集合中的排列位置越靠前”的效果。
在另一些可能的实施例中,服务器也可以在第一集合划分为第二集合以及第三集合之后,对第二集合中的每个对象按照相似度从大到小的顺序,重新进行排序,从而实现上述“对象与该目标对象的相似度越大,该对象在该第二集合中的排列位置越靠前”的效果。其中,排序时依据的相似度,可以是划分第二集合以及第三集合时使用的相似度,例如,可以是多种相似度算法得出的对象的综合相似度。同理地,可以对第三集合中的每个对象按照相似度从大到小的顺序进行排序,通过进行排序,来实现上述“对象与目标对象的相似度越大,对象在第三集合中的排列位置越靠前”的效果。
需要说明的一点是,上述仅是以一台服务器执行图3实施例中的各个步骤为例进行说明, 在一些可能的实施例中,图3实施例也可以由服务器集群来执行,服务器集群中的不同服务器可以用于执行不同步骤。作为示例,可以由一个服务器执行步骤302,由另一个服务器执行步骤303,由再一个服务器执行步骤304至步骤305。通过将图3实施例中的不同步骤分散在不同的服务器执行,能够让不同的服务器分担整体的计算量,从而避免单个服务器负载过重,提高相似性搜索方法的整体的计算效率。
本实施例提供的方法,设计了一套单一输入下的重排框架,通过从数据库搜索到一些对象后,将这些对象中与目标对象的相似度满足第一阈值的对象往前排,将这些对象中与目标对象的相似度不满足阈值的对象往后排,再按照从前往后的顺序选择对象并发往终端,由于相似度满足第一阈值的对象的置信度很高,是正确结果的概率很大,通过将这些对象排在前面,在从前往后选择对象时,可以提升选择出正确结果的概率,让选出的正确结果的比重更大,从而有效地提升了召回率。
并且,相关技术在重排时,需要基于搜索出的对象以及目标对象,构造幽灵点,再根据幽灵点在数据库中再次搜索一遍,导致运算量大、运算速度慢,并且如果幽灵点不准确,会导致召回率急剧下降。而本实施例中,基于从数据库中搜索出的第一集合即可进行重排,相对于先根据第一集合中的对象以及目标对象构造幽灵点,再根据幽灵点在数据库中再次搜索一遍的方式来说,省去了构造幽灵点的步骤以及在数据库中进行二次搜索的步骤,从而解决了重排框架运算量巨大的问题,提升了运算速度,并且避免幽灵点不准而导致召回率下降的问题,提升了鲁棒性。
在一些可能的实施例中,上述图3实施例中涉及的排序功能可以基于队列机制实现,以下通过图6实施例进行阐述。
图6是本申请实施例提供的一种相似性搜索方法的流程图,如图6所示,该方法包括下述步骤601至607:
步骤601、终端向服务器发送搜索指令。
步骤602、服务器接收终端的搜索指令。
步骤603、服务器对数据库进行搜索,得到第一集合。
步骤604、服务器将第一集合划分为第二集合以及第三集合。
步骤605、服务器将第二集合存入第一队列,将第三集合存入第二队列。
第一队列和第二队列分别代表着不同的优先级,第一队列的优先级高于第二队列的优先级。在一些可能的实施例中,服务器可以先创建第一队列以及第二队列;服务器可以将该第二集合中的每个对象存入第一队列,将该第三集合中的每个对象存入第二队列。其中,由于第一队列中存储的对象的置信度高于第二队列中存储的对象的置信度,因此第一队列可以记为信任队列。
示意性地,参见图7,如果代表点1与目标对象之间的综合相似度满足第一阈值,则簇1属于第二集合,会将簇1存入第一队列;如果代表点1与目标对象之间的综合相似度不满足第一阈值,则簇1属于第三集合,会将簇1存入第二队列;如果代表点2与目标对象之间的综合相似度满足第一阈值,则簇2属于第二集合,会将簇2存入第一队列;如果代表点2与目标对象之间的综合相似度不满足第一阈值,则簇2属于第三集合,会将簇2存入第二队列; 如果散点与目标对象之间的综合相似度满足第一阈值,则散点属于第二集合,会将散点存入第一队列;如果散点与目标对象之间的综合相似度不满足第一阈值,则散点属于第三集合,会将散点存入第二队列。
步骤606、服务器按照第一队列在前、第二队列在后的顺序,对第一集合中的对象进行排序。
服务器可以将第一队列中的对象排在第二队列中的对象之前。示例性地,对于第一集合中的对象i和对象j来说,如果对象i是第一队列中的对象,对象j是第二队列中的对象,则会将对象i排在对象j之前。那么,如果第一集合包括(2*K)个对象,第一队列包括Q1个对象,第二队列包括Q2个对象,那么进行排序后,第1个对象至第Q1个对象均是第一队列中的对象,第(Q1+1)至最后1个对象均是第二队列中的对象。
步骤607、服务器按照从前往后的顺序,从第一集合中选择目标数目个对象发往终端。
通过将第一集合中的各个对象存储在队列这种顺序存储结构中,按照从前往后的顺序选择对象,即为按照从队首到队尾的顺序选择对象。具体来说,服务器可以从第一队列的队首开始,按照从队首到队尾的顺序,从第一队列中依次选取对象。在选择对象的过程中,服务器可以判断已经选择的对象的数目是否达到目标数目;如果选择的对象的数目达到目标数目,则停止从第一队列继续选择对象,也不会选择第二队列中的对象;如果已经选择到第一队列的队尾,而选择的对象的数目尚未达到目标数目,则继续从第二队列的队首开始,按照从队首到队尾的顺序依次选择对象,直至选择的对象的数目达到目标数目为止。通过这种选择方式,使得第一队列的优先级高于第二队列的优先级。
比如说,如果目标数目为K,第一集合包括(2*K)个对象,第一队列包括Q1个对象,第二队列包括Q2个对象,如果K小于Q1,则服务器会选择第一队列的队首至第一队列中排在第K位的对象;如果K等于Q1,则服务器会选择第一队列中的每个对象;如果K大于Q1,服务器会选择第一队列中的每个对象,以及第二队列的队首至第二队列中排在第(K-Q1)位的对象。
在一些可能的实施例中,与步骤306中第二集合的内部排列顺序对应,第一队列的内部排列顺序可以依据与目标对象之间的相似度确定。具体来说,对于第一队列中的每个对象,该对象与目标对象的相似度越大,该对象在第一队列中的排列位置可以越靠前。例如,第一队列的队首可以是第一队列中与目标对象的相似度最大的对象,第一队列的队尾可以是第一队列中与目标对象的相似度最小的对象。关于如何实现第一队列的内部排列顺序,在一些可能的实施例中,如果在执行步骤303之后,第一集合中的各个对象已经按照与目标对象的相似度的大小顺序依次排序,则服务器在将第二集合存入第一队列时,服务器可以保持第一队列中每个对象中的排列顺序还是在第一集合中的排列顺序。在另一些可能的实施例中,服务器也可以将该第二集合存入第一队列之后,可以按照与目标对象的相似度从大到小的顺序,对第一队列中的各个对象进行重新排序,例如采用多种相似度算法结合得出的、第一集合中的对象与目标对象的综合相似度,对第一队列中的对象进行重新排序。
同理地,在一些可能的实施例中,第三集合的内部排列顺序可以通过第二队列的内部排列顺序表示。具体来说,对于第二队列中的每个对象,该对象与目标对象的相似度越大,该对象在第二队列中的排列位置可以越靠前。关于如何实现第一队列的内部排列顺序,在一些 可能的实施例中,如果在执行步骤303之后,第一集合中的各个对象已经按照与目标对象的相似度的大小顺序依次排序,则服务器在将第三集合存入第二队列时,服务器可以保持第二队列中每个对象中的排列顺序还是在第一集合中的排列顺序。在另一些可能的实施例中,服务器将该第三集合存入第二队列之后,可以按照与目标对象的相似度从大到小的顺序,对第二队列中的各个对象重新进行排序。
通过将第一队列或者第二队列内部的排列顺序与目标对象的相似度关联起来,让与目标对象相似度大的对象排在前面,与目标对象相似度小的对象排在后面,可以进一步提升候选解排列顺序的准确性,那么在从按照前往后的顺序依次选择对象时,可以提升选择的对象中正确结果的比例,并且可以尽量让正确结果排在搜索结果的前面,从而让终端呈现搜索结果时,正确结果的显示位置会更靠前。比如说,对于第一队列中与目标对象的相似度最大的对象来说,该对象会在第一队列中排在第一位,那么也就会在第一集合中排在第一位,服务器将选择的对象发往终端后,终端呈现的搜索结果中,该对象会排在搜索结果的第一位。并且,与目标对象之间的相似度小的对象由于排在队列内部的后面,因此会被排在搜索结果的后面,或者可以避免被放入搜索结果中,从而减少了搜索结果的错误结果的数量,让搜索结果中正确结果的比例更高,从而有效地提高了搜索的召回率。比如说,如果目标数目为K,第一集合包括(2*K)个对象,第一队列包括Q1个对象,第二队列包括Q2个对象,服务器会将第一队列中的每个对象,以及第二队列中相似度最大的对象至第二队列中相似度排在第(K-Q1)位的对象发往终端,而第二队列中相似度排在后K位的对象不会被选择,也不会被发往终端。那么,由于第二队列中相似度排在后K位的对象是错误结果的概率高于第一集合中的其他对象,那么通过将这K个对象从搜索结果排除掉,可以提升搜索结果的召回率。
本实施例提供的方法,在实现图3实施例达到的效果的基础上,设计了一套基于队列的重排框架,通过将第二集合加入第一队列,将第三集合加入第二队列,可以将从数据库中搜索出的对象划分出多种队列,在选取对象时,通过将第一队列中的对象排在前、第二队列中的对象排在后,在按照从前到后的顺序选取对象时,会高优先选取第一队列中的对象,低优先选取第二队列中的对象,那么由于第一队列中对象的置信度高于第二队列中对象,在选取的对象的总数目一定的情况下,能够提高选取置信度高的对象的概率,降低选取置信度低的对象的概率,从而提高搜索结果中置信度高的对象的占比,因此可以提高搜索结果的召回率,并且,能够让置信度高的对象置于搜索结果的前列,提高搜索结果顺序的准确性。
在一些可能的实施例中,在上述图3实施例的基础上,还可以将第三集合进一步划分不同的集合,以下通过图8实施例进行阐述。
图8是本申请实施例提供的一种相似性搜索方法的流程图,如图8所示,该方法包括由服务器执行的步骤801至806:
步骤801、终端向服务器发送搜索指令。
步骤802、服务器接收终端的搜索指令。
步骤803、服务器对数据库进行搜索,得到第一集合。
步骤804、服务器将第一集合划分为第二集合以及第三集合。
步骤805、服务器将第三集合划分为第四集合以及第五集合。
本实施例中,会将第二集合作为重排的基础,根据与第二集合的相似度是否满足第二阈值,来对第三集合进行划分,以便根据划分的结果进行重新排序。
相关技术中,通常仅会将单个候选对象为重排的颗粒度,将单个候选对象作为重排的基础,那么如果该单个候选对象选择的不准确,例如是错误结果,会导致重排后候选对象的排列顺序的准确性更差,因此会导致召回率急剧下降。而本实施例中,一方面,由于第二集合通常包括多个对象,相对于基于单个候选对象进行重排的方式来说,将重排的颗粒度从单个候选对象扩展为整个集合,相当于将重排的颗粒度粗颗粒化,可以解决单个候选对象选择不准确而导致召回率下降的问题,因此能够提高重排的方式的鲁棒性,另一方面,由于第二集合中的对象与目标对象的相似度满足第一阈值,因此第二集合中的对象的置信度高,通过基于第二集合来对第三集合进行重排,能够提高重排的方式的准确性。
第四集合中每个对象与第二集合的相似度满足第二阈值,第五集合中每个对象与第二集合的相似度不满足第二阈值。其中,该第二阈值可以低于第一阈值,该第二阈值可以高于第三阈值,第二阈值可以根据实验、经验或需求设置,第二阈值可以预先存储在服务器中。在一些可能的实施例中,相似度满足第二阈值可以指相似度大于第二阈值,相似度不满足第二阈值可以指相似度小于或等于第二阈值。在另一些可能的实施例中,相似度满足第二阈值可以指相似度大于或等于第二阈值,相似度不满足第二阈值可以指相似度小于第二阈值。
相对于第五集合中的对象来说,由于第四集合中的对象与第二集合之间的相似度高于第一阈值,第四集合中的对象是正确结果的概率更高,因此第四集合中的对象的置信度更高,因此第四集合中的对象在重排时,顺序优先级会高于第五集合中的对象。
关于划分出第四集合以及第五集合的方式,在一些可能的实施例中,服务器可以创建第四集合以及第五集合;对于第三集合中的每个对象,服务器可以获取该对象与第二集合之间的相似度;服务器可以判断该对象与第二集合之间的相似度是否满足第二阈值;如果该对象与第二集合之间的相似度满足第二阈值,则将该对象加入第四集合;如果该对象与第二集合之间的相似度不满足第二阈值,则将该对象加入第五集合。其中,如果第二集合包括n个对象,则对于第三集合中的对象i,可以获取对象i与这n个对象中每个对象之间的相似度,得到n个相似度,可以对n个相似度进行结合,将得到的结果作为对象i与第二集合之间的相似度。其中,n为正整数,结合的方式包括而不限于加权求和、求和、加权平均、取平均中的任意一项及其组合。
对于第三集合中的每个对象来说,获取该对象与第二集合之间的相似度的过程,可以理解为对该对象进行打分操作;该对象与第二集合之间的相似度,可以理解为该对象的分数,能够反映该对象的置信度。
在一些可能的实施例中,服务器可以采用多种相似度算法结合,获取第三集合中的对象与第二集合的综合相似度,如果综合相似度满足第二阈值,则将对象加入到第四集合,如果综合相似度不满足第二阈值,则将对象加入第五集合。
在一些可能的实施例中,对于多种相似度算法中的每种相似度算法,服务器可以采用该相似度算法,获取第三集合中对象与第二集合之间的相似度,再对多个相似度进行结合,得到对象与第二集合之间的综合相似度。其中,对多个相似度进行结合的方式包括而不限于加权求和、求和、加权平均、取平均中的任意一项及其组合。以结合方式为加权平均为例,可 以根据每个相似度算法对应的权重,对每个相似度算法得出的第三集合中对象与第二集合之间的相似度进行加权平均,将加权平均值作为第三集合中对象与第二集合之间的综合相似度。
示意性地,参见图9,对于第三集合中的每个对象,可以采用欧式距离算法获取该对象与第二集合之间的相似度,采用rank order算法获取该对象与第二集合之间的相似度,采用机器学习模型获取该对象与第二集合之间的相似度,得到3个相似度,再将这3个相似度结合为1个相似度,判断这个相似度是否满足第二阈值,如果满足第二阈值,将这个对象加入第四集合,如果不满足第二阈值,将这个对象加入第五集合。
通过采用多种相似度算法结合来获取相似度,可以结合多种度量方式,从而解决了单一的度量方式不准确的技术问题。另外,采用的多种相似度算法可以包括rank order算法或者其他考虑了群体关系的相似度算法,从而可以提高对象与第二集合的相似度的准确性。
在一些可能的实施例中,划分第四集合以及第五集合时依据的多种相似度算法可以包括第一相似度算法,即在数据库时进行搜索时所采用的相似度算法。另外,该多种相似度算法还可以包括第一相似度之外的其他相似度算法。例如,第一相似度算法可以是欧式距离算法,该多种相似度算法可以包括欧式距离算法以及rank order算法。
通过先依据第一相似度算法得出的相似度,来得出第一集合,再在第一相似度算法的基础上,依据该第一相似度算法结合其他相似度算法得出的综合相似度,来划分第四集合和第五集合,由于第一相似度算法与其他相似度算法结合后,能够弥补第一相似度算法的度量方式的不足,达到改进第一相似度算法的目的,因此综合相似度相对于第一相似度算法得出的相似度来说,能够确保提升准确性,因此,与目标对象之间综合相似度高的对象是正确结果的概率会显著高于与目标对象之间综合相似度低的对象,也即是,第四集合整体的置信度会高于第五集合整体的置信度,因此通过将第四集合中的对象排在第五集合中的对象前面,可以确保提升第一集合中对象顺序的准确性。
需要说明的一点是,划分第四集合和第五集合时采用的相似度算法,与划分第二集合和第三集合时采用的相似度算法可以相同也可以不同。其中,划分第四集合和第五集合时采用的相似度算法,可以比划分第二集合和第三集合时采用的相似度算法更多,比如说,如果划分第二集合和第三集合时采用了第一相似度算法以及第二相似度算法,划分第四集合和第五集合时采用的多个相似度算法可以包括第一相似度算法以及第二相似度算法,另外还可以包括其他相似度算法。例如,在数据库进行搜索时,采用欧式距离算法,在划分第二集合和第三集合时,可以采用欧式距离算法以及rank order算法,在划分第四集合和第五集合时,可以仍采用欧式距离算法以及rank order算法。或者,在划分第四集合和第五集合时,可以采用欧式距离算法以及rank order算法以及机器学习模型,从而让划分第四集合和第五集合时,相似度的精确度更高。
需要说明的一点是,采用多种相似度算法结合,来获取第三集合中对象与第二集合之间的综合相似度仅是一种示意性实施方式,在另一些可能的实施例中,服务器也可以采用一种相似度算法,来获取第三集合中对象与第二集合之间的相似度。
在一些可能的实施例中,可以在第三集合内部划分出簇,根据对象所属的簇来获取相似度。具体来说,这种方式可以包括下述步骤一至步骤二。
步骤一、服务器从第三集合中获取簇。
从第三集合获取到的簇的数量可以为一个或多个。在一些可能的实施例中,簇的获取方式可以包括下述方式一至方式二中的任意一项及其结合。
方式一、服务器采用聚类算法,对第三集合进行聚类,得到簇。
方式二、对于第三集合中的每个对象,服务器获取该对象与第三集合中该对象之外的其他对象之间的相关度,根据每个对象与其他对象之间的相关度,获取相关度满足预设条件的多个对象,将该相关度满足预设条件的多个对象划分为簇。
服务器可以对第三集合中的多个对象进行两两比对,得到第三集合中任两个对象之间的相关度,服务器可以判断第三集合中每个对象与第三集合中其他对象之间的相关度是否满足预设条件,如果第三集合中对象与第三集合中其他对象之间的相关度满足预设条件,则将该第三集合中对象划分至簇,如果第三集合中对象与第三集合中其他对象之间的相关度不满足预设条件,则将该第三集合中对象作为散点。
在一些可能的实施例中,服务器可以根据第三集合每个对象与第三集合中其他对象之间的相关度,生成相关度矩阵,服务器可以遍历相关度矩阵,寻找相关度满足相关度条件的对象,将这些对象划分为簇,将相关度矩阵中簇之外的剩余对象作为散点。其中,相关度矩阵可以如上述图3实施例中步骤306中的表1所示。
步骤二、服务器获取簇与第二集合之间的相似度,作为簇中的每个对象与第二集合之间的相似度。
通过将簇与第二集合之间的相似度作为簇中每个对象与第二集合之间的相似度,如果簇与第二集合之间的相似度满足第二阈值,则会将簇中每个对象均加入到第四集合,如果簇与第二集合之间的相似度不满足第二阈值,则会将簇中每个对象均加入到第五集合。也即是,通过这种方式,属于同一个簇中的各个对象被加入的集合可以相同。
另外,在将簇中的对象划分为簇后,可以有一个或多个对象没有对应的簇,这些没有簇的对象可以称为散点,对于散点来说,可以直接获取散点与第二集合的相似度,判断相似度是否满足第二阈值,如果相似度满足第二阈值,则将散点划分至第四集合,如果相似度不满足第二阈值,则将散点划分至第五集合。
在一些可能的实施例中,获取簇与第二集合之间的相似度的方式包括而不限于下述方式(1)至方式(2)中的任意一项及其组合:
方式(1)服务器从簇中选取代表点,服务器获取代表点与第二集合之间的相似度,作为簇与第二集合之间的相似度。
可以使用代表点,来代替整个簇中的所有对象,去和第二集合进行度量。其中,如果代表点包括多个对象,可以获取每个代表点与第二集合之间的相似度;可以对每个代表点与第二集合之间的相似度求平均,将平均值作为簇与第二集合之间的相似度。或者,可以获取每个代表点与第二集合之间的相似度的和值,将和值作为簇与第二集合之间的相似度。通过将代表点与第二集合之间的相似度作为簇与第二集合之间的相似度,如果代表点与第二集合之间的相似度满足第二阈值,则会将簇中的每个对象加入第四集合,如果代表点与第二集合之间的相似度不满足第二阈值,则会将簇中的每个对象加入第五集合。
方式(2)服务器获取簇中的每个对象与第二集合之间的相似度,服务器根据每个对象与第二集合之间的相似度,获取簇与第二集合之间的相似度。例如,服务器可以对每个对象与 第二集合之间的相似度求平均,将平均值作为簇与第二集合之间的相似度。或者,可以获取每个对象与第二集合之间的相似度的和值,将和值作为簇与第二集合之间的相似度。
在一些可能的实施例中,服务器可以判断簇中对象的数目是否满足数目阈值,如果簇中对象的数目满足数目阈值,则采用方式(1),如果簇中对象的数目不满足数目阈值,则采用方式(2)。
示意性地,参见图10,可以采用聚类算法,来将第三集合中的每个对象划分为簇或散点;或者,可以对第三集合中的每个对象两两进行比对,得到相关度矩阵,采用数据合并算法将相关度矩阵中对象划分为簇或散点,再判断簇和散点的相似度是否满足第二阈值,如果簇或散点的相似度是否满足第二阈值,则将簇或散点加入第四集合,如果簇或散点的相似度不满足第二阈值,则将簇或散点的相似度加入第五集合。
通过这种方式,达到的效果至少可以包括:在搜索的过程中,受到目标对象较为模糊或者其他因素的影响,从数据库中搜索出的对象可能存在噪声数据。相关技术中,噪声数据会导致搜索结果中包括较多的错误结果,导致召回率较低。而通过将第三集合中互相相似的对象聚为簇,噪声数据能够被划分到对应的簇中,那么通过将簇与第二集合的相似度来作为簇中每个对象的相似度,噪声数据本身与第二集合之间的相似度会被替换为簇与第二集合之间的相似度,那么即使噪声数据本身与第二集合之间的相似度较高,由于没有使用噪声数据本身与第二集合之间的相似度,而是使用了该噪声数据所属的簇与第二集合之间的相似度,可以有效地防止噪声数据的影响,从而解决了由于噪声数据而造成误判的问题,减少了搜索结果中错误结果的数量,进而极大地提升了召回率。
需要说明的一点是,多种相似度算法结合、簇类划分这两个技术手段可以结合,以形成步骤805。具体来说,可以先将第三集合划分为簇或散点;采用多种相似度算法结合,获取簇与该第二集合之间的相似度,作为该簇中的每个对象与该第二集合之间的相似度,采用多种相似度算法结合,获取散点与该第二集合之间的相似度;把相似度满足第二阈值的簇和散点加入第四集合,把相似度不满足第二阈值的簇和散点加入第五集合。
步骤806、服务器按照第二集合中的对象在前、第三集合中的对象在后的顺序,对第一集合中的对象进行排序。
步骤807、服务器按照第四集合中的对象在前、第五集合中的对象在后的顺序,对第三集合中的对象进行排序。
服务器可以将第四集合中的对象排在第五集合中的对象之前。通过对第三集合中的对象进行了排序,则在第三集合中对象的顺序具体为:第四集合中的对象在前、第五集合中的对象在后。示例性地,对于第三集合中的对象i和对象j来说,如果对象i是第二集合中的对象,对象j是第五集合中的对象,则会将对象i排在对象j之前。示例性地,如果目标数目为K,第一集合包括(2*K)个对象,第二集合包括Q1个对象,第三集合包括Q2个对象,这Q2个对象中第四集合包括Q3个对象,第五集合包括Q4个对象,Q2=Q3+Q4,Q3和Q4均为正整数。通过将第四集合排在第五集合之前,第三集合中的各个对象中,第1个对象至第Q3个对象均是第四集合中的对象,第(Q3+1)至最后1个对象均是第五集合中的对象。因此,第一集合中各个对象的顺序是:第二集合中的Q1个对象排在最前,第四集合中的Q3个对象排在中间,第五集合中的Q4个对象排在最后。
需要说明的一点是,本实施例对步骤806以及步骤807的执行顺序不做限定。例如,步骤806与步骤807可以顺序执行。作为示例,可以先执行步骤806,再执行步骤807;也可以先执行步骤807,再执行步骤806。当然,步骤806与步骤807也可以并行执行,即,可以同时执行步骤806以及步骤807。
步骤808、服务器按照从前往后的顺序,从第一集合中选择目标数目个对象发往终端。
其中,由于第二集合中的对象排在最前,第四集合中的对象排在第二集合中的对象之后,第五集合中的对象排在第四集合中的对象之后,在选择对象时,第二集合中的对象的优先级高于第四集合中对象的优先级,第四集合中对象的优先级高于第五集合中对象的优先级。具体来说,如果第二集合中对象的数目大于或等于目标数目,则服务器会选择第二集合中的对象,而不选择第三集合中的对象;如果第二集合中对象的数目小于目标数目,服务器才会在选择第二集合中的对象的基础上,继续从第四集合中选择对象;其中,如果第二集合中对象的数目与第四集合中对象的数目的和值大于目标数目,服务器会从第二集合以及第五集合中选择对象,而不会从第五集合中选择对象。如果第二集合中对象的数目与第四集合中对象的数目的和值小于目标数目,服务器才会在选择第二集合中的对象以及第四集合中的对象的基础上,继续从第五集合中选择对象。
示例性地,如果目标数目为K,第一集合包括(2*K)个对象,第二集合包括Q1个对象,第三集合包括Q2个对象,其中第四集合包括Q3个对象,第五集合包括Q4个对象,其中,K为正整数,Q1为正整数或0,Q2=2*K-Q1,Q2=Q3+Q4。服务器可以按照第二集合中的Q1个对象在前、第四集合中的Q3个对象其次、第五集合中的Q4个对象最后的顺序,对第一集合中的(2*K)个对象进行排序,从排序后的第一集合中选择K个对象发往终端。其中,如果K小于Q1,则服务器会从第二集合中选择K个对象发往终端,而第二集合中会剩余(Q1-K)个对象未被选择,另外第四集合中所有的Q3个对象以及第五集合中所有的Q4个对象也不会被选择;如果K等于Q1,则服务器会恰好将第二集合中所有的Q1个对象发往终端,另外第四集合中所有的Q3个对象以及第五集合中所有的Q4个对象也不会被选择;如果K大于Q1且小于(Q1+Q3),服务器会将第二集合中的Q1个对象,以及第四集合中(K-Q1)个对象发往终端,而第四集合中剩余的(Q3-K+Q1)个对象以及第五集合中的所有Q4个对象不会被选择。如果K等于(Q1+Q3),服务器会将第二集合中所有的Q1个对象,以及第四集合中所有的Q3个对象发往终端。如果K大于(Q1+Q3),服务器会将第二集合中的Q1个对象、第四集合中Q3个对象以及第五集合中(K-Q1-Q3)发往终端。
由于第四集合中的对象与第二集合之间的相似度满足第二阈值,而第五集合中的对象与第二集合之间的相似度不满足第二阈值,因此第四集合与第五集合相比较来说,第四集合中的对象与第二集合更加相似,第四集合中的对象是正确结果的概率更高,即第四集合中的对象的置信度更高。那么,通过让第四集合中的对象的优先级高于第五集合中的对象的优先级,可以保证选取的对象中正确结果的比例更大,从而提升了搜索的召回率。
在一些可能的实施例中,对于第三集合中的每个对象,该对象与第二集合的相似度越大,该对象在该第三集合中的排列位置可以越靠前。例如,对于第三集合中的对象i以及第三集合中的对象j来说,如果对象i与第二集合的相似度大于对象j与第二集合的相似度,则对象i排在对象j的前面。其中,对象在第三集合内部的排列位置所依据的相似度,可以是通过一 种相似度算法获取到的相似度,也可以是结合多种相似度算法获取到的综合相似度,本实施例对此不做限定。
通过将第三集合内部的排列位置与第二集合的相似度结合起来,让与第二集合相似度大的对象排在前面,与第二集合相似度小的对象排在后面,可以进一步提升候选对象排列顺序的准确性,由于与第二集合相似度大的对象排在与第二集合相似度小的对象之前,而与第二集合相似度大的对象是正确结果的概率高于与第二集合相似度小的对象,在按照前往后的顺序依次选择对象时,可以提升选择的对象中正确结果的比例,能够尽量多地选择与第二集合相似的对象,从而让搜索结果中正确结果的比重更大,从而有效地提升了召回率。并且,与第二集合之间的相似度小的对象会被排在搜索结果的后面,或者可以避免被放入搜索结果中,从而减少了搜索结果的错误结果的数量,从而有效地提高了搜索的召回率。
对于第三集合中的每个对象,服务器可以按照该对象与第二集合的相似度,对第三集合中的每个对象按照相似度从大到小的顺序,重新进行排序,通过进行排序,来实现上述“对象与该第二集合的相似度越大,该对象在该第三集合中的排列位置越靠前”的效果。
需要说明的一点是,第三集合中,可以所有对象的排列位置均与第二集合的相似度相关,也可以仅是部分对象的排列位置与第二集合的相似度相关。
例如,对于第四集合中的每个对象,该对象与第二集合的相似度越大,该对象在该第三集合中的排列位置可以越靠前。而对于第五集合中的每个对象,该对象与目标对象的相似度越大,该对象在该第五集合中的排列位置可以越靠前。如此,可以让第五集合中的对象的排列顺序,保持为从数据库搜索出该第五集合时对象的排列顺序。示意性地,在划分出第四集合以及第五集合之后,对于第四集合中的每个对象,服务器可以按照该对象与第二集合的相似度,对第四集合中的每个对象按照相似度从大到小的顺序进行重新排序,如此,第四集合中的对象的排列顺序可以从与目标对象之间的相似度的大小顺序,更新为与第二集合之间的相似度的大小顺序。因此,第四集合可以记为重排结果集合。而对于第五集合中的每个对象,服务器可以保持每个对象中的排列顺序为从数据库中搜索出该第五集合中每个对象时,不同对象的排列顺序。也即是,第五集合中的每个对象从前到后的排列顺序,可以是为该每个对象与目标对象之间的相似度从大到小的排列顺序,第五集合中与目标对象之间的相似度最大的对象在第五集合中排在第一位。那么,在执行步骤808时,如果从第五集合中选择了K个对象,则这K个对象由于维持为在步骤803中的排列顺序,使得搜索结果这K个对象的排列顺序与初始顺序重叠。
通过保持第五集合中的对象的排列顺序为候选对象的排列顺序,如果第一集合中每个对象与目标对象之间的相似度均不满足第一阈值,则第二集合为空,第四集合可以也为空,第一集合中的每个对象均会被加入第五集合中。由于第五集合中对象的排列顺序为与目标对象的相似度大小的顺序,因此在按照从前往后的顺序依次选择对象时,会选择相似度排在前目标数目位的对象,因此即使第一集合中没有高置信度的对象,也可以保证搜索结果不会比相关技术中差,从而实现兜底的功能。
本实施例提供的方法,在实现图3实施例达到的效果的基础上,通过将第二集合作为重排时参照的基准,将第三集合中与第二集合的相似度满足阈值的对象加入到第四集合,将第三集合中与第二集合的相似度不满足阈值的对象加入到第五集合,再将第四集合排在第五集 合之前,从而对第三集合中的对象进行重排,由于第二集合中的对象的置信度高,而第四集合中的对象与其相似,因此第四集合中的对象是正确结果的概率高于第五集合中的对象,那么通过将第四集合中的对象往前排,将第五集合中的对象往后排,可以提升第三集合中对象的排列顺序的准确性,在按照从前往后的顺序选取对象时,能够提高选中的对象中正确结果的比例,从而进一步提升了召回率。
在一些可能的实施例中,上述图8实施例中涉及的排序功能可以基于队列机制实现,以下通过图11实施例进行阐述。
图11是本申请实施例提供的一种相似性搜索方法的流程图,如图11所示,该方法包括下述步骤1101至1109:
步骤1101、终端向服务器发送搜索指令。
步骤1102、服务器接收终端的搜索指令。
步骤1103、服务器对数据库进行搜索,得到第一集合。
步骤1104、服务器将该第一集合划分为第二集合以及第三集合。
步骤1105、服务器将第二集合存入第一队列。
步骤1106、服务器将第三集合划分为第四集合以及第五集合。
由于第一队列存有了第二集合,可以将上述图8实施例中,度量对象与第二集合之间的相似度的过程替换为度量对象与第一队列之间的相似度。也即是,则服务器可以获取第三集合中每个对象与第一队列之间的相似度,如果第三集合中对象与第一队列之间的相似度满足第二阈值,则将该对象加入第四集合;如果该对象与第一队列之间的相似度不满足第二阈值,则将该对象加入第五集合。
步骤1107、服务器将第四集合存入第二队列,将第五集合存入第三队列。
第一队列、第二队列以及第三队列分别代表着不同的优先级,第一队列的优先级高于第二队列的优先级,第二队列的优先级高于第三队列的优先级。在一些可能的实施例中,服务器可以先创建第一队列、第二队列以及第三队列;服务器可以将该第二集合中的每个对象存入第一队列,将该第四集合中的每个对象存入第二队列,将第五集合中的每个对象存入第三队列。
示意性地,参见图12,对于第三集合中的每个对象,可以将第三集合划分为簇1、簇2和多个散点;从簇1中选择代表点1,从簇2中选择代表点2;采用多种相似度算法结合,获取代表点1与第二集合之间的综合相似度,采用多种相似度算法结合,获取代表点2与第二集合之间的综合相似度,采用多种相似度算法结合,获取每个散点与第二集合之间的综合相似度;判断代表点1与第二集合之间的综合相似度是否满足第二阈值,如果代表点1与第二集合之间的综合相似度满足第二阈值,将簇1存入第二队列;如果代表点1与第二集合之间的综合相似度不满足第二阈值,将簇1存入第三队列;判断代表点2与第二集合之间的综合相似度是否满足第二阈值,如果代表点2与第二集合之间的综合相似度满足第二阈值,将簇2存入第二队列;如果代表点2与第二集合之间的综合相似度不满足第二阈值,将簇2存入第三队列;判断散点与第二集合之间的综合相似度是否满足第二阈值,如果散点与第二集合之间的综合相似度满足第二阈值,将散点存入第二队列;如果散点与第二集合之间的综合相 似度不满足第二阈值,将散点存入第三队列。
步骤1108、服务器按照第一队列最前、第二队列其次、第三队列最后的顺序,对第一集合中的对象进行排序。
服务器可以将第一队列中的对象排在第二队列中的对象之前,将第二队列中的对象排在第三对象中的对象之前。通过进行了排序,则在第一集合中对象的顺序具体为:第二集合中的对象在最前,该第四集合中的对象在中间、该第五集合中的对象在最后。示例性地,对于第一集合中的对象i、对象j和对象k来说,如果对象i是第一队列中的对象,对象j是第二队列中的对象,对象k是第三队列中的对象,则会将对象i排在对象j之前,将对象j排在对象k之前。那么,如果第一集合包括(2*K)个对象,第一队列包括Q1个对象,第二队列包括Q2个对象,第三队列包括Q3个对象,那么进行排序后,第1个对象至第Q1个对象均是第一队列中的对象,第(Q1+1)至第(Q1+Q2)个对象均是第二队列中的对象,第(Q1+Q2+1)至最后1个对象均是第三队列中的对象。
步骤1109、服务器按照从前往后的顺序,从第一集合中选择目标数目个对象发往终端。
服务器可以从第一队列的第一个对象开始,按照从队首至队尾的顺序,从第一队列中依次选取对象,如果选择的对象的数目达到目标数目,则停止从第一队列继续选择对象,也不会选择第二队列中的对象;如果已经选择到第一队列的队尾,而选择的对象尚未达到目标数目,服务器才会在选择第一队列中的对象的基础上,继续从第二队列的队首开始,按照从队首到队尾的顺序,从第二队列中依次选择对象。其中,如果选择的对象的数目达到目标数目,则停止从第二队列继续选择对象,也不会选择第三队列中的对象;如果已经选择到第二队列的队尾,而选择的对象尚未达到目标数目,服务器才会在选择第一队列中的对象以及第二队列中的对象的基础上,继续从第三队列的队首开始,按照从队首到队尾的顺序,从第三队列中依次选择对象。通过这种选择方式,使得第一队列的优先级高于第二队列的优先级,第二队列的优先级高于第三队列的优先级。
比如说,如果目标数目为K,第一集合包括(2*K)个对象,第一队列包括Q1个对象,第二队列包括Q2个对象,第三队列包括Q3个对象,如果K小于Q1,则服务器会选择第一队列的队首至第一队列中排在第K位的对象;如果K等于Q1,则服务器会选择第一队列中的每个对象;如果K大于Q1,服务器会选择第一队列中的每个对象,以及第二队列的队首至第二队列中排在第(K-Q1)位的对象。如果K=(Q1+Q2),服务器会选择第一队列中的每个对象以及第二队列中的每个对象。如果K>(Q1+Q2),服务器会选择第一队列中的每个对象、第二队列中的每个对象,以及第三队列中的队首至第三队列中排在第(K-Q1-Q2)位的对象。
在一些可能的实施例中,与步骤808中第四集合的内部排列顺序对应,对于第二队列中的每个对象,该对象与第二集合的相似度越大,该对象在第二队列中的排列位置可以越靠前。例如,第二队列的队首可以是第二队列中与第二集合的相似度最大的对象,第二队列的队尾可以是第二队列中与第二集合的相似度最小的对象。关于如何实现第二队列的内部排列顺序,在一些可能的实施例中,服务器可以将第四集合存入第二队列之后,按照与第二集合的相似度从大到小的顺序,对第二队列中的各个对象进行重新排序,例如可以根据多种相似度算法得出的、第一集合中的对象与第二集合的相似度,对第二队列中的各个对象进行重新排序。
当然,第二队列内部的不同对象的排列顺序也可以保持为步骤1103中该每个对象在第一集合中的排列顺序,具体来说,对于第二队列中的每个对象,该对象与目标对象的相似度越大,该对象在第二队列中的排列位置可以越靠前。
在一些可能的实施例中,与步骤808中第五集合的内部排列顺序对应,对于第三队列中的每个对象,该对象与目标对象的相似度越大,该对象在第三队列中的排列位置可以越靠前。例如,第三队列的队首可以是第三队列中与目标对象的相似度最大的对象,第三队列的队尾可以是第三队列中与目标对象的相似度最小的对象。其中,如果在执行步骤1103时,第一集合中各个对象从前往后的排列顺序,为该各个对象与目标对象的相似度从大到小的顺序,则服务器在将第五集合存入第三队列时,服务器可以保持第三队列中的每个对象的排列顺序为该每个对象在第一集合中的排列顺序。
示意性地,如果目标数目为5,如果服务器在执行步骤1103时,数据库中的所有对象中与目标对象的相似度最大的对象是对象1,其次是对象2,再次是对象3,依次类推,搜索出了10个对象,则第一集合是(对象1、对象2、对象3……对象10)。第四集合是(对象4、对象5、对象6),第五集合是(对象7、对象8、对象9、对象10),其中第四集合的三个对象中,对象6与第二集合的相似度最大、对象5与第二集合的相似度其次,对象4与第二集合的相似度最小,则第二队列的内部排列顺序为:对象6为队首,对象5位于队列中间,对象4是队尾。而第三队列的内部排列顺序可以为(对象7、对象8、对象9、对象10),即维持了顺序的不变。因此,第三队列可以记为保序队列。
本实施例提供的方法,在实现图8实施例达到的效果的基础上,设计了一套基于队列的重排框架,通过将第二集合加入第一队列,将第四集合加入第二队列,将第五集合加入第三队列,可以将从数据库中搜索出的对象划分出多种队列,在选取对象时,通过第一队列中的对象排在前、第二队列中的对象排在中间,第三队列中的对象排在最后,在按照从前到后的顺序选取对象时,会高优先选取第一队列中的对象,其次优先选取第二队列中的对象,低优先选取第三队列中的对象,那么由于三种队列中,第一队列中对象的置信度最高,第二队列中对象的置信度其次,第三队列中对象的置信度最低,在选取的对象的总数目一定的情况下,能够提高选取置信度高的对象的概率,降低选取置信度低的对象的概率,从而提高搜索结果中置信度高的对象的占比,因此可以提高搜索结果的召回率,并且,能够让置信度高的对象置于搜索结果的前列,提高搜索结果顺序的准确性。
需要说明的一点是,图8实施例以及图11实施例所示的队列机制仅是一种示例性实施方式,而本申请的保护范围并不局限于此,例如,可以将队列机制等同替换为其他顺序存储结构,比如说将图8实施例以及图11实施例中的队列替换为数组,而这些修改或替换都应涵盖在本申请的保护范围之内。
以上介绍了本申请实施例的相似性搜索方法,以下介绍本申请实施例的相似性搜索装置,应理解,该应用于相似性搜索装置其具有上述方法中服务器的任意功能。
本申请还提供了一种相似性搜索装置。如图13所示,相似性搜索装置1300包括接收模块1301,搜索模块1302,划分模块1303、排序模块1304和发送模块1305。以上各个模块可以为软件模块。
接收模块1301,用于执行步骤302;搜索模块1302,用于执行步骤303;划分模块1303,用于执行步骤304;排序模块1304,用于执行步骤305;发送模块1305,用于执行步骤306。
在一种可能的实现中,该划分模块,还用于执行步骤805或者步骤1106。
在一种可能的实现中,该排序模块,具体用于执行步骤1105至步骤1108。
在一种可能的实现中,该排序模块,具体用于执行步骤605以及步骤606:
相似性搜索装置1300可以作为云搜索服务向用户提供。例如,相似性搜索装置1300(或其部分)运行在云环境上,例如运行在云环境上的一个或多个服务器上,用户选择目标对象发送至接收模块1301后,启动相似性搜索装置1300对目标对象进行搜索,输出的目标数目个对象被提供给用户。当然,该装置运行在云环境仅是示意,该装置也可以运行在边缘环境中,例如运行在边缘环境中的一个或多个服务器上。该装置还可以运行在终端环境中,具体为终端环境中的一个或多个终端设备上。终端设备可以为手机、笔记本、服务器、台式电脑等。
应理解,上述实施例提供的相似性搜索装置在进行相似性搜索时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将服务器的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的相似性搜索装置与相似性搜索方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
以上介绍了本申请实施例的相似性搜索装置,以下介绍该相似性搜索装置可能的产品形态。应理解,但凡具备上述图13中的相似性搜索装置的特征的任何形态的产品都落入本申请的保护范围。还应理解,以下介绍仅为举例,不限制本申请实施例的相似性搜索装置的产品形态仅限于此。
作为一种可能的产品形态,本申请实施例中的相似性搜索装置,可以由一般性的总线体系结构来实现。例如,该相似性搜索装置可以实施为服务器,参见图14,图14是本申请实施例提供的一种服务器的结构示意图,该服务器1400可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器1401和一个或一个以上的存储器1402,另外还可以包括总线1403、收发器1404,处理器1401、存储器1402以及收发器1404之间可以通过总线1403通信。
存储器1402中存储有至少一条指令,至少一条指令由处理器1401加载并执行以实现上述各个方法实施例提供的相似性搜索方法,处理器1401可以控制收发器1404执行步骤302以及步骤306,或者步骤602以及步骤607,或者步骤802以及步骤808,或者步骤1102以及步骤1109。
其中,处理器1401可以是中央处理器(英文:central processing unit,缩写:CPU)。存储器1402可以包括易失性存储器1402(英文:volatile memory),例如随机存取存储器1402(英文:random access memory,缩写:RAM)。存储器1402还可以包括非易失性存储器1402(英文:non-volatile memory),例如只读存储器1402(英文:read-only memory,缩写:ROM),快闪存储器1402,HDD或SSD。存储器1402中还可以包括操作系统等其他运行进程所需的软件模块。操作系统可以为LINUX TM,UNIX TM,WINDOWS TM等。当然,该服务器还可以具有有 线或无线网络接口以及输入输出接口等部件,以便进行输入输出,该服务器还可以包括其他用于实现设备功能的部件,在此不做赘述。另外,服务器1400可以为云环境中的服务器,或边缘环境中的服务器,或终端环境中的服务器。
如图15所示,相似性搜索装置1300的不同模块可以在分散在不同服务器上运行。因此,本申请还提出了一种服务器集群。如图15所示,该服务器集群包括多个服务器1400。每个服务器1400的结构还请参见上述图14实施例。不同服务器1400间通过通信网络建立通信通路。上述方法实施例中,相似性搜索方法的不同步骤可以分散在不同的服务器执行,例如服务器1用于执行步骤302,服务器2用于执行步骤303至步骤305,服务器3用于执行步骤306。相应地,相似性搜索装置1300的不同模块可以分布在不同服务器1400,例如接收模块1301位于服务器1,搜索模块1302以及划分模块1303位于服务器2,发送模块1305位于服务器3。
任一服务器1400可以为云环境中的服务器,或边缘环境中的服务器,或终端环境中的服务器。
考虑到数据库或者第一集合占用的存储空间很大,服务器1400本身可能无法存储全部的数据库或者第一集合,如图16所示,本申请还提出了一种服务器集群,该服务器集群包括多个服务器1400以及云存储服务。数据库或者第一集合存储在云存储服务中(例如对象存储服务),用户在云存储服务中申请一定容量的存储空间,并将数据库或者第一集合存入存储空间中。服务器1400运行时,通过通信网络从远端的云存储服务中获取所需的对象。
作为一种可能的产品形态,本申请实施例中的相似性搜索装置,可以由芯片来实现。
在一些可能的实施例中,该芯片包括处理器,用于从存储器中调用并运行存储器中存储的指令,使得安装有该芯片的设备执行上述各个方法实施例提供的相似性搜索方法。
在一些可能的实施例中,该芯片包括输入接口、输出接口、处理器和存储器,该输入接口、输出接口、该处理器以及该存储器之间通过内部连接通路相连,该处理器用于执行该存储器中的指令,当该指令被执行时,该处理器用于执行步骤303至步骤305、步骤603至步骤607、步骤803至步骤807、步骤1103至步骤1108,该处理器用于控制该输入接口用于执行上述步骤302、步骤602、步骤802、步骤1102,该处理器用于控制该输出接口执行上述方法实施例中的步骤306、步骤607、步骤808、步骤1109。
作为一种可能的产品形态,本申请实施例中的相似性搜索装置,还可以使用下述来实现:一个或多个现场可编程门阵列(Field-Programmable Gate Array,FPGA)、可编程逻辑器件(英文:Programmable Logic Device,简称:PLD)、复杂可编程逻辑器件(英文:Complex Programmable Logic Device,简称:CPLD)、控制器、专用集成电路(Application Specific Integrated Circuit,ASIC)、状态机、门逻辑、分立硬件部件、晶体管逻辑器件、网络处理器(Network Processor,NP)、任何其它适合的电路、或者能够执行本申请通篇所描述的各种功能的电路的任意组合。
作为一种可能的产品形态,本申请实施例中的相似性搜索装置,可以由计算机程序实现, 该计算机程序包括用于执行上述方法实施例的指令。该计算机程序可以为一个软件安装包,在需要使用上述相似性搜索方法的情况下,可以下载该计算机程序并在服务器上执行该计算机程序。
应理解,上述各种产品形态的相似性搜索装置,分别具有上述方法实施例中服务器的任意功能,此处不再赘述。
本领域普通技术人员可以意识到,结合本文中所公开的实施例中描述的各方法步骤和模块,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各实施例的步骤及组成。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。本领域普通技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的系统、装置和模块的具体工作过程,可以参见前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,该模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个模块或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或模块的间接耦合或通信连接,也可以是电的,机械的或其它的形式连接。
该作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理模块,即可以位于一个地方,或者也可以分布到多个网络模块上。可以根据实际的需要选择其中的部分或者全部模块来实现本申请实施例方案的目的。
另外,在本申请各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以是两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。
该集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例该方法的全部或部分步骤。而前述的存储介质包括易失性存储器以及非易失性存储器,例如存储介质可以是:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟、快闪存储器、硬盘(hard disk drive,HDD)、固态硬盘(solid state drive,SSD)或者光盘等各种可以存储程序代码的介质。
上述各个实施例的流程的描述各有侧重,某个流程中没有详述的部分,可以参见其他流程的相关描述。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。该计算机程序产品包括 一个或多个计算机程序指令。在计算机上加载和执行该计算机程序指令时,全部或部分地产生按照本申请实施例的流程或功能。该计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。该计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,该计算机程序指令可以从一个网站站点、计算机、服务器或数据中心通过有线或无线方式向另一个网站站点、计算机、服务器或数据中心进行传输。该计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。该可用介质可以是磁性介质(例如软盘、硬盘、磁带)、光介质(例如,数字视频光盘(digital video disc,DVD)、或者半导体介质(例如固态硬盘)等。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,该程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上该仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (22)

  1. 一种相似性搜索方法,其特征在于,所述方法包括:
    接收终端的搜索指令,所述搜索指令用于指示搜索与目标对象相似的对象;
    对数据库进行搜索,得到第一集合,所述第一集合包括多个对象;
    将所述第一集合划分为第二集合以及第三集合,所述第二集合中每个对象与所述目标对象之间的相似度满足第一阈值,所述第三集合中每个对象与所述目标对象的相似度不满足所述第一阈值;
    按照所述第二集合中的对象在前、所述第三集合中的对象在后的顺序,对所述第一集合中的对象进行排序;
    按照从前往后的顺序,从所述第一集合中选择对象发往所述终端。
  2. 根据权利要求1所述的方法,其特征在于,所述对数据库进行搜索,得到第一集合具体包括:采用第一相似度算法,把与所述目标对象之间的相似度满足第三阈值的对象加入所述第一集合;
    进一步的,所述将所述第一集合划分为第二集合以及第三集合,具体包括:
    采用多种相似度算法结合,获取所述第一集合中的对象与所述目标对象之间的综合相似度,把综合相似度满足所述第一阈值的对象加入所述第二集合,把综合相似度不满足所述第一阈值的对象加入所述第三集合,所述多种相似度算法包括所述第一相似度算法。
  3. 根据权利要求1所述的方法,其特征在于,
    所述第三集合包括第四集合以及第五集合,所述第四集合中每个对象与所述第二集合的相似度满足第二阈值,所述第五集合中每个对象与所述第二集合的相似度不满足所述第二阈值;
    在所述第三集合中对象的顺序具体为:所述第四集合中的对象在前、所述第五集合中的对象在后。
  4. 根据权利要求1所述的方法,其特征在于,所述对数据库进行搜索,得到第一集合具体包括:采用第一相似度算法,把与所述目标对象之间的相似度满足第三阈值的对象加入所述第一集合;
    进一步的,所述将所述第一集合划分为第二集合以及第三集合之后,所述方法还包括:
    采用多种相似度算法结合,获取所述第三集合中的对象与所述第二集合的综合相似度,把综合相似度满足第二阈值的对象加入第四集合,把综合相似度不满足所述第二阈值的对象加入第五集合,所述多种相似度算法包括所述第一相似度算法,所述第三集合包括所述第四集合以及所述第五集合,在所述第三集合中对象的顺序具体为:所述第四集合中的对象在前、所述第五集合中的对象在后。
  5. 根据权利要求1所述的方法,其特征在于,所述将所述第一集合划分为第二集合以及第三集合之后,所述方法还包括:
    从所述第三集合中获取簇,所述簇中的任一对象与所述簇中的其他对象之间的相关度符合预设条件;
    获取所述簇与所述第二集合之间的相似度,作为所述簇中的每个对象与所述第二集合之间的相似度;
    把相似度满足第二阈值的对象加入第四集合,把相似度不满足所述第二阈值的对象加入第五集合,所述第三集合包括所述第四集合以及所述第五集合,在所述第三集合中对象的顺序具体为:所述第四集合中的对象在前、所述第五集合中的对象在后。
  6. 根据权利要求5所述的方法,其特征在于,所述获取所述簇与所述第二集合之间的相似度,具体包括下述任意一项:
    从所述簇中选取代表点,获取所述代表点与所述第二集合之间的相似度,作为所述簇与所述第二集合之间的相似度,所述代表点用于代表所述簇中的每个对象;
    获取所述簇中的每个对象与所述第二集合之间的相似度,根据所述每个对象与所述第二集合之间的相似度,获取所述簇与所述第二集合之间的相似度。
  7. 根据权利要求3至6中任一项所述的方法,其特征在于,所述按照所述第二集合中的对象在前、所述第三集合中的对象在后的顺序,对所述第一集合中的对象进行排序,具体包括:
    将所述第二集合存入第一队列;
    将所述第四集合存入第二队列;
    将所述第五集合存入第三队列;
    按照所述第一队列最前、所述第二队列其次、所述第三队列最后的顺序,对所述第一集合中的对象进行排序。
  8. 根据权利要求1至6中任一项所述的方法,其特征在于,所述按照所述第二集合中的对象在前、所述第三集合中的对象在后的顺序,对所述第一集合中的对象进行排序,具体包括:
    将所述第二集合存入第一队列;
    将所述第三集合存入第二队列;
    按照所述第一队列在前、所述第二队列在后的顺序,对所述第一集合中的对象进行排序。
  9. 根据权利要求1至8中任一项所述的方法,其特征在于,
    对于所述第二集合中的每个对象,所述对象与所述目标对象的相似度越大,所述对象在所述第二集合中的排列位置越靠前;和/或,
    对于所述第三集合中的每个对象,所述对象与所述第二集合的相似度越大,所述对象在所述第三集合中的排列位置越靠前。
  10. 根据权利要求1所述的方法,其特征在于,所述将所述第一集合划分为第二集合以及第三集合,具体包括:
    从所述第一集合中获取簇,所述簇中的任一对象与所述簇中的其他对象之间的相关度符合预设条件;
    获取所述簇与所述目标对象之间的相似度,作为所述簇中的每个对象与所述目标对象之间的相似度;
    把相似度满足所述第一阈值的对象加入所述第二集合,把相似度不满足所述第一阈值的对象加入所述第三集合。
  11. 一种相似性搜索装置,其特征在于,所述装置包括:
    接收模块,用于接收终端的搜索指令,所述搜索指令用于指示搜索与目标对象相似的对象;
    搜索模块,用于对数据库进行搜索,得到第一集合,所述第一集合包括多个对象;
    划分模块,用于将所述第一集合划分为第二集合以及第三集合,所述第二集合中每个对象与所述目标对象之间的相似度满足第一阈值,所述第三集合中每个对象与所述目标对象的相似度不满足所述第一阈值;
    排序模块,用于按照所述第二集合中的对象在前、所述第三集合中的对象在后的顺序,对所述第一集合中的对象进行排序;
    发送模块,用于按照从前往后的顺序,从所述第一集合中选择对象发往所述终端。
  12. 根据权利要求11所述的装置,其特征在于,所述搜索模块,具体用于:采用第一相似度算法,把与所述目标对象之间的相似度满足第三阈值的对象加入所述第一集合;
    进一步的,所述划分模块,具体用于:采用多种相似度算法结合,获取所述第一集合中的对象与所述目标对象之间的综合相似度,把综合相似度满足所述第一阈值的对象加入所述第二集合,把综合相似度不满足所述第一阈值的对象加入所述第三集合,所述多种相似度算法包括所述第一相似度算法。
  13. 根据权利要求11所述的装置,其特征在于,所述第三集合包括第四集合以及第五集合,所述第四集合中每个对象与所述第二集合的相似度满足第二阈值,所述第五集合中每个对象与所述第二集合的相似度不满足所述第二阈值;
    在所述第三集合中对象的顺序具体为:所述第四集合中的对象在前、所述第五集合中的对象在后。
  14. 根据权利要求11所述的装置,其特征在于,所述搜索模块,具体用于:采用第一相似度算法,把与所述目标对象之间的相似度满足第三阈值的对象加入所述第一集合;
    进一步的,所述划分模块,还用于:采用多种相似度算法结合,获取所述第三集合中的对象与所述第二集合的综合相似度,把综合相似度满足第二阈值的对象加入第四集合,把综合相似度不满足所述第二阈值的对象加入第五集合,所述多种相似度算法包括所述第一相似度算法,所述第三集合包括所述第四集合以及所述第五集合,在所述第三集合中对象的顺序具体为:所述第四集合中的对象在前、所述第五集合中的对象在后。
  15. 根据权利要求11所述的装置,其特征在于,所述装置还包括:
    获取模块,用于从所述第三集合中获取簇,所述簇中的任一对象与所述簇中的其他对象之间的相关度符合预设条件;
    所述获取模块,还用于获取所述簇与所述第二集合之间的相似度,作为所述簇中的每个对象与所述第二集合之间的相似度;
    所述划分模块,还用于把相似度满足第二阈值的对象加入第四集合,把相似度不满足所述第二阈值的对象加入第五集合,所述第三集合包括所述第四集合以及所述第五集合,在所述第三集合中对象的顺序具体为:所述第四集合中的对象在前、所述第五集合中的对象在后。
  16. 根据权利要求15所述的装置,其特征在于,所述获取模块,具体用于执行下述任意一项:
    从所述簇中选取代表点,获取所述代表点与所述第二集合之间的相似度,作为所述簇与所述第二集合之间的相似度,所述代表点用于代表所述簇中的每个对象;
    获取所述簇中的每个对象与所述第二集合之间的相似度,根据所述每个对象与所述第二集合之间的相似度,获取所述簇与所述第二集合之间的相似度。
  17. 根据权利要求13至16中任一项所述的装置,其特征在于,所述排序模块,具体用于:
    将所述第二集合存入第一队列;
    将所述第四集合存入第二队列;
    将所述第五集合存入第三队列;
    按照所述第一队列最前、所述第二队列其次、所述第三队列最后的顺序,对所述第一集合中的对象进行排序。
  18. 根据权利要求11至16中任一项所述的装置,其特征在于,所述排序模块,具体用于:
    将所述第二集合存入第一队列;
    将所述第三集合存入第二队列;
    按照所述第一队列在前、所述第二队列在后的顺序,对所述第一集合中的对象进行排序。
  19. 根据权利要求11至18中任一项所述的装置,其特征在于,
    对于所述第二集合中的每个对象,所述对象与所述目标对象的相似度越大,所述对象在所述第二集合中的排列位置越靠前;和/或,
    对于所述第三集合中的每个对象,所述对象与所述第二集合的相似度越大,所述对象在所述第三集合中的排列位置越靠前。
  20. 根据权利要求11所述的装置,其特征在于,所述划分模块,具体包括:
    获取单元,用于从所述第一集合中获取簇,所述簇中的任一对象与所述簇中的其他对象之间的相关度符合预设条件;
    所述获取单元,还用于获取所述簇与所述目标对象之间的相似度,作为所述簇中的每个对象与所述目标对象之间的相似度;
    加入单元,用于把相似度满足所述第一阈值的对象加入所述第二集合,把相似度不满足所述第一阈值的对象加入所述第三集合。
  21. 一种服务器,其特征在于,所述服务器包括一个或多个处理器和一个或多个存储器,所述一个或多个存储器中存储有至少一条指令,所述指令由所述一个或多个处理器加载并执行以实现如权利要求1至权利要求10任一项所述的相似性搜索方法。
  22. 一种计算机可读存储介质,其特征在于,所述存储介质中存储有至少一条指令,所述指令由处理器加载并执行以实现如权利要求1至权利要求10任一项所述的相似性搜索方法。
PCT/CN2019/088879 2019-05-28 2019-05-28 相似性搜索方法、装置、服务器及存储介质 WO2020237511A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2019/088879 WO2020237511A1 (zh) 2019-05-28 2019-05-28 相似性搜索方法、装置、服务器及存储介质
CN201980096330.6A CN113811865A (zh) 2019-05-28 2019-05-28 相似性搜索方法、装置、服务器及存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/088879 WO2020237511A1 (zh) 2019-05-28 2019-05-28 相似性搜索方法、装置、服务器及存储介质

Publications (1)

Publication Number Publication Date
WO2020237511A1 true WO2020237511A1 (zh) 2020-12-03

Family

ID=73552099

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/088879 WO2020237511A1 (zh) 2019-05-28 2019-05-28 相似性搜索方法、装置、服务器及存储介质

Country Status (2)

Country Link
CN (1) CN113811865A (zh)
WO (1) WO2020237511A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112926998A (zh) * 2021-03-24 2021-06-08 支付宝(杭州)信息技术有限公司 作弊识别方法和装置

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103106282A (zh) * 2013-02-27 2013-05-15 王义东 一种网页搜索与展示的方法
CN103207871A (zh) * 2012-01-17 2013-07-17 深圳市腾讯计算机系统有限公司 对搜索系统的查询串改写效果进行评测的方法和装置
CN108241646A (zh) * 2016-12-23 2018-07-03 阿里巴巴集团控股有限公司 一种搜索匹配方法和装置、推荐方法和装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103207871A (zh) * 2012-01-17 2013-07-17 深圳市腾讯计算机系统有限公司 对搜索系统的查询串改写效果进行评测的方法和装置
CN103106282A (zh) * 2013-02-27 2013-05-15 王义东 一种网页搜索与展示的方法
CN108241646A (zh) * 2016-12-23 2018-07-03 阿里巴巴集团控股有限公司 一种搜索匹配方法和装置、推荐方法和装置

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112926998A (zh) * 2021-03-24 2021-06-08 支付宝(杭州)信息技术有限公司 作弊识别方法和装置

Also Published As

Publication number Publication date
CN113811865A (zh) 2021-12-17

Similar Documents

Publication Publication Date Title
US11526799B2 (en) Identification and application of hyperparameters for machine learning
US9524310B2 (en) Processing of categorized product information
US10489468B2 (en) Similarity search using progressive inner products and bounds
CN107305637B (zh) 基于K-Means算法的数据聚类方法和装置
Reinanda et al. Mining, ranking and recommending entity aspects
WO2015081915A1 (zh) 文件推荐方法和装置
US20140006369A1 (en) Processing structured and unstructured data
EP3803634A1 (en) Accelerating machine learning inference with probabilistic predicates
US10810458B2 (en) Incremental automatic update of ranked neighbor lists based on k-th nearest neighbors
US9104946B2 (en) Systems and methods for comparing images
US10289931B2 (en) Method and system for searching images
CN110309143A (zh) 数据相似度确定方法、装置及处理设备
US10509800B2 (en) Visually interactive identification of a cohort of data objects similar to a query based on domain knowledge
CN108509628B (zh) 数据库配置方法、装置、计算机设备和存储介质
WO2020237511A1 (zh) 相似性搜索方法、装置、服务器及存储介质
CN113761185A (zh) 主键提取方法、设备及存储介质
CN108228101B (zh) 一种管理数据的方法和系统
US20230196831A1 (en) Image Group Classifier in a User Device
CN115391581A (zh) 索引创建、图像存储、图像检索方法、装置及电子设备
US11556514B2 (en) Semantic data type classification in rectangular datasets
US20180276294A1 (en) Information processing apparatus, information processing system, and information processing method
Zhou et al. A novel locality-sensitive hashing algorithm for similarity searches on large-scale hyperspectral data
US20170180511A1 (en) Method, system and apparatus for dynamic detection and propagation of data clusters
WO2017095421A1 (en) Automatic selection of neighbor lists to be incrementally updated
US11893012B1 (en) Content extraction using related entity group metadata from reference objects

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19930772

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19930772

Country of ref document: EP

Kind code of ref document: A1