WO2023108995A1 - 向量相似度计算方法、装置、设备及存储介质 - Google Patents

向量相似度计算方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2023108995A1
WO2023108995A1 PCT/CN2022/090758 CN2022090758W WO2023108995A1 WO 2023108995 A1 WO2023108995 A1 WO 2023108995A1 CN 2022090758 W CN2022090758 W CN 2022090758W WO 2023108995 A1 WO2023108995 A1 WO 2023108995A1
Authority
WO
WIPO (PCT)
Prior art keywords
query
vector
cluster
distance
preset
Prior art date
Application number
PCT/CN2022/090758
Other languages
English (en)
French (fr)
Inventor
舒畅
陈又新
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2023108995A1 publication Critical patent/WO2023108995A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2237Vectors, bitmaps or matrices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present application relates to the technical field of machine learning, and in particular to a vector similarity calculation method, device, equipment and storage medium.
  • a vector similarity search is performed quickly and accurately in a database.
  • the purpose of the similarity search is to quickly identify vectors in a data set that are similar to a specific query vector.
  • the computational complexity will increase linearly with the increase of the target data volume. For example, there are 10 million vector data in the database. Now it is necessary to calculate which vector in the database is most similar to the query vector.
  • the time complexity at this time is O(N), that is, the query vector and the 10 million vector data in the database need to be calculated one by one, that is, the calculation is performed 10 million times. At this time, if the data in the database increases by 10 million, 20 million calculations are required at this time, and the calculation cost is very high, which is not conducive to industrial-grade applications.
  • the vector similarity calculation system mainly uses neural network algorithm, which has the advantage of high calculation accuracy, but the disadvantage is poor real-time performance; another non-neural network algorithm calculates vector similarity, which has low calculation accuracy and complicated calculation Problems with high degree of difficulty are difficult to use in complex business scenarios with large data volume and high dimensionality.
  • the embodiment of the present application proposes a vector similarity calculation method, including:
  • a vector similarity value is calculated according to the second distance.
  • the embodiment of the present application proposes a vector similarity calculation device, including:
  • a query vector acquisition module configured to acquire a query vector
  • a first distance calculation module configured to calculate a first distance between the query vector and a preset first cluster center
  • a first query cluster acquisition module configured to select a first query cluster from a first cluster corresponding to the first cluster center according to the first distance and a preset first threshold;
  • the second distance calculation module is used to perform vector segmentation calculation in the first query cluster according to a plurality of query sub-segment vectors of the query vector to obtain the second distance;
  • a vector similarity value calculation module configured to calculate a vector similarity value according to the second distance.
  • the embodiment of the present application proposes a computer device, including a processor and a memory:
  • the memory is used to store programs
  • the processor is configured to execute a vector similarity calculation method according to the program.
  • the vector similarity calculation method includes: obtaining a query vector; calculating a first relationship between the query vector and a preset first cluster center. distance; according to the first distance and the preset first threshold, select the first query cluster from the first cluster corresponding to the first cluster center; according to the multiple query sub-segment vectors of the query vector in the second A vector segment calculation is performed in a query cluster to obtain a second distance; and a vector similarity value is calculated according to the second distance.
  • the embodiment of the present application provides a computer-readable storage medium, the storage medium stores computer-executable instructions, and the computer-executable instructions are used to make a computer execute a vector similarity calculation method, the
  • the vector similarity calculation method includes: obtaining a query vector; calculating a first distance between the query vector and a preset first cluster center; according to the first distance and a preset first threshold, from the corresponding Selecting the first query cluster in the first cluster of a cluster center; performing vector segmentation calculation in the first query cluster according to a plurality of query sub-segment vectors of the query vector to obtain a second distance; according to the second Distance computes vector similarity values.
  • the vector similarity calculation method, device, equipment, and storage medium proposed in the embodiments of the present application can filter and obtain the first query cluster, avoid querying in the entire database, reduce the amount of query data to a certain extent, and then obtain the first query cluster Carry out vector segmentation calculations in , and simplify the calculation process by segment query, which can reduce calculation complexity, reduce calculation costs, and improve the calculation efficiency of vector similarity.
  • FIG. 1 is a schematic diagram of an exemplary system architecture provided by an embodiment of the present application
  • Fig. 2 is the flow chart of the vector similarity calculation method provided by one embodiment of the present application.
  • Fig. 3 is another flowchart of the vector similarity calculation method provided by one embodiment of the present application.
  • FIG. 4 is a schematic diagram of segmentation processing provided by an embodiment of the present application.
  • Fig. 5 is another flow chart of the vector similarity calculation method provided by one embodiment of the present application.
  • FIG. 6 is a schematic diagram of a second clustering provided by an embodiment of the present application.
  • Fig. 7 is another flow chart of the vector similarity calculation method provided by one embodiment of the present application.
  • Fig. 8 is another flow chart of the vector similarity calculation method provided by one embodiment of the present application.
  • Fig. 9 is another flow chart of the vector similarity calculation method provided by one embodiment of the present application.
  • Fig. 10 is another flow chart of the vector similarity calculation method provided by one embodiment of the present application.
  • Fig. 11 is a schematic diagram of vector similarity calculation provided by an embodiment of the present application.
  • Fig. 12 is a structural block diagram of an apparatus for calculating vector similarity provided by an embodiment of the present application.
  • K-means clustering algorithm It is a clustering algorithm based on Euclidean distance, which believes that the closer the Euclidean distance between two vectors, the greater the similarity.
  • the clustering parameter k is first determined, and then all the database vectors in the query database are divided into k clusters, so that the obtained clusters meet the following conditions: the similarity of the database vectors in the same cluster High; while the vector similarity of different clustering database vectors is small.
  • Cluster A collection of samples generated by clustering, samples in the same cluster are similar to each other and different from samples in other clusters.
  • Artificial Intelligence It is a new technical science that studies and develops theories, methods, technologies and application systems for simulating, extending and expanding human intelligence; artificial intelligence is a branch of computer science. Intelligence attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a manner similar to human intelligence. Research in this field includes robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of human consciousness and thinking. Artificial intelligence is also a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • AI artificial intelligence
  • the embodiments of the present application may acquire and process relevant data based on artificial intelligence technology.
  • artificial intelligence is the theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .
  • Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • the purpose of similarity search is to quickly identify objects in the data set that are similar to specific query items. vector.
  • the current common vector similarity calculation belongs to the vector similarity search, which has become an important content of the current network technology application, such as recommending content that may be of interest when users browse news, recommending products that users are inclined to buy when browsing products, etc. .
  • the computational complexity will increase linearly with the increase in the amount of target data. For example, there are 10 million vector data in the database. Now it is necessary to calculate which vector in the database is most similar to the query vector. , the time complexity at this time is O(N), that is, the query vector and the 10 million vector data in the database need to be calculated one by one, that is, the calculation is performed 10 million times. At this time, if the data in the database increases by 10 million, 20 million calculations are required at this time, and the calculation cost is very high, which is not conducive to industrial-grade applications.
  • the vector similarity calculation system mainly uses the neural network algorithm.
  • the algorithm has the advantage of high calculation accuracy, but the disadvantage is poor real-time performance; another non-neural network algorithm calculates the vector similarity, which has low calculation accuracy and complicated calculation. Problems with high degree of difficulty are difficult to use in complex business scenarios with large data volume and high dimensionality.
  • the query vector is obtained, the first distance between the query vector and the preset first cluster center is calculated, and according to the first distance and the preset first threshold, from Select the first query cluster from the first cluster corresponding to the first cluster center, perform vector segmentation calculation in the first query cluster according to multiple query sub-segment vectors of the query vector, obtain the second distance, and calculate the vector similarity according to the second distance degree value.
  • the first query cluster is screened to avoid querying in the entire database, and the amount of query data is reduced to a certain extent, and then the vector segment calculation is performed in the first query cluster, and the calculation process is simplified by segment query.
  • the calculation complexity can be reduced, the calculation cost can be reduced, and the calculation efficiency of vector similarity can be improved.
  • the vector similarity calculation method provided in the embodiment of the present application can be implemented by various electronic devices with computing and processing capabilities, such as notebook computers, tablet computers, desktop computers, set-top boxes, mobile devices (for example, mobile phones) , personal digital assistant, dedicated messaging device, portable game device) and other types of user terminals, and can also be implemented by a server.
  • various electronic devices with computing and processing capabilities such as notebook computers, tablet computers, desktop computers, set-top boxes, mobile devices (for example, mobile phones) , personal digital assistant, dedicated messaging device, portable game device) and other types of user terminals, and can also be implemented by a server.
  • the above-mentioned server can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, and can also provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, Cloud servers for basic cloud computing services such as network services, cloud communications, middleware services, domain name services, security services, and big data and artificial intelligence platforms are not limited in this embodiment of the application.
  • the following uses the application of the vector similarity calculation method provided in the embodiment of the present application to a server as an example to introduce the applicable application scenarios of the vector similarity calculation method provided in the embodiment of the present application.
  • FIG. 1 is a schematic diagram of an application scenario of a vector similarity calculation method provided in an embodiment of the present application.
  • the vector similarity calculation method provided by the embodiment of the present application is applied to a system framework 100 , where the system framework 100 may include a database 101 , a network 102 and a server 103 .
  • the network 102 is used as a medium for providing a communication link between the database 101 and the server 103 .
  • Network 102 may include various connection types, such as wired communication links, wireless communication links, and so on.
  • the server 103 obtains the query vector from the database 101, calculates the first distance between the query vector and the preset first cluster center, and according to the first distance and the preset first threshold, Select the first query cluster from the first cluster corresponding to the first cluster center, perform vector segmentation calculation in the first query cluster according to multiple query sub-segment vectors of the query vector, obtain the second distance, and calculate the vector according to the second distance similarity value.
  • the vector similarity calculation method provided in the embodiment of the present application is generally executed by the server 103 , and correspondingly, the vector similarity calculation device is generally set in the server 103 .
  • the terminal device may also have a similar function to the server, so as to execute the vector similarity calculation solution provided in the embodiment of the present application.
  • system architecture and application scenarios described in the embodiments of the present application are for more clearly illustrating the technical solutions of the embodiments of the present application, and do not constitute limitations on the technical solutions provided by the embodiments of the present application.
  • Those skilled in the art know that with the system architecture The evolution of the technology and the emergence of new application scenarios, the technical solutions provided by the embodiments of the present application are also applicable to similar technical problems.
  • the system architecture shown in Figure 1 does not constitute a limitation to the embodiment of the present application, and may include more or less components than those shown in the illustration, or combine some components, or different components layout.
  • FIG. 2 is a flowchart of a vector similarity calculation method provided by an embodiment of the present application, including but not limited to steps S110 to S150 .
  • Step S110 acquiring query vectors.
  • the query vector may be a search keyword, specifically, the search keyword may be represented as a query vector, for example, as an N-dimensional query vector.
  • Step S120 calculating a first distance between the query vector and a preset first cluster center.
  • the vector similarity calculation method in order to reduce the amount of query data to a certain extent and avoid querying in the entire database, before performing step S110 or step S120, the vector similarity calculation method also includes the following steps:
  • a preset query database is obtained, and a first clustering process is performed on the database vectors of the query database to obtain first clusters; wherein each of the first clusters includes a first cluster core.
  • the first clustering is clustering the database vectors in the query database using the K-means clustering algorithm.
  • Euclidean distance can be interpreted as the length of a line segment connecting two points in Euclidean space, and the Euclidean distance between N-dimensional vector a and N-dimensional vector b is expressed as follows:
  • N-dimensional vector a is expressed as: a(x 11 ,x 12 ,...,x 1N );
  • the N-dimensional vector b is expressed as: b(x 21 ,x 22 ,...,x 2N );
  • the Euclidean distance is expressed as:
  • the clustering step of the first clustering is described as:
  • the termination condition may be the number of iterations or the minimum error change, etc., which may be set according to specific requirements, and is not limited here.
  • the clustering parameter k is set to 1024, and the K-means clustering algorithm is used to perform the first clustering to obtain 1024 first clusters, each first Each cluster contains a cluster center, that is, there are 1024 first cluster centers in total, then the 10 million pieces of data will be divided into these 1024 clusters. The distance of these 1024 clusters is allocated according to the idea of clustering.
  • step S120 is: respectively calculating the distances between the N-dimensional query vector and the 1024 first cluster centers to obtain the corresponding 1024 first distances.
  • Step S130 according to the first distance and the preset first threshold, select the first query cluster from the first cluster corresponding to the first cluster center.
  • the first cluster closest to the query vector is selected according to the preset first threshold as the first query cluster, and the vector similarity calculation of the query vector is performed in the first query cluster to avoid querying in the entire database , to a certain extent reduce the amount of query data.
  • a first cluster whose first distance is smaller than a preset first threshold is selected from the first clusters as the first query cluster, and the first query cluster includes one or more first clusters.
  • Step S140 perform vector segmentation calculation in the first query cluster according to the plurality of query sub-segment vectors of the query vector, to obtain the second distance.
  • vector segment calculation is performed in the first query cluster, and the calculation process is simplified by segment query, which reduces the complexity of calculation, reduces the cost of calculation, and can efficiently calculate the similarity between vectors, which is suitable for data Complex business scenarios with large volume or high dimensionality.
  • step S140 includes but not limited to step S141 to step S144:
  • Step S141 obtaining the number of pre-segmentation segments, for example, the number of pre-segmentation segments is M.
  • step S142 the query vector is segmented according to the number of pre-segmentation segments to obtain query sub-segment vectors, and the number of query sub-segment vectors is equal to the number of pre-segmentation segments.
  • the N-dimensional query vector is segmented according to the pre-segmentation number M to obtain M query sub-segment vectors, and M is used to identify how many query sub-segment vectors the query vector is divided into, so M can be queried
  • M is used to identify how many query sub-segment vectors the query vector is divided into, so M can be queried
  • the dimension of the vector is divisible evenly.
  • the query vector is 1*128 dimensions
  • the query vector is divided into 4 segments, and 4 query sub-segment vectors are obtained.
  • Each query sub-segment vector The dimensionality of is 32 dimensions.
  • Step S143 respectively calculating the second distance between each query sub-segment vector and the preset second cluster center.
  • the second cluster center is the cluster center of the second cluster, and before performing step S143, the vector similarity calculation method also includes:
  • Acquiring the preset second cluster center may specifically include steps S161 to S163.
  • Step S161 Obtain a preset query database.
  • Step S162 Pre-segment the database vectors of the query database according to the number of pre-segmented segments to obtain multiple sub-segment vectors corresponding to each database vector.
  • FIG. 6 it is a schematic diagram of carrying out the second clustering process to the database vectors in the query database, assuming that there are N database vectors, the dimension of each database vector is 1*128, and the number of pre-segmentation segments is 4. Pre-segment it to obtain N*4 32-dimensional sub-segment vectors, and each database vector corresponds to 4 sub-segment vectors.
  • Step S163 Perform a second clustering process on each sub-segment vector of the database vector to obtain a plurality of second clusters corresponding to each sub-segment vector, wherein each second cluster includes a second cluster center, and the second cluster center is The cluster center of the second cluster.
  • the specific process is: respectively cluster the fth sub-segment of the database vector to obtain a plurality of second clusters corresponding to each sub-segment vector, each second cluster includes a second cluster center, where 1 ⁇ f ⁇ M, M is the number of pre-segmentation segments, and f is an integer.
  • cluster the first sub-segment that is, extract the first sub-segment vectors of all database vectors for clustering, and obtain multiple second clusters corresponding to the first sub-segment, for example, use K-means
  • the clustering algorithm clusters the database vectors in the query database to obtain multiple second clusters, and each sub-segment corresponds to a group of second clusters.
  • the 4 sub-segments are clustered respectively to obtain 4 groups of 256 second clusters, 256 is an example, and each second cluster includes a second cluster center.
  • step S164 the second cluster center of each second cluster is used as a preset second cluster center.
  • step S164 it also includes:
  • Step S165 Mapping the database vectors in the query database to obtain cluster identification vectors.
  • the database vectors in the query database are mapped according to the second cluster center in advance, and each database vector is represented as a cluster identification vector, wherein the cluster identification vector is about to represent the original database vector as having the same number of pre-segmented segments numbers.
  • step S165 includes but is not limited to the following steps:
  • Step S1651 calculating a third distance between each sub-segment vector of the database vector and its corresponding plurality of second cluster centers.
  • the third distance between each sub-segment vector and multiple second cluster centers of the corresponding sub-segment is calculated respectively.
  • Step S1652 according to the third distance and the preset second threshold, the second cluster center identifier corresponding to each sub-segment vector is obtained.
  • the one with the smallest distance among the plurality of third distances of each sub-segment vector is selected, that is, the second closest distance to each sub-segment vector is selected according to the preset second threshold.
  • Cluster center, the second cluster center identification (that is, the ID identification of the second cluster center, which can be expressed in decimal and used to distinguish the second cluster center) of the second cluster center closest to the distance corresponds to the sub-section vector .
  • Step S1653 according to the second cluster center identifier corresponding to the database vector, map the database vector into a cluster identifier vector.
  • the second cluster center identification of the second cluster center with the closest distance corresponding to each sub-section vector of the database vector is obtained by the above steps, and the database vector is mapped to a cluster identification vector, wherein the cluster identification vector is determined by the second Cluster heart logo composition.
  • the database vectors are mapped and expressed as cluster identification vectors, that is, N 128-dimensional database vectors are respectively mapped and expressed as a 4-dimensional cluster identification vector according to the number of pre-segmented segments 4, and all cluster identification vectors The total dimension is: N*4, and each value of the cluster identification vector represents the corresponding second cluster center identification.
  • Step S150 calculating the vector similarity value according to the second distance.
  • step S150 includes but not limited to the following steps:
  • Step S151 calculating a second distance between each query subsegment vector and the second cluster center corresponding to the cluster identification vector.
  • the cluster identification vector is selected for query, and the second distance between each query sub-segment vector and the second cluster center corresponding to the second cluster center identification of the sub-segment corresponding to the cluster identification vector is obtained.
  • the second distance can be calculated in advance, for example, the second distance between each query sub-segment vector of the query vector and multiple sets of second cluster centers is stored in advance, and when the cluster identification vector is selected for query, it can be directly calculated without calculation. Read relevant values, greatly reducing calculation time and improving query efficiency.
  • Step S152 calculating the query distance according to the second distance.
  • the second distance corresponding to the M query sub-segment vectors is calculated, and the query distance is obtained after adding the M second distances.
  • Step S153 according to the query distance and the preset third threshold, calculate the vector similarity value.
  • the database vector corresponding to the smallest query distance is selected as the vector most similar to the query vector, and the vector similarity value between the two vectors is calculated.
  • the vector similarity value is a cosine similarity value, specifically the cosine value of the angle between two vectors in the vector space, that is, the vector similarity value is calculated by using the cosine similarity calculation method, and the distance between two vectors is measured. the size of the difference.
  • the above steps are used to perform vector segmentation calculations in each first query cluster according to the multiple query sub-segment vectors of the query vector, and obtain corresponding to each first query cluster.
  • the second distance of the query cluster and then calculate the vector similarity value corresponding to each first query cluster according to the second distance corresponding to each first query cluster, and perform the second vector similarity value on each first query cluster.
  • First sorting the first sorting result is obtained, and the vector similarity value is selected according to the first sorting result, that is, according to the vector similarity value, the database vector most similar to the query vector is found in multiple first query clusters, which can improve the vector similarity calculation accuracy.
  • parallel computing in order to improve query efficiency, parallel computing is used to perform query.
  • the process of parallel computing in this embodiment includes but is not limited to the following steps:
  • Step S710 performing aggregation processing on the first cluster corresponding to the preset first cluster center according to the preset aggregation number to obtain the aggregated cluster, which is stored in different storage spaces.
  • the value of the preset aggregation number may be [2, 1023], and the 1024 first clusters are respectively aggregated according to the preset aggregation number.
  • the preset aggregation number is 1, then 1024 first clusters correspond to 1024 aggregation clusters.
  • the first cluster can be aggregated in a group of two to obtain an aggregated cluster, for example, aggregated cluster 1: the first cluster 1-the first cluster 2; aggregated cluster 2: the first cluster One cluster 1 - first cluster 3; Aggregate cluster 3: first cluster 1 - first cluster 4 etc.
  • the first cluster can be aggregated in a group of 3 to obtain an aggregated cluster, for example, aggregated cluster 1: first cluster 1-first cluster 2-first cluster 3; Aggregated cluster 2: first cluster 1-first cluster 3-first cluster 4; aggregated cluster 3: first cluster 1-first cluster 2-first cluster 5, etc.
  • different aggregation clusters are stored in different storage spaces, for example, aggregation cluster 1 is stored in storage space 1, aggregation cluster 2 is stored in storage space 2, and aggregation cluster 3 is stored in storage space 3, and so on, divide different storage spaces for each different aggregation cluster. It can be understood that different storage spaces may be physically separated storage media, or may be different partitions on the same storage medium, as long as parallel computing can be realized, there is no limitation.
  • the process of calculating the vector similarity value of the query vector in parallel in each aggregation cluster is as follows.
  • Step S720 calculating the first distance between the query vector and the preset first cluster center in each aggregation cluster.
  • Step S730 in each aggregation cluster, select the first query cluster from the first cluster corresponding to the first cluster center according to the first distance and the preset first threshold.
  • Step S740 perform vector segmentation calculation in the first query cluster according to the multiple query sub-segment vectors of the query vector, to obtain the second distance corresponding to each aggregation cluster.
  • Step S750 calculating the vector similarity value corresponding to each aggregation cluster according to the second distance.
  • the above steps are executed in parallel in different storage spaces to calculate the vector similarity value, and one or more query vectors can be obtained in each aggregation cluster Vector similarity values.
  • Step S760 performing a second sorting on the vector similarity values in each aggregation cluster to obtain a second sorting result, and selecting a vector similarity value according to the second sorting result.
  • a second sorting is performed on the vector similarity values obtained in each aggregation cluster, and the closest vector similarity value is selected according to the second sorting result, that is, according to the vector similarity value, among multiple aggregation clusters Find the database vector most similar to the query vector to further improve the accuracy of vector similarity calculation.
  • Step S810 selecting a first query cluster.
  • the first clustering is performed in advance to obtain 1024 first clusters, and then the first distance between the query vector and the 1024 first cluster centers is calculated. According to the first distance and the preset first threshold, in The first query cluster is selected from one cluster, and the vector segmentation calculation is subsequently performed in the first query cluster.
  • step S820 the database vector is mapped into a cluster identification vector in advance.
  • the process of mapping the database vector into a cluster identification vector is described as:
  • each database vector has N dimensions, the dimension of each database vector is 1*128, and the number of pre-segmented segments is 4, then pre-segment it to obtain N*4 32-dimensional sub-segment vectors, each database vector The corresponding 4 sub-segment vectors.
  • the database vector is mapped and expressed as a cluster identification vector, that is, N 128-dimensional database vectors, according to the number of pre-segmented segments 4, are respectively mapped and expressed as 4-dimensional cluster identification vectors, and each value of the cluster identification vector represents The corresponding second cluster center logo.
  • Step S830 calculating vector similarity, refer to FIG. 11 , which is a schematic diagram of vector similarity calculation.
  • the query vector is a 1*128-dimensional vector
  • a 4*256 second distance matrix is obtained, wherein the second distance matrix can be obtained by pre-calculation.
  • the second distance between each query sub-segment vector of the query vector and multiple sets of second cluster centers is stored in advance, when selecting the cluster identification vector for query, no calculation is required, and the second distance matrix can be directly read only by querying M times The values of the relevant second distances are obtained, and the query distance of the query vector for each cluster identification vector is obtained by accumulating and summing, which greatly reduces the calculation time and improves the query efficiency.
  • the time complexity of calculating the optimal vector similarity value of the query vector in the query database is O(N).
  • the calculation time complexity of the embodiment of the present application is expressed as: O(C+M*CC+M), wherein C is the number of the first cluster, M is the number of pre-segmentation segments, and CC is the number of the second cluster.
  • the first step calculate the first distance between the query vector and multiple first clusters, so the time complexity of this step is O(C);
  • the second step the query vector is divided into M query sub-segment vectors, in the first query cluster, calculate the second distance between each query sub-segment vector of the query vector and the CC second cluster centers of the corresponding segment, so
  • the time complexity of this step is O(M*CC);
  • the third step get the second distance and calculate the query distance, and query the second distance matrix M times, so the time complexity of this step is O(M);
  • the calculation time complexity of the embodiment of the present application is: O(C+M*CC+M).
  • the query database there are 10 million database vectors in the query database.
  • 10 million calculations are required.
  • 10 million new data are added, and the vector dimension remains unchanged.
  • the number of the first cluster and the number of the second cluster will increase appropriately with the increase of data, but the growth amount is not large. Hence, as the number of vectors N increases, the calculation time complexity in the related art will gradually increase, and it will be more and more time-consuming compared with the embodiment of the present application.
  • a vector similarity calculation method obtained by an embodiment of the present application obtains a query vector, calculates the first distance between the query vector and the preset first cluster center, and calculates from the corresponding first cluster center according to the first distance and the preset first threshold Select the first query cluster from the first cluster of a cluster heart, perform vector segmentation calculation in the first query cluster according to the multiple query sub-segment vectors of the query vector, and obtain the second distance, and calculate the vector similarity according to the second distance degree value.
  • the first query cluster is screened to avoid querying in the entire database, and the amount of query data is reduced to a certain extent, and then the vector segment calculation is performed in the first query cluster, and the calculation process is simplified by segment query.
  • an embodiment of the embodiment of the present application also provides a vector similarity calculation device, referring to Figure 12, the device includes:
  • a query vector acquisition module 121 configured to acquire a query vector
  • the first distance calculation module 122 is used to calculate the first distance between the query vector and the preset first cluster center;
  • the first query cluster acquisition module 123 is used to select the first query cluster from the first cluster corresponding to the first cluster center according to the first distance and the preset first threshold;
  • the second distance calculation module 124 is used to perform vector segmentation calculation in the first query cluster according to a plurality of query sub-segment vectors of the query vector to obtain the second distance;
  • the vector similarity value calculation module 125 is configured to calculate the vector similarity value according to the second distance.
  • the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • an embodiment of the embodiment of the present application further provides computer equipment, and the computer equipment includes: a memory, a processor, and a computer program stored in the memory and operable on the processor.
  • the processor and memory can be connected by a bus or other means.
  • memory can be used to store non-transitory software programs and non-transitory computer-executable programs.
  • the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage devices.
  • the memory optionally includes memory located remotely from the processor, which remote memory may be connected to the processor via a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the non-transitory software programs and instructions required to realize the vector similarity calculation method of the above-mentioned embodiment are stored in the memory, and when executed by the processor, the vector similarity calculation method in the above-mentioned embodiment is executed, the vector similarity calculation method Including: obtaining a query vector; calculating a first distance between the query vector and a preset first cluster center; selecting the first query from the first cluster corresponding to the first cluster center according to the first distance and the preset first threshold cluster; carry out vector segmentation calculation in the first query cluster according to a plurality of query sub-segment vectors of the query vector, and obtain the second distance; calculate the vector similarity value according to the second distance; for example, perform the method in Fig. 2 described above Step S110 to step S150 , method step S141 to step S143 in FIG. 3 , method step S710 to step S760 in FIG. 9 , and so on.
  • an embodiment of the embodiment of the present application further provides a computer-readable storage medium, and the computer-readable storage medium may be non-volatile or volatile.
  • the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are executed by a processor or a controller, for example, executed by a processor in the above-mentioned computer device embodiment, which can cause the above-mentioned processor to perform the above-mentioned embodiment.
  • the vector similarity calculation method includes: obtaining a query vector; calculating the first distance between the query vector and the preset first cluster center; according to the first distance and the preset first threshold, Select the first query cluster from the first cluster corresponding to the first cluster center; perform vector segmentation calculation in the first query cluster according to multiple query sub-segment vectors of the query vector, and obtain the second distance; calculate the vector according to the second distance Similarity value; for example, execute the method steps S110 to S150 in FIG. 2 described above, the method steps S141 to S143 in FIG. 3 , the method steps S710 to S760 in FIG. 9 , etc. described above.
  • being executed by a processor in the above-mentioned computer device embodiment can make the above-mentioned processor execute the vector similarity calculation method in the above-mentioned embodiment, for example, execute steps S110 to S150, The method steps S141 to S143 in FIG. 3 , the method steps S710 to S760 in FIG. 9 , and so on.
  • memory can be used to store non-transitory software programs and non-transitory computer-executable programs.
  • the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage devices.
  • the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor via a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • the disclosed devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including multiple instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, referred to as ROM), random access memory (Random Access Memory, referred to as RAM), magnetic disk or optical disc, etc., which can store programs. medium.
  • ROM read-only memory
  • RAM random access memory
  • magnetic disk or optical disc etc., which can store programs. medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种向量相似度计算方法、装置、设备及存储介质,属于机器学习技术领域。其中方法包括:获取查询向量(S110),计算查询向量与预设的第一簇心之间的第一距离(S120),根据第一距离和预设第一阈值,从对应第一簇心的第一簇中选取第一查询簇(S130),根据查询向量的多个查询子段向量在第一查询簇中进行向量分段计算,得到第二距离(S140),根据第二距离计算向量相似度值(S150)。该方法首先筛选得到第一查询簇,避免在整个数据库中进行查询,在一定程度上减少查询数据量,然后在第一查询簇中进行向量分段计算,通过分段查询简化计算的过程,降低计算复杂程度,减少计算成本,能够高效的计算向量之间的相似度,适用于数据量大或者维度高的复杂业务场景。

Description

向量相似度计算方法、装置、设备及存储介质
本申请要求于2021年12月15日提交中国专利局、申请号为202111536035.3,发明名称为“向量相似度计算方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及机器学习技术领域,尤其涉及一种向量相似度计算方法、装置、设备及存储介质。
背景技术
随着互联网时代的发展,互联网信息数据正以极快的速度增长。由于大数据的发展及人工智能的应用,迫切需要发现大型数据集中的数据模式,类似于数据挖掘技术,使用不同的方式来分析数据集。例如在数据库中快速且准确地进行向量相似性搜索,相似性搜索的目的是迅速地标识数据集中类似于特定的查询向量的向量。
技术问题
以下是发明人意识到的现有技术的技术问题:
在数据库中计算向量之间的相似度,计算复杂度会随着目标数据量的增加而线性增加,例如数据库中有1000万条向量数据,现需要计算数据库中哪个向量表示与查询向量最相似,此时的时间复杂度为O(N),即需要把查询向量与数据库中的1000万向量数据逐一计算,即计算1000万次。此时如果数据库中的数据增加了1000万,此时需计算2000万次,计算成本非常高,不利于工业级的应用。相关技术中,向量相似度计算系统主要利用神经网络算法,该算法的优点是计算精度高,但是缺点是实时性差;另一种非神经网络算法计算向量相似度,则存在计算精度低、计算复杂度高的问题,难以在数据量大,维度高的复杂业务场景中使用。
技术解决方案
第一方面,本申请实施例提出了一种向量相似度计算方法,包括:
获取查询向量;
计算所述查询向量与预设的第一簇心之间的第一距离;
根据所述第一距离和预设第一阈值,从对应所述第一簇心的第一簇中选取第一查询簇;
根据所述查询向量的多个查询子段向量在所述第一查询簇中进行向量分段计算,得到第二距离;
根据所述第二距离计算向量相似度值。
第二方面,本申请实施例提出了一种向量相似度计算装置,包括:
查询向量获取模块,用于获取查询向量;
第一距离计算模块,用于计算所述查询向量与预设的第一簇心之间的第一距离;
第一查询簇获取模块,用于根据所述第一距离和预设第一阈值,从对应所述第一簇心的第一簇中选取第一查询簇;
第二距离计算模块,用于根据所述查询向量的多个查询子段向量在所述第一查询簇中进 行向量分段计算,得到第二距离;
向量相似度值计算模块,用于根据所述第二距离计算向量相似度值。
第三方面,本申请实施例提出了一种计算机设备,包括处理器以及存储器:
所述存储器用于存储程序;
所述处理器用于根据所述程序执行一种向量相似度计算方法,所述向量相似度计算方法,包括:获取查询向量;计算所述查询向量与预设的第一簇心之间的第一距离;根据所述第一距离和预设第一阈值,从对应所述第一簇心的第一簇中选取第一查询簇;根据所述查询向量的多个查询子段向量在所述第一查询簇中进行向量分段计算,得到第二距离;根据所述第二距离计算向量相似度值。
第四方面,本申请实施例提出了一种计算机可读存储介质,所述存储介质存储有计算机可执行指令,所述计算机可执行指令用于使计算机执行一种向量相似度计算方法,所述向量相似度计算方法,包括:获取查询向量;计算所述查询向量与预设的第一簇心之间的第一距离;根据所述第一距离和预设第一阈值,从对应所述第一簇心的第一簇中选取第一查询簇;根据所述查询向量的多个查询子段向量在所述第一查询簇中进行向量分段计算,得到第二距离;根据所述第二距离计算向量相似度值。
有益效果
本申请实施例提出的向量相似度计算方法、装置、设备及存储介质,能够筛选得到第一查询簇,避免在整个数据库中进行查询,在一定程度上减少查询数据量,然后在第一查询簇中进行向量分段计算,通过分段查询简化计算的过程,能够降低计算复杂程度,减少计算成本,提高向量相似度的计算效率。
附图说明
附图用来提供对本申请技术方案的进一步理解,并且构成说明书的一部分,与本申请的实施例一起用于解释本申请的技术方案,并不构成对本申请技术方案的限制。
图1是本申请一个实施例提供的示例性系统架构的示意图;
图2是本申请一个实施例提供的向量相似度计算方法的流程图;
图3是本申请一个实施例提供的向量相似度计算方法的又一流程图;
图4是本申请一个实施例提供的分段处理示意图;
图5是本申请一个实施例提供的向量相似度计算方法的又一流程图;
图6是本申请一个实施例提供的第二聚类示意图;
图7是本申请一个实施例提供的向量相似度计算方法的又一流程图;
图8是本申请一个实施例提供的向量相似度计算方法的又一流程图;
图9是本申请一个实施例提供的向量相似度计算方法的又一流程图;
图10是本申请一个实施例提供的向量相似度计算方法的又一流程图;
图11是本申请一个实施例提供的向量相似度计算示意图;
图12是本申请一个实施例提供的向量相似度计算装置的结构框图。
本发明的实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。
需要说明的是,虽然在装置示意图中进行了功能模块划分,在流程图中示出了逻辑顺序, 但是在某些情况下,可以以不同于装置中的模块划分,或流程图中的顺序执行所示出或描述的步骤。说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。
首先,对本申请中涉及的若干名词进行解析:
K-means聚类算法:是基于欧式距离的一种聚类算法,其认为两个向量的欧式距离越近,相似度越大。K-means聚类算法中首先确定聚类参数k,然后将查询数据库中所有的数据库向量划分为k个聚类,以便使得所获得的聚类满足以下条件:同一聚类中数据库向量的相似度较高;而不同聚类数据库向量的向量相似度较小。
簇:由聚类所生成的一组样本的集合,同一簇内样本彼此相似,与其他簇中的样本相异。
人工智能(artificial intelligence,AI):是研究、开发用于模拟、延伸和扩展人的智能的理论、方法、技术及应用系统的一门新的技术科学;人工智能是计算机科学的一个分支,人工智能企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器,该领域的研究包括机器人、语言识别、图像识别、自然语言处理和专家系统等。人工智能可以对人的意识、思维的信息过程的模拟。人工智能还是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。
本申请实施例可以基于人工智能技术对相关的数据进行获取和处理。其中,人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。
人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、机器人技术、生物识别技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
随着大数据的发展及人工智能的应用,互联网数据量正以极快的速度增长,需要发现大型数据集中的数据模式进行针对性应用。类似于数据挖掘技术,使用不同的方式来分析数据集,例如在n维空间中的快速且准确地进行向量相似性搜索,相似性搜索的目的是迅速地标识数据集中类似于特定的查询项目的向量。目前常见的向量相似度计算即属于向量相似性搜索,已经成为目前网络技术应用的一项重要内容,例如在用户浏览新闻时时推荐可能感兴趣的内容,在用户浏览商品时推荐倾向购买的商品等。
但是在数据库中计算向量之间的相似度,计算复杂度会随着目标数据量的增加而线性增加,例如数据库中有1000万条向量数据,现需要计算数据库中哪个向量表示与查询向量最相似,此时的时间复杂度为O(N),即需要把查询向量与数据库中的1000万向量数据逐一计算,即计算1000万次。此时如果数据库中的数据增加了1000万,此时需计算2000万次,计算成本非常高,不利于工业级的应用。相关技术中,向量相似度计算系统主要利用神经网络算法,该算法的优点是计算精度高,但是缺点是实时性差;另一种非神经网络算法计算向量相似度, 则存在计算精度低、计算复杂度高的问题,难以在数据量大,维度高的复杂业务场景中使用。
因此,本申请实施例提供的一种向量相似度计算方法,获取查询向量,计算查询向量与预设的第一簇心之间的第一距离,根据第一距离和预设第一阈值,从对应第一簇心的第一簇中选取第一查询簇,根据查询向量的多个查询子段向量在第一查询簇中进行向量分段计算,得到第二距离,根据第二距离计算向量相似度值。本实施例首先筛选得到第一查询簇,避免在整个数据库中进行查询,在一定程度上减少查询数据量,然后在第一查询簇中进行向量分段计算,通过分段查询简化计算的过程,能够降低计算复杂程度,减少计算成本,提高向量相似度的计算效率。
下面结合附图,对本申请实施例作进一步阐述。
可以理解的是,本申请实施例提供的向量相似度计算方法可以由各种具备运算处理能力的电子设备实施,例如可以为笔记本电脑、平板电脑、台式计算机、机顶盒、移动设备(例如,移动电话、个人数字助理、专用消息设备、便携式游戏设备)等各种类型的用户终端实施,也可以由服务器实施。
需要说明的是,上述的服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、以及大数据和人工智能平台等基础云计算服务的云服务器,本申请实施例在此不做限制。
为了便于理解本申请实施例提供的技术方案,下面以本申请实施例提供的向量相似度计算方法应用于服务器为例,对本申请实施例提供的向量相似度计算方法适用的应用场景进行介绍。
参见图1,图1为本申请实施例提供的向量相似度计算方法的应用场景示意图。
如图1所示,本申请实施例提供的向量相似度计算方法应用于系统框架100,其中系统架构100可以包括数据库101、网络102和服务器103。网络102用以在数据库101和服务器103之间提供通信链路的介质。网络102可以包括各种连接类型,例如有线通信链路、无线通信链路等等。
在本申请实施例的一个实施例中,服务器103从数据库101中获取查询向量,计算查询向量与预设的第一簇心之间的第一距离,根据第一距离和预设第一阈值,从对应第一簇心的第一簇中选取第一查询簇,根据查询向量的多个查询子段向量在第一查询簇中进行向量分段计算,得到第二距离,根据第二距离计算向量相似度值。首先筛选得到第一查询簇,避免在整个数据库中进行查询,在一定程度上减少查询数据量,然后在第一查询簇中进行向量分段计算,通过分段查询简化计算的过程,降低计算复杂程度,减少计算成本,能够高效的计算向量之间的相似度,适用于数据量大或者维度高的复杂业务场景。
需要说明的是,本申请实施例所提供的向量相似度计算方法一般由服务器103执行,相应地,向量相似度计算装置一般设置于服务器103中。但是,在本申请实施例的其它实施例中,终端设备也可以与服务器具有相似的功能,从而执行本申请实施例所提供的向量相似度计算方案。
本申请实施例描述的系统架构以及应用场景是为了更加清楚的说明本申请实施例的技术方案,并不构成对于本申请实施例提供的技术方案的限定,本领域技术人员可知,随着系统 架构的演变和新应用场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。本领域技术人员可以理解的是,图1中示出的系统架构并不构成对本申请实施例的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
基于上述系统架构,提出本申请实施例的向量相似度计算方法的各个实施例。
如图2所示,图2是本申请一个实施例提供的向量相似度计算方法的流程图,包括但不限于有步骤S110至步骤S150。
步骤S110,获取查询向量。
在信息推荐或者信息搜索场景中,查询向量可以是搜索的关键词,具体地,可以将搜索的关键词表示为查询向量的形式,例如表示为N维查询向量。
步骤S120,计算查询向量与预设的第一簇心之间的第一距离。
在一实施例中,为了在一定程度上减少查询数据量,避免在整个数据库中进行查询,在执行步骤S110或步骤S120之前,该向量相似度计算方法,还包括下面步骤:
获取预设的查询数据库,对查询数据库的数据库向量进行第一聚类处理,得到第一簇;其中每个所述第一簇包括一个第一簇心。
在一实施例中,第一聚类为使用K-means聚类算法对查询数据库中数据库向量进行聚类,。
欧式距离可解释为欧氏空间中连接两个点的线段长度,N维向量a和N维向量b之间的欧式距离表示如下:
N维向量a表示为:a(x 11,x 12,…,x 1N);
N维向量b表示为:b(x 21,x 22,…,x 2N);
欧式距离表示为:
Figure PCTCN2022090758-appb-000001
在一实施例中,第一聚类的聚类步骤描述为:
1)选择聚类参数k,对应的包含k个初始聚类中心,初始聚类中心表示为:a=a 1,a 2,...,a k
2)针对查询数据库中每个数据库向量x i,1≤i≤ss,其中ss表示查询数据库中数据库向量的个数,数据库向量以x表示,i表示数据库向量的序号,x i表示第i个数据库向量,计算其到k个初始聚类中心的欧式距离,并将其分到距离最小的聚类中心所对应的类中;
3)针对每个初始聚类中心a j,1≤j≤k,j表示初始聚类中心的序号,a j表示第j个初始聚类中心,重新计算其聚类中心:
Figure PCTCN2022090758-appb-000002
1≤i≤ss(即聚类中心是该类的所有数据库向量x的质心,c i表示数据库向量x i的质心);
4)重复上面步骤2)和步骤3),直到达到预设的中止条件,则迭代结束;否则,则继续迭代。其中,中止条件可以是迭代次数或者最小误差变化等,可根据具体需求进行设定,在此不做限定。
在一实施例中,假设查询数据库中包含1000万个数据库向量,设定聚类参数k为1024,利用K-means聚类算法进行第一聚类,得到1024个第一簇,每个第一簇均包含一个聚类中心,即共有1024个第一簇心,则这1000万条数据将会被划分入这1024个簇中,划分的标准不是平均分配,而是分别计算1000万条数据到这1024个簇的距离,以聚类的思想就近分配。
在该实施例中,上述步骤S120为:分别计算N维查询向量与1024个第一簇心的距离, 得到对应的1024个第一距离。
步骤S130,根据第一距离和预设第一阈值,从对应第一簇心的第一簇中选取第一查询簇。
在一实施例中,根据预设第一阈值选取与查询向量距离最近的第一簇作为第一查询簇,再第一查询簇中进行查询向量的向量相似度计算,避免在整个数据库中进行查询,在一定程度上减少查询数据量。
在一实施例中,在第一簇中选取第一距离小于预设第一阈值的第一簇,作为第一查询簇,第一查询簇包括一个或一个以上的第一簇。
步骤S140,根据查询向量的多个查询子段向量在第一查询簇中进行向量分段计算,得到第二距离。
在一实施例中,在第一查询簇中进行向量分段计算,通过分段查询简化计算的过程,降低计算复杂程度,减少计算成本,能够高效的计算向量之间的相似度,适用于数据量大或者维度高的复杂业务场景。
在一实施例中,参考图3,步骤S140包括但不限于步骤S141至步骤S144:
步骤S141,获取预分段段数,例如预分段段数设为M。
步骤S142,根据预分段段数对查询向量进行分段处理,得到查询子段向量,查询子段向量的数量与预分段段数相等。
在一实施例中,根据预分段段数M对N维查询向量进行分段处理,得到M个查询子段向量,M用于标识查询向量被分成多少个查询子段向量,因此M能够被查询向量的维数整除。参照图4,为分段处理示意图,图中取M=4,当查询向量为1*128维时,则将查询向量分为4段,得到4个查询子段向量,每个查询子段向量的维数都是32维。
步骤S143,分别计算每个查询子段向量与预设的第二簇心之间的第二距离。
在一实施例中,第二簇心为第二簇的聚类中心,在执行步骤S143之前,该向量相似度计算方法,还包括:
获取预设的第二簇心,参照图5,具体可以包括步骤S161至步骤S163。
步骤S161:获取预设的查询数据库。
步骤S162:对查询数据库的数据库向量按照预分段段数进行预分段处理,得到每个数据库向量对应的多个子段向量。
在一实施例中,参照图6,为对查询数据库中数据库向量进行第二聚类处理示意图,假设有N个数据库向量,每个数据库向量的维数都是1*128,预分段段数是4,则将其进行预分段得到N*4个32维的子段向量,每个数据库向量对应的4个子段向量。
步骤S163:对数据库向量的每一个子段向量进行第二聚类处理,得到每个子段向量对应的多个第二簇,其中每个第二簇包括一个第二簇心,第二簇心为第二簇的聚类中心。
具体过程是:分别对数据库向量的第f个子段进行聚类,得到每个子段向量对应的多个第二簇,每个第二簇包括一个第二簇心,其中1≤f≤M,M是预分段段数,f是整数。
参考图6,例如对第1个子段进行聚类,即对所有数据库向量的第1个子段向量取出来进行聚类,得到对应于第1子段的多个第二簇,例如使用K-means聚类算法对查询数据库中数据库向量进行聚类得到多个第二簇,每个子段对应一组第二簇。在一实施例中,第二簇的个数可以与第一簇的个数相关,例如第二簇的个数=第一簇的个数/预分段段数,但是并不对 此进行限定。参考图6,对4个子段分别进行聚类,得到4组256个第二簇,256是一种示例,每个第二簇包括一个第二簇心。
可以理解的是,第二簇的聚类数量在此仅作示意,不对其进行限定。
步骤S164,将每个第二簇的第二簇心作为预设的第二簇心。
在一实施例中,步骤S164之后还包括:
步骤S165:对查询数据库中数据库向量进行映射得到簇标识向量。
在一实施例中,预先对查询数据库中数据库向量按照第二簇心进行映射,将每个数据库向量表示为簇标识向量,其中簇标识向量即将原来的数据库向量表示成与预分段段数数量相同的数字。
参照图7,步骤S165包括但不限于有以下步骤:
步骤S1651,计算数据库向量的每个子段向量与其对应的多个第二簇心之间的第三距离。
在一实施例中,对每个数据库向量分别计算其每个子段向量与对应子段的多个第二簇心之间的第三距离。
步骤S1652,根据第三距离与预设第二阈值得到每个子段向量对应的第二簇心标识。
在一实施例中,根据预设第二阈值筛选出,根据每个子段向量的多个第三距离中距离最小的一个,即根据预设第二阈值筛选出每个子段向量距离最近的第二簇心,将该距离最近的第二簇心的第二簇心标识(即第二簇心的ID标识,可以是十进制表示,用于对第二簇心进行区分)与该子段向量对应起来。
步骤S1653,根据数据库向量对应的第二簇心标识,将数据库向量映射为簇标识向量。
在一实施例中,由上述步骤得到数据库向量的每个子段向量分别对应的距离最近的第二簇心的第二簇心标识,将数据库向量映射为簇标识向量,其中簇标识向量由第二簇心标识构成。参考图6,将数据库向量进行映射表示为簇标识向量,即将N个128维数据库向量,根据其预分段段数4,分别将其映射表示为一个4维的簇标识向量,所有的簇标识向量总维数是:N*4,簇标识向量的每个值均表示对应的第二簇心标识。
步骤S150,根据第二距离计算向量相似度值。
在一实施例中,参考图8,步骤S150,包括但不限于以下步骤:
步骤S151,计算每个查询子段向量与簇标识向量对应的第二簇心之间的第二距离。
在一实施例中,选取簇标识向量进行查询,获取每个查询子段向量与簇标识向量对应子段的第二簇心标识对应的第二簇心之间的第二距离。
在一实施例中,第二距离可以预先计算,例如预先存储查询向量的每个查询子段向量与多组第二簇心的第二距离,在选取簇标识向量进行查询时,无需计算可以直接读取相关的数值,大大降低计算时间,提高查询效率。
步骤S152,根据第二距离计算得到查询距离。
在一实施例中,例如预分段段数为M,则计算得到M个查询子段向量对应的第二距离,将M个第二距离相加之后得到查询距离。
步骤S153,根据查询距离与预设第三阈值,计算得到向量相似度值。
在一实施例中,根据预设第三阈值,选取最小的查询距离对应的数据库向量作为与查询向量最相似的向量,计算得到两个向量之间向量相似度值。
在一实施例中,向量相似度值为余弦相似度值,具体是向量空间中两个向量夹角的余弦值,即利用余弦相似度计算方法计算得到向量相似度值,衡量两个向量之间的差异大小。
在一实施例中,若第一查询簇的数量大于1,利用上述步骤,根据查询向量的多个查询子段向量在每个第一查询簇中进行向量分段计算,得到对应每个第一查询簇的第二距离,然后根据每个第一查询簇对应的第二距离计算对应每个第一查询簇的向量相似度值,对每个第一查询簇中得到的向量相似度值进行第一排序,得到第一排序结果,根据第一排序结果选取向量相似度值,即根据向量相似度值,在多个第一查询簇中寻找与查询向量最相似的数据库向量,这样能够提高向量相似度计算准确性。
在一实施例中,为了提高查询效率,利用并行计算的方式进行查询。参考图9,本实施例并行计算的过程包括但不限于以下步骤:
步骤S710,根据预设聚合数对预设的第一簇心对应的第一簇进行聚合处理,得到聚合簇,聚合簇存储在不同的存储空间中。
在一实施例中,例如经过第一聚类之后得到1024个第一簇,预设聚合数的取值可以是[2,1023],根据预设聚合数对这1024个第一簇分别聚合。
对于每个预设聚合数,进行聚合之后能得到较多的组合方式。
例如预设聚合数为1时,则1024个第一簇对应1024个聚合簇。
例如预设聚合数为2时,则可以将第一簇按照2个为一组的聚合方式进行聚合得到聚合簇,例如聚合簇1:第一簇1-第一簇2;聚合簇2:第一簇1-第一簇3;聚合簇3:第一簇1-第一簇4等。
例如预设聚合数为3时,则可以将第一簇按照3个为一组的聚合方式进行聚合得到聚合簇,例如聚合簇1:第一簇1-第一簇2-第一簇3;聚合簇2:第一簇1-第一簇3-第一簇4;聚合簇3:第一簇1-第一簇2-第一簇5等。
例如预设聚合数为1024时,则得到一个包含1024个第一簇的总体组合。
在一实施例中,将不同的聚合簇存储在不同的存储空间上,例如将聚合簇1存储在存储空间1上,将聚合簇2存储在存储空间2上,将聚合簇3存储在存储空间3上,以此类推,为每个不同的聚合簇划分不同的存储空间。可以理解的是,不同的存储空间可以是物理上分离的存储介质,也可以是同一存储介质上不同分区,只要能实现并行计算即可,在次不做限定。
在一实施例中,在每个聚合簇中并行计算查询向量的向量相似度值的过程如下所述。
步骤S720,在每个聚合簇中计算查询向量与预设的第一簇心之间的第一距离。
步骤S730,在每个聚合簇中,根据第一距离和预设第一阈值,从对应第一簇心的第一簇中选取第一查询簇。
步骤S740,根据查询向量的多个查询子段向量在第一查询簇中进行向量分段计算,得到每个聚合簇对应的第二距离。
步骤S750,根据第二距离计算每个聚合簇对应的向量相似度值。
在一实施例中,由于不同的聚合簇位于不同的存储空间,因此并行的在不同的存储空间运行上述步骤进行向量相似度值计算,每个聚合簇中均能得到一个或多个查询向量的向量相似度值。
步骤S760,对每个聚合簇中的向量相似度值进行第二排序,得到第二排序结果,根据第二排序结果选取向量相似度值。
在一实施例中,对每个聚合簇中得到的向量相似度值进行第二排序,根据第二排序结果选取最接近的向量相似度值,即根据向量相似度值,在多个聚合簇中寻找与查询向量最相似的数据库向量,进一步提高向量相似度计算准确性。
另外,在一实施例中,参照图10,计算向量相似度的具体流程如下面步骤S810至步骤S830所述。
步骤S810,选取第一查询簇。
假设有1000万条数据库向量,预先进行第一聚类得到1024个第一簇,然后计算查询向量和1024个第一簇心的第一距离,根据第一距离和预设第一阈值,在第一簇中选取得到第一查询簇,后续在第一查询簇中进行向量分段计算。
步骤S820,预先将数据库向量映射成簇标识向量,结合图6,将数据库向量映射成簇标识向量的过程描述为:
假设有N个数据库向量,每个数据库向量的维数都是1*128,预分段段数是4,则将其进行预分段得到N*4个32维的子段向量,每个数据库向量对应的4个子段向量。
使用K-means聚类算法对查询数据库中数据库向量对4个子段分别进行第二聚类,得到4组256个第二簇,每个第二簇包括一个第二簇心,共有4*256个第二簇心。
然后将数据库向量进行映射表示为簇标识向量,即将N个128维数据库向量,根据其预分段段数4,分别将其映射表示为4维的簇标识向量,簇标识向量的每个值均表示对应的第二簇心标识。
步骤S830,计算向量相似度,参照图11,为向量相似度计算示意图。
假设查询向量是1*128维向量,按照预分段段数4将其分为4个查询子段向量,分别计算每个查询子段向量与对应分段数的第二簇心之间的第二距离,得到一个4*256的第二距离矩阵,其中,第二距离矩阵可以预先计算得到。
根据第二距离计算得到查询距离。例如某一数据库向量被映射成[124,56,132,222],当需要计算查询向量针对该簇标识向量的查询距离时,通过读取上述第二距离矩阵,得到查询向量的第一段查询子段向量与第二簇心标识为124的第二簇心之间的距离,依次读取得到四个对应的第二距离,分别记为d1、d2、d3和d4,对其进行累加求和,得到查询距离表示为:d=d1+d2+d3+d4。由于预先存储查询向量的每个查询子段向量与多组第二簇心的第二距离,在选取簇标识向量进行查询时,无需计算,只需要查询M次第二距离矩阵,就可以直接读取相关的第二距离的数值,累加求和的方式得到查询向量针对每个簇标识向量的查询距离,大大降低计算时间,提高查询效率。
相关技术中,若查询数据库中存在N条数据库向量,在查询数据库中计算查询向量的最优向量相似度值,其时间复杂度为O(N)。
本申请实施例的计算时间复杂度表示为:O(C+M*CC+M),其中C是第一簇的个数,M是预分段段数,CC是第二簇的个数。
第一步:计算查询向量与多个第一簇的第一距离,因此该步骤的时间复杂度是O(C);
第二步:查询向量被分为M个查询子段向量,在第一查询簇中,计算查询向量的每个查 询子段向量与对应分段的CC个第二簇心的第二距离,因此该步骤的时间复杂度是O(M*CC);
第三步:得到第二距离计算得到查询距离,查询M次第二距离矩阵,因此该步骤的时间复杂度是O(M);
根据上述得到本申请实施例的计算时间复杂度为:O(C+M*CC+M)。
在一实施例中,例如,查询数据库存在1000万条数据库向量,相关技术中,需要计算1000万次,本实施例中,需要计算1024+4*256+4=2052次(假设第一簇的个数为1024个,预分段段数为4,第二簇的个数为256个)。假设又新增了1000万数据,向量维度不变,此时相关技术中,需要计算2000万次;本实施例需要计算2048+4*512+4=4100次(假设第一簇的个数为2048个,预分段段数为4,第二簇的个数为512个)。一般情况第一簇的个数和第二簇的个数会随着数据增加适当增加,但增长量不大。很显然随着向量个数N的增加,相关技术中计算时间复杂度会逐渐增加,相比较本申请实施例会越来越耗时。
本申请实施例提供的一种向量相似度计算方法,获取查询向量,计算查询向量与预设的第一簇心之间的第一距离,根据第一距离和预设第一阈值,从对应第一簇心的第一簇中选取第一查询簇,根据查询向量的多个查询子段向量在第一查询簇中进行向量分段计算,得到第二距离,根据第二距离,计算得到向量相似度值。本实施例首先筛选得到第一查询簇,避免在整个数据库中进行查询,在一定程度上减少查询数据量,然后在第一查询簇中进行向量分段计算,通过分段查询简化计算的过程,降低计算复杂程度,减少计算成本,能够高效的计算向量之间的相似度,适用于数据量大或者维度高的复杂业务场景。
另外,本申请实施例的一个实施例还提供了一种向量相似度计算装置,参照图12,装置包括:
查询向量获取模块121,用于获取查询向量;
第一距离计算模块122,用于计算查询向量与预设的第一簇心之间的第一距离;
第一查询簇获取模块123,用于根据第一距离和预设第一阈值,从对应第一簇心的第一簇中选取第一查询簇;
第二距离计算模块124,用于根据查询向量的多个查询子段向量在第一查询簇中进行向量分段计算,得到第二距离;
向量相似度值计算模块125,用于根据第二距离计算向量相似度值。
以上所描述的装置实施例仅仅是示意性的,其中作为分离部件说明的单元可以是或者也可以不是物理上分开的,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
需要说明的是,本实施例的向量相似度计算装置的具体实施方式与上述向量相似度计算方法的具体实施方式基本一致,在此不再赘述。
另外,本申请实施例的一个实施例还提供了计算机设备,计算机设备包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序。
处理器和存储器可以通过总线或者其他方式连接。
存储器作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序以及非暂态性计算机可执行程序。此外,存储器可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施方式中, 存储器可选包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至该处理器。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
实现上述实施例的向量相似度计算方法所需的非暂态软件程序以及指令存储在存储器中,当被处理器执行时,执行上述实施例中的向量相似度计算方法,该向量相似度计算方法包括:获取查询向量;计算查询向量与预设的第一簇心之间的第一距离;根据第一距离和预设第一阈值,从对应第一簇心的第一簇中选取第一查询簇;根据查询向量的多个查询子段向量在第一查询簇中进行向量分段计算,得到第二距离;根据第二距离计算向量相似度值;例如,执行以上描述的图2中的方法步骤S110至步骤S150、图3中的方法步骤S141至步骤S143、图9中的方法步骤S710至步骤S760等。
此外,本申请实施例的一个实施例还提供了一种计算机可读存储介质,计算机可读存储介质可以是非易失性,也可以是易失性。该计算机可读存储介质存储有计算机可执行指令,该计算机可执行指令被一个处理器或控制器执行,例如,被上述计算机设备实施例中的一个处理器执行,可使得上述处理器执行上述实施例中的向量相似度计算方法,该向量相似度计算方法包括:获取查询向量;计算查询向量与预设的第一簇心之间的第一距离;根据第一距离和预设第一阈值,从对应第一簇心的第一簇中选取第一查询簇;根据查询向量的多个查询子段向量在第一查询簇中进行向量分段计算,得到第二距离;根据第二距离计算向量相似度值;例如,执行以上描述的图2中的方法步骤S110至步骤S150、图3中的方法步骤S141至步骤S143、图9中的方法步骤S710至步骤S760等。
又如,被上述计算机设备实施例中的一个处理器执行,可使得上述处理器执行上述实施例中的向量相似度计算方法,例如,执行以上描述的图2中的方法步骤S110至步骤S150、图3中的方法步骤S141至步骤S143、图9中的方法步骤S710至步骤S760等。
存储器作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序以及非暂态性计算机可执行程序。此外,存储器可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施方式中,存储器可选包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至该处理器。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
本公开实施例描述的实施例是为了更加清楚的说明本公开实施例的技术方案,并不构成对于本公开实施例提供的技术方案的限定,本领域技术人员可知,随着技术的演变和新应用场景的出现,本公开实施例提供的技术方案对于类似的技术问题,同样适用。
本领域技术人员可以理解的是,图中示出的技术方案并不构成对本公开实施例的限定,可以包括比图示更多或更少的步骤,或者组合某些步骤,或者不同的步骤。
以上所描述的装置实施例仅仅是示意性的,其中作为分离部件说明的单元可以是或者也可以不是物理上分开的,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统、设备中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。
本申请的说明书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的 数据在适当情况下可以互换,以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括多指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,简称ROM)、随机存取存储器(Random Access Memory,简称RAM)、磁碟或者光盘等各种可以存储程序的介质。
以上参照附图说明了本公开实施例的优选实施例,并非因此局限本公开实施例的权利范围。本领域技术人员不脱离本公开实施例的范围和实质内所作的任何修改、等同替换和改进,均应在本公开实施例的权利范围之内。

Claims (20)

  1. 一种向量相似度计算方法,其中,包括:
    获取查询向量;
    计算所述查询向量与预设的第一簇心之间的第一距离;
    根据所述第一距离和预设第一阈值,从对应所述第一簇心的第一簇中选取第一查询簇;
    根据所述查询向量的多个查询子段向量在所述第一查询簇中进行向量分段计算,得到第二距离;
    根据所述第二距离计算向量相似度值。
  2. 根据权利要求1所述的向量相似度计算方法,其中,所述获取查询向量之前,所述方法还包括:
    获取预设的查询数据库;
    对所述查询数据库的数据库向量进行第一聚类处理,得到所述第一簇;其中每个所述第一簇包括一个第一簇心。
  3. 根据权利要求1所述的向量相似度计算方法,其中,所述根据所述查询向量的多个查询子段向量在所述第一查询簇中进行向量分段计算,得到第二距离,包括:
    获取预分段段数;
    根据所述预分段段数对所述查询向量进行分段处理,得到查询子段向量,所述查询子段向量的数量与所述预分段段数相等;
    计算每个所述查询子段向量与预设的第二簇心之间的第二距离。
  4. 根据权利要求3所述的向量相似度计算方法,其中,所述计算每个所述查询子段向量与预设的第二簇心之间的第二距离之前,所述方法还包括获取预设的第二簇心,具体包括:
    获取预设的查询数据库;
    对所述查询数据库的数据库向量按照所述预分段段数进行预分段处理,得到每个数据库向量对应的多个子段向量;
    对所述数据库向量的每一个所述子段向量进行第二聚类处理,得到每个子段向量对应的多个第二簇,其中每个所述第二簇包括一个第二簇心,所述第二簇心为第二簇的聚类中心;
    将每个所述第二簇的第二簇心作为所述预设的第二簇心。
  5. 根据权利要求4所述的向量相似度计算方法,其中,所述计算每个所述查询子段向量与预设的第二簇心之间的第二距离,包括:
    对所述查询数据库中数据库向量进行映射得到簇标识向量;
    计算每个所述查询子段向量与所述簇标识向量对应的第二簇心之间的第二距离。
  6. 根据权利要求5所述的向量相似度计算方法,其中,所述对所述查询数据库中数据库向量进行映射得到簇标识向量,包括:
    计算所述数据库向量的每个子段向量与其对应的多个第二簇心之间的第三距离;
    根据所述第三距离与预设第二阈值得到每个所述子段向量对应的第二簇心标识;
    根据所述数据库向量对应的第二簇心标识,将所述数据库向量映射为簇标识向量。
  7. 根据权利要求1所述的向量相似度计算方法,其中,所述根据所述第二距离计算向量相似度值,包括:
    根据所述第二距离计算查询距离;
    根据所述查询距离与预设第三阈值,计算所述向量相似度值。
  8. 根据权利要求1至7任一项所述的向量相似度计算方法,其中,若所述第一查询簇的数量大于1;
    所述根据所述查询向量的多个查询子段向量在所述第一查询簇中进行向量分段计算,得到第二距离,包括:
    根据所述查询向量的多个查询子段向量在每个所述第一查询簇中进行向量分段计算,得到对应每个第一查询簇的第二距离;
    所述根据所述第二距离计算向量相似度值,包括:
    根据每个第一查询簇对应的所述第二距离计算对应每个第一查询簇的向量相似度值;
    对每个第一查询簇中得到的向量相似度值进行第一排序,得到第一排序结果;
    根据所述第一排序结果选取向量相似度值。
  9. 根据权利要求1至7任一项所述的向量相似度计算方法,其中,所述计算所述查询向量与预设的第一簇心之间的第一距离,包括:
    根据预设聚合数对预设的第一簇心对应的第一簇进行聚合处理,得到聚合簇,所述聚合簇存储在不同的存储空间中;
    在每个聚合簇中计算所述查询向量与预设的第一簇心之间的第一距离。
  10. 一种向量相似度计算装置,其中,包括:
    查询向量获取模块,用于获取查询向量;
    第一距离计算模块,用于计算所述查询向量与预设的第一簇心之间的第一距离;
    第一查询簇获取模块,用于根据所述第一距离和预设第一阈值,从对应所述第一簇心的第一簇中选取第一查询簇;
    第二距离计算模块,用于根据所述查询向量的多个查询子段向量在所述第一查询簇中进行向量分段计算,得到第二距离;
    向量相似度值计算模块,用于根据所述第二距离计算向量相似度值。
  11. 一种计算机设备,其中,包括处理器以及存储器;
    所述存储器用于存储程序;
    所述处理器用于根据所述程序执行一种向量相似度计算方法,其中所述向量相似度计算方法包括:
    获取查询向量;
    计算所述查询向量与预设的第一簇心之间的第一距离;
    根据所述第一距离和预设第一阈值,从对应所述第一簇心的第一簇中选取第一查询簇;
    根据所述查询向量的多个查询子段向量在所述第一查询簇中进行向量分段计算,得到第二距离;
    根据所述第二距离计算向量相似度值。
  12. 根据权利要求11所述的计算机设备,其中:所述获取查询向量之前,所述向量相似度计算方法还包括:
    获取预设的查询数据库;
    对所述查询数据库的数据库向量进行第一聚类处理,得到所述第一簇;其中每个所述第一簇包括一个第一簇心。
  13. 根据权利要求11所述的计算机设备,其中,所述根据所述查询向量的多个查询子段向量在所述第一查询簇中进行向量分段计算,得到第二距离,包括:
    获取预分段段数;
    根据所述预分段段数对所述查询向量进行分段处理,得到查询子段向量,所述查询子段向量的数量与所述预分段段数相等;
    计算每个所述查询子段向量与预设的第二簇心之间的第二距离。
  14. 根据权利要求13所述的计算机设备,其中,所述计算每个所述查询子段向量与预设的第二簇心之间的第二距离之前,所述方法还包括获取预设的第二簇心,具体包括:
    获取预设的查询数据库;
    对所述查询数据库的数据库向量按照所述预分段段数进行预分段处理,得到每个数据库向量对应的多个子段向量;
    对所述数据库向量的每一个所述子段向量进行第二聚类处理,得到每个子段向量对应的多个第二簇,其中每个所述第二簇包括一个第二簇心,所述第二簇心为第二簇的聚类中心;
    将每个所述第二簇的第二簇心作为所述预设的第二簇心。
  15. 根据权利要求14所述的计算机设备,其中,所述计算每个所述查询子段向量与预设的第二簇心之间的第二距离,包括:
    对所述查询数据库中数据库向量进行映射得到簇标识向量;
    计算每个所述查询子段向量与所述簇标识向量对应的第二簇心之间的第二距离。
  16. 根据权利要求15所述的计算机设备,其中,所述对所述查询数据库中数据库向量进行映射得到簇标识向量,包括:
    计算所述数据库向量的每个子段向量与其对应的多个第二簇心之间的第三距离;
    根据所述第三距离与预设第二阈值得到每个所述子段向量对应的第二簇心标识;
    根据所述数据库向量对应的第二簇心标识,将所述数据库向量映射为簇标识向量。
  17. 一种计算机可读存储介质,其中,存储有计算机可执行指令,所述计算机可执行指令用于执行一种向量相似度计算方法,所述向量相似度计算方法包括:
    获取查询向量;
    计算所述查询向量与预设的第一簇心之间的第一距离;
    根据所述第一距离和预设第一阈值,从对应所述第一簇心的第一簇中选取第一查询簇;
    根据所述查询向量的多个查询子段向量在所述第一查询簇中进行向量分段计算,得到第二距离;
    根据所述第二距离计算向量相似度值。
  18. 根据权利要求17所述的计算机可读存储介质,其中,所述根据所述第二距离计算向量相似度值,包括:
    根据所述第二距离计算查询距离;
    根据所述查询距离与预设第三阈值,计算所述向量相似度值。
  19. 根据权利要求17所述的计算机可读存储介质,其中,若所述第一查询簇的数量大于1;
    所述根据所述查询向量的多个查询子段向量在所述第一查询簇中进行向量分段计算,得到第二距离,包括:
    根据所述查询向量的多个查询子段向量在每个所述第一查询簇中进行向量分段计算,得到对应每个第一查询簇的第二距离;
    所述根据所述第二距离计算向量相似度值,包括:
    根据每个第一查询簇对应的所述第二距离计算对应每个第一查询簇的向量相似度值;
    对每个第一查询簇中得到的向量相似度值进行第一排序,得到第一排序结果;
    根据所述第一排序结果选取向量相似度值。
  20. 根据权利要求17所述的向量相似度计算方法,其中,所述计算所述查询向量与预设的第一簇心之间的第一距离,包括:
    根据预设聚合数对预设的第一簇心对应的第一簇进行聚合处理,得到聚合簇,所述聚合簇存储在不同的存储空间中;
    在每个聚合簇中计算所述查询向量与预设的第一簇心之间的第一距离。
PCT/CN2022/090758 2021-12-15 2022-04-29 向量相似度计算方法、装置、设备及存储介质 WO2023108995A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111536035.3A CN114238329A (zh) 2021-12-15 2021-12-15 向量相似度计算方法、装置、设备及存储介质
CN202111536035.3 2021-12-15

Publications (1)

Publication Number Publication Date
WO2023108995A1 true WO2023108995A1 (zh) 2023-06-22

Family

ID=80756630

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/090758 WO2023108995A1 (zh) 2021-12-15 2022-04-29 向量相似度计算方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN114238329A (zh)
WO (1) WO2023108995A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117156485A (zh) * 2023-10-31 2023-12-01 西安明赋云计算有限公司 链路质量探测方法
CN117194737A (zh) * 2023-09-14 2023-12-08 上海交通大学 基于距离阈值的近似近邻搜索方法、系统、介质及设备
CN117235137A (zh) * 2023-11-10 2023-12-15 深圳市一览网络股份有限公司 一种基于向量数据库的职业信息查询方法及装置

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114238329A (zh) * 2021-12-15 2022-03-25 平安科技(深圳)有限公司 向量相似度计算方法、装置、设备及存储介质
CN117807175A (zh) * 2023-12-26 2024-04-02 北京海泰方圆科技股份有限公司 一种数据存储方法、装置、设备及介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110874385A (zh) * 2018-08-10 2020-03-10 阿里巴巴集团控股有限公司 数据处理方法、装置和系统
CN111859004A (zh) * 2020-07-29 2020-10-30 书行科技(北京)有限公司 检索图像的获取方法、装置、设备及可读存储介质
WO2021081913A1 (zh) * 2019-10-31 2021-05-06 北京欧珀通信有限公司 向量查询方法、装置、电子设备及存储介质
CN113723115A (zh) * 2021-09-30 2021-11-30 平安科技(深圳)有限公司 基于预训练模型的开放域问答预测方法及相关设备
CN114238329A (zh) * 2021-12-15 2022-03-25 平安科技(深圳)有限公司 向量相似度计算方法、装置、设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110874385A (zh) * 2018-08-10 2020-03-10 阿里巴巴集团控股有限公司 数据处理方法、装置和系统
WO2021081913A1 (zh) * 2019-10-31 2021-05-06 北京欧珀通信有限公司 向量查询方法、装置、电子设备及存储介质
CN111859004A (zh) * 2020-07-29 2020-10-30 书行科技(北京)有限公司 检索图像的获取方法、装置、设备及可读存储介质
CN113723115A (zh) * 2021-09-30 2021-11-30 平安科技(深圳)有限公司 基于预训练模型的开放域问答预测方法及相关设备
CN114238329A (zh) * 2021-12-15 2022-03-25 平安科技(深圳)有限公司 向量相似度计算方法、装置、设备及存储介质

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117194737A (zh) * 2023-09-14 2023-12-08 上海交通大学 基于距离阈值的近似近邻搜索方法、系统、介质及设备
CN117194737B (zh) * 2023-09-14 2024-06-07 上海交通大学 基于距离阈值的近似近邻搜索方法、系统、介质及设备
CN117156485A (zh) * 2023-10-31 2023-12-01 西安明赋云计算有限公司 链路质量探测方法
CN117156485B (zh) * 2023-10-31 2024-01-09 西安明赋云计算有限公司 链路质量探测方法
CN117235137A (zh) * 2023-11-10 2023-12-15 深圳市一览网络股份有限公司 一种基于向量数据库的职业信息查询方法及装置
CN117235137B (zh) * 2023-11-10 2024-04-02 深圳市一览网络股份有限公司 一种基于向量数据库的职业信息查询方法及装置

Also Published As

Publication number Publication date
CN114238329A (zh) 2022-03-25

Similar Documents

Publication Publication Date Title
WO2023108995A1 (zh) 向量相似度计算方法、装置、设备及存储介质
Li et al. Recent developments of content-based image retrieval (CBIR)
Liu et al. Deep sketch hashing: Fast free-hand sketch-based image retrieval
Hong et al. Coherent semantic-visual indexing for large-scale image retrieval in the cloud
Zhu et al. Exploring auxiliary context: discrete semantic transfer hashing for scalable image retrieval
Zhu et al. Graph PCA hashing for similarity search
Ashraf et al. Content based image retrieval by using color descriptor and discrete wavelet transform
Tolias et al. Image search with selective match kernels: aggregation across single and multiple images
CN106777038B (zh) 一种基于序列保留哈希的超低复杂度图像检索方法
Liong et al. Cross-modal discrete hashing
Uricchio et al. Fisher encoded convolutional bag-of-windows for efficient image retrieval and social image tagging
Kumar et al. Indian classical dance classification with adaboost multiclass classifier on multifeature fusion
CN110751027B (zh) 一种基于深度多示例学习的行人重识别方法
Wang et al. Duplicate discovery on 2 billion internet images
TW202217597A (zh) 圖像的增量聚類方法、電子設備、電腦儲存介質
WO2023020214A1 (zh) 检索模型的训练和检索方法、装置、设备及介质
Etezadifar et al. Scalable video summarization via sparse dictionary learning and selection simultaneously
Li et al. Sub-selective quantization for learning binary codes in large-scale image search
Mathan Kumar et al. Multiple kernel scale invariant feature transform and cross indexing for image search and retrieval
Liu et al. Adding spatial distribution clue to aggregated vector in image retrieval
Khalaf et al. Robust partitioning and indexing for iris biometric database based on local features
JP6373292B2 (ja) 特徴量生成装置、方法、及びプログラム
Zhu et al. A novel two-stream saliency image fusion CNN architecture for person re-identification
Duan et al. Minimizing reconstruction bias hashing via joint projection learning and quantization
CN110209895B (zh) 向量检索方法、装置和设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22905763

Country of ref document: EP

Kind code of ref document: A1