CN117407727A

CN117407727A - Vector similarity determining method and vector searching method

Info

Publication number: CN117407727A
Application number: CN202311607446.6A
Authority: CN
Inventors: 刘熙
Original assignee: Transwarp Technology Shanghai Co Ltd
Current assignee: Transwarp Technology Shanghai Co Ltd
Priority date: 2023-11-28
Filing date: 2023-11-28
Publication date: 2024-01-16
Anticipated expiration: 2043-11-28
Also published as: CN117407727B

Abstract

The invention discloses a vector similarity determining method and a vector searching method, wherein the vector similarity determining method comprises the following steps: acquiring a first dense sparse vector and a second dense sparse vector; calculating a first similarity and a second similarity based on the first dense sparse vector and the second dense sparse vector; the similarity of the first dense sparse vector and the second dense sparse vector is determined according to the first similarity and the second similarity, the problem that the similarity calculation result of the dense vector and the sparse vector is low in accuracy and the search recall rate is low due to the fact that the similarity calculation result is low in accuracy is solved, and the dense sparse vector is obtained by splicing the dense vector and the sparse vector. The similarity of the vectors is calculated by considering the dense vectors and the sparse vectors, and the result is more accurate; the realization is simple, and the data consistency is effectively ensured; the real data space distribution is better reflected, so that the accuracy of the similarity result is ensured, and the recall rate of vector search is further improved.

Description

Vector similarity determining method and vector searching method

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method for determining vector similarity and a method for searching vectors.

Background

To facilitate processing of data such as text and images, the prior art generally vectorizes the data, which is represented by vectors. The vectors are typically one-dimensional arrays for storing the same type of data, such as feature vectors in machine learning, with each value in the vector representing a numerical feature of a corresponding dimension. The vectors comprise two kinds of dense vectors and sparse vectors, wherein the dense vectors refer to vectors with all or most of values different from 0 in a one-dimensional array of the vectors, and the feature vectors generated by most machine learning models, such as text-embedding-ada-002 models of openai, word2vec models commonly used by open sources and the like, and the dimension of the dense vectors is generally below 2000 dimensions; sparse vectors refer to vectors that represent a one-dimensional array of vectors with very high dimensions, but most values are 0, with the dimensions of sparse vectors typically being over hundreds of thousands or even millions.

In the case of sparse vector and dense vector joint search, there are two classical approaches in the industry: 1. using sparse vectors as coarse rows, taking the top 10 to be searched as an example, using the sparse vectors to calculate the similarity, searching out 100 pieces of data as candidate sets, using dense vectors to calculate the similarity for the candidate sets as fine rows, and using the most similar 10 pieces of data as a final result. 2. And (3) respectively calculating the similarity by using the sparse vector and the dense vector, searching out a top 10 result set, calculating the reciprocal ordering fusion RRF score of the data according to the two top 10 result sets, and taking 10 pieces of data with the highest RRF score as a final result set.

In the prior art, when similarity is calculated, a sparse vector and a dense vector are used as two independent data to respectively calculate the similarity, and the distribution of the two data in a vector space cannot be considered at the same time, so that a similarity calculation result is inaccurate, a search result is influenced, and a search recall rate is lower. How to improve the accuracy of similarity calculation of vectors becomes a problem to be solved.

Disclosure of Invention

The invention provides a vector similarity determining method and a vector searching method, which are used for solving the problem of low accuracy of a dense vector and sparse vector similarity calculation result.

According to an aspect of the present invention, there is provided a vector similarity determination method, including:

acquiring a first dense sparse vector and a second dense sparse vector, wherein the first dense sparse vector is obtained by splicing the first dense vector and the first sparse vector, and the second dense sparse vector is obtained by splicing the second dense vector and the second sparse vector;

calculating a first similarity and a second similarity based on the first dense sparse vector and the second dense sparse vector;

and determining the similarity of the first dense sparse vector and the second dense sparse vector according to the first similarity and the second similarity.

According to another aspect of the present invention, there is provided a vector search method including:

acquiring a dense sparse vector to be searched;

searching a search graph of a hierarchical navigable small world based on a graph algorithm, reading a dense sparse vector to be matched in the search graph of the hierarchical navigable small world, and calculating the similarity between the dense sparse vector to be searched and the dense sparse vector to be matched, wherein the similarity is calculated according to the vector similarity determining method in any embodiment of the invention;

and determining a target dense sparse vector matched with the dense sparse vector to be searched according to the similarity.

According to another aspect of the present invention, there is provided a vector similarity determination apparatus including:

the vector acquisition module is used for acquiring a first dense sparse vector and a second dense sparse vector, wherein the first dense sparse vector is obtained by splicing the first dense vector and the first sparse vector, and the second dense sparse vector is obtained by splicing the second dense vector and the second sparse vector;

a similarity calculation module for calculating a first similarity and a second similarity based on the first dense sparse vector and the second dense sparse vector;

And the similarity determining module is used for determining the similarity of the first dense sparse vector and the second dense sparse vector according to the first similarity and the second similarity.

According to another aspect of the present invention, there is provided a vector search apparatus including:

the vector acquisition module is used for acquiring dense sparse vectors to be searched;

the vector search module is used for searching the search graph of the layered navigable small world based on a graph algorithm, reading dense sparse vectors to be matched in the search graph of the layered navigable small world, and calculating the similarity between the dense sparse vectors to be searched and the dense sparse vectors to be matched, wherein the similarity is calculated according to the vector similarity determination method in any embodiment of the invention;

and the target vector determining module is used for determining a target dense sparse vector matched with the dense sparse vector to be searched according to the similarity.

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the vector similarity determination method of any one of the embodiments of the present invention or the vector search method of any one of the embodiments of the present invention.

According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to implement the vector similarity determination method according to any of the embodiments of the present invention or the vector search method according to any of the embodiments of the present invention when executed.

According to the technical scheme, a first dense sparse vector and a second dense sparse vector are obtained, the first dense sparse vector is obtained by splicing the first dense vector and the first sparse vector, and the second dense sparse vector is obtained by splicing the second dense vector and the second sparse vector; calculating a first similarity and a second similarity based on the first dense sparse vector and the second dense sparse vector; and determining the similarity of the first dense sparse vector and the second dense sparse vector according to the first similarity and the second similarity, solving the problem of lower accuracy of a computation result of similarity between the dense vector and the sparse vector, and obtaining a new vector, namely the dense sparse vector, by splicing the dense vector and the sparse vector. Determining the final similarity of the first dense sparse vector and the second dense sparse vector by calculating the first similarity and the second similarity of the first dense sparse vector and the second dense sparse vector, wherein compared with the prior art in which the similarity is calculated from two angles of the dense vector and the sparse vector respectively, the similarity of the vectors is calculated by considering the dense vector and the sparse vector at the same time, and the result is more accurate; the realization is simple, and the data consistency is effectively ensured; the data are represented through dense sparse vectors, so that real data space distribution is reflected better, accuracy of similarity results is guaranteed, and recall rate of vector search is improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a vector similarity determining method according to a first embodiment of the present invention;

FIG. 2 is a diagram illustrating an example of a dense sparse vector provided in accordance with one embodiment of the present invention;

fig. 3 is a flowchart of a vector similarity determining method according to a second embodiment of the present invention;

fig. 4 is a flowchart of a vector search method according to a third embodiment of the present invention;

FIG. 5 is a flow chart of a vector search method according to a fourth embodiment of the present invention;

FIG. 6 is an exemplary diagram of prior art dense vector storage;

FIG. 7 is a diagram illustrating storing of a vector in a pre-allocated memory space according to a fourth embodiment of the present invention;

FIG. 8 is a diagram illustrating storing of vectors in another pre-allocated memory space provided according to a fourth embodiment of the present invention;

FIG. 9 is an exemplary diagram of storing dense sparse vectors to be stored according to a fourth embodiment of the present invention;

FIG. 10 is an exemplary diagram for storing dense sparse vectors to be stored according to a fourth embodiment of the present invention;

fig. 11 is a schematic structural diagram of a vector similarity determining device according to a fifth embodiment of the present invention;

fig. 12 is a schematic structural diagram of a vector search device according to a sixth embodiment of the present invention;

fig. 13 is a schematic structural diagram of an electronic device according to a seventh embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

Fig. 1 is a flowchart of a vector similarity determining method according to an embodiment of the present invention, where the method may be performed by a vector similarity determining device, and the vector similarity determining device may be implemented in hardware and/or software, and the vector similarity determining device may be configured in an electronic device. As shown in fig. 1, the method includes:

S101, acquiring a first dense sparse vector and a second dense sparse vector, wherein the first dense sparse vector is spliced by the first dense vector and the first sparse vector, and the second dense sparse vector is spliced by the second dense vector and the second sparse vector.

In this embodiment, the first dense sparse vector may be specifically understood as a vector obtained by splicing a dense vector and a sparse vector; the first dense vector may be understood as a dense vector and the first sparse vector may be understood as a sparse vector; the first dense sparse vector is obtained by splicing the first dense vector and the first sparse vector, the first dense vector and the first sparse vector can splice the first sparse vector behind the first dense vector when splicing, and the first dense vector can also splice the first dense vector behind the first sparse vector.

Under the scenes such as machine learning, the data are vectorized to obtain a corresponding dense vector and a sparse vector, the dense vector and the sparse vector corresponding to the data are spliced to obtain the dense sparse vector corresponding to the data, a vector form is newly defined, and the data are represented through the dense vector and the sparse vector.

In the case where the similarity of the vectors needs to be calculated, a first dense sparse vector and a second dense sparse vector are acquired, which may be stored in the storage space in advance, obtained by reading from the corresponding storage space when the similarity is calculated, or received from a user input or an external device or the like.

By way of example, the embodiments of the present application provide a representation of a dense sparse vector in a memory, taking a 32-bit floating point vector as an example, where the dense vector is stored in the memory using a continuous memory with a length of dimension d by 32 bits; the sparse vector is stored in a compression mode, and comprises a head with a fixed length of 32 bits, the number of non-0 values of the sparse vector is stored, each non-0 value in the sparse vector is stored in 64 bits, the first 32 bits store subscripts, the second 32 bits store specific numerical values of the sparse vector, and the non-0 values of the sparse vector are arranged in ascending order according to the subscripts; and directly splicing the sparse vector to the dense vector to obtain the memory representation of the dense sparse vector. Fig. 2 provides an exemplary representation of a dense sparse vector, as shown, spliced with a dense sparse vector.

S102, calculating the first similarity and the second similarity based on the first dense sparse vector and the second dense sparse vector.

In this embodiment, the first similarity may be specifically understood as a similarity calculated from the first dense sparse vector and the second dense sparse vector; the second similarity may be understood as a further similarity calculated from the first dense sparse vector and the second dense sparse vector, wherein the first similarity and the second similarity are calculated in different ways or with different data.

Determining a first dense vector and a first sparse vector which are included in a first dense sparse vector, determining a second dense vector and a second sparse vector which are included in a second dense sparse vector, processing the first dense vector and the first sparse vector, processing the second dense vector and the second sparse vector, aligning the first dense sparse vector and the second dense sparse vector, and calculating the similarity of the aligned dense sparse vectors in different calculation modes to obtain a first similarity and a second similarity; alternatively, the similarity of the first dense sparse vector and the second dense sparse vector is directly calculated in different ways, the first similarity and the second similarity are obtained, and so on.

S103, determining the similarity of the first dense sparse vector and the second dense sparse vector according to the first similarity and the second similarity.

The first similarity and the second similarity may represent the similarity of the first dense sparse vector and the second dense sparse vector from different dimensions, and after the first similarity and the second similarity are obtained, the first similarity and the second similarity are combined to determine a final similarity, for example, a maximum value, a minimum value, an average value, and the like between the first similarity and the second similarity are determined, and the finally determined similarity is used as the similarity of the first dense sparse vector and the second dense sparse vector.

It should be noted that the method for determining the vector similarity provided by the embodiment of the present invention may be applied to different vector search algorithms, and in the process of vector search, the similarity is calculated by the method for determining the vector similarity provided by the embodiment of the present invention, so as to implement vector search.

The embodiment of the invention provides a vector similarity determining method, which solves the problem of lower accuracy of a result of similarity calculation of a dense vector and a sparse vector, and a new vector, namely the dense sparse vector, is obtained by splicing the dense vector and the sparse vector. Determining the final similarity of the first dense sparse vector and the second dense sparse vector by calculating the first similarity and the second similarity of the first dense sparse vector and the second dense sparse vector, wherein compared with the prior art in which the similarity is calculated from two angles of the dense vector and the sparse vector respectively, the similarity of the vectors is calculated by considering the dense vector and the sparse vector at the same time, and the result is more accurate; the realization is simple, and the data consistency is effectively ensured; the data are represented through dense sparse vectors, so that real data space distribution is reflected better, accuracy of similarity results is guaranteed, and recall rate of vector search is improved.

Example two

Fig. 3 is a flowchart of a vector similarity determining method according to a second embodiment of the present invention, where the method is refined based on the foregoing embodiment. As shown in fig. 3, the method includes:

s201, a first dense sparse vector and a second dense sparse vector are obtained, the first dense sparse vector is obtained by splicing the first dense vector and the first sparse vector, and the second dense sparse vector is obtained by splicing the second dense vector and the second sparse vector.

S202, calculating the similarity of the first dense vector and the second dense vector to obtain the first similarity.

The similarity of the first dense vector and the second dense vector can be calculated directly according to a predetermined similarity calculation mode, the dimensions of the first dense vector and the second dense vector are the same, vector values corresponding to the dimensions can also be determined directly, the first similarity is calculated directly, and the calculation method of the first similarity can be cosine similarity, euclidean distance and the like.

S203, calculating the similarity of the first sparse vector and the second sparse vector to obtain a second similarity.

The similarity between the first sparse vector and the second sparse vector can be calculated directly according to a predetermined similarity calculation mode, and since a large number of dimensionalities in the first sparse vector and the second sparse vector are 0, when the vector value is calculated, the second similarity can be obtained by calculating according to subscripts (namely dimensionalities) recorded in the sparse vectors, and the calculation method of the second similarity can be cosine similarity, euclidean distance and the like. The first similarity and the second similarity may be calculated in the same or different manners.

As an optional embodiment of the present embodiment, the optional embodiment further calculates the similarity between the first sparse vector and the second sparse vector, and obtains the second similarity by optimizing as:

a1, determining a first dimension identification in the first sparse vector and a first vector value corresponding to the first dimension identification.

In this embodiment, the first dimension identifier may be specifically understood as information identifying different dimensions in the first sparse vector; the first vector value may be understood as a vector value other than 0 in the first sparse vector.

The first sparse vector includes a plurality of values which are 0, and only values which are not 0 are stored in the first sparse vector. Referring to fig. 2, the subscript in fig. 2 is a first dimension identifier that identifies the dimension in which the value other than 0 is located, e.g., 10; the first dimension is identified and then stored as a first vector value, i.e., a value after the subscript, e.g., 0.1. All first dimension identifications in the first sparse vector are determined, and a first vector value corresponding to each first dimension identification is determined.

A2, determining a second dimension identifier in the second sparse vector and a second vector value corresponding to the second dimension identifier.

In this embodiment, the second dimension identifier may be specifically understood as information identifying different dimensions in the second sparse vector; the second vector value may be understood as a vector value other than 0 in the second sparse vector.

The second sparse vector also includes a large number of values with 0, and only values other than 0 are stored in the second sparse vector, and referring to fig. 2, the subscript in fig. 2 is the second dimension identifier, and the value after the subscript is the second vector value. And determining all second dimension identifications in the second sparse vector, and simultaneously determining a second vector value corresponding to each second dimension identification.

A3, comparing each first dimension identifier with each second dimension identifier, and determining the same first dimension identifier and second dimension identifier as target dimension identifiers.

In this embodiment, the target dimension identification may be specifically understood as identification information of a dimension in which a numerical value other than 0 exists in both the first sparse vector and the second sparse vector.

Comparing all the first dimension identifications with the second dimension identifications, judging whether the first dimension identifications and the second dimension identifications are the same, and determining the identifications as target dimension identifications when the first dimension identifications and the second dimension identifications are the same. There may be a plurality of the number of target dimension identifications.

A4, calculating the similarity based on the first vector value and the second vector value corresponding to the target dimension identification, and determining the second similarity.

Determining a first vector value corresponding to the target dimension identification in the first sparse vector, and determining a second vector value corresponding to the target dimension identification in the second sparse vector; and calculating the similarity based on the first vector value and the second vector value corresponding to all the target dimension identifiers with the value not being 0, wherein the calculation method of the similarity can be preset, and the second similarity is obtained by calculating the similarity according to a calculation formula.

Optionally, the first similarity and the second similarity are inner product similarities.

The first similarity and the second similarity are preferably calculated by adopting a calculation mode of inner product similarity, and the calculation mode of the first similarity is as follows: multiplying the values of the same dimension of the first dense vector and the second dense vector, and adding the products of the dimensions; the first similarity is calculated by the following steps: the first vector value and the second vector value corresponding to the target dimension identification are multiplied, and the products corresponding to all the target dimension identifications are added.

Exemplary, the embodiment of the application provides a calculation formula of inner product similarity:

wherein p (A, B) is the inner product similarity between vector A and vector B, a _i Is the value of the ith dimension in vector a; b _i The value of the ith dimension in the vector B, and n is the dimension of the vector A and the vector B.

As can be seen from the above formula, when the inner product similarity is calculated, the product of the values of the same dimension is directly calculated, and the inner product similarity is obtained by summing the products. Because any value is multiplied by 0, when calculating the second similarity, the second similarity can be obtained by calculating the product of the first vector value and the second vector value corresponding to the target dimension identifier and then summing the products of the first vector value and the second vector value, wherein the statistics value of the target dimension identifier is not 0.

And S204, carrying out weighted summation on the first similarity and the second similarity to obtain the similarity of the first dense sparse vector and the second dense sparse vector.

The first similarity and the second similarity are weighted and summed, and the weight can be set according to different scenes, data types corresponding to the vectors and the like or set by a user independently. For example, the user designates the weight of the dense vector and the sparse vector in the final score, for example, the weight of the dense vector is set to 0.7, the weight of the sparse vector is set to 0.3, according to the definition of the inner product similarity of the dense sparse vector, the dense part and the sparse part of the vector are multiplied by the corresponding weights respectively, that is, the first similarity and the second similarity are multiplied by the corresponding weights respectively, and the similarity obtained after the weighted summation is the similarity of the first dense sparse vector and the second dense sparse vector.

The embodiment of the invention provides a vector similarity determining method, which solves the problem of lower accuracy of a result of similarity calculation of a dense vector and a sparse vector, and a new vector, namely the dense sparse vector, is obtained by splicing the dense vector and the sparse vector. Calculating first similarity of the first dense vector and the second dense vector and second similarity of the first sparse vector and the second sparse vector, and carrying out weighted summation on the first similarity and the second similarity to obtain final similarity of the first dense sparse vector and the second dense sparse vector, so that accuracy of similarity calculation is improved; compared with the prior art that the similarity is calculated from two angles of the dense vector and the sparse vector respectively, the method and the device consider the dense vector and the sparse vector at the same time, calculate the similarity of the vectors, and have more accurate results, so that the recall rate of vector search is improved; the realization is simple, and the data consistency is effectively ensured; the data is represented by the dense sparse vector, so that real data space distribution is better reflected, and the problem that in a traditional scheme, the similarity calculation result caused by calculating the similarity of the dense sparse vector and the sparse vector is inaccurate, so that the performance and recall rate are lost is avoided.

Example III

Fig. 4 is a flowchart of a vector search method according to a third embodiment of the present invention, where the method may be performed by a vector search device, and the vector search device may be implemented in hardware and/or software, and the vector search device may be configured in an electronic device. As shown in fig. 4, the method includes:

s301, acquiring dense sparse vectors to be searched.

In the present embodiment, the dense sparse vector to be searched can be specifically understood as a dense sparse vector having a search requirement. The dense sparse vector to be searched can be input by a user or obtained by vectorizing data, can be determined by other devices and service modules and transmitted to the executing device before searching, and the like. After the dense sparse vector to be searched is obtained, calculating the similarity based on the vector similarity calculation method provided by any embodiment of the application, and determining the dense sparse vector matched with the dense sparse vector to be searched.

S302, searching a search graph of the layered navigable small world based on a graph algorithm, reading dense sparse vectors to be matched in the search graph of the layered navigable small world, calculating the similarity between the dense sparse vectors to be searched and the dense sparse vectors to be matched, and calculating the similarity according to the vector similarity determining method of any embodiment of the invention.

In this embodiment, the hierarchical navigable small world, HNSW, stores different vectors and relationships between the different vectors through the search graph. The dense sparse vector to be matched can be specifically understood as a dense sparse vector which needs to be matched with the dense sparse vector to be searched for by calculating similarity.

The method comprises the steps of pre-constructing a hierarchical navigable small world search graph, calculating the similarity between different dense sparse vectors when constructing the search graph, and adding each dense sparse vector into a graph layer to form the search graph. The method for determining the vector similarity is also used for calculating the similarity between different dense sparse vectors in the process of constructing the search graph.

Searching a search graph of the hierarchical navigable small world according to a graph algorithm HNSW algorithm, firstly determining the identification of all dense sparse vectors in the uppermost layer of the search graph of the hierarchical navigable small world, taking the identification of the dense sparse vectors as the identification of the dense sparse vectors to be matched, and reading the dense sparse vectors to be matched from the lowermost layer storing specific vector values according to the identification. Calculating the similarity between the read dense sparse vector to be searched and the dense sparse vector to be matched; selecting one or more dense sparse vectors to be matched with higher similarity to map to a next layer, determining the neighbor of the dense sparse vector to be matched in the next layer for each selected dense sparse vector to be matched, taking the vector corresponding to the neighbor as the dense sparse vector to be matched, continuously calculating the similarity between each dense sparse vector to be matched and the dense sparse vector to be searched, continuously selecting the dense sparse vector to be matched with higher similarity to map to the next layer until the dense sparse vector to be matched is mapped to the layer at the bottom layer, and obtaining the similarity stored in the searching process. In the searching process, the number of dense sparse vectors participating in the similarity calculation in each layer may be uncertain, and related to the number of neighbors of each dense sparse vector, after the similarity calculation is completed, a preset number of dense sparse vectors to be matched may be saved, that is, a preset number of dense sparse vectors to be matched with higher similarity is selected for saving.

In the prior art, when sparse vector and dense vector joint search is performed, sparse vector is used for coarse ranking and then dense vector is used for fine ranking, if the accuracy of similarity results of the sparse vectors is low, the quality of candidate sets is poor, recall rate is seriously affected, two sets of retrieval systems are required to be maintained for calculating similarity, engineering complexity is high, and data consistency is not easy to guarantee. When the sparse vector and the dense vector are used for calculating the similarity and then calculating the RRF score, the distribution of the two data in the vector space cannot be considered simultaneously when the index is created because the sparse vector and the dense vector are two independent data, so that the recall rate is influenced, and the fusion algorithm for selecting the final result from the two independent result sets depends on the scene and cannot guarantee the recall rate in the general scene. In the vector searching process, the embodiment redefines a vector, namely a dense sparse vector, and defines a similarity calculating method of the dense sparse vector, and according to the similarity calculating method, the similarity is calculated so as to search the vector, so that the recall rate can be effectively improved, and the method is independent of a specific scene.

S303, determining a target dense sparse vector matched with the dense sparse vector to be searched according to the similarity.

In this embodiment, the target dense sparse vector may be specifically understood as a dense sparse vector that matches the dense sparse vector to be searched. Comparing all the similarities stored in the calculation process, determining a value with higher similarity, and determining a dense sparse vector corresponding to the value with higher similarity as a target dense sparse vector matched with the dense sparse vector to be searched, wherein the number of the target dense sparse vectors can be one or a plurality of. The target dense sparse vector can be used as a search result to be fed back to a user to acquire other data processing layers for data processing.

It should be noted that, in the vector search method provided in the embodiment of the present application, the similarity calculation between two related vectors is calculated by using the vector similarity determination method provided in the first embodiment or the second embodiment of the present invention.

The embodiment of the invention provides a vector searching method, which solves the problem of low recall rate caused by lower accuracy of a similarity calculation result when dense vectors and sparse vectors are searched in a combined way, and a new vector, namely a dense sparse vector, is obtained by splicing the dense vectors and the sparse vectors. In the vector searching process, the final similarity is determined by calculating the first similarity and the second similarity of the first dense sparse vector and the second dense sparse vector, so that the accuracy of similarity calculation is improved, and the searching accuracy is further improved; compared with the prior art that the similarity is calculated from two angles of the dense vector and the sparse vector respectively, the method and the device consider the dense vector and the sparse vector at the same time, calculate the similarity of the vectors, have more accurate results, and improve the recall rate during vector search; the realization is simple, only one set of retrieval system is required to be maintained, and the data consistency is effectively ensured; the data are represented by the dense sparse vectors, so that real data space distribution is better reflected, and the problem that in a traditional scheme, the similarity calculation result caused by calculating the similarity of the dense sparse vectors is inaccurate, so that the performance and recall rate are lost is avoided; the recall rate is guaranteed in the generic scenario independent of the specific scenario.

Example IV

Fig. 5 is a flowchart of a vector search method according to a fourth embodiment of the present invention, where the method is refined based on the foregoing embodiments. As shown in fig. 5, the method includes:

s401, acquiring a dense sparse vector to be stored.

In this embodiment, the dense sparse vector to be stored may be specifically understood as a dense sparse vector having a storage requirement, and the dense sparse vector to be stored may form a hierarchical navigable small world search graph.

The dense sparse vector to be stored can be input by a user, can be pre-stored in a file or can be transmitted to the executing device by other devices or terminals. The number of the dense sparse vectors to be stored can be one or more, namely, one dense vector to be stored can be obtained when the dense sparse vectors to be stored are obtained, and a plurality of dense sparse vectors to be stored can be obtained in batches so as to construct a hierarchical navigable small world search graph based on the dense sparse vectors to be stored.

Taking the drawing algorithm as hnswlib as an example, hnswlib is an hnswlib implementation, only dense vectors are supported currently, and the hnswlib drawing object specifies a maximum number of points max_elements in construction, wherein the maximum number of points max_elements is the maximum number of vectors forming a search drawing of the hierarchical navigable small world. The main data structure used to represent the graph includes: linklists and data_leve0_memory_; wherein, the linklists storage point is in M adjacent id sets of layer 1-n layers; data_level 0_memory_store point is in layer 0's 2M neighbor id sets, and vector data itself.

The memory of one of the points in layer0 is represented as:

listsize, integer, representing the number of valid neighbors, 2M maximum;

2M integer types, storing neighbor id sets;

d floating point types, storing vector data;

an elongated, external tag (label) of storage point;

the memory of a point in layer0 is denoted as a fixed length, and in order to guarantee performance, a space is pre-allocated when the hnswlib object is initialized, the memory size allocated by data_level0_memory is the maximum number of points max_elements multiplied by the fixed memory size of each point, and fig. 6 provides an exemplary diagram of the prior art when dense vector storage is performed, and each point is a vector.

The embodiment of the application mainly needs to work in two aspects in the aspect of extending hnswlib to support dense sparse vectors: 1. replacing the original dense vector distance calculation mode by using the dense sparse vector distance calculation mode; 2. unlike dense vectors, which are variable-length data, it is necessary to support the storage of variable-length vector data while ensuring the performance of the above two addressing operations. Therefore, the embodiment of the application provides a new dense sparse vector storage mode and a similarity calculation mode when constructing a hierarchical navigable small world search graph.

S402, storing the dense sparse vector to be stored through the pre-allocated memory space.

And pre-allocating a memory space, wherein the memory space is used for storing all the dense sparse vectors to be stored, and the memory space can be divided into different small spaces for respectively storing the corresponding dense sparse vectors to be stored. After the dense sparse vector to be stored is obtained, the dense sparse vector to be stored is stored in a pre-allocated memory space, and the memory space can be allocated in advance according to the number of the dense sparse vectors to be stored and the maximum length or the preset length, wherein the preset length is determined according to the length general value range of different dense sparse vectors to be stored.

Optionally, the pre-allocated memory space includes a first memory space and a second memory space;

as an optional embodiment of the present embodiment, the present optional embodiment further stores the dense sparse vector to be stored through the pre-allocated memory space, and optimizes to: determining the length of the dense sparse vector to be stored and storing the dense sparse vector to the corresponding first memory space, judging whether the length of the dense sparse vector to be stored is larger than the length of the second memory space, and if so, storing the dense sparse vector to be stored to the corresponding virtual address space; otherwise, the dense sparse vector to be stored is stored in the corresponding second memory space.

The length of the second memory space is determined according to the dimension of the dense vector in the dense sparse vector to be stored and the maximum non-zero value number of the constraint sparse vector; the size of the virtual address space is determined by rounding the data page for the maximum of the length of the dense sparse vector to be stored.

In this embodiment, the first memory space and the second memory space store two different data respectively, where the first memory space is used for storing the length of the dense sparse vector, and the second memory space is used for storing a specific value of the dense sparse vector. The pre-allocated memory space is divided into a first memory space and a second memory space, wherein one first memory space corresponds to one second memory space, and a dense sparse vector is stored together; the memory space comprises a plurality of groups of first memory space and second memory space, and the number of the groups of the first memory space and the second memory space is the maximum point number max_elements. Since the specific value of the dense sparse vector to be stored is stored in the second memory space, which includes the dense vector and the sparse vector, the size of the second memory space is determined according to the dense vector dimension in the dense sparse vector to be stored and the maximum non-zero value number of the constraint sparse vector. The dense vector dimensions in the dense sparse vectors to be stored are usually predetermined, and the maximum non-zero value number of the constraint sparse vectors can also be set in advance according to the service scenario.

For example, in the embodiment of the present application, taking the example that the number of maximum non-0 values of the constrained sparse vector is equal to N as an example, the maximum length of the floating-point sparse vector is sizeof (uin32_t) +n sizeof (float), and the maximum length of the entire floating-point dense sparse vector is max_vector_size=d sizeof (float)

sizeof (uint 32_t) +n sizeof (float), i.e., maximum dense sparse vector length max_vector_size=4+8n+4d, where N may be determined according to the maximum non-0 number of most sparse vectors in each dense sparse vector to be stored, or may be determined according to the maximum non-0 number of all sparse vectors in each dense sparse vector to be stored, and the specific value of N may be obtained by implementing an analysis dataset, or may be estimated according to the traffic that generates the sparse vector.

In most application scenarios, the length difference of the generated sparse vectors is not too large, it is feasible to pre-allocate the space according to the maximum size in the process of constructing the graph index, but for some scenarios, the length difference of the sparse vectors may be large, for example, most sparse vectors have 20 non-0 values, but few vectors have ten thousand non-0 values, if the values are taken according to the maximum non-0 value in all sparse vectors, the value of N needs to be ten thousand, but since most sparse vectors have no ten thousand values to be stored, pre-allocating the space according to the maximum size in this case can cause excessive memory pressure. Therefore, N may be valued according to the maximum non-0 number of values for most sparse vectors, set to 20.

For example, fig. 7 provides an exemplary diagram of storing a pre-allocated memory space vector, where the memory space variable_l0_memory is divided into a first memory space head and a second memory space body, when the variable_l0_memory object is initialized, a memory with the maximum number of points max_elements multiplied by a long integer is allocated to the head, a memory with the maximum number of points max_elements multiplied by max_vector_size is allocated to the body, and the actual length of the dense sparse vector is stored in the head, so that the addressing of the vector data can be completed only by calculating the relative offset of the body through the identification of the dense sparse vector. When writing data, only the data is copied to the corresponding offset in the body, and the actual length is recorded in the head, the whole reading and writing process does not need any lock protection, and fig. 7 shows that 3 dense sparse vectors are stored in an exemplary way. FIG. 8 provides another example diagram of storing vectors in pre-allocated memory space, storing a new dense sparse vector as compared to FIG. 7, for a total of 4 dense sparse vectors.

After the dense sparse vector to be stored is obtained, the length of the dense sparse vector to be stored can be determined correspondingly, and the length is stored into the corresponding first memory space. Judging whether the length of the dense sparse vector to be stored is larger than that of the second memory space, if so, determining that all vector values of the dense sparse vector to be stored cannot be stored in the first memory space, mapping a virtual address space by using mmap, storing the dense sparse vector to be stored into the corresponding virtual address space, wherein the size of the virtual address space is obtained by rounding a data page of the maximum value of the length of the dense sparse vector to be stored, namely, the maximum value of the length of the dense sparse vector to be stored can be preset by rounding the size of one data page upwards, and also comparing the lengths of the dense sparse vectors to be stored when the dense sparse vectors to be stored are obtained in batches; the size of the virtual address space may also be determined by the maximum number of non-0 values for all dense sparse vectors to be stored, i.e. the maximum value of the length of the dense sparse vectors to be stored may be determined from the maximum number of non-0 values and the dense vector dimension. If the length of the dense sparse vector to be stored is not greater than the length of the second memory space, the dense sparse vector to be stored is directly stored into the corresponding second memory space.

Illustratively, the maximum number of non-zero values of the constrained sparse vectors is N, which represents the maximum number of non-0 values of most sparse vectors, e.g., n=32, from which the length of the second memory space can be calculated, the length of the second memory space being normal_spark_size, and the maximum length of the dense sparse vectors being max_vector_size. Thus, the memory of max_elements times normal_spark_size is allocated for the second memory space body; the max_vector_size is rounded up by 4KB, and a block virtual address space huge_body is mapped using mmap, the length of the huge_body being the value of max_elements multiplied by max_vector_size rounded up by 4 KB. When the dense sparse vector is written, if the length of the dense sparse vector is less than or equal to normal_sparse_size, the dense sparse vector is stored in a body; if the dense sparse vector data length is greater than normal_sparse_size, then the dense sparse vector is stored in the huge_body.

Assuming that the maximum value max_vector_size of the lengths of all dense sparse vectors to be stored is 16380, the maximum non-zero value number N of constraint sparse vectors is 32, the dense vector dimension is 64, and max_elements is 1024, the length normal_space_size=64×4+4+8×32=516 of the second memory space, and the max_vector_size is 16384 after rounding the data page; assuming that the length of the dense sparse vector 4 to be stored is 13386, the rest of the vector sparse portions are all below 32 in non-0 value. Fig. 9 provides an example diagram of storing a dense sparse vector to be stored, the dense sparse vector to be stored being stored in a corresponding second memory space body when the length of the dense sparse vector to be stored is not greater than the length of the second memory space, and the dense sparse vector to be stored being stored in a corresponding virtual address space huge_body when the length of the dense sparse vector to be stored is greater than the length of the second memory space. The memory overhead for the use of the huge_page optimization at this time is: head, memory overhead 8192 bytes; body, memory overhead is 528384 bytes; huge_body, the actual memory overhead is 16384 bytes; the total memory overhead is 552960 bytes. Without using the huge_page optimization, the memory overhead is: head, memory overhead 8192 bytes; body, memory overhead is 16777216 bytes; the total memory overhead is 16785408 bytes.

S403, constructing a hierarchical navigable small world search graph according to the dense sparse vector to be stored.

The similarity calculated when the hierarchical navigable small world search graph is constructed according to the dense sparse vector to be stored is calculated by adopting the vector similarity determining method provided by any embodiment of the invention.

And placing the dense sparse vectors to be stored in the corresponding layers to form a hierarchical navigable small world search graph. When the dense sparse vector to be stored is put on the corresponding layer, the dense sparse vector can be put on the layer0 layer, then the similarity is calculated, the neighbor of the dense sparse vector is determined based on the similarity, and the proper dense sparse vector is selected to be mapped on the upper layer, or the layer is added. In the process of constructing the hierarchical navigable small world search graph, the similarity needs to be calculated so as to determine the neighbor of each dense sparse vector and the position of each dense sparse vector, and the similarity is calculated by adopting the vector similarity determining method of any embodiment of the invention. When a plurality of dense sparse vectors to be stored exist, the dense sparse vectors to be stored are sequentially placed into the search graph, and the construction of the search graph of the hierarchical navigable small world is completed.

It should be noted that, in the process of constructing the search map of the hierarchical navigable small world, if the similarity between the newly added dense sparse vector to be stored and the dense sparse vector already existing in the search map of the hierarchical navigable small world is calculated, since the dense sparse vector is already stored in the memory space at this time, it is necessary to read from the memory space. Taking the dense sparse vector to be read as a dense sparse vector to be read, and reading the length of the dense sparse vector to be read from the corresponding first memory space according to the identification of the dense sparse vector to be read; judging whether the length of the dense sparse vector to be read is larger than that of the second memory space, if so, determining that the dense sparse vector to be read is stored in the virtual address space, and then reading the dense sparse vector to be read from the corresponding virtual memory space; otherwise, the dense sparse vector to be read is determined to be stored in the second memory space, and the dense sparse vector to be read is read from the corresponding second memory space.

S404, when the memory merging condition is met, merging the pre-allocated memory space to obtain a merged memory space.

In this embodiment, the memory merging condition may be specifically understood as a determination condition that the memory space is merged and the redundant space is released. The memory merge condition may be that it is detected that all dense sparse vectors have been stored, or that a memory merge operation triggered by a user is received, or that the number of dense sparse vectors to be stored currently reaches a set threshold, or the like.

And presetting memory merging conditions, when the memory merging conditions are detected to be met, determining that the search graph of the hierarchical navigable small world is constructed, and no new data is added. And merging the pre-allocated memory space to release the redundant space, thereby obtaining the merged memory space. After the memory merging is executed, the search graph of the hierarchical navigable small world refuses to add new data, and vector searching can be performed based on the search graph of the hierarchical navigable small world.

As an optional embodiment of the present embodiment, the optional embodiment further combines the pre-allocated memory space to obtain a combined memory space, and optimizes to:

b1, determining the total length of the storage space and the offset of each dense sparse vector to be stored according to the length stored in the first memory space of the memory space.

And reading the lengths stored in all the first memory spaces in the memory space, calculating the sum of the lengths to obtain the total length of the memory space required for storing the vector data, and simultaneously determining the offset of each dense sparse vector to be stored according to the length of the dense sparse vector to be stored before the dense sparse vector to be stored for each dense sparse vector to be stored.

And B2, distributing a new memory space according to the total length of the memory space, wherein the new memory space comprises a third memory space and a fourth memory space.

And allocating a new memory space according to the total length of the memory space so as to store the vector through the new memory space, wherein the new memory space comprises a third memory space and a fourth memory space.

And B3, storing the offset of each dense sparse vector to be stored into a corresponding third memory space, and storing each dense sparse vector to be stored into a corresponding fourth memory space according to the offset to obtain a combined memory space.

The third memory space is used for storing offset, and the offset of each dense sparse vector to be stored is stored into the corresponding third memory space; the fourth memory space is used for storing the dense sparse vectors, storing the dense sparse vectors to be stored into the corresponding fourth memory space according to the offset, and performing space allocation according to the size of the vectors when the dense sparse vectors to be stored are stored in the fourth memory space, so that memory waste is avoided, and the combination of the memory spaces is completed, and the combined memory space is obtained.

For example, fig. 10 provides an exemplary diagram for storing a dense sparse vector to be stored, where the obtained storage space is the storage space after the merging process, the offset is stored in the head, and the dense sparse vector to be stored is stored in the body.

In the process of constructing the search graph, the embodiment of the invention firstly allocates the memory space with fixed length to store the dense sparse vector, namely pre-allocates the second memory space and the virtual address space to store the dense sparse vector to be stored. After the memory is combined, the head stores the offset of the dense sparse vector to be stored in the body, and the dense sparse vector is addressed by only adding the offset stored in the head to the body. After the memory has been merged, no new points are allowed to be inserted, and the whole addressing process is still lock-free.

After the search map of the hierarchical navigable small world is built, the search map is not changed generally, and only the search map needs to be queried in the vector search process.

S405, acquiring dense sparse vectors to be searched.

S406, searching the search graph of the layered navigable small world based on a graph algorithm, reading dense sparse vectors to be matched in the search graph of the layered navigable small world, calculating the similarity between the dense sparse vectors to be searched and the dense sparse vectors to be matched, and calculating the similarity according to the vector similarity determining method of any embodiment of the invention.

As an optional embodiment of the present embodiment, the present optional embodiment further reads dense sparse vectors to be matched in a search graph of a hierarchical navigable small world, including:

c1, reading the length from a corresponding first memory space according to the identification of the dense sparse vector to be matched;

c2, judging whether the length of the dense sparse vector to be matched is larger than that of the second memory space, and if so, reading the dense sparse vector to be matched from the corresponding virtual memory space; otherwise, reading the dense sparse vector to be matched from the corresponding second memory space.

The hierarchical navigable small world search graph can realize the actual application of search after the construction is completed, and can also realize the actual application of search in the construction process. If memory merging is not executed, determining the identification of the dense sparse vector to be matched in vector searching, and reading the length of the dense sparse vector to be matched from the corresponding first memory space according to the identification of the dense sparse vector to be matched; judging whether the length of the dense sparse vector to be matched is larger than that of the second memory space, if so, determining that the dense sparse vector to be matched is stored in the virtual address space, and reading the dense sparse vector to be matched from the corresponding virtual memory space; otherwise, determining that the dense sparse vector to be matched is stored in the second memory space, and reading the dense sparse vector to be matched from the corresponding second memory space.

It should be noted that, when memory merging is not performed, in this case, similarity calculation may be involved if vector searching is performed or a new dense sparse vector to be stored is stored in a memory space and a search map is updated, and in this case, the similarity calculation needs to read the dense sparse vector already included in the search map from the already stored memory space, and the reading manner is the same, and it is determined by length whether the vector is stored in the second memory space or the virtual address space, so as to read data from the corresponding space.

As an optional embodiment of the present embodiment, the present optional embodiment further optimizes the dense sparse vector to be matched in the search graph of the reading hierarchical navigable small world to:

and D1, determining the offset of the dense sparse vector to be matched according to the identification of the dense sparse vector to be matched.

In the searching process, if memory merging is performed, all the dense sparse vectors are already stored in the storage space, so that specific values of the dense sparse vectors need to be read from the storage space. And reading the offset of the dense sparse vector to be matched from the head according to the identification of the dense sparse vector to be matched.

And D2, reading the dense sparse vector to be matched from a storage space corresponding to the search map of the hierarchical navigable small world according to the offset of the dense sparse vector to be matched.

And directly reading the dense sparse vector to be matched from a storage space corresponding to the search map of the hierarchical navigable small world, namely a body according to the offset.

The embodiment of the application provides two methods for reading dense sparse vectors to be matched in a search graph of a hierarchical navigable small world, which can realize the reading of the dense sparse vectors to be matched under different conditions. Vector searching is carried out before memory merging, at the moment, dense sparse vectors to be matched are read by adopting a step of C1-C2, and the method can be applied to a scene of vector searching in the process of constructing a search graph; and vector searching is carried out after the memory merging, and the step of reading the dense sparse vector to be matched can be carried out by adopting the step of D1-D2, so that the method can be applied to the scene that the construction of a search graph is completed and no new vector is added. In the actual application process, the dense sparse vector to be matched can be read by judging whether memory merging is executed or not and selecting a proper mode.

S407, determining a target dense sparse vector matched with the dense sparse vector to be searched according to the similarity.

When the target dense sparse vector is fed back, the external Label Label of the target dense sparse vector can be fed back at the same time, so that the target dense sparse vector can be conveniently used.

The embodiment of the invention provides a vector searching method, which solves the problem of low recall rate caused by lower accuracy of a similarity calculation result when dense vectors and sparse vectors are searched in a combined way, and a new vector, namely a dense sparse vector, is obtained by splicing the dense vectors and the sparse vectors. In the process of searching vectors based on hierarchical navigable small worlds, the final similarity is determined by calculating the first similarity and the second similarity of the first dense sparse vector and the second dense sparse vector, so that the accuracy of similarity calculation is improved, and the accuracy of searching is further improved; compared with the prior art that the similarity is calculated from two angles of the dense vector and the sparse vector respectively, the method and the device consider the dense vector and the sparse vector at the same time, calculate the similarity of the vectors, have more accurate results, and improve the recall rate during vector search; the implementation is simple, only one set of retrieval system is required to be maintained, and the data consistency is effectively ensured. In addition, the data storage of the hierarchical navigable small world search map is expanded, the dense sparse vectors are stored in a hierarchical mode through pre-distributing a fixed-length memory space and a virtual address space, no lock protection is needed in the whole process, meanwhile, the memory space can be effectively saved through storing larger dense sparse vectors through the virtual address space, after all the dense sparse vectors are stored, merging operation is carried out on the memory space, redundant space is released, the dense sparse vectors are stored in a variable length mode, fixed-length addressing is used for the dense sparse vectors before merging, and variable-length addressing is used for the dense sparse vectors after merging. Through an expansion graph algorithm HNSW, the dense sparse vector can be directly subjected to mixed retrieval by using the graph algorithm, the data is represented by the dense sparse vector, the real data space distribution is better reflected, and the performance and recall rate loss caused by respectively retrieving the dense sparse vector and the dense sparse vector in the traditional scheme are avoided; the recall rate is guaranteed in the generic scenario independent of the specific scenario.

Example five

Fig. 11 is a schematic structural diagram of a vector similarity determining apparatus according to a fifth embodiment of the present invention. As shown in fig. 11, the apparatus includes: a vector acquisition module 51, a similarity calculation module 52, and a similarity determination module 53.

The vector acquisition module 51 is configured to acquire a first dense sparse vector and a second dense sparse vector, where the first dense sparse vector is obtained by stitching the first dense vector and the first sparse vector, and the second dense sparse vector is obtained by stitching the second dense vector and the second sparse vector;

a similarity calculation module 52 for calculating a first similarity and a second similarity based on the first dense sparse vector and the second dense sparse vector;

a similarity determining module 53, configured to determine a similarity of the first dense sparse vector and the second dense sparse vector according to the first similarity and the second similarity.

The embodiment of the invention provides a vector similarity determining device, which solves the problem of lower accuracy of a result of similarity calculation of a dense vector and a sparse vector, and a new vector, namely the dense sparse vector, is obtained by splicing the dense vector and the sparse vector. Determining the final similarity of the first dense sparse vector and the second dense sparse vector by calculating the first similarity and the second similarity of the first dense sparse vector and the second dense sparse vector, wherein compared with the prior art in which the similarity is calculated from two angles of the dense vector and the sparse vector respectively, the similarity of the vectors is calculated by considering the dense vector and the sparse vector at the same time, and the result is more accurate; the realization is simple, and the data consistency is effectively ensured; the data are represented through dense sparse vectors, so that real data space distribution is better reflected, and the recall rate of vector search is improved.

Optionally, the similarity calculation module 52 includes:

a first similarity calculating unit, configured to calculate a similarity between the first dense vector and the second dense vector, to obtain a first similarity;

and the second similarity calculation unit is used for calculating the similarity of the first sparse vector and the second sparse vector to obtain second similarity.

Optionally, the second similarity calculation unit is specifically configured to determine a first dimension identifier in the first sparse vector and a first vector value corresponding to the first dimension identifier; determining a second dimension identifier in the second sparse vector and a second vector value corresponding to the second dimension identifier; comparing each first dimension identifier with each second dimension identifier, and determining the same first dimension identifier and second dimension identifier as target dimension identifiers; and calculating the similarity based on the first vector value and the second vector value corresponding to the target dimension identification, and determining the second similarity.

Optionally, the similarity determining module 53 is specifically configured to: and carrying out weighted summation on the first similarity and the second similarity to obtain the similarity of the first dense sparse vector and the second dense sparse vector.

The vector similarity determining device provided by the embodiment of the invention can execute the vector similarity determining method provided by the first embodiment or the second embodiment of the invention, and has the corresponding functional modules and beneficial effects of the executing method.

Example six

Fig. 12 is a schematic structural diagram of a vector search device according to a sixth embodiment of the present invention. As shown in fig. 12, the apparatus includes: a vector to be searched acquisition module 61, a vector search module 62, and a target vector determination module 63.

A to-be-searched vector acquisition module 61, configured to acquire a dense sparse vector to be searched;

the vector search module 62 is configured to search a search graph of a hierarchical navigable small world based on a graph algorithm, read dense sparse vectors to be matched in the search graph of the hierarchical navigable small world, and calculate a similarity between the dense sparse vectors to be searched and the dense sparse vectors to be matched, where the similarity is calculated according to the vector similarity determination method according to any embodiment of the present invention;

and a target vector determining module 63, configured to determine a target dense sparse vector matched with the dense sparse vector to be searched according to the similarity.

The embodiment of the invention provides a vector similarity determining device, which solves the problem of low recall rate caused by low accuracy of similarity calculation results when dense vectors and sparse vectors are searched in a combined way, and a new vector, namely a dense sparse vector, is obtained by splicing the dense vectors and the sparse vectors. Calculating first similarity of the first dense vector and the second dense vector and second similarity of the first sparse vector and the second sparse vector, and carrying out weighted summation on the first similarity and the second similarity to obtain final similarity of the first dense sparse vector and the second dense sparse vector, so that accuracy of similarity calculation is improved; compared with the prior art that the similarity is calculated from two angles of the dense vector and the sparse vector respectively, the method and the device consider the dense vector and the sparse vector at the same time, calculate the similarity of the vectors, have more accurate results and have higher recall rate; the realization is simple, and the data consistency is effectively ensured; the data are represented by the dense sparse vectors, so that real data space distribution is better reflected, and the problem that in a traditional scheme, the similarity calculation result caused by calculating the similarity of the dense sparse vectors is inaccurate, so that the performance and recall rate are lost is avoided; the recall rate is guaranteed in the generic scenario independent of the specific scenario.

Optionally, the apparatus further comprises:

the vector to be stored acquisition module is used for acquiring dense sparse vectors to be stored;

the vector storage module is used for storing the dense sparse vector to be stored through a pre-allocated memory space;

and the search graph construction module is used for constructing a search graph of the hierarchical navigable small world according to the dense sparse vector to be stored.

the vector storage module is specifically configured to: determining the length of the dense sparse vector to be stored and storing the length into a corresponding first memory space, judging whether the length of the dense sparse vector to be stored is larger than the length of a second memory space, and if so, storing the dense sparse vector to be stored into a corresponding virtual address space; otherwise, storing the dense sparse vector to be stored into a corresponding second memory space;

the length of the second memory space is determined according to the dimension of the dense vector in the dense sparse vector to be stored and the maximum non-zero value number of the constraint sparse vector; the size of the virtual address space is determined by rounding the data page for the maximum value of the length of the dense sparse vector to be stored.

Optionally, the apparatus further comprises:

and the memory merging module is used for merging the pre-allocated memory spaces when the memory merging condition is met, so as to obtain the merged memory spaces.

Optionally, the memory merging module includes:

the total length determining unit is used for determining the total length of the storage space and the offset of each dense sparse vector to be stored according to the length stored in the first memory space of the memory space;

the space allocation unit is used for allocating a new memory space according to the total length of the memory space, wherein the new memory space comprises a third memory space and a fourth memory space;

and the merging unit is used for storing the offset of each to-be-stored dense sparse vector into a corresponding third memory space, and storing each to-be-stored dense sparse vector into a corresponding fourth memory space according to the offset to obtain a merged memory space.

Optionally, the vector search module 62 includes:

the length reading unit is used for reading the length from the corresponding first memory space according to the identification of the dense sparse vector to be matched;

the first vector reading unit is used for judging whether the length of the dense sparse vector to be matched is larger than that of the second memory space, and if so, reading the dense sparse vector to be matched from the corresponding virtual memory space; otherwise, reading the dense sparse vector to be matched from the corresponding second memory space.

Optionally, the vector search module 62 includes:

the offset determining unit is used for determining the offset of the dense sparse vector to be matched according to the identification of the dense sparse vector to be matched;

and the second vector reading unit is used for reading the dense sparse vector to be matched from a storage space corresponding to the search map of the hierarchical navigable small world according to the offset of the dense sparse vector to be matched.

The vector searching device provided by the embodiment of the invention can execute the vector searching method provided by the third embodiment or the fourth embodiment of the invention, and has the corresponding functional modules and beneficial effects of the executing method

Example seven

Fig. 13 shows a schematic diagram of an electronic device 70 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 13, the electronic device 70 includes at least one processor 71, and a memory, such as a Read Only Memory (ROM) 72, a Random Access Memory (RAM) 73, etc., communicatively connected to the at least one processor 71, wherein the memory stores a computer program executable by the at least one processor, and the processor 71 may perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM) 72 or the computer program loaded from the storage unit 78 into the Random Access Memory (RAM) 73. In the RAM 73, various programs and data required for the operation of the electronic device 70 may also be stored. The processor 71, the ROM 72 and the RAM 73 are connected to each other via a bus 74. An input/output (I/O) interface 75 is also connected to bus 74.

Various components in the electronic device 70 are connected to the I/O interface 75, including: an input unit 76 such as a keyboard, a mouse, etc.; an output unit 77 such as various types of displays, speakers, and the like; a storage unit 78 such as a magnetic disk, an optical disk, or the like; and a communication unit 79 such as a network card, modem, wireless communication transceiver, etc. The communication unit 79 allows the electronic device 70 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Processor 71 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 71 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 71 performs the respective methods and processes described above, such as the vector similarity determination method or the vector search method.

In some embodiments, the vector similarity determination method or the vector search method may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as the storage unit 78. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 70 via the ROM 72 and/or the communication unit 79. When the computer program is loaded into RAM 73 and executed by processor 71, one or more steps of the vector similarity determination method or vector search method described above may be performed. Alternatively, in other embodiments, the processor 71 may be configured to perform the vector similarity determination method or the vector search method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

The storage medium including computer instructions provided in an embodiment of the present application, when executed by a computer processor, is configured to perform a vector similarity determination method, the method including: acquiring a first dense sparse vector and a second dense sparse vector, wherein the first dense sparse vector is obtained by splicing the first dense vector and the first sparse vector, and the second dense sparse vector is obtained by splicing the second dense vector and the second sparse vector; calculating a first similarity and a second similarity based on the first dense sparse vector and the second dense sparse vector; and determining the similarity of the first dense sparse vector and the second dense sparse vector according to the first similarity and the second similarity.

The embodiment of the application provides a storage medium containing computer instructions, which when executed by a computer processor, are used for executing a vector search method, the method comprising: acquiring a dense sparse vector to be searched; searching a search graph of a hierarchical navigable small world based on a graph algorithm, reading a dense sparse vector to be matched in the search graph of the hierarchical navigable small world, and calculating the similarity between the dense sparse vector to be searched and the dense sparse vector to be matched, wherein the similarity is calculated according to the vector similarity determining method in any embodiment of the invention; and determining a target dense sparse vector matched with the dense sparse vector to be searched according to the similarity.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A method for determining vector similarity, comprising:

2. The method of claim 1, wherein the computing a first similarity and a second similarity based on the first dense sparse vector and the second dense sparse vector comprises:

calculating the similarity of the first dense vector and the second dense vector to obtain a first similarity;

and calculating the similarity of the first sparse vector and the second sparse vector to obtain a second similarity.

3. The method of claim 2, wherein the calculating the similarity of the first sparse vector and the second sparse vector to obtain a second similarity comprises:

determining a first vector value corresponding to a first dimension identification in the first sparse vector;

determining a second dimension identifier in the second sparse vector and a second vector value corresponding to the second dimension identifier;

comparing each first dimension identifier with each second dimension identifier, and determining the same first dimension identifier and second dimension identifier as target dimension identifiers;

And calculating the similarity based on the first vector value and the second vector value corresponding to the target dimension identification, and determining the second similarity.

4. The method of claim 2, wherein the first similarity and the second similarity are inner product similarities.

5. The method of claim 1, wherein the determining the similarity of the first dense sparse vector and the second dense sparse vector from the first similarity and the second similarity comprises:

and carrying out weighted summation on the first similarity and the second similarity to obtain the similarity of the first dense sparse vector and the second dense sparse vector.

6. A vector search method, comprising:

acquiring a dense sparse vector to be searched;

searching a search graph of a hierarchical navigable small world based on a graph algorithm, reading a dense sparse vector to be matched in the search graph of the hierarchical navigable small world, and calculating the similarity between the dense sparse vector to be searched and the dense sparse vector to be matched, wherein the similarity is calculated according to the vector similarity determining method according to any one of claims 1-5;

7. The method of claim 6, wherein constructing the search graph of the hierarchical navigable small world comprises:

acquiring a dense sparse vector to be stored;

storing the dense sparse vector to be stored through a pre-allocated memory space;

constructing a hierarchical navigable small world search graph according to the dense sparse vector to be stored;

wherein the similarity calculated when constructing a hierarchical navigable small world search graph from the dense sparse vectors to be stored is calculated using the vector similarity determination method of any one of claims 1-5.

8. The method of claim 7, wherein the pre-allocated memory space comprises a first memory space and a second memory space;

the storing the dense sparse vector to be stored through the pre-allocated memory space comprises:

determining the length of the dense sparse vector to be stored and storing the length into a corresponding first memory space, judging whether the length of the dense sparse vector to be stored is larger than the length of a second memory space, and if so, storing the dense sparse vector to be stored into a corresponding virtual address space; otherwise, storing the dense sparse vector to be stored into a corresponding second memory space;

9. The method of claim 8, wherein the reading dense sparse vectors to be matched in the search graph of the hierarchical navigable small world comprises:

reading the length from the corresponding first memory space according to the identification of the dense sparse vector to be matched;

judging whether the length of the dense sparse vector to be matched is larger than that of the second memory space, if so, reading the dense sparse vector to be matched from the corresponding virtual memory space; otherwise, reading the dense sparse vector to be matched from the corresponding second memory space.

10. The method as recited in claim 7, further comprising:

and when the memory merging condition is met, merging the pre-allocated memory space to obtain a merged memory space.

11. The method of claim 10, wherein the merging the pre-allocated memory space to obtain the merged memory space comprises:

Determining the total length of a storage space and the offset of each dense sparse vector to be stored according to the length stored in a first memory space of the memory space;

allocating a new memory space according to the total length of the memory space, wherein the new memory space comprises a third memory space and a fourth memory space;

and storing the offset of each to-be-stored dense sparse vector into a corresponding third memory space, and storing each to-be-stored dense sparse vector into a corresponding fourth memory space according to the offset to obtain a combined memory space.

12. The method of claim 11, wherein the reading dense sparse vectors to be matched in the search graph of the hierarchical navigable small world comprises:

determining the offset of the dense sparse vector to be matched according to the identification of the dense sparse vector to be matched;

and reading the dense sparse vector to be matched from a storage space corresponding to the search map of the hierarchical navigable small world according to the offset of the dense sparse vector to be matched.

13. A vector similarity determination apparatus, comprising:

14. A vector search apparatus, comprising:

the vector search module is used for searching the search graph of the hierarchical navigable small world based on a graph algorithm, reading dense sparse vectors to be matched in the search graph of the hierarchical navigable small world, and calculating the similarity between the dense sparse vectors to be searched and the dense sparse vectors to be matched, wherein the similarity is calculated according to the vector similarity determination method according to any one of claims 1-5;

15. An electronic device, the electronic device comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the vector similarity determination method of any one of claims 1-5 or the vector search method of any one of claims 6-12.

16. A computer readable storage medium storing computer instructions for causing a processor to implement the vector similarity determination method of any one of claims 1-5 or the vector search method of any one of claims 6-12 when executed.