CN117112854A

CN117112854A - Data query method, device, electronic equipment and storage medium

Info

Publication number: CN117112854A
Application number: CN202311054889.7A
Authority: CN
Inventors: 曹铭斌; 蒋少东; 欧旭新
Original assignee: Midea Network Information Service Shenzhen Co ltd
Current assignee: Midea Network Information Service Shenzhen Co ltd
Priority date: 2023-08-18
Filing date: 2023-08-18
Publication date: 2023-11-24

Abstract

The invention relates to the technical field of big data, and provides a data query method, a device, electronic equipment and a storage medium, wherein the method comprises the following steps: responding to a data query request, constructing a target connected graph based on a plurality of original vector data respectively included by different written index objects, wherein the original vector data comprises a query text vector and a corresponding query result vector, and the data query request carries a target query text vector and a query result quantity requirement; searching a plurality of candidate coordinate positions adjacent to the coordinate positions of the target query text based on the target communication diagram, and determining at least one target coordinate position from the plurality of candidate coordinate positions based on the query result quantity requirement; and determining a query result vector of the query text vector corresponding to the at least one target coordinate position as a query result of the data query request. The invention combines neighbor text query and graph algorithm, reduces the time complexity of data query and improves the data query efficiency.

Description

Data query method, device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of big data technologies, and in particular, to a data query method, a device, an electronic device, and a storage medium.

Background

In the big data field, the elastic search component is a free and open distributed search and analysis engine that is applicable to all types of data including text, numbers, geospatial, and structured and unstructured data. Also, the full text search function of the elastic search component is to word a query text and then calculate a relevance score for each word to find a result text similar to the query text, which is essentially a word-based search. However, when a broader result text needs to be recalled for synonyms/hyponyms for each word, the full text search function of the elastic search component is not applicable, and the recall of the broader result text can be accomplished using the vector search function provided by the eletissearch 7.X component. Therefore, how to use the ElatisSearch 7.X component for fast and accurate vector queries is a key issue that currently needs to be addressed.

In the related art, in combination with the implementation principle of the vector search function provided by the ElatisSearch 7.X component, a result text vector matched with a query text vector is searched through a script_score, then vector similarity score calculation is performed on a plurality of matched result text vectors, and a target result text is fed back to a user based on the vector similarity score of each result text vector.

However, since the script_score query method uses a linear query mode, all vector data stored currently needs to be traversed one by one, and cannot jump until a result text vector matched with a query text vector is queried, thus resulting in high time complexity of linear query and low query efficiency.

Disclosure of Invention

The present invention is directed to solving at least one of the technical problems existing in the related art. Therefore, the invention provides a data query method, which not only reduces the time complexity of data query, but also greatly improves the data query efficiency by combining neighbor text query and graph algorithm, and only determining the final query result from partial original vector data meeting the preset neighbor relation without carrying out traversal query on all currently stored vector data one by one.

The invention also provides a data query device.

The invention further provides electronic equipment.

The invention also proposes a non-transitory computer readable storage medium.

According to an embodiment of the first aspect of the present invention, a data query method includes:

responding to a data query request, and constructing a target connected graph based on a plurality of original vector data respectively included by different written index objects, wherein the original vector data comprise query text vectors and corresponding query result vectors, coordinate positions corresponding to different query text vectors in the target connected graph are mutually connected, and the data query request carries target query text vectors and query result quantity requirements;

Searching a plurality of candidate coordinate positions adjacent to the coordinate positions of the target query text based on the target communication diagram, and determining at least one target coordinate position from the plurality of candidate coordinate positions based on the query result quantity requirement;

and determining a query result vector of the query text vector corresponding to the at least one target coordinate position as a query result of the data query request.

According to the data query method provided by the embodiment of the invention, when an ElatisSearch 7.X component responds to a data query request, a target connected graph is firstly constructed based on a plurality of original vector data respectively included by different written index objects, and under the conditions that coordinate positions corresponding to different query text vectors in the target connected graph are mutually connected, the data query request carries a target query text vector and a query result number requirement, a plurality of candidate coordinate positions adjacent to the coordinate positions of the target query text are searched based on the target connected graph, at least one target coordinate position is determined from the plurality of candidate coordinate positions based on the query result number requirement, and then the query result vector of the query text vector corresponding to the at least one target coordinate position is determined as the query result of the data query request. In this way, by combining the neighbor text query and the graph algorithm, the traversing query is not needed to be carried out on all currently stored vector data one by one, and the final query result is only needed to be determined from part of original vector data meeting the preset neighbor relation, so that the time complexity of data query is reduced, and the data query efficiency is greatly improved.

According to one embodiment of the present invention, the constructing a target connectivity graph based on a plurality of original vector data included in each of the written different index objects includes:

constructing a sub-graph of each index object based on a plurality of original vector data included in each index object, wherein coordinate positions corresponding to query text vectors of the plurality of original vector data in the sub-graph are communicated with each other;

and constructing the target communication graph based on each sub-graph, wherein all the sub-graphs in the target communication graph are communicated with each other.

According to one embodiment of the present invention, for each of the plurality of original vector data of the written different index objects, the writing process of the plurality of original vector data included in the index object includes:

responding to a data writing instruction, respectively vectorizing a plurality of original data to be written to obtain a plurality of original vector data; each piece of original vector data comprises the query result vector and the query text vector;

determining index objects corresponding to the plurality of original vector data based on the plurality of original vector data and an index object generation method of a preset plugin library, wherein the index objects comprise the plurality of original vector data;

Storing the index object into a Segment document under a Lucene file, determining that the accumulated time length responding to the data writing instruction reaches a preset time length, and brushing the Segment document into a disk, wherein the Segment document stored in the disk is used for constructing the target communication graph.

According to an embodiment of the present invention, the preset plugin library is an HNSW plugin library, and the determining, based on the plurality of original vector data and an index object generating method of the preset plugin library, an index object corresponding to the plurality of original vector data includes:

converting the plurality of original vector data into data types supported by the HNSW plugin library respectively to obtain a plurality of vector data conversion results;

adding the vector data conversion results to a preset index of a preset ElatisSearch component;

and calling an index object generation method of the HNSW plugin library based on the addition completion identification, and determining the index object.

According to an embodiment of the present invention, the preset plugin library is an LSH plugin library, and the determining an index object corresponding to the plurality of original vector data based on the plurality of original vector data and an index object generating method of the preset plugin library includes:

Acquiring target hash functions corresponding to the plurality of original data to be written respectively;

aiming at each original data to be written and each target hash function, carrying out hash on the original vector data by using the target hash function based on a hash index object generation method of the LSH plugin library, and determining a hash value corresponding to the original vector data;

and mapping the original data to be written into a hash bucket where the hash value is located according to each hash value and each original data to be written into, and determining the index object.

According to an embodiment of the present invention, the preset plugin library is an IVSPQ plugin library, and the determining, based on the plurality of original vector data and an index object generating method of the preset plugin library, an index object corresponding to the plurality of original vector data includes:

for each original vector data, determining a target clustering point and a target coding mode corresponding to the original vector data based on a mapping relation between vector data and a clustering point-coding mode which are stored in advance in the IVSPQ plugin library;

aiming at a target clustering point and a target coding mode which are respectively corresponding to each original vector data, coding and clustering the original data to be written by using the target coding mode and the target clustering point based on a coding index object generation method of the IVSPQ plugin library to obtain a clustering coding result;

And mapping the original data to be written into to the coding table where the cluster coding result is located according to each cluster coding result and each original data to be written into, and determining the index object.

According to one embodiment of the invention, the method further comprises:

and storing the original vector data into a preset data type of the preset plug-in library, wherein the preset data type is used for storing the original vector data in an unlimited dimension.

According to one embodiment of the present invention, the searching for a plurality of candidate coordinate positions adjacent to the coordinate position of the target query text based on the target communication map includes:

loading the target communication graph into a memory, and searching the candidate coordinate positions based on the target communication graph;

or,

and searching the candidate coordinate positions based on the result of loading the target communication diagram and the target communication diagram.

comparing the written different index objects with the index objects adopted by the communication graph constructed in the previous time;

Determining that the written different index objects are different from the index objects adopted by the previously constructed communication graph, and constructing the target communication graph based on the plurality of original vector data respectively included by the written different index objects.

According to an embodiment of the second aspect of the present invention, a data query device includes:

the system comprises a connected graph construction module, a data query module and a data query module, wherein the connected graph construction module is used for responding to a data query request, constructing a target connected graph based on a plurality of original vector data respectively included by different written index objects, wherein the original vector data comprise query text vectors and corresponding query result vectors, coordinate positions corresponding to different query text vectors in the target connected graph are mutually connected, and the data query request carries target query text vectors and query result quantity requirements;

the data query module is used for searching a plurality of candidate coordinate positions adjacent to the coordinate positions of the target query text based on the target communication diagram, and determining at least one target coordinate position from the plurality of candidate coordinate positions based on the query result quantity requirement; and determining a query result vector of the query text vector corresponding to the at least one target coordinate position as a query result of the data query request.

According to the data query device provided by the embodiment of the invention, when a data query request is responded, a target connection graph is firstly constructed based on a plurality of original vector data respectively included by different written index objects, and under the conditions that coordinate positions corresponding to different query text vectors in the target connection graph are mutually connected, the data query request carries the target query text vectors and the query result quantity requirement, a plurality of candidate coordinate positions adjacent to the coordinate positions of the target query text are searched based on the target connection graph, at least one target coordinate position is determined from the plurality of candidate coordinate positions based on the query result quantity requirement, and then the query result vector of the query text vector corresponding to the at least one target coordinate position is determined as the query result of the data query request.

The above technical solutions in the embodiments of the present invention have at least one of the following technical effects: when the ElatisSearch 7.X component responds to a data query request, firstly, a target connected graph is constructed based on a plurality of original vector data respectively included by different written index objects, and under the conditions that coordinate positions corresponding to different query text vectors in the target connected graph are mutually connected, the data query request carries a target query text vector and a query result number requirement, a plurality of candidate coordinate positions adjacent to the coordinate positions of the target query text are searched based on the target connected graph, at least one target coordinate position is determined from the plurality of candidate coordinate positions based on the query result number requirement, and then a query result vector of the query text vector corresponding to the at least one target coordinate position is determined to be a query result of the data query request. In this way, by combining the neighbor text query and the graph algorithm, the traversing query is not needed to be carried out on all currently stored vector data one by one, and the final query result is only needed to be determined from part of original vector data meeting the preset neighbor relation, so that the time complexity of data query is reduced, and the data query efficiency is greatly improved.

Furthermore, the accuracy and reliability of constructing the communication graph are improved by constructing the subgraphs of the index objects and then constructing the target communication graph based on the subgraphs.

Furthermore, the ElatisSearch 7.X component realizes the purpose of data writing by means of vectorizing the original data to be written, determining an index object, and then brushing the index object stored in a Segment document under a Lucene file into a disk, so that a target connection diagram can be constructed accurately and quickly when a data query instruction is responded subsequently, and reliable data support is provided for improving the data query efficiency; further, under the condition that the accumulated time length of the response data writing instruction reaches the preset time length, all index objects determined in the preset time length are stored in the Segment document under the Lucene file. In this way, the storage operation of storing index objects corresponding to different Segment documents can be ensured to be executed every preset time, so that the efficiency of brushing the Segment documents into the disk in the follow-up process is improved.

Still further, by storing the original vector data into the preset data types of the preset plugin library, the defect that the existing dense_vector type in the elastic search7.X component cannot store vector data in millions or even tens of millions of dimensions is overcome, the wide applicability of vector data storage by using the preset data types of the preset plugin library is improved, and reliable guarantee is provided for the richness and comprehensiveness of subsequent query results.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.

FIG. 1 is a schematic flow chart of a data query method provided by the invention;

FIG. 2 is a flow chart of writing data corresponding to the preset plugin library provided by the present invention as the HNSW plugin library;

FIG. 3 is a flow chart of writing data corresponding to the preset plugin library provided by the present invention as an LSH plugin library;

FIG. 4 is a flow chart of writing data corresponding to the preset plugin library provided by the present invention as an IVSPQ plugin library;

FIG. 5 is a schematic diagram of a data query device according to the present invention;

fig. 6 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In embodiments of the present invention, "at least one" means one or more, and "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: there are three cases, a alone, a and B together, and B alone, wherein a, B may be singular or plural. In the text description of the present invention, the character "/" generally indicates that the front-rear associated object is an or relationship. In addition, it should be noted that, the numbers of the objects described in the present invention, such as "first", "second", etc., are merely used to distinguish the described objects, and do not have any sequence or technical meaning.

In the big data field, the elastic search component is a free and open distributed search and analysis engine that is applicable to all types of data including text, numbers, geospatial, and structured and unstructured data. Also, the full text search function of the elastic search component is to word the query text and then calculate a relevance score for each word to find a result text similar to the query text, which is essentially a term-basd search. However, when a broader result text needs to be recalled for synonyms/hyponyms for each word, the full text search function of the elastic search component is not applicable, and the recall of the broader result text can be accomplished using the vector search function provided by the eletissearch 7.X component. Therefore, how to use the ElatisSearch 7.X component for fast and accurate vector queries is a key issue that currently needs to be addressed.

In the related art, in combination with the implementation principle of the vector search function provided by the ElatisSearch 7.X component, a result text vector matched with a query text vector is searched through a script_score, vector similarity score calculation is performed on a plurality of matched result text vectors respectively, and then a target result text is fed back to a user based on the vector similarity score of each result text vector.

In order to solve the technical problems, the invention provides a data query method, a data query device, electronic equipment and a storage medium. The data query method, the device, the electronic equipment and the storage medium provided by the invention are described below with reference to fig. 1 to 6, wherein an execution subject of the data query method can be an elatisserch 7.X component, the elatisserch 7.X component is provided with software expansion capability, a preset plugin library with neighbor search capability can be realized in the elatisserch 7.X component, the preset plugin library is a third party plugin written into the elatisserch 7.X component in a software manner, and the elatisserch 7.X component is provided with at least neighbor search capability and target connected graph construction capability; the ElatisSearch 7.X component is version 7.X of the ElatisSearch component.

The following method embodiment is described taking the execution subject as the ElatisSearch 7.X component containing the preset plugin library as an example.

In order to facilitate understanding of the data query method provided by the present invention, the data query method provided by the present invention will be described in detail by the following several exemplary embodiments. It is to be understood that the following several exemplary embodiments may be combined with each other and that some embodiments may not be repeated for the same or similar concepts or processes.

Referring to fig. 1, a flow chart of a data query method provided by the present invention, as shown in fig. 1, the data writing method includes the following steps 110 to 130.

Step 110, responding to a data query request, and constructing a target connected graph based on a plurality of original vector data respectively included by different written index objects, wherein each original vector data includes a query text vector and a corresponding query result vector, and coordinate positions corresponding to different query text vectors in the target connected graph are mutually connected; the data query request carries the target query text vector and the query result vector quantity requirement.

Each index object corresponds to a physical query index, and each original vector data can comprise a query text number vector, a service primary key field vector and the like besides a query result vector and a corresponding query text vector; in addition, the query text vector in each original vector data participates in the construction of the target connected graph and the subsequent data query, and other information such as the query result vector, the query text number vector, the service primary key field vector and the like in each original vector data is only stored in data. The target query text vector is used for locating a starting point query position in the target connectivity graph, and the query result vector data requirement is used for determining the final number of feedback query results.

Specifically, when the ElatisSearch 7.X component receives a data query request sent by a user, the ElatisSearch 7.X component immediately responds to the data query request, and can construct a target connected graph based on a plurality of original vector data included in each of the written different index objects. For example, if 3 index objects have been written when responding to the data query request, and the 1 st index object includes 4 original vector data, the 2 nd index object includes 5 original vector data, and the 3 rd index object includes responding to 3 original vector data, coordinate positions corresponding to the 4 original vector data, coordinate positions corresponding to the 5 original vector data, and coordinate positions corresponding to the 3 original vector data in the constructed target connected graph are all mutually connected.

It should be noted that the type of the query result vector may be determined according to the type of the query text vector. For example, when the type of the query text vector is a question type, such as "how weather today" and the type of the query result vector is an answer type, such as "weather today is clear".

Step 120, searching a plurality of candidate coordinate positions adjacent to the coordinate positions of the target query text based on the target communication diagram, and determining at least one target coordinate position from the plurality of candidate coordinate positions based on the number of query results.

The number of query results may be a specific number threshold, or may be a recall rate. The present invention is not particularly limited herein. When the number of query results is required to be the recall rate, the product result of the total number of candidate coordinate positions and the recall rate can be used as a number threshold of the query results.

Specifically, a plurality of candidate coordinate positions adjacent to the coordinate position of the target query text are searched in the target communication graph, the coordinate position of the target query text in the target communication graph can be used as a starting point position, and then each coordinate position meeting the preset neighbor relation between the target communication graph and the starting point position can be determined as the candidate coordinate position. The preset neighbor relation can be determined specifically based on a preset neighbor query algorithm, that is, the preset neighbor relation can be that the distance between the preset neighbor relation and the starting point position meets a preset distance threshold value, and at the moment, all coordinate positions in the target communication graph, the distance between the preset neighbor relation and the starting point position meets the preset distance threshold value, can be used as candidate coordinate positions; alternatively, the preset neighbor relation may be that a circle is drawn with a preset radius around the starting point position, and all coordinate positions except the starting point position in the drawn circle are candidate coordinate positions.

At this time, if the total number of candidate coordinate positions is consistent with the requirement of the number of query results, that is, the total number of candidate coordinate positions is the same as the number threshold determined based on the number of query results, at this time, a plurality of candidate coordinate positions may be all used as target coordinate positions; otherwise, if the total number of candidate coordinate positions is inconsistent with the query result number, that is, the number threshold determined based on the query result number is smaller than the total number of second coordinate positions, a relevance score algorithm may be used to calculate a relevance score corresponding to each candidate coordinate position at this time, and at least one target coordinate position with the highest relevance score may be based on the target coordinate position. The relevance scoring algorithm can be a Best Matching (BM) 25 algorithm or a Term Frequency-inverse text Frequency (Term Frequency-Inverse Document Frequency, TF-IDE) algorithm, 25 in the BM25 algorithm refers to the 25 th iteration, and the BM25 algorithm is the most mainstream algorithm in the current information retrieval field for calculating the query and document similarity score; the TF-IDE algorithm is a statistical method for evaluating the importance of a "word" to one of the documents in a document set or corpus.

And 130, determining a query result vector of the query text vector corresponding to at least one target coordinate position as a query result of the data query request.

Specifically, since each original vector data includes a query text vector and a corresponding query result vector, when determining each target coordinate position, the query result vector of the query text vector corresponding to each target coordinate position may be determined as the query result of the data query request.

When the ElatisSearch 7.X component responds to a data query request, a target connection graph is firstly constructed based on a plurality of original vector data which are respectively included by different written index objects, and under the conditions that coordinate positions corresponding to different query text vectors in the target connection graph are mutually connected, the data query request carries a target query text vector and query result quantity requirements, a plurality of candidate coordinate positions adjacent to the coordinate positions of the target query text are searched based on the target connection graph, at least one target coordinate position is determined from the plurality of candidate coordinate positions based on the query result quantity requirements, and then a query result vector of the query text vector corresponding to the at least one target coordinate position is determined as a query result of the data query request. In this way, by combining the neighbor text query and the graph algorithm, the traversing query is not needed to be carried out on all currently stored vector data one by one, and the final query result is only needed to be determined from part of original vector data meeting the preset neighbor relation, so that the time complexity of data query is reduced, and the data query efficiency is greatly improved.

It will be appreciated that, based on the data query method shown in fig. 1, in an exemplary embodiment, when constructing the connected graph, sub-graphs may be constructed based on a plurality of original vector data included in each index object, and then each sub-graph may be constructed based on the sub-graphs. Based on this, the specific process of constructing the target connectivity map based on the plurality of original vector data each included in the written different index objects in step 110 may include:

firstly, constructing a sub-graph of an index object based on a plurality of original vector data contained in each index object, wherein coordinate positions corresponding to query text vectors of the plurality of original vector data in the sub-graph are communicated with each other; then, based on each sub-graph, a target communication graph is constructed, and all the sub-graphs in the target communication graph are mutually communicated.

Specifically, each coordinate position in each sub-graph includes a vertex and the coordinate position of the vertex.

For example, if 3 index objects have been written in response to the data query request, and the 1 st index object includes 4 original vector data, the 2 nd index object includes 5 original vector data, and the 3 rd index object includes response to 3 original vector data, then the constructed target connected graph includes sub-graph p1 of the 1 st index object, sub-graph p2 of the 2 nd index object, and sub-graph p3 of the 3 rd index object, sub-graph p1 includes 4 coordinate positions and 4 coordinate positions are mutually connected, sub-graph p2 includes 5 coordinate positions and 5 coordinate positions are mutually connected, sub-graph p3 includes 3 coordinate positions and 3 coordinate positions are mutually connected, and 4 coordinate positions, 5 coordinate positions and 3 coordinate positions are mutually connected. In addition, the target connected graph is constructed, specifically, the target connected graph of mutual connection between sub-graphs of each index object is constructed in a D-dimensional space based on a plurality of original vector data included in each written different index object, and D is an integer greater than 500. D is more than 500, because the existing ElatisSearch 7.X component can only perform data query on vector data sets with less than or equal to 500 dimensions and has a general recall effect, the method of the invention constructs a connected target connected graph in D dimension space when responding to a data query instruction, so that the constructed target connected graph can be used for performing data query operation on vector data sets with millions or even tens of millions.

According to the data query method provided by the invention, the accuracy and reliability of constructing the communication graph are improved by constructing the subgraphs of the index objects and then constructing the target communication graph based on the subgraphs.

It may be appreciated that, based on the data query method shown in fig. 1, in an example embodiment, the writing process of the plurality of original vector data included in one index object may specifically include:

firstly, responding to a data writing instruction, respectively vectorizing a plurality of original data to be written to obtain a plurality of original vector data; each piece of original vector data comprises a query result vector and a query text vector; further, determining index objects corresponding to the plurality of original vector data based on the plurality of original vector data and an index object generation method of a preset plugin library, wherein the index objects comprise the plurality of original vector data; and then, storing the index object into a Segment document under the Lucene file, determining that the accumulated time length of the response data writing instruction reaches the preset time length, and brushing the Segment document into a disk, wherein the Segment document stored in the disk is used for constructing a target communication graph.

Specifically, when the ElatisSearch 7.X component receives a data query request sent by a user, the ElatisSearch 7.X component immediately responds to the data query request, and a vector conversion model can be called to vectorize each piece of original data to be written carried by the data query request to obtain a plurality of pieces of original vector data, wherein each piece of original data to be written includes query result data, query text number data, business primary key field data and the like, and the corresponding converted piece of original vector data can be [0.12,0.31,0.33 ], and the length of the N-dimensional vector array is N; n is a positive integer.

At this time, the index objects corresponding to the plurality of original vector data can be determined by calling an index object generation method of a preset plug-in library, then the index objects are stored in Segment documents under the Lucene file, and the Segment documents are brushed into a disk under the condition that the Segment documents meet the disk refreshing condition; the disc refreshing condition here, that is, the accumulated duration of responding to the data writing instruction, reaches a preset duration, which may be manually preset, for example, the preset duration may be 60 seconds.

Based on the above, when the accumulated time length of the response data writing instruction is determined to reach the preset time length, all the original vector data determined in the preset time length can be stored into the Segment document under the Lucene file; that is, in the process of executing the data writing operation, the disk swiping operation is executed once every preset time period, and the corresponding Segment document is different when the disk input operation is executed each time, and the disk swiping operation may be an operation of swiping all original vector data determined in the preset time period into the Segment document under the Lucene file. For example, 8 pieces of original vector data are determined at 60 seconds, and all of the 8 pieces of original vector data may be stored in Segment document S1 under the Lucene file; once again, at 60 seconds intervals, 3 raw vector data are determined, and all of the 3 raw vector data may be stored in Segment document S2 under the Lucene file.

According to the data writing method provided by the invention, the ElatisSearch 7.X component realizes the purpose of data writing by means of vectorizing original data to be written, determining an index object, and then brushing the index object stored in a Segment document under a Lucene file into a disk, so that a target communication diagram can be constructed accurately and quickly when a data query instruction is responded subsequently, and reliable data support is provided for improving the data query efficiency; further, under the condition that the accumulated time length of the response data writing instruction reaches the preset time length, all index objects determined in the preset time length are stored in the Segment document under the Lucene file. In this way, the storage operation of storing index objects corresponding to different Segment documents can be ensured to be executed every preset time, so that the efficiency of brushing the Segment documents into the disk in the follow-up process is improved.

It may be appreciated that, based on the data query method shown in fig. 1, in an example embodiment, in a case where the preset plugin library is an HNSW plugin library, determining, based on a plurality of original vector data and an index object generating method of the preset plugin library, index objects corresponding to the plurality of original vector data may include:

Firstly, converting a plurality of original vector data into data types supported by an HNSW plugin library respectively to obtain a plurality of vector data conversion results; the conversion results of the vector data are further added into a preset index of a preset ElatisSearch component; and then, based on the addition completion identification, calling an index object generation method of the HNSW plug-in library, and determining an index object corresponding to the original vector data.

The preset eletissearch component may be specifically an eletissearch 7.X component, and the preset index may be an index of the eletissearch 7.X component. The HNSW plugin library is a graph-based algorithm library in the KNN search field (or ANN search field), which is collectively denoted Hierarchical Navigable Small World and may specifically refer to a hierarchical navigable small world network. Furthermore, KNN refers specifically to the K nearest neighbors (K-NearestNeighbor), i.e. the K nearest neighbors. When the index objects stored in the preset plug-in library form a large-scale data set, the calculation amount of using the KNN is overlarge, and only A approximate neighbors can be concerned at the moment, namely the KNN is replaced by the ANN, wherein the ANN specifically refers to Approximate Nearest Neighbor and is an algorithm for searching the nearest neighbors in the large-scale data set; a < K and A, K are positive integers.

Specifically, referring to fig. 2, for writing data corresponding to a preset plugin library provided by the present invention into a flowchart when the preset plugin library is an HNSW plugin library, as shown in fig. 2, after vectorizing each original data to be written into the corresponding original vector data by a vector conversion model, converting each original vector data into a data type supported by the HNSW plugin library, writing the data into index of an ElatisSearch 7.X component to indicate the index to execute a process of writing vector data into a Lucene file, at this time, vector data conversion results obtained by converting each original vector data can be added into index of an ElatisSearch 7.X component, and after adding is completed, through an application program interface (Application Program Interface, API) of the HNSW plugin library, an index object generating method of the HNSW plugin library is called, specifically, a build method of the HNSW plugin can be called, and an index object returned by the build method is obtained, and the index object returned by the build method is the index object corresponding to a plurality of original vector data.

According to the data query method provided by the invention, the plurality of original vector data are respectively converted into the data types supported by the HNSW plugin library, the plurality of vector data conversion results are added into the preset index of the preset ElatisSearch component, and then the index object corresponding to the plurality of original vector data is determined by calling the index object generation method of the HNSW plugin library. Therefore, index objects corresponding to the plurality of original vector data can be accurately and rapidly determined through the algorithm library preset in the HNSW plug-in library, and convenience and reliability of data writing are improved.

It may be appreciated that, based on the data query method shown in fig. 1, in an example embodiment, in a case where the preset plugin library is an LSH plugin library, the index object corresponding to the plurality of original vector data is determined based on the plurality of original vector data and an index object generating method of the preset plugin library, and the specific implementation process includes:

firstly, obtaining target hash functions corresponding to a plurality of original data to be written respectively; still further, for each original data to be written and each target hash function, a hash index object generation method based on an LSH plugin library is used for carrying out hash on the original vector data by using the target hash function, and a hash value corresponding to the original vector data is determined; then, mapping the original data to be written into the hash bucket where the hash value is located according to each hash value and each original data to be written into, and determining index objects corresponding to a plurality of original vector data.

The LSH is generally called Locality Sensitive Hashing and may specifically refer to a locally sensitive hash algorithm, which is an approximate nearest neighbor search algorithm based on a hash function.

Specifically, referring to fig. 3, for a data writing process diagram corresponding to the case that the preset plugin library provided by the present invention is an LSH plugin library, as shown in fig. 3, after vectorizing each piece of original data to be written through a vector conversion model, obtaining corresponding original vector data, converting each piece of original vector data into a data type supported by the HNSW plugin library, and writing the data type into index of an ElatisSearch 7.X component to instruct the index to execute a process of writing vector data into a Lucene file, at this time, a target hash function corresponding to each piece of original data to be written can be obtained, that is, a proper hash function selected for each piece of original data to be written; and then, through an API of the LSH plug-in library, a target hash function is called to hash each original vector data respectively to obtain a hash value corresponding to each original vector data, each original vector data is mapped into a hash bucket where the corresponding hash value is located, so that hash tables of a plurality of original vector data are constructed, and the constructed hash tables are index objects corresponding to the plurality of original vector data.

According to the data query method, the hash index object generation method based on the LSH plugin library firstly uses the target hash function corresponding to each original data to be written to hash the corresponding original multidimensional data, and then maps each original data to be written to the hash bucket where the corresponding hash value is located, so that the index objects of a plurality of original vector data are determined. Therefore, the index object corresponding to the original vector data can be accurately and rapidly determined through the approximate nearest neighbor search algorithm based on the hash function in the LSH plug-in library, and the convenience and reliability of data writing can be improved.

It may be appreciated that, based on the data query method shown in fig. 1, in an example embodiment, in a case where the preset plugin library is an IVSPQ plugin library, determining, based on a plurality of original vector data and an index object generating method of the preset plugin library, index objects corresponding to the plurality of original vector data may include:

firstly, determining a target clustering point and a target coding mode corresponding to original vector data based on a mapping relation between vector data-clustering point-coding modes pre-stored in an IVSPQ plugin library aiming at each original vector data; still further, for each target clustering point and target coding mode corresponding to each original vector data, coding and clustering are carried out on the original data to be written by using the target clustering point and the target coding mode based on the coding index object generation method of the IVSPQ plugin library, so as to obtain a clustering coding result; then, mapping the original data to be written into the coding table where the cluster coding result is located according to each cluster coding result and each original data to be written into, and determining an index object corresponding to the original vector data.

The IVSPQ plugin library is provided with an approximate nearest neighbor search algorithm based on vector quantization, and the method can be used for quick recall of high-dimensional data; the english full name of IVSPQ is Inverted Vector Quantization with Sequential Projections and Query-adaptive Refinement and the chinese full name is vector quantization approximate nearest neighbor search algorithm based on inverted index and uses sequential projection and query adaptive refinement.

Specifically, referring to fig. 4, for writing data corresponding to the preset plugin library provided by the present invention when the preset plugin library is an IVSPQ plugin library, as shown in fig. 4, after vectorizing each piece of original data to be written by a vector conversion model, obtaining corresponding original vector data, converting each piece of original vector data into a data type supported by the HNSW plugin library, and writing the data into index of an ElatisSearch 7.X component to instruct the index to execute a process of writing vector data into a Lucene file, at this time, calling a mapping relation between vector data-cluster point-coding mode pre-stored in the IVSPQ plugin library through API of the IVSPQ plugin order to determine a target cluster point and a target coding mode corresponding to each piece of original vector data; still further, for each original vector data, a target clustering point and a target coding mode corresponding to each original vector data are used, the target coding mode and the target clustering point are used for coding and clustering the corresponding original data to be written, each original data to be written is mapped into a coding table where a corresponding clustering coding result is located, a plurality of coding tables of the original vector data are constructed, and the constructed coding tables are index objects corresponding to the plurality of original vector data.

The data query method provided by the invention is based on an approximate nearest neighbor search algorithm of an IVSPQ plug-in library, firstly, a target clustering point and a target coding mode corresponding to a plurality of original vector data are obtained, then the target coding mode and the target clustering point are used for coding and clustering the corresponding original data to be written, and then each original data to be written is mapped into a coding table where a corresponding clustering coding result is located, so that index objects corresponding to the plurality of original vector data are determined. Therefore, the index objects corresponding to the plurality of original vector data can be accurately and rapidly determined through the approximate nearest neighbor search algorithm based on vector quantization in the IVSPQ plugin library, so that the convenience and reliability of data writing can be improved.

It can be appreciated that, based on the data query method shown in fig. 1, in an exemplary embodiment, since the existing dense_vector type in the elastic search7.X component can only store vector data of no more than 500 dimensions, it is not applicable to store vector data sets of millions or even tens of millions, so that the preset plugin library provided by the present invention is not only provided with a vector search algorithm, but also with a data type for storing vector data of millions or even tens of millions. Based on this, the data query method provided by the invention can further include:

And storing the original vector data into a preset data type of a preset plugin library, wherein the preset data type is used for storing the original vector data in an unlimited dimension.

The preset data type may be specifically a knn _vector type, and is used for storing vector data in knn _vector type, and the knn _vector type does not specifically limit a storage dimension, that is, may store vector data with dimension greater than 500.

Specifically, each original vector data contains one DocValue data, and the DocValue data is specifically a corresponding original multidimensional vector array, so that each original multidimensional vector array can be obtained, and the original multidimensional vector array can be stored as DocValue data into a preset data type.

It should be noted that the preset data type is a new data type implemented by a third party plug-in, which is a preset plug-in library, so that there is no dimension limitation in storing vector data.

According to the data query method provided by the invention, the defect that the existing dense_vector type in the elastic search7.X component cannot store vector data in millions or even tens of millions of dimensions is overcome by storing the original vector data into the preset data types of the preset plugin library, the wide applicability of vector data storage by using the preset data types of the preset plugin library is improved, and reliable guarantee is provided for the richness and comprehensiveness of subsequent query results.

It can be appreciated that, based on the data query method shown in fig. 1, in an exemplary embodiment, in order to increase the query speed, the constructed target connectivity graph may be loaded into the memory; the target connectivity graph may also be loaded directly without requiring query speed. Based on this, the process of finding a plurality of candidate coordinate locations adjacent to the coordinate location of the target query text based on the target communication map in step 120 may include:

loading the target communication graph into a memory, and searching a plurality of candidate coordinate positions adjacent to the coordinate position of the target query text based on the target communication graph;

or,

based on the result of loading the target connectivity graph and the target connectivity graph, a plurality of candidate coordinate positions adjacent to the coordinate positions of the target query text are searched.

Specifically, in order to increase the query speed, the target connectivity graph may be loaded into the memory by calling the memory framework Guava at regular time or periodically, and then searching for a plurality of candidate coordinate positions adjacent to the coordinate position of the target query text; the target connected graph can be directly loaded under the condition that the query speed is not required, and a plurality of candidate coordinate positions adjacent to the coordinate position of the target query text can be searched. For example, when a plurality of original vector data respectively included in different index objects are stored in a disk, a target connected graph can be built in the disk by using the plurality of original vector data respectively included in the index objects, and the built target connected graph can be directly loaded; or may be loaded into memory.

According to the data query method provided by the invention, the flexibility and reliability of quantity query are improved by directly loading the constructed target connected graph into the memory or loading the constructed target connected graph into the memory and then executing the mode of searching a plurality of candidate coordinate positions.

It can be appreciated that, based on the data query method shown in fig. 1, in an exemplary embodiment, each time a data query request is received, whether the previously constructed connectivity graph is still applicable may be determined first, and if not, the target connectivity graph required for the current data query may be reconstructed. Based on this, the specific implementation procedure of step 110 may include:

firstly, comparing different written index objects with index objects adopted by a communication graph constructed in the previous time; then, it is determined that the written different index objects are different from the index objects employed by the previously constructed connected graph, and a target connected graph is constructed based on a plurality of original vector data each included in the written different index objects.

Specifically, when the data query request is received, whether the previously constructed connected graph is suitable for the current data query can be judged, that is, whether the index object adopted by the previously constructed connected graph is the same as the written different index objects written in currently is judged, when the index object adopted by the previously constructed connected graph is determined not to be the same as the written different index objects written in currently, it can be confirmed that the previously constructed connected graph is not suitable for the current data query, and at the moment, the target connected graph can be constructed based on the multiple original vector data respectively included by the written different index objects.

According to the data query method provided by the invention, when a data query request is received, whether the communication diagram constructed in the previous time is suitable for the data query at the present time is judged, and a target communication diagram suitable for the data query at the present time is constructed under the condition that the communication diagram constructed in the previous time is determined to be unsuitable for the data query at the present time. Therefore, the rationality and the necessity of constructing the target connection graph can be improved, and the reliability guarantee is improved for subsequent data query.

After data query is performed by combining the above-mentioned fig. 2 to 4, the query performance using the HNSW plugin library is highest, and the recall rate is higher than 95%; the LSH plug-in library is used for carrying out data query, so that the implementation process is simple, and the query performance is higher; the query performance using the IVSPQ plugin library is higher, and the recall rate is higher than 90%. In addition, the KNN algorithm in the HNSW plug-in library is used for data query, the time complexity is reduced from O (n) to O (log (n)), and the recall effect is also obviously improved; n is the total dimension of the original vector data that each of the different written index objects includes.

Referring to fig. 5, a schematic structural diagram of a data query device provided by the present invention, as shown in fig. 5, the data query device 500 includes: the connectivity map construction module 510 and the data query module 520.

The connected graph construction module 510 is configured to construct, in response to a data query request, a target connected graph based on a plurality of original vector data that are respectively included in different written index objects, where the original vector data includes a query text vector and a corresponding query result vector, and coordinate positions corresponding to different query text vectors in the target connected graph are mutually connected, and the data query request carries a target query text vector and a query result number requirement.

The data query module 520 is configured to search a plurality of candidate coordinate positions adjacent to the coordinate position of the target query text based on the target communication map, and determine at least one target coordinate position from the plurality of candidate coordinate positions based on the number of query results; and determining a query result vector of the query text vector corresponding to the at least one target coordinate position as a query result of the data query request.

Optionally, the connected graph construction module 510 may be specifically configured to construct, for each index object, a sub-graph of the index object based on a plurality of original vector data included in each index object, where coordinate positions corresponding to query text vectors of each of the plurality of original vector data in the sub-graph are mutually connected; and constructing a target communication graph based on each sub-graph, wherein all the sub-graphs in the target communication graph are mutually communicated.

Optionally, the data query device provided by the invention may further include a data writing module, configured to respond to a data writing instruction, and vector the plurality of original data to be written respectively to obtain a plurality of original vector data; each piece of original vector data comprises a query result vector and a query text vector; determining index objects corresponding to the plurality of original vector data based on the plurality of original vector data and an index object generation method of a preset plugin library, wherein the index objects comprise the plurality of original vector data; storing the index object into a Segment document under the Lucene file, determining that the accumulated time length of the response data writing instruction reaches a preset time length, and brushing the Segment document into a disk, wherein the Segment document stored in the disk is used for constructing a target communication graph.

Optionally, in the case that the preset plugin library is an HNSW plugin library, the data writing module is further configured to convert the plurality of original vector data into data types supported by the HNSW plugin library, so as to obtain a plurality of vector data conversion results; adding the multiple vector data conversion results into a preset index of a preset ElatisSearch component; and based on the addition completion identification, calling an index object generation method of the HNSW plug-in library, and determining index objects corresponding to the plurality of original vector data.

Optionally, in the case that the preset plugin library is an LSH plugin library, the data writing module is further configured to obtain target hash functions corresponding to the plurality of original data to be written; aiming at each original data to be written and each target hash function, a hash index object generation method based on an LSH plugin library uses the target hash function to hash the original vector data, and determines a hash value corresponding to the original vector data; and mapping the original data to be written into the hash bucket where the hash value is located according to each hash value and each original data to be written into, and determining index objects corresponding to a plurality of original vector data.

Optionally, when the preset plugin library is an IVSPQ plugin library, the data writing module is further configured to determine, for each piece of original vector data, a target cluster point and a target coding mode corresponding to the original vector data based on a mapping relationship between vector data-cluster point-coding modes stored in the IVSPQ plugin library in advance; aiming at target clustering points and target coding modes corresponding to each original vector data, coding and clustering are carried out on the original data to be written by using the target coding modes and the target clustering points based on a coding index object generation method of an IVSPQ plugin library, so as to obtain a clustering coding result; and mapping the original data to be written into the coding table where the cluster coding result is located according to the cluster coding result and the original data to be written into, and determining index objects corresponding to the plurality of original vector data.

Optionally, the data writing module is further configured to store the original vector data into a preset data type of a preset plugin library, where the preset data type is used for storing the original vector data without limitation of dimensions.

Optionally, the number query module provided by the invention may further include a data loading module, configured to load the target communication graph into the memory, and search a plurality of candidate coordinate positions based on the target communication graph; or searching a plurality of candidate coordinate positions based on the result of loading the target communication diagram and the target communication diagram.

Optionally, the connectivity graph construction module 510 is specifically configured to compare the written different index objects with the index object adopted by the connectivity graph constructed previously; and determining that the written different index objects are different from the index objects adopted by the communication graph constructed in the previous time, and constructing a target communication graph based on a plurality of original vector data respectively included by the written different index objects.

The data query device 500 provided by the present invention can execute the technical scheme of the data query method in any embodiment, and the implementation principle and beneficial effects of the data query method are similar to those of the data query method, and can be referred to herein without redundant description.

Fig. 6 illustrates a physical schematic diagram of an electronic device, as shown in fig. 6, the electronic device 600 may include: processor 610, communication interface (Communications Interface) 620, memory 630, and communication bus 640, wherein processor 610, communication interface 620, and memory 630 communicate with each other via communication bus 640. The processor 610 may call logic instructions in the memory 630 to perform the following methods:

responding to a data query request, constructing a target connected graph based on a plurality of original vector data respectively included by different written index objects, wherein the original vector data comprises query text vectors and corresponding query result vectors, the coordinate positions corresponding to different query text vectors in the target connected graph are mutually connected, and the data query request carries the requirements of the target query text vectors and the number of query results; searching a plurality of candidate coordinate positions adjacent to the coordinate positions of the target query text based on the target communication diagram, and determining at least one target coordinate position from the plurality of candidate coordinate positions based on the query result quantity requirement; and determining a query result vector of the query text vector corresponding to the at least one target coordinate position as a query result of the data query request.

Further, the logic instructions in the memory 630 may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the related art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, embodiments of the present invention disclose a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, are capable of performing the methods provided by the above-described method embodiments, for example comprising:

In yet another aspect, embodiments of the present invention further provide a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform the transmission method provided in the above embodiments, for example, including:

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on such understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the related art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that the above-mentioned embodiments are merely illustrative of the invention, and not limiting. While the invention has been described in detail with reference to the embodiments, those skilled in the art will appreciate that various combinations, modifications, or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and it is intended to be covered by the scope of the claims of the present invention.

Claims

1. A method of querying data, comprising:

2. The data query method of claim 1, wherein the constructing a target connectivity graph based on a plurality of original vector data each included in the written different index objects includes:

3. The data query method of claim 1, wherein for each of the plurality of original vector data of the different written index objects, the writing process of the plurality of original vector data included in the index object includes:

4. The data query method according to claim 3, wherein the preset plugin library is a HNSW plugin library, and the determining the index object corresponding to the plurality of original vector data based on the plurality of original vector data and the index object generating method of the preset plugin library includes:

5. The data query method according to claim 3, wherein the preset plugin library is an LSH plugin library, and the determining the index object corresponding to the plurality of original vector data based on the index object generating method of the plurality of original vector data and the preset plugin library includes:

6. The data query method according to claim 3, wherein the preset plugin library is an IVSPQ plugin library, and the determining the index object corresponding to the plurality of original vector data based on the plurality of original vector data and the index object generating method of the preset plugin library includes:

7. The data query method of any one of claims 3 to 6, further comprising:

8. The data query method of any one of claims 1 to 6, wherein said searching for a plurality of candidate coordinate locations adjacent to the coordinate location of the target query text based on the target communication map comprises:

Or,

9. The data query method according to any one of claims 1 to 6, wherein the constructing a target connectivity graph based on a plurality of original vector data each included in the written different index objects includes:

10. A data query device, comprising:

11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the data query method of any of claims 1 to 9 when the program is executed by the processor.

12. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements a data query method as claimed in any of claims 1 to 9.