CN117708263B - Multi-party joint vector knowledge base retrieval method and system for privacy protection - Google Patents

Multi-party joint vector knowledge base retrieval method and system for privacy protection Download PDF

Info

Publication number
CN117708263B
CN117708263B CN202311703773.1A CN202311703773A CN117708263B CN 117708263 B CN117708263 B CN 117708263B CN 202311703773 A CN202311703773 A CN 202311703773A CN 117708263 B CN117708263 B CN 117708263B
Authority
CN
China
Prior art keywords
party
vector
random
trusted
corpus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311703773.1A
Other languages
Chinese (zh)
Other versions
CN117708263A (en
Inventor
陈欣
李闯
肖骞宇
高金超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Financial Certification Authority Co ltd
Original Assignee
China Financial Certification Authority Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Financial Certification Authority Co ltd filed Critical China Financial Certification Authority Co ltd
Priority to CN202311703773.1A priority Critical patent/CN117708263B/en
Publication of CN117708263A publication Critical patent/CN117708263A/en
Application granted granted Critical
Publication of CN117708263B publication Critical patent/CN117708263B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a system for searching a multiparty joint vector knowledge base with privacy protection, wherein the method comprises the following steps: the multi-party text corpus constructs an index and uploads the index to a trusted third party respectively, the trusted third party distributes random protection secret parameters, each party generates auxiliary data by combining local random parameters, each party acquires an embedded vector of the corpus, and the data after dimension reduction processing and random homogeneous transformation is uploaded to a joint vector database server together with the auxiliary data; the user performs corpus embedding on the query request text and then performs dimension reduction and random homogeneous transformation to generate a vector to be retrieved; the vector to be searched is subjected to similarity searching after being operated with auxiliary data in a joint vector database, and an index result is returned; after the vector similarity retrieval is completed, corresponding text contents are queried in a trusted third party according to the index, and the retrieval is completed. The invention solves the problem that the prior knowledge base is difficult to complete multiparty safety retrieval under the condition that knowledge base information of each party is not exposed.

Description

Multi-party joint vector knowledge base retrieval method and system for privacy protection
Technical Field
The invention relates to the technical field of database retrieval safety, in particular to a method and a system for retrieving a multiparty joint vector knowledge base with privacy protection.
Background
The generation of large language models (large language model, LLM) has been widely applied and developed in various industries, and different technology landing attempts are continuously emerging, but in order to overcome the problem of "large model illusion" of the large model in the concrete practice of construction and application, and fully utilize the understanding and abstract capability of the large model to language, so the "retrieval enhancement generation" (RETRIEVAL AUGMENTED GENERATION, RAG) mode combined with the large language model is widely focused.
The core idea of RAG is to construct a vector knowledge base and use the retrieved relevant information of the problem in the knowledge base as prompt content (prompt) input of a downstream large model. In this process, the retrieved vector knowledge base is often independently constructed, or a plurality of different knowledge base information is directly summarized and embedded to complete construction.
On the one hand, the embedded vector corresponding to the privacy information contained in the knowledge base is directly exposed to the server of the vector knowledge base when the knowledge base is constructed. On the other hand, the embedded vector of the query content (query) is also directly exposed when the search is performed. And the feature vector after the existing corpus is embedded can be easily deduced to the original embedded text through dictionary attack. Therefore, the vector knowledge base construction and retrieval steps in the RAG have higher risk of privacy disclosure. Further, knowledge information supporting a large language model is gradually converged, and knowledge bases with associated content and different providers are provided, so that direct mutual retrieval cannot be performed due to potential privacy information protection requirements (such as data not allowing out of domain, etc.), thereby causing information limitation in the retrieval of the knowledge bases, and further limiting the actual performance of the large language model.
The construction of the vector knowledge base generally comprises several steps, namely, firstly, embedding an original text corpus (namely, taking the text as input of an embedded model to obtain an output vector), then constructing an index map (the index corresponds to the corpus and the vector), uploading the embedded vector to a vector database server, and finishing the construction of the vector database when all the corpus is completely operated. As described above, after the vector database server receives the embedded vector data, the vector database server directly performs operations such as storing or retrieving the data, which directly causes exposure (or attack) risk of the sensitive information to the database server. In the searching link, after the user embeds the query text, the embedded vector (vec-q) is searched in the database, and the database server returns a similarity searching result (content hash index corresponding to the similarity vector). Under the condition of multiple knowledge bases, besides privacy exposure of embedded vectors facing a database server side, when a search requester searches an original text after obtaining a search hash index, as the original text only exists locally for each knowledge base provider, other providers are necessarily required to provide the original text and communicate with the original text, on one hand, potential sensitive information leakage can be caused, which party owns knowledge information can be known by the search requester, meanwhile, possible search contents of the search requester can be known by the queried knowledge base provider, and on the other hand, when more knowledge base providers participate, the communication among the parties can cause great reduction of search efficiency. The existing SBE (Secure Binary Embedding) focuses on the security of the corpus embedded vector, but the similarity search (using hamming distance) after encoding has a mismatch interval with the similarity (L2 distance or sine distance) of the original vector, and the mismatch degree is determined by the parameters of the SBE algorithm and the characteristics of the data, so in practice, the similarity search may fail for the complex data embedded dataset.
Disclosure of Invention
The invention provides a method and a system for searching a multiparty joint vector knowledge base with privacy protection, which are used for solving the problem that the prior knowledge base is difficult to complete multiparty safe search under the condition that knowledge base information of all parties is not exposed and solving the problem of low search efficiency caused by over-high dimensionality.
The invention provides a method for searching a multiparty joint vector knowledge base with privacy protection, which comprises the following steps:
The multi-party text corpus is constructed into an index and is respectively uploaded to a trusted third party, the trusted third party distributes random protection secret parameters, each party combines the local random parameters to generate auxiliary data, and the auxiliary data comprises the trusted third party random protection secret parameters and local random transformation parameters of each party;
each party obtains an embedded vector of the corpus, and data after dimension reduction processing and random homogeneous transformation are uploaded to a joint vector database server together with auxiliary data;
the user performs corpus embedding on the query request text and then performs dimension reduction and random homogeneous transformation to generate a vector to be retrieved;
the vector to be searched is subjected to similarity searching after being operated with auxiliary data in a joint vector database, and an index result is returned;
after the vector similarity retrieval is completed, corresponding text contents are queried in a trusted third party according to the index, and the retrieval is completed.
According to the multiparty joint vector knowledge base retrieval method for privacy protection provided by the invention,
The multi-party text corpus is constructed into an index and is respectively uploaded to a trusted third party, the trusted third party distributes random protection secret parameters, each party combines the local random parameters to generate auxiliary data, the auxiliary data comprises the trusted third party random protection secret parameters and local random transformation parameters of each party, and the method specifically comprises the following steps:
The text corpus of each party constructs an index and uploads the index to a trusted third party respectively;
Each party receives the random protection secret parameters distributed by the trusted third party and generates auxiliary data by combining the local random parameters.
According to the multiparty joint vector knowledge base retrieval method for privacy protection provided by the invention,
The method comprises the steps that each party obtains an embedded vector of a corpus, and data after dimension reduction processing and random homogeneous transformation are uploaded to a joint vector database server together with auxiliary data, and specifically comprises the following steps:
each party performs embedding operation on the language materials to obtain embedded vectors, reduces the dimension of vector data, and performs random homogeneous transformation by adopting local random parameters;
and uploading the result to a joint vector database server together with auxiliary data based on the random homogeneous transformation result.
According to the multi-party joint vector knowledge base retrieval method for privacy protection provided by the invention, the user performs corpus embedding on the query request text and then performs dimensionality reduction and random homogeneous transformation to generate a vector to be retrieved, and the method specifically comprises the following steps:
The method comprises the steps that a user inputs a query request text and sends the query request text to a joint vector database server;
Performing corpus embedding on the query request text to generate an initial vector;
and performing dimension reduction mapping and random homogeneous transformation operation on the initial vector to generate a vector to be retrieved.
According to the multi-party joint vector knowledge base retrieval method for privacy protection provided by the invention, the vector to be retrieved is subjected to similarity retrieval after being operated with auxiliary data in a joint vector database, and an index result is returned, and the method specifically comprises the following steps:
calculating the vector to be searched and auxiliary data, and judging the similarity distance between the calculated vector data and the vector in the joint vector database;
and under the condition that the similarity distance is smaller than the set distance, meeting the similarity requirement, determining the corpus with semantic relation or semantic relativity, and returning the index.
According to the method for searching the multiparty joint vector knowledge base with privacy protection provided by the invention, after the vector similarity search is completed, corresponding text contents are queried in a trusted third party according to indexes, and the search is completed, and the method specifically comprises the following steps:
after the vector similarity retrieval is completed, the user acquires a retrieval hash index set returned by the data server side;
And the user queries corresponding text content in the trusted party according to the retrieval hash index set to complete one-time retrieval.
The invention also provides a multiparty joint vector knowledge base retrieval system for privacy protection, which comprises:
The joint vector database construction module is used for constructing indexes of the multiparty text corpus and uploading the indexes to the trusted third party respectively, the trusted third party distributes random protection secret parameters, each party combines the local random parameters to generate auxiliary data, and the auxiliary data comprises the trusted third party random protection secret parameters and local random transformation parameters of each party;
The corpus processing module is used for each party to acquire an embedded vector of the corpus, and the data after dimension reduction processing and random homogeneous transformation are uploaded to the joint vector database server together with the auxiliary data;
the dimension reduction homogeneous transformation module is used for carrying out dimension reduction and random homogeneous transformation after the user carries out corpus embedding on the query request text to generate a vector to be searched;
The similarity retrieval module is used for carrying out similarity retrieval after the vector to be retrieved is operated with auxiliary data in the joint vector database and returning an index result;
And the third-party query module is used for querying corresponding text content in the trusted third party according to the index after completing the vector similarity retrieval, and completing the retrieval.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the multiparty joint vector knowledge base retrieval method for privacy protection according to any one of the above when executing the program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a multi-party joint vector knowledge base retrieval method of any one of the privacy protections described above.
The invention also provides a computer program product comprising a computer program which when executed by a processor implements a multi-party joint vector knowledge base retrieval method of any one of the privacy preserving described above.
According to the multi-party joint vector knowledge base retrieval method and system for privacy protection, the safe vector database is constructed, so that original data of the vector database constructed by a user cannot be local, and support is still safe under the situation of vector similarity retrieval for public database service providers; when a plurality of knowledge base providers participate, a vector knowledge base is established through multiparty union, and under the condition that query vectors are supported to search safety for a server, each participant can complete the search of information associated with query content without knowing the content of the knowledge base of other people.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow diagram of a method for searching a multi-party joint vector knowledge base with privacy protection according to the present invention;
FIG. 2 is a second flow chart of a method for searching a multi-party joint vector knowledge base with privacy protection according to the present invention;
FIG. 3 is a third flow chart of a method for searching a multi-party joint vector knowledge base with privacy protection according to the present invention;
FIG. 4 is a schematic flow chart of a method for searching a multi-party joint vector knowledge base with privacy protection according to the present invention;
FIG. 5 is a schematic flow chart of a method for searching a multi-party joint vector knowledge base with privacy protection according to the present invention;
FIG. 6 is a flowchart of a method for searching a multi-party joint vector knowledge base with privacy protection according to the present invention;
FIG. 7 is a schematic diagram of the module connection of a privacy preserving multi-party federated vector knowledge base retrieval system provided by the present invention;
FIG. 8 is a schematic diagram of a privacy preserving vector data construction flow provided by the present invention;
FIG. 9 is a schematic diagram of a privacy preserving vector data construction flow provided by the present invention;
FIG. 10 is a schematic diagram of a flow for constructing a multi-party federated knowledge base in accordance with the present invention;
fig. 11 is a schematic structural diagram of an electronic device provided by the present invention.
Reference numerals:
110: a joint vector database construction module; 120: a corpus processing module; 130: the dimensionality reduction homogeneous transformation module; 140: a similarity retrieval module; 150: a third party query module;
1110: a processor; 1120: a communication interface; 1130: a memory; 1140: a communication bus.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
First, explanation is made on related terms:
corpus embedding, namely converting text corpus (such as sentences and speech segments) into a vector with the length of n, wherein the corpus with semantic relation or semantic relativity is higher in similarity (or smaller in similarity distance) after being converted into the vector.
Embedding a model: an AI model that performs the operation of converting the corpus content into vectors is called an embedded model (embedding model).
Vector knowledge base: the method is characterized in that after large-scale corpus is embedded, embedded vectors are stored, and a database with a corpus retrieval function based on vector similarity is supported, so that the database is a knowledge base, the construction of the vector database is usually carried out aiming at knowledge corpus in a specific field in practice, and the constructed knowledge base directly serves a large model, so that the function of 'knowledge query' is achieved.
Homogeneous coordinates: is a coordinate system used in computational geometry and computer graphics. It is a method of representing points, vectors and transformations in euclidean space. The homogeneous coordinate system represents points, vectors and transforms as tuples with the same number of coordinates by introducing one additional coordinate.
Homogeneous transformation: is a mathematical tool commonly used in computer graphics and computer vision to describe and implement geometric transformations such as translation, rotation, scaling, tilting, etc. The basic idea of homogeneous transformations is to use matrix representation and multiplication to represent a combination of multiple geometric transformations for more efficient computation and processing. It should be noted that the homogeneous transformation is mainly used for computer vision, and the invention mainly uses the mathematical characteristics thereof, namely, the homogeneous transformation can keep the L2 distance between certain two transformed vectors unchanged. Therefore, the vector subjected to the same homogeneous transformation (the same transformation matrix) is subjected to similarity retrieval, and the retrieval accuracy can be ensured.
Vector dimension reduction compression: the technology for dimension reduction of the data matrix (such as embedded data matrix) in the characteristic dimension is realized by matrix decomposition technology in practice, such as: matrix Singular Value Decomposition (SVD), principal Component Analysis (PCA), and the like.
Vector database server: a network service provider is deployed that provides vector data retrieval services in a public network environment.
The following describes a method for searching a multiparty joint vector knowledge base with privacy protection in combination with fig. 1-6, which comprises the following steps:
S100, constructing indexes by multiparty text corpus and uploading the indexes to a trusted third party respectively, wherein the trusted third party distributes random protection secret parameters, each party generates auxiliary data by combining local random parameters, and the auxiliary data comprises the trusted third party random protection secret parameters and local random transformation parameters of each party;
S200, each party acquires an embedded vector of the corpus, and the data after dimension reduction and random homogeneous transformation are uploaded to a joint vector database server together with auxiliary data;
s300, performing corpus embedding on the query request text by a user, and performing dimension reduction and random homogeneous transformation to generate a vector to be retrieved;
S400, carrying out similarity retrieval on the vector to be retrieved after the vector to be retrieved is operated with auxiliary data in a joint vector database, and returning an index result;
s500, after the vector similarity retrieval is completed, corresponding text contents are queried in a trusted third party according to the index, and the retrieval is completed.
In the invention, the vector knowledge base is established through multiparty union, and under the condition of supporting the search safety of the query vector to the server, each participant can complete the search of information associated with the query content without knowing the content of the knowledge base of other people.
The multi-party text corpus is constructed into an index and is respectively uploaded to a trusted third party, the trusted third party distributes random protection secret parameters, each party combines the local random parameters to generate auxiliary data, the auxiliary data comprises the trusted third party random protection secret parameters and local random transformation parameters of each party, and the method specifically comprises the following steps:
S101, constructing an index of text corpus of each party, and respectively uploading the index to a trusted third party;
s102, each party receives the random protection secret parameters distributed by the trusted third party, and auxiliary data is generated by combining the local random parameters.
Referring to fig. 8, in one specific example, for a single user, after text embedding is completed, the embedded vector data set is (assuming the embedding dimension is m, containing n pieces of text):
When a multi-party joint vector knowledge base is constructed, each party generally needs to adopt the same embedding model to carry out embedding operation on each corpus, but the method is not a strong assumption, because the difficulty of constructing the embedding model supporting the extensive corpus is very large, and the model with better effect is generally considered to be flexible in the current industry, so that the same embedding model is adopted to carry out more conventional operation.
When a trusted third party is introduced to construct a joint vector database, the random protection secret parameter R0 is firstly distributed to a plurality of users, and the secret parameter is also a random homogeneous transformation matrix. And uploading the corpus and the corresponding index to a trusted third party by each party. Here, the trusted third party role must have fairness and security, on one hand, the third party organization with a certain authority cannot collude with any participant to reveal data, and on the other hand, the generated secret parameter R0 cannot be easily obtained. Next, each party uploads the converted vector data C1 and the auxiliary data C2 to the vector database server. Wherein,
The server stores the result of calculating C2C1, namely:
each party obtains an embedded vector of the corpus, and data after dimension reduction processing and random homogeneous transformation are uploaded to a joint vector database server together with auxiliary data, and the method specifically comprises the following steps:
s201, each party performs embedding operation on the language materials to obtain embedded vectors, performs dimension reduction on vector data, and performs random homogeneous transformation by adopting local random parameters;
S202, uploading the random homogeneous transformation result and auxiliary data to a joint vector database server.
For a single user, firstly, dimension reduction is carried out on data, SVD (singular value decomposition) or PCA (PCA (Principal Component Analysis) is a common data analysis mode, is commonly used for dimension reduction of high-dimension data, and can be used for extracting main characteristic components of the data) and other means, the user keeps a dimension reduction mapping matrix W, and the corresponding dimension reduction data is as follows:
TL×n=WLV
then carrying out random homogeneous transformation on the dimension-reduced data, generating a random homogeneous transformation matrix R by a user, uploading the transformed data set D to a vector database server,
D(L+1)×n=RTL×n
In the invention, the method for reducing the dimension of the original embedded vector is not limited to any one of SVD, PCA, UMAP, T-SNE and the like, and the W dimension reduction mapping matrix described in the method is obtained in a SVD mode and can be further described as a key parameter of the dimension reduction method, and the parameter is related to the data of dimension reduction processing.
Regarding the random homogeneous transformation matrix, which includes two parts of random rotation and random displacement vector, as described above, the random homogeneous transformation matrix satisfying the above forms can maintain the L2 distance between vectors, but if the similarity of cosine is required to be maintained, the displacement vector part needs to be all 0, and at this time, homogeneous transformation can simultaneously maintain the L2 distance of vector and similarity of cosine. The generation mode of the random alignment transformation matrix is not specifically distinguished in the invention.
After corpus embedding is carried out on the query request text by a user, the user carries out dimension reduction and random homogeneous transformation to generate a vector to be retrieved, and the method specifically comprises the following steps:
S301, a user inputs a query request text and sends the query request text to a joint vector database server;
S302, corpus embedding is carried out on the query request text, and an initial vector is generated;
s303, performing dimensionality reduction mapping and random homogeneous transformation operation on the initial vector to generate a vector to be retrieved.
The method specifically comprises the steps of carrying out similarity retrieval on the vector to be retrieved and auxiliary data after operation in a joint vector database, and returning an index result, wherein the method specifically comprises the following steps:
s401, calculating the vector to be searched and auxiliary data, and judging the similarity distance between the calculated vector data and the vector in the joint vector database;
S402, under the condition that the similarity distance is smaller than the set distance, the similarity requirement is met, the corpus with semantic relation or semantic relativity is determined, and the index is returned.
Referring to fig. 9 and 10, in the present invention, in the retrieval stage, it is assumed that there is a query request text_q (text) from user 1; the user 1 performs the same operations of embedding, dimension reduction (the result after dimension reduction is recorded as Xq), random homogeneous transformation and the like on the original request text. The random homogeneous transformation uses the same R1 as when constructing the database, i.e., R1Xq.
At the server, the following calculation is performed using C2 (auxiliary data) corresponding to user 1:
And then carrying out similarity retrieval on the R0Xq vector in the joint vector data set. It can be seen that at this time, the vector to be retrieved and the joint vector database are both vector spaces under the same random transformation, so that the retrieval accuracy can be ensured.
In the invention, the original data set is processed by adopting two-step transformation, firstly, in the dimension reduction processing, taking SVD as an example, a dimension reduction conversion matrix W can be obtained by matrix decomposition according to the following formula, W can be regarded as data projection coordinates, the data covariant characteristic is related to the singular value after decomposition, and the original vector data set V is difficult to acquire under the condition that any data covariant characteristic is not known after the projection result (dimension reduction result) is acquired;
VT=U∑WT
and secondly, generating a random secondary transformation matrix R at the user end, wherein the randomness of the random secondary transformation matrix R further protects the data after dimension reduction.
In addition, after the identity authentication is completed, the trusted third party can provide the function of corpus inquiry according to the hash index, which is essentially different from the function of vector retrieval, when the server performs vector retrieval, a large amount of vector similarity calculation is needed, particularly when the scale of the joint vector database is large enough, a large amount of calculation resources are consumed, and the vector database server can always provide larger calculation force to complete vector retrieval. The text and hash index pairs stored by the trusted third party are out of order, namely information associated with the user is not reserved, and when the retrieval request party acquires the original text according to the content index, only the corresponding text content can be acquired, and the rest information cannot be acquired.
After completing the vector similarity retrieval, inquiring corresponding text content in a trusted third party according to the index, and completing the retrieval, wherein the method specifically comprises the following steps:
S501, after the vector similarity retrieval is completed, a user acquires a retrieval hash index set returned by a data server side;
S502, the user queries corresponding text content in the trusted party according to the retrieval hash index set, and completes one-time retrieval.
According to the multiparty joint vector knowledge base retrieval method for privacy protection, provided by the invention, the safe vector database is constructed, so that the original data of the vector database constructed by a user cannot be local, and the support is still safe under the vector similarity retrieval situation for public database service providers; when a plurality of knowledge base providers participate, a vector knowledge base is established through multiparty union, and under the condition that query vectors are supported to search safety for a server, each participant can complete the search of information associated with query content without knowing the content of the knowledge base of other people.
Referring to fig. 7, the invention also discloses a system for searching the multiparty joint vector knowledge base with privacy protection, which comprises:
The joint vector database construction module 110 is configured to construct an index of the multi-party text corpus and upload the index to a trusted third party, the trusted third party distributes random protection secret parameters, each party combines the local random parameters to generate auxiliary data, and the auxiliary data includes the trusted third party random protection secret parameters and local random transformation parameters of each party;
the corpus processing module 120 is configured to obtain an embedded vector of the corpus by each party, and upload the data after the dimension reduction processing and the random homogeneous transformation together with the auxiliary data to the joint vector database server;
the dimension reduction homogeneous transformation module 130 is configured to perform dimension reduction and random homogeneous transformation after the user performs corpus embedding on the query request text, and generate a vector to be retrieved;
The similarity retrieval module 140 is configured to perform similarity retrieval after the vector to be retrieved is operated with auxiliary data in the joint vector database, and return an index result;
and the third party query module 150 is configured to query the corresponding text content at the trusted third party according to the index after completing the search for the vector similarity.
Wherein, the joint vector database construction module,
The text corpus of each party constructs an index and uploads the index to a trusted third party respectively;
Each party receives the random protection secret parameters distributed by the trusted third party and generates auxiliary data by combining the local random parameters.
The corpus processing module is used for carrying out embedding operation on the language materials by each party to obtain an embedded vector, carrying out dimension reduction on vector data and carrying out random homogeneous transformation by adopting local random parameters;
and uploading the result to a joint vector database server together with auxiliary data based on the random homogeneous transformation result.
The dimensionality reduction homogeneous transformation module is used for inputting a query request text by a user and sending the query request text to a joint vector database server;
Performing corpus embedding on the query request text to generate an initial vector;
and performing dimension reduction mapping and random homogeneous transformation operation on the initial vector to generate a vector to be retrieved.
The similarity retrieval module is used for calculating the vector to be retrieved and the auxiliary data, and similarity distance judgment is carried out on the calculated vector data and the vector in the joint vector database;
and under the condition that the similarity distance is smaller than the set distance, meeting the similarity requirement, determining the corpus with semantic relation or semantic relativity, and returning the index.
The third party query module is used for acquiring a retrieval hash index set returned by the data server side by a user after the vector similarity retrieval is completed;
And the user queries corresponding text content in the trusted party according to the retrieval hash index set to complete one-time retrieval.
According to the multiparty joint vector knowledge base retrieval system with privacy protection, provided by the invention, the original data of the vector database constructed by a user cannot be local by constructing the safe vector database, and the support is still safe under the vector similarity retrieval situation for public database service providers; when a plurality of knowledge base providers participate, a vector knowledge base is established through multiparty union, and under the condition that query vectors are supported to search safety for a server, each participant can complete the search of information associated with query content without knowing the content of the knowledge base of other people.
Fig. 11 illustrates a physical structure diagram of an electronic device, as shown in fig. 11, which may include: processor 1110, communication interface Communications Interface 1120, memory 1130, and communication bus 1140, wherein processor 1110, communication interface 1120, memory 1130 perform communication with each other through communication bus 1140. Processor 1110 may invoke logic instructions in memory 1130 to perform a privacy-preserving multi-party joint vector knowledge base retrieval method comprising:
The multi-party text corpus is constructed into an index and is respectively uploaded to a trusted third party, the trusted third party distributes random protection secret parameters, each party combines the local random parameters to generate auxiliary data, and the auxiliary data comprises the trusted third party random protection secret parameters and local random transformation parameters of each party;
each party obtains an embedded vector of the corpus, and data after dimension reduction processing and random homogeneous transformation are uploaded to a joint vector database server together with auxiliary data;
the user performs corpus embedding on the query request text and then performs dimension reduction and random homogeneous transformation to generate a vector to be retrieved;
the vector to be searched is subjected to similarity searching after being operated with auxiliary data in a joint vector database, and an index result is returned;
after the vector similarity retrieval is completed, corresponding text contents are queried in a trusted third party according to the index, and the retrieval is completed.
Further, the logic instructions in the memory 1130 described above may be implemented in the form of software functional units and sold or used as a stand-alone product, stored on a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program when executed by a processor being capable of performing a privacy-preserving multi-party joint vector knowledge base retrieval method provided by the above methods, the method comprising:
The multi-party text corpus is constructed into an index and is respectively uploaded to a trusted third party, the trusted third party distributes random protection secret parameters, each party combines the local random parameters to generate auxiliary data, and the auxiliary data comprises the trusted third party random protection secret parameters and local random transformation parameters of each party;
each party obtains an embedded vector of the corpus, and data after dimension reduction processing and random homogeneous transformation are uploaded to a joint vector database server together with auxiliary data;
the user performs corpus embedding on the query request text and then performs dimension reduction and random homogeneous transformation to generate a vector to be retrieved;
the vector to be searched is subjected to similarity searching after being operated with auxiliary data in a joint vector database, and an index result is returned;
after the vector similarity retrieval is completed, corresponding text contents are queried in a trusted third party according to the index, and the retrieval is completed.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform a method of searching a multi-party joint vector knowledge base for privacy protection provided by the above methods, the method comprising:
The multi-party text corpus is constructed into an index and is respectively uploaded to a trusted third party, the trusted third party distributes random protection secret parameters, each party combines the local random parameters to generate auxiliary data, and the auxiliary data comprises the trusted third party random protection secret parameters and local random transformation parameters of each party;
each party obtains an embedded vector of the corpus, and data after dimension reduction processing and random homogeneous transformation are uploaded to a joint vector database server together with auxiliary data;
the user performs corpus embedding on the query request text and then performs dimension reduction and random homogeneous transformation to generate a vector to be retrieved;
the vector to be searched is subjected to similarity searching after being operated with auxiliary data in a joint vector database, and an index result is returned;
after the vector similarity retrieval is completed, corresponding text contents are queried in a trusted third party according to the index, and the retrieval is completed.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for searching a multiparty joint vector knowledge base with privacy protection is characterized by comprising the following steps:
The multi-party text corpus is constructed into an index and is respectively uploaded to a trusted third party, the trusted third party distributes random protection secret parameters, each party combines the local random parameters to generate auxiliary data, and the auxiliary data comprises the trusted third party random protection secret parameters and local random transformation parameters of each party;
each party obtains an embedded vector of the corpus, and data after dimension reduction processing and random homogeneous transformation are uploaded to a joint vector database server together with auxiliary data;
the user performs corpus embedding on the query request text and then performs dimension reduction and random homogeneous transformation to generate a vector to be retrieved;
the vector to be searched is subjected to similarity searching after being operated with auxiliary data in a joint vector database, and an index result is returned;
after the vector similarity retrieval is completed, corresponding text contents are queried in a trusted third party according to the index, and the retrieval is completed;
When constructing the joint vector database, firstly, distributing random protection secret parameters R 0 for a plurality of users, wherein the secret parameters are also random homogeneous transformation matrixes, each party uploads corpus and corresponding index to a trusted third party, each party uploads the converted vector data C 1 and auxiliary data C 2 to a vector database server, wherein,
The server stores the result of calculation C 2C1, namely:
2. The method for searching the multi-party joint vector knowledge base with privacy protection according to claim 1, wherein the multi-party text corpus constructs an index and uploads the index to a trusted third party respectively, the trusted third party distributes random protection secret parameters, each party generates auxiliary data by combining local random parameters, the auxiliary data comprises the trusted third party random protection secret parameters and local random transformation parameters of each party, and the method specifically comprises the following steps:
The text corpus of each party constructs an index and uploads the index to a trusted third party respectively;
Each party receives the random protection secret parameters distributed by the trusted third party and generates auxiliary data by combining the local random parameters.
3. The method for searching the multi-party joint vector knowledge base with privacy protection according to claim 1, wherein each party obtains an embedded vector of a corpus, and the data after dimension reduction and random homogeneous transformation is uploaded to a joint vector database server together with auxiliary data, specifically comprising:
each party performs embedding operation on the language materials to obtain embedded vectors, reduces the dimension of vector data, and performs random homogeneous transformation by adopting local random parameters;
and uploading the result to a joint vector database server together with auxiliary data based on the random homogeneous transformation result.
4. The method for searching the multi-party joint vector knowledge base with privacy protection according to claim 1, wherein the user performs the dimension reduction and random homogeneous transformation after the query request text is subjected to corpus embedding, and generates the vector to be searched, and the method specifically comprises the following steps:
The method comprises the steps that a user inputs a query request text and sends the query request text to a joint vector database server;
Performing corpus embedding on the query request text to generate an initial vector;
and performing dimension reduction mapping and random homogeneous transformation operation on the initial vector to generate a vector to be retrieved.
5. The method for searching the multi-party joint vector knowledge base with privacy protection according to claim 1, wherein the searching the similarity between the vector to be searched and the auxiliary data in the joint vector database, and returning the index result comprises the following steps:
calculating the vector to be searched and auxiliary data, and judging the similarity distance between the calculated vector data and the vector in the joint vector database;
and under the condition that the similarity distance is smaller than the set distance, meeting the similarity requirement, determining the corpus with semantic relation or semantic relativity, and returning the index.
6. The method for searching the multi-party joint vector knowledge base with privacy protection according to claim 1, wherein after the completion of the vector similarity search, the corresponding text content is queried at a trusted third party according to the index to complete the search, and the method specifically comprises the following steps:
after the vector similarity retrieval is completed, the user acquires a retrieval hash index set returned by the data server side;
And the user queries corresponding text content in the trusted party according to the retrieval hash index set to complete one-time retrieval.
7. A privacy preserving multiparty joint vector knowledge base retrieval system, the system comprising:
The joint vector database construction module is used for constructing indexes of the multiparty text corpus and uploading the indexes to the trusted third party respectively, the trusted third party distributes random protection secret parameters, each party combines the local random parameters to generate auxiliary data, and the auxiliary data comprises the trusted third party random protection secret parameters and local random transformation parameters of each party;
The corpus processing module is used for each party to acquire an embedded vector of the corpus, and the data after dimension reduction processing and random homogeneous transformation are uploaded to the joint vector database server together with the auxiliary data;
the dimension reduction homogeneous transformation module is used for carrying out dimension reduction and random homogeneous transformation after the user carries out corpus embedding on the query request text to generate a vector to be searched;
The similarity retrieval module is used for carrying out similarity retrieval after the vector to be retrieved is operated with auxiliary data in the joint vector database and returning an index result;
The third party query module is used for querying corresponding text content in a trusted third party according to the index after completing the vector similarity retrieval, and completing the retrieval;
When constructing the joint vector database, firstly, distributing random protection secret parameters R 0 for a plurality of users, wherein the secret parameters are also random homogeneous transformation matrixes, each party uploads corpus and corresponding index to a trusted third party, each party uploads the converted vector data C 1 and auxiliary data C 2 to a vector database server, wherein,
The server stores the result of calculation C 2C1, namely:
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the privacy-preserving multiparty joint vector knowledge base retrieval method of any of claims 1 to 6 when the computer program is executed by the processor.
9. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements a privacy-preserving multiparty joint vector knowledge base retrieval method according to any one of claims 1 to 6.
10. A computer program product comprising a computer program which, when executed by a processor, implements a privacy-preserving multiparty joint vector knowledge base retrieval method as claimed in any one of claims 1 to 6.
CN202311703773.1A 2023-12-12 2023-12-12 Multi-party joint vector knowledge base retrieval method and system for privacy protection Active CN117708263B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311703773.1A CN117708263B (en) 2023-12-12 2023-12-12 Multi-party joint vector knowledge base retrieval method and system for privacy protection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311703773.1A CN117708263B (en) 2023-12-12 2023-12-12 Multi-party joint vector knowledge base retrieval method and system for privacy protection

Publications (2)

Publication Number Publication Date
CN117708263A CN117708263A (en) 2024-03-15
CN117708263B true CN117708263B (en) 2024-07-30

Family

ID=90159943

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311703773.1A Active CN117708263B (en) 2023-12-12 2023-12-12 Multi-party joint vector knowledge base retrieval method and system for privacy protection

Country Status (1)

Country Link
CN (1) CN117708263B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118246749B (en) * 2024-05-29 2024-08-06 浪潮通用软件有限公司 Financial data risk analysis method and system based on large model proxy

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107292189A (en) * 2017-05-15 2017-10-24 温州大学瓯江学院 The privacy of user guard method of text-oriented retrieval service
CN116821056A (en) * 2023-06-19 2023-09-29 之江实验室 Trusted third party-based hidden query method, system, device and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9438412B2 (en) * 2014-12-23 2016-09-06 Palo Alto Research Center Incorporated Computer-implemented system and method for multi-party data function computing using discriminative dimensionality-reducing mappings
US20220067500A1 (en) * 2020-08-25 2022-03-03 Capital One Services, Llc Decoupling memory and computation to enable privacy across multiple knowledge bases of user data
CN114036565B (en) * 2021-11-19 2024-03-08 上海勃池信息技术有限公司 Private information retrieval system and private information retrieval method
CN117150557A (en) * 2023-09-06 2023-12-01 哈尔滨理工大学 Compression-supporting private information retrieval method and system based on secure multiparty computing

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107292189A (en) * 2017-05-15 2017-10-24 温州大学瓯江学院 The privacy of user guard method of text-oriented retrieval service
CN116821056A (en) * 2023-06-19 2023-09-29 之江实验室 Trusted third party-based hidden query method, system, device and storage medium

Also Published As

Publication number Publication date
CN117708263A (en) 2024-03-15

Similar Documents

Publication Publication Date Title
Cui et al. Efficient and expressive keyword search over encrypted data in cloud
CN117708263B (en) Multi-party joint vector knowledge base retrieval method and system for privacy protection
US8898478B2 (en) Method for querying data in privacy preserving manner using attributes
WO2022099495A1 (en) Ciphertext search method, system, and device in cloud computing environment
CN106934301B (en) Relational database secure outsourcing data processing method supporting ciphertext data operation
CN103927340A (en) Ciphertext retrieval method
CN117150557A (en) Compression-supporting private information retrieval method and system based on secure multiparty computing
Wang et al. PeGraph: A system for privacy-preserving and efficient search over encrypted social graphs
CN115310125A (en) Encrypted data retrieval system, method, computer equipment and storage medium
CN114637746A (en) Conditional hiding trace query method, system and device based on privacy calculation
CN115309861A (en) Ciphertext retrieval system, method, computer equipment and storage medium
Yang et al. MASK: Efficient and privacy-preserving m-tree based biometric identification over cloud
US20240005433A1 (en) Anonymous crime reporting and escrow system with hashed perpetrator matching
CN117951730A (en) Cloud security searchable encryption method based on hash index
Yan et al. Privacy-preserving content-based image retrieval in edge environment
CN116821056A (en) Trusted third party-based hidden query method, system, device and storage medium
CN109766314A (en) Ciphertext data multi-key word searching method based on probability trapdoor
CN112632063A (en) Restricted shortest distance query method, electronic device and readable storage medium
CN116303551B (en) Hidden query method and device
Rahman et al. A novel privacy preserving search technique for stego data in untrusted cloud
CN114647662B (en) Data retrieval method, data retrieval device, electronic equipment and storage medium
CN113626485B (en) Searchable encryption method and system suitable for database management system
CN115408451B (en) Confidential trace query method and storage medium
JP7440662B2 (en) Multi-key information search
CN109617683B (en) Terminal and cloud server

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant