CN111859421A

CN111859421A - Multi-keyword ciphertext storage and retrieval method and system based on word vector

Info

Publication number: CN111859421A
Application number: CN202010651620.7A
Authority: CN
Inventors: 韩光; 田宝松; 许彩云; 杨杨; 兰静; 哈兰; 崔永进
Original assignee: China National Software & Service Co ltd
Current assignee: China National Software & Service Co ltd
Priority date: 2020-07-08
Filing date: 2020-07-08
Publication date: 2020-10-30

Abstract

The invention discloses a multi-keyword ciphertext storage and retrieval method and a system based on word vectors, which comprise the following steps: the data owner represents the keywords of the plaintext document as n + 1-dimensional word vectors, calculates and uploads a ciphertext index and an encrypted document corresponding to the plaintext document to the cloud server, and sends an index key, a decryption private key and model parameters to a data user; and the data user generates a trapdoor by inquiring the keyword set, acquires a plurality of encrypted documents with the highest relevancy from the cloud server side, and decrypts the encrypted documents to obtain corresponding plaintext documents. The invention can accurately obtain the implicit semantics of the words through the word vectors, thereby improving the accuracy of the query, ensuring the security of the ciphertext query by using the thought of the MRSC method as a reference, and preventing the background attack of enemies.

Description

Multi-keyword ciphertext storage and retrieval method and system based on word vector

Technical Field

The invention belongs to the field of ciphertext retrieval, and particularly relates to a multi-keyword ciphertext storage and retrieval method and system based on word vectors.

Background

Under the push of the rapid development of internet applications, the requirement of users on storage capacity is increasing, so more and more enterprises or individuals (i.e. data owners) will choose to store data on the cloud server to save local storage space. In the process, in order to ensure the security of the data, the data owner encrypts the data first and then uploads the data to the cloud server. The encrypted data loses flexibility, and if a user wants to acquire required data from a large amount of encrypted data, the required data can be acquired only after downloading and decrypting all data on the cloud server. Thus, the efficiency of obtaining the relevant data is very low, and researchers have proposed a ciphertext retrieval technology to solve the problem.

The ciphertext sequencing retrieval technology is a continuation of the ciphertext retrieval technology, and improves the accuracy of ciphertext retrieval on the basis of fuzzy retrieval. From the perspective of the number of query keywords, ciphertext sorting retrieval methods can be divided into single-Keyword ciphertext sorting methods and Multi-Keyword ciphertext sorting methods (Multi-Keyword Ranked Search).

The MRSE method (Multi-Keyword Search over Encrypted Cloud Data) is one of the classic methods in Multi-Keyword cipher text sorting. On the basis of the MRSE method, a subsequent researcher expands related terms of the query keyword through technologies of query expansion, personalized recommendation and the like so as to increase semantic information of the query keyword, but in the process, as the related terms of the query keyword expansion are increased, the phenomenon of query semantic drift occurs, so that the problem of reduction of retrieval accuracy is caused.

Although the chinese patent application CN109271485A discloses a cloud environment encrypted document sorting retrieval method supporting semantics, the LDA topic model adopted in the method only uses the probability distribution of keywords under specific topics to represent the potential contribution of the keywords to the topic semantics, but it cannot sufficiently and directly mine the semantic relationship of the keywords, so the application still has limited improvement on the accuracy of ciphertext retrieval.

Disclosure of Invention

In order to solve the problems, the invention provides a multi-keyword ciphertext storage and retrieval method and system based on word vectors.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a multi-keyword ciphertext storage method based on word vectors comprises the following steps:

1) a data owner randomly generates an n + 1-dimensional binary segmentation vector s and two (n +1) × (n +1) -dimensional first reversible matrixes M according to a security parameter n₁And a second invertible matrix M₂The index key SK ═ (s, M)₁,M₂)，n≥10；

2) Extracting m keywords from each plaintext document of a plaintext document set respectively, inputting the m keywords of each plaintext document into a model to obtain m n-dimensional keyword word vectors of each plaintext document, wherein the method for obtaining the model comprises the step of inputting a sample keyword set into a word2vec tool for training;

3) expanding the dimensionality of each keyword word vector from n dimensionality to n +1 dimensionality to obtain a plaintext index of each plaintext document;

4) calculating the ciphertext index of each plaintext document according to the index key and each plaintext index, and obtaining the encrypted document of each plaintext document through generating an encrypted and decrypted public and private key pair;

5) And uploading each ciphertext index and each encrypted document to a cloud server, and sending an index key, a decryption private key and model parameters obtained by training to a data user.

Further, the dimension of each keyword word vector is expanded from n dimension to n +1 dimension by the following strategy:

1) the first n dimensions of each keyword word vector are kept unchanged;

2) compute n +1 dimensions, cw, of the keyword word vector'_j[n+1]＝-0.5||cw_j||²Cw is the keyword word vector, j ∈ {1,2, …, m }.

Further, the ciphertext index of the plaintext document is calculated by:

1) dividing the plaintext index into a first plaintext index and a second plaintext index by using a binary division vector s;

2) by means of a first invertible matrix M₁And a second invertible matrix M₂And respectively encrypting the first plaintext index and the second plaintext index to obtain a ciphertext index comprising the first ciphertext index and the second ciphertext index.

Further, the plaintext index is partitioned into a first plaintext index and a second plaintext index by the following strategy:

1) if s [ l]1, then d'_i[t][l]+d″_i[t][l]＝d_i[t][l]，s[l]Is a binary division vector of the l dimension, d_i[t][l]Is a plaintext index of the ith dimension of the ith keyword of the ith plaintext document, d'_i[t][l]A first plaintext index, d ″, of the ith dimension of the t keyword of the ith plaintext document _i[t][l]The second plaintext index is the ith keyword and the ith dimension of the ith plaintext document, t belongs to {1,2, …, m }, l belongs to {1,2, …, n, n +1}, i belongs to {1,2, …, k }, and k is the number of plaintext documents in the plaintext document set;

2) if s [ l]0, then d'_i[t][l]＝d″_i[t][l]＝d_i[t][l]。

A multi-keyword ciphertext retrieval method based on word vectors comprises the following steps:

1) inputting x query keywords into a model trained by a data owner by a data user to obtain x n-dimensional query keyword word vectors;

2) expanding the dimensionality of each query keyword word vector from n dimensionality to n +1 dimensionality to obtain a query index;

3) according to the received index key SK ═ (s, M)₁,M₂) Inquiring the index, generating a trapdoor, and uploading the trapdoor to a cloud server;

4) the cloud server calculates the correlation degree of the query key words and each encrypted document according to the trapdoors and the ciphertext index of each encrypted document, and returns a plurality of encrypted documents with the highest correlation degree to the data user;

5) and the data user obtains a corresponding plaintext document according to the decryption private key.

Further, the dimension of each query keyword word vector is expanded from n dimension to n +1 dimension by the following strategy:

1) the first n dimensions of each query keyword word vector are kept unchanged;

2) calculating n +1 dimension, cqw 'of query keyword word vector' _s[n+1]1, cqw is the keyword word vector, s ∈ {1,2, …, x }.

Further, the trapdoor is generated by:

1) generating a query vector r x Q of x (n +1) dimensions by using x query keyword vectors and a random number r_w，Q_wIndexing for queries;

2) dividing the query vector into a first query vector and a second query vector using a binary division vector s, and passing through a first invertible matrix M₁And a second invertible matrix M₂And respectively encrypting the first query vector and the second query vector to obtain the trapdoor containing the first query vector and the second query vector.

Further, the query vector is partitioned into a first query vector and a second query vector by:

1) if s [ l]＝1，Q′_w[b][l]＝Q″_w[b][l]＝r×Q_w[b][l]，s[l]Is a binary division vector of the l dimension, Q_w[b][l]Query index, Q ', in the l dimension of the b-th query keyword'_w[b][l]Is the first query index, Q ″, of the l dimension of the b-th keyword_w[b][l]A second query index for the ith dimension of the kth keyword, b ∈ {1,2, …, x }, l ∈ {1,2, …, n, n +1 };

2)s[l]＝0，Q′_w[b][l]+Q″_w[b][l]＝r×Q_w[b][l]。

further, the method for calculating the relevance of the query keyword set and each encrypted document comprises the following steps: the Kuhn-Munkres algorithm.

A multi-keyword ciphertext retrieval system based on word vectors, comprising:

a data owner for randomly generating an n + 1-dimensional binary segmentation vector s and two (n +1) × (n +1) -dimensional first reversible matrices M according to a security parameter n ₁And a second invertible matrix M₂The index key SK ═ (s, M)₁,M₂) N is more than or equal to 10; extracting m keywords from each plaintext document of a plaintext document set respectively, and inputting the m keywords of each plaintext document into a model trained by a sample keyword set to obtain m n-dimensional keyword word vectors of each plaintext document; expanding the dimensionality of each keyword word vector from n dimensionality to n +1 dimensionality to obtain a plaintext index of each plaintext document; calculating the ciphertext index of each plaintext document according to the index key and each plaintext index, and obtaining the encrypted document of each plaintext document through generating an encrypted and decrypted public and private key pair; uploading each ciphertext index and each encrypted document to a cloud server, and sending an index key, a decryption private key and model parameters obtained by training to a data user;

the data user is used for inputting x query keywords into a model trained by a data owner to obtain x n-dimensional query keyword word vectors; expanding the dimensionality of each query keyword word vector from n dimensionality to n +1 dimensionality to obtain a query index; generating a trapdoor according to the received index key and the query index, and uploading the trapdoor to a cloud server; the data user obtains a corresponding plaintext document according to the decryption private key;

The cloud server is used for storing the encrypted document and the corresponding ciphertext index; and calculating the correlation degree of the query key words and each encrypted document according to the trapdoors and the ciphertext index of each encrypted document, and returning a plurality of encrypted documents with the highest correlation degrees to the data user.

Compared with the prior art, the invention has the beneficial effects that:

1) the word vector can be used for accurately acquiring the implicit semantics of the words, so that the query accuracy can be improved;

2) the problem of relevance among the subjects in the Chinese patent CN109271485A is solved;

3) by using the thought of the MRSE method for reference, the security of ciphertext query is ensured, and the background attack of an adversary is prevented.

Drawings

Fig. 1 is a system architecture diagram of ciphertext retrieval.

Fig. 2 is a diagram illustrating the construction of the ciphertext retrieval scheme.

FIG. 3 is a flow chart of index generation and encryption.

Detailed Description

In order that the objects, principles, aspects and advantages of the present invention will become more apparent, the present invention will be described in detail below with reference to specific embodiments thereof and with reference to the accompanying drawings.

The system model of the scheme is shown in fig. 1, and the cloud service is divided into three entities according to different functions: data owner, cloud server and data consumer. The scheme comprises the following steps as shown in figure 2:

The method comprises the following steps: generate key, SK ← setup (1)ⁿ) The safety parameter n is given, wherein n is more than or equal to 10, and the preferable value range is [50,200 ]]The algorithm outputs an encryption index key SK.

Step two: and (I, C) ← Geninder (F, SK): inputting a plaintext document set F and an encryption key SK, and the algorithm can generate an index set I 'corresponding to the plaintext document set F by using the plaintext document set F, and simultaneously encrypts the plaintext document set F and the index set I' by using the encryption key to obtain a ciphertext document set C and a ciphertext index I corresponding to the ciphertext document set.

Step three: generation of trapdoor, TD ← GenTrapdoor (Q)_wSK): the algorithm uses the index key SK to match the query Q_wThe trapdoor TD can be obtained by encryption.

Step four: generating a ciphertext query result, E_kAnd (E) the cloud server obtains the encryption index I, the trapdoor TD and the parameter k, calculates the correlation between the Query and the ciphertext set C by using the algorithm, and calculates the correlation ciphertext E of top-k_kAnd feeding back to the data user.

Step five: obtaining a plaintext challenge result, F_k←Dec(E_kSK) data usageThe person receives the ciphertext E returned by the cloud server_kThen, top-k related plaintext F can be obtained through decryption by the algorithm_k。

The specific construction method of the first step is as follows:

The data owner sets a security parameter n, namely the dimension of a word vector, then randomly generates an n + 1-dimensional binary vector s as a segmentation indication vector, and simultaneously generates two (n +1) × (n +1) -dimensional reversible matrixes M₁And M₂Then SK is (s, M)₁,M₂)。

The word2vec tool in the step is a method for training word vectors by using a neural network, and meanwhile, other deep neural network methods can be replaced to obtain the word vectors.

The word vector in this step is a distributed low-dimensional real vector, and the basic idea is to map words to N-dimensional real vectors using a training corpus. The distance between word vectors may represent an implicit semantic relationship between words.

The binary vector in this step means that each dimension in the vector takes a value of 0 or 1.

The specific construction scheme of the step two (I, C) ← Geninder (F, SK) is as follows:

as shown in FIG. 3, the data owner uses the package genic in the python program to set F ═ F for the plaintext document set₁,f₂,…,f_kTraining to obtain a training model; obtaining word vectors { cw) corresponding to m keyword sets of each document by using the trained model₁,cw₂,…,cw_mM is the number of keywords in the index; the word vector for each keyword is then expanded from n dimensions to n +1 dimensions to form an expanded word vector { cw' ₁,cw′₂,…,cw′_mThe expanding method comprises the following steps: the top n dimension of each keyword word vector remains constant, cw'_j[n+1]＝-0.5||cw_j||²J is e {1,2, …, m }. Then utilizes the word vector cw'_jFor each document f_iGenerating an mx (n +1) document matrix d_i(i ∈ {1,2, …, k }) as a plaintext index for the document; and indexing the plaintext d by using the segmentation indication vector s_iIs divided into d'_iAnd d ″)_iThe dividing method comprises the following steps: if s [ l]＝1，d′_i[t][l]+d″_i[t][l]＝d_i[t][l]And vice versa d'_i[t][l]＝d″_i[t][l]＝d_i[t][l]T ∈ {1,2, …, m } and l ∈ {1,2, …, n, n +1 }. Then through two matrices M₁And M₂To d'_iAnd d ″)_iIs encrypted to obtain

Any safe and reliable encryption algorithm can be used for encrypting the plaintext document set F to obtain a ciphertext document set C, and finally the data owner encrypts I_iAnd uploading the ciphertext C to a cloud server, and changing the index key SK to (s, M)₁,M₂) The decryption key of the encryption algorithm and the parameters of the training model are sent to the data user.

Said step three TD ← GenTrapdoor (Q)_wSK) is as follows:

the data user inputs the query keyword set { qw using a model trained by the data owner₁,qw₂,…,qw_xObtaining word vectors (cqw) corresponding to the x query keyword sets₁,cqw₂,…,cqw_xThen expand each keyword word vector from n-dimension to n + 1-dimension to form an expanded word vector of cqw' ₁,cqw′₂,…,cqw′_xThe expanding method comprises the following steps: the top n dimension of each query keyword term vector remains unchanged, cqw'_sn +1 is 1, s ∈ {1,2, …, x }. Generating a query vector r x Q of dimension x (n +1) using a word vector query Q_wX is the number of query keywords; query r × Q using a segmentation indication vector s_wIs divided into Q'_wAnd Q ″)_wThe dividing method comprises the following steps: if s [ l]＝1，Q′_w[b][l]＝Q″_w[b][l]＝r×Q_w[b][l]And vice versa Q'_w[b][l]+Q″_w[b][l]＝r×Q_w[b][l]Where b ∈ {1,2, …, x } and l ∈ {1,2, …, n, n +1 }. Then through two matrices M₁And M₂To Q'_wAnd Q ″)_wIs encrypted to obtain

And finally, the trapdoor TD is uploaded to the cloud service by a data user.

Step four, the cloud server calculates the value of Rscore KM (sim (I)_iTD)), the correlation between the query and the ciphertext is obtained, and the top-k ciphertexts are fed back to the data user.

Index I_iThe correlation with the trapdoor TD is calculated using the following equation:

the function KM () in the above equation represents the KM algorithm, which is called Kuhn-Munkres in its entirety, and is a maximum weight matching algorithm for computing weighted bipartite graphs. The maximum value of the weight of the matching edge in the bipartite graph can be obtained through a KM algorithm. dis (d)_i,Q_w) Representing the Euclidean distance between the query matrix and the document matrix.

Step five, the data user obtains the ciphertext E_kThe associated plaintext is obtained by using the decryption key

The invention has the following analysis on retrieval accuracy:

the scheme utilizes the original information of a data user and calculates the inner product between the query keyword vector and the document keyword vector to obtain the semantic relation between the query keyword and the document keyword. According to the scheme, the matching degree of the query keywords and the document keywords can be improved by utilizing the KM algorithm and the word semantic correlation degree, so that the accuracy of ciphertext retrieval is improved.

The invention has the following safety analysis:

under the known ciphertext model, the adversary can obtain corresponding ciphertext information, including an encrypted document vector, a query vector and the like, but the encryption key is kept secret. The encryption key of the scheme consists of two parts, namely a segmentation indication vector s with dimension of n +1 and a reversible matrix M of (n +1) × (n +1)₁，M₂. Due to the fact that

And

reversible matrix obtained by matrix operation under known ciphertext condition

But the segmentation indication vector cannot be deduced. Calculation 2 is required to calculate the division instruction vector SⁿThis, in turn, places extremely high demands on the operational performance of the server, so the scheme is secure given the ciphertext.

In the known background information model, the cloud server not only knows the ciphertext index information, but also has the capabilities of recording query results, analyzing the query process, and conjecturing the relationship among different trapdoors, the statistical analysis knowledge of the database and the like. Query disassociation: according to the scheme, the relevance between the query key words and the document key words is obtained by using the thought of the MRSE method for reference, and the corresponding relation between the query matrix and the document matrix cannot be obtained; the key word is safe: the keyword encryption of the scheme adopts a security-enhanced inner vector product calculation mode, and the encryption mode provided by the scheme also meets the keyword security because the encryption mode meets the keyword security.

The multi-keyword ciphertext sequencing retrieval method based on the word vectors can improve the accuracy of ciphertext retrieval on the premise of ensuring the safety of ciphertext retrieval.

It should be understood that the above embodiments are described in some detail and with some particularity, but should not be construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the claims.

Claims

1. A multi-keyword ciphertext storage method based on word vectors comprises the following steps:

1) the data owner randomly generates an n +1 dimensional binary segmentation vector s and two (n +1) templates according to the security parameter nA first invertible matrix M of dimension (n +1)₁And a second invertible matrix M₂The index key SK ═ (s, M)₁,M₂)，n≥10；

2. The method of claim 1, wherein the dimension of each keyword word vector is expanded from n-dimension to n + 1-dimension by the following strategy:

1) the first n dimensions of each keyword word vector are kept unchanged;

3. The method of claim 2, wherein the ciphertext index of the plaintext document is computed by:

4. The method of claim 3, wherein the plaintext index is partitioned into a first plaintext index and a second plaintext index by:

1) if s [ l]1, then d'_i[t][l]+d″_i[t][l]＝d_i[t][l]，s[l]Is a binary division vector of the l dimension, d_i[t][l]Is a plaintext index of the ith dimension of the ith keyword of the ith plaintext document, d'_i[t][l]A first plaintext index, d ″, of the ith dimension of the t keyword of the ith plaintext document_i[t][l]The second plaintext index is the ith keyword and the ith dimension of the ith plaintext document, t belongs to {1,2, …, m }, l belongs to {1,2, …, n, n +1}, i belongs to {1,2, …, k }, and k is the number of plaintext documents in the plaintext document set;

2) if s [ l]0, then d'_i[t][l]＝d″_i[t][l]＝d_i[t][l]。

5. A multi-keyword ciphertext retrieval method based on word vectors comprises the following steps:

4) the cloud server calculates the correlation degree of the query key words and each encrypted document according to the trapdoors and the ciphertext index of each encrypted document obtained by the method of any one of claims 1 to 5, and returns a plurality of encrypted documents with the highest correlation degree to the data user;

6. The method of claim 5, wherein the dimension of each query keyword word vector is expanded from n-dimensions to n + 1-dimensions by the following strategy:

1) the first n dimensions of each query keyword word vector are kept unchanged;

2) calculating n +1 dimension, cqw 'of query keyword word vector'_s[n+1]1, cqw is the keyword word vector, s ∈ {1,2, …, x }.

7. The method of claim 6, wherein the trapdoor is created by:

8. The method of claim 7, wherein the query vector is partitioned into a first query vector and a second query vector by:

1) if s [ l]＝1，Q′_w[b][l]＝Q″_w[b][l]＝r×Q_w[b][l]，s[l]Is a binary division vector of the l dimension, Q_w[b][l]Query index, Q ', in the l dimension of the b-th query keyword' _w[b][l]Is the first query index, Q ″, of the l dimension of the b-th keyword_w[b][l]A second query index for the ith dimension of the kth keyword, b ∈ {1,2, …, x }, l ∈ {1,2, …, n, n +1 };

2)s[l]＝0，Q′_w[b][l]+Q″_w[b][l]＝r×Q_w[b][l]。

9. the method of claim 5, wherein the method of calculating the relevance of the query keyword set to each encrypted document comprises: the Kuhn-Munkres algorithm.

10. A multi-keyword ciphertext retrieval system based on word vectors, comprising:

a data owner for randomly generating an n + 1-dimensional binary segmentation vector s and two (n +1) × (n +1) -dimensional first reversible matrices M according to a security parameter n₁And a second invertible matrix M₂The index key SK ═ (s, M)₁,M₂) N is more than or equal to 10; extracting m keywords from each plaintext document of a plaintext document set respectively, and inputting the m keywords of each plaintext document into a model trained by a sample keyword set to obtain m n-dimensional keyword word vectors of each plaintext document; expanding the dimensionality of each keyword word vector from n dimensionality to n +1 dimensionality to obtain a plaintext index of each plaintext document; calculating the ciphertext index of each plaintext document according to the index key and each plaintext index, and obtaining the encrypted document of each plaintext document through generating an encrypted and decrypted public and private key pair; uploading each ciphertext index and each encrypted document to a cloud server, and sending an index key, a decryption private key and model parameters obtained by training to a data user;