CN109902143B

CN109902143B - Multi-keyword extended retrieval method based on ciphertext

Info

Publication number: CN109902143B
Application number: CN201910160214.8A
Authority: CN
Inventors: 许建; 黄新宇; 戴华; 杨庚; 陈燕俐
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2019-03-04
Filing date: 2019-03-04
Publication date: 2022-09-23
Anticipated expiration: 2039-03-04
Also published as: CN109902143A

Abstract

The invention discloses a ciphertext-based multi-keyword expansion retrieval method, which comprises the steps of extracting a keyword set from a data source, constructing and grouping inverse document vector sets based on the keyword set, constructing a B + index tree for each group of vector sets, encrypting through a safe KNN algorithm, encrypting the data source by using a symmetric encryption algorithm, uploading the encrypted index tree group and the data source to a cloud server, performing fuzzy processing on retrieval keywords input by a user, and correcting the condition of user input errors; and finally, performing semantic analysis operation on the fuzzy query keyword set, expanding the query keyword set, generating a query vector according to the semantic analyzed keyword set, processing the query vector by using an encryption algorithm to obtain trapdoors, performing grouping processing on the trapdoors, and uploading the trapdoors to a cloud server. According to the invention, from the aspect of trapdoor expansion, fuzzy processing and semantic analysis are carried out on the retrieval keywords input by the user, and the use experience of the user is improved.

Description

Multi-keyword extended retrieval method based on ciphertext

Technical Field

The invention relates to the technical field of character retrieval, in particular to a ciphertext-based multi-keyword expansion retrieval method.

Background

In the research aiming at privacy protection, a searchable encryption scheme is in an important position, and the research based on the field is also fully enriched and developed. However, with the explosive growth of data and the increasing diversification of user demands, the solution also faces various problems and challenges. Most of the current mainstream schemes perform accurate retrieval according to the query keyword input by the user, and do not consider various other factors, so that when the query keyword input by the user is wrong or single, the corresponding result cannot be reasonably returned to the user. Along with the linear increase of the data volume uploaded by the user, how to filter the results or raise the priority of keywords according to the preference and preference of the user and reduce the data screening of the user is also one of important improvements.

The problem is well solved along with the proposal of personalized search, and the main principle of the method is to collect the information of the user, analyze the interest and the preference of the user, and perform personalized sequencing on the retrieval result aiming at the user according to analysis and comparison, so that the user can quickly find the desired result. But since this scheme operates on the basis of user information, it may not be a good choice for ciphertext retrieval, a privacy-focused scheme.

Disclosure of Invention

The purpose of the invention is as follows: in order to overcome the defects of the prior art, the invention provides a ciphertext-based multi-keyword extended search method, which can solve the problems of low efficiency, low accuracy, single query result and low intelligence degree in the process of carrying out fuzzy multi-keyword sequencing search.

The technical scheme is as follows: the invention relates to a ciphertext-based multi-keyword expansion retrieval method, which comprises the following steps:

(1) constructing a B + index tree group: constructing an inverse document vector set IDOC according to a keyword set KW in a data source, constructing a corresponding grouping B + index tree group IO by using the inverse document vector set IDOC, and constructing a corresponding grouping document data set IT by using a document vector set DOC;

(2) IT and IO encryption: IO and IT are encrypted by utilizing a secure KNN algorithm, and the encrypted data are respectively recorded as E _IO And E _IT Uploading the encrypted data to a cloud server;

(3) fuzzy processing of query keywords: set W composed of query keywords input by user _q Matching with the keyword set KW to obtain a processed fuzzy keyword set W _m ；

(4) Fuzzy query keyword set W _m Semantic expansion of (2): obtaining semantic similarity between keywords according to the constructed semantic tree, and comparing W _m Traversing and carrying out semantic similarity calculation with a keyword set KW, carrying out semantic expansion on the keywords in each fuzzy set to form a semantic expansion set, and then adding the semantic expansion set of each keyword to W _m Forming a semantic set W _y ；

(5) Constructing a trapdoor: to W _y Traversing, constructing a first query vector QO and a second query vector QT according to whether the traversed current keyword exists in KW, encrypting the QO and the QT by adopting a safe KNN algorithm to obtain a trapdoor, and uploading encrypted data to a cloud server;

(6)E _QO and E _QT Secondary sorting and matching: by using E _IO And E _QO Calculating a correlation Score Score of the encrypted TF value and the IDF value stored in the storage unit to obtain a Result set Result; according to the Result of the first search, use E _QT And calculating and sequencing a secondary correlation Score with the found document vector to obtain E _QTi And E _ITi Score and return the top k ciphertext documents with the highest Score to the user.

Preferably, in step (3), the set W formed by the query keywords input by the user _q Matching with a keyword set KW to obtain a processed fuzzy keyword set W _m The method comprises the following steps: if the number of query keywords input by the user is t, the query keyword set can be represented as W _q ＝{W _q1 ,W _q2 ,…,W _qt Is traversed, if w _qi When the element KW is always true, wherein i is more than or equal to 1 and less than or equal to t, the query keyword input by the user has no spelling error, and W is the moment _m ＝W _q (ii) a If w is present therein _qi E.g. KW is false, the keyword input by the user does not exist in the keyword set KW, and the keyword w is _qi Calculating the editing distance ed between the keyword and each keyword in the keyword set KW, and if the keyword meets the preset threshold value of ed, adding the keyword in the KW in the keyword set into the fuzzy keyword set W _m After the traversal is finished, all keywords meeting the conditions are added into the fuzzy keyword set to obtain the final W _m 。

Preferably, in step (3), the fuzzy keyword set W _m Denoted as CM, the formula is:

preferably, in the step (4), performing semantic extension on the keywords in each fuzzy keyword set to form a semantic extension set, including the following steps:

(41) two keywords w are defined _i And w _j Using sim (w) _i ,w _j ) Expressed as a keyword w _i And a keyword w _j The similarity operation formula is as follows:

wherein gamma and delta are influence weights for controlling the shortest distance path length and the near common ancestor node in the operation, gamma is more than or equal to 0, delta is more than or equal to 0, len (w) _i ,w _j ) Representing secondary keywords w in a semantic tree _i To the keyword w _j Shortest path traversed, and:

lso(w _i ，w _j ) Representing a keyword w in a semantic tree _i And a keyword w _j Nearest common ancestor node, deep (w) _i ) Represents a keyword w _i The length of the path to the root node;

(42) to W _m Traversing and carrying out similarity calculation with a keyword set KW, carrying out semantic expansion on the keywords in each fuzzy keyword set, and taking the first tau keywords with the highest similarity to form a semantic expansion set.

Preferably, the step (5) specifically includes the following steps:

(51) to W _y Go through the traversal, e.g.If the traversed current keyword exists in KW, storing the IDF value of the keyword on the corresponding position of QO, otherwise, storing 0 for occupying;

(52) to W _y Traversing, if the traversed current keyword exists in KW, storing the IDF value of the keyword on the corresponding bit of QT, otherwise, storing 0 for occupying, and QT and QO form a complete query vector;

(53) encrypting QO by adopting a secure KNN algorithm and using QO _i [j]QO 'is the data of j-th bit of i-th group in QO, and if the random bit is 0' _i [j]+QO″ _i [j]＝QO _i [j](ii) a QO 'if the random bit is 1' _i [j]＝QO″ _i [j]＝QO _i [j]QO encrypted form is denoted as E _QO ，QO _i [j]Two new vectors generated after encryption are QO _i '[j]And QO _i ”[j]；

(54) Encrypting QT by secure KNN algorithm _i [j]Data of j-th bit of i-th group in QT, QT _i [j]The two new vectors generated after encryption are QT _i '[j]And QT _i ”[j]. QT if the random bit is 0 _i '[j]+QT _i ”[j]＝QT _i [j](ii) a QT if the random bit is 1 _i '[j]＝QT _i ”[j]＝QT _i [j]And the form after QT encryption is denoted as E _QT 。

Preferably, the step (6) specifically comprises the following steps:

(61) will E _IO And E _QO Adopting a model based on TF-IDF to carry out matching calculation, when E is _QO When null marks exist in the data, directly skipping and not calculating; otherwise, the encrypted IO _i QO after encryption _i The first h encrypted documents with the highest relevance of each group are obtained by searching, and a Result set { Result is formed ₁ ,Result ₂ ,…,Result _f Therein Result, wherein Result _i The length of h is less than or equal to b, and a Result set Result is obtained after duplication removal;

(62)

and

the formula of the relevance Score of (c) is expressed as:

(63) find E by Result _IT Vector of document of medium significance, with E _QT Calculating and sequencing secondary relevance Score (Score) of the found document vectors, returning the top k document identifiers with the highest Score to the query user, finding the corresponding ciphertext document by the user according to the document identifier fid, downloading the ciphertext document to local decryption, and acquiring corresponding plaintext information;

(64)

and

the formula of the relevance Score of (c) is expressed as:

has the advantages that: compared with the prior art, the invention has the following remarkable advantages: 1. according to the method, from the aspect of trapdoor expansion, fuzzy processing and semantic analysis are carried out on the retrieval keywords input by the user, and the use experience of the user is improved; 2. on the basis of efficient multi-keyword orderable ciphertext retrieval, keyword fuzzy processing is added in the step of trapdoor construction to enrich returned results, and even if a user inputs wrong keyword information, the wrong keyword information can be corrected to return a correct result; 3. the invention carries out semantic analysis processing on the user query keyword set, enlarges search conditions, enriches the returned results of query, ensures that the query is not more accurate, reduces the limitation of keywords and helps the user to more deeply mine useful information in data.

Drawings

FIG. 1 is a framework for implementing the method of the present invention;

FIG. 2 is a flow chart of a method according to the present invention;

FIG. 3 is a detailed flow chart of steps 1 and 2 of FIG. 2;

FIG. 4 is a detailed flowchart of step 3 in FIG. 2;

fig. 5 is a flowchart of the blurring processing method in step 3 in fig. 2.

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary of the invention and are not intended to limit its scope, as various equivalent modifications of the invention will become apparent to those skilled in the art after reading the present invention and fall within the scope of the appended claims.

As shown in fig. 1, the present invention provides a ciphertext-based multi-keyword expansion retrieval method, which includes extracting a keyword set from a data source provided by a data provider through an IK tokenizer to form a keyword set, constructing an inverse document vector set based on the keyword set, performing grouping processing, constructing a B + index tree for each group of vector sets, encrypting through a secure KNN algorithm, encrypting the data source by using a symmetric encryption algorithm, and uploading the encrypted index tree set and the data source to a cloud server, so as to ensure high retrieval efficiency. In the trap door generating process, firstly, carrying out fuzzy processing on the retrieval keywords input by a user by the fuzzy multi-keyword sequencing retrieval method, and correcting the condition of user input errors; then, semantic analysis operation is carried out on the query keyword set after fuzzy processing, and the query keyword set is expanded; and finally, generating a query vector according to the keyword set subjected to semantic analysis, encrypting by using a security KNN algorithm to obtain trapdoors, grouping the trapdoors, and uploading the trapdoors to a cloud server. And the cloud server matches the grouping trapdoors with the B + index tree group according to the TF-IDF model, sorts the grouping trapdoors according to the relevance scores, and finally returns the first k ciphertext documents with the highest relevance to the authorized user.

The method specifically comprises the following steps:

as shown in fig. 2-5, step 1, a data provider provides a data source, performs word segmentation processing on the data source according to an IK word segmenter, obtains a keyword set, constructs and groups an inverse document vector set, and encrypts the data source by using a symmetric encryption algorithm.

Step 1.1 generating a document vector set: in the data source, the key word segmentation is performed through an IK participler, the obtained key word set is KW, n is set as the number of the key words in the key word set, and the key word set can be expressed as KW ═ (KW) ₁ ,kw ₂ ,…,kw _n ). The document set in the data source can be converted into a document vector set DOC through KW, the number of documents in the data source is set to be m, and the document vector set can be represented as DOC (DOC) ₁ ,doc ₂ ,…,doc _m ) Therein doc of _i And representing the ith (i is more than or equal to 0 and less than or equal to m) document vector in the DOC, wherein the vector length is n, each bit in the vector corresponds to a corresponding keyword in KW, if the keyword appears in the current document, the TF value of the keyword in the current document is stored, and if not, the corresponding bit is stored to be 0 for occupying.

Step 1.2, generating an inverse document vector set: an inverse document vector set IDOC is constructed through KW in step 1.1, the length of the inverse document vector set IDOC is equal to KW, the number of bits is the same, the construction principle is that a document set containing a keyword is found through the keyword, and the document set IDOC can be expressed as IDOC ═ { IDOC (KW) } ₁ ),idoc(kw ₂ ),…,idoc(kw _n ) Wherein, idoc (kw) _i ) Representing the ith (0 ≦ i ≦ n) inverse document vector in the IDOC, and storing the inverse document vector containing the keyword kw _i All doc of (a) are defined in this step as the length of all idocs being a, i.e. the storage contains kw _i And the first a document vectors with the highest TF value.

Step 1.3 grouping of inverse document vector sets: grouping the IDOCs in step 1.2, and setting the IDOCs into b groups, the grouped IDOCs can be expressed as IDOCG (IDOCG) ₁ ,idocg ₂ ,…,idocg _b ) The number of the keyword sets in each group is o, wherein

idocg _i Is the ith (i is more than or equal to 0 and less than or equal to n) group in the IDOCG, and stores the idoc of all the keywords in the group. With KW similarly divided into b groups, the grouped set of keywords can be expressed as KWG ═ KWG ( ₁ ,kwg ₂ ,…, kwg _b )。

Step 2, constructing a B + index tree group by using the inverse document vector set, encrypting the B + index tree group by using an improved security KNN algorithm, and uploading the B + index tree group and the encrypted data source to a cloud server;

step 2.1: constructing an index tree group corresponding to the inverse document vector set: constructing a corresponding group B + index tree group IO according to the IDOC in the step 1.3, wherein IO is expressed as IO { IO ═ IO ₁ ,IO ₂ ,…,IO _b }. The key structure of the node storage in the B + tree is as<fid,children[m],inf>Fid is the document identifier corresponding to the ciphertext index, and only appears in the leaf node, and the value of the non-leaf node is null, child [ m ]]And storing pointer information pointing to the child nodes, wherein m is the order of the B + tree, inf of the leaf nodes stores TF values of the corresponding keyword groups in the document, and inf of the non-leaf nodes is obtained by taking the maximum value of inf of all keywords stored by the corresponding child nodes according to bits. If one of the keywords stored by a node is KEY, its corresponding child node is child, and KEY represents all the stored keyword information of child, the c-th bit of inf of the node can be obtained by the following formula:

key.inf[c]＝max{child.key[c]|key∈KEY}

step 2.2: constructing a grouped document data set: constructing IT ═ IT from DOC in step 1.1 ₁ ,IT ₂ ,…,IT _m Using IT _i Representing one of the items of IT, based on doc _i Constructed in a structure IT _i ＝<fid,inf ₁ ,inf ₂ ,…,inf _b >Wherein inf _i Is a vector of length o, the j-th bit stores data that can be expressed as inf _i [j]＝IT(doc _i <KWG _i,j >) (j ═ 1, …, o). The complete index is composed of IO for the first ranking calculation in step 4.1 and IT for the second ranking calculation in step 4.2.

Step 2.3: encryption of IO and IT: the IO in step 2.1 and the IT in step 2.2 are encrypted by using a secure KNN algorithm, that is, random 0 and 1 sequences are generated, and the IO and the IT are encrypted into two new data.

By IO _i [j]IO 'when the random bit is 0, representing the j-th data of the i-th group in IO' _i [j]＝IO″ _i [j]＝IO _i [j](ii) a IO 'when random bit is 1' _i [j]＝IO″ _i [j]＝IO _i [j]. The encryption steps of IT are consistent with IO, and finally, the E after encryption is obtained _IO And E _IT And uploading the two sets of encrypted data to the cloud server.

And 3, on the basis of the query keywords input by the authorized user, firstly performing fuzzy processing operation to obtain a fuzzy query keyword set, then performing semantic analysis processing on the fuzzy query keyword set to obtain a semantic query keyword set, constructing a query vector on the basis, encrypting by using safe KNN to obtain a trapdoor, and uploading to a cloud server after performing grouping processing.

Step 3.1: aiming at improving the retrieval robustness, the condition that the user inputs the retrieval keyword to spell the wrong keyword and no retrieval result occurs is prevented, and the query keyword W input by the user is subjected to _q And (5) performing fuzzy processing.

If the number of query keywords input by the user is t, the query keyword set can be represented as W _q ＝{W _q1 ,W _q2 ,…, W _qt And the step of, go through it,

when w is _qi When the e KW (i is more than or equal to 1 and less than or equal to t) is always true, the query keyword input by the user has no spelling error, and W is used _m Represents the set of processed fuzzy query keywords, at which point W _m ＝W _q 。

If w is _qi Let KW be false, which indicates that the keyword input by the user is not present in the keyword set KW, it is necessary to deal with the case where the user has input an error, and ed (S ', S) represents the edit distance between the character string S' and the character string S, and for the error, the keyword w is set to be "true" _qi Calculating the editing distance ed between each keyword in the keyword set KW to obtain a fuzzy keyword set corresponding to the keyword, and then calculating the w _qi From W _q In the middle of removing。

The size of ed influences the semantic similarity standard of the keywords and can be set according to actual requirements, the fuzzy processing of the keywords sets ed as 1, and because the fuzzy processing aims to process the keywords under the condition of misspelling of the query keywords, a dictionary-based mode is adopted to construct a fuzzy keyword set. After traversing the keyword set KW, all keywords w _qi Adding keywords with ed being 1 in the keyword set KW into the fuzzy keyword set, and after traversing, adding keywords in all fuzzy sets into W _q To finally obtain W _m And CM represents the set of fuzzy keywords obtained by the error keyword, CM can be represented by the following formula:

the following is a query keyword W _q The process of making the fuzzy processing algorithm comprises the following steps:

inputting: query keyword set W _q Edit distance α, keyword set KW

And (3) outputting: fuzzy keyword set W _m

Step 3.2, semantic expansion of the fuzzy query keyword set: using W obtained in step 3.1 _m Semantic expansion operation is performed, authorized users are helped to perfect query keyword information, and valuable data are deeply mined.

Two keywords w are defined _i And w _j Using deep (w) _i ) Represents a keyword w _i The level in the semantic tree, i.e. the length of the path to the root node, so the setting of the level of the root node influences its locationThere is a hierarchy of child nodes.

The root level is set to 1, so the depth of the child node is the distance to the root plus 1. Using len (w) _i ,w _j ) Representing secondary keywords w in a semantic tree _i To the keyword w _j The shortest path traversed, i.e. the length of the distance between them. With lso (w) _i ,w _j ) Representing a keyword w in a semantic tree _i And a keyword w _j The nearest common ancestor node.

Using sim (w) _i ,w _j ) Expressed as a keyword w _i And a keyword w _j And satisfies the following semantic similarity:

two calculation methods, where δ is a non-negative number, the similarity operation formula is as follows:

in the above formula, γ and δ are influence weights of the shortest distance path length and the near common ancestor node in the operation, and an optimal setting of γ equal to 0.2 and δ equal to 0.6 is adopted. As can be seen from the above formula, the similarity of two keywords is inversely proportional to the shortest path distance len, and is proportional to lso of the two keywords, and sim (w) _i ,w _j ) Is in the range of 0 to 1.

To W _m Traversing and carrying out similarity calculation with a keyword set KW, carrying out semantic expansion on the keywords in each fuzzy keyword set, taking the first tau keywords with the highest similarity to form a semantic expansion set, and then adding the semantic expansion set of each keyword into W _m Forming a semantic set W _y 。

Step 3.3: and (3) constructing query vectors QO and QT: according to W in step 3.2 _y And constructing a first query vector QO. The construction process comprises the following steps: first to W _y And traversing, if the traversed current keyword exists in the KW, storing the IDF value of the keyword on the corresponding bit of the QO, and if not, storing 0 for occupying. QO is also divided into b groups, when all the stored data of one group are 0, the group does not contain a retrieval key word, the group is marked with null, and the subsequent calculation can be directly skipped without calculation, so that the retrieval efficiency is improved. The construction method of the second query vector QT is the same as the final form and QO, and the QT and QO form a complete query vector. QO is used for carrying out first sorting with IO calculation, and QT is used for carrying out second sorting with IT calculation.

Step 3.4: encryption of QO and QT: and (3) encrypting QO and QT in the step 3.3 by using a secure KNN algorithm, wherein the encryption process is basically the same as the step 2.3. Use of QO _i [j]Data of j-th bit representing the i-th group in QO, for QO _i [j]Encryption is carried out, and two new vectors are generated and are QO _i '[j]And QO _i ”[j]. Different from the index encryption is QO 'when the random bit is 0' _i [j]＝QO″ _i [j]＝QO _i [j](ii) a QO 'when the random bit is 1' _i [j]＝QO″ _i [j]＝QO _i [j]. The encryption step of QT is consistent with QO, and the encrypted form of QO and QT is E _QO And E _QT The two sets of encrypted data are then uploaded to a cloud server.

And 4, matching the grouping B + index tree in the step 2 with the grouping trapdoors in the step 3 through a TF-IDF model, sequencing matching results according to the relevance scores, and returning to top-k ciphertext documents before the user.

The method comprises the following specific steps:

step 4.1: will E _IO And E _QO Adopting a model based on TF-IDF to carry out matching calculation, when E is _QO When null marks exist in the index tree, the corresponding B + index tree is directly skipped without calculation,

otherwise, utilize E _IO And E _QO The encrypted TF value and the IDF value stored in the storage unit are correlatedCalculation of fractional Score, that is to say encrypted IO _i QO after encryption _i The search is carried out, then the first h (h is a random positive integer) encrypted documents with the highest relevance of each group are obtained, and a Result set { Result is formed ₁ ,Result ₂ ,…,Result _f Therein Result, wherein Result _i The length of the second order is h, f is less than or equal to b, and a Result set Result is obtained after deduplication.

And

the correlation Score of (c) can be calculated by the following formula:

and 4.2: and 4.1, performing secondary sorting calculation by using the Result of the first retrieval obtained in the step 4.1, finding an effective document vector in the EIT through the Result, then performing secondary correlation Score calculation and sorting by using the EQT and the found document vector, returning the top k document identifiers with the highest Score to the query user, finding a corresponding ciphertext document according to the document identifier fid by the user, downloading the ciphertext document to the local, and decrypting to obtain corresponding plaintext information.

And

the correlation Score of (c) can be calculated by the following formula:

Claims

1. a ciphertext-based multi-keyword expansion retrieval method is characterized by comprising the following steps:

(6)E _QO and E _QT Secondary sorting and matching: by using E _IO And E _QO Calculating a correlation Score of the encrypted TF value and the IDF value stored in the storage unit to obtain a Result set Result; according to the Result of the first search, use E _QT And calculating and sequencing a secondary correlation Score with the found document vector to obtain

And

the final relevance Score is obtained, and the top k ciphertext documents with the highest scores are returned to the user;

in step (3), the set W composed of the query keywords input by the user _q Matching with a keyword set KW to obtain a processed fuzzy keyword set W _m The method comprises the following steps: if the number of query keywords input by the user is t, the query keyword set can be represented as W _q ＝{W _q1 ,W _q2 ,…,W _qt Is traversed if w _qi When the element KW is always true, wherein i is more than or equal to 1 and less than or equal to t, the query keyword input by the user has no spelling error, and W is the moment _m ＝W _q (ii) a If w is present therein _qi e.KW is false, the keyword input by the user does not exist in the keyword set KW, and the keyword w is set _qi Calculating the editing distance ed between the keyword and each keyword in the keyword set KW, and if the editing distance conforms to the preset threshold value of ed, adding the keyword in the KW in the keyword set to the fuzzy keyword set W _m After traversing, all keywords meeting the conditions are added into the fuzzy keyword set to obtain the final W _m 。

2. The ciphertext-based multi-keyword expansion retrieval method of claim 1, wherein in step (3), the fuzzy keyword set W _m Denoted as CM, the formula is:

3. the ciphertext-based multi-keyword expansion retrieval method according to claim 1, wherein in the step (4), performing semantic expansion on the keywords in each fuzzy keyword set to form a semantic expansion set, includes the following steps:

wherein gamma and delta are influence weights for controlling the shortest distance path length and the near common ancestor node in the operation, gamma is more than or equal to 0, delta is more than or equal to 0, len (w) _i ,w _j ) Representing in a semantic tree from a keyword w _i To the keyword w _j Shortest path traveled, and:

(42) to W _m Traversing and calculating the similarity with the keyword set KW, and performing similarity calculation on each fuzzy keyword set W _m Performing semantic expansion on the keywords in the database, and taking the first tau keywords with the highest similarity to form a semantic expansion set.

4. The ciphertext-based multi-keyword expansion retrieval method according to claim 1, wherein the step (5) specifically comprises the following steps:

(51) to W _y Traversing, if the traversed current keyword exists in KW, storing the IDF value of the keyword on the corresponding bit of QO, otherwise, storing 0 for occupying;

(52) to W _y Go through traversal, if the traversed current keyword exists in KWIf the key word is in the same bit, storing the IDF value of the key word on the corresponding bit of the QT, otherwise, storing 0 for occupying space, and forming a complete query vector by the QT and QO;

(53) encrypting QO by adopting a secure KNN algorithm and using QO _i [j]QO 'is the data of j-th bit of i-th group in QO, and if the random bit is 0' _i [j]+QO″ _i [j]＝QO _i [j](ii) a QO 'if the random bit is 1' _i [j]＝QO″ _i [j]＝QO _i [j]QO encrypted form is denoted as E _QO ，QO _i [j]Two new vectors generated after encryption are QO _i ′[j]And QO _i ″[j]；

(54) Encrypting QT with secure KNN algorithm, using QT _i [j]Indicating the jth data of the ith group in QT, QT _i [j]Two new vectors generated after encryption are QT _i ′[j]And QT _i ″[j](ii) a QT if the random bit is 0 _i ′[j]+QT _i ″[j]＝QT _i [j](ii) a QT if the random bit is 1 _i ′[j]＝QT _i ″[j]＝QT _i [j]And the form after QT encryption is denoted as E _QT 。

5. The ciphertext-based multi-keyword expansion retrieval method of claim 4, wherein the step (6) specifically comprises the following steps:

(61) will E _IO And E _QO Adopting a TF-IDF-based model for matching calculation, if E _QO When null marks exist in the data, directly skipping and not calculating; otherwise, the encrypted IO _i QO after encryption _i The search is carried out to obtain the first h encrypted documents with the highest correlation of each group to form a Result set { Result ₁ ,Result ₂ ,…,Result _f Where Result _i The length of h is less than or equal to b, and a Result set Result is obtained after duplication removal;

(62)

and

the formula of the relevance Score of (c) is expressed as:

(64)

and

the formula of the relevance Score of (c) is expressed as: