CN109902143B - Multi-keyword extended retrieval method based on ciphertext - Google Patents

Multi-keyword extended retrieval method based on ciphertext Download PDF

Info

Publication number
CN109902143B
CN109902143B CN201910160214.8A CN201910160214A CN109902143B CN 109902143 B CN109902143 B CN 109902143B CN 201910160214 A CN201910160214 A CN 201910160214A CN 109902143 B CN109902143 B CN 109902143B
Authority
CN
China
Prior art keywords
keyword
query
semantic
user
fuzzy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910160214.8A
Other languages
Chinese (zh)
Other versions
CN109902143A (en
Inventor
许建
黄新宇
戴华
杨庚
陈燕俐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN201910160214.8A priority Critical patent/CN109902143B/en
Publication of CN109902143A publication Critical patent/CN109902143A/en
Application granted granted Critical
Publication of CN109902143B publication Critical patent/CN109902143B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a ciphertext-based multi-keyword expansion retrieval method, which comprises the steps of extracting a keyword set from a data source, constructing and grouping inverse document vector sets based on the keyword set, constructing a B + index tree for each group of vector sets, encrypting through a safe KNN algorithm, encrypting the data source by using a symmetric encryption algorithm, uploading the encrypted index tree group and the data source to a cloud server, performing fuzzy processing on retrieval keywords input by a user, and correcting the condition of user input errors; and finally, performing semantic analysis operation on the fuzzy query keyword set, expanding the query keyword set, generating a query vector according to the semantic analyzed keyword set, processing the query vector by using an encryption algorithm to obtain trapdoors, performing grouping processing on the trapdoors, and uploading the trapdoors to a cloud server. According to the invention, from the aspect of trapdoor expansion, fuzzy processing and semantic analysis are carried out on the retrieval keywords input by the user, and the use experience of the user is improved.

Description

Multi-keyword extended retrieval method based on ciphertext
Technical Field
The invention relates to the technical field of character retrieval, in particular to a ciphertext-based multi-keyword expansion retrieval method.
Background
In the research aiming at privacy protection, a searchable encryption scheme is in an important position, and the research based on the field is also fully enriched and developed. However, with the explosive growth of data and the increasing diversification of user demands, the solution also faces various problems and challenges. Most of the current mainstream schemes perform accurate retrieval according to the query keyword input by the user, and do not consider various other factors, so that when the query keyword input by the user is wrong or single, the corresponding result cannot be reasonably returned to the user. Along with the linear increase of the data volume uploaded by the user, how to filter the results or raise the priority of keywords according to the preference and preference of the user and reduce the data screening of the user is also one of important improvements.
The problem is well solved along with the proposal of personalized search, and the main principle of the method is to collect the information of the user, analyze the interest and the preference of the user, and perform personalized sequencing on the retrieval result aiming at the user according to analysis and comparison, so that the user can quickly find the desired result. But since this scheme operates on the basis of user information, it may not be a good choice for ciphertext retrieval, a privacy-focused scheme.
Disclosure of Invention
The purpose of the invention is as follows: in order to overcome the defects of the prior art, the invention provides a ciphertext-based multi-keyword extended search method, which can solve the problems of low efficiency, low accuracy, single query result and low intelligence degree in the process of carrying out fuzzy multi-keyword sequencing search.
The technical scheme is as follows: the invention relates to a ciphertext-based multi-keyword expansion retrieval method, which comprises the following steps:
(1) constructing a B + index tree group: constructing an inverse document vector set IDOC according to a keyword set KW in a data source, constructing a corresponding grouping B + index tree group IO by using the inverse document vector set IDOC, and constructing a corresponding grouping document data set IT by using a document vector set DOC;
(2) IT and IO encryption: IO and IT are encrypted by utilizing a secure KNN algorithm, and the encrypted data are respectively recorded as E IO And E IT Uploading the encrypted data to a cloud server;
(3) fuzzy processing of query keywords: set W composed of query keywords input by user q Matching with the keyword set KW to obtain a processed fuzzy keyword set W m
(4) Fuzzy query keyword set W m Semantic expansion of (2): obtaining semantic similarity between keywords according to the constructed semantic tree, and comparing W m Traversing and carrying out semantic similarity calculation with a keyword set KW, carrying out semantic expansion on the keywords in each fuzzy set to form a semantic expansion set, and then adding the semantic expansion set of each keyword to W m Forming a semantic set W y
(5) Constructing a trapdoor: to W y Traversing, constructing a first query vector QO and a second query vector QT according to whether the traversed current keyword exists in KW, encrypting the QO and the QT by adopting a safe KNN algorithm to obtain a trapdoor, and uploading encrypted data to a cloud server;
(6)E QO and E QT Secondary sorting and matching: by using E IO And E QO Calculating a correlation Score Score of the encrypted TF value and the IDF value stored in the storage unit to obtain a Result set Result; according to the Result of the first search, use E QT And calculating and sequencing a secondary correlation Score with the found document vector to obtain E QTi And E ITi Score and return the top k ciphertext documents with the highest Score to the user.
Preferably, in step (3), the set W formed by the query keywords input by the user q Matching with a keyword set KW to obtain a processed fuzzy keyword set W m The method comprises the following steps: if the number of query keywords input by the user is t, the query keyword set can be represented as W q ={W q1 ,W q2 ,…,W qt Is traversed, if w qi When the element KW is always true, wherein i is more than or equal to 1 and less than or equal to t, the query keyword input by the user has no spelling error, and W is the moment m =W q (ii) a If w is present therein qi E.g. KW is false, the keyword input by the user does not exist in the keyword set KW, and the keyword w is qi Calculating the editing distance ed between the keyword and each keyword in the keyword set KW, and if the keyword meets the preset threshold value of ed, adding the keyword in the KW in the keyword set into the fuzzy keyword set W m After the traversal is finished, all keywords meeting the conditions are added into the fuzzy keyword set to obtain the final W m
Preferably, in step (3), the fuzzy keyword set W m Denoted as CM, the formula is:
Figure BDA0001984368050000021
preferably, in the step (4), performing semantic extension on the keywords in each fuzzy keyword set to form a semantic extension set, including the following steps:
(41) two keywords w are defined i And w j Using sim (w) i ,w j ) Expressed as a keyword w i And a keyword w j The similarity operation formula is as follows:
Figure BDA0001984368050000022
wherein gamma and delta are influence weights for controlling the shortest distance path length and the near common ancestor node in the operation, gamma is more than or equal to 0, delta is more than or equal to 0, len (w) i ,w j ) Representing secondary keywords w in a semantic tree i To the keyword w j Shortest path traversed, and:
Figure BDA0001984368050000031
Figure BDA0001984368050000032
lso(w i ,w j ) Representing a keyword w in a semantic tree i And a keyword w j Nearest common ancestor node, deep (w) i ) Represents a keyword w i The length of the path to the root node;
(42) to W m Traversing and carrying out similarity calculation with a keyword set KW, carrying out semantic expansion on the keywords in each fuzzy keyword set, and taking the first tau keywords with the highest similarity to form a semantic expansion set.
Preferably, the step (5) specifically includes the following steps:
(51) to W y Go through the traversal, e.g.If the traversed current keyword exists in KW, storing the IDF value of the keyword on the corresponding position of QO, otherwise, storing 0 for occupying;
(52) to W y Traversing, if the traversed current keyword exists in KW, storing the IDF value of the keyword on the corresponding bit of QT, otherwise, storing 0 for occupying, and QT and QO form a complete query vector;
(53) encrypting QO by adopting a secure KNN algorithm and using QO i [j]QO 'is the data of j-th bit of i-th group in QO, and if the random bit is 0' i [j]+QO″ i [j]=QO i [j](ii) a QO 'if the random bit is 1' i [j]=QO″ i [j]=QO i [j]QO encrypted form is denoted as E QO ,QO i [j]Two new vectors generated after encryption are QO i '[j]And QO i ”[j];
(54) Encrypting QT by secure KNN algorithm i [j]Data of j-th bit of i-th group in QT, QT i [j]The two new vectors generated after encryption are QT i '[j]And QT i ”[j]. QT if the random bit is 0 i '[j]+QT i ”[j]=QT i [j](ii) a QT if the random bit is 1 i '[j]=QT i ”[j]=QT i [j]And the form after QT encryption is denoted as E QT
Preferably, the step (6) specifically comprises the following steps:
(61) will E IO And E QO Adopting a model based on TF-IDF to carry out matching calculation, when E is QO When null marks exist in the data, directly skipping and not calculating; otherwise, the encrypted IO i QO after encryption i The first h encrypted documents with the highest relevance of each group are obtained by searching, and a Result set { Result is formed 1 ,Result 2 ,…,Result f Therein Result, wherein Result i The length of h is less than or equal to b, and a Result set Result is obtained after duplication removal;
(62)
Figure BDA0001984368050000041
and
Figure BDA0001984368050000042
the formula of the relevance Score of (c) is expressed as:
Figure BDA0001984368050000043
(63) find E by Result IT Vector of document of medium significance, with E QT Calculating and sequencing secondary relevance Score (Score) of the found document vectors, returning the top k document identifiers with the highest Score to the query user, finding the corresponding ciphertext document by the user according to the document identifier fid, downloading the ciphertext document to local decryption, and acquiring corresponding plaintext information;
(64)
Figure BDA0001984368050000044
and
Figure BDA0001984368050000045
the formula of the relevance Score of (c) is expressed as:
Figure BDA0001984368050000046
has the advantages that: compared with the prior art, the invention has the following remarkable advantages: 1. according to the method, from the aspect of trapdoor expansion, fuzzy processing and semantic analysis are carried out on the retrieval keywords input by the user, and the use experience of the user is improved; 2. on the basis of efficient multi-keyword orderable ciphertext retrieval, keyword fuzzy processing is added in the step of trapdoor construction to enrich returned results, and even if a user inputs wrong keyword information, the wrong keyword information can be corrected to return a correct result; 3. the invention carries out semantic analysis processing on the user query keyword set, enlarges search conditions, enriches the returned results of query, ensures that the query is not more accurate, reduces the limitation of keywords and helps the user to more deeply mine useful information in data.
Drawings
FIG. 1 is a framework for implementing the method of the present invention;
FIG. 2 is a flow chart of a method according to the present invention;
FIG. 3 is a detailed flow chart of steps 1 and 2 of FIG. 2;
FIG. 4 is a detailed flowchart of step 3 in FIG. 2;
fig. 5 is a flowchart of the blurring processing method in step 3 in fig. 2.
Detailed Description
The present invention is further illustrated by the following examples, which are intended to be purely exemplary of the invention and are not intended to limit its scope, as various equivalent modifications of the invention will become apparent to those skilled in the art after reading the present invention and fall within the scope of the appended claims.
As shown in fig. 1, the present invention provides a ciphertext-based multi-keyword expansion retrieval method, which includes extracting a keyword set from a data source provided by a data provider through an IK tokenizer to form a keyword set, constructing an inverse document vector set based on the keyword set, performing grouping processing, constructing a B + index tree for each group of vector sets, encrypting through a secure KNN algorithm, encrypting the data source by using a symmetric encryption algorithm, and uploading the encrypted index tree set and the data source to a cloud server, so as to ensure high retrieval efficiency. In the trap door generating process, firstly, carrying out fuzzy processing on the retrieval keywords input by a user by the fuzzy multi-keyword sequencing retrieval method, and correcting the condition of user input errors; then, semantic analysis operation is carried out on the query keyword set after fuzzy processing, and the query keyword set is expanded; and finally, generating a query vector according to the keyword set subjected to semantic analysis, encrypting by using a security KNN algorithm to obtain trapdoors, grouping the trapdoors, and uploading the trapdoors to a cloud server. And the cloud server matches the grouping trapdoors with the B + index tree group according to the TF-IDF model, sorts the grouping trapdoors according to the relevance scores, and finally returns the first k ciphertext documents with the highest relevance to the authorized user.
The method specifically comprises the following steps:
as shown in fig. 2-5, step 1, a data provider provides a data source, performs word segmentation processing on the data source according to an IK word segmenter, obtains a keyword set, constructs and groups an inverse document vector set, and encrypts the data source by using a symmetric encryption algorithm.
Step 1.1 generating a document vector set: in the data source, the key word segmentation is performed through an IK participler, the obtained key word set is KW, n is set as the number of the key words in the key word set, and the key word set can be expressed as KW ═ (KW) 1 ,kw 2 ,…,kw n ). The document set in the data source can be converted into a document vector set DOC through KW, the number of documents in the data source is set to be m, and the document vector set can be represented as DOC (DOC) 1 ,doc 2 ,…,doc m ) Therein doc of i And representing the ith (i is more than or equal to 0 and less than or equal to m) document vector in the DOC, wherein the vector length is n, each bit in the vector corresponds to a corresponding keyword in KW, if the keyword appears in the current document, the TF value of the keyword in the current document is stored, and if not, the corresponding bit is stored to be 0 for occupying.
Step 1.2, generating an inverse document vector set: an inverse document vector set IDOC is constructed through KW in step 1.1, the length of the inverse document vector set IDOC is equal to KW, the number of bits is the same, the construction principle is that a document set containing a keyword is found through the keyword, and the document set IDOC can be expressed as IDOC ═ { IDOC (KW) } 1 ),idoc(kw 2 ),…,idoc(kw n ) Wherein, idoc (kw) i ) Representing the ith (0 ≦ i ≦ n) inverse document vector in the IDOC, and storing the inverse document vector containing the keyword kw i All doc of (a) are defined in this step as the length of all idocs being a, i.e. the storage contains kw i And the first a document vectors with the highest TF value.
Step 1.3 grouping of inverse document vector sets: grouping the IDOCs in step 1.2, and setting the IDOCs into b groups, the grouped IDOCs can be expressed as IDOCG (IDOCG) 1 ,idocg 2 ,…,idocg b ) The number of the keyword sets in each group is o, wherein
Figure BDA0001984368050000051
idocg i Is the ith (i is more than or equal to 0 and less than or equal to n) group in the IDOCG, and stores the idoc of all the keywords in the group. With KW similarly divided into b groups, the grouped set of keywords can be expressed as KWG ═ KWG ( 1 ,kwg 2 ,…, kwg b )。
Step 2, constructing a B + index tree group by using the inverse document vector set, encrypting the B + index tree group by using an improved security KNN algorithm, and uploading the B + index tree group and the encrypted data source to a cloud server;
step 2.1: constructing an index tree group corresponding to the inverse document vector set: constructing a corresponding group B + index tree group IO according to the IDOC in the step 1.3, wherein IO is expressed as IO { IO ═ IO 1 ,IO 2 ,…,IO b }. The key structure of the node storage in the B + tree is as<fid,children[m],inf>Fid is the document identifier corresponding to the ciphertext index, and only appears in the leaf node, and the value of the non-leaf node is null, child [ m ]]And storing pointer information pointing to the child nodes, wherein m is the order of the B + tree, inf of the leaf nodes stores TF values of the corresponding keyword groups in the document, and inf of the non-leaf nodes is obtained by taking the maximum value of inf of all keywords stored by the corresponding child nodes according to bits. If one of the keywords stored by a node is KEY, its corresponding child node is child, and KEY represents all the stored keyword information of child, the c-th bit of inf of the node can be obtained by the following formula:
key.inf[c]=max{child.key[c]|key∈KEY}
step 2.2: constructing a grouped document data set: constructing IT ═ IT from DOC in step 1.1 1 ,IT 2 ,…,IT m Using IT i Representing one of the items of IT, based on doc i Constructed in a structure IT i =<fid,inf 1 ,inf 2 ,…,inf b >Wherein inf i Is a vector of length o, the j-th bit stores data that can be expressed as inf i [j]=IT(doc i <KWG i,j >) (j ═ 1, …, o). The complete index is composed of IO for the first ranking calculation in step 4.1 and IT for the second ranking calculation in step 4.2.
Step 2.3: encryption of IO and IT: the IO in step 2.1 and the IT in step 2.2 are encrypted by using a secure KNN algorithm, that is, random 0 and 1 sequences are generated, and the IO and the IT are encrypted into two new data.
By IO i [j]IO 'when the random bit is 0, representing the j-th data of the i-th group in IO' i [j]=IO″ i [j]=IO i [j](ii) a IO 'when random bit is 1' i [j]=IO″ i [j]=IO i [j]. The encryption steps of IT are consistent with IO, and finally, the E after encryption is obtained IO And E IT And uploading the two sets of encrypted data to the cloud server.
And 3, on the basis of the query keywords input by the authorized user, firstly performing fuzzy processing operation to obtain a fuzzy query keyword set, then performing semantic analysis processing on the fuzzy query keyword set to obtain a semantic query keyword set, constructing a query vector on the basis, encrypting by using safe KNN to obtain a trapdoor, and uploading to a cloud server after performing grouping processing.
Step 3.1: aiming at improving the retrieval robustness, the condition that the user inputs the retrieval keyword to spell the wrong keyword and no retrieval result occurs is prevented, and the query keyword W input by the user is subjected to q And (5) performing fuzzy processing.
If the number of query keywords input by the user is t, the query keyword set can be represented as W q ={W q1 ,W q2 ,…, W qt And the step of, go through it,
when w is qi When the e KW (i is more than or equal to 1 and less than or equal to t) is always true, the query keyword input by the user has no spelling error, and W is used m Represents the set of processed fuzzy query keywords, at which point W m =W q
If w is qi Let KW be false, which indicates that the keyword input by the user is not present in the keyword set KW, it is necessary to deal with the case where the user has input an error, and ed (S ', S) represents the edit distance between the character string S' and the character string S, and for the error, the keyword w is set to be "true" qi Calculating the editing distance ed between each keyword in the keyword set KW to obtain a fuzzy keyword set corresponding to the keyword, and then calculating the w qi From W q In the middle of removing。
The size of ed influences the semantic similarity standard of the keywords and can be set according to actual requirements, the fuzzy processing of the keywords sets ed as 1, and because the fuzzy processing aims to process the keywords under the condition of misspelling of the query keywords, a dictionary-based mode is adopted to construct a fuzzy keyword set. After traversing the keyword set KW, all keywords w qi Adding keywords with ed being 1 in the keyword set KW into the fuzzy keyword set, and after traversing, adding keywords in all fuzzy sets into W q To finally obtain W m And CM represents the set of fuzzy keywords obtained by the error keyword, CM can be represented by the following formula:
Figure BDA0001984368050000071
the following is a query keyword W q The process of making the fuzzy processing algorithm comprises the following steps:
inputting: query keyword set W q Edit distance α, keyword set KW
And (3) outputting: fuzzy keyword set W m
Figure BDA0001984368050000072
Figure BDA0001984368050000081
Step 3.2, semantic expansion of the fuzzy query keyword set: using W obtained in step 3.1 m Semantic expansion operation is performed, authorized users are helped to perfect query keyword information, and valuable data are deeply mined.
Two keywords w are defined i And w j Using deep (w) i ) Represents a keyword w i The level in the semantic tree, i.e. the length of the path to the root node, so the setting of the level of the root node influences its locationThere is a hierarchy of child nodes.
The root level is set to 1, so the depth of the child node is the distance to the root plus 1. Using len (w) i ,w j ) Representing secondary keywords w in a semantic tree i To the keyword w j The shortest path traversed, i.e. the length of the distance between them. With lso (w) i ,w j ) Representing a keyword w in a semantic tree i And a keyword w j The nearest common ancestor node.
Using sim (w) i ,w j ) Expressed as a keyword w i And a keyword w j And satisfies the following semantic similarity:
Figure BDA0001984368050000082
Figure BDA0001984368050000083
two calculation methods, where δ is a non-negative number, the similarity operation formula is as follows:
Figure BDA0001984368050000084
in the above formula, γ and δ are influence weights of the shortest distance path length and the near common ancestor node in the operation, and an optimal setting of γ equal to 0.2 and δ equal to 0.6 is adopted. As can be seen from the above formula, the similarity of two keywords is inversely proportional to the shortest path distance len, and is proportional to lso of the two keywords, and sim (w) i ,w j ) Is in the range of 0 to 1.
To W m Traversing and carrying out similarity calculation with a keyword set KW, carrying out semantic expansion on the keywords in each fuzzy keyword set, taking the first tau keywords with the highest similarity to form a semantic expansion set, and then adding the semantic expansion set of each keyword into W m Forming a semantic set W y
Step 3.3: and (3) constructing query vectors QO and QT: according to W in step 3.2 y And constructing a first query vector QO. The construction process comprises the following steps: first to W y And traversing, if the traversed current keyword exists in the KW, storing the IDF value of the keyword on the corresponding bit of the QO, and if not, storing 0 for occupying. QO is also divided into b groups, when all the stored data of one group are 0, the group does not contain a retrieval key word, the group is marked with null, and the subsequent calculation can be directly skipped without calculation, so that the retrieval efficiency is improved. The construction method of the second query vector QT is the same as the final form and QO, and the QT and QO form a complete query vector. QO is used for carrying out first sorting with IO calculation, and QT is used for carrying out second sorting with IT calculation.
Step 3.4: encryption of QO and QT: and (3) encrypting QO and QT in the step 3.3 by using a secure KNN algorithm, wherein the encryption process is basically the same as the step 2.3. Use of QO i [j]Data of j-th bit representing the i-th group in QO, for QO i [j]Encryption is carried out, and two new vectors are generated and are QO i '[j]And QO i ”[j]. Different from the index encryption is QO 'when the random bit is 0' i [j]=QO″ i [j]=QO i [j](ii) a QO 'when the random bit is 1' i [j]=QO″ i [j]=QO i [j]. The encryption step of QT is consistent with QO, and the encrypted form of QO and QT is E QO And E QT The two sets of encrypted data are then uploaded to a cloud server.
And 4, matching the grouping B + index tree in the step 2 with the grouping trapdoors in the step 3 through a TF-IDF model, sequencing matching results according to the relevance scores, and returning to top-k ciphertext documents before the user.
The method comprises the following specific steps:
step 4.1: will E IO And E QO Adopting a model based on TF-IDF to carry out matching calculation, when E is QO When null marks exist in the index tree, the corresponding B + index tree is directly skipped without calculation,
otherwise, utilize E IO And E QO The encrypted TF value and the IDF value stored in the storage unit are correlatedCalculation of fractional Score, that is to say encrypted IO i QO after encryption i The search is carried out, then the first h (h is a random positive integer) encrypted documents with the highest relevance of each group are obtained, and a Result set { Result is formed 1 ,Result 2 ,…,Result f Therein Result, wherein Result i The length of the second order is h, f is less than or equal to b, and a Result set Result is obtained after deduplication.
Figure BDA0001984368050000091
And
Figure BDA0001984368050000092
the correlation Score of (c) can be calculated by the following formula:
Figure BDA0001984368050000093
and 4.2: and 4.1, performing secondary sorting calculation by using the Result of the first retrieval obtained in the step 4.1, finding an effective document vector in the EIT through the Result, then performing secondary correlation Score calculation and sorting by using the EQT and the found document vector, returning the top k document identifiers with the highest Score to the query user, finding a corresponding ciphertext document according to the document identifier fid by the user, downloading the ciphertext document to the local, and decrypting to obtain corresponding plaintext information.
Figure BDA0001984368050000094
And
Figure BDA0001984368050000101
the correlation Score of (c) can be calculated by the following formula:
Figure BDA0001984368050000102

Claims (5)

1. a ciphertext-based multi-keyword expansion retrieval method is characterized by comprising the following steps:
(1) constructing a B + index tree group: constructing an inverse document vector set IDOC according to a keyword set KW in a data source, constructing a corresponding grouping B + index tree group IO by using the inverse document vector set IDOC, and constructing a corresponding grouping document data set IT by using a document vector set DOC;
(2) IT and IO encryption: IO and IT are encrypted by utilizing a secure KNN algorithm, and the encrypted data are respectively recorded as E IO And E IT Uploading the encrypted data to a cloud server;
(3) fuzzy processing of query keywords: set W composed of query keywords input by user q Matching with the keyword set KW to obtain a processed fuzzy keyword set W m
(4) Fuzzy query keyword set W m Semantic expansion of (2): obtaining semantic similarity between keywords according to the constructed semantic tree, and comparing W m Traversing and carrying out semantic similarity calculation with a keyword set KW, carrying out semantic expansion on the keywords in each fuzzy set to form a semantic expansion set, and then adding the semantic expansion set of each keyword to W m Forming a semantic set W y
(5) Constructing a trapdoor: to W y Traversing, constructing a first query vector QO and a second query vector QT according to whether the traversed current keyword exists in KW, encrypting the QO and the QT by adopting a safe KNN algorithm to obtain a trapdoor, and uploading encrypted data to a cloud server;
(6)E QO and E QT Secondary sorting and matching: by using E IO And E QO Calculating a correlation Score of the encrypted TF value and the IDF value stored in the storage unit to obtain a Result set Result; according to the Result of the first search, use E QT And calculating and sequencing a secondary correlation Score with the found document vector to obtain
Figure FDA0003765738240000011
And
Figure FDA0003765738240000012
the final relevance Score is obtained, and the top k ciphertext documents with the highest scores are returned to the user;
in step (3), the set W composed of the query keywords input by the user q Matching with a keyword set KW to obtain a processed fuzzy keyword set W m The method comprises the following steps: if the number of query keywords input by the user is t, the query keyword set can be represented as W q ={W q1 ,W q2 ,…,W qt Is traversed if w qi When the element KW is always true, wherein i is more than or equal to 1 and less than or equal to t, the query keyword input by the user has no spelling error, and W is the moment m =W q (ii) a If w is present therein qi e.KW is false, the keyword input by the user does not exist in the keyword set KW, and the keyword w is set qi Calculating the editing distance ed between the keyword and each keyword in the keyword set KW, and if the editing distance conforms to the preset threshold value of ed, adding the keyword in the KW in the keyword set to the fuzzy keyword set W m After traversing, all keywords meeting the conditions are added into the fuzzy keyword set to obtain the final W m
2. The ciphertext-based multi-keyword expansion retrieval method of claim 1, wherein in step (3), the fuzzy keyword set W m Denoted as CM, the formula is:
Figure FDA0003765738240000021
3. the ciphertext-based multi-keyword expansion retrieval method according to claim 1, wherein in the step (4), performing semantic expansion on the keywords in each fuzzy keyword set to form a semantic expansion set, includes the following steps:
(41) two keywords w are defined i And w j Using sim (w) i ,w j ) Expressed as a keyword w i And a keyword w j The similarity operation formula is as follows:
Figure FDA0003765738240000022
wherein gamma and delta are influence weights for controlling the shortest distance path length and the near common ancestor node in the operation, gamma is more than or equal to 0, delta is more than or equal to 0, len (w) i ,w j ) Representing in a semantic tree from a keyword w i To the keyword w j Shortest path traveled, and:
Figure FDA0003765738240000023
Figure FDA0003765738240000024
lso(w i ,w j ) Representing a keyword w in a semantic tree i And a keyword w j Nearest common ancestor node, deep (w) i ) Represents a keyword w i The length of the path to the root node;
(42) to W m Traversing and calculating the similarity with the keyword set KW, and performing similarity calculation on each fuzzy keyword set W m Performing semantic expansion on the keywords in the database, and taking the first tau keywords with the highest similarity to form a semantic expansion set.
4. The ciphertext-based multi-keyword expansion retrieval method according to claim 1, wherein the step (5) specifically comprises the following steps:
(51) to W y Traversing, if the traversed current keyword exists in KW, storing the IDF value of the keyword on the corresponding bit of QO, otherwise, storing 0 for occupying;
(52) to W y Go through traversal, if the traversed current keyword exists in KWIf the key word is in the same bit, storing the IDF value of the key word on the corresponding bit of the QT, otherwise, storing 0 for occupying space, and forming a complete query vector by the QT and QO;
(53) encrypting QO by adopting a secure KNN algorithm and using QO i [j]QO 'is the data of j-th bit of i-th group in QO, and if the random bit is 0' i [j]+QO″ i [j]=QO i [j](ii) a QO 'if the random bit is 1' i [j]=QO″ i [j]=QO i [j]QO encrypted form is denoted as E QO ,QO i [j]Two new vectors generated after encryption are QO i ′[j]And QO i ″[j];
(54) Encrypting QT with secure KNN algorithm, using QT i [j]Indicating the jth data of the ith group in QT, QT i [j]Two new vectors generated after encryption are QT i ′[j]And QT i ″[j](ii) a QT if the random bit is 0 i ′[j]+QT i ″[j]=QT i [j](ii) a QT if the random bit is 1 i ′[j]=QT i ″[j]=QT i [j]And the form after QT encryption is denoted as E QT
5. The ciphertext-based multi-keyword expansion retrieval method of claim 4, wherein the step (6) specifically comprises the following steps:
(61) will E IO And E QO Adopting a TF-IDF-based model for matching calculation, if E QO When null marks exist in the data, directly skipping and not calculating; otherwise, the encrypted IO i QO after encryption i The search is carried out to obtain the first h encrypted documents with the highest correlation of each group to form a Result set { Result 1 ,Result 2 ,…,Result f Where Result i The length of h is less than or equal to b, and a Result set Result is obtained after duplication removal;
(62)
Figure FDA0003765738240000031
and
Figure FDA0003765738240000032
the formula of the relevance Score of (c) is expressed as:
Figure FDA0003765738240000033
(63) find E by Result IT Vector of document of medium significance, with E QT Calculating and sequencing secondary relevance Score (Score) of the found document vectors, returning the top k document identifiers with the highest Score to the query user, finding the corresponding ciphertext document by the user according to the document identifier fid, downloading the ciphertext document to local decryption, and acquiring corresponding plaintext information;
(64)
Figure FDA0003765738240000034
and
Figure FDA0003765738240000035
the formula of the relevance Score of (c) is expressed as:
Figure FDA0003765738240000036
CN201910160214.8A 2019-03-04 2019-03-04 Multi-keyword extended retrieval method based on ciphertext Active CN109902143B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910160214.8A CN109902143B (en) 2019-03-04 2019-03-04 Multi-keyword extended retrieval method based on ciphertext

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910160214.8A CN109902143B (en) 2019-03-04 2019-03-04 Multi-keyword extended retrieval method based on ciphertext

Publications (2)

Publication Number Publication Date
CN109902143A CN109902143A (en) 2019-06-18
CN109902143B true CN109902143B (en) 2022-09-23

Family

ID=66946218

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910160214.8A Active CN109902143B (en) 2019-03-04 2019-03-04 Multi-keyword extended retrieval method based on ciphertext

Country Status (1)

Country Link
CN (1) CN109902143B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111427998B (en) * 2020-03-19 2024-03-26 辽宁工业大学 Secure ciphertext query method for cloud data multi-keyword extension weight
CN113239054A (en) * 2021-05-11 2021-08-10 北京百度网讯科技有限公司 Information generation method, related device and computer program product
CN114238619B (en) * 2022-02-23 2022-04-29 成都数联云算科技有限公司 Method, system, device and medium for screening Chinese nouns based on edit distance
CN115495483A (en) * 2022-09-21 2022-12-20 企查查科技有限公司 Data batch processing method, device, equipment and computer readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104408177A (en) * 2014-12-15 2015-03-11 西安电子科技大学 Cipher searching method based on cloud document system
CN108171071A (en) * 2017-12-01 2018-06-15 南京邮电大学 A kind of multiple key towards cloud computing can sort cipher text retrieval method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020188735A1 (en) * 2001-06-06 2002-12-12 Needham Bradford H. Partially replicated, locally searched peer to peer file sharing system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104408177A (en) * 2014-12-15 2015-03-11 西安电子科技大学 Cipher searching method based on cloud document system
CN108171071A (en) * 2017-12-01 2018-06-15 南京邮电大学 A kind of multiple key towards cloud computing can sort cipher text retrieval method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
An_Efficient_Multi-keyword_top-k_Search_Scheme_over_Encrypted_Cloud_Data;Jian Xu,;《2018 15th International Symposium on Pervasive Systems, Algorithms and Networks》;20181231;全文 *

Also Published As

Publication number Publication date
CN109902143A (en) 2019-06-18

Similar Documents

Publication Publication Date Title
CN109902143B (en) Multi-keyword extended retrieval method based on ciphertext
CN107220343B (en) Chinese multi-keyword fuzzy sorting ciphertext searching method based on locality sensitive hashing
CN106997384B (en) Semantic fuzzy searchable encryption method capable of verifying sequencing
Yang et al. A knowledge-enhanced deep recommendation framework incorporating gan-based models
CN108334612A (en) A kind of nearly Chinese character full text fuzzy retrieval method of shape for ciphertext domain
CN109885640B (en) Multi-keyword ciphertext sorting and searching method based on alpha-fork index tree
CN106874425B (en) Storm-based real-time keyword approximate search algorithm
CN108171071B (en) Multi-keyword orderable ciphertext retrieval method oriented to cloud computing
AU2015324282B2 (en) Protected indexing and querying of large sets of textual data
CN111026788A (en) Homomorphic encryption-based multi-keyword ciphertext sorting and retrieving method in hybrid cloud
CN106708929A (en) Video program search method and device
CN109885650B (en) Outsourcing cloud environment privacy protection ciphertext sorting retrieval method
CN109255244B (en) Data encryption method and device and data encryption retrieval system
CN105404677B (en) A kind of search method based on tree structure
JP2018180789A (en) Query clustering device, method, and program
CN110442741B (en) Tensor fusion and reordering-based cross-modal image-text mutual search method
CN115757676A (en) Fuzzy searchable encryption method and device and electronic equipment
JP5367632B2 (en) Knowledge amount estimation apparatus and program
CN105426490B (en) A kind of indexing means based on tree structure
CN103646035A (en) Information search method based on heuristic method
Fischer et al. Timely semantics: a study of a stream-based ranking system for entity relationships
CN106021251A (en) Hierarchical semantic model image retrieval method based on background knowledge
JP6495206B2 (en) Document concept base generation device, document concept search device, method, and program
Manalu et al. The Development of Document Similarity Detector by Jaccard Formulation
Onal et al. Utilizing word embeddings for result diversification in tweet search

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant