CN113868441A - File processing method, electronic device and storage medium - Google Patents

File processing method, electronic device and storage medium Download PDF

Info

Publication number
CN113868441A
CN113868441A CN202111056882.XA CN202111056882A CN113868441A CN 113868441 A CN113868441 A CN 113868441A CN 202111056882 A CN202111056882 A CN 202111056882A CN 113868441 A CN113868441 A CN 113868441A
Authority
CN
China
Prior art keywords
file
identification vector
storage position
vector
identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111056882.XA
Other languages
Chinese (zh)
Inventor
吴良顺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuo Erzhi Lian Wuhan Research Institute Co Ltd
Original Assignee
Zhuo Erzhi Lian Wuhan Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuo Erzhi Lian Wuhan Research Institute Co Ltd filed Critical Zhuo Erzhi Lian Wuhan Research Institute Co Ltd
Priority to CN202111056882.XA priority Critical patent/CN113868441A/en
Publication of CN113868441A publication Critical patent/CN113868441A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/41Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/483Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Library & Information Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a file processing method, electronic equipment and a storage medium, wherein the file processing method comprises the following steps: determining a first identification vector of the file according to the position of the file in the file set; generating a first storage position for storing a first identification vector according to the hash function and the feature vector of the file; determining a second identification vector according to the hash table corresponding to the hash function and the first identification vector; storing the second identification vector into a second storage position obtained by performing pseudo-random replacement on the first storage position; the second storage position is used for indicating a first storage position corresponding to the characteristic vector used for searching the file to be searched according to the first identification vector of the file to be searched when the file is searched. Therefore, according to the second storage position, the corresponding file can be found based on the position of the file to be searched in the file set, and a user does not need to provide accurate keywords for searching, so that the file searching efficiency is improved.

Description

File processing method, electronic device and storage medium
Technical Field
The present invention relates to the field of data processing, and in particular, to a file processing method, an electronic device, and a storage medium.
Background
In recent years, with the development and popularization of cloud computing, some cloud service products begin to store data files and build systems for users in a cloud server, and more data owners select to store mass data of the users, especially multimedia files including audio and video, in the cloud. In order to protect personal privacy and sensitive data (such as personal data files, medical care records, home life videos and the like), a user needs to perform local encryption operation before uploading files.
In the prior art, when a user needs to search for a related data file, one method is to download all ciphertext to the local for decryption, and search on the plaintext obtained by decryption, but huge network overhead and storage overhead are brought, and calculation overhead is brought by encryption and decryption operations. Another is Searchable Symmetric Encryption (SSE), where a user creates a plaintext index and uploads the plaintext index to a remote server in an encrypted manner, and when the user searches for a keyword, a search Trapdoor (Trapdoor) of the keyword can be generated and submitted to the server. And after receiving the search trap, the server searches the encrypted index and returns a corresponding ciphertext result to the user. And finally, the user decrypts the returned ciphertext. However, the existing SSE scheme has to perform accurate index query based on the keywords, which results in low efficiency of file search.
Disclosure of Invention
In view of this, embodiments of the present invention provide a file processing method, an electronic device, and a storage medium.
The technical scheme of the invention is realized as follows:
in a first aspect, an embodiment of the present invention provides a file processing method, including:
determining a first identification vector of a file according to the position of the file in a file set;
generating a first storage position for storing the first identification vector according to a hash function and the feature vector of the file;
determining a second identification vector according to the hash table corresponding to the hash function and the first identification vector;
storing the second identification vector into a second storage position obtained by performing pseudo-random replacement on the first storage position; the second storage position is used for indicating a first storage position corresponding to the characteristic vector used for searching the file to be searched according to the first identification vector of the file to be searched when the file is searched.
Further, the determining a second identification vector according to the hash table corresponding to the hash function and the first identification vector includes:
determining an adjacent storage position of the first storage position according to a hash table corresponding to the hash function;
generating a second identification vector based on a union of vectors stored in the adjacent storage locations and the first identification vector.
Further, the method further comprises:
if no data is stored in the adjacent storage location, writing an all 0 vector of the same length as the first identification vector in the adjacent storage location.
Further, the method further comprises:
splicing the identification information corresponding to the file set with the information of a second storage position;
and calculating the spliced information through a pseudo-random function, and encrypting the second storage position based on the calculation result.
Further, the method further comprises:
if the first storage position stores data, the first identification vector and the stored data form a union set and are stored in the first storage position;
and if the first storage position does not store data, storing the first identification vector into the first storage position.
In a second aspect, an embodiment of the present invention provides a file processing method, including:
determining a first identification vector of a file to be searched according to the position of the file to be searched in a file set;
determining a storage position containing the first identification vector in the stored data;
determining an original storage position according to the storage position of the first identification vector and an inverse function of the pseudo-random permutation;
and searching the file to be searched according to the characteristic vector corresponding to the original storage position.
Further, the determining a storage location in the stored data that contains the first identification vector includes:
determining a second identification vector according to the first identification vector; the second identification vector is a union formed by the first identification vector and other vectors;
determining a storage location of the second identification vector.
Further, the determining a storage location of the second identification vector comprises:
determining encrypted storage position information corresponding to the second identification vector;
and decrypting the encrypted storage position information according to the identification information corresponding to the file set and the second identification vector to obtain the storage position of the second identification vector.
In a third aspect, an embodiment of the present invention provides a file processing apparatus, including:
the first determining unit is used for determining a first identification vector of the file according to the position of the file in the file set; determining a second identification vector according to the hash table corresponding to the hash function and the first identification vector;
the generating unit is used for generating a first storage position for storing the first identification vector according to a hash function and the feature vector of the file;
the storage unit is used for storing the second identification vector into a second storage position obtained by performing pseudo-random replacement on the first storage position; the second storage position is used for indicating a first storage position corresponding to the characteristic vector used for searching the file to be searched according to the first identification vector of the file to be searched when the file is searched.
Further, the storage unit is specifically configured to:
determining an adjacent storage position of the first storage position according to a hash table corresponding to the hash function;
generating a second identification vector based on a union of vectors stored in the adjacent storage locations and the first identification vector.
Further, the apparatus further comprises:
and the writing unit is used for writing all 0 vectors with the length being the same as that of the first identification vector in the adjacent storage position if the adjacent storage position does not store data.
Further, the apparatus further comprises:
the splicing unit is used for splicing the identification information corresponding to the file set with the information of the second storage position;
and the encryption unit is used for calculating the spliced information through a pseudorandom function and encrypting the second storage position based on the calculation result.
Further, the storage unit is further configured to:
if the first storage position stores data, the first identification vector and the stored data form a union set and are stored in the first storage position;
and if the first storage position does not store data, storing the first identification vector into the first storage position.
In a fourth aspect, an embodiment of the present invention provides a file processing apparatus, including:
the second determining unit is used for determining a first identification vector of the file to be searched according to the position of the file to be searched in the file set; determining a storage position containing the first identification vector in the stored data; determining an original storage position according to the storage position of the first identification vector and an inverse function of the pseudo-random permutation;
and the searching unit is used for searching the file to be searched according to the characteristic vector corresponding to the original storage position.
Further, the second determining unit is specifically configured to:
determining a second identification vector according to the first identification vector; the second identification vector is a union formed by the first identification vector and other vectors;
determining a storage location of the second identification vector.
Further, the second determining unit is specifically configured to:
determining encrypted storage position information corresponding to the second identification vector;
and decrypting the encrypted storage position information according to the identification information corresponding to the file set and the second identification vector to obtain the storage position of the second identification vector.
In a fifth aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes: a processor and a memory for storing a computer program capable of running on the processor;
the processor, when running said computer program, performs the steps of one or more of the preceding claims.
In a sixth aspect, an embodiment of the present invention provides a computer-readable storage medium, where the computer-readable storage medium stores computer-executable instructions; the computer-executable instructions, when executed by a processor, are capable of implementing the methods described in one or more of the preceding claims.
The file processing method provided by the invention comprises the following steps: determining a first identification vector of a file according to the position of the file in a file set; generating a first storage position for storing the first identification vector according to a hash function and the feature vector of the file; determining a second identification vector according to the hash table corresponding to the hash function and the first identification vector; storing the second identification vector into a second storage position obtained by performing pseudo-random replacement on the first storage position; the second storage position is used for indicating a first storage position corresponding to the characteristic vector used for searching the file to be searched according to the first identification vector of the file to be searched when the file is searched. Therefore, according to the second storage position, the corresponding first storage position can be found based on the arrangement position of the file in the file set, and the file to be searched can be positioned based on the feature vector corresponding to the first storage position, so that a user does not need to provide accurate file keywords for searching, and the file searching efficiency is greatly improved. Moreover, based on the calculation processes of a hash function, pseudo-random replacement and the like, compared with the corresponding relation of a direct record file and a storage position thereof, the privacy and the safety are higher.
Drawings
Fig. 1 is a schematic flow chart of a file processing method according to an embodiment of the present invention;
fig. 2 is a schematic flowchart of a file processing method according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a file processing method according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating a file processing method according to an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a document processing apparatus according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a file processing apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
In the description that follows, references to the terms "first \ second \ third" are intended merely to distinguish similar objects and do not denote a particular order, but rather are to be understood that the terms "first \ second \ third" may be interchanged under certain circumstances or sequences of events to enable embodiments of the invention described herein to be practiced in other than those illustrated or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.
As shown in fig. 1, an embodiment of the present invention provides a file processing method, including:
s110: determining a first identification vector of a file according to the position of the file in a file set;
s120: generating a first storage position for storing the first identification vector according to a hash function and the feature vector of the file;
s130: determining a second identification vector according to the hash table corresponding to the hash function and the first identification vector;
s140: storing the second identification vector into a second storage position obtained by performing pseudo-random replacement on the first storage position; the second storage position is used for indicating a first storage position corresponding to the characteristic vector used for searching the file to be searched according to the first identification vector of the file to be searched when the file is searched.
In the embodiment of the present invention, the file set may be a set formed by a plurality of files that need to be stored for subsequent search and search, for example, the file may be a multimedia file such as an audio/video file, or may also be a file in other forms such as a log data file. For example, the file set may be a complete video, and the file may be a plurality of frame picture files formed by dividing the video by frames, or a plurality of video clip files formed by dividing the video. The location of the file in the file collection may be that the token file is the second file in the file collection, or the relative arrangement of the token file with respect to other files in the file collection. For example, when the file collection is a video, the position of the file in the file collection may be that the file is the frame of the video, or that the file is a video clip of the second of the video.
Here, the length of the first identification vector may be equal to the number of files included in the file set, and the corresponding files may also be characterized by 0 and 1, for example, if the file set includes 100 files, the length of the first identification vector is 100. The first identification vector of the ith file in the file set may be a vector in which the ith element is 1, and the remaining 99 elements are 0. In this way, the first identification vector may characterize the location of the file in the file collection.
The feature vector is a vector that can characterize the data features of the file, for example, for a multimedia data file, the feature vector can be a feature vector extracted by a color histogram algorithm. Illustratively, when the file is an image file, the feature vectors may be extracted by an OpenCV tool.
In one embodiment, for a set of files containing a total number of files # f, the ith file is fiWhere i ∈ [1, # f ]]。fiThe feature vector of (a) may be a d-dimensional feature vector p obtained by a feature extraction algorithmi. The first identification Vector may be an Inverse File identification Vector (IFV) with a length equal to the number of files # f in the File collection
Figure BDA0003255017570000071
Step S120 may include: pairing feature vectors p by a location sensitive hash functioniA calculation is performed to obtain the generated hash bucket location as the first storage location. For example, the first storage location x ═ H (p) is calculated from the location sensitive hash function Hi) (mod m), where mod represents the remainder function calculation, and m is the total number of hash buckets contained in the hash table corresponding to the location-sensitive hash function H.
In another embodiment, step S120 may further be: selecting M position-sensitive hash functions h1,h2,...,hMAccording to the formula xj=hj(pi) (mod m) + m (j-1) for feature vector p, respectivelyiPerforming a calculation in which j ∈ [1, M ]]. Each hash function generates a hash bucket, and then the M functions generate M hash buckets, i.e. M first storage locations x1、x2、...、xMEach first storage location having stored therein a first identification vector
Figure BDA0003255017570000072
Here, M may be a randomly determined number or a preset number. The M functions may be randomly selected from the location sensitive hash function family, or the preset M functions may be directly used. In this way, a distributed storage location generation is formed by performing computations through multiple location-sensitive hash functions, inhibiting the computation through a single hash function from resulting in less security in the storage location establishment process.
In one embodiment, step S130 may include: the second identification vector is determined based on the first identification vector and vectors stored in other storage locations in the hash table where the first storage location where the first identification vector is stored is located. Illustratively, a first identification vector is stored in a hash bucket x generated according to a location-sensitive hash function H
Figure BDA0003255017570000081
Then the vector stored in the hash bucket other than x in the hash table corresponding to H, for example, the vector stored in the adjacent bucket where x is adjacent left and right in the hash table, an
Figure BDA0003255017570000082
Together determining a second identification vector.
In another embodiment, if M first storage locations are generated by M location-sensitive hash functions, S130 may be: and determining M second identification vectors according to all storage positions in the M hash tables and the first identification vector. For example, the second identifier vector corresponding to each hash table may be determined according to the storage location in the hash table and the first identifier vector, respectively.
In one embodiment, the first storage location X is Pseudo Random Permutation (PRP) to generate a random location, i.e., the second storage location X. And storing the second identification vector into a second storage position, and inquiring the second storage position of the associated second identification vector only based on the first identification vector when the user needs to search the file based on the incidence relation between the first identification vector and the second identification vector. Further, the first storage location can be restored and obtained based on the inverse function of the pseudo-random permutation PRP and the second storage location, and then the corresponding feature vector is determined based on the first storage location, that is, the feature vector of the file to be searched which needs to be queried. The feature vectors can be directly matched to the corresponding files.
In another embodiment, if M first storage locations are generated by a M location-sensitive hash function, then the M first storage locations x are paired1、x2、...、xMPRP calculation is respectively carried out based on the pseudo-random permutation F (X) to obtain M second storage positions X1、X2、...、XMWherein X isM=F(xM). And respectively storing the M second identification vectors into second storage positions corresponding to the hash tables where the corresponding first storage positions are located.
Thus, according to the second storage location and the pseudo-random replacement with the inverse function, the corresponding first storage location can be found based on the location of the file in the file set. On the basis, the file to be searched can be positioned according to the feature vector corresponding to the first storage position, and a user does not need to provide accurate file keywords for searching, so that the file searching efficiency is greatly improved. Moreover, based on the calculation of hash function and pseudo-random permutation and the conversion process of storage positions, compared with the corresponding relation of directly recording files and storage positions thereof, the storage security of the related data contents such as the first identification vector of the user file is higher.
In some embodiments, the S130 may include:
determining an adjacent storage position of the first storage position according to a hash table corresponding to the hash function;
generating a second identification vector based on a union of vectors stored in the adjacent storage locations and the first identification vector.
In the embodiment of the present invention, the adjacent storage location of the first storage location may be a storage location closest to the left side of the first storage location in the hash table, and/or a storage location closest to the right side of the first storage location. Illustratively, the first storage location is hash bucket x in the hash table, then the adjacent storage locations may be adjacent buckets located to the left and/or right of hash bucket x.
In one embodiment, since the location-sensitive hash function H may be used to generate the first storage location for multiple files, the storage location included in the hash table corresponding to H may be used to store the first identification vector of different files. Illustratively, the first identification vector stored in the first storage location x is a reverse file identification vector
Figure BDA0003255017570000091
The vector stored in the adjacent storage location to the left of x is represented as
Figure BDA0003255017570000092
The vectors stored in adjacent storage locations to the right of x are represented as
Figure BDA0003255017570000093
Thus, the vectors stored in adjacent storage locations to the left and right of x, and the first identification vector are formed into a union, i.e. the second identification vector fidiIs composed of
Figure BDA0003255017570000094
Here, v denotes an or operation.
In another embodiment, if M first storage locations are generated by M location-sensitive hash functions, vectors in adjacent storage locations of the first storage location in each hash table are determined, and a first identification vector is merged with the vectors in the adjacent storage locations in each hash table and stored in a corresponding second storage location of the hash table.
In another embodiment, all storage locations in the hash table may be further subjected to a pseudo-random permutation and a vector union calculation process of adjacent storage locations, that is, each storage location is subjected to the pseudo-random permutation to generate a corresponding second storage location, and a vector stored in each storage location and an adjacent union of its adjacent storage locations are stored in the second storage location corresponding to the storage location.
Thus, when the similarity of two files is high, the similarity of the feature vectors is also high. Due to the characteristics of the position sensitive hash function, after the similar feature vectors are subjected to hash calculation, the obtained storage position distance is also close, and the higher the similarity is, the closer the storage position distance is. Therefore, the file corresponding to the vector stored in the adjacent storage location in the hash table has the highest similarity with the file corresponding to the current first identification vector. Based on the method, the vector of the adjacent storage position and the first identification vector are stored in the second storage position in a union mode, similarity search can be provided when a user searches based on the first identification vector, and the file to be searched and the file with the highest similarity with the file to be searched are provided for the user.
In some embodiments, the method further comprises:
if no data is stored in the adjacent storage location, writing an all 0 vector of the same length as the first identification vector in the adjacent storage location.
In the embodiment of the present invention, because of the storage locations included in the hash table, there may be some storage locations that are empty, indicating that the storage location does not store data. Therefore, to suppress a calculation error caused by the fact that adjacent storage locations are empty during the calculation to form the union, all 0 vectors of the same length as the first identification vector are stored in all the storage locations in the hash table where no data is stored before the calculation.
Illustratively, when the first identification vector
Figure BDA0003255017570000101
Is # f, the all 0 vectors of length # f are stored in all empty hash buckets in the hash table.
In some embodiments, as shown in fig. 2, the method further comprises:
s150: splicing the identification information corresponding to the file set with the information of a second storage position;
s160: and calculating the spliced information through a pseudo-random function, and encrypting the second storage position based on the calculation result.
In the embodiment of the present invention, since the second storage location may be used for searching for a file, the encrypted second storage location may be recorded as index information of the file. The identification information corresponding to the file set may be identification information of each file set, and may be characterized by an Index Identity (Index Identity Document, Index id), for example.
Because a Pseudo Random Function (PRF) is different from pseudo random permutation, the PRF does not have an inverse function, and when an object encrypted by the pseudo random function is decrypted, secondary encryption is performed according to the same encryption algorithm, so that the decrypted original object is obtained. Therefore, the safety of establishing and storing the storage position can be further improved, and the probability that the encryption and decryption process is stolen and used is greatly reduced.
In one embodiment, the second storage location X of the first storage location X is obtained based on a pseudorandom permutation calculation, and the second storage location X is encrypted using a pseudorandom function g (X) according to a formula
Figure BDA0003255017570000111
The encrypted storage location information I can be obtained. Here, the following: the term "assigned as",
Figure BDA0003255017570000112
indicating an exclusive or operation and | | indicating a connection operation of the character string.
Therefore, the second storage position after encryption is used as index information for indicating the file to be searched, and the safety of the file index information establishment and searching process can be greatly improved. And the identity of the file set is characterized based on the IndexID, so that the second storage position of different file sets can be effectively distinguished when a plurality of file sets exist at the same time.
In some embodiments, as shown in fig. 3, the method further comprises:
s101: if the first storage position stores data, the first identification vector and the stored data form a union set and are stored in the first storage position;
s102: and if the first storage position does not store data, storing the first identification vector into the first storage position.
In the embodiment of the present invention, although the first storage location is generated by performing a hash operation on the feature vector of the current file by using a hash function, in practical applications, there may be a hash collision phenomenon that the hash value of the feature vector of the current file is already occupied. This may result in the first storage location possibly already being used for storing other vectors before storing the first identification vector of the current file.
Therefore, in order to ensure that the first identification vector of the current file can be effectively stored in the first storage position without influencing the previously stored vectors, the first identification vector and the previously stored vectors are stored in the first storage position together in a union mode.
As shown in fig. 4, an embodiment of the present invention provides a file processing method, including:
s210: determining a first identification vector of a file to be searched according to the position of the file to be searched in a file set;
s220: determining a storage position containing the first identification vector in the stored data;
s230: determining an original storage position according to the storage position of the first identification vector and an inverse function of the pseudo-random permutation;
s240: and searching the file to be searched according to the characteristic vector corresponding to the original storage position.
In the embodiment of the present invention, the position of the file to be searched in the file set, for example, the file to be searched is a video clip, the second few seconds of the video clip in the file set, that is, the complete video, may be determined, or the file to be searched is a frame picture file, and the frame picture may be the second few frames in the complete video.
Since the data obtained by performing union operation based on the first identification vector is stored in the second storage location converted according to the first storage location, the data can be matched to at least one second storage location containing the first identification vector by performing search based on the first identification vector.
In one embodiment, since the first storage location is generated by pseudo-random permutation to be the second storage location, the storage location acquired when the file is searched is the second storage location, and the first storage location, that is, the original storage location, can be calculated based on an inverse function of the pseudo-random permutation.
In another embodiment, the feature vector is used for performing hash calculation through a position sensitive hash function H to obtain an original storage position, and then a one-to-one correspondence relationship exists between the feature vector and the original storage position, and after the original storage position is obtained, the corresponding feature vector can be determined. Further, the corresponding file can be matched to the query based on the feature vector.
In another embodiment, if a plurality of second storage locations including the first identification vector are found based on the first identification vector, the original storage locations corresponding to the plurality of second storage locations are respectively determined, and the corresponding feature vectors and the files are respectively determined. The obtained files comprise files to be searched, and the files except the files to be searched are all files with higher similarity to the files to be searched. Therefore, similarity search can be carried out according to the search requirements of the user, so that the user can select files which can meet the requirements of the user from a plurality of highly similar files.
Therefore, when a user needs to search for a file, the file to be searched can be obtained only by providing the first identification vector representing the position of the file in the file set, and an accurate file keyword does not need to be provided for searching, so that the searching efficiency is greatly improved. On the basis, similarity search content can be provided, more selection space is provided for the user, and the flexibility is higher.
In some embodiments, the S220 may include:
determining a second identification vector according to the first identification vector; the second identification vector is a union formed by the first identification vector and other vectors;
determining a storage location of the second identification vector.
In the embodiment of the present invention, the second storage location stores a union of the first identification vector and data stored in an adjacent storage location, so that the storage location containing the first identification vector can be found based on the first identification vector.
In one embodiment, in the process of generating the second storage location, each hash bucket (first storage location) in the hash table is subjected to pseudo-random permutation and union operation, and then m second storage locations in the hash table are all union sets formed by adjacent bucket storage vectors. The plurality of memory locations found based on the first identification vector thus correspond to the first memory location for storing the first identification vector and the adjacent memory locations to and from the first memory location, respectively.
Therefore, the first storage position where the first identification vector is located and the adjacent storage positions can be effectively queried based on the first identification vector, and the similarity of the feature vectors corresponding to the adjacent storage positions is high due to the characteristics of the position sensitive hash function. Based on the method, one or more files with higher similarity to the file to be searched can be obtained, and similarity search is realized.
In some embodiments, said determining a storage location of said second identification vector comprises:
determining encrypted storage position information corresponding to the second identification vector;
and decrypting the encrypted storage position information according to the identification information corresponding to the file set and the second identification vector to obtain the storage position of the second identification vector.
In the embodiment of the present invention, after the storage location of the second identification vector, that is, the second storage location is generated, the second storage location is encrypted to generate encrypted storage location information. Therefore, when searching for a file, one or more pieces of encrypted storage location information are matched based on the first identification vector, and the corresponding second storage location needs to be decrypted.
In one embodiment, the encrypted storage location information may be decrypted based on a pseudo-random function and an exclusive-or operation, resulting in a second storage location and identification information of the set of files, a second identification vector, and so on.
In another embodiment, the obtained second identification vector is a union of the first identification vector and vectors in adjacent storage locations, and the similarity between the file corresponding to the vector in the adjacent storage location and the file corresponding to the first identification vector is higher. Therefore, one or more identification vectors contained in the second identification vector can be recorded, and respective matching search can be performed for each identification vector to determine files corresponding to adjacent storage locations and files corresponding to adjacent storage locations of the adjacent storage locations, so that similarity search in a wider range can be quickly realized, and more selectable similar files can be provided for users.
As shown in fig. 5, an embodiment of the present invention provides a file processing apparatus, where the apparatus includes:
a first determining unit 110, configured to determine a first identification vector of a file according to a position of the file in a file set; determining a second identification vector according to the hash table corresponding to the hash function and the first identification vector;
a generating unit 120, configured to generate a first storage location for storing the first identification vector according to a hash function and a feature vector of the file;
a storage unit 130, configured to store the second identification vector in a second storage location obtained by performing pseudo-random permutation on the first storage location; the second storage position is used for indicating a first storage position corresponding to the characteristic vector used for searching the file to be searched according to the first identification vector of the file to be searched when the file is searched.
In some embodiments, the storage unit 130 is specifically configured to:
determining an adjacent storage position of the first storage position according to a hash table corresponding to the hash function;
generating a second identification vector based on a union of vectors stored in the adjacent storage locations and the first identification vector.
In some embodiments, the apparatus further comprises:
and the writing unit is used for writing all 0 vectors with the length being the same as that of the first identification vector in the adjacent storage position if the adjacent storage position does not store data.
In some embodiments, the apparatus further comprises:
the splicing unit is used for splicing the identification information corresponding to the file set with the information of the second storage position;
and the encryption unit is used for calculating the spliced information through a pseudorandom function and encrypting the second storage position based on the calculation result.
In some embodiments, the storage unit 130 is further configured to:
if the first storage position stores data, the first identification vector and the stored data form a union set and are stored in the first storage position;
and if the first storage position does not store data, storing the first identification vector into the first storage position.
As shown in fig. 6, an embodiment of the present invention provides a file processing apparatus, where the apparatus includes:
a second determining unit 210, configured to determine a first identifier vector of the file to be searched according to a position of the file to be searched in the file set; determining a storage position containing the first identification vector in the stored data; determining an original storage position according to the storage position of the first identification vector and an inverse function of the pseudo-random permutation;
the searching unit 220 is configured to search the file to be searched according to the feature vector corresponding to the original storage location.
In some embodiments, the second determining unit 220 is specifically configured to:
determining a second identification vector according to the first identification vector; the second identification vector is a union formed by the first identification vector and other vectors;
determining a storage location of the second identification vector.
In some embodiments, the second determining unit 220 is specifically configured to:
determining encrypted storage position information corresponding to the second identification vector;
and decrypting the encrypted storage position information according to the identification information corresponding to the file set and the second identification vector to obtain the storage position of the second identification vector.
One specific example is provided below in connection with any of the embodiments described above:
the embodiment of the invention provides a technology for searching similarity on encrypted multimedia data.
1. Pseudo-random permutation (PRP)
Figure BDA0003255017570000151
And Pseudo Random Function (PRF)
Figure BDA0003255017570000152
Let # f denote the total number of files in the set of files, multimedia data file fiIs the ith file in the file collection. p is a radical ofiRepresenting files f from multimedia dataiAnd extracting d-dimensional feature vectors.
Figure BDA0003255017570000153
Is represented by a feature vector piCorresponding inverted file identification vector (IFV) if
Figure BDA0003255017570000154
If the ith bit of (1) indicates that the file contains the ith file, and if the bit is 0, the file does not contain the ith file. Except thatBesides, a backward file identification vector obtained by combining a plurality of backward file identification vectors is represented by fid. For a given vector v, use v [ i ]]Or viRepresenting its ith element.
2. The steps of constructing the encrypted search index are as follows:
randomly selecting M location-sensitive hash functions H from a location-sensitive hash function family H1,h2,...,hM. These M hash functions will be used to generate M hash tables.
For each multimedia data file fiWhere i ∈ [1, # f ]]。
For each multimedia data file, a corresponding d-dimensional feature vector p is obtained by using a feature extraction algorithm (such as a color histogram algorithm)iAnd generates a reverse file identification vector (IFV)
Figure BDA0003255017570000161
Computing the location of M hash buckets, x, produced by a location-sensitive hash function1=g1(pi),...,xM=gM(pi),gj=hj(pi) (mod M) + M (j-1), where j ∈ [1, M]And m is the number of buckets in each hash table.
Position x for each bucketjWhere j is ∈ [1, M ]]Identifying reverse file to vector
Figure BDA0003255017570000162
The insertion into the bucket is according to the following rules. If bucket I [ xj]If the middle position is empty, the central position is directly connected with
Figure BDA0003255017570000163
Insertion of I [ xj](ii) a Otherwise, it will be "new
Figure BDA0003255017570000164
And "old" in which the location existed before
Figure BDA0003255017570000165
After the OR operation is performed, I [ x ] is storedj]Is marked as
Figure BDA0003255017570000166
By y1,y2,...,YMmRepresenting the location of all buckets, the all 0 vectors of length | fid | are stored in all empty buckets. For each bucket yi,i∈[1,Mm]。
By using
Figure BDA0003255017570000167
Representation is stored in I [ yi]The reverse file identification vector. Computing
Figure BDA0003255017570000168
As a "federated" reverse File identification vector fidi. Wherein fidi -And fidi +Representing the inverse file identification vectors stored in the left and right adjacent buckets, respectively.
For each bucket yi,i∈[1,Mm]Generating a new corresponding random position by pseudo-random permutation
Figure BDA0003255017570000169
i∈[1,Mm]And will fitiIs stored in
Figure BDA00032550175700001610
In (1).
For each bucket in the index
Figure BDA00032550175700001611
i∈[1,Mm]Encrypting fid as followsi
Figure BDA00032550175700001612
Where | represents a join operation in a string. Where IndexID is used to uniquely identify the encrypted similarity index I.
For i e [1, # f]Encrypting each using PCPA-secured symmetric encryption algorithmA file, i.e.
Figure BDA00032550175700001613
Where SKE is a symmetric encryption scheme.
Finally, uploading the obtained (I, c) to a remote server, wherein c ═ c1,...,c#f)。
3. The above construction of the encryption similarity index is based on the entire set f of multimedia data files rather than each data file individually, and therefore the length of the reverse file identification vector is set to # f.
Besides the reverse identification file vectors, a hash table needs to be established, and each reverse file identification vector is identified by a hash function H
Figure BDA00032550175700001614
Mapping to a unique real file identity (e.g., pathname in a file system). Such as a given one
Figure BDA0003255017570000171
By the formula
Figure BDA0003255017570000172
Here mod # f denotes performing a remainder calculation based on the file set length # f. An index is determined that can look for the corresponding entry. By locating the actual file identifier extracted from the corresponding bucket, we can find the corresponding data file. Meanwhile, the reason for using the IndexID concept is that, in a practical case, if a data owner wants to create indexes from multimedia data files of different combinations, the IndexID can be used to distinguish each index individually.
An embodiment of the present invention further provides an electronic device, where the electronic device includes: a processor and a memory for storing a computer program capable of running on the processor, the computer program when executed by the processor performing the steps of one or more of the methods described above.
An embodiment of the present invention further provides a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and after being executed by a processor, the computer-executable instructions can implement the method according to one or more of the foregoing technical solutions.
The computer storage media provided by the present embodiments may be non-transitory storage media.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, all the functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may be separately used as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
In some cases, any two of the above technical features may be combined into a new method solution without conflict.
In some cases, any two of the above technical features may be combined into a new device solution without conflict.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media capable of storing program codes, such as a removable Memory device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, and an optical disk.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (10)

1. A method of file processing, the method comprising:
determining a first identification vector of a file according to the position of the file in a file set;
generating a first storage position for storing the first identification vector according to a hash function and the feature vector of the file;
determining a second identification vector according to the hash table corresponding to the hash function and the first identification vector;
storing the second identification vector into a second storage position obtained by performing pseudo-random replacement on the first storage position; the second storage position is used for indicating a first storage position corresponding to the characteristic vector used for searching the file to be searched according to the first identification vector of the file to be searched when the file is searched.
2. The method of claim 1, wherein determining a second identification vector according to the hash table corresponding to the hash function and the first identification vector comprises:
determining an adjacent storage position of the first storage position according to a hash table corresponding to the hash function;
generating a second identification vector based on a union of vectors stored in the adjacent storage locations and the first identification vector.
3. The method of claim 2, further comprising:
if no data is stored in the adjacent storage location, writing an all 0 vector of the same length as the first identification vector in the adjacent storage location.
4. The method of claim 1, further comprising:
splicing the identification information corresponding to the file set with the information of a second storage position;
and calculating the spliced information through a pseudo-random function, and encrypting the second storage position based on the calculation result.
5. The method of claim 1, further comprising:
if the first storage position stores data, the first identification vector and the stored data form a union set and are stored in the first storage position;
and if the first storage position does not store data, storing the first identification vector into the first storage position.
6. A method of file processing, the method comprising:
determining a first identification vector of a file to be searched according to the position of the file to be searched in a file set;
determining a storage position containing the first identification vector in the stored data;
determining an original storage position according to the storage position of the first identification vector and an inverse function of the pseudo-random permutation;
and searching the file to be searched according to the characteristic vector corresponding to the original storage position.
7. The method of claim 6, wherein determining a storage location in the stored data that contains the first identification vector comprises:
determining a second identification vector according to the first identification vector; the second identification vector is a union formed by the first identification vector and other vectors;
determining a storage location of the second identification vector.
8. The method of claim 7, wherein determining the storage location of the second identification vector comprises:
determining encrypted storage position information corresponding to the second identification vector;
and decrypting the encrypted storage position information according to the identification information corresponding to the file set and the second identification vector to obtain the storage position of the second identification vector.
9. An electronic device, characterized in that the electronic device comprises: a processor and a memory for storing a computer program capable of running on the processor; wherein the content of the first and second substances,
the processor, when executing the computer program, performs the steps of the document processing method of any one of claims 1 to 8.
10. A computer-readable storage medium having stored thereon computer-executable instructions; the computer-executable instructions, when executed by a processor, are capable of implementing a method of processing documents as claimed in any one of claims 1 to 8.
CN202111056882.XA 2021-09-09 2021-09-09 File processing method, electronic device and storage medium Pending CN113868441A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111056882.XA CN113868441A (en) 2021-09-09 2021-09-09 File processing method, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111056882.XA CN113868441A (en) 2021-09-09 2021-09-09 File processing method, electronic device and storage medium

Publications (1)

Publication Number Publication Date
CN113868441A true CN113868441A (en) 2021-12-31

Family

ID=78995211

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111056882.XA Pending CN113868441A (en) 2021-09-09 2021-09-09 File processing method, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN113868441A (en)

Similar Documents

Publication Publication Date Title
Lu et al. Enabling search over encrypted multimedia databases
Xia et al. Towards privacy-preserving content-based image retrieval in cloud computing
Yuan et al. SEISA: Secure and efficient encrypted image search with access control
US20130159695A1 (en) Document processing method and system
JP2014507841A (en) Apparatus and method for online storage, transmitting apparatus and method, and receiving apparatus and method
CN109783667B (en) Image storage and retrieval method, client and system
CN112000632B (en) Ciphertext sharing method, medium, sharing client and system
Guo et al. Enabling secure cross-modal retrieval over encrypted heterogeneous IoT databases with collective matrix factorization
Al Sibahee et al. Efficient encrypted image retrieval in IoT-cloud with multi-user authentication
CN112328606A (en) Keyword searchable encryption method based on block chain
CN111651779B (en) Privacy protection method for encrypted image retrieval in block chain
Yuan et al. Towards privacy-preserving and practical image-centric social discovery
CN112685753A (en) Method and equipment for storing encrypted data
Abduljabbar et al. EEIRI: Efficient encrypted image retrieval in IoT-cloud
Cui et al. Harnessing encrypted data in cloud for secure and efficient image sharing from mobile devices
CN113434555B (en) Data query method and device based on searchable encryption technology
JP2006189925A (en) Private information management system, private information management program, and private information protection method
CN113868441A (en) File processing method, electronic device and storage medium
CN111966778B (en) Multi-keyword ciphertext sorting and searching method based on keyword grouping reverse index
Kozak et al. Efficiency and security in similarity cloud services
CN115459967A (en) Ciphertext database query method and system based on searchable encryption
CN114661793A (en) Fuzzy query method and device, electronic equipment and storage medium
Aritomo et al. A privacy-preserving similarity search scheme over encrypted word embeddings
CN111680062A (en) Safe multi-target data object query method and storage medium
Wang et al. A Secure Searchable Image Retrieval Scheme with Correct Retrieval Identity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination