CN107291935B - Spark and Huffman coding based CPIR-V nearest neighbor privacy protection query method - Google Patents

Spark and Huffman coding based CPIR-V nearest neighbor privacy protection query method Download PDF

Info

Publication number
CN107291935B
CN107291935B CN201710536073.6A CN201710536073A CN107291935B CN 107291935 B CN107291935 B CN 107291935B CN 201710536073 A CN201710536073 A CN 201710536073A CN 107291935 B CN107291935 B CN 107291935B
Authority
CN
China
Prior art keywords
query
nearest neighbor
cpir
data
spark
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710536073.6A
Other languages
Chinese (zh)
Other versions
CN107291935A (en
Inventor
王波涛
王国仁
陈月梅
李昂
岳春成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN201710536073.6A priority Critical patent/CN107291935B/en
Publication of CN107291935A publication Critical patent/CN107291935A/en
Application granted granted Critical
Publication of CN107291935B publication Critical patent/CN107291935B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2272Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/221Column-oriented storage; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24532Query optimisation of parallel queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6227Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database where protection concerns the structure of data, e.g. records, types, queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Abstract

The invention discloses a CPIR-V nearest neighbor privacy protection query method based on Spark and Huffman coding, which compresses data of a nearest neighbor matrix by using the Huffman coding to reduce the Bit number of the data in each grid; then, storing the compressed data, the code length of the characters and the maximum value of the elements in a hollow database HBase; then the server side reads and reads data in an HBase database and caches the data in an RDD of a Spark parallel framework, CPIR nearest neighbor matrixes in the RDD are grouped according to a parallel strategy, the Spark server side performs CPIR parallel calculation according to query information after grouping, the calculation results of each group are aggregated, and then the query results and the character code length are sent to the client side; and the client analyzes the query result to obtain the value of the query bit, and decompresses the value of the query bit to obtain the query information. The privacy protection query algorithm based on Spark parallelization and Huffman coding ensures that the query privacy of a user is protected and the query efficiency is improved under the original query effect in a big data application scene.

Description

Spark and Huffman coding based CPIR-V nearest neighbor privacy protection query method
Technical Field
The invention relates to the technical field of communication networks, in particular to a CPIR-V nearest neighbor privacy protection query method based on Spark and Huffman coding.
Background
With the continuous development and production of mobile devices, the emergence of various positioning means and various communication means, the popularization of mobile terminals and the widespread use of communication devices due to the generation of various positioning technologies, mobile applications represented by Location Based Services (LBS) have stepped into the mobile big data era. However, it is not satisfactory to process the increasing amount of data only by using the computing power of the existing organization structure of the PC and the server, but if the computing power is improved by upgrading the hardware device, a large amount of financial resources and material resources are wasted, and effective horizontal extensibility and maintainability cannot be obtained. Therefore, great research has been conducted on cost saving, horizontal scalability and maintainability, and Google corporation first proposed the concept of "Cloud Computing" at the search engine society (SES San Jose 2006). Cloud computing is a type of parallel computing that will operate more like the internet by distributing computing over a large number of distributed computers, rather than local computers or remote servers. This enables the enterprise to switch resources to the required applications, accessing the computers and storage systems on demand. Cloud computing was yet another huge change following the large transition of mainframe computers to client-servers in the 1980 s.
The cloud platform provides a good platform for processing mobile big data, and moving the traditional LBS application and LBS privacy protection technology to the cloud platform is a development trend of the LBS application technology and the privacy protection technology and becomes one of research hotspots at present. In the big data era, potential information can be obtained by analyzing, summarizing and mining the big data, and the potential information can help enterprises and merchants to obtain huge benefits, such as adjusting market policies, reducing and avoiding risks, making decisions on market changes rationally, and the like. However, as the technology for mining big data is continuously appeared and perfected, there is a risk that the privacy of individuals is leaked while potential information is mined, so that personal information security and business confidentiality of enterprises, security confidentiality of countries and the like are seriously threatened. With the development and popularization of large data applications, privacy protection for individuals is important and becomes a serious challenge.
At present, the privacy protection research directions are mainly divided into three categories: generalized-based privacy protection techniques, encryption-based privacy protection techniques, and interference-based privacy protection techniques, wherein encryption-based privacy protection techniques are mainly represented by Computational Private Information Retrieval (CPIR). CPIR is a difficult-to-solve problem based on quadratic residuals, indicating that distinguishing between quadratic residuals is a difficult problem to compute in a modulo operation with a large complex modulus (typically 1024 bits). The CPIR algorithm greatly reduces the communication complexity, but also improves the calculation complexity and ensures the strongest privacy protection degree. However, LBS privacy protection involves a large amount of computation operations and complex transformation operations, and the CPIR algorithm requires a large amount of computation and a long computation time because of the need to scan the entire data space during computation, which makes the computation capability of the conventional computing platform unable to meet the existing requirements.
Disclosure of Invention
Aiming at the problems, the invention aims to provide a CPIR-V nearest neighbor privacy protection query method based on Spark and Huffman coding, so that the CPIR calculation cost is reduced, and the performance is further improved.
In order to solve the problems existing in the background technology, the technical scheme of the invention is as follows:
a CPIR-V nearest neighbor privacy protection query method based on Spark and Huffman coding comprises the following steps:
1) processing the file to obtain a grid, and reading nearest neighbor matrix data of the grid in the file;
2) compressing elements in the nearest neighbor matrix data by using Huffman coding to reduce Bit digits of the elements;
3) storing the encoded nearest neighbor matrix data into a spatial database HBase;
4) after receiving client data query information, a server side reads corresponding query information from a database HBase according to the data query information and stores the query information into an RDD (resource description device) of a Spark parallel framework, groups CPIR nearest neighbor matrixes in the RDD according to a parallel strategy, performs CPIR parallel calculation on the Spark according to the query information, aggregates calculation results of each group and then sends the query results and the character code length to a client;
5) and the client analyzes the query result to obtain the value of the query bit, and decompresses the value of the query bit to obtain the query information.
The step 1) is to process the file to obtain a grid, and reading nearest neighbor matrix data of the grid in the file comprises the following steps:
the method comprises the steps of dividing a Voronoi diagram according to interest points of spatial data in a file, then dividing the spatial data through the Voronoi diagram to obtain Voronoi grids, then carrying out grid division on the Voronoi grids, counting the number of potential nearest neighbors of a grid, and finally obtaining a nearest neighbor matrix of the grid.
The step 2) specifically comprises the following steps:
2.1, creating a one-dimensional integer array, reading the nearest neighbor matrix according to characters, counting the frequency number of the characters, storing the frequency number of the characters in the array, and taking the frequency number of the ending characters as the sum of matrix elements;
2.2, calculating the frequency of each character, and constructing a priority queue according to the sequence of the character frequencies from small to large;
2.3, constructing a Huffman tree by using the priority queue, coding characters in the Huffman tree and storing code length into an array;
2.4, recoding each element in the nearest neighbor matrix, additionally adding an end character code after the element code, storing the end character code into a code chain table, and counting the Bit number of each element after the coding is finished, and storing the Bit number into an array;
2.5, solving the maximum Bit digit according to the coded Bit digits of each element;
and 2.6, complementing each element in the coding linked list by the insufficient Bit digit according to the maximum Bit digit.
And 2.6, supplementing the ending character in the element to be complemented and then fully supplementing zero.
The step 3) specifically comprises the following steps:
3.1, storing the data after the nearest neighbor matrix compression into a two-dimensional byte array, wherein one dimension represents the total number of nearest neighbor matrix elements, and the two dimension represents the maximum byte value of the elements;
3.2, designing RowKey of the HBase database, and taking the reverse sequence of the row number of each row of the nearest neighbor matrix as the RowKey of the HBase so that the encoded nearest neighbor matrix data is uniformly distributed on the HRegionServer of the HBase;
3.3, storing the character in the column according to the column number, wherein the value of the column number is the element in the grid corresponding to the column number of each row, and storing the code length after the character compression into a database.
The client data query information includes:
and calculating the grid where the query point is located according to the position of the query point, generating corresponding secondary residual query according to the grid where the query point is located, and finally sending the query, grid division size and the selected parallel strategy to the server.
The step 4) comprises the following steps:
receiving a query, grid division number and a parallel strategy sent from a client, reading a corresponding CPIR nearest neighbor matrix, character code length and maximum value from a database HBase according to grid division data, storing the CPIR nearest neighbor matrix, the character code length and the maximum value into the RDD of a Spark, then grouping the CPIR nearest neighbor matrix in the RDD according to the parallel strategy sent by the client, performing CPIR parallel calculation on the Spark according to a query Q after grouping is completed, finally obtaining a calculation result, aggregating the calculation result of each group by the Spark, and then sending the query result and the character code length to the client.
The grouping of the CPIR nearest neighbor matrix in the RDD according to the parallel strategy sent by the client comprises the following steps of grouping based on Row level and grouping based on Bit level:
grouping the CPIR matrix according to rows based on the Row level grouping; and firstly acquiring the number k of the CPUs currently distributed by the cluster based on the Bit-level grouping, and grouping the data of each row of the CPIR matrix according to the data of the CPUs.
The step 5) comprises the following steps: and the receiving server returns the result and the character code length, performs secondary residual calculation on the result, obtains the value of the query bit, and performs decompression calculation on the value of the query bit to obtain a correct query result.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides a privacy protection query method for nearest neighbor CPIR-V based on Spark and Huffman coding, which reduces the calculation amount of CPIR by compressing data by using Huffman coding and reducing the data amount, reduces the calculation time by adopting Spark frame to perform parallel calculation when a server side performs CPIR calculation, and solves the problem of long calculation time. The privacy protection query algorithm based on Spark parallelization and Huffman coding ensures that the query privacy of a user is protected and the query efficiency is improved under the original query effect in a big data application scene.
Drawings
FIG. 1 is a flow chart of a CPIR-V nearest neighbor privacy protection query method based on Spark and Huffman coding according to the present invention;
FIG. 2 is a schematic view of the meshing of the Voronoi diagram of the present invention;
fig. 3 is a graph of the computed mean time for the server with different meshing according to the present invention, wherein (a) the computed mean time for the server with evenly distributed meshing, (b) the computed mean time for the server with gaussian distributed meshing, and (c) the computed mean time for the server with real data meshing.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings.
As shown in fig. 1, the present invention provides a CPIR-V nearest neighbor privacy protection query method based on Spark and Huffman coding, which includes:
1) processing the file to obtain a grid, and reading nearest neighbor matrix data of the grid in the file;
the method comprises the steps of dividing a Voronoi diagram according to interest points of spatial data in a file, then dividing the spatial data through the Voronoi diagram to obtain Voronoi grids, then carrying out grid division on the Voronoi grids, counting the number of potential nearest neighbors of a grid, and finally obtaining a nearest neighbor matrix of the grid. The Voronoi diagram reflects the topological relation of the neighbors between space objects through space division, each polygon in the diagram is called a Voronoi grid, and the edges of the Voronoi grid are perpendicular bisectors of the adjacent space objects.
When the grid division is carried out on the Voronoi grid, the grid division size needs to be noticed, the grid division is too small, data in the nearest neighbor matrix are too much, and the calculation cost is large when the client carries out analysis calculation on the query result; the grid division is too large, the data in the nearest neighbor matrix is too little, and the effect after compression is poor, so that the grid needs to be reasonably divided according to the density of the interest points. Illustratively, the entire space is divided into G grid grids, each grid intersecting one of the Voronoi grids or being divided into G grid gridsComprises the following steps. As shown in FIG. 2, where p represents a point of interest and q represents a query point, the entire space is divided into 5 by 5 grids, where the grids are labeled 1,1 and the point of interest p1,p2The Voronoi grids intersect, the grid number is 2,1 is the interest point p1The Vorono grid contains. In FIG. 2, the grid where query point q is located intersects the Voronoi grid where interest point q is located, so the nearest neighbor of query point q may be p1,p2So from p1,p2Formed set { p1,p2Called the potential nearest neighbor set of grid 2, 1. It can be seen that the number of nearest neighbors owned by the mesh is not necessarily the same, which is related to the distribution of points and the size of the mesh. The CPIR-V algorithm converts the nearest neighbor relation of each mesh in fig. 2 into a nearest neighbor memory matrix, where the potential nearest neighbor of each mesh is stored in the matrix, and it is noted that the size of each element in the memory matrix is the same. The CPIR-V algorithm first finds the maximum p _ max of the potential nearest neighbors of the grid and then completes the default for the number of grids that are less than the maximum. The mesh potential nearest neighbor maximum is 3, so other deficient meshes are complemented. The potential nearest neighbor relation of each grid is stored in a matrix, and the matrix is the basis and the core for carrying out a CPIR-V nearest neighbor query algorithm.
2) Compressing elements in the nearest neighbor matrix data by using Huffman coding to reduce Bit digits of the elements;
huffman coding is a Variable Length Coding (VLC) method for lossless compression, and is a method for constructing unique codes for characters according to the probability of the characters appearing in a file to be coded, and ensuring that the average code of variable codes is shortest, which is called an optimal binary tree and sometimes called an optimal code. Because Huffman coding is variable length coding, the length of coding is shorter for characters with higher occurrence probability, and the length of coding is longer for characters with lower occurrence probability, so that the total code length for processing all characters is ensured to be smaller than the actual coding length.
The method specifically comprises the following steps:
2.1, creating a one-dimensional integer array, reading the nearest neighbor matrix according to characters, counting the frequency number of the characters, storing the frequency number of the characters in the array, and taking the frequency number of the ending characters as the sum of matrix elements;
2.2, calculating the frequency of each character, and constructing a priority queue according to the sequence of the character frequencies from small to large;
2.3, constructing a Huffman tree by using the priority queue, coding characters in the Huffman tree and storing code length into an array;
2.4, recoding each element in the nearest neighbor matrix, additionally adding an end character code after the element code, storing the end character code into a code chain table, and counting the Bit number of each element after the coding is finished, and storing the Bit number into an array;
2.5, solving the maximum Bit digit according to the coded Bit digits of each element;
and 2.6, complementing each element in the coding linked list by the insufficient Bit digit according to the maximum Bit digit. The method specifically comprises the steps of firstly supplementing ending characters in elements to be supplemented, and then completely supplementing zero.
3) Storing the encoded nearest neighbor matrix data into a spatial database HBase;
the HBase is a distributed column storage-oriented Key-Value starting database, and is a distributed storage system with high reliability, high performance, column orientation and scalability. It uses Hadoop HDFS as its file storage system, emulates and provides all the functions of the Bigtable database based on the Google file system. The large-scale structured storage cluster can be built on a cheap PC Server by utilizing HBase technology. Since HBase is a key value database, HBase is suitable for a database of unstructured data storage.
The nearest neighbor matrix is stored in the space database HBase, so that the process of compressing the nearest neighbor matrix and storing the compressed nearest neighbor matrix in the HBase database is as follows: the method is characterized in that data are uniformly distributed on the HRegionServer of the HBase by designing the RowKey of the HBase database, and the RowKey is mainly designed according to the principle that the reverse order of the row number of each row of the nearest neighbor matrix is used as the RowKey of the HBase, so that the data volume stored in the HRegionServer of the HBase is not too large, and each HRegionServer has data storage. The columns are stored according to the column numbers, the values of the columns are elements in the grids corresponding to the column numbers of each row, and finally, the code length after character compression is stored in a database.
The method specifically comprises the following steps:
3.1, storing the data after the nearest neighbor matrix compression into a two-dimensional byte array, wherein one dimension represents the total number of nearest neighbor matrix elements, and the two dimension represents the maximum byte value of the elements;
3.2, designing RowKey of the HBase database, and taking the reverse sequence of the row number of each row of the nearest neighbor matrix as the RowKey of the HBase so that the encoded nearest neighbor matrix data is uniformly distributed on the HRegionServer of the HBase;
3.3, storing the character in the column according to the column number, wherein the value of the column number is the element in the grid corresponding to the column number of each row, and storing the code length after the character compression into a database.
The specific storage format is shown in table 1.
TABLE 1H-PCIR-V information Table Structure
Figure BDA0001340582660000061
4) After receiving client data query information, a server side reads corresponding query information from a database HBase according to the data query information and stores the query information into an RDD (resource description device) of a Spark parallel framework, groups CPIR nearest neighbor matrixes in the RDD according to a parallel strategy, performs CPIR parallel calculation on the Spark according to the query information, aggregates calculation results of each group and then sends the query results and the character code length to a client;
the client data query information includes: and calculating the grid where the query point is located according to the position of the query point, generating corresponding secondary residual query according to the grid where the query point is located, and finally sending the query, grid division size and the selected parallel strategy to the server. The method comprises the following specific steps:
1: calculating the grid G where the query point is located according to the position of the query pointa,b
2: generating query Q (y)1,y2,…,yg_x) WhereinSubscript b corresponds to yiThe remaining subscript values correspond to yi=QR;
3: sending the query Q, the grid division numbers g _ x and g _ y and the parallel strategy stream to a server;
4: waiting for the server side to return a query result;
receiving a query, grid division number and a parallel strategy sent from a client, reading a corresponding CPIR nearest neighbor matrix, character code length and maximum value from a database HBase according to grid division data, storing the CPIR nearest neighbor matrix, the character code length and the maximum value into the RDD of a Spark, then grouping the CPIR nearest neighbor matrix in the RDD according to the parallel strategy sent by the client, performing CPIR parallel calculation on the Spark according to a query Q after grouping is completed, finally obtaining a calculation result, aggregating the calculation result of each group by the Spark, and then sending the query result and the character code length to the client.
The grouping of the CPIR nearest neighbor matrix in the RDD according to the parallel strategy sent by the client comprises the following steps of grouping based on Row level and grouping based on Bit level:
grouping the CPIR matrix according to rows based on the Row level grouping; and firstly acquiring the number k of the CPUs currently distributed by the cluster based on the Bit-level grouping, and grouping the data of each row of the CPIR matrix according to the data of the CPUs.
The method specifically comprises the following steps:
1: the server acquires CPIR matrix data, character code length and maximum value according to grid division g _ x and g _ y and caches the CPIR matrix data, the character code length and the maximum value to RDD;
2: if the parallel strategy is Row, grouping CPIR matrix data in the RDD according to rows;
3: if the parallel strategy is Bit, acquiring the number k of CPUs allocated to the current cluster, and dividing each row of data into k groups;
4: if the parallel strategy is not matched, the default is Row, and CPIR matrix data in the RDD is grouped according to the rows;
5: the Spark calculates CPIR of each group according to the query Q to obtain a query result Z;
6: spark aggregates the result Z, and then sends the query result and the character code length to the client;
5) and the client analyzes the query result to obtain the value of the query bit, and decompresses the value of the query bit to obtain the query information. The method comprises the following steps: and the receiving server returns the result and the character code length, performs secondary residual calculation on the result, obtains the value of the query bit, and performs decompression calculation on the value of the query bit to obtain a correct query result.
And (3) comparing experimental results:
the adopted cloud computing service platform is IBM xSeries System 3650M4, wherein the number of cluster nodes is 5, and the detailed configuration of each node is as follows:
a CPU: 2 Xeon E5-2620 CPUs (6 cores per 2 threads);
memory: 32G Bytes;
hard disk: 5T Bytes,10000rpm, raid 5;
operating the system: CentOS 6.4;
developing a tool: GNU Toolkits (G + +, GDB), Make, Vim, JDK, etc.
The development language used for the experiment is standard C + +, Java and scala languages.
Three data sets are mainly used, the two data sets are a synthesized data set and a real data set, experimental results are analyzed, and the query performance of the algorithm is further analyzed through the experimental results. The real dataset is Sequoia selected from california. The synthetic data set is a data set which is distributed uniformly and in a Gaussian mode, wherein the data set in the Gaussian mode is distributed from (X, Y) to N (1,1,0,0, 1). The data ranged from 1046435 x 1929615, noting that the type of abscissa x and ordinate y was int type. The large number calculation library used by C + + is GMP, the large integer calculation tool uses the bigInteger large number calculation class of Java JDK itself, and it should be noted that the value-taking rule of the threshold involved in the second residue is: the time for multiplying the theta large integers is longer than the minimum value of a time condition of memory table look-up, and the time is obtained through a laboratory, and when the theta is 3, the time consumed by multiplying the theta large integers exceeds the time of memory table look-up, so that the value of the theta is 3. The parameters used in the experiment are shown in table 2.
TABLE 2 Experimental data parameters
Parameter name Range of variation Default value
Data type True data (62K), Gaussian distribution (100K), Uniform distribution (100K) Evenly distributed (100K)
Mesh partitioning 10*10,20*20,50*50,100*100,200*200,400*400 100*100
Modulus k 128,256,512,1024 512
Coverage of range query 1,5,10,15 1
Query result calculation method Equation 2.10, equation 2.11 Equation 2.10
Number of CPU cores 1,10,20,40,60 60
Data compression experimental results:
TABLE 3 Range query Algorithm data compression contrast
Figure BDA0001340582660000081
The algorithm in the data preprocessing stage uses Huffman coding to compress the data in the mesh, and the comparison between the range query data before and after compression is shown in table 3. From table 3, it can be seen that the data size of the range query is reduced by nearly half after the Huffman coding compression, and the compression ratio is close to 55%. The reduction of the data volume generally means that the CPIR calculation amount of the service end is also reduced by half correspondingly, so that the calculation time of the service end is reduced.
TABLE 4 CPIR-V Algorithm data compression contrast
Figure BDA0001340582660000082
Figure BDA0001340582660000091
Table 4 shows that the CPIR-V algorithm compresses the data in the nearest neighbor matrix using Huffman coding in the data preprocessing stage to obtain the compressed size of the maximum value in the matrix, and compares it with the maximum value of the matrix before compression and the maximum value after compression. As can be seen from table 4, the CPIR-V algorithm reduces the size of the maximum value in the matrix by approximately 1/3 after the compression by Huffman coding. The CPIR-V algorithm is to search the maximum value of elements in the nearest neighbor matrix and then complement the rest elements, and if the nearest neighbor matrix is N x N, wherein the maximum value is m, the large number calculation amount of the service end is m x N; after the Huffman coding compression, the maximum value in the matrix is (2/3) m, then the large-tertiary calculation amount of the service end is (2/3) m × N, and the calculation amount of the service end as a whole is (1/3) m × N.
As shown in fig. 3, it is a comparison graph of the computation time of the server under different mesh partitions of the Spark-based parallel CPIR-V algorithm (PCPIR-V), the Spark-based parallel CPIR-V algorithm (H-PCPIR-V) Row parallel strategy and the Bit parallel strategy based on Spark and Huffman coding. Wherein H-PCPIR-V-R in the figure refers to a parallel strategy based on Row, and H-PCPIR-V-B refers to a parallel strategy based on Bit. As can be seen from the figure, the computation time of the server side of the three algorithms generally becomes larger along with the increase of the grid, because the larger the grid division is, the more the computation matrix of CPIR-V is, the more the computation of large integers is required, and the computation time is increased. The calculation time of the server side of the three data sets under different grids is shown in the figure, and it can be seen from the figure that the calculation time of the H-PCPIR-V-R algorithm and the H-PCPIR-V-B algorithm at the server side is generally shorter than that of the PCPIR-V algorithm, and is more obvious in the Gaussian distribution data set and the real data set, because the H-PCPIR-V-R and the H-PCPIR-V-B compress the data in the matrix in the data preprocessing stage, so that the reduction of the bit number for performing large integer calculation in the matrix further reduces the calculation time of the server side. The reason why the difference between the server-side calculation time of the gaussian distribution data set and the real data set is more obvious than that between the server-side calculation time of the uniformly distributed data set is that the matrix data in the uniformly distributed data set is more uniform, the compression has no great influence, the data distribution sizes in the other two data set matrixes are different, and the difference between the maximum value and the minimum value is great, so that the difference between the calculation bit number after the compression is great, and the calculation result is influenced.
Through comparison experiments and experimental result analysis, the calculation cost of the H-PCPIR-V algorithm at the server side is reduced by about 30% compared with that of the PCPIR-V algorithm, the calculation cost of the client side is reduced by about 10%, and the communication cost is reduced by about 40%.
It will be appreciated by those skilled in the art that the foregoing embodiments are merely preferred embodiments of the invention, and thus, modifications, variations and equivalents of the parts of the invention may be made by those skilled in the art, which are still within the spirit of the invention and which are intended to be within the scope of the invention.

Claims (6)

1. A CPIR-V nearest neighbor privacy protection query method based on Spark and Huffman coding is characterized by comprising the following steps:
1) processing the file to obtain a grid, and reading nearest neighbor matrix data of the grid in the file;
2) compressing elements in the nearest neighbor matrix data by using Huffman coding to reduce Bit digits of the elements;
3) storing the encoded nearest neighbor matrix data into a spatial database HBase;
4) after receiving client data query information, a server side reads corresponding query information from a database HBase according to the data query information and stores the query information into an RDD (resource description device) of a Spark parallel framework, groups CPIR nearest neighbor matrixes in the RDD according to a parallel strategy, performs CPIR parallel calculation on the Spark according to the query information, aggregates calculation results of each group and then sends the query results and the character code length to a client;
5) the client analyzes the query result to obtain the value of the query bit, and decompresses the value of the query bit to obtain query information;
the step 1) is to process the file to obtain a grid, and reading nearest neighbor matrix data of the grid in the file comprises the following steps:
dividing a Voronoi diagram according to interest points of spatial data in a file, then dividing the spatial data through the Voronoi diagram to obtain Voronoi grids, then carrying out grid division on the Voronoi grids, counting the number of potential nearest neighbors of the grids, and finally obtaining a nearest neighbor matrix of the grids;
the step 2) specifically comprises the following steps:
2.1, creating a one-dimensional integer array, reading the nearest neighbor matrix according to characters, counting the frequency number of the characters, storing the frequency number of the characters in the array, and taking the frequency number of the ending characters as the sum of matrix elements;
2.2, calculating the frequency of each character, and constructing a priority queue according to the sequence of the character frequencies from small to large;
2.3, constructing a Huffman tree by using the priority queue, coding characters in the Huffman tree and storing code length into an array;
2.4, recoding each element in the nearest neighbor matrix, additionally adding an end character code after the element code, storing the end character code into a code chain table, and counting the Bit number of each element after the coding is finished, and storing the Bit number into an array;
2.5, solving the maximum Bit digit according to the coded Bit digits of each element;
2.6, complementing each element in the coding chain table with insufficient Bit digits according to the maximum Bit digit;
the step 3) specifically comprises the following steps:
3.1, storing the data after the nearest neighbor matrix compression into a two-dimensional byte array, wherein one dimension represents the total number of nearest neighbor matrix elements, and the two dimension represents the maximum byte value of the elements;
3.2, designing RowKey of the HBase database, and taking the reverse sequence of the row number of each row of the nearest neighbor matrix as the RowKey of the HBase so that the encoded nearest neighbor matrix data is uniformly distributed on the HRegionServer of the HBase;
3.3, storing the character in the column according to the column number, wherein the value of the column number is the element in the grid corresponding to the column number of each row, and storing the code length after the character compression into a database.
2. A CPIR-V nearest neighbor privacy protection query method based on Spark and Huffman coding as claimed in claim 1, wherein said step 2.6 comprises first complementing the ending character in the bit element to be complemented and then complementing all zeros.
3. The CPIR-V nearest neighbor privacy preserving query method based on Spark and Huffman coding as claimed in claim 1, wherein the client data query information comprises:
and calculating the grid where the query point is located according to the position of the query point, generating corresponding secondary residual query according to the grid where the query point is located, and finally sending the query, grid division size and the selected parallel strategy to the server.
4. The CPIR-V nearest neighbor privacy protection query method based on Spark and Huffman coding according to claim 1, wherein the step 4) comprises:
receiving a query, grid division number and a parallel strategy sent from a client, reading a corresponding CPIR nearest neighbor matrix, character code length and maximum value from a database HBase according to grid division data, storing the CPIR nearest neighbor matrix, the character code length and the maximum value into the RDD of a Spark, then grouping the CPIR nearest neighbor matrix in the RDD according to the parallel strategy sent by the client, performing CPIR parallel calculation on the Spark according to a query Q after grouping is completed, finally obtaining a calculation result, aggregating the calculation result of each group by the Spark, and then sending the query result and the character code length to the client.
5. The CPIR-V nearest neighbor privacy protection lookup method based on Spark and Huffman coding as claimed in claim 4, wherein the grouping the CPIR nearest neighbor matrices in the RDD according to the parallel policy sent by the client comprises Row-level based grouping and Bit-level based grouping:
grouping the CPIR matrix according to rows based on the Row level grouping; and firstly acquiring the number k of the CPUs currently distributed by the cluster based on the Bit-level grouping, and grouping the data of each row of the CPIR matrix according to the data of the CPUs.
6. The CPIR-V nearest neighbor privacy protection query method based on Spark and Huffman coding according to claim 1, wherein the step 5) comprises: and the receiving server returns the result and the character code length, performs secondary residual calculation on the result, obtains the value of the query bit, and performs decompression calculation on the value of the query bit to obtain a correct query result.
CN201710536073.6A 2017-07-04 2017-07-04 Spark and Huffman coding based CPIR-V nearest neighbor privacy protection query method Active CN107291935B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710536073.6A CN107291935B (en) 2017-07-04 2017-07-04 Spark and Huffman coding based CPIR-V nearest neighbor privacy protection query method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710536073.6A CN107291935B (en) 2017-07-04 2017-07-04 Spark and Huffman coding based CPIR-V nearest neighbor privacy protection query method

Publications (2)

Publication Number Publication Date
CN107291935A CN107291935A (en) 2017-10-24
CN107291935B true CN107291935B (en) 2020-09-29

Family

ID=60098630

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710536073.6A Active CN107291935B (en) 2017-07-04 2017-07-04 Spark and Huffman coding based CPIR-V nearest neighbor privacy protection query method

Country Status (1)

Country Link
CN (1) CN107291935B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108200027B (en) * 2017-12-27 2020-11-03 东南大学 Position privacy protection neighbor query method based on feedback angle
CN109190809A (en) * 2018-08-15 2019-01-11 中国石油化工股份有限公司江汉油田分公司勘探开发研究院 The coding method of oilfield development program multivariable and device
CN112527951B (en) * 2021-02-09 2021-05-11 北京微步在线科技有限公司 Storage method and device of integer data and storage medium
CN114968404B (en) * 2022-05-24 2023-11-17 武汉大学 Distributed unloading method for computing tasks of location privacy protection

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073689A (en) * 2010-12-27 2011-05-25 东北大学 Dynamic nearest neighbour inquiry method on basis of regional coverage
CN102708191A (en) * 2012-05-15 2012-10-03 通唐软件技术(湖南)有限公司 Word stock coding and decoding method capable of saving memory
CN104268210A (en) * 2014-09-12 2015-01-07 东北大学 CPIR-V nearest neighbor privacy protection querying method based on local super-set
CN104392318A (en) * 2014-11-24 2015-03-04 蔡志明 Medical data storing and inquiring method based on cloud platform
CN104486434A (en) * 2014-12-23 2015-04-01 深圳供电局有限公司 Mobile terminal and file upload and download methods of mobile terminal

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073689A (en) * 2010-12-27 2011-05-25 东北大学 Dynamic nearest neighbour inquiry method on basis of regional coverage
CN102708191A (en) * 2012-05-15 2012-10-03 通唐软件技术(湖南)有限公司 Word stock coding and decoding method capable of saving memory
CN104268210A (en) * 2014-09-12 2015-01-07 东北大学 CPIR-V nearest neighbor privacy protection querying method based on local super-set
CN104392318A (en) * 2014-11-24 2015-03-04 蔡志明 Medical data storing and inquiring method based on cloud platform
CN104486434A (en) * 2014-12-23 2015-04-01 深圳供电局有限公司 Mobile terminal and file upload and download methods of mobile terminal

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Hadoop云构架的智能调度无损集群压缩技术;屈志坚;《电力系统自动化》;20130925;第93-96页 *
PCPIR-V:基于Spark的并行隐私保护近邻查询算法;邓诗卓;《网络与信息安全学报》;20160531;第1-12页 *

Also Published As

Publication number Publication date
CN107291935A (en) 2017-10-24

Similar Documents

Publication Publication Date Title
CN107291935B (en) Spark and Huffman coding based CPIR-V nearest neighbor privacy protection query method
Tao et al. Optimizing lossy compression rate-distortion from automatic online selection between SZ and ZFP
Karakus et al. Encoded distributed optimization
CN107046812B (en) Data storage method and device
EP0851627A1 (en) Method and device for compressing and ciphering data
Su et al. Taming massive distributed datasets: data sampling using bitmap indices
US20160210305A1 (en) Effective method to compress tabular data export files for data movement
Luo et al. Identifying latent reduced models to precondition lossy compression
Zou et al. Performance optimization for relative-error-bounded lossy compression on scientific data
Goetschalckx et al. Efficiently combining svd, pruning, clustering and retraining for enhanced neural network compression
Cao et al. Optimal task allocation and coding design for secure coded edge computing
Hu et al. Delta-DNN: Efficiently compressing deep neural networks via exploiting floats similarity
Gao et al. Squish: Near-optimal compression for archival of relational datasets
CN115392473A (en) Vertical joint learning with compressed embedding
Barbarioli et al. Hierarchical residual encoding for multiresolution time series compression
CN112559462A (en) Data compression method and device, computer equipment and storage medium
Lei et al. Compressing deep convolutional networks using k-means based on weights distribution
CN112396166A (en) Graph convolution neural network training method and device based on mixed granularity aggregator
CN109543772B (en) Data set automatic matching method, device, equipment and computer readable storage medium
CN115905168B (en) Self-adaptive compression method and device based on database, equipment and storage medium
Wang et al. A polygon-based methodology for mining related spatial datasets
Wang et al. TAC: Optimizing error-bounded lossy compression for three-dimensional adaptive mesh refinement simulations
Ma et al. BCH–LSH: a new scheme of locality‐sensitive hashing
Zheng et al. Visual image encryption scheme based on vector quantization and content transform
US11308093B1 (en) Encoding scheme for numeric-like data types

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant