CN107291935B

CN107291935B - Spark and Huffman coding based CPIR-V nearest neighbor privacy protection query method

Info

Publication number: CN107291935B
Application number: CN201710536073.6A
Authority: CN
Inventors: 王波涛; 王国仁; 陈月梅; 李昂; 岳春成
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2017-07-04
Filing date: 2017-07-04
Publication date: 2020-09-29
Anticipated expiration: 2037-07-04
Also published as: CN107291935A

Abstract

The invention discloses a CPIR-V nearest neighbor privacy protection query method based on Spark and Huffman coding, which compresses data of a nearest neighbor matrix by using the Huffman coding to reduce the Bit number of the data in each grid; then, storing the compressed data, the code length of the characters and the maximum value of the elements in a hollow database HBase; then the server side reads and reads data in an HBase database and caches the data in an RDD of a Spark parallel framework, CPIR nearest neighbor matrixes in the RDD are grouped according to a parallel strategy, the Spark server side performs CPIR parallel calculation according to query information after grouping, the calculation results of each group are aggregated, and then the query results and the character code length are sent to the client side; and the client analyzes the query result to obtain the value of the query bit, and decompresses the value of the query bit to obtain the query information. The privacy protection query algorithm based on Spark parallelization and Huffman coding ensures that the query privacy of a user is protected and the query efficiency is improved under the original query effect in a big data application scene.

Description

Spark and Huffman coding based CPIR-V nearest neighbor privacy protection query method

Technical Field

The invention relates to the technical field of communication networks, in particular to a CPIR-V nearest neighbor privacy protection query method based on Spark and Huffman coding.

Background

With the continuous development and production of mobile devices, the emergence of various positioning means and various communication means, the popularization of mobile terminals and the widespread use of communication devices due to the generation of various positioning technologies, mobile applications represented by Location Based Services (LBS) have stepped into the mobile big data era. However, it is not satisfactory to process the increasing amount of data only by using the computing power of the existing organization structure of the PC and the server, but if the computing power is improved by upgrading the hardware device, a large amount of financial resources and material resources are wasted, and effective horizontal extensibility and maintainability cannot be obtained. Therefore, great research has been conducted on cost saving, horizontal scalability and maintainability, and Google corporation first proposed the concept of "Cloud Computing" at the search engine society (SES San Jose 2006). Cloud computing is a type of parallel computing that will operate more like the internet by distributing computing over a large number of distributed computers, rather than local computers or remote servers. This enables the enterprise to switch resources to the required applications, accessing the computers and storage systems on demand. Cloud computing was yet another huge change following the large transition of mainframe computers to client-servers in the 1980 s.

The cloud platform provides a good platform for processing mobile big data, and moving the traditional LBS application and LBS privacy protection technology to the cloud platform is a development trend of the LBS application technology and the privacy protection technology and becomes one of research hotspots at present. In the big data era, potential information can be obtained by analyzing, summarizing and mining the big data, and the potential information can help enterprises and merchants to obtain huge benefits, such as adjusting market policies, reducing and avoiding risks, making decisions on market changes rationally, and the like. However, as the technology for mining big data is continuously appeared and perfected, there is a risk that the privacy of individuals is leaked while potential information is mined, so that personal information security and business confidentiality of enterprises, security confidentiality of countries and the like are seriously threatened. With the development and popularization of large data applications, privacy protection for individuals is important and becomes a serious challenge.

At present, the privacy protection research directions are mainly divided into three categories: generalized-based privacy protection techniques, encryption-based privacy protection techniques, and interference-based privacy protection techniques, wherein encryption-based privacy protection techniques are mainly represented by Computational Private Information Retrieval (CPIR). CPIR is a difficult-to-solve problem based on quadratic residuals, indicating that distinguishing between quadratic residuals is a difficult problem to compute in a modulo operation with a large complex modulus (typically 1024 bits). The CPIR algorithm greatly reduces the communication complexity, but also improves the calculation complexity and ensures the strongest privacy protection degree. However, LBS privacy protection involves a large amount of computation operations and complex transformation operations, and the CPIR algorithm requires a large amount of computation and a long computation time because of the need to scan the entire data space during computation, which makes the computation capability of the conventional computing platform unable to meet the existing requirements.

Disclosure of Invention

Aiming at the problems, the invention aims to provide a CPIR-V nearest neighbor privacy protection query method based on Spark and Huffman coding, so that the CPIR calculation cost is reduced, and the performance is further improved.

In order to solve the problems existing in the background technology, the technical scheme of the invention is as follows:

a CPIR-V nearest neighbor privacy protection query method based on Spark and Huffman coding comprises the following steps:

1) processing the file to obtain a grid, and reading nearest neighbor matrix data of the grid in the file;

2) compressing elements in the nearest neighbor matrix data by using Huffman coding to reduce Bit digits of the elements;

3) storing the encoded nearest neighbor matrix data into a spatial database HBase;

4) after receiving client data query information, a server side reads corresponding query information from a database HBase according to the data query information and stores the query information into an RDD (resource description device) of a Spark parallel framework, groups CPIR nearest neighbor matrixes in the RDD according to a parallel strategy, performs CPIR parallel calculation on the Spark according to the query information, aggregates calculation results of each group and then sends the query results and the character code length to a client;

5) and the client analyzes the query result to obtain the value of the query bit, and decompresses the value of the query bit to obtain the query information.

The step 1) is to process the file to obtain a grid, and reading nearest neighbor matrix data of the grid in the file comprises the following steps:

the method comprises the steps of dividing a Voronoi diagram according to interest points of spatial data in a file, then dividing the spatial data through the Voronoi diagram to obtain Voronoi grids, then carrying out grid division on the Voronoi grids, counting the number of potential nearest neighbors of a grid, and finally obtaining a nearest neighbor matrix of the grid.

The step 2) specifically comprises the following steps:

2.1, creating a one-dimensional integer array, reading the nearest neighbor matrix according to characters, counting the frequency number of the characters, storing the frequency number of the characters in the array, and taking the frequency number of the ending characters as the sum of matrix elements;

2.2, calculating the frequency of each character, and constructing a priority queue according to the sequence of the character frequencies from small to large;

2.3, constructing a Huffman tree by using the priority queue, coding characters in the Huffman tree and storing code length into an array;

2.4, recoding each element in the nearest neighbor matrix, additionally adding an end character code after the element code, storing the end character code into a code chain table, and counting the Bit number of each element after the coding is finished, and storing the Bit number into an array;

2.5, solving the maximum Bit digit according to the coded Bit digits of each element;

and 2.6, complementing each element in the coding linked list by the insufficient Bit digit according to the maximum Bit digit.

And 2.6, supplementing the ending character in the element to be complemented and then fully supplementing zero.

The step 3) specifically comprises the following steps:

3.1, storing the data after the nearest neighbor matrix compression into a two-dimensional byte array, wherein one dimension represents the total number of nearest neighbor matrix elements, and the two dimension represents the maximum byte value of the elements;

3.2, designing RowKey of the HBase database, and taking the reverse sequence of the row number of each row of the nearest neighbor matrix as the RowKey of the HBase so that the encoded nearest neighbor matrix data is uniformly distributed on the HRegionServer of the HBase;

3.3, storing the character in the column according to the column number, wherein the value of the column number is the element in the grid corresponding to the column number of each row, and storing the code length after the character compression into a database.

The client data query information includes:

and calculating the grid where the query point is located according to the position of the query point, generating corresponding secondary residual query according to the grid where the query point is located, and finally sending the query, grid division size and the selected parallel strategy to the server.

The step 4) comprises the following steps:

receiving a query, grid division number and a parallel strategy sent from a client, reading a corresponding CPIR nearest neighbor matrix, character code length and maximum value from a database HBase according to grid division data, storing the CPIR nearest neighbor matrix, the character code length and the maximum value into the RDD of a Spark, then grouping the CPIR nearest neighbor matrix in the RDD according to the parallel strategy sent by the client, performing CPIR parallel calculation on the Spark according to a query Q after grouping is completed, finally obtaining a calculation result, aggregating the calculation result of each group by the Spark, and then sending the query result and the character code length to the client.

The grouping of the CPIR nearest neighbor matrix in the RDD according to the parallel strategy sent by the client comprises the following steps of grouping based on Row level and grouping based on Bit level:

grouping the CPIR matrix according to rows based on the Row level grouping; and firstly acquiring the number k of the CPUs currently distributed by the cluster based on the Bit-level grouping, and grouping the data of each row of the CPIR matrix according to the data of the CPUs.

The step 5) comprises the following steps: and the receiving server returns the result and the character code length, performs secondary residual calculation on the result, obtains the value of the query bit, and performs decompression calculation on the value of the query bit to obtain a correct query result.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a privacy protection query method for nearest neighbor CPIR-V based on Spark and Huffman coding, which reduces the calculation amount of CPIR by compressing data by using Huffman coding and reducing the data amount, reduces the calculation time by adopting Spark frame to perform parallel calculation when a server side performs CPIR calculation, and solves the problem of long calculation time. The privacy protection query algorithm based on Spark parallelization and Huffman coding ensures that the query privacy of a user is protected and the query efficiency is improved under the original query effect in a big data application scene.

Drawings

FIG. 1 is a flow chart of a CPIR-V nearest neighbor privacy protection query method based on Spark and Huffman coding according to the present invention;

FIG. 2 is a schematic view of the meshing of the Voronoi diagram of the present invention;

fig. 3 is a graph of the computed mean time for the server with different meshing according to the present invention, wherein (a) the computed mean time for the server with evenly distributed meshing, (b) the computed mean time for the server with gaussian distributed meshing, and (c) the computed mean time for the server with real data meshing.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings.

As shown in fig. 1, the present invention provides a CPIR-V nearest neighbor privacy protection query method based on Spark and Huffman coding, which includes:

the method comprises the steps of dividing a Voronoi diagram according to interest points of spatial data in a file, then dividing the spatial data through the Voronoi diagram to obtain Voronoi grids, then carrying out grid division on the Voronoi grids, counting the number of potential nearest neighbors of a grid, and finally obtaining a nearest neighbor matrix of the grid. The Voronoi diagram reflects the topological relation of the neighbors between space objects through space division, each polygon in the diagram is called a Voronoi grid, and the edges of the Voronoi grid are perpendicular bisectors of the adjacent space objects.

When the grid division is carried out on the Voronoi grid, the grid division size needs to be noticed, the grid division is too small, data in the nearest neighbor matrix are too much, and the calculation cost is large when the client carries out analysis calculation on the query result; the grid division is too large, the data in the nearest neighbor matrix is too little, and the effect after compression is poor, so that the grid needs to be reasonably divided according to the density of the interest points. Illustratively, the entire space is divided into G grid grids, each grid intersecting one of the Voronoi grids or being divided into G grid gridsComprises the following steps. As shown in FIG. 2, where p represents a point of interest and q represents a query point, the entire space is divided into 5 by 5 grids, where the grids are labeled 1,1 and the point of interest p₁，p₂The Voronoi grids intersect, the grid number is 2,1 is the interest point p₁The Vorono grid contains. In FIG. 2, the grid where query point q is located intersects the Voronoi grid where interest point q is located, so the nearest neighbor of query point q may be p₁，p₂So from p₁，p₂Formed set { p₁，p₂Called the potential nearest neighbor set of

grid

2, 1. It can be seen that the number of nearest neighbors owned by the mesh is not necessarily the same, which is related to the distribution of points and the size of the mesh. The CPIR-V algorithm converts the nearest neighbor relation of each mesh in fig. 2 into a nearest neighbor memory matrix, where the potential nearest neighbor of each mesh is stored in the matrix, and it is noted that the size of each element in the memory matrix is the same. The CPIR-V algorithm first finds the maximum p _ max of the potential nearest neighbors of the grid and then completes the default for the number of grids that are less than the maximum. The mesh potential nearest neighbor maximum is 3, so other deficient meshes are complemented. The potential nearest neighbor relation of each grid is stored in a matrix, and the matrix is the basis and the core for carrying out a CPIR-V nearest neighbor query algorithm.

huffman coding is a Variable Length Coding (VLC) method for lossless compression, and is a method for constructing unique codes for characters according to the probability of the characters appearing in a file to be coded, and ensuring that the average code of variable codes is shortest, which is called an optimal binary tree and sometimes called an optimal code. Because Huffman coding is variable length coding, the length of coding is shorter for characters with higher occurrence probability, and the length of coding is longer for characters with lower occurrence probability, so that the total code length for processing all characters is ensured to be smaller than the actual coding length.

The method specifically comprises the following steps:

and 2.6, complementing each element in the coding linked list by the insufficient Bit digit according to the maximum Bit digit. The method specifically comprises the steps of firstly supplementing ending characters in elements to be supplemented, and then completely supplementing zero.

the HBase is a distributed column storage-oriented Key-Value starting database, and is a distributed storage system with high reliability, high performance, column orientation and scalability. It uses Hadoop HDFS as its file storage system, emulates and provides all the functions of the Bigtable database based on the Google file system. The large-scale structured storage cluster can be built on a cheap PC Server by utilizing HBase technology. Since HBase is a key value database, HBase is suitable for a database of unstructured data storage.

The nearest neighbor matrix is stored in the space database HBase, so that the process of compressing the nearest neighbor matrix and storing the compressed nearest neighbor matrix in the HBase database is as follows: the method is characterized in that data are uniformly distributed on the HRegionServer of the HBase by designing the RowKey of the HBase database, and the RowKey is mainly designed according to the principle that the reverse order of the row number of each row of the nearest neighbor matrix is used as the RowKey of the HBase, so that the data volume stored in the HRegionServer of the HBase is not too large, and each HRegionServer has data storage. The columns are stored according to the column numbers, the values of the columns are elements in the grids corresponding to the column numbers of each row, and finally, the code length after character compression is stored in a database.

The method specifically comprises the following steps:

The specific storage format is shown in table 1.

TABLE 1H-PCIR-V information Table Structure

the client data query information includes: and calculating the grid where the query point is located according to the position of the query point, generating corresponding secondary residual query according to the grid where the query point is located, and finally sending the query, grid division size and the selected parallel strategy to the server. The method comprises the following specific steps:

1: calculating the grid G where the query point is located according to the position of the query point_a，b；

2: generating query Q (y)₁，y₂，…，y_{g_x}) WhereinSubscript b corresponds to y_iThe remaining subscript values correspond to y_i＝QR；

3: sending the query Q, the grid division numbers g _ x and g _ y and the parallel strategy stream to a server;

4: waiting for the server side to return a query result;

The method specifically comprises the following steps:

1: the server acquires CPIR matrix data, character code length and maximum value according to grid division g _ x and g _ y and caches the CPIR matrix data, the character code length and the maximum value to RDD;

2: if the parallel strategy is Row, grouping CPIR matrix data in the RDD according to rows;

3: if the parallel strategy is Bit, acquiring the number k of CPUs allocated to the current cluster, and dividing each row of data into k groups;

4: if the parallel strategy is not matched, the default is Row, and CPIR matrix data in the RDD is grouped according to the rows;

5: the Spark calculates CPIR of each group according to the query Q to obtain a query result Z;

6: spark aggregates the result Z, and then sends the query result and the character code length to the client;

5) and the client analyzes the query result to obtain the value of the query bit, and decompresses the value of the query bit to obtain the query information. The method comprises the following steps: and the receiving server returns the result and the character code length, performs secondary residual calculation on the result, obtains the value of the query bit, and performs decompression calculation on the value of the query bit to obtain a correct query result.

And (3) comparing experimental results:

the adopted cloud computing service platform is IBM xSeries System 3650M4, wherein the number of cluster nodes is 5, and the detailed configuration of each node is as follows:

a CPU: 2 Xeon E5-2620 CPUs (6 cores per 2 threads);

memory: 32G Bytes;

hard disk: 5T Bytes,10000rpm, raid 5;

operating the system: CentOS 6.4;

developing a tool: GNU Toolkits (G + +, GDB), Make, Vim, JDK, etc.

The development language used for the experiment is standard C + +, Java and scala languages.

Three data sets are mainly used, the two data sets are a synthesized data set and a real data set, experimental results are analyzed, and the query performance of the algorithm is further analyzed through the experimental results. The real dataset is Sequoia selected from california. The synthetic data set is a data set which is distributed uniformly and in a Gaussian mode, wherein the data set in the Gaussian mode is distributed from (X, Y) to N (1,1,0,0, 1). The data ranged from 1046435 x 1929615, noting that the type of abscissa x and ordinate y was int type. The large number calculation library used by C + + is GMP, the large integer calculation tool uses the bigInteger large number calculation class of Java JDK itself, and it should be noted that the value-taking rule of the threshold involved in the second residue is: the time for multiplying the theta large integers is longer than the minimum value of a time condition of memory table look-up, and the time is obtained through a laboratory, and when the theta is 3, the time consumed by multiplying the theta large integers exceeds the time of memory table look-up, so that the value of the theta is 3. The parameters used in the experiment are shown in table 2.

TABLE 2 Experimental data parameters

Parameter name	Range of variation	Default value
			Data type	True data (62K), Gaussian distribution (100K), Uniform distribution (100K)	Evenly distributed (100K)
Mesh partitioning	1010,2020,5050,100100,200200,400400	100*100
			Modulus k	128,256,512,1024	512
Coverage of range query	1，5，10，15	1
			Query result calculation method	Equation 2.10, equation 2.11	Equation 2.10
Number of CPU cores	1,10,20,40,60	60

Data compression experimental results:

TABLE 3 Range query Algorithm data compression contrast

The algorithm in the data preprocessing stage uses Huffman coding to compress the data in the mesh, and the comparison between the range query data before and after compression is shown in table 3. From table 3, it can be seen that the data size of the range query is reduced by nearly half after the Huffman coding compression, and the compression ratio is close to 55%. The reduction of the data volume generally means that the CPIR calculation amount of the service end is also reduced by half correspondingly, so that the calculation time of the service end is reduced.

TABLE 4 CPIR-V Algorithm data compression contrast

Table 4 shows that the CPIR-V algorithm compresses the data in the nearest neighbor matrix using Huffman coding in the data preprocessing stage to obtain the compressed size of the maximum value in the matrix, and compares it with the maximum value of the matrix before compression and the maximum value after compression. As can be seen from table 4, the CPIR-V algorithm reduces the size of the maximum value in the matrix by approximately 1/3 after the compression by Huffman coding. The CPIR-V algorithm is to search the maximum value of elements in the nearest neighbor matrix and then complement the rest elements, and if the nearest neighbor matrix is N x N, wherein the maximum value is m, the large number calculation amount of the service end is m x N; after the Huffman coding compression, the maximum value in the matrix is (2/3) m, then the large-tertiary calculation amount of the service end is (2/3) m × N, and the calculation amount of the service end as a whole is (1/3) m × N.

As shown in fig. 3, it is a comparison graph of the computation time of the server under different mesh partitions of the Spark-based parallel CPIR-V algorithm (PCPIR-V), the Spark-based parallel CPIR-V algorithm (H-PCPIR-V) Row parallel strategy and the Bit parallel strategy based on Spark and Huffman coding. Wherein H-PCPIR-V-R in the figure refers to a parallel strategy based on Row, and H-PCPIR-V-B refers to a parallel strategy based on Bit. As can be seen from the figure, the computation time of the server side of the three algorithms generally becomes larger along with the increase of the grid, because the larger the grid division is, the more the computation matrix of CPIR-V is, the more the computation of large integers is required, and the computation time is increased. The calculation time of the server side of the three data sets under different grids is shown in the figure, and it can be seen from the figure that the calculation time of the H-PCPIR-V-R algorithm and the H-PCPIR-V-B algorithm at the server side is generally shorter than that of the PCPIR-V algorithm, and is more obvious in the Gaussian distribution data set and the real data set, because the H-PCPIR-V-R and the H-PCPIR-V-B compress the data in the matrix in the data preprocessing stage, so that the reduction of the bit number for performing large integer calculation in the matrix further reduces the calculation time of the server side. The reason why the difference between the server-side calculation time of the gaussian distribution data set and the real data set is more obvious than that between the server-side calculation time of the uniformly distributed data set is that the matrix data in the uniformly distributed data set is more uniform, the compression has no great influence, the data distribution sizes in the other two data set matrixes are different, and the difference between the maximum value and the minimum value is great, so that the difference between the calculation bit number after the compression is great, and the calculation result is influenced.

Through comparison experiments and experimental result analysis, the calculation cost of the H-PCPIR-V algorithm at the server side is reduced by about 30% compared with that of the PCPIR-V algorithm, the calculation cost of the client side is reduced by about 10%, and the communication cost is reduced by about 40%.

It will be appreciated by those skilled in the art that the foregoing embodiments are merely preferred embodiments of the invention, and thus, modifications, variations and equivalents of the parts of the invention may be made by those skilled in the art, which are still within the spirit of the invention and which are intended to be within the scope of the invention.

Claims

1. A CPIR-V nearest neighbor privacy protection query method based on Spark and Huffman coding is characterized by comprising the following steps:

5) the client analyzes the query result to obtain the value of the query bit, and decompresses the value of the query bit to obtain query information;

dividing a Voronoi diagram according to interest points of spatial data in a file, then dividing the spatial data through the Voronoi diagram to obtain Voronoi grids, then carrying out grid division on the Voronoi grids, counting the number of potential nearest neighbors of the grids, and finally obtaining a nearest neighbor matrix of the grids;

the step 2) specifically comprises the following steps:

2.6, complementing each element in the coding chain table with insufficient Bit digits according to the maximum Bit digit;

the step 3) specifically comprises the following steps:

2. A CPIR-V nearest neighbor privacy protection query method based on Spark and Huffman coding as claimed in claim 1, wherein said step 2.6 comprises first complementing the ending character in the bit element to be complemented and then complementing all zeros.

3. The CPIR-V nearest neighbor privacy preserving query method based on Spark and Huffman coding as claimed in claim 1, wherein the client data query information comprises:

4. The CPIR-V nearest neighbor privacy protection query method based on Spark and Huffman coding according to claim 1, wherein the step 4) comprises:

5. The CPIR-V nearest neighbor privacy protection lookup method based on Spark and Huffman coding as claimed in claim 4, wherein the grouping the CPIR nearest neighbor matrices in the RDD according to the parallel policy sent by the client comprises Row-level based grouping and Bit-level based grouping:

6. The CPIR-V nearest neighbor privacy protection query method based on Spark and Huffman coding according to claim 1, wherein the step 5) comprises: and the receiving server returns the result and the character code length, performs secondary residual calculation on the result, obtains the value of the query bit, and performs decompression calculation on the value of the query bit to obtain a correct query result.