CN107291935A - The CPIR V arest neighbors privacy protection enquiring methods encoded based on Spark and Huffman - Google Patents

The CPIR V arest neighbors privacy protection enquiring methods encoded based on Spark and Huffman Download PDF

Info

Publication number
CN107291935A
CN107291935A CN201710536073.6A CN201710536073A CN107291935A CN 107291935 A CN107291935 A CN 107291935A CN 201710536073 A CN201710536073 A CN 201710536073A CN 107291935 A CN107291935 A CN 107291935A
Authority
CN
China
Prior art keywords
arest neighbors
cpir
data
spark
huffman
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710536073.6A
Other languages
Chinese (zh)
Other versions
CN107291935B (en
Inventor
王波涛
王国仁
陈月梅
李昂
岳春成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN201710536073.6A priority Critical patent/CN107291935B/en
Publication of CN107291935A publication Critical patent/CN107291935A/en
Application granted granted Critical
Publication of CN107291935B publication Critical patent/CN107291935B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2272Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/221Column-oriented storage; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24532Query optimisation of parallel queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6227Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database where protection concerns the structure of data, e.g. records, types, queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Abstract

The invention discloses a kind of CPIR V arest neighbors privacy protection enquiring methods encoded based on Spark and Huffman, the data of arest neighbors matrix are encoded to the data Bit digits for being compressed and reducing in each grid using Huffman;Then by the data of compression, the code length and element maximum of character are stored into empty database HBase;Then server end reads the data in HBase databases and is cached in the RDD of the parallel frameworks of Spark, and the CPIR arest neighbors matrixes in RDD are grouped according to paralleling tactic, Spark service ends carry out CPIR parallel computations according to Query Information after packet, and client is sent to by the result of calculation polymerization of each packet and then by Query Result and character code length;Client parses Query Result the value for obtaining poll bit, and the value of poll bit is decompressed, Query Information is obtained.The secret protection search algorithm that the present invention is encoded based on Spark parallelizations and Huffman, it is ensured that under big data application scenarios, protects the inquiry privacy of user and improves search efficiency under original inquiry effect.

Description

The CPIR-V arest neighbors privacy protection enquiring methods encoded based on Spark and Huffman
Technical field
The present invention relates to technical field of communication network, more particularly to a kind of CPIR-V encoded based on Spark and Huffman Arest neighbors privacy protection enquiring method.
Background technology
Continuing to develop and producing along with mobile device, the appearance of various positioning means and multiple communication modes, Due to the generation of various location technology, the popularization of mobile terminal and widely using for communication equipment, with based on location-based service (LBS) the mobile big data epoch have been stepped into for the Mobile solution of representative.And handle growing data volume only rely on it is existing PC and the computing capability of organizational structure of server can not meet, but if meter is lifted by upgrading hardware equipment Calculation ability can then waste substantial amounts of financial resources and material resources, can not also get effectively horizontal extension and maintainability.Therefore exist Cost is saved, autgmentability of improving the standard and maintainable aspect have done very big research, and Google companies are in search engine conference (SES San Jose 2006) proposes " concept of cloud computing (Cloud Computing) first.Cloud computing is a kind of parallel meter Calculate, it is by making calculating be distributed on substantial amounts of distributed computer, rather than in local computer or remote server, enterprise's number Will be more like with internet according to the operation at center.This enables enterprise by the application of resource switch to needs, according to demand Access computer and storage system.Cloud computing is after the 1980's mainframe computer to the big change of client-server Another great change.
Cloud platform provides good platform for the processing of mobile big data, and traditional LBS is applied and LBS secret protections Technology, which is moved on in cloud platform, to be LBS application technologies and secret protection technology trends and has had become research at present One of focus.In the big data epoch, by being analyzed big data, being concluded, excavated and then therefrom obtain potential information, this A little potential informations can Bang Zu enterprises and businessman obtain huge income, such as adjust the marketing policy, reduce and avoid risk, rationality Decision-making etc. is done in face of turn of the market.However, as the technology excavated to big data continuously emerges and perfect, it is latent excavating Be also possible to exist the danger of leakage individual privacy while information so that it is serious threaten personal information security and The trade secret of enterprise, nation's security secret etc..The development and popularization applied with big data, personal secret protection show It is particularly important and will turn into a big severe challenge.
Current secret protection research direction is broadly divided into three classes:Based on extensive secret protection technology, based on the hidden of encryption Private protection technique and the secret protection technology based on interference, wherein the secret protection technology main representative based on encryption has based on meter The Private information retrieval (Computional private informationretrieval, CPIR) of calculation.CPIR is to be based on two Secondary remaining double linear problems of difficulty for solving, shows in the modular arithmetic of one big composite modulus (being typically 1024bit), distinguishing quadratic residue is The problem of hardly possible is calculated.CPIR algorithms greatly reduce communication complexity, but also improve the complexity of calculating, it is ensured that most strong Secret protection degree.But, LBS secret protections can be related to substantial amounts of calculate and operate and complicated map function, CPIR algorithms The data spaces for scanning whole are needed make it that the computationally intensive and calculating time is long during calculating, this causes the calculating of traditional calculations platform Ability can not meet existing demand.
The content of the invention
In view of the above-mentioned problems, it is an object of the invention to provide a kind of CPIR-V encoded based on Spark and Huffman most Neighbour's privacy protection enquiring method, reduces CPIR calculation costs, further improves performance.
The problem of in order to solve in the presence of background technology, the technical scheme is that:
A kind of CPIR-V arest neighbors privacy protection enquiring methods encoded based on Spark and Huffman, including:
1), by file process, grid is obtained, the arest neighbors matrix data of grid in file is read;
2), the element in arest neighbors matrix data is compressed using Huffman codings, the Bit of the element is reduced Digit;
3), the arest neighbors matrix data after coding is stored into spatial database HBase;
4), receive after client data Query Information, server end is according to data query information and from database HBase It is middle to read the storage of corresponding Query Information into the RDD of the parallel frameworks of Spark, and according to paralleling tactic to the CPIR in RDD most Neighbour's matrix is grouped, and Spark carries out CPIR parallel computations according to Query Information, the result of calculation of each packet is polymerize right Query Result and character code length are sent to client afterwards;
5), client parses Query Result the value for obtaining poll bit, and the value of poll bit is decompressed, Query Information is obtained.
The step 1) by file process, grid is obtained, reading the arest neighbors matrix data of grid in file includes:
According to the Partition for Interest Points Voronoi diagram of file spatial data, then spatial data is entered by Voronoi diagram Row division obtains Voronoi lattice, then carries out mesh generation to Voronoi lattice, counts the potential arest neighbors number of grid, finally To the arest neighbors matrix of grid.
The step 2) specifically include:
2.1st, an one-dimensional integer array is created, arest neighbors matrix is read by character, the frequency that statistics character occurs is simultaneously And in the frequency storage array of character, and the frequency of termination character is the summation of matrix element;
2.2nd, the frequency of each character is calculated, according to the sequential configuration Priority Queues of character frequency from small to large;
2.3rd, Huffman tree is constructed using Priority Queues, and to the character code in Huffman tree and code length is stored in number Group;
2.4th, each element in arest neighbors matrix is recompiled and the extra termination character that adds is encoded after element coding Deposit coding chained list, counts each element and has encoded rear Bit digits deposit array;
2.5th, the Bit digits after being encoded according to each element obtain maximum Bit digits;
2.6th, by each element encoded in chained list according to the not enough Bit digits of maximum Bit digits completion.
The step 2.6 fills termination character simultaneously including first in cover element is treated, then full zero padding.
The step 3) specifically include:
3.1st, the data after arest neighbors matrix compression are stored in two-dimentional byte arrays, wherein one-dimensional representation arest neighbors matrix The sum of element, the maximum byte value of two-dimensional representation element;
3.2nd, the RowKey of HBase databases is designed, using arest neighbors matrix per a line line number backward as HBase RowKey so that the arest neighbors matrix data after coding is uniformly distributed on HBase HRegionServer;
3.3rd, row are stored according to row number to it, and its value is the element in the grid per a line respective column number, and by character pressure Code length after contracting is stored into database.
The client data Query Information includes:
Position according to where query point calculates the grid where query point, then the mess generation according to where query point Corresponding quadratic residue inquiry, will finally inquire about and be sent to service end with mesh generation size and the paralleling tactic of selection.
The step 4) include:
Receive the inquiry sent from client, mesh generation number and paralleling tactic, according to mesh generation data from data Corresponding CPIR arest neighbors matrix, character code length and maximum storage are read in the HBase of storehouse into Spark RDD, then according to The paralleling tactic that client is sent is grouped to the CPIR arest neighbors matrixes in RDD, and Spark is according to inquiry Q after having divided group CPIR parallel computations are carried out, result of calculation are finally obtained, Spark is by the result of calculation polymerization of each packet then by Query Result Client is sent to character code length.
The paralleling tactic sent according to client, which carries out packet to the CPIR arest neighbors matrix in RDD, to be included being based on Row grades are grouped and based on Bit grades of packets:
It is described that then CPIR matrixes are grouped according to row based on Row grades of packets;It is described then first to be obtained based on Bit grades of packets The quantity k for the CPU that cluster is distributed at present, the data according to CPU data to CPIR matrixes per a line are grouped.
The step 5) include:Service end returning result and character code length are received, the calculating of quadratic residue is carried out to result, And the value of poll bit is obtained, the value of poll bit is subjected to decompression calculating, correct Query Result is obtained.
Compared with prior art, beneficial effects of the present invention are:
The invention provides a kind of CPIR-V arest neighbors privacy protection enquiring methods encoded based on Spark and Huffman, Encode to be compressed data by using Huffman and reduce data volume so as to reduce CPIR amount of calculation, service end is carried out When CPIR is calculated, carry out parallel computation to reduce the calculating time using Spark frameworks, the problem of solution calculating time is long.This The secret protection search algorithm that invention is encoded based on Spark parallelizations and Huffman, it is ensured that under big data application scenarios, is protected Protect the inquiry privacy of user and improve search efficiency under original inquiry effect.
Brief description of the drawings
Fig. 1 is the CPIR-V arest neighbors privacy protection enquiring method flows that the present invention is encoded based on Spark and Huffman Figure;
Fig. 2 is the mesh generation schematic diagram of Voronoi diagram of the present invention;
Fig. 3 is that the service end of different mesh generations of the invention calculates average time figure, wherein, (a) is to be uniformly distributed grid The service end of division calculates average time figure, and (b) calculates average time figure for the service end of Gaussian Profile mesh generation, and (c) is The service end of True Data mesh generation calculates average time figure.
Embodiment
The present invention is described in detail below in conjunction with the accompanying drawings.
As shown in figure 1, the invention provides a kind of CPIR-V arest neighbors privacy guarantor encoded based on Spark and Huffman Querying method is protected, including:
1), by file process, grid is obtained, the arest neighbors matrix data of grid in file is read;
According to the Partition for Interest Points Voronoi diagram of file spatial data, then spatial data is entered by Voronoi diagram Row division obtains Voronoi lattice, then carries out mesh generation to Voronoi lattice, counts the potential arest neighbors number of grid, finally To the arest neighbors matrix of grid.The Voronoi diagram is that the neighbour between spatial object is embodied by the division to space Each polygon is referred to as Voronoi lattice in topological relation, figure, and the side of Voronoi lattice is then the vertical of adjacent space object Bisector.
The size of mesh generation is should be noted when carrying out mesh generation to Voronoi lattice, mesh generation is too small, arest neighbors Overabundance of data in matrix, it is big that client carries out computing cost during analytical Calculation to Query Result;Mesh generation is too big, arest neighbors square Data are very few in battle array, and effect is not good after compression, it is therefore desirable to according to the density classifying rationally grid of point of interest.Exemplary, it is whole Individual space is divided into G*G grid, each grid can intersect with one of Voronoi lattice or by comprising.Such as Fig. 2 institutes Show, wherein p represents point of interest and q represents query point, whole space be divided into 5*5 grid, wherein grid marked as 1,1 with Point of interest p1, p2The Voronoi lattice at place intersect, and grid is marked as 2,1 by point of interest p1The Vorono lattice at place are included.Fig. 2 In, the grid where query point q intersects with the Voronoi lattice where point of interest q, therefore query point q arest neighbors is probably p1, p2, therefore by p1, p2Set { the p of formation1, p2It is referred to as the potential arest neighbors set of grid 2,1.It can thus be appreciated that grid is possessed most Not necessarily, this is relevant with the distribution of point and the size of grid for Neighbor Points quantity.CPIR-V algorithms are by each grid in Fig. 2 Arest neighbors relation is converted to arest neighbors storage matrix, and what is wherein stored in matrix is the potential nearest neighbor point of each grid, to be noted The size of each element is identical in meaning, storage matrix.CPIR-V algorithms find the potential arest neighbors of grid first Maximum p_max, the grid that maximum is then less than to quantity carries out default value completion.The potential arest neighbors maximum of grid is 3, Therefore completion is carried out to the not enough grid of other quantity.Each potential arest neighbors relation of grid is preserved in a matrix, and the matrix is Carry out basis and the core of CPIR-V K-NN search algorithms.
2), the element in arest neighbors matrix data is compressed using Huffman codings, the Bit of the element is reduced Digit;
Huffman codings are a kind of Variable Length Code (VLC of Lossless Compression:Variable length coding) side Formula, is that the probability that occurs constructs the unique encodings to character in need to encode file according to character, and ensure that variable The average coding of coding is most short, is referred to as optimum binary tree, is sometimes referred to as forced coding.Because Huffman codings are variable Long codes, therefore for the higher character of probability of occurrence, the length of coding is shorter, and for the relatively low character of probability of occurrence, coding Length is longer, this ensure that total code length of processing alphabet is necessarily less than actual code length.
Specifically include:
2.1st, an one-dimensional integer array is created, arest neighbors matrix is read by character, the frequency that statistics character occurs is simultaneously And in the frequency storage array of character, and the frequency of termination character is the summation of matrix element;
2.2nd, the frequency of each character is calculated, according to the sequential configuration Priority Queues of character frequency from small to large;
2.3rd, Huffman tree is constructed using Priority Queues, and to the character code in Huffman tree and code length is stored in number Group;
2.4th, each element in arest neighbors matrix is recompiled and the extra termination character that adds is encoded after element coding Deposit coding chained list, counts each element and has encoded rear Bit digits deposit array;
2.5th, the Bit digits after being encoded according to each element obtain maximum Bit digits;
2.6th, by each element encoded in chained list according to the not enough Bit digits of maximum Bit digits completion.Specifically include elder generation In cover element is treated, termination character is filled simultaneously, then full zero padding.
3), the arest neighbors matrix data after coding is stored into spatial database HBase;
HBase is a Key-Value PostgreSQL database distributed, towards row storage, is high reliability, a height Performance, towards row, telescopic distributed memory system.It is imitated simultaneously by the use of Hadoop HDFS as its document storage system It is functional there is provided the institute of the Bigtable databases based on Google file system.Can be cheap using HBase technologies Large-scale structure storage cluster is erected on PC Server.Because HBase is a key value database, therefore HBase is suitable The database stored in unstructured data.
Arest neighbors matrix is stored in spatial database HBase, therefore HBase data are arrived in storage after arest neighbors matrix compression The process in storehouse is as follows:It is designed to reach data HBase's by the RowKey to HBase databases It is uniformly distributed on HRegionServer, RowKey main design thought is the line number backward work by arest neighbors matrix per a line For HBase RowKey, so that the data volume stored in HBase HRegionServer will not be excessive and each HRegionServer has data storage.Row are stored according to row number to it, and its value is the member in the grid per a line respective column number Element, finally stores the code length after character compression into database.
Specifically include:
3.1st, the data after arest neighbors matrix compression are stored in two-dimentional byte arrays, wherein one-dimensional representation arest neighbors matrix The sum of element, the maximum byte value of two-dimensional representation element;
3.2nd, the RowKey of HBase databases is designed, using arest neighbors matrix per a line line number backward as HBase RowKey so that the arest neighbors matrix data after coding is uniformly distributed on HBase HRegionServer;
3.3rd, row are stored according to row number to it, and its value is the element in the grid per a line respective column number, and by character pressure Code length after contracting is stored into database.
Specific storage format is as shown in table 1.
The H-PCIR-V information table structures of table 1
4), receive after client data Query Information, server end is according to data query information and from database HBase It is middle to read the storage of corresponding Query Information into the RDD of the parallel frameworks of Spark, and according to paralleling tactic to the CPIR in RDD most Neighbour's matrix is grouped, and Spark carries out CPIR parallel computations according to Query Information, the result of calculation of each packet is polymerize right Query Result and character code length are sent to client afterwards;
The client data Query Information includes:Position according to where query point calculates the grid where query point, Then the corresponding quadratic residue inquiry of mess generation according to where query point, finally will inquiry and mesh generation size and choosing The paralleling tactic selected is sent to service end.Concretely comprise the following steps:
1:Grid G according to where the position of query point calculates query pointA, b
2:Generation inquiry Q (y1, y2..., yg_x), wherein under be designated as b correspondence yi=QNR, remaining subscript value corresponds to yi =QR;
3:Q will be inquired about, mesh generation number g_x, g_y, paralleling tactic strategy is sent to service end;
4:Service end is waited to return to Query Result;
Receive the inquiry sent from client, mesh generation number and paralleling tactic, according to mesh generation data from data Corresponding CPIR arest neighbors matrix, character code length and maximum storage are read in the HBase of storehouse into Spark RDD, then according to The paralleling tactic that client is sent is grouped to the CPIR arest neighbors matrixes in RDD, and Spark is according to inquiry Q after having divided group CPIR parallel computations are carried out, result of calculation are finally obtained, Spark is by the result of calculation polymerization of each packet then by Query Result Client is sent to character code length.
The paralleling tactic sent according to client, which carries out packet to the CPIR arest neighbors matrix in RDD, to be included being based on Row grades are grouped and based on Bit grades of packets:
It is described that then CPIR matrixes are grouped according to row based on Row grades of packets;It is described then first to be obtained based on Bit grades of packets The quantity k for the CPU that cluster is distributed at present, the data according to CPU data to CPIR matrixes per a line are grouped.
Specially:
1:Service end obtains CPIR matrix datas, character code length and maximum and is cached to according to mesh generation g_x, g_y RDD;
2:The CPIR matrix datas in RDD are grouped according to row if paralleling tactic strategy is Row;
3:The CPU quantity k of current cluster distribution are obtained if paralleling tactic strategy is Bit, will be per data line It is divided into k groups;
4:Acquiescence is that CPIR matrix datas in Row, RDD are carried out according to row if paralleling tactic strategy is mismatched Packet;
5:Spark carries out CPIR to each packet according to inquiry Q and calculates acquisition Query Result Z;
6:Result Z polymerize by Spark, and Query Result and character code length then are sent into client;
5), client parses Query Result the value for obtaining poll bit, and the value of poll bit is decompressed, Query Information is obtained. Including:Service end returning result and character code length are received, the calculating of quadratic residue is carried out to result, and obtains the value of poll bit, The value of poll bit is subjected to decompression calculating, correct Query Result is obtained.
Experimental result is contrasted:
The cloud computing service platform of use is IBM xSeries System 3650M4, and wherein clustered node is 5, often The detailed configuration of individual node is as follows:
CPU:2*Xeon E5-2620CPU (each have the threads of 6 core * 2);
Internal memory:32G Bytes;
Hard disk:5T Bytes,10000rpm,raid5;
Operating system:CentOS 6.4;
Developing instrument:GNU Toolkits (G++, GDB), Make, Vim, JDK etc..
Experiment development language used is standard C++, Java, scala language.
It is main that using three kinds of data sets, two kinds of data sets and True Data collection for synthesis are analyzed experimental result, Pass through the query performance of the further parser of experimental result.True Data collection is the Sequoia selected from California.Synthesis Data set is the data set being uniformly distributed using obeying with Gaussian Profile, and the wherein data set of Gaussian Profile obeys (X, Y)~N (1,1,0,0,1).Data span is 1046435*1929615, it is noted that abscissa x and ordinate y type are int Type.It is GMP, the big numbers of BigInteger that big integer calculations instrument is carried using Java JDK that big number, which calculates storehouse, used in C++ Calculate class, it is noted that the value rule of involved threshold value is in quadratic residue:The time for meeting θ Long-number multiplication is more than Internal memory is tabled look-up the minimum values of time conditions, is drawn by laboratory, and as θ=3, the time-consuming of Long-number multiplication exceedes The time that internal memory is tabled look-up, therefore θ value is taken as 3.Test the parameter used as shown in table 2.
The experimental data parameter of table 2
Parameter name Excursion Default value
Data type True Data (62K), Gaussian Profile (100k), is uniformly distributed (100k) It is uniformly distributed (100K)
Mesh generation 10*10,20*20,50*50,100*100,200*200,400*400 100*100
Modulus k 128,256,512,1024 512
Range query coverage rate 1,5,10,15 1
Query Result computational methods Formula 2.10, formula 2.11 Formula 2.10
Core cpu quantity 1,10,20,40,60 60
Data compression experimental result:
The compression contrast of the range query algorithm data of table 3
The data in grid are compressed using Huffman codings in the algorithm of data preprocessing phase, range query Contrast before data compression and after compression is as shown in table 3.The data of range query are encoded by Huffman as can be seen from Table 3 Data volume reduces nearly half after compression, and its compression ratio is close to 55%.Data volume, which is reduced, generally means that service end is carried out CPIR amounts of calculation also consequently reduce half, and this causes the calculating time of service end to decline.
The compression contrast of the CPIR-V algorithm datas of table 4
Table 4 shows that CPIR-V algorithms are compiled the data in arest neighbors matrix using Huffman in data preprocessing phase Code compression obtains in matrix size after maximum compression, the maximum by itself and the matrix maximum before compression and after compressing Contrasted.As can be seen from Table 4, after CPIR-V algorithms are by Huffman coding compressions, the size of maximum is nearly in matrix Reduce 1/3.Because CPIR-V algorithms are first to search in arest neighbors matrix element maximum and then by remaining completion, it is assumed that most Neighbour is N*N, and wherein maximum is m, then the big number amount of calculation of service end is m*N*N;The square after Huffman coding compressions Maximum is (2/3) m in battle array, then uncle's amount of calculation of service end is (2/3) m*N*N, on the whole service end calculating gauge Let it pass (1/3) m*N*N.
As shown in figure 3, being that the parallel C PIR-V algorithms (PCPIR-V) based on Spark are compiled with being based on Spark and Huffman Parallel C PIR-V algorithms (H-PCPIR-V) Row paralleling tactics and Bit paralleling tactics the service end meter under different mesh generations of code The comparison diagram of evaluation time.H-PCPIR-V-R refers to be based on Row paralleling tactics wherein in figure, and H-PCPIR-V-B refers to be based on Bit paralleling tactics.It can be seen that three kinds of algorithms with the increase service end of grid the calculating time generally also with Become big because mesh generation is bigger, CPIR-V calculating matrix it is more, it is necessary to big integer calculations it is also more so as to causing The calculating time increases.The calculating time of three kinds of data sets service end under different grids is shown in figure, as can be seen from the figure It is generally fewer than PCPIR-V algorithms in the calculating time of service end in H-PCPIR-V-R algorithms and H-PCPIR-V-B algorithms, and And concentrate obvious in gaussian distribution data collection and True Data, because H-PCPIR-V-R and H-PCPIR-V-B are in number The Data preprocess stage has carried out compression to data common in matrix to cause the digit for carrying out big integer calculations in matrix to reduce Further reduce service end and calculate the time.It is equal that the service end of gaussian distribution data collection and True Data collection calculates lead time ratio It is that Uniform-distributed Data concentrates matrix data than more uniform that the service end of even distributed data collection, which calculates the reason for lead time is obvious, Without generation considerable influence, data distribution is not of uniform size in other two kinds of data set matrixes after compression, and maxima and minima is poor Away from larger so as to cause calculating digit after compression to generate larger gap, have an impact which results in result of calculation.
By contrast experiment and analysis of experimental results find H-PCPIR-V algorithms than PCPIR-V algorithm service end meter Calculate cost and decline 30% or so, the calculation cost of client declines 10% or so, and communication cost declines 40% or so.
It is obvious to a person skilled in the art that will appreciate that above-mentioned specific embodiment is the preferred side of the present invention Case, therefore improvement, the variation that those skilled in the art may make to some of present invention part, embodiment is still this The principle of invention, realization is still the purpose of the present invention, belongs to the scope that the present invention is protected.

Claims (9)

1. a kind of CPIR-V arest neighbors privacy protection enquiring methods encoded based on Spark and Huffman, it is characterised in that bag Include:
1), by file process, grid is obtained, the arest neighbors matrix data of grid in file is read;
2), the element in arest neighbors matrix data is compressed using Huffman codings, the Bit digits of the element are reduced;
3), the arest neighbors matrix data after coding is stored into spatial database HBase;
4), receive after client data Query Information, server end is read according to data query information and from database HBase Corresponding Query Information storage is taken into the RDD of the parallel frameworks of Spark, and according to paralleling tactic to the CPIR arest neighbors in RDD Matrix is grouped, and Spark carries out CPIR parallel computations according to Query Information, and then the result of calculation polymerization of each packet will Query Result and character code length are sent to client;
5), client parses Query Result the value for obtaining poll bit, and the value of poll bit is decompressed, Query Information is obtained.
2. the CPIR-V arest neighbors privacy protection enquiring methods encoded according to claim 1 based on Spark and Huffman, it is special Levy and be, the step 1) by file process, grid is obtained, reading the arest neighbors matrix data of grid in file includes:
According to the Partition for Interest Points Voronoi diagram of file spatial data, then spatial data is drawn by Voronoi diagram Get Voronoi lattice, mesh generation then is carried out to Voronoi lattice, count the potential arest neighbors number of grid, finally obtain net The arest neighbors matrix of lattice.
3. the CPIR-V arest neighbors privacy protection enquiring methods encoded according to claim 1 based on Spark and Huffman, it is special Levy and be, the step 2) specifically include:
2.1st, an one-dimensional integer array is created, arest neighbors matrix is read by character, the frequency that statistics character occurs and word In the frequency storage array of symbol, and the frequency of termination character is the summation of matrix element;
2.2nd, the frequency of each character is calculated, according to the sequential configuration Priority Queues of character frequency from small to large;
2.3rd, Huffman tree is constructed using Priority Queues, and to the character code in Huffman tree and code length is stored in array;
2.4th, each element in arest neighbors matrix is recompiled and the extra termination character coding that adds is stored in after element coding Chained list is encoded, each element is counted and has encoded rear Bit digits deposit array;
2.5th, the Bit digits after being encoded according to each element obtain maximum Bit digits;
2.6th, by each element encoded in chained list according to the not enough Bit digits of maximum Bit digits completion.
4. the CPIR-V arest neighbors privacy protection enquiring methods encoded according to claim 3 based on Spark and Huffman, it is special Levy and be, the step 2.6 fills termination character simultaneously including first in cover element is treated, then full zero padding.
5. the CPIR-V arest neighbors privacy protection enquiring methods encoded according to claim 1 based on Spark and Huffman, it is special Levy and be, the step 3) specifically include:
3.1st, the data after arest neighbors matrix compression are stored in two-dimentional byte arrays, wherein one-dimensional representation arest neighbors matrix element Sum, the maximum byte value of two-dimensional representation element;
3.2nd, the RowKey of HBase databases is designed, the line number backward using arest neighbors matrix per a line is used as HBase's RowKey so that the arest neighbors matrix data after coding is uniformly distributed on HBase HRegionServer;
3.3rd, row are stored according to row number to it, and its value is the element in the grid per a line respective column number, and by after character compression Code length store into database.
6. the CPIR-V arest neighbors privacy protection enquiring methods encoded according to claim 1 based on Spark and Huffman, it is special Levy and be, the client data Query Information includes:
Position according to where query point calculates the grid where query point, then the mess generation correspondence according to where query point Quadratic residue inquiry, will finally inquire about and the paralleling tactic of mesh generation size and selection being sent to service end.
7. the CPIR-V arest neighbors privacy protection enquiring methods encoded according to claim 1 based on Spark and Huffman, it is special Levy and be, the step 4) include:
Receive the inquiry sent from client, mesh generation number and paralleling tactic, according to mesh generation data from database Corresponding CPIR arest neighbors matrix, character code length and maximum storage are read in HBase into Spark RDD, then according to visitor The paralleling tactic that family end is sent is grouped to the CPIR arest neighbors matrixes in RDD, and Spark enters according to inquiry Q after having divided group Row CPIR parallel computations, finally obtain result of calculation, Spark by the result of calculation polymerization of each packet and then by Query Result and Character code length is sent to client.
8. the CPIR-V arest neighbors privacy protection enquiring methods encoded according to claim 7 based on Spark and Huffman, it is special Levy and be, the paralleling tactic sent according to client, which carries out packet to the CPIR arest neighbors matrix in RDD, to be included being based on Row Level is grouped and based on Bit grades of packets:
It is described that then CPIR matrixes are grouped according to row based on Row grades of packets;It is described then first to obtain cluster based on Bit grades of packets The CPU distributed at present quantity k, the data according to CPU data to CPIR matrixes per a line are grouped.
9. the CPIR-V arest neighbors privacy protection enquiring methods encoded according to claim 1 based on Spark and Huffman, it is special Levy and be, the step 5) include:Service end returning result and character code length are received, the calculating of quadratic residue is carried out to result, And the value of poll bit is obtained, the value of poll bit is subjected to decompression calculating, correct Query Result is obtained.
CN201710536073.6A 2017-07-04 2017-07-04 Spark and Huffman coding based CPIR-V nearest neighbor privacy protection query method Active CN107291935B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710536073.6A CN107291935B (en) 2017-07-04 2017-07-04 Spark and Huffman coding based CPIR-V nearest neighbor privacy protection query method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710536073.6A CN107291935B (en) 2017-07-04 2017-07-04 Spark and Huffman coding based CPIR-V nearest neighbor privacy protection query method

Publications (2)

Publication Number Publication Date
CN107291935A true CN107291935A (en) 2017-10-24
CN107291935B CN107291935B (en) 2020-09-29

Family

ID=60098630

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710536073.6A Active CN107291935B (en) 2017-07-04 2017-07-04 Spark and Huffman coding based CPIR-V nearest neighbor privacy protection query method

Country Status (1)

Country Link
CN (1) CN107291935B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108200027A (en) * 2017-12-27 2018-06-22 东南大学 A kind of protective position privacy nearest Neighbor based on feedback angle
CN109190809A (en) * 2018-08-15 2019-01-11 中国石油化工股份有限公司江汉油田分公司勘探开发研究院 The coding method of oilfield development program multivariable and device
CN112527951A (en) * 2021-02-09 2021-03-19 北京微步在线科技有限公司 Byte array-based integer variable-length ordered coding method and device and storage medium
CN114968404A (en) * 2022-05-24 2022-08-30 武汉大学 Distributed unloading method for computing task with position privacy protection

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073689A (en) * 2010-12-27 2011-05-25 东北大学 Dynamic nearest neighbour inquiry method on basis of regional coverage
CN102708191A (en) * 2012-05-15 2012-10-03 通唐软件技术(湖南)有限公司 Word stock coding and decoding method capable of saving memory
CN104268210A (en) * 2014-09-12 2015-01-07 东北大学 CPIR-V nearest neighbor privacy protection querying method based on local super-set
CN104392318A (en) * 2014-11-24 2015-03-04 蔡志明 Medical data storing and inquiring method based on cloud platform
CN104486434A (en) * 2014-12-23 2015-04-01 深圳供电局有限公司 Mobile terminal and file upload and download methods of mobile terminal

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073689A (en) * 2010-12-27 2011-05-25 东北大学 Dynamic nearest neighbour inquiry method on basis of regional coverage
CN102708191A (en) * 2012-05-15 2012-10-03 通唐软件技术(湖南)有限公司 Word stock coding and decoding method capable of saving memory
CN104268210A (en) * 2014-09-12 2015-01-07 东北大学 CPIR-V nearest neighbor privacy protection querying method based on local super-set
CN104392318A (en) * 2014-11-24 2015-03-04 蔡志明 Medical data storing and inquiring method based on cloud platform
CN104486434A (en) * 2014-12-23 2015-04-01 深圳供电局有限公司 Mobile terminal and file upload and download methods of mobile terminal

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
屈志坚: "Hadoop云构架的智能调度无损集群压缩技术", 《电力系统自动化》 *
邓诗卓: "PCPIR-V:基于Spark的并行隐私保护近邻查询算法", 《网络与信息安全学报》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108200027A (en) * 2017-12-27 2018-06-22 东南大学 A kind of protective position privacy nearest Neighbor based on feedback angle
CN108200027B (en) * 2017-12-27 2020-11-03 东南大学 Position privacy protection neighbor query method based on feedback angle
CN109190809A (en) * 2018-08-15 2019-01-11 中国石油化工股份有限公司江汉油田分公司勘探开发研究院 The coding method of oilfield development program multivariable and device
CN112527951A (en) * 2021-02-09 2021-03-19 北京微步在线科技有限公司 Byte array-based integer variable-length ordered coding method and device and storage medium
CN114968404A (en) * 2022-05-24 2022-08-30 武汉大学 Distributed unloading method for computing task with position privacy protection
CN114968404B (en) * 2022-05-24 2023-11-17 武汉大学 Distributed unloading method for computing tasks of location privacy protection

Also Published As

Publication number Publication date
CN107291935B (en) 2020-09-29

Similar Documents

Publication Publication Date Title
Tao et al. Significantly improving lossy compression for scientific data sets based on multidimensional prediction and error-controlled quantization
CN107291935A (en) The CPIR V arest neighbors privacy protection enquiring methods encoded based on Spark and Huffman
Arya et al. Approximate nearest neighbor queries in fixed dimensions.
US7765172B2 (en) Artificial intelligence for wireless network analysis
Kamousi et al. Closest pair and the post office problem for stochastic points
CN101968806A (en) Data storage method, querying method and device
CN104348490A (en) Combined data compression algorithm based on effect optimization
Xue et al. Sequence data matching and beyond: New privacy-preserving primitives based on bloom filters
CN104881449A (en) Image retrieval method based on manifold learning data compression hash
CN112860932B (en) Image retrieval method, device, equipment and storage medium for resisting malicious sample attack
CN105138527A (en) Data classification regression method and data classification regression device
Tirthapura et al. Rectangle-efficient aggregation in spatial data streams
Magdy et al. Privacy preserving search index for image databases based on SURF and order preserving encryption
CN108153585A (en) A kind of method and apparatus of the operational efficiency based on locality expression function optimization MapReduce frames
CN106202522B (en) A kind of multiplexing method and system of flow field integral curve
Crume et al. Compressing intermediate keys between mappers and reducers in scihadoop
Gupta et al. Computational complexity of fractal image compression algorithm
CN106571909A (en) Data encryption method and device
CN107077481B (en) Information processing apparatus, information processing method, and computer-readable storage medium
Krasadakis et al. Parallel based hiding of sensitive knowledge
CN110162549A (en) A kind of fire data analysis method, device, readable storage medium storing program for executing and terminal device
CN110297952B (en) Grid index-based parallelization high-speed railway survey data retrieval method
CN114003744A (en) Image retrieval method and system based on convolutional neural network and vector homomorphic encryption
CN107515937B (en) Differential account classification method and system, service terminal and memory
Lee et al. Implement of MapReduce-based Big Data Processing Scheme for Reducing Big Data Processing Delay Time and Store Data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant