CN115982752A

CN115982752A - K domination privacy protection method based on approximate semantic query

Info

Publication number: CN115982752A
Application number: CN202211496552.7A
Authority: CN
Inventors: 李松; 吴楠; 曹文琪
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2022-11-25
Filing date: 2022-11-25
Publication date: 2023-04-18
Anticipated expiration: 2042-11-25
Also published as: CN115982752B

Abstract

The invention discloses a K domination privacy protection method based on approximate semantic query, which comprises the steps of firstly, giving data and obtaining a position data set in a rectangular area containing a real position from the data, obtaining a clustering center point in the data set through an MCA algorithm, adopting a multi-center data processing algorithm based on the maximum and minimum distance, combining a data point set generated by the MCA algorithm clustering, selecting a position point which ensures the farthest distance between the position point and the data point set, and generating a group of processed candidate sets. Secondly, semantic similarity between any two positions in the candidate set is obtained by calculating the distance between position information of different names, and a k-1 position with the minimum semantic similarity is selected as a virtual result set by combining a dummy method. Experimental results show that the method can ensure the physical dispersity and semantic diversity of the positions and improve the virtual generation efficiency. Meanwhile, balance between privacy protection safety and query service quality is realized.

Description

K domination privacy protection method based on approximate semantic query

Technical Field

The invention relates to the field of privacy protection processing in data query, in particular to a K domination privacy protection method based on approximate semantic query.

Background

Background significance of the Main Innovative Point study

With the development of mobile location technology and wireless communication technology, a large number of mobile devices in the market have the capability of GPS accurate location, so that Location Based Services (LBS) are rapidly developed. However, while LBS provides convenience and great benefit to society, its sensitive information leakage problem is also receiving increasing attention. Since the user's location is shared among different location service providers, untrusted third parties can easily steal the user's privacy by analyzing and comparing the location information. For example, by capturing the recent user's trail, an adversary can analyze some information, such as home address, workplace and health, etc.

Therefore, it is necessary to ensure the security of the privacy of the user location, and at present, many different methods are proposed to prevent the disclosure of private information, including mainly fuzzy methods, encryption methods and policy-based methods. Spatial anonymity methods typically require the assistance of a fully Trusted Third Party (TTP). When the location query service is needed, the mobile user firstly sends a query request to the TTP, and the TTP generates a K domination area containing the user location and then sends the K domination area to the LBS server for querying. In this method, if the area of the K dominating region is too large, not only is more time consumed, but also the accuracy of the query result is reduced. At the same time, TTP is likely to become a bottleneck of the system. However, in privacy protection based on virtual locations, which are generated by mobile clients, TTPs and anonymous areas are not required. Therefore, it can well compensate for the above-mentioned disadvantages of the spatial anonymity method.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a K domination privacy protection method based on approximate semantic query, which combines the K domination technology and the semantic similarity correlation technology in the traditional calculation to improve the privacy protection degree of the query, and the algorithm can further improve the query efficiency.

The technical scheme adopted by the invention for solving the technical problems is as follows: a K domination privacy protection method based on approximate semantic query mainly comprises the following steps:

1. firstly, giving data and obtaining a position data set in a rectangular area containing real positions from the data, calculating and generating a plurality of clustering centers by an MCA center clustering method so as to form a candidate data set, then selecting some positions by adopting a multi-center data processing method based on the maximum and minimum distances, ensuring the farthest distance between the positions, and generating a group of processed fake data points;

2. secondly, semantic similarity between any two positions in the candidate set is obtained by calculating the distance between the position information of different names, and the k-1 position with the minimum semantic similarity is selected as a virtual data point.

Furthermore, the MCA algorithm is adopted, so that a plurality of clustering centers can be generated at the mobile client. Because these locations are furthest apart from each other, spurious data points may produce a data set from them.

Further, the semantic similarity calculation is carried out on the position information of the candidate set, k-1 positions with the minimum semantic similarity are selected as virtual positions, k-1 virtual point information and real positions are sent to an LBS server to be inquired, and meanwhile, a dummy set is generated by combining the proposed dummy element generation method, wherein the dummy element data set is generated through clustering calculation in the algorithm 1.

The beneficial effects of the invention are: according to the invention, further protection on the user position information query is realized by adopting an algorithm combining K domination and semantic similarity, so that the problem of time overhead during query is reduced, and the query privacy of the user can be further ensured.

Drawings

FIG. 1 is a abstract drawing of a K domination privacy preserving method based on approximate semantic query according to the invention.

Fig. 2 is a graph comparing the time overhead of the three methods presented by the present invention as the value of K increases.

Fig. 3 is an exemplary diagram of an MCA algorithm presented in the present invention.

Fig. 4 is a graph comparing the efficiency of operation of the present invention and maxminddistds, as provided by the present invention.

FIG. 5 is a graph comparing the operating efficiency of the present invention and SimPMaxMinDistDS as provided by the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described in detail below with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of implementation examples of the present invention, and not all embodiments. Further, it should be understood that various modifications and changes may occur to those skilled in the art after reading the present disclosure, and that such equivalents fall within the scope of the appended claims.

The invention discloses a K domination privacy protection method based on approximate semantic query, which comprises the following specific operation processes:

step (1): and calculating the clustering centers of the position geographic coordinates in the square area by using the MCA algorithm to obtain a plurality of clustering centers, and selecting the clustering centers as virtual candidate sets. The MCA algorithm is a heuristic-based clustering algorithm that takes as far as possible objects as the cluster center according to euclidean distance. The sample object is first taken as a first cluster center, and then the sample farthest from the first cluster center is selected as a second cluster center. Additional cluster centers are then determined until there are no new cluster centers. After all the clustering centers are determined, m clustering sample sets containing samples are used as virtual position candidate sets. The result is the position shown in fig. 3. According to Algorithm 1, select l ₁ As the first clustering center, select l ₅ As second cluster center, the third cluster center l is determined ₉ . And (5) obtaining three clustering centers through clustering calculation to generate a virtual position candidate set.

When determining the cluster center, the actual position is used as the initial cluster center 1, and if it is selected as the sixth cluster center, these conditions must be satisfied:

(1)D _i ＞γ·D ₁₂ wherein i ∈ (1,..., n);

(2)D _i ＝max{min(D _i1 ,D _i2 ) And i ∈ (1, ·, n), D ₁₂ ＝|Z ₂ -Z ₁ |；

(3) Gamma is a test parameter in the algorithm, and the value range is as follows: gamma is more than 0.5 and less than 1.

The MCA step algorithm is as follows:

algorithm 1.

Inputting a position data set S _n And a demand parameter m.

Output generation of a virtual location data set S ₁ 。

1. The value range of gamma is set and is ensured to be in the range of 0 < gamma < 1.

2. Will be the true position l _re As a first clustering center Z ₁ 。

3. Find from Z ₁ The most distant position as the second polymer center Z ₂ 。

4. For S _n Of the remaining objects of _i To Z, it goes ₁ And Z ₂ Is a distance D _i1 And D _i2 . Suppose D ₁₂ Is Z ₁ And Z ₂ If D is _i ＝max{min(D _i1 ,D _i2 ) And wherein i ∈ (1.. N) and D _i >γ·D ₁₂ Then, it is taken as the third clustering center Z ₃ 。

5. And by analogy, obtaining all the v cluster centers meeting the conditions. When the maximum and minimum distances are less than gamma.D ₁₂ And when the cluster center is found, the calculation for finding the cluster center is finished.

6. Assuming that v represents the calculated number of the clustering centers, judging which of the following conditions is met:

(1) If v is more than or equal to m, the algorithm is ended;

(2) If v < m, the value is reselected and step 1 is then re-executed.

7. Generating a candidate set S ₁ 。

Step (2): and calculating the semantic similarity of the position information of the pseudo candidate set. Firstly, the same prefix in the information is removed according to the characteristics of the position information. Then, the semantic similarity in the residual character strings is calculated by calculating the distance, and the calculation efficiency and accuracy are improved. For example, "Harbin second school" and "Harbin first school" are two strings of Chinese place names. The 'Harbin' has no meaning on the semantic similarity calculation of the two place name strings, and also influences the accuracy of the calculation result. Thus, "harbin" is no longer prefixed in the calculation.

D[i,j]Is a dynamically planned distance matrix with a cost of between 0 and 1 per editing operation. And different values may be set as desired. Herein, this value is set to 0 or 1. If a is _i ＝b _i The replacement cost is 0. Otherwise, all overhead is set to 1. In the following matrix, D is a dynamic programming matrix, which represents the distance between the string a = "second middle school" and the string B = "first middle school".

The distance between two strings is obtained by calculation, i.e. D [ i, j]＝D[4,4]And (2). Using the following formula

And calculating a similarity matching index between the character strings, namely semantic similarity, wherein the semantic similarity is 0.5.

Where | a | and | B | respectively denote the lengths of two character strings, and the maximum length of the character string S is used to calculate the semantic similarity. Finally, arg min (S (l) according to the formula _i ,l _j ) K positions with the smallest semantic similarity including the true position are obtained.

The semantic similarity algorithm is as follows:

and 2, calculating semantic similarity to obtain a virtual position result set.

Description of related input steps:

inputting a location candidate set S ₁ And a parameter threshold of semantic diversity, l.

Outputting a set of position results S ₂ 。

1. And sequentially matching each character of the place name information, and ignoring prefix characters with the same matching value. Then, two new character strings a and B are obtained.

2. Let it be assumed that the string a contains i characters, which are denoted as a = a ₁ a ₂ a ₃ La _i (ii) a The string B contains j characters, denoted B = B ₁ b ₂ b ₃ Lb _i 。

3. And constructing a dynamic programming matrix of i +1 columns and j +1 rows. The last element from D [ i, j ] is ed (A, B).

4. If j =0, return i, then exit; if i =0, return j and then exit.

5. The first row is initialized to (0, 1, l, i); the first column is initialized to (0, 1, l, j).

6. Each element in the matrix is assigned a value:

if a is _i ＝b _i Then D [ i, j ]]＝D[i-1,j-1]；

If a _i ≠b _i Then D [ i, j ]]＝1+min{D[i-1,j-1],D[i-1,j],D[i,j-1]}。

7. Step 6 is repeated until all values in the matrix are obtained, eventually ensuring the distance D i, j.

8. And calculating a similarity matching index S (A, B), namely semantic similarity, through D [ i, j ].

9. Selecting k-1 positions with minimum semantic similarity to generate a virtual result set S ₂ 。

Finally, the effectiveness of the method is verified through experiments. In the dummy position selection method considering semantic similarity, the average execution time of the dummy positions of maxminbidtds, simpmaxminbidtds and the proposed method are compared respectively. The average execution time of the virtual positions for the three methods is shown in fig. 2. In fig. 4, the comparison of the efficiency of generating virtual objects by maxminbidtds, simpmaxminbidtds and the proposed method is shown in fig. 5. As shown in fig. 2, as k increases, the maxminddistds algorithm takes more time than the proposed method. As shown in FIG. 5, when k is less than 5, the average execution time of the SimPmaxMinDistDS algorithm is slightly larger than that of the proposed method, and when k is greater than or equal to 5, the average execution time of the SimPmaxMinDistDS algorithm is much larger than that of the proposed method. As can be seen from fig. 5, the efficiency of the proposed method is better and better than the other two algorithms as k increases.

Theoretical and experimental results show that the algorithm can ensure the physical dispersity and semantic diversity of the position, effectively protect the position privacy of the user, reduce the time for generating the dummy and effectively improve the query efficiency.

While embodiments of the invention have been described above, it is not limited to the applications set forth in the description and the embodiments, which are fully applicable in various fields of endeavor to which the invention pertains, and further modifications may readily be made by those skilled in the art, it being understood that the invention is not limited to the details shown and described herein without departing from the general concept defined by the appended claims and their equivalents.

Claims

1. A K domination privacy protection method based on approximate semantic query is characterized by comprising the following steps:

the method comprises the steps that firstly, a position data set in a rectangular area containing real positions is obtained, a plurality of positions are selected through an MCA algorithm and a multi-center clustering method with the largest and smallest distances, and then a candidate data set is generated after processing through a dummy method;

and step two, after the distance between the geographical position information is calculated, calculating to obtain the semantic similarity between any two positions in the candidate set, and selecting k-1 geographical positions with the minimum semantic similarity as virtual positions.

2. The method of claim 1, wherein the K-dominant privacy protection method based on approximate semantic query is characterized in that: an MCA method for processing data is provided to achieve the purpose of acquiring a cluster center point set.

3. The method of claim 1, wherein the K-dominant privacy protection method based on approximate semantic query is characterized in that: and a candidate virtual model set is generated by adopting a multi-center clustering algorithm based on a maximum and minimum distance method, so that the physical dispersity of the virtual model is ensured.

4. The method of claim 1, wherein the K-dominant privacy protection method based on approximate semantic query is characterized in that: in the aspect of processing the semantic similarity, the geographical position with the minimum semantic similarity is selected as the virtual place name, so that the semantic diversity of the virtual place name is ensured.

5. The method of claim 4, wherein the K-dominant privacy protection method based on approximate semantic query is characterized in that: a dummy processing method is provided for processing and selecting the candidate set in the approximate semantic query process, and the average execution time for selecting the candidate set can be well reduced.