CN109918939B

CN109918939B - HMM-based user query risk assessment and privacy protection method

Info

Publication number: CN109918939B
Application number: CN201910072616.2A
Authority: CN
Inventors: 徐光伟; 马永东; 王文涛; 史春红; 赖淼麟
Original assignee: Donghua University
Current assignee: Donghua University
Priority date: 2019-01-25
Filing date: 2019-01-25
Publication date: 2023-08-11
Anticipated expiration: 2039-01-25
Also published as: CN109918939A

Abstract

The application provides a user query risk assessment and privacy protection method based on a hidden Markov model (Hidden Markov Model, HMM). Analyzing the characteristics of the user query, taking the index obtained by analysis as a quantitative index of the HMM, and establishing a user query risk assessment model; initializing model parameters; training the model according to the visible state sequence and the real state of the system; and finally, inquiring risk calculation and risk grade evaluation for the user when the user inquires. The application evaluates the security risk of user inquiry by using the HMM model, considers the dynamic property of each stage and reflects the risk state in real time. The high-intensity differential noise is adopted for high-risk query to reduce the risk of user query, and the low-intensity differential noise is adopted for low-risk query to protect, so that the risk of privacy disclosure during user query is effectively solved, and the privacy protection cost is saved. Meanwhile, the model has strong expansibility and can be applied to various online query services.

Description

HMM-based user query risk assessment and privacy protection method

Technical Field

The application relates to a user query risk assessment and privacy protection method based on an HMM model, and belongs to the field of WEB data query and privacy protection.

Background

In recent years, online query services bring great convenience to information retrieval of people, but also bring various privacy disclosure problems. When users use various online inquiry services, a series of digital marks containing personal information, interests and inquiry intentions are left, and the digital marks contain abundant sensitive information of the users, so that once the users are leaked, serious harm is caused to the users. Such as an attacker (an untrusted service provider or a third party marketer) deducing the user's true query intent by analyzing the user data traces. It is inferred what the user is looking for, when and under what circumstances the user initiates a query operation in order to provide more relevant and customized motivational content or advertising to induce the user to consume blindly or to deceptively. This leaves the user uncontrollable "curiosity" systems for misusing their personal information for targeted advertising and digital discrimination, which has raised serious public concern about privacy infringement.

In user inquiry, the traditional privacy protection method mainly focuses on identifiable aspects of privacy, namely sensitive information deletion, secure communication, anonymous inquiry and data confusion, so as to improve the privacy protection problem when the user inquires online. Although some work has been done, there still exist serious privacy disclosure problems, such as high encryption cost and poor flexibility, so that the above method is not widely used.

The existing privacy protection method mainly comprises query confusion and query coverage solutions. Query confusion is the prevention of accurate inferences about a user's query by a search server by generating a virtual query to be sent to the service provider along with the user's real query. An overlay query is an original query that is hidden from the user by employing a method of latent semantic indexing to generate a masked query. However, the problem of distinguishing the query risk during the query of the user is ignored in the above methods. In an actual query scene, each query of a user does not relate to privacy, namely, each query is not at high risk, and if the same privacy protection method is adopted, the problems of low accuracy, low query efficiency and the like of the query of the user caused by too high privacy protection intensity are easily caused.

The HMM model is an important probability model for sequence data processing and statistical learning, and has the characteristics of simple modeling, small data calculation amount, high running speed, high recognition rate and the like. HMM has been widely used in pattern recognition, part-of-speech tagging and information extraction, well combines qualitative and quantitative methods, and has relatively accurate evaluation accuracy.

Disclosure of Invention

The purpose of the application is that: the HMM is applied to user inquiry risk assessment and privacy protection, so that the problem that the user privacy is revealed when the user inquires is effectively solved, and the high-risk inquiry user is protected by adopting a privacy protection strategy.

In order to achieve the above object, the technical solution of the present application is to provide a method for evaluating risk and protecting privacy of user query based on HMM, which is characterized by comprising the following steps:

step 1, a user initiates a query request, and query feature analysis is performed according to query content contained in the user query request to obtain user query features;

step 2, establishing an HMM model based on user query characteristics;

initializing model parameters, and training the HMM model according to a visible state sequence of the HMM model and the real state of the system;

step 4, performing risk assessment and risk value calculation on query contents contained in a query request initiated by a user in real time by using the trained HMM model, and determining a query risk level;

step 5, adopting different privacy protection measures aiming at different inquiry risk levels: when the query risk level is high-risk query, reducing the query risk by adopting high-strength differential privacy noise; when the query risk level is low-risk query, low-intensity differential privacy noise is adopted to realize protection;

step 6, sending the privacy-protected result to a service provider, and inquiring the result by the service provider according to the inquiring requirement of the user;

and 7, returning the queried result by the service provider, and performing the result ranking operation on the user side to perform privacy protection again.

Preferably, in step 1, each time a user initiates a query request, there is a progressive and co-occurrence feature at the time of the user's query.

Preferably, the method for establishing the HMM model in the step 2 includes the following steps:

step 201, determining a visible state when a user inquires;

step 202, a hidden Markov quintuple parameter model is built, wherein the hidden Markov quintuple parameter model comprises a state transition probability matrix, a probability matrix of an observation vector, an initial state probability distribution vector, a state number and an observation symbol number.

Preferably, in step 201, the visible state contains all information of the system, and the observation in the current state is independent, and the query content of the user is related to the previous state only.

Preferably, in step 202, the safe state probability distribution of each link is the initial state probability distribution of the next link.

Preferably, in step 4, the user query sequence is set as x= (X) ₁ ,x ₂ ,…,x _n ) Wherein x is _i Representing the ith node, and each node has a transition probability P, then on a hidden markov model M, the probability that a query sequence X is observed is the sum of the probabilities on all possible paths:

where P (X|M) represents the joint distribution of the user query sequence and observations, q ₁ ,…,q _n Representing observation nodes, Q ^l Representing a set of observation nodes, P (q _k-1 →q _k ) Represents the ratio q _k-1 To q _k Transition probability of P (x) _k |q _k ) Representing for passing through x _k Query q _k Is a probability of (2).

Preferably, step 5 adopts different privacy protection methods for different query risks, and when the query of the user is a high-risk query, high-intensity differential privacy noise is adopted to reduce the risk of the query of the user. When the query of the user is a low-risk query, the low-intensity differential random noise is used for protection.

The application has the following advantages:

(1) The application provides a method for dynamically evaluating the risk of user inquiry, which is different from the existing static risk evaluation method in that the threat generated by user inquiry content is dynamically evaluated, and the size of the inquiry risk is different along with the change of the user inquiry time, the inquiry times and the like.

(2) In the privacy protection, privacy protection strategies with different intensities are adopted for protection according to different risks, so that the defect that the user inquiry risks are not distinguished and the same privacy protection strategy is adopted in the existing privacy protection method is overcome;

(3) Compared with the prior art, the risk assessment method and the system utilize the HMM model to assess the risk of the user query, consider the dynamics of each stage and obtain the risk state of the user query in real time, and the risk assessment model has strong expansibility and can be applied to various fields of online service.

Drawings

FIG. 1 is a flowchart of a method for risk assessment and privacy protection for user queries based on an HMM model;

fig. 2 is a user query risk assessment and privacy protection method model based on an HMM model.

Detailed Description

The application will be further illustrated with reference to specific examples. It is to be understood that these examples are illustrative of the present application and are not intended to limit the scope of the present application. Furthermore, it should be understood that various changes and modifications can be made by one skilled in the art after reading the teachings of the present application, and such equivalents are intended to fall within the scope of the application as defined in the appended claims.

1. User query model overview: as shown in FIG. 2, the method is a user query risk assessment and privacy protection method model based on an HMM model, and the whole process comprises three stages, namely user query and risk assessment, query privacy protection and query result return and re-ranking. (1) In the risk assessment stage of user inquiry, a user inputs keywords or contents to be inquired through a user terminal, and a client (privacy protection part) carries out risk assessment on the user inquiry to determine the risk involved in the user inquiry; (2) In the inquiry privacy protection stage, a high-strength privacy protection strategy is adopted for high-risk inquiry to reduce the user inquiry risk, and a mode of adding random differential noise is adopted for low risk, so that the user inquiry after privacy protection is transmitted to the SP through the internet; (3) And in the query result returning and renaming stage, the SP performs content retrieval according to the query keywords of the user and returns the query result. In order to prevent the unreliable SP from analyzing the user click, a user terminal is combined with the original query (the real query of the user) to perform a renaming operation on the new query (the query after privacy protection) result of the user, and the final user obtains the real query result.

2. Hidden Markov Model (HMM) overview: HMMs were developed based on markov chains. The events observed in HMM are random functions of states, so the model is a double random process, i.e. an observed state, a hidden state. HMMs have been widely used in pattern recognition, part-of-speech tagging, and information extraction.

3. HMM definition and analysis: for the sake of more clear description, HMM can be expressed in the form of 5-membered progenitor, e.g<N,M,π,A,B>Where N is the number of states in the model, the state set can be represented as s= { S ₁ ,…,S _N -a }; m is the number of observer symbols, and the observation set can be expressed as O= { O ₁ ,…,O _M Observation indicates the number of results that each state may output; pi represents an initial distribution state; a is a state transition probability matrix, i.e. a= (a) _ij ) Wherein a in A _ij Representing the time t from the state S _i Transition to state S _j Probability of (2); b represents the output probability matrix, i.e. b= { B _j (O _k )}，b _j (O _k )＝P(O _k |S _j ) Representing the state S at time t _j Output observed value O _k Is a probability of (2). The following is given as follows<N,M,π,A,B>Is expressed by the formula:

wherein pi represents the initial distribution state:

a is a state transition probability matrix, i.e

B is the probability matrix of the observation vector, i.e. b= (B) _ij ) _N×M (1≤j≤N,1≤k≤M)

b _jk Representing the probability of state k occurring in the case of state j, namely:

b _jk ＝p(V _k |a _j ),1≤j≤N,1≤k≤M

wherein V is _k Representing the observation state k, a _j The transition probability of a j state is represented, and p (·) represents the probability of a state occurring.

4. User query risk analysis: when the HMM model is established, the progressive and co-occurrence query characteristics of the user are combined, and analysis is carried out from the angles of transition probability and observation probability.

(1) The transition probability refers to the conditional probability that after the user gives the previous query data sequence, another query data of the user is obtained after a plurality of times, such as two connected nodes q ₁ ,q ₂ There is a transition probability P (q ₁ →q ₂ ). Since the risk that the user's query data is distinguishable in the continuous query scenario depends on the user's previous query data, if the previous data in the same topic is considered, the information gain of the data becomes high. Set X _t For personally identifiable sensitive information or sensitive subject matter in HMM, X _X The transition probability between nodes is p (X _t |X _t-1 ) The number of transitions that have occurred between nodes can be weighted, i.eThen, privacy risk, namely alpha-p (X, in the user query is calculated according to the method of the weighted transition probability _t |X _t-1 )。

(2) The observation probability refers to the query behavior that may occur at a node, such as user u _i The probability of e being queried through q is P (e|q). The value is obtained by analysis and calculation based on historical query data of the user, and each node comprises a group of nodes with observation probabilityThese observation probabilities are modeled as the observations of different users in the previous data (p (u) _i |X _t ) Given data X found in the database) _t Is a probability of (2). The more data a user queries for a particular topic, the higher the accuracy of the inference of the user's interest data, and the higher the risk of the query. Similarly, the query risk is determined by adopting a weighted counting mode, namelyBecause the more uniform the users, the higher the query privacy risk, i.e. p (u) _i |X _t )。

In the scene of continuous inquiry of the user, the current inquiry of the user is set as X _T Only with the previous query X _T-1 And (5) correlation. If the current query result does not meet the query requirement of the user when the user queries a complex problem (medical knowledge, etc.), the user adds or deletes the previous query content, so that the first-order markov property is met.

5. User query risk assessment:

let user u _i Query sequence X ₁ ,X ₂ ,…,X _n Then the observation result output for the user query sequence is Y ₁ ,Y ₂ ,…,Y _n . User u is available _i The joint distribution of query sequences and observations is:

user u can be calculated _i Is a query sequence (X) ₁ →X ₂ →…→X _n ) The overall privacy risk generated is:

wherein (HMM|u) _i ) Representing user u _i Privacy probability lists for all paths of (a), including nodes with user observation probability greater than 0, and finally, the user can query the sequence asX ₁ ,X ₂ ,…,X _n The query risk at the time is p (X ₁ ,…,X _n |u _i )。

6. User query risk classification: we assume that the user queries five states, A1-A5, where A1 represents a normal safe state, A5 represents a significant dangerous condition, and A2, A3, A4 represent progressively deeper levels of danger. If represented by probabilities, the probabilities of the risk levels represented by the respective states are shown in Table 1.

Table 1 risk ranking of user queries

The query risk classification of the application meets the requirements of GB/T33132-2016 information security technology and information security risk processing implementation guidelines and GB/T31722-2015 information technology, security technology and information security risk management.

7. Query risk privacy protection:

the higher the query risk level of the user, the stronger the privacy protection strength should be. In inquiry privacy protection, the application adopts a differential privacy protection technology to add random noise to the inquiry of the user so as to realize privacy protection. Differential privacy has good effect in resisting background attacks and the like. Let Q be a set of query functions, epsilon-differential privacy can be achieved by adding random noise r, i.eWhere r is random noise. The magnitude of the privacy protection intensity is realized by r, and the larger r represents the more random noise is added, and the higher the privacy protection intensity is. When the user's query is a high risk query, more random noise needs to be added, and if the user's query is a low query, less query noise needs to be added.

8. Query result return and user side re-ranking:

and returning the query result to the user side through the Internet according to the query request SP of the user. The returned result of the SP query contains both the real query and the false query (noise query) of the user, so that the user side performs the renaming operation on the returned result. The application combines the results returned by the query with the ranked search results in the client user profile. Specifically, ranking scores corresponding to locations where candidate documents appear in each of the ranker's ranking list are assigned, and the candidates are ranked according to the composite ranking score:

wherein R is _i (d) Representing a ranking of i for data d, α controls the weight of each query result. In the ranking stage, query results are ranked based on the original user query.

In the renaming stage, the application sets the occurrence frequency threshold beta and the false access mechanism to mask the real inquiry of the user besides considering the weight of each page and the similarity problem with the real page. The occurrence threshold β represents that the actual queries that occur in the results page cannot exceed β pieces of information, and this mechanism is known only to the user, who would know that the results are actual results in the results return page when accessing the final results. While in order to confuse an attacker (an untrusted service provider or a third party marketer), the system of the present application will randomly click on a fake query in the returned results, but this process does not require user involvement. Finally, the purpose of dual privacy protection of user inquiry is achieved.

While the basic principle, main features and embodiments of the present application have been described above, the present application is not limited to the above-described embodiments, and various changes and modifications may be made without departing from the spirit and scope of the present application. Accordingly, unless such changes and modifications depart from the scope of the present application, they should be construed as being included therein.

Claims

1. The user query risk assessment and privacy protection method based on the HMM is characterized by comprising the following steps of:

step 2, based on user query characteristics, establishing an HMM model, wherein the method for establishing the HMM model comprises the following steps:

step 201, determining a visible state of a user when inquiring, wherein the visible state contains all information of a system, the observation under the current state is independent, and the inquiring content of the user is only related to the previous state;

step 202, establishing a hidden Markov quintuple parameter model, wherein the hidden Markov quintuple parameter model comprises a state transition probability matrix, a probability matrix of an observation vector, an initial state probability distribution vector, a state number and an observation symbol number, and the safety state probability distribution of each link is the initial state probability distribution of the next link;

step 4, performing risk assessment and risk value calculation on query contents contained in a query request initiated by a user in real time by using the trained HMM model, and determining a query risk level, wherein:

when the HMM model is established, the progressive and co-occurrence query characteristics of the user are combined, and analysis is carried out from the angles of transition probability and observation probability:

the transition probability refers to the conditional probability that after the user gives the previous query data sequence, the user obtains another query data of the user after a plurality of times, and two nodes q are connected ₁ ,q ₂ There is a transition probability P (q ₁ →q ₂ ) The method comprises the steps of carrying out a first treatment on the surface of the Set X _t For personally identifiable sensitive information in HMM, then X _t The transition probability between nodes is p (X _t |X _t-1 ) Count (X) of the number of transitions that have occurred between nodes _t |X _t-1 ) Calculated to obtainAlpha is a weight; and then transferred according to the weightingProbability method calculates privacy risk in user query, i.e. alpha p (X _t |X _t-1 )；

The observation probability refers to the query behavior that may occur at a node, such as user u _i The probability of e being queried over q is P (e|q), which is analyzed and calculated based on the historical query data of the user, each node contains a set of observations with observation probabilities, which are modeled as the probability of the observation of the different user in the previous data (P (u) _i |X _t ) Given data X found in the database) _t The more data a user queries for a particular topic, the higher the accuracy of inference on the user's interest data, the higher the risk of the query; determining query risk β p (u) _i |X _t ) Wherein, beta is the weight,

in the scene of continuous inquiry of the user, the current inquiry of the user is set as X _T Only with the previous query X _T-1 Related, if the user queries a complex problem, and the current query result does not meet the query requirement of the user, the user can add or delete the previous query content, so that the first-order Markov property is met;

user query risk assessment:

let user u _i Query sequence X ₁ ,X ₂ ,…,X _n Then the observation result output for the user query sequence is Y ₁ ,Y ₂ ,…,Y _n User u _i The joint distribution of query sequences and observations is:

calculate user u _i Is a query sequence (X) ₁ →X ₂ →…→X _n ) The overall privacy risk generated is:

wherein, (HMM|u) _i ) Representing user u _i Privacy probability lists for all paths of (a), including nodes with user observation probability greater than 0, and finally, the user can query sequence X ₁ ,X ₂ ,…,X _n The query risk at the time is p (X ₁ ,…,X _n |u _i )；

User query risk classification: the user inquires five states, A1-A5, wherein A1 represents a normal safety state, A5 represents a serious dangerous condition, and A2, A3 and A4 represent gradual deepening of dangerous levels; expressed in terms of probabilities, the probability of the risk level expressed by each state is as follows: