CN111292008A

CN111292008A - Privacy protection data release risk assessment method based on knowledge graph

Info

Publication number: CN111292008A
Application number: CN202010139728.8A
Authority: CN
Inventors: 王瑞锦; 张凤荔; 何兴高; 张巍琦; 唐榆程; 郭鹏宇; 谭琪
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-03-03
Filing date: 2020-03-03
Publication date: 2020-06-16

Abstract

The invention discloses a privacy protection data release risk assessment method based on a knowledge graph, which comprises the following steps: acquiring information submitted by a data applicant, and judging whether the basic information meets the specification; mapping the information of the data applicant into an RDF data set, and then converting the RDF data set into graph data in a knowledge graph; based on the knowledge graph, the basic information risk assessment, the identity abnormal risk assessment, the group fraud risk assessment and the individual credit risk assessment of the data applicant are completed by using a related algorithm; combining all risk evaluation data, constructing a risk model, and carrying out risk scoring on a data applicant; and labeling the score of the comprehensive risk assessment to obtain a risk assessment conclusion and a specific risk item assessment result. According to the scheme, the information of the data applicant can be automatically extracted and the risk can be analyzed, the process of privacy protection data release is actively protected, the workload of manual examination and verification is greatly reduced, and the risk of privacy protection data release is more visually described.

Description

Privacy protection data release risk assessment method based on knowledge graph

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a privacy protection data release risk assessment method based on a knowledge graph.

Background

With the widespread use of big data technology, data has become one of the important assets of many companies. The existing privacy protection data publishing system only evaluates the privacy disclosure risk of desensitized data, but neglects to actively evaluate the risk of a data applicant, and although a malicious attacker cannot steal privacy data from the technical aspect, the malicious attacker can steal data fraud by utilizing social engineering, for example, different data are obtained by means of user counterfeiting, group fraud and the like, and then the privacy data are obtained by analyzing the data by means of data analysis. The knowledge graph is a technology based on a graph structure, and can rapidly analyze the relationship between nodes in the knowledge graph. Therefore, the data applicant is mapped to the knowledge graph, and the implicit relation of the data applicant and the knowledge graph is analyzed based on the knowledge graph, so that the privacy data stealing by utilizing social engineering can be effectively prevented.

Disclosure of Invention

The invention aims to provide a privacy protection data release risk assessment method based on a knowledge graph, which can effectively prevent privacy data from being stolen and cheated. The purpose of the invention is realized by the following technical scheme:

a privacy protection data release risk assessment method based on a knowledge graph comprises the following steps:

s1, acquiring information of a data applicant, mapping the acquired information into an RDF data set, converting the RDF data into graph data in a knowledge graph, and converting the graph data into graph data in the knowledge graph;

s2, detecting the basic information of the data applicant based on the knowledge graph to finish the basic information risk assessment;

s3, performing identity anomaly detection on the data applicant by using an anomaly detection algorithm based on the knowledge graph to finish identity anomaly risk assessment;

s4, carrying out community division on the data applicant groups by using a community discovery algorithm based on the knowledge map, calculating group fraud risk, and finishing group fraud risk evaluation of the data applicant;

s5, carrying out individual credit calculation analysis on the data applicant by using an improved personalized PageRank algorithm based on the knowledge graph to finish individual credit risk assessment of the data applicant;

s6, constructing a risk model by combining all risk evaluation data, and carrying out risk scoring on the data applicant according to the evaluation standard to complete the comprehensive risk evaluation of the data applicant;

and S7, processing the scores of the comprehensive risk assessment by adopting a hierarchical labeling method, and summarizing to obtain a risk assessment conclusion and a specific risk item assessment result.

Further, the step S1 includes the following sub-steps:

s101, generating a mapping file according to a logic table of a relational database;

s102, analyzing the mapping file to obtain mapping elements contained in the mapping file;

s103, analyzing the mapping elements, and acquiring the mapping rules of the sub-elements, the logic table and the attribute columns of the logic table;

s104, obtaining tuples in the logic table from the relational database, and mapping corresponding attribute columns in the tuples into RDF terms according to mapping rules;

and S105, combining the obtained RDF terms into RDF triples and outputting the RDF triples to an RDF data set.

Further, the identity anomaly risk assessment in step S3 includes the following sub-steps:

s301, a detected target user is given

Wherein

Is the ith attribute of the target user;

s302, a normal user set U ═ { U ═ is given₁,u₂,...,u_mExtracting the k-th attribute of each normal user to obtain an attribute set

Wherein

A kth attribute representing a jth user;

s303, extracting l attributes from each normal user, and forming a multi-user multi-attribute set Muti _ UP ═ P₁,P₂,...,P_lExtracting corresponding attributes from target users to be detected to form an attribute set P to be detected^Test＝{p₁,p₂,...,p_l}；

S304, mapping the multi-user multi-attribute set Muti _ UP to a l-dimensional clustering space, then carrying out clustering operation, and then carrying out attribute set P to be detected^Test＝{p₁,p₂,...,p_lMapping to the clustering space, and calculating an abnormal detection result by using an abnormal detection algorithm to finish the evaluation of the abnormal identity risk.

Further, the group fraud risk assessment in step S4 includes the following sub-steps:

s401, a sample set of a fraudulent user is given

Wherein

Is a fraudulent user sample having an attribute of

Wherein

Is the jth attribute of the rogue user sample;

s402, initializing a cheating group set and initializing it to null, i.e. to be empty

S403, selecting one attribute from m attributes of the fraudulent user to form an attribute subset

S404, classifying all the fraudulent users by using a community discovery algorithm according to the l attributes, classifying the fraudulent users with similar characteristics into one class, and finally obtaining a user classification set U '({ U')₁,U₂,...,U_p-each element in the set represents a type of fraudulent community; adding different types of cheating groups into the cheating Group set as an element to obtain the cheating Group set Group { U }₁,U₂,...,U_pAnd finishing group fraud risk assessment.

Further, the individual credit risk assessment in step S5 includes the following sub-steps:

s501, giving a user relationship network U ═ G_U,V_U> (wherein G)_UIs a set of user nodes, V, in a relational network_UIs a set of edges in a relational network;

s502, assuming that there is a user node U with a risk weight w, and n user nodes connected to the user node U are U ═ U₁,u₂,...,u_n}；

S503, supposing that the user node u has a certain adverse credit event, a time correlation function delta (u, t) transmits the risk weight of the node u to the node connected with u;

s504, traversing all nodes by using an improved personalized PageRank algorithm, completing risk weight conduction calculation of adverse credit events, and finally ranking all users in the user relationship network according to risk weights to obtain a user risk ranking set to complete individual credit risk assessment.

Further, the risk model constructed in step S6 is:

score(u)＝μ(B,F)

wherein mu is a risk scoring function, B represents basic information of the data user, and F represents evaluation results based on the abnormal identity risk, the group fraud risk and the individual credit risk of the data user.

The invention has the beneficial effects that:

(1) the work of manually checking the information of the data applicant is greatly reduced, and active protection is provided for the process of issuing privacy protection data;

(2) automatically extracting background information, incidence relation analysis and other information of the data applicant, and analyzing the risk;

(3) a quantitative and qualitative scheme for risk assessment is provided, so that the risk of privacy protection data release is described more intuitively;

(4) the complexity of data applicant identity verification in a complex relation network can be reduced, and finally, a risk assessment result is obtained through tagging to carry out semantic expression, so that the method is visual and easy to understand.

Drawings

FIG. 1 is a diagram of the method steps of the present invention.

FIG. 2 is a diagram of the hierarchical tagging methodology of the present invention.

Detailed Description

In order to more clearly understand the technical features, objects, and effects of the present invention, embodiments of the present invention will now be described with reference to the accompanying drawings.

In an embodiment of the present invention, as shown in fig. 1, a privacy-preserving data publishing risk assessment method based on a knowledge graph includes the following steps: acquiring information of a data applicant, mapping the acquired information into an RDF data set, and converting the RDF data set into graph data in a knowledge graph; detecting the basic information of the data applicant based on the knowledge graph to finish the basic information risk assessment; performing identity anomaly detection on the data applicant by using an anomaly detection algorithm based on the knowledge graph to finish identity anomaly risk assessment; carrying out community division on the data applicant groups by using a community discovery algorithm based on a knowledge graph, calculating group fraud risk, and finishing group fraud risk evaluation of the data applicant; carrying out individual credit calculation analysis on the data applicant by using an improved personalized PageRank algorithm based on a knowledge graph to finish individual credit risk assessment of the data applicant; combining all risk evaluation data, constructing a risk model, and carrying out risk scoring on the data applicant according to an evaluation standard to complete comprehensive risk evaluation of the data applicant; and processing the scores of the comprehensive risk assessment by adopting a hierarchical labeling method, and summarizing to obtain a risk assessment conclusion and a specific risk item assessment result.

Further, the step of obtaining the information of the data applicant and converting the information into the knowledge graph further comprises the step of judging whether the basic information of the information submitted by the data applicant meets the specification.

The specific process of mapping the knowledge graph of the data applicant information is as follows:

in the process of mapping the knowledge map of the data applicant information, the data applicant information is generally stored in the form of structured data and text type unstructured data, and the storage mode is not favorable for exploring deep information and implicit relations between data applicants. The invention maps the information of the data applicant to an RDF data set, and then converts the RDF data set into graph data in a knowledge graph. The mapping process of the knowledge graph is described as follows:

(1) generating a mapping file according to a logic table of the relational database;

(2) analyzing the mapping file to obtain mapping elements contained in the mapping file;

(3) analyzing the mapping elements to obtain the mapping rules of the sub-elements, the logic table and the attribute columns thereof;

(4) obtaining tuples in the logic table from a relational database, and mapping corresponding attribute columns in the tuples into RDF terms according to a mapping rule;

(5) and combining the obtained RDF terms into RDF triples and outputting the RDF triples to the RDF data set.

The specific process of the identity abnormity detection risk assessment of the data applicant information is as follows:

anomaly detection is a relatively representative method in unsupervised model learning, namely finding points or sets of points in data that have anomalous properties. In the data applicant identity abnormal risk assessment, the abnormal detection is often used for identifying illegal users who attempt to apply legal information to disguise the illegal users as normal users, and the data identity abnormal risk assessment task is described as follows:

(1) given a detected target user

Wherein

Is the ith attribute of the target user;

(2) given a normal set of users U ═ U₁,u₂,...,u_mExtracting the k-th attribute of each normal user to obtain an attribute set

Wherein

A kth attribute representing a jth user;

(3) extracting l attributes from each normal user to form a multi-user multi-attribute set Muti _ UP ═ P₁,P₂,...,P_lExtracting corresponding attributes from target users to be detected to form an attribute set P to be detected^Test＝{p₁,p₂,...,p_l}；

(4) Mapping a multi-user multi-attribute set Muti _ UP into a l-dimensional clustering space, then carrying out clustering operation, and then carrying out attribute set P to be detected^Test＝{p₁,p₂,...,p_lMapping to the clustering space, and calculating an abnormal detection result based on an abnormal detection algorithm.

From the above tasks, the anomaly detection algorithm is a key part in the whole user identity anomaly detection, and is directly related to the result of the anomaly detection, so that the local anomaly Factor (LOF) algorithm in the anomaly detection algorithm is selected for the anomaly detection.

The basic idea of the LOF algorithm is: and calculating the ratio of the average density of the positions of the sample points around a target sample point to the density of the positions of the target sample point, wherein the ratio is based on 1, and when the ratio is greater than 1, the larger the value is, the lower the density of the positions of the target sample point is than that of the positions of the sample points around the target sample point is, and the higher the possibility that the target sample point is an abnormal point is.

The LOF algorithm is defined as follows:

(1) distance between two points: let p, o be two points in a given set, with the distance between them denoted as d (p, o);

(2) kth distance (k-distance): the k-th distance of the point p means a distance from the point k-th distance of the point p. Let the kth distance of point p be d_k(p) then d_kD (p, o), and at least k points o 'epsilon C { x ≠ p } excluding p in the set, and d (p, o') ≦ d (p, o); meanwhile, at most k-1 points o ∈ C { x ≠ p } which do not include p in the set, and d (p, o) < d (p, o) is satisfied;

(3) k-distance neighbor (k-distance neighbor borrhoodofp): the meaning of the k-th long-distance neighborhood of the point p is that all points including the k-th long distance and within the k-th long distance are included, and the k-th long-distance neighborhood of the point p is recorded as N_k(p) the number of midpoints therein is expressed as | N_k(p)|≥k；

(4) Reach-distance (rd): the k-th far reachable distance from point o to point p is:

rd_k(p,o)＝max{d_k(o),d(p,o)}

the above equation states that the k-th distance from point o to point p is at least the k-th distance of o, i.e., the k points nearest to point o, the distances from o to these points all equal d_k(o)；

(5) Local accessibility density (lrd): represents the inverse of the average reachable distance from point p to a point within the kth far neighborhood of point p, i.e.:

the above equation states that if point p and the surrounding neighborhood points are in the same cluster, then the achievable distance is a small value d_k(o) the greater the probability, the smaller the sum of the reachable distances, the higher the local reachable density; if the point p is far from the surrounding neighborhood points, the more likely the reachable distance takes a larger value of d (p, o), the larger the sum of the reachable distances, the lower the local reachable density, and the more likely it is that the point is an outlier.

(6) Local outlier factor (lof): neighborhood point N representing point p_k(p) an average of a ratio of the local achievable density of (p) to the local achievable density of point p. Namely:

if LOF_k(p) → 1, indicating that point p is likely to belong to the same cluster as the neighborhood; if LOF_k(p) < 1, and the smaller the value, the higher the density of the point p is, and the lower the density of the point p is, the more likely the point p is an abnormal point.

In summary, the density is calculated by calculating the distance between the points, and then whether each point p is an abnormal point is determined by comparing the density of the point p with the density of the neighboring points, wherein the lower the density of the point p, the greater the possibility of being an abnormal point. When the user identity abnormity detection is carried out, the multi-user multi-attribute set Muti _ UP is mapped into a clustering space, and then the attribute set P to be detected is mapped into a clustering space^Test＝{p₁,p₂,...,p_lAnd mapping the obtained data to the clustering space to be used as a point p in the LOF algorithm.

The group fraud risk assessment of the data applicant information comprises the following specific processes:

the group is a collection of individuals which are closely connected with each other and have certain similarity in behaviors and attributes, and is characterized in that: individual relationships within a community are tight and relationships between communities are sparse. In a group fraud risk assessment process for data applicants, a community discovery algorithm is used to identify a group of fraud in a group of data applicants. The invention describes the group fraud risk assessment task as follows:

(1) sample set for a given rogue user

Wherein

Is a fraudulent user sample having an attribute of

Wherein

Is the jth attribute of the rogue user sample;

(2) initializing a fraudulent community set and initializing it to null, i.e. to be empty

(3) Selecting one attribute from m attributes of fraudulent user to form an attribute subset

(4) Classifying all the fraudulent users according to the above attributes, classifying the fraudulent users with similar characteristics into one class, and finally obtaining a user classification set U' ═ { U₁,U₂,...,U_p-each element in the set represents a type of fraudulent community; adding different types of cheating groups into the cheating Group set as an element to obtain the cheating Group set Group { U }₁,U₂,...,U_p}。

The above task can be known, the discovery method of the cheating group is a key part of the task, the community discovery algorithm analyzes the modularized community structure from the complex network by using the information contained in the graph topological structure, and the intensive research on the problem is helpful for researching the modules, functions and evolution of the whole network in a divide-and-conquer mode, more accurately understanding the organization principle, topological structure and dynamic characteristics of the complex system, has very important significance and is commonly used for identifying the cheating group, so the invention uses a modularity quantification algorithm to discover the cheating group.

The modularity quantification community discovery algorithm is an algorithm for quantifying the characteristics of communities and then dividing the communities by comparing quantification results, has good performance, can process a large-scale network, can discover communities with different granularities, and most importantly, can automatically discover the communities without specifying the number of the communities in advance. The basic idea of the algorithm is as follows: based on the modularity, observing the influence on the modularity gain through the nodes in the mobile network, merging communities with the largest influence on the modularity increment until the modularity is not increased any more, and obtaining the final community division.

The definition of the modularity quantitative community discovery algorithm is as follows:

(1) modularity (modular): the modularity of a network is noted as:

where m is the number of edges in the entire network, A_ijIs a weight between node i and node j, k_iAnd k_jRespectively representing the sum of the weights of network node i and node j, c_iRepresents the community to which the node i belongs, if i and j belong to the same community, delta (c)_i,c_j) 1 else δ (c)_i,c_j)＝0。

(2) Modularity gain: assuming that there are N nodes in the network, each node is first assigned a community, and N communities are obtained. Then, for each node i and all its neighboring nodes j in the network, the increment of the modularity is changed from the community where the node i is located to the community where its neighboring nodes j are located:

where m is the sum of the weights of all edges in the network, Σ_inIs the sum of the weights, Σ, of all edges between nodes within a community_totIs the sum of the weights of all edges in the community associated with node i, k_i,inIs the sum of the weights of the edges of node i to all nodes in the community.

(3) The modularity gain maximization division community: and moving the node i to the community where the node j with the maximum modularity gain is located, and if the node i cannot find the community with the positive modularity gain, leaving the node i in the original community. This process is repeated until no more nodes are moved resulting in increased modularity.

(4) Iteratively dividing communities: and (4) forming a new network by taking the obtained communities divided in the step (3) as nodes, wherein the weight between the new nodes is the sum of the original weights between the two new nodes, carrying out the division in the step (3) in an iterative manner, obtaining the optimal division of the network when the maximum modularity is generated or the network is not changed any more, and stopping the iteration.

In summary, when Group fraud risk assessment is performed, users in a given user set are used as nodes in a modularity quantitative Group discovery algorithm, attributes such as business connection and benefit exchange in an attribute set are used as edges between the nodes, then community division is performed, and finally a fraudulent user division set Group { U ═ is obtained₁,U₂,...,U_pAnd finishing group fraud risk assessment of the data applicants.

The individual credit risk assessment of the data applicant information comprises the following specific processes:

the individual's credit may change based on events associated with the individual, and if the individual has an adverse credit event or is affected by an adverse credit event, the individual's credit may decrease and vice versa, while the effect of the adverse event on the individual may gradually decrease over time. In the individual credit risk assessment process, the individual credit risk analysis task is described as follows:

(1) given a user relationship network U ═ G_U,V_U> (wherein G)_UIs a set of user nodes, V, in a relational network_UIs a set of edges in a relational network;

(2) assuming that there is a user node U with a risk weight w, the n user nodes connected to the user node U are U ═ U₁,u₂,...,u_n}；

(3) Assuming that a user node u has a certain bad credit event, there is a time-dependent function δ (u, t) to conduct the risk weight of the node u to the node connected to u;

(4) and traversing all the nodes and completing risk weight conduction of the adverse credit events at the same time, and finally sequencing all the users in the user relationship network according to the risk weights to obtain a user risk sequencing set.

As can be known from the above tasks, the calculation of the risk of the node is a key part of the task, and the risk of the data applicant calculated by the PageRank algorithm is selected. The invention selects an improved personalized PageRank algorithm to calculate the individual credit risk of the data applicant. The basic idea of the traditional PageRank algorithm is as follows: in a directed graph, a user starts to visit from any node, when jumping to the next node, the user randomly selects the next visiting node from all directed edges of the current node by using a probability c, or jumps to any node and starts a new round of random walk by using the probability of (1-c), the above processes are repeated until the probability that the user stays at any node is stable, and the more nodes point to the node p for one node p, the greater the weight of the node p, namely:

r＝(1-c)Mr+cu

wherein r is a PageRank value, which represents the probability that the node is visited, c is the probability of restarting random walk, u is the probability of being selected when the node restarts random walk, in the PageRank, the probability of being selected by each node is equal, and M is a normalized adjacency matrix.

However, the traditional PageRank algorithm does not conform to the actual situation, and in an actual use scene, when each node in the network restarts random walk, the selected probability is different, but based on the preference of the user, the algorithm has a certain bias. Therefore, the PageRank algorithm is improved later, and a personalized PageRank algorithm is provided.

The personalized PageRank algorithm assumes that when random walk is repeated, jump to any node cannot be randomly selected, but one node is selected from a specific node set, meanwhile, when the weight of the node is initialized, the nodes in the specific node set are treated differently from other nodes, and when a stable state is calculated, the nodes preferred by a user and related nodes can obtain better weight. For the node p, the calculation method of the personalized PageRank is as follows:

r＝(1-c)Mr+cv

where v is a preference vector of the user, representing the importance of each node in the relationship network for a given preference vector, i.e. the user's preferences.

However, in the problem of individual credit risk analysis, the weight influence between two nodes is related to the occurrence time of the adverse events besides the adverse events themselves, and according to the general knowledge, the adverse events with the closer occurrence time have larger influence on the current situation, and vice versa, but the personalized PageRank algorithm does not consider the influence of the time on the weight of the nodes.

Based on the time influence problem, the personalized PageRank algorithm needs to be further improved, a time attenuation factor is added into an adjacent matrix, and if delta is an exponential time decay function, the method comprises the following steps:

where β is a decay constant indicating the rate at which the influence of past information decreases, t is the interval between the time when an adverse event occurs and the current time, and when t is 0, the adverse event is currently occurring, the original adjacency matrix M is transformed into a weight matrix W with time decay added thereto by an exponential time decay function, and in this case, the PageRank value is calculated by:

r＝(1-c)Wr+cv

meanwhile, in the task, the degree of the node should be independent of the assigned weight, and in the personalized PageRank, (1-c) Qr represents that the weight influence caused by the adverse event of the node is dispersedly propagated to the neighbor nodes, but when the weights are equal, the high-degree node is propagated to the lower weight influence of the neighbor nodes, and the low-degree node is propagated to the higher weight influence of the neighbor nodes. In the improved personalized PageRank algorithm, the weight influence of the height nodes is amplified, so that the weight influence obtained by neighbor nodes of nodes with different degrees on one scale during propagation is ensured. The calculation method of the PageRank at this time is as follows:

r＝(1-c)Wr+cz'

if v is a preference vector of the user, d is the degree of a node, z is a result of multiplying each element in the vector v and each element in the vector d one by one, and z' is obtained by normalizing z.

And iteratively executing an improved personalized PageRank algorithm on all nodes in the relational network, and finally obtaining a user risk ranking table based on adverse event influence to finish individual credit risk assessment.

The comprehensive risk assessment process of the data applicant information comprises the following specific steps:

in order to evaluate the comprehensive risk of the data applicant, a risk model can be preliminarily constructed by combining the previous data:

score(u)＝μ(B,F)

wherein mu is a risk scoring function, B represents basic information of the data user, and F represents evaluation results based on the abnormal identity risk, the group fraud risk and the individual credit risk of the data user. The model shows that the risk scoring is carried out on a data user u, and the scoring result is related to basic information, the abnormal identity risk, the group fraud risk and the individual credit risk. The invention calculates the score of each item in 10 grades and carries out comprehensive risk assessment on the data applicant.

First, for the basic information B of the data applicant, its source is actively submitted by the data applicant. The evaluation needs to be carried out from three aspects of authenticity, completeness and objectivity, and the invention provides an evaluation standard shown in the table 1.

Table 1 data evaluation standard table for applicant's basic information

For the identity information F of the data user, the score of F is divided into three parts, namely an abnormal detection score F, from the result obtained by the identity detection_LOFGroup fraud risk score F_FuardIndividual risk score F_PR. For local outlier LOF, U is { U } for one user U and other normal user sets₁,u₂,...,u_nF, if lof (u) → 1, indicating that user u is a normal user, is more likely_LOFThe closer to 10 points. For group fraud detection, if a user u, a known fraud group a and a known normal group b are known, the user u and the two groups are respectively divided by using a community discovery algorithm, and if the frequency of dividing the u and the b normal groups into one group is recorded as f_bAnd the frequency of dividing a fraudulent group into one group is denoted as f_aCalculating the probability f ═ f that the user is a member of the fraudulent group_b/(f_a+f_b) The operation is repeated to obtain the probabilities of all the users, the obtained probability sets are ranked from high to low and then are averagely divided into 10 levels, each level corresponds to one score in 1-10, the higher the probability is, the probability that the user in the level is an individual in a cheating group is, the lower the score corresponding to the level is, otherwise, the higher the score is. For the current data applicant, the probability of being a fraudulent group can be calculated and then the corresponding score F is obtained_Fuard. Similarly, for individual risk assessment, PageRank values for all users may be calculated, then ranging from high to highAnd (3) dividing the low ranks into 10 levels on average, wherein each level corresponds to one score in 1-10, and the higher the score is, the more adverse events occur or are influenced by the adverse events of the user, the higher the individual risk is, and otherwise, the lower the score is. Finally, the identity detection score F of the data applicant is the abnormal detection score F_LOFGroup fraud risk score F_FuardIndividual risk score F_PRAverage score of (a).

And finally, labeling the score of the comprehensive risk assessment to obtain a risk assessment conclusion. The invention adopts the layered labeling processing of the scores of the comprehensive evaluation as shown in fig. 2 so as to obtain the total risk evaluation conclusion and the specific risk item evaluation result, and completes the risk evaluation of the data applicant in the privacy protection data issuing system.

The method can automatically extract the background information, the incidence relation analysis and other information of the data applicant, analyze the risk, provide active protection for the privacy protection data issuing process, greatly reduce the work of manually checking the information of the data applicant, more intuitively describe the risk of issuing the privacy protection data by providing a quantitative and qualitative scheme for risk evaluation, reduce the complexity of identity verification of the data applicant in a complex relation network, and finally obtain the risk evaluation result through tagging for semantic expression, so that the method is intuitive and easy to understand.

The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A privacy protection data release risk assessment method based on a knowledge graph is characterized by comprising the following steps:

s1, acquiring information of a data applicant, mapping the acquired information into an RDF data set, and converting the RDF data set into graph data in a knowledge graph;

2. The method for risk assessment of privacy-preserving data distribution based on knowledge-graph as claimed in claim 1, wherein said step S1 comprises the following sub-steps:

3. The method for risk assessment of privacy-preserving data distribution based on knowledge-graph as claimed in claim 1, wherein the risk assessment of identity abnormality in step S3 comprises the following sub-steps:

s301, a detected target user is given

Wherein

Is the ith attribute of the target user;

Wherein

A kth attribute representing a jth user;

4. The method for risk assessment of privacy-preserving data distribution based on knowledge-graph as claimed in claim 1, wherein the group fraud risk assessment in step S4 comprises the following sub-steps:

s401, a sample set of a fraudulent user is given

Wherein

Is a fraudulent user sample having an attribute of

Wherein

Is the jth attribute of the rogue user sample;

5. The method as claimed in claim 1, wherein the individual credit risk assessment in step S5 comprises the following sub-steps:

6. The method for risk assessment of privacy-preserving data distribution based on knowledge-graph as claimed in claim 1, wherein the risk model constructed in step S6 is:

score(u)＝μ(B,F)

7. The method for assessing the risk of distributing privacy-preserving data based on a knowledge-graph as claimed in claim 1, wherein the step S1 further comprises determining whether the basic information of the information submitted by the data applicant meets the specification.