CN111292008A - Privacy protection data release risk assessment method based on knowledge graph - Google Patents

Privacy protection data release risk assessment method based on knowledge graph Download PDF

Info

Publication number
CN111292008A
CN111292008A CN202010139728.8A CN202010139728A CN111292008A CN 111292008 A CN111292008 A CN 111292008A CN 202010139728 A CN202010139728 A CN 202010139728A CN 111292008 A CN111292008 A CN 111292008A
Authority
CN
China
Prior art keywords
risk
data
user
risk assessment
attribute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010139728.8A
Other languages
Chinese (zh)
Inventor
王瑞锦
张凤荔
何兴高
张巍琦
唐榆程
郭鹏宇
谭琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202010139728.8A priority Critical patent/CN111292008A/en
Publication of CN111292008A publication Critical patent/CN111292008A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Abstract

The invention discloses a privacy protection data release risk assessment method based on a knowledge graph, which comprises the following steps: acquiring information submitted by a data applicant, and judging whether the basic information meets the specification; mapping the information of the data applicant into an RDF data set, and then converting the RDF data set into graph data in a knowledge graph; based on the knowledge graph, the basic information risk assessment, the identity abnormal risk assessment, the group fraud risk assessment and the individual credit risk assessment of the data applicant are completed by using a related algorithm; combining all risk evaluation data, constructing a risk model, and carrying out risk scoring on a data applicant; and labeling the score of the comprehensive risk assessment to obtain a risk assessment conclusion and a specific risk item assessment result. According to the scheme, the information of the data applicant can be automatically extracted and the risk can be analyzed, the process of privacy protection data release is actively protected, the workload of manual examination and verification is greatly reduced, and the risk of privacy protection data release is more visually described.

Description

Privacy protection data release risk assessment method based on knowledge graph
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a privacy protection data release risk assessment method based on a knowledge graph.
Background
With the widespread use of big data technology, data has become one of the important assets of many companies. The existing privacy protection data publishing system only evaluates the privacy disclosure risk of desensitized data, but neglects to actively evaluate the risk of a data applicant, and although a malicious attacker cannot steal privacy data from the technical aspect, the malicious attacker can steal data fraud by utilizing social engineering, for example, different data are obtained by means of user counterfeiting, group fraud and the like, and then the privacy data are obtained by analyzing the data by means of data analysis. The knowledge graph is a technology based on a graph structure, and can rapidly analyze the relationship between nodes in the knowledge graph. Therefore, the data applicant is mapped to the knowledge graph, and the implicit relation of the data applicant and the knowledge graph is analyzed based on the knowledge graph, so that the privacy data stealing by utilizing social engineering can be effectively prevented.
Disclosure of Invention
The invention aims to provide a privacy protection data release risk assessment method based on a knowledge graph, which can effectively prevent privacy data from being stolen and cheated. The purpose of the invention is realized by the following technical scheme:
a privacy protection data release risk assessment method based on a knowledge graph comprises the following steps:
s1, acquiring information of a data applicant, mapping the acquired information into an RDF data set, converting the RDF data into graph data in a knowledge graph, and converting the graph data into graph data in the knowledge graph;
s2, detecting the basic information of the data applicant based on the knowledge graph to finish the basic information risk assessment;
s3, performing identity anomaly detection on the data applicant by using an anomaly detection algorithm based on the knowledge graph to finish identity anomaly risk assessment;
s4, carrying out community division on the data applicant groups by using a community discovery algorithm based on the knowledge map, calculating group fraud risk, and finishing group fraud risk evaluation of the data applicant;
s5, carrying out individual credit calculation analysis on the data applicant by using an improved personalized PageRank algorithm based on the knowledge graph to finish individual credit risk assessment of the data applicant;
s6, constructing a risk model by combining all risk evaluation data, and carrying out risk scoring on the data applicant according to the evaluation standard to complete the comprehensive risk evaluation of the data applicant;
and S7, processing the scores of the comprehensive risk assessment by adopting a hierarchical labeling method, and summarizing to obtain a risk assessment conclusion and a specific risk item assessment result.
Further, the step S1 includes the following sub-steps:
s101, generating a mapping file according to a logic table of a relational database;
s102, analyzing the mapping file to obtain mapping elements contained in the mapping file;
s103, analyzing the mapping elements, and acquiring the mapping rules of the sub-elements, the logic table and the attribute columns of the logic table;
s104, obtaining tuples in the logic table from the relational database, and mapping corresponding attribute columns in the tuples into RDF terms according to mapping rules;
and S105, combining the obtained RDF terms into RDF triples and outputting the RDF triples to an RDF data set.
Further, the identity anomaly risk assessment in step S3 includes the following sub-steps:
s301, a detected target user is given
Figure BDA0002398645550000021
Wherein
Figure BDA0002398645550000022
Is the ith attribute of the target user;
s302, a normal user set U ═ { U ═ is given1,u2,...,umExtracting the k-th attribute of each normal user to obtain an attribute set
Figure BDA0002398645550000023
Wherein
Figure BDA0002398645550000024
A kth attribute representing a jth user;
s303, extracting l attributes from each normal user, and forming a multi-user multi-attribute set Muti _ UP ═ P1,P2,...,PlExtracting corresponding attributes from target users to be detected to form an attribute set P to be detectedTest={p1,p2,...,pl};
S304, mapping the multi-user multi-attribute set Muti _ UP to a l-dimensional clustering space, then carrying out clustering operation, and then carrying out attribute set P to be detectedTest={p1,p2,...,plMapping to the clustering space, and calculating an abnormal detection result by using an abnormal detection algorithm to finish the evaluation of the abnormal identity risk.
Further, the group fraud risk assessment in step S4 includes the following sub-steps:
s401, a sample set of a fraudulent user is given
Figure BDA0002398645550000025
Wherein
Figure BDA0002398645550000026
Is a fraudulent user sample having an attribute of
Figure BDA0002398645550000027
Wherein
Figure BDA0002398645550000028
Is the jth attribute of the rogue user sample;
s402, initializing a cheating group set and initializing it to null, i.e. to be empty
Figure BDA0002398645550000029
S403, selecting one attribute from m attributes of the fraudulent user to form an attribute subset
Figure BDA00023986455500000210
S404, classifying all the fraudulent users by using a community discovery algorithm according to the l attributes, classifying the fraudulent users with similar characteristics into one class, and finally obtaining a user classification set U '({ U')1,U2,...,Up-each element in the set represents a type of fraudulent community; adding different types of cheating groups into the cheating Group set as an element to obtain the cheating Group set Group { U }1,U2,...,UpAnd finishing group fraud risk assessment.
Further, the individual credit risk assessment in step S5 includes the following sub-steps:
s501, giving a user relationship network U ═ GU,VU> (wherein G)UIs a set of user nodes, V, in a relational networkUIs a set of edges in a relational network;
s502, assuming that there is a user node U with a risk weight w, and n user nodes connected to the user node U are U ═ U1,u2,...,un};
S503, supposing that the user node u has a certain adverse credit event, a time correlation function delta (u, t) transmits the risk weight of the node u to the node connected with u;
s504, traversing all nodes by using an improved personalized PageRank algorithm, completing risk weight conduction calculation of adverse credit events, and finally ranking all users in the user relationship network according to risk weights to obtain a user risk ranking set to complete individual credit risk assessment.
Further, the risk model constructed in step S6 is:
score(u)=μ(B,F)
wherein mu is a risk scoring function, B represents basic information of the data user, and F represents evaluation results based on the abnormal identity risk, the group fraud risk and the individual credit risk of the data user.
The invention has the beneficial effects that:
(1) the work of manually checking the information of the data applicant is greatly reduced, and active protection is provided for the process of issuing privacy protection data;
(2) automatically extracting background information, incidence relation analysis and other information of the data applicant, and analyzing the risk;
(3) a quantitative and qualitative scheme for risk assessment is provided, so that the risk of privacy protection data release is described more intuitively;
(4) the complexity of data applicant identity verification in a complex relation network can be reduced, and finally, a risk assessment result is obtained through tagging to carry out semantic expression, so that the method is visual and easy to understand.
Drawings
FIG. 1 is a diagram of the method steps of the present invention.
FIG. 2 is a diagram of the hierarchical tagging methodology of the present invention.
Detailed Description
In order to more clearly understand the technical features, objects, and effects of the present invention, embodiments of the present invention will now be described with reference to the accompanying drawings.
In an embodiment of the present invention, as shown in fig. 1, a privacy-preserving data publishing risk assessment method based on a knowledge graph includes the following steps: acquiring information of a data applicant, mapping the acquired information into an RDF data set, and converting the RDF data set into graph data in a knowledge graph; detecting the basic information of the data applicant based on the knowledge graph to finish the basic information risk assessment; performing identity anomaly detection on the data applicant by using an anomaly detection algorithm based on the knowledge graph to finish identity anomaly risk assessment; carrying out community division on the data applicant groups by using a community discovery algorithm based on a knowledge graph, calculating group fraud risk, and finishing group fraud risk evaluation of the data applicant; carrying out individual credit calculation analysis on the data applicant by using an improved personalized PageRank algorithm based on a knowledge graph to finish individual credit risk assessment of the data applicant; combining all risk evaluation data, constructing a risk model, and carrying out risk scoring on the data applicant according to an evaluation standard to complete comprehensive risk evaluation of the data applicant; and processing the scores of the comprehensive risk assessment by adopting a hierarchical labeling method, and summarizing to obtain a risk assessment conclusion and a specific risk item assessment result.
Further, the step of obtaining the information of the data applicant and converting the information into the knowledge graph further comprises the step of judging whether the basic information of the information submitted by the data applicant meets the specification.
The specific process of mapping the knowledge graph of the data applicant information is as follows:
in the process of mapping the knowledge map of the data applicant information, the data applicant information is generally stored in the form of structured data and text type unstructured data, and the storage mode is not favorable for exploring deep information and implicit relations between data applicants. The invention maps the information of the data applicant to an RDF data set, and then converts the RDF data set into graph data in a knowledge graph. The mapping process of the knowledge graph is described as follows:
(1) generating a mapping file according to a logic table of the relational database;
(2) analyzing the mapping file to obtain mapping elements contained in the mapping file;
(3) analyzing the mapping elements to obtain the mapping rules of the sub-elements, the logic table and the attribute columns thereof;
(4) obtaining tuples in the logic table from a relational database, and mapping corresponding attribute columns in the tuples into RDF terms according to a mapping rule;
(5) and combining the obtained RDF terms into RDF triples and outputting the RDF triples to the RDF data set.
The specific process of the identity abnormity detection risk assessment of the data applicant information is as follows:
anomaly detection is a relatively representative method in unsupervised model learning, namely finding points or sets of points in data that have anomalous properties. In the data applicant identity abnormal risk assessment, the abnormal detection is often used for identifying illegal users who attempt to apply legal information to disguise the illegal users as normal users, and the data identity abnormal risk assessment task is described as follows:
(1) given a detected target user
Figure BDA0002398645550000041
Wherein
Figure BDA0002398645550000042
Is the ith attribute of the target user;
(2) given a normal set of users U ═ U1,u2,...,umExtracting the k-th attribute of each normal user to obtain an attribute set
Figure BDA0002398645550000051
Wherein
Figure BDA0002398645550000052
A kth attribute representing a jth user;
(3) extracting l attributes from each normal user to form a multi-user multi-attribute set Muti _ UP ═ P1,P2,...,PlExtracting corresponding attributes from target users to be detected to form an attribute set P to be detectedTest={p1,p2,...,pl};
(4) Mapping a multi-user multi-attribute set Muti _ UP into a l-dimensional clustering space, then carrying out clustering operation, and then carrying out attribute set P to be detectedTest={p1,p2,...,plMapping to the clustering space, and calculating an abnormal detection result based on an abnormal detection algorithm.
From the above tasks, the anomaly detection algorithm is a key part in the whole user identity anomaly detection, and is directly related to the result of the anomaly detection, so that the local anomaly Factor (LOF) algorithm in the anomaly detection algorithm is selected for the anomaly detection.
The basic idea of the LOF algorithm is: and calculating the ratio of the average density of the positions of the sample points around a target sample point to the density of the positions of the target sample point, wherein the ratio is based on 1, and when the ratio is greater than 1, the larger the value is, the lower the density of the positions of the target sample point is than that of the positions of the sample points around the target sample point is, and the higher the possibility that the target sample point is an abnormal point is.
The LOF algorithm is defined as follows:
(1) distance between two points: let p, o be two points in a given set, with the distance between them denoted as d (p, o);
(2) kth distance (k-distance): the k-th distance of the point p means a distance from the point k-th distance of the point p. Let the kth distance of point p be dk(p) then dkD (p, o), and at least k points o 'epsilon C { x ≠ p } excluding p in the set, and d (p, o') ≦ d (p, o); meanwhile, at most k-1 points o ∈ C { x ≠ p } which do not include p in the set, and d (p, o) < d (p, o) is satisfied;
(3) k-distance neighbor (k-distance neighbor borrhoodofp): the meaning of the k-th long-distance neighborhood of the point p is that all points including the k-th long distance and within the k-th long distance are included, and the k-th long-distance neighborhood of the point p is recorded as Nk(p) the number of midpoints therein is expressed as | Nk(p)|≥k;
(4) Reach-distance (rd): the k-th far reachable distance from point o to point p is:
rdk(p,o)=max{dk(o),d(p,o)}
the above equation states that the k-th distance from point o to point p is at least the k-th distance of o, i.e., the k points nearest to point o, the distances from o to these points all equal dk(o);
(5) Local accessibility density (lrd): represents the inverse of the average reachable distance from point p to a point within the kth far neighborhood of point p, i.e.:
Figure BDA0002398645550000061
the above equation states that if point p and the surrounding neighborhood points are in the same cluster, then the achievable distance is a small value dk(o) the greater the probability, the smaller the sum of the reachable distances, the higher the local reachable density; if the point p is far from the surrounding neighborhood points, the more likely the reachable distance takes a larger value of d (p, o), the larger the sum of the reachable distances, the lower the local reachable density, and the more likely it is that the point is an outlier.
(6) Local outlier factor (lof): neighborhood point N representing point pk(p) an average of a ratio of the local achievable density of (p) to the local achievable density of point p. Namely:
Figure BDA0002398645550000062
if LOFk(p) → 1, indicating that point p is likely to belong to the same cluster as the neighborhood; if LOFk(p) < 1, and the smaller the value, the higher the density of the point p is, and the lower the density of the point p is, the more likely the point p is an abnormal point.
In summary, the density is calculated by calculating the distance between the points, and then whether each point p is an abnormal point is determined by comparing the density of the point p with the density of the neighboring points, wherein the lower the density of the point p, the greater the possibility of being an abnormal point. When the user identity abnormity detection is carried out, the multi-user multi-attribute set Muti _ UP is mapped into a clustering space, and then the attribute set P to be detected is mapped into a clustering spaceTest={p1,p2,...,plAnd mapping the obtained data to the clustering space to be used as a point p in the LOF algorithm.
The group fraud risk assessment of the data applicant information comprises the following specific processes:
the group is a collection of individuals which are closely connected with each other and have certain similarity in behaviors and attributes, and is characterized in that: individual relationships within a community are tight and relationships between communities are sparse. In a group fraud risk assessment process for data applicants, a community discovery algorithm is used to identify a group of fraud in a group of data applicants. The invention describes the group fraud risk assessment task as follows:
(1) sample set for a given rogue user
Figure BDA0002398645550000063
Wherein
Figure BDA0002398645550000064
Is a fraudulent user sample having an attribute of
Figure BDA0002398645550000071
Wherein
Figure BDA0002398645550000072
Is the jth attribute of the rogue user sample;
(2) initializing a fraudulent community set and initializing it to null, i.e. to be empty
Figure BDA0002398645550000073
(3) Selecting one attribute from m attributes of fraudulent user to form an attribute subset
Figure BDA0002398645550000074
(4) Classifying all the fraudulent users according to the above attributes, classifying the fraudulent users with similar characteristics into one class, and finally obtaining a user classification set U' ═ { U1,U2,...,Up-each element in the set represents a type of fraudulent community; adding different types of cheating groups into the cheating Group set as an element to obtain the cheating Group set Group { U }1,U2,...,Up}。
The above task can be known, the discovery method of the cheating group is a key part of the task, the community discovery algorithm analyzes the modularized community structure from the complex network by using the information contained in the graph topological structure, and the intensive research on the problem is helpful for researching the modules, functions and evolution of the whole network in a divide-and-conquer mode, more accurately understanding the organization principle, topological structure and dynamic characteristics of the complex system, has very important significance and is commonly used for identifying the cheating group, so the invention uses a modularity quantification algorithm to discover the cheating group.
The modularity quantification community discovery algorithm is an algorithm for quantifying the characteristics of communities and then dividing the communities by comparing quantification results, has good performance, can process a large-scale network, can discover communities with different granularities, and most importantly, can automatically discover the communities without specifying the number of the communities in advance. The basic idea of the algorithm is as follows: based on the modularity, observing the influence on the modularity gain through the nodes in the mobile network, merging communities with the largest influence on the modularity increment until the modularity is not increased any more, and obtaining the final community division.
The definition of the modularity quantitative community discovery algorithm is as follows:
(1) modularity (modular): the modularity of a network is noted as:
Figure BDA0002398645550000075
where m is the number of edges in the entire network, AijIs a weight between node i and node j, kiAnd kjRespectively representing the sum of the weights of network node i and node j, ciRepresents the community to which the node i belongs, if i and j belong to the same community, delta (c)i,cj) 1 else δ (c)i,cj)=0。
(2) Modularity gain: assuming that there are N nodes in the network, each node is first assigned a community, and N communities are obtained. Then, for each node i and all its neighboring nodes j in the network, the increment of the modularity is changed from the community where the node i is located to the community where its neighboring nodes j are located:
Figure BDA0002398645550000081
where m is the sum of the weights of all edges in the network, ΣinIs the sum of the weights, Σ, of all edges between nodes within a communitytotIs the sum of the weights of all edges in the community associated with node i, ki,inIs the sum of the weights of the edges of node i to all nodes in the community.
(3) The modularity gain maximization division community: and moving the node i to the community where the node j with the maximum modularity gain is located, and if the node i cannot find the community with the positive modularity gain, leaving the node i in the original community. This process is repeated until no more nodes are moved resulting in increased modularity.
(4) Iteratively dividing communities: and (4) forming a new network by taking the obtained communities divided in the step (3) as nodes, wherein the weight between the new nodes is the sum of the original weights between the two new nodes, carrying out the division in the step (3) in an iterative manner, obtaining the optimal division of the network when the maximum modularity is generated or the network is not changed any more, and stopping the iteration.
In summary, when Group fraud risk assessment is performed, users in a given user set are used as nodes in a modularity quantitative Group discovery algorithm, attributes such as business connection and benefit exchange in an attribute set are used as edges between the nodes, then community division is performed, and finally a fraudulent user division set Group { U ═ is obtained1,U2,...,UpAnd finishing group fraud risk assessment of the data applicants.
The individual credit risk assessment of the data applicant information comprises the following specific processes:
the individual's credit may change based on events associated with the individual, and if the individual has an adverse credit event or is affected by an adverse credit event, the individual's credit may decrease and vice versa, while the effect of the adverse event on the individual may gradually decrease over time. In the individual credit risk assessment process, the individual credit risk analysis task is described as follows:
(1) given a user relationship network U ═ GU,VU> (wherein G)UIs a set of user nodes, V, in a relational networkUIs a set of edges in a relational network;
(2) assuming that there is a user node U with a risk weight w, the n user nodes connected to the user node U are U ═ U1,u2,...,un};
(3) Assuming that a user node u has a certain bad credit event, there is a time-dependent function δ (u, t) to conduct the risk weight of the node u to the node connected to u;
(4) and traversing all the nodes and completing risk weight conduction of the adverse credit events at the same time, and finally sequencing all the users in the user relationship network according to the risk weights to obtain a user risk sequencing set.
As can be known from the above tasks, the calculation of the risk of the node is a key part of the task, and the risk of the data applicant calculated by the PageRank algorithm is selected. The invention selects an improved personalized PageRank algorithm to calculate the individual credit risk of the data applicant. The basic idea of the traditional PageRank algorithm is as follows: in a directed graph, a user starts to visit from any node, when jumping to the next node, the user randomly selects the next visiting node from all directed edges of the current node by using a probability c, or jumps to any node and starts a new round of random walk by using the probability of (1-c), the above processes are repeated until the probability that the user stays at any node is stable, and the more nodes point to the node p for one node p, the greater the weight of the node p, namely:
r=(1-c)Mr+cu
wherein r is a PageRank value, which represents the probability that the node is visited, c is the probability of restarting random walk, u is the probability of being selected when the node restarts random walk, in the PageRank, the probability of being selected by each node is equal, and M is a normalized adjacency matrix.
However, the traditional PageRank algorithm does not conform to the actual situation, and in an actual use scene, when each node in the network restarts random walk, the selected probability is different, but based on the preference of the user, the algorithm has a certain bias. Therefore, the PageRank algorithm is improved later, and a personalized PageRank algorithm is provided.
The personalized PageRank algorithm assumes that when random walk is repeated, jump to any node cannot be randomly selected, but one node is selected from a specific node set, meanwhile, when the weight of the node is initialized, the nodes in the specific node set are treated differently from other nodes, and when a stable state is calculated, the nodes preferred by a user and related nodes can obtain better weight. For the node p, the calculation method of the personalized PageRank is as follows:
r=(1-c)Mr+cv
where v is a preference vector of the user, representing the importance of each node in the relationship network for a given preference vector, i.e. the user's preferences.
However, in the problem of individual credit risk analysis, the weight influence between two nodes is related to the occurrence time of the adverse events besides the adverse events themselves, and according to the general knowledge, the adverse events with the closer occurrence time have larger influence on the current situation, and vice versa, but the personalized PageRank algorithm does not consider the influence of the time on the weight of the nodes.
Based on the time influence problem, the personalized PageRank algorithm needs to be further improved, a time attenuation factor is added into an adjacent matrix, and if delta is an exponential time decay function, the method comprises the following steps:
Figure BDA0002398645550000091
where β is a decay constant indicating the rate at which the influence of past information decreases, t is the interval between the time when an adverse event occurs and the current time, and when t is 0, the adverse event is currently occurring, the original adjacency matrix M is transformed into a weight matrix W with time decay added thereto by an exponential time decay function, and in this case, the PageRank value is calculated by:
r=(1-c)Wr+cv
meanwhile, in the task, the degree of the node should be independent of the assigned weight, and in the personalized PageRank, (1-c) Qr represents that the weight influence caused by the adverse event of the node is dispersedly propagated to the neighbor nodes, but when the weights are equal, the high-degree node is propagated to the lower weight influence of the neighbor nodes, and the low-degree node is propagated to the higher weight influence of the neighbor nodes. In the improved personalized PageRank algorithm, the weight influence of the height nodes is amplified, so that the weight influence obtained by neighbor nodes of nodes with different degrees on one scale during propagation is ensured. The calculation method of the PageRank at this time is as follows:
r=(1-c)Wr+cz'
if v is a preference vector of the user, d is the degree of a node, z is a result of multiplying each element in the vector v and each element in the vector d one by one, and z' is obtained by normalizing z.
And iteratively executing an improved personalized PageRank algorithm on all nodes in the relational network, and finally obtaining a user risk ranking table based on adverse event influence to finish individual credit risk assessment.
The comprehensive risk assessment process of the data applicant information comprises the following specific steps:
in order to evaluate the comprehensive risk of the data applicant, a risk model can be preliminarily constructed by combining the previous data:
score(u)=μ(B,F)
wherein mu is a risk scoring function, B represents basic information of the data user, and F represents evaluation results based on the abnormal identity risk, the group fraud risk and the individual credit risk of the data user. The model shows that the risk scoring is carried out on a data user u, and the scoring result is related to basic information, the abnormal identity risk, the group fraud risk and the individual credit risk. The invention calculates the score of each item in 10 grades and carries out comprehensive risk assessment on the data applicant.
First, for the basic information B of the data applicant, its source is actively submitted by the data applicant. The evaluation needs to be carried out from three aspects of authenticity, completeness and objectivity, and the invention provides an evaluation standard shown in the table 1.
Table 1 data evaluation standard table for applicant's basic information
Figure BDA0002398645550000101
Figure BDA0002398645550000111
For the identity information F of the data user, the score of F is divided into three parts, namely an abnormal detection score F, from the result obtained by the identity detectionLOFGroup fraud risk score FFuardIndividual risk score FPR. For local outlier LOF, U is { U } for one user U and other normal user sets1,u2,...,unF, if lof (u) → 1, indicating that user u is a normal user, is more likelyLOFThe closer to 10 points. For group fraud detection, if a user u, a known fraud group a and a known normal group b are known, the user u and the two groups are respectively divided by using a community discovery algorithm, and if the frequency of dividing the u and the b normal groups into one group is recorded as fbAnd the frequency of dividing a fraudulent group into one group is denoted as faCalculating the probability f ═ f that the user is a member of the fraudulent groupb/(fa+fb) The operation is repeated to obtain the probabilities of all the users, the obtained probability sets are ranked from high to low and then are averagely divided into 10 levels, each level corresponds to one score in 1-10, the higher the probability is, the probability that the user in the level is an individual in a cheating group is, the lower the score corresponding to the level is, otherwise, the higher the score is. For the current data applicant, the probability of being a fraudulent group can be calculated and then the corresponding score F is obtainedFuard. Similarly, for individual risk assessment, PageRank values for all users may be calculated, then ranging from high to highAnd (3) dividing the low ranks into 10 levels on average, wherein each level corresponds to one score in 1-10, and the higher the score is, the more adverse events occur or are influenced by the adverse events of the user, the higher the individual risk is, and otherwise, the lower the score is. Finally, the identity detection score F of the data applicant is the abnormal detection score FLOFGroup fraud risk score FFuardIndividual risk score FPRAverage score of (a).
And finally, labeling the score of the comprehensive risk assessment to obtain a risk assessment conclusion. The invention adopts the layered labeling processing of the scores of the comprehensive evaluation as shown in fig. 2 so as to obtain the total risk evaluation conclusion and the specific risk item evaluation result, and completes the risk evaluation of the data applicant in the privacy protection data issuing system.
The method can automatically extract the background information, the incidence relation analysis and other information of the data applicant, analyze the risk, provide active protection for the privacy protection data issuing process, greatly reduce the work of manually checking the information of the data applicant, more intuitively describe the risk of issuing the privacy protection data by providing a quantitative and qualitative scheme for risk evaluation, reduce the complexity of identity verification of the data applicant in a complex relation network, and finally obtain the risk evaluation result through tagging for semantic expression, so that the method is intuitive and easy to understand.
The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (7)

1. A privacy protection data release risk assessment method based on a knowledge graph is characterized by comprising the following steps:
s1, acquiring information of a data applicant, mapping the acquired information into an RDF data set, and converting the RDF data set into graph data in a knowledge graph;
s2, detecting the basic information of the data applicant based on the knowledge graph to finish the basic information risk assessment;
s3, performing identity anomaly detection on the data applicant by using an anomaly detection algorithm based on the knowledge graph to finish identity anomaly risk assessment;
s4, carrying out community division on the data applicant groups by using a community discovery algorithm based on the knowledge map, calculating group fraud risk, and finishing group fraud risk evaluation of the data applicant;
s5, carrying out individual credit calculation analysis on the data applicant by using an improved personalized PageRank algorithm based on the knowledge graph to finish individual credit risk assessment of the data applicant;
s6, constructing a risk model by combining all risk evaluation data, and carrying out risk scoring on the data applicant according to the evaluation standard to complete the comprehensive risk evaluation of the data applicant;
and S7, processing the scores of the comprehensive risk assessment by adopting a hierarchical labeling method, and summarizing to obtain a risk assessment conclusion and a specific risk item assessment result.
2. The method for risk assessment of privacy-preserving data distribution based on knowledge-graph as claimed in claim 1, wherein said step S1 comprises the following sub-steps:
s101, generating a mapping file according to a logic table of a relational database;
s102, analyzing the mapping file to obtain mapping elements contained in the mapping file;
s103, analyzing the mapping elements, and acquiring the mapping rules of the sub-elements, the logic table and the attribute columns of the logic table;
s104, obtaining tuples in the logic table from the relational database, and mapping corresponding attribute columns in the tuples into RDF terms according to mapping rules;
and S105, combining the obtained RDF terms into RDF triples and outputting the RDF triples to an RDF data set.
3. The method for risk assessment of privacy-preserving data distribution based on knowledge-graph as claimed in claim 1, wherein the risk assessment of identity abnormality in step S3 comprises the following sub-steps:
s301, a detected target user is given
Figure FDA0002398645540000011
Wherein
Figure FDA0002398645540000012
Is the ith attribute of the target user;
s302, a normal user set U ═ { U ═ is given1,u2,...,umExtracting the k-th attribute of each normal user to obtain an attribute set
Figure FDA0002398645540000013
Wherein
Figure FDA0002398645540000014
A kth attribute representing a jth user;
s303, extracting l attributes from each normal user, and forming a multi-user multi-attribute set Muti _ UP ═ P1,P2,...,PlExtracting corresponding attributes from target users to be detected to form an attribute set P to be detectedTest={p1,p2,...,pl};
S304, mapping the multi-user multi-attribute set Muti _ UP to a l-dimensional clustering space, then carrying out clustering operation, and then carrying out attribute set P to be detectedTest={p1,p2,...,plMapping to the clustering space, and calculating an abnormal detection result by using an abnormal detection algorithm to finish the evaluation of the abnormal identity risk.
4. The method for risk assessment of privacy-preserving data distribution based on knowledge-graph as claimed in claim 1, wherein the group fraud risk assessment in step S4 comprises the following sub-steps:
s401, a sample set of a fraudulent user is given
Figure FDA0002398645540000021
Wherein
Figure FDA0002398645540000022
Is a fraudulent user sample having an attribute of
Figure FDA0002398645540000023
Wherein
Figure FDA0002398645540000024
Is the jth attribute of the rogue user sample;
s402, initializing a cheating group set and initializing it to null, i.e. to be empty
Figure FDA0002398645540000025
S403, selecting one attribute from m attributes of the fraudulent user to form an attribute subset
Figure FDA0002398645540000026
S404, classifying all the fraudulent users by using a community discovery algorithm according to the l attributes, classifying the fraudulent users with similar characteristics into one class, and finally obtaining a user classification set U '({ U')1,U2,...,Up-each element in the set represents a type of fraudulent community; adding different types of cheating groups into the cheating Group set as an element to obtain the cheating Group set Group { U }1,U2,...,UpAnd finishing group fraud risk assessment.
5. The method as claimed in claim 1, wherein the individual credit risk assessment in step S5 comprises the following sub-steps:
s501, giving a user relationship network U ═ GU,VU> (wherein G)UIs a set of user nodes, V, in a relational networkUIs a set of edges in a relational network;
s502, assuming that there is a user node U with a risk weight w, and n user nodes connected to the user node U are U ═ U1,u2,...,un};
S503, supposing that the user node u has a certain adverse credit event, a time correlation function delta (u, t) transmits the risk weight of the node u to the node connected with u;
s504, traversing all nodes by using an improved personalized PageRank algorithm, completing risk weight conduction calculation of adverse credit events, and finally ranking all users in the user relationship network according to risk weights to obtain a user risk ranking set to complete individual credit risk assessment.
6. The method for risk assessment of privacy-preserving data distribution based on knowledge-graph as claimed in claim 1, wherein the risk model constructed in step S6 is:
score(u)=μ(B,F)
wherein mu is a risk scoring function, B represents basic information of the data user, and F represents evaluation results based on the abnormal identity risk, the group fraud risk and the individual credit risk of the data user.
7. The method for assessing the risk of distributing privacy-preserving data based on a knowledge-graph as claimed in claim 1, wherein the step S1 further comprises determining whether the basic information of the information submitted by the data applicant meets the specification.
CN202010139728.8A 2020-03-03 2020-03-03 Privacy protection data release risk assessment method based on knowledge graph Pending CN111292008A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010139728.8A CN111292008A (en) 2020-03-03 2020-03-03 Privacy protection data release risk assessment method based on knowledge graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010139728.8A CN111292008A (en) 2020-03-03 2020-03-03 Privacy protection data release risk assessment method based on knowledge graph

Publications (1)

Publication Number Publication Date
CN111292008A true CN111292008A (en) 2020-06-16

Family

ID=71021322

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010139728.8A Pending CN111292008A (en) 2020-03-03 2020-03-03 Privacy protection data release risk assessment method based on knowledge graph

Country Status (1)

Country Link
CN (1) CN111292008A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112200583A (en) * 2020-10-28 2021-01-08 交通银行股份有限公司 Knowledge graph-based fraud client identification method
CN112419074A (en) * 2020-11-13 2021-02-26 中保车服科技服务股份有限公司 Vehicle insurance fraud group identification method and device
CN112422574A (en) * 2020-11-20 2021-02-26 同盾控股有限公司 Risk account identification method, device, medium and electronic equipment
CN112417457A (en) * 2020-11-16 2021-02-26 中国电子科技集团公司第三十研究所 Big data based sensitive data reduction detection method and system
CN112818381A (en) * 2021-01-13 2021-05-18 海南大学 Cross-modal randomization privacy protection method and system oriented to essential computing and reasoning
CN112822004A (en) * 2021-01-14 2021-05-18 山西财经大学 Belief network-based targeted privacy protection data publishing method
CN112910872A (en) * 2021-01-25 2021-06-04 中国科学院信息工程研究所 Social attack threat, event and scene analysis method, device and system
CN113641920A (en) * 2021-10-13 2021-11-12 中南大学 Commodity personalized recommendation method and system based on community discovery and graph neural network
CN113706180A (en) * 2021-10-29 2021-11-26 杭银消费金融股份有限公司 Method and system for identifying cheating communities

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050177561A1 (en) * 2004-02-06 2005-08-11 Kumaresan Ramanathan Learning search algorithm for indexing the web that converges to near perfect results for search queries
CN101383694A (en) * 2007-09-03 2009-03-11 电子科技大学 Defense method and system rejecting service attack based on data mining technology
CN109271806A (en) * 2018-08-14 2019-01-25 同济大学 Research on Privacy Preservation Mechanism based on user behavior
US20190050862A1 (en) * 2017-08-09 2019-02-14 Microsoft Technology Licensing, Llc Systems and methods of providing security in an electronic network
CN109460664A (en) * 2018-10-23 2019-03-12 北京三快在线科技有限公司 Risk analysis method, device, Electronic Design and computer-readable medium
CN110166344A (en) * 2018-04-25 2019-08-23 腾讯科技(深圳)有限公司 A kind of identity recognition methods, device and relevant device
CN110162521A (en) * 2019-04-28 2019-08-23 银清科技(北京)有限公司 A kind of payment system transaction data processing method and system
CN110297912A (en) * 2019-05-20 2019-10-01 平安科技(深圳)有限公司 Cheat recognition methods, device, equipment and computer readable storage medium
US10432639B1 (en) * 2017-05-04 2019-10-01 Amazon Technologies, Inc. Security management for graph analytics
CN110351307A (en) * 2019-08-14 2019-10-18 杭州安恒信息技术股份有限公司 Abnormal user detection method and system based on integrated study
US20200042723A1 (en) * 2018-08-03 2020-02-06 Verizon Patent And Licensing Inc. Identity fraud risk engine platform
CN110781308A (en) * 2019-06-25 2020-02-11 广微数据科技(苏州)有限公司 Anti-fraud system for building knowledge graph based on big data

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050177561A1 (en) * 2004-02-06 2005-08-11 Kumaresan Ramanathan Learning search algorithm for indexing the web that converges to near perfect results for search queries
CN101383694A (en) * 2007-09-03 2009-03-11 电子科技大学 Defense method and system rejecting service attack based on data mining technology
US10432639B1 (en) * 2017-05-04 2019-10-01 Amazon Technologies, Inc. Security management for graph analytics
US20190050862A1 (en) * 2017-08-09 2019-02-14 Microsoft Technology Licensing, Llc Systems and methods of providing security in an electronic network
CN110166344A (en) * 2018-04-25 2019-08-23 腾讯科技(深圳)有限公司 A kind of identity recognition methods, device and relevant device
US20200042723A1 (en) * 2018-08-03 2020-02-06 Verizon Patent And Licensing Inc. Identity fraud risk engine platform
CN109271806A (en) * 2018-08-14 2019-01-25 同济大学 Research on Privacy Preservation Mechanism based on user behavior
CN109460664A (en) * 2018-10-23 2019-03-12 北京三快在线科技有限公司 Risk analysis method, device, Electronic Design and computer-readable medium
CN110162521A (en) * 2019-04-28 2019-08-23 银清科技(北京)有限公司 A kind of payment system transaction data processing method and system
CN110297912A (en) * 2019-05-20 2019-10-01 平安科技(深圳)有限公司 Cheat recognition methods, device, equipment and computer readable storage medium
CN110781308A (en) * 2019-06-25 2020-02-11 广微数据科技(苏州)有限公司 Anti-fraud system for building knowledge graph based on big data
CN110351307A (en) * 2019-08-14 2019-10-18 杭州安恒信息技术股份有限公司 Abnormal user detection method and system based on integrated study

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王成;舒鹏飞;: "WEB:一种基于网络嵌入的互联网借贷欺诈预测方法" *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112200583A (en) * 2020-10-28 2021-01-08 交通银行股份有限公司 Knowledge graph-based fraud client identification method
CN112200583B (en) * 2020-10-28 2023-12-19 交通银行股份有限公司 Knowledge graph-based fraudulent client identification method
CN112419074A (en) * 2020-11-13 2021-02-26 中保车服科技服务股份有限公司 Vehicle insurance fraud group identification method and device
CN112417457A (en) * 2020-11-16 2021-02-26 中国电子科技集团公司第三十研究所 Big data based sensitive data reduction detection method and system
CN112422574A (en) * 2020-11-20 2021-02-26 同盾控股有限公司 Risk account identification method, device, medium and electronic equipment
CN112818381A (en) * 2021-01-13 2021-05-18 海南大学 Cross-modal randomization privacy protection method and system oriented to essential computing and reasoning
CN112822004A (en) * 2021-01-14 2021-05-18 山西财经大学 Belief network-based targeted privacy protection data publishing method
CN112910872A (en) * 2021-01-25 2021-06-04 中国科学院信息工程研究所 Social attack threat, event and scene analysis method, device and system
CN113641920A (en) * 2021-10-13 2021-11-12 中南大学 Commodity personalized recommendation method and system based on community discovery and graph neural network
CN113641920B (en) * 2021-10-13 2022-02-18 中南大学 Commodity personalized recommendation method and system based on community discovery and graph neural network
CN113706180A (en) * 2021-10-29 2021-11-26 杭银消费金融股份有限公司 Method and system for identifying cheating communities
CN113706180B (en) * 2021-10-29 2022-02-08 杭银消费金融股份有限公司 Method and system for identifying cheating communities

Similar Documents

Publication Publication Date Title
CN111292008A (en) Privacy protection data release risk assessment method based on knowledge graph
Wu et al. Information-theoretic outlier detection for large-scale categorical data
Yang et al. Estimating user behavior toward detecting anomalous ratings in rating systems
Koga et al. Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing
Rodriguez et al. Patent clustering and outlier ranking methodologies for attributed patent citation networks for technology opportunity discovery
US7856411B2 (en) Social network aware pattern detection
Lee et al. Shilling attack detection—a new approach for a trustworthy recommender system
Xue et al. Spatial analysis with preference specification of latent decision makers for criminal event prediction
Ganapathy et al. A novel weighted fuzzy C–means clustering based on immune genetic algorithm for intrusion detection
CN111612041A (en) Abnormal user identification method and device, storage medium and electronic equipment
CN109829721B (en) Online transaction multi-subject behavior modeling method based on heterogeneous network characterization learning
Wang et al. Outlier detection based on weighted neighbourhood information network for mixed-valued datasets
CN115688024A (en) Network abnormal user prediction method based on user content characteristics and behavior characteristics
Hewapathirana Change detection in dynamic attributed networks
Shahbazi et al. A survey on techniques for identifying and resolving representation bias in data
Shi et al. An improved agglomerative hierarchical clustering anomaly detection method for scientific data
Shan et al. Incorporating user behavior flow for user risk assessment
CN111125747B (en) Commodity browsing privacy protection method and system for commercial website user
Zhou et al. Connecting patterns inspire link prediction in complex networks
Sun et al. Sensitive task assignments in crowdsourcing markets with colluding workers
Zhang et al. HyIDSVis: hybrid intrusion detection visualization analysis based on rare category and association rules
Schmidt et al. Using spectral clustering of hashtag adoptions to find interest-based communities
CN112506930A (en) Data insight platform based on machine learning technology
Ozgul et al. Comparing two models for terrorist group detection: Gdm or ogdm?
Guo et al. Privacy disclosure and preservation in learning with multi-relational databases

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200616