Disclosure of Invention
In view of this, embodiments of the present application provide a method and an apparatus for measuring risk of information disclosure, so as to improve protection of private information.
In a first aspect, an embodiment of the present application provides a method for measuring risk of information leakage, which is applied to a target system, and the method includes: constructing a privacy information ontology tree comprising a plurality of nodes; determining known privacy information and unknown privacy information; respectively mapping the known privacy information and the unknown privacy information to each node in the privacy information ontology tree; selecting a node mapped with unknown privacy information from the privacy information ontology tree as a target node; acquiring a node having a parent-child relationship with the target node according to the target node as a current node; if the privacy disclosure value of the current node is known, calculating the privacy disclosure value of the target node according to the privacy disclosure value of the current node; and determining the leakage risk degree of the unknown privacy information according to the privacy leakage value of the target node.
With reference to the first aspect, an embodiment of the present invention provides a first possible implementation manner of the first aspect, where: the constructing of the privacy information ontology tree comprising a plurality of nodes comprises: acquiring nouns and adjectives which accord with upper and lower relations, instance relations, integral part relations and attribute relations from a dictionary; and constructing the privacy information ontology tree by using the nouns and the adjectives.
With reference to the first aspect, an embodiment of the present invention provides a second possible implementation manner of the first aspect, where: the mapping the known privacy information and the unknown privacy information to each node in the privacy information ontology tree respectively comprises: calculating semantic similarity of the private information and the unknown private information and the keywords corresponding to each node by using a Vector Space Model (VSM) algorithm; if the semantic similarity is not zero, mapping the privacy information and the unknown privacy information with each node; if the semantic similarity is zero, calculating the word similarity of the private information to be mapped and the unknown private information and the keywords corresponding to each node by using a cosine similarity algorithm; and if the word similarity is larger than a preset threshold value, mapping the privacy information and the unknown privacy information with each node.
With reference to the first aspect, an embodiment of the present invention provides a third possible implementation manner of the first aspect, where: the calculating the privacy disclosure value of the target node according to the privacy disclosure value of the current node comprises: starting from the target node, finding a corresponding current node according to the parent-child relationship stored in the target node; and calculating the privacy disclosure value of the target node according to the privacy disclosure value of the current node.
With reference to the third possible implementation manner of the first aspect, an embodiment of the present invention provides a fourth possible implementation manner of the first aspect, where: the calculating the privacy disclosure value of the target node comprises: if the relationship between the target node and the current node is a top-bottom relationship or an instance relationship, and the target node is a child node of the current node, the privacy disclosure value of the target node is the same as the privacy disclosure value of the current node; if the relationship between the target node and the current node is a top-bottom relationship or an instance relationship, and the target node is a parent node of the current node, the privacy disclosure value of the target node is 1/n of the privacy disclosure value of the current node; if the relationship between the target node and the current node is an integral part relationship or an attribute relationship, and the target node is a child node of the current node, the privacy disclosure value of the target node is 1/n of the privacy disclosure value of the current node; and if the relationship between the target node and the current node is an integral part relationship or an attribute relationship, and the target node is a father node of the current node, the privacy disclosure value of the target node is the same as the privacy disclosure value of the current node.
With reference to the first aspect, an embodiment of the present invention provides a fifth possible implementation manner of the first aspect, where: the determining the leakage risk degree of the unknown privacy information according to the privacy leakage value of the target node comprises the following steps: if the privacy disclosure value reaches a preset threshold value, the unknown privacy information is determined to be high disclosure risk information or disclosure information; and if the privacy leakage value does not reach a preset threshold value, the unknown privacy information is determined to be low-leakage-risk information or non-leakage information.
In a second aspect, an embodiment of the present application provides an apparatus for measuring risk of information leakage, which is applied to a target system, and the apparatus includes: the building module is used for building a privacy information ontology tree comprising a plurality of nodes; the privacy information determining module is used for determining known privacy information and unknown privacy information; the mapping module is used for mapping the known privacy information and the unknown privacy information to each node in the privacy information ontology tree respectively; the computing module is used for selecting a node mapped with unknown privacy information from the privacy information ontology tree as a target node; acquiring a node having a parent-child relationship with the target node according to the target node as a current node; if the privacy disclosure value of the current node is known, calculating the privacy disclosure value of the target node according to the privacy disclosure value of the current node; and the leakage risk degree determining module is used for determining the leakage risk degree of the unknown privacy information according to the privacy leakage value of the target node.
With reference to the second aspect, an embodiment of the present invention provides a first possible implementation manner of the second aspect, where: the calculation module is further to: starting from the target node, finding a corresponding current node according to the parent-child relationship stored in the target node; and calculating the privacy disclosure value of the target node according to the privacy disclosure value of the current node.
With reference to the first possible implementation manner of the second aspect, an embodiment of the present invention provides a second possible implementation manner of the second aspect, where: the calculation module is further to: if the relationship between the target node and the current node is a top-bottom relationship or an instance relationship, and the target node is a child node of the current node, the privacy disclosure value of the target node is the same as the privacy disclosure value of the current node; if the relationship between the target node and the current node is a top-bottom relationship or an instance relationship, and the target node is a parent node of the current node, the privacy disclosure value of the target node is 1/n of the privacy disclosure value of the current node; if the relationship between the target node and the current node is an integral part relationship or an attribute relationship, and the target node is a child node of the current node, the privacy disclosure value of the target node is 1/n of the privacy disclosure value of the current node; and if the relationship between the target node and the current node is an integral part relationship or an attribute relationship, and the target node is a father node of the current node, the privacy disclosure value of the target node is the same as the privacy disclosure value of the current node.
In combination with the second aspect, embodiments of the present invention provide a third possible implementation manner of the second aspect, where: the leakage risk level determination module is further configured to: if the privacy disclosure value reaches a preset threshold value, the unknown privacy information is determined to be high disclosure risk information or disclosure information; and if the privacy leakage value does not reach a preset threshold value, the unknown privacy information is determined to be low-leakage-risk information or non-leakage information.
By adopting the scheme, the privacy leakage value of the target node is calculated through the privacy leakage value of the node with known privacy information, and the leakage risk degree of unknown privacy information is determined according to the privacy leakage value of the target node. In the mode, the privacy leakage degree of the unknown privacy information is calculated by utilizing the leakage degree of the known privacy information, so that the protection strength of the unknown privacy information can be improved according to the leakage degree of the unknown privacy information. The problem that in privacy protection, the protection degree is not enough due to the fact that the reasoning among information is not considered, and the issued information can be forbidden to be known in practice is solved.
Additional features and advantages of the disclosure will be set forth in the description which follows, or in part may be learned by the practice of the above-described techniques of the disclosure, or may be learned by practice of the disclosure.
In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the embodiments of the present disclosure will be described clearly and completely with reference to the accompanying drawings, and it is to be understood that the described embodiments are some, but not all embodiments of the present disclosure. All other embodiments, which can be derived by one of ordinary skill in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
The privacy protection technology in the prior art does not solve the problem of association between information, namely the reasoning of the information. The information reasoning comprises a plurality of methods such as association reasoning, comprehensive reasoning, causal reasoning, hypothesis reasoning and the like, so that an information acquirer can further acquire more information on the basis of knowing a part of information. The research on the aspect is less, a universal privacy propagation model is not popular, meanwhile, the determination of the privacy information type needs semantic analysis, and how the analyzed information generates reasoning relation does not have an authoritative standard up to now. In 2002, the world wide web provides a P3P privacy preference platform, privacy information protection is performed from a brand-new perspective, a user does not passively let software identify privacy any more, but actively provides own privacy preference, and a service party matches own policy with the user privacy preference, so that the purpose of privacy protection is achieved. In the P3P platform, since the user can make rules by himself, some simple reasoning process can be considered by the user, for example, the user does not want to reveal his birthday, and certainly does not allow the user to issue his identity card number, because the birthday can be easily deduced from the identity card number. However, the reasoning process that the user can consider is very shallow and incomplete, and still privacy information cannot be well protected, and the privacy protection difficulty caused by reasoning can be solved only by introducing perfect semantic reasoning.
Based on this, the embodiment of the application provides a method and a device for measuring information disclosure risk, so as to improve the protection of private information.
The embodiments of the present application do not limit specific application scenarios, and any method using the embodiments of the present application is within the scope of the present application. The following is a detailed description by way of specific examples.
See fig. 1 for a flow chart of a method of information leakage risk measurement. The embodiment provides a method for measuring risk of information leakage, which is applied to a target system and comprises the following steps:
s101, constructing a privacy information ontology tree comprising a plurality of nodes;
each kind of private information is regarded as one information point. The information point may be described by a specific word, such as "name", "house number", etc., and the specific information contained therein is different in each flow of the system operation and is related to the user using the system. In real life, people can often reason by knowing information, so as to obtain more information, because the information has the characteristic of reasoning, so does the privacy information. Referring to fig. 2, a diagram of inference of private information is shown, where PA represents a known data set and dm represents a new data set inferred from the known data set.
In the PB set, p1, p2, p3, p4 and p5 are known data; d1 is new data inferred from p 1; d2 is new data inferred from p 2; d3 is new data inferred from p 3; d4 is new data inferred from p 4; d5 is new data inferred from p 5.
Taking age information as an example, the information system often considers "working age" and "retirement age" as two different fields, and in fact, as long as the age of a person at his time of employment is known, his retirement age can be easily inferred from the working age. In the semantic reasoning model, the age of job and the working age are PA, the age of retirement is dm, and dm can further reason more information.
Step S102, known privacy information and unknown privacy information are determined;
wherein known and unknown privacy information in the target system is determined. The target system is any system that can operate on data, such as a data processing system, an access control system, a data distribution system, and the like. Given that private information is information that is freely accessible or published within a data processing system, this portion of the information can be captured by any role in the system and used to reason about it. The reasoning refers to obtaining more information through the association between the information on the basis of the existing information, for example, knowing the age and working age of a person, the retirement time can be roughly inferred. The invention mainly depends on the reasoning relation among the privacy information, the measurement is carried out on the premise that the information is leaked or actively issued by an information owner, and the leakage risk of other information is calculated according to the information. For example, in an access control system, known private information refers to information that has been accessed, i.e., a person makes an access request and the system gives "allow" decision information.
Unknown private information, i.e. information to be calculated, is determined in the target system. The information to be calculated refers to the leakage degree of certain private information that a user needs to know, and in a data processing system, the user usually has no way to directly obtain the information, namely, the content is vacant, or the data owner intentionally hides the information. For example, if a person sets a "detailed address" to be inaccessible in the access control system, the detailed address is one of the information to be calculated. After the known information is available, it is also necessary to know which information is to be calculated, and this step is only for determining the target, so as to reduce the running time of the algorithm and improve the efficiency. In theory, the atmosphere of the information to be calculated can be very wide and even be expanded to all information, that is, the algorithm can have a definite calculation target or no specific target, and only what information has been leaked in the current information background is measured.
Step S103, respectively mapping the known privacy information and the unknown privacy information to each node in the privacy information ontology tree;
step S104, selecting a node mapped with unknown privacy information from the privacy information ontology tree as a target node;
acquiring a node having a parent-child relationship with a target node as a current node according to the target node;
if the privacy disclosure value of the current node is known, calculating the privacy disclosure value of the target node according to the privacy disclosure value of the current node;
wherein the root privacy leakage value is represented as a specific numerical value between 0 and 1. If the privacy disclosure value of the current node is known, calculating the privacy disclosure value of the target node according to the privacy disclosure value of the current node; and if the privacy disclosure value of the current node is unknown, the current node is taken as a target node, traversal is continued, the node with parent-child relationship with the current node is continuously searched, and the node with the known privacy disclosure value is always found.
And step S105, determining the leakage risk degree of unknown privacy information according to the privacy leakage value of the target node.
According to the method, the privacy leakage value of the target node is calculated through the privacy leakage value of the node with known privacy information, and the leakage risk degree of unknown privacy information is determined according to the privacy leakage value of the target node. In the mode, the privacy leakage degree of the unknown privacy information is calculated by utilizing the leakage degree of the known privacy information, so that the protection strength of the unknown privacy information can be improved according to the leakage degree of the unknown privacy information. The problem that in privacy protection, the protection degree is not enough due to the fact that the reasoning among information is not considered, and the issued information can be forbidden to be known in practice is solved.
In a possible implementation manner, step S101, constructing a privacy information ontology tree including a plurality of nodes, includes: acquiring nouns and adjectives which accord with upper and lower relations, instance relations, integral part relations and attribute relations from a dictionary; and constructing a private information ontology tree by using nouns and adjectives.
The method comprises the steps of determining that the privacy information ontology tree contains two parts of contents, namely determining a relationship and determining part of speech. The reasoning process of the private information requires dependency relationships, such as "working age" and "retirement age" are chronological relationships, and the private information itself also needs to be described by words of specific parts of speech.
A plurality of different relations are given in a WordNet ontology library, and only four relations with the maximum association degree of the private information are selected in the invention, namely hyponym-of, instance-of, part-of and attribute-of. hyponym-of is a superior-inferior relationship, such as "job → teacher, employee"; instance-of is an example relationship, such as "emperor → Alexandrium; part-of is an integral part relationship, such as "car → wheel, engine"; attribute-of is an attribute relationship such as "beauty → neat, beautiful". The four relations are independently cut out from the WordNet ontology library to be used as the privacy information ontology tree in the invention.
For private information, two parts of speech are sufficient, namely nouns and adjectives. Nouns are used to describe a specific object, most commonly used in private information, such as "occupation", "age", "address", etc., while adjectives are used to describe the nature of an object, such as a scientist who has "intelligent", "smart", etc. tags on his body, and actresses who have "beautiful", "elegant", etc. tags on his body. Words of the two parts of speech are independently cut out from a WordNet ontology library, and form a privacy information ontology tree together with the four relations.
In one possible implementation, step S103, mapping the known privacy information and the unknown privacy information onto each node in the privacy information ontology tree, respectively, includes:
calculating semantic similarity of the private information and the unknown private information with the keywords corresponding to each node by using a Vector Space Model (VSM) algorithm;
in the body tree of the private information, each node represents a specific private information, the private information is represented by a keyword, and the keyword inherits the vocabulary of WordNet and has corresponding meaning explanation. The principle of the vector space model VSM algorithm is to calculate the semantic similarity of two words according to the probability of repeated words in the meaning of the words. The calculation of this step is decisive.
If the semantic similarity is not zero, mapping the privacy information and the unknown privacy information with each node;
when the semantic similarity of the two words is not 0, the mapping relation is added to the two words, and the two words are considered to express the same privacy information. For example, in the target system, "home" and "address" both represent the address of the user, and when the attribute of certain information in the target system is "home", the algorithm maps the information to the "address" node of the privacy information ontology tree, thereby completing subsequent node traversal and calculation.
If the semantic similarity is zero, calculating the word similarity of the private information to be mapped and the unknown private information and the keywords corresponding to each node by using a cosine similarity algorithm;
the VSM algorithm is an algorithm produced on the basis of a WordNet ontology library, can only calculate words in the WordNet ontology library, but cannot correctly measure the similarity between abbreviations and the words not in the WordNet, so a cosine similarity calculation method is introduced. Referring to the application diagram of the vector space model VSM algorithm and the cosine similarity algorithm shown in FIG. 3; inputting the attribute of certain information in a target system and the key words of nodes into a Vector Space Model (VSM) algorithm and a cosine similarity algorithm; obtaining semantic similarity between information and keywords through a Vector Space Model (VSM) algorithm; obtaining word similarity between the information and the keywords through a cosine similarity algorithm; the result can be obtained according to the semantic similarity or the word similarity. The meaning of the arrow labeled "decide" in fig. 3 is: if the cosine similarity algorithm can convert the mapped words into the existing words in WordNet, then VSM is further used for semantic similarity calculation. After the computation is finished, all the known information in the information system is mapped to the nodes of the privacy information ontology tree.
If the word similarity is greater than a predetermined threshold, the private information and the unknown private information are mapped with each node.
It is assumed that when the cosine similarity exceeds 0.7, the mapped word is regarded as an abbreviation of the node keyword, and mapping relationships such as "address" and "address" are also given. And the cosine similarity algorithm can also be used for converting the abbreviations into the whole, so that suggestions are provided for the VSM algorithm.
In a possible implementation manner, see a schematic diagram of a calculation rule when a privacy leakage value is calculated in a traversal manner in a privacy leakage risk measurement algorithm of an ontology and information reasoning relationship shown in fig. 4;
calculating the privacy disclosure value of the target node according to the privacy disclosure value of the current node, comprising: starting from the target node, finding a corresponding current node according to the parent-child relationship stored in the target node; and calculating the privacy disclosure value of the target node according to the privacy disclosure value of the current node.
If a plurality of current nodes having a parent-child relationship with the target node exist, calculating a privacy disclosure value of the target node according to the privacy disclosure value of each current node; and accumulating and summing the privacy leakage values of the target node, which are calculated according to the privacy leakage value of each current node, to obtain the privacy leakage value of the target node.
In one possible embodiment, the step of calculating the privacy disclosure value of the target node includes: if the relationship between the target node and the current node is a top-bottom relationship, or instance relationship, and the target node is a child node of the current node, the privacy disclosure value of the target node is the same as the privacy disclosure value of the current node;
if the relationship between the target node and the current node is a top-bottom relationship, or instance relationship, and the target node is a parent node of the current node, the privacy disclosure value of the target node is 1/n of the privacy disclosure value of the current node;
if the relationship between the target node and the current node is a whole part relationship part-of or an attribute relationship attribute-of, and the target node is a child node of the current node, the privacy disclosure value of the target node is 1/n of the privacy disclosure value of the current node;
and if the relationship between the target node and the current node is an integral part relationship part-of or an attribute relationship attribute-of, and the target node is a parent node of the current node, the privacy disclosure value of the target node is the same as that of the current node.
During specific implementation, outward traversal is started from a node mapped with unknown privacy information, leakage values are calculated according to the steps, and if the leakage value of the node is unknown after passing through one node, traversal is continued; until a node is encountered that is mapped with known private information. And calculating the leakage value of the node with unknown privacy information according to the risk leakage value of the node with the known privacy information.
Traversal assignments follow three principles:
1. nodes that have already been computed are not computed again unless an intersection of paths is involved;
2. in the structural aspect, traversal is breadth-first, in the relational aspect, traversal is priority with a relation without changing leakage values, for example, a hyponym-of relation is from a parent to a child, and a part-of relation is from the child to the parent, so that the algorithm efficiency is improved, and the algorithm is finished in as short a time as possible;
3. when path crossing is involved, the leakage value of the crossing point takes the larger value in the traversal of the two paths.
When the privacy leakage value of a certain target node reaches 1, the subsequent traversal of the current node is immediately stopped, and the previous node is returned to find a new traversal path, because the subsequent traversal has no significance. The leakage value of the current point on all paths reaches 1, or no new point can be traversed, and the algorithm is finished integrally. If a threshold value is set, the traversal can be controlled according to the threshold value of specific information, and the traversal is also stopped when the privacy leakage value of the target node reaches the threshold value.
In one possible implementation manner, determining the disclosure risk degree of unknown privacy information according to the privacy disclosure value of the target node includes: if the privacy disclosure value reaches a preset threshold value, the unknown privacy information is determined to be high disclosure risk information or disclosure information; and if the privacy leakage value does not reach the preset threshold value, the unknown privacy information is determined to be low-leakage-risk information or non-leakage information.
Wherein, when the concrete implementation is carried out, the results are sorted. Information represented by a node whose leak value has reached 1 (or a threshold value) is judged as "leaked information", information represented by a node whose leak value has not reached 1 (or a threshold value) is judged as "undisleaked information", and information represented by a node whose leak value has not reached 1 (or a threshold value) but is close thereto is judged as "information with a higher leak risk". In the step, all the calculated privacy disclosure values are stored on the corresponding nodes, and when the known information set is unchanged and only the calculation target is changed, the calculated value can be directly used in the next calculation, so that the algorithm efficiency is greatly optimized. When the set of known information changes, the leakage value of each node can only be calculated again.
And finally, transmitting the determined result to a corresponding module in the target system, or directly transmitting the determined result to a user in a simpler system.
Corresponding to the method, the application also provides a device for measuring the risk of information leakage, which is applied to a target system, and the device is shown in a schematic diagram of the device for measuring the risk of information leakage in fig. 5; the device includes:
a building module 51, configured to build a privacy information ontology tree including a plurality of nodes;
a privacy information determination module 52 for determining known privacy information and unknown privacy information;
the mapping module 53 is configured to map the known privacy information and the unknown privacy information to each node in the privacy information ontology tree respectively;
the calculation module 54 is configured to select a node mapped with unknown privacy information from the privacy information ontology tree as a target node;
acquiring a node having a parent-child relationship with a target node as a current node according to the target node;
if the privacy disclosure value of the current node is known, calculating the privacy disclosure value of the target node according to the privacy disclosure value of the current node;
and the disclosure risk degree determining module 55 is configured to determine the disclosure risk degree of the unknown privacy information according to the privacy disclosure value of the target node.
The device calculates the privacy disclosure value of the target node according to the privacy disclosure value of the node with known privacy information through a calculation module 54; the disclosure risk degree determining module 55 determines the disclosure risk degree of the unknown privacy information according to the privacy disclosure value of the target node. In the mode, the privacy leakage degree of the unknown privacy information is calculated by utilizing the leakage degree of the known privacy information, so that the protection strength of the unknown privacy information can be improved according to the leakage degree of the unknown privacy information.
In one possible implementation, the calculation module 54 is further configured to: starting from the target node, finding a corresponding current node according to the parent-child relationship stored in the target node; and calculating the privacy disclosure value of the target node according to the privacy disclosure value of the current node.
In one possible implementation, the calculation module 54 is further configured to: if the relationship between the target node and the current node is a top-bottom relationship or an instance relationship and the target node is a child node of the current node, the privacy disclosure value of the target node is the same as the privacy disclosure value of the current node;
if the relationship between the target node and the current node is a top-bottom relationship or an instance relationship and the target node is a father node of the current node, the privacy disclosure value of the target node is 1/n of the privacy disclosure value of the current node;
if the relationship between the target node and the current node is an integral part relationship or an attribute relationship, and the target node is a child node of the current node, the privacy disclosure value of the target node is 1/n of the privacy disclosure value of the current node;
and if the relationship between the target node and the current node is an integral part relationship or an attribute relationship, and the target node is a father node of the current node, the privacy disclosure value of the target node is the same as that of the current node.
In a possible embodiment, the leakage risk level determination module 45 is further configured to: if the privacy disclosure value reaches a preset threshold value, the unknown privacy information is determined to be high disclosure risk information or disclosure information; and if the privacy leakage value does not reach the preset threshold value, the unknown privacy information is determined to be low-leakage-risk information or non-leakage information.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.