WO2023165702A1 - Dispositif de gestion de données et procédé de gestion de données - Google Patents

Dispositif de gestion de données et procédé de gestion de données Download PDF

Info

Publication number
WO2023165702A1
WO2023165702A1 PCT/EP2022/055453 EP2022055453W WO2023165702A1 WO 2023165702 A1 WO2023165702 A1 WO 2023165702A1 EP 2022055453 W EP2022055453 W EP 2022055453W WO 2023165702 A1 WO2023165702 A1 WO 2023165702A1
Authority
WO
WIPO (PCT)
Prior art keywords
pii
node
graph
elements
data management
Prior art date
Application number
PCT/EP2022/055453
Other languages
English (en)
Inventor
Shahar SALZMAN
Idan Zach
Assaf Natanzon
Elizabeth FIRMAN
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/EP2022/055453 priority Critical patent/WO2023165702A1/fr
Publication of WO2023165702A1 publication Critical patent/WO2023165702A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/358Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Definitions

  • the present disclosure relates generally to the field of data management; and more specifically, to a data management device and a computer-implemented method of data management.
  • GDPR General Data Protection Regulation
  • HIPAA Health Insurance Portability and Accountability Act
  • various organizations e.g., private organizations, or government organizations
  • the potential customers may also be referred to as subjects.
  • the data saved by various organizations may be distributed among different storage systems and storage tiers (e.g., some information stored in on-prem datacenters and other information stored in cloud datacenters), therefore, getting information about a specific subject may be a complicated and time-consuming task, and sometimes may require manual efforts also.
  • different organizations have used catalog solutions which store metadata about either all or some of the data saved by an organization.
  • the catalog solutions allow retrieval of information about subjects, which may be used to answer queries (e.g., data subject access request, DS AR) from regulatory bodies.
  • queries e.g., data subject access request, DS AR
  • data is constantly flowing into a typical data storage system and therefore, information is required to be constantly indexed. That means, a correlation between subject information to the subject is required to be maintained constantly, which is not a trivial task.
  • Another issue regarding correlation of data to the subject is that the data is not always structured, and even in structured data, correlation between data items is not an easy task.
  • Another implementation issue is that the subject is required to specify some form of identification, and in response to this, the organization is required to provide all the relevant information collectively about the subject.
  • the present disclosure provides a data management device and a computer-implemented method of data management.
  • the present disclosure provides a solution to the existing problem of inefficiently identifying the relevant information related to a data subject due to inadequate correlation between various PII elements of the data subject.
  • An aim of the present disclosure is to provide a solution that overcomes at least partially the problems encountered in prior art, and provide an improved data management device and an improved computer-implemented method of data management.
  • the present disclosure provides a data management device, comprising an input unit configured to receive at least one document.
  • the data management device further comprises an identification unit configured to identify one or more personally identifiable information, PII, elements in the received document and a relation unit configured to identify one or more relations between pairs of PII elements identified in the received document.
  • the data management device further comprises a mapping unit configured to generate a graph by adding each identified PII element as a node, adding each identified relation as an edge, assigning an accuracy score and a uniqueness score to each node, and assigning a relation accuracy score to each edge.
  • the data management device efficiently identifies the relevant information related to a data subject due to adequate correlation between the one or more PII elements identified in a data storage system and the at least one PII element specified in the identity request. Moreover, the data management device identifies the one or more PII elements related to the data subject not only from a single document but from the multiple documents received by the input unit and saved in the data storage system. The data management device uses the weighting factor in order to locate a cluster of such PII elements in the graph representing the data subject while removing irrelevant information.
  • the data management device further comprises a reporting module, including a request input unit configured to receive a request specifying at least one PII element, and a discovery unit configured to traverse the graph starting from the specified PII element and generate a list including each traversed PII element, where the traversal is limited by a weighting factor based on the assigned scores.
  • a reporting module including a request input unit configured to receive a request specifying at least one PII element, and a discovery unit configured to traverse the graph starting from the specified PII element and generate a list including each traversed PII element, where the traversal is limited by a weighting factor based on the assigned scores.
  • At least one PII element specified in the request is correlated more accurately with each one or more PII elements of the data subject which are defined as nodes in the graph. And, by virtue of the discovery unit the graph is efficiently traversed returning PII elements which seem more closer to the specified PII element in the request.
  • each node of the graph includes information on at least one received document related to the PII element, and where the discovery unit is configured to include each related document in the list of traversed PII elements.
  • the discovery unit is configured to traverse the graph using a breadth first search.
  • the weighting factor is calculated for each node by multiplying the accuracy score of the node with a path weight, where the path weight is the product of the path weight of the preceding node, the uniqueness score of the preceding node and the accuracy score of the relation between the two nodes. This is advantageous to compute the weighting factor for each node in order to efficiently correlate the one or more PII elements related to the data subject with more accuracy and reliability.
  • the next node to be searched is determined to be the node with the highest value of path weight multiplied by uniqueness score.
  • the discovery unit is configured to stop traversing the graph if the weighting factor falls below a predefined threshold.
  • the use of the predefined threshold eliminates one or more PII elements of different subjects, while includes all the PII elements of the data subject for which the search is going on.
  • the threshold adjusted if a plurality of PII elements sharing a common type is found for the same subject.
  • the present disclosure provides a computer-implemented method of data management, comprising receiving, by an input unit, at least one document.
  • the computer- implemented method further comprises identifying, by an identification unit, one or more personally identifiable information, PII, elements in the received document and identifying, by a relation unit, one or more relations between pairs of PII elements identified in the received document.
  • the computer- implemented method further comprises generating, by a mapping unit, a graph by adding each identified PII element as a node, adding each identified relation as an edge, assigning an accuracy score and a uniqueness score to each node, and assigning a relation accuracy score to each edge.
  • the present disclosure provides a computer-readable medium comprising instructions which, when executed by a processor, cause the processor to perform the method.
  • the processor e.g., processor of a device or a system
  • FIG. 1 is a block diagram that illustrates various exemplary components of a data management device, in accordance with an embodiment of the present disclosure
  • FIG. 2 illustrates different type of connections between a plurality of personally identifiable information (PII) elements related to one or more data subjects, in accordance with an embodiment of the present disclosure
  • FIG. 3 illustrates graph traversal directions in a weighted graph, in accordance with an embodiment of the present disclosure
  • FIG. 4 is an exemplary scenario that illustrates filtering of PII elements of different subjects, while including all the PII elements of same subject, in accordance with an embodiment of the present disclosure.
  • FIG. 5 is a flowchart of a computer-implemented method of data management, in accordance with an embodiment of the present disclosure.
  • an underlined number is employed to represent an item over which the underlined number is positioned or an item to which the underlined number is adjacent.
  • a non-underlined number relates to an item identified by a line linking the non-underlined number to the item. When a number is non-underlined and accompanied by an associated arrow, the non-underlined number is used to identify a general item at which the arrow is pointing.
  • FIG. 1 is a block diagram that illustrates various exemplary components of a data management device, in accordance with an embodiment of the present disclosure.
  • a block diagram 100 of a data management device 102 that includes an input unit 104, an identification unit 106, a relation unit 108, a mapping unit 110, a reporting module 112, a memory 114 and a processor 116.
  • the reporting module 112 includes a request input unit 112A and a discovery unit 112B.
  • the data management device 102 may include suitable logic, circuitry, interfaces, or code that is configured to correlate all the information related to a data subject stored in a data storage system based on a weighted graph representation of the connections between personally identifiable information (PII) elements.
  • PII personally identifiable information
  • the PII elements are present in one or more documents received from the data subject and saved in the data storage system.
  • the data management device 102 is further configured to use the weights in order to locate a cluster of such PII elements in a graph representing the data subject, while removing irrelevant information.
  • the data subject may be a customer or a potential customer of an organization.
  • the data subject may be either a medical exam or a medical practitioner, without limiting the scope of the disclosure.
  • the input unit 104 may include suitable logic, circuitry, interfaces, or code that is configured to receive at least one document. Examples of the input unit 104 may include, but are not limited to, a data terminal, a receiver, a receiving unit, a transceiver, a facsimile machine, a virtual server, and the like.
  • the identification unit 106 may include suitable logic, circuitry, interfaces, or code that is configured to identify one or more personally identifiable information (PII) elements in the received document.
  • the identification unit 106 may also be referred to as a detector.
  • the relation unit 108 may include suitable logic, circuitry, interfaces, or code that is configured to identify one or more relations between pairs of PII elements identified in the received document. Examples of the relation unit 108 may include, but are not limited to, a correlator, auto correlator, cross correlator, and the like.
  • the mapping unit 110 may include suitable logic, circuitry, interfaces, or code that is configured to generate a graph (or a weighted graph) by considering each of the identified one or more PII elements as a node and relation in between the identified one or more PII elements as the edge between two nodes.
  • the reporting module 112 may include suitable logic, circuitry, interfaces, or code that is configured to receive a request related to a query subject and traverse through the graph in order to provide relevant information about the query subject. Moreover, the reporting module 112 reports both the relations and the one or more documents from which the one or more PII elements are identified.
  • the memory 114 may include suitable logic, circuitry, interfaces, or code that is configured to store data and instructions executable by the processor 116. Examples of implementation of the memory 114 may include, but are not limited to, an Electrically Erasable Programmable Read- Only Memory (EEPROM), Random Access Memory (RAM), Read Only Memory (ROM), Hard Disk Drive (HDD), Flash memory, Solid-State Drive (SSD), or CPU cache memory.
  • the memory 114 may store an operating system or other program products (including one or more operation algorithms) to operate the data management device 102.
  • the processor 116 may include suitable logic, circuitry, interfaces, or code that is configured to execute the instructions stored in the memory 114.
  • the processor 116 may be a general-purpose processor.
  • Other examples of the processor 116 may include, but are not limited to a control unit, a central processing unit (CPU), a digital signal processor (DSP), a microprocessor, a microcontroller, a complex instruction set computing (CISC) processor, an application- specific integrated circuit (ASIC) processor, a reduced instruction set (RISC) processor, a very long instruction word (VLIW) processor, a central processing unit (CPU), a state machine, a data processing unit, a graphics processing unit (GPU), and other processors or control circuitry.
  • the processor 116 may refer to one or more individual processors, processing devices, a processing unit that is part of a machine, such as the data management device 102.
  • the data management device 102 comprising the input unit 104 is configured to receive at least one document.
  • the input unit 104 is configured to receive one or more documents related to a data subject.
  • the received one or more documents include information about the data subject.
  • the one or more documents are sent from the data subject to an organization.
  • a medical report may be received by the input unit 104 which includes the details about subject of an exam as well as details of a medical practitioner performing the exam.
  • the data management device 102 further comprises the identification unit 106 configured to identify one or more personally identifiable information, PII, elements in the received document.
  • the PII elements may include name of the data subject, social security number (SSN) of the data subject, phone number of the data subject, credit card number of the data subject, and the like.
  • the PII elements include name of the exam, identity (ID) of the exam and details of the examination as well as name of the medical practitioner, ID of the medical practitioner and medical license of the medical practitioner performing the exam.
  • the data management device 102 further comprises the relation unit 108 configured to identify one or more relations between pairs of PII elements identified in the received document.
  • the relation unit 108 is configured to identify the one or more relations (i.e., a correlation) between various PII elements, such as the name, SSN, phone number, and credit card number of the data subject.
  • the relation unit 108 is configured to identify the correlation between pairs of the PII elements, such as the name, ID and details of the examination as well as between the name, ID and medical license of the medical practitioner performing the exam.
  • the relation unit 108 may be configured to group together the subject information (i.e., the exam information) and medical practitioner information without mixing details of one (i.e., PII elements of the exam) with the other (i.e., PII elements of the medical practitioner).
  • the data management device 102 further comprises the mapping unit 110 configured to generate a graph by adding each identified PII element as a node, adding each identified relation as an edge, assigning an accuracy score and a uniqueness score to each node, and assigning a relation accuracy score to each edge.
  • the mapping unit 110 is configured to generate the graph (or a weighted graph) by adding each identified PII element as the node of the graph and adding each identified correlation between the PII elements as the edge.
  • the mapping unit 110 may be configured to map the various PII elements, such as name of the data subject, SSN of the data subject, phone number of the data subject, credit card number of the data subject as the node of the graph.
  • the mapping unit 110 may be further configured to map the correlation between each PII element as the edge of the graph. Thereafter, the mapping unit 110 may be further configured to assign the accuracy score and uniqueness score to each PII element (i.e., the node in the graph).
  • the accuracy score describes the accuracy of the identification performed by the identification unit 106 (or the underlying detector). The accuracy score lies in a range of 0 to 1.
  • the accuracy score may also be referred to as either a detection score or a PII detection score.
  • an Israeli ID is an eight digits number with a validation digit at the end. The validation digit can be calculated by the other eight digits.
  • the identification unit 106 i.e., the detector
  • the identification unit 106 identifies a nine digits number, with the last digit matches validation of the first eight digits, there is a high chance that this number is an Israeli ID. But since the identification unit 106 (i.e., the detector) cannot validate that this ID actually matches a person in formal government documents, therefore, the accuracy score is less than 100%. The reason being there exists a chance that the eight digits number along with one validation digit at the end is a different type of PII element, and the validation is random.
  • the uniqueness score describes how unique is the PII element. The value of the uniqueness score lies in between 0 to 1.
  • the unique PII element may be defined as a PII element which is unique by law with uniqueness equals to 1, such as social security number (SSN) or passport number (PPN).
  • SSN social security number
  • PPN passport number
  • other PII elements such as home address, phone number, credit card number, and the like, are assigned a value lesser than 1, whereas the higher the value, the more unique is the PII element. Since, such PII elements are domain specific, therefore, values of these PII elements are required to re-assigned after having a better sample of the population, described later in the FIG. 1.
  • the mapping unit 110 is configured to assign the relation accuracy score to each edge in the graph.
  • the relation accuracy score (may also be named as PII relation accuracy score) describes the accuracy of the relation identified by the relation unit 108.
  • the value of the relation accuracy score lies in between 0 to 1.
  • the relation unit 108 derives the relation between the name and ID, although the address is not related. Since, the relations are in the same sentence, the identified PII elements are somewhat related, so the relation unit 108 may generate a relation between the Name/ID and address, but this relation should be lower than the relation between the Name and ID. In such scenarios, the relation accuracy score may be less than unity.
  • the data management device 102 further comprises the reporting module 112, including the request input unit 112A configured to receive a request specifying at least one PII element, and the discovery unit 112B configured to traverse the graph starting from the specified PII element and generate a list including each traversed PII element, where the traversal is limited by a weighting factor based on the assigned scores.
  • the request input unit 112A is configured to receive the request (also named as identity request) which includes either one PII element or a group of PII elements that is sufficient to identify the data subject.
  • the request (i.e., the identity request) is used to describe any query of a system that should result in all the relevant information about the query subject, no matter which regulation this request is used to fulfill, for example, a data subject access request (DS AR).
  • the discovery unit 112B After receiving the request (i.e., the identity request) about the query subject by the request input unit 112A, the discovery unit 112B is configured to gather the relevant information about the query subject by use of the graph traversal. The graph traversal starts from the PII element or the group of PII elements specified by the request (i.e., the identity request). In order to gather the relevant information about the query subject, the discovery unit 112B is further configured to generate the list including each traversed PII element.
  • the graph traversal is limited by the weighting factor which is computed based on the accuracy and uniqueness score assigned to each node and the relation accuracy score assigned to each edge.
  • the graph traversal is limited only to those PII elements that are closely related to one PII element or the group of PII elements specified in the received request.
  • the discovery unit 112B is configured to traverse the graph using a breadth first search.
  • the graph traversal is performed using the breadth first search (BFS) algorithm which uses the weighting factor in order to limit the graph traversal only to those PII elements that are closely related to one PII element or the group of PII elements specified in the received request.
  • BFS breadth first search
  • the weighting factor is calculated for each node by multiplying the accuracy score of the node with a path weight, where the path weight is the product of the path weight of the preceding node, the uniqueness score of the preceding node and the accuracy score of the relation between the two nodes.
  • each node is assigned the weighting factor which is calculated by multiplying the accuracy score (i.e., the PII accuracy score) of the node with the path weight (may also be referred to as a path product).
  • the accuracy score of the node in the file.relations links the path weight (i.e., the path product) to a specific data instance.
  • the path weight is defined as the product of foreachrelationinthepath ⁇ * where R is the relation accuracy score between the two nodes, U is the PII uniqueness score of the preceding node.
  • the weighting factor is a diminishing product because all the scores, that is the accuracy score, the uniqueness score and the relation accuracy score lie in between 0 to 1.
  • the next node to be searched is determined to be the node with the highest value of path weight multiplied by the uniqueness score.
  • multiple nodes are available in the graph which are to be searched out.
  • the BFS algorithm uses a concept of “search front” to describe the node (i.e., the next node) that will be searched.
  • the BFS algorithm is required to sort the nodes according to the path weight multiplied by the node uniqueness score, so that the node having the highest value of multiplication of the path weight with the node uniqueness score is selected for next search.
  • the discovery unit 112B is configured to stop traversing the graph if the weighting factor falls below a predefined threshold.
  • the predefined threshold may either be controlled by a user issuing the request (i.e., the identity request), or by the data storage system using heuristics, such as location of duplicate PII types for the same user, that are not acceptable by law, for example, finding two social security numbers for the same user.
  • the provided results are sorted by the path weight (i.e., the path product), so that either manual or computational review of the results is focused on those that are as close as possible to the PII elements specified in the request (i.e., the identity request).
  • path weight i.e., the path product
  • traversal threshold i.e., the predefined threshold
  • the path threshold is used to filter out the results in the same way that the path depth is used in the typical BFS algorithm. If the path weight (i.e., the path product) is smaller than the traversal threshold (i.e., the predefined threshold), then the node is removed from the search front, and additional traversal through this path is stopped.
  • the threshold adjusted if a plurality of PII elements sharing a common type is found for the same subject.
  • the predefined threshold is adjusted in case if the plurality of PII elements of the common type is found for the same subject.
  • the plurality of PII elements of the common type may exist in the same traversal path of the graph.
  • each node of the graph includes information on at least one received document related to the PII element, and where the discovery unit 112B is configured to include each related document in the list of traversed PII elements.
  • Each node of the graph includes the information available either on the at least one document or the one or more documents related to the one or more PII elements.
  • the discovery unit 112B is configured to include each related document from the one or more received documents in the list of traversed PII elements.
  • two types of relations are used, such as pii.relations and file.relations.
  • the pii.relations are relations between a PII element and another PII element.
  • the file.relations are relations between a PII element and a file or a database (DB) in the data storage system.
  • DB database
  • ⁇ Name-John Smith> is located in file/tmp/john_smith_passport.jpeg.
  • the representations of relations in such a manner allows building correlations during system operation, and allows adding relations to the graph without requiring to re-calculate old relations.
  • Different connections, such as strong and weak connections between various PII elements are described in detail, for example, in FIG. 2.
  • the accuracy score (i.e., the PII accuracy score) is assigned to the file.relations and the uniqueness score (i.e., the PII uniqueness score) and the relation accuracy score (i.e., the PII relation accuracy score) is assigned to the pii.relations.
  • the mapping unit 110 of the data management device 102 may perform adaptive assignment and re-assignment of the uniqueness score to each PII element.
  • the one or more PII elements are not required to be unique by law.
  • An initial value can be given by a system (e.g., a data storage system) to the one or more PII elements and stored in database (DB) by a service performing the identity request.
  • DB database
  • a type of each PII element which is a unique identifier of the PII element is specified in the system. This allows to adaptively adjust the value of the uniqueness score if the specific population deviates from the initial pre-defined values.
  • home address uniqueness is defined by how many subjects share the same home address.
  • An initial value may be given according to subject age ranges (can be taken from organization data), and population data (can be taken from government population statistics). Once a large enough population is detected in the organization data, the value may be re-assigned for an improved representation of data, e.g., only a single subject is defined per home address, since the service defines a username that is sufficient for the entire home address.
  • the data management device 102 efficiently identifies the relevant information related to the data subject due to adequate correlation between the one or more PII elements identified in a data storage system and the at least one PII element specified in the identity request.
  • the data management device 102 uses the path weight (i.e., the path product) on the weighted PII or file relation graph in order to better identify the one or more PII elements related to the data subject in the data storage system with non-trivial correlation between the one or more PII elements.
  • the identification unit 106 enables the data management device 102 to identify the one or more PII elements related to the data subject not only from a single document but from the multiple documents received by the input unit 104 and saved in the data storage system.
  • the relation unit 108 enables the data management device 102 to identify the relation between information from the multiple documents in contrast to a conventional method in which a single document is considered in order to identify the PII element while not addressing the multiple documents for identification of the one or more PII elements and finding adequate correlation between the identified PII elements.
  • the data management device 102 uses the weighting factor in order to locate a cluster of such PII elements in the graph representing the data subject while removing irrelevant information.
  • FIG. 2 illustrates different type of connections between a plurality of personally identifiable information (PII) elements related to one or more data subjects, in accordance with an embodiment of the present disclosure.
  • FIG. 2 is described in conjunction with elements from FIG. 1.
  • FIG. 2 there is shown an exemplary scenario 200 that illustrates a graph 202 representing the relations between a plurality of PII elements related to one or more data subjects.
  • the graph 202 is represented by a dashed box, which is used for illustration purpose only.
  • the plurality of PII elements includes name, social security number (SSN), address and another name.
  • the graph 202 represents the relations between John Smith, SSN - 123456789, address - 19657 B Street, New York, and Jane Smith.
  • John Smith and Jane Smith are considered as different data subjects sharing a common PII element (e.g., home address).
  • the identified relations are termed as strong and weak connections.
  • the exemplary scenario 200 may be an example of PII elements saved by an organization with both John Smith and Jane Smith, different data subjects (or users) which share the same home address.
  • the organization Upon an identity request for John Smith, SSN - 123456789, the organization should not provide the name and other details of Jane Smith, even though performing a traversal through the graph 202 will provide Jane Smith and her details.
  • PII elements are shared by nature, for example, home address, phone number, etc.
  • a weighting factor (described earlier in FIG. 1) is used distinguish between “strong” connections, which have a high probability of belonging to the same data subject and “weak” connections, which have a low probability of belonging to the same data subject.
  • the SSN - 123456789 has high probability of belonging to John Smith
  • home address - 19657 B Street, New York has low probability of belonging to Jon Smith.
  • the graph 202 representation eases the requirement to process data as the data flows in data storage system.
  • FIG. 3 illustrates graph traversal directions in a weighted graph, in accordance with an embodiment of the present disclosure.
  • FIG. 3 is described in conjunction with elements from FIGs. 1 and 2.
  • an exemplary scenario 300 that illustrates graph traversal directions in a weighted graph 302.
  • the weighted graph 302 includes a first portion 302A and a second portion 302B.
  • the weighted graph 302 is represented by a dashed box, which is used for illustration purpose only.
  • the first portion 302A represents the relations between PII elements of subject A, such as A_ID, A_Social Security Number (SSN), A_Credit_Card, and A_Name.
  • the second portion 302B represents the relations between PII elements of subject B, such as B_ID, B_SSN, and B_Name, where AB_Address is shared between the first portion 302A and the second portion 302B.
  • a predefined threshold value of 0.8 will include the shared PII element, that is AB_Address, but not the nodes connected to it belonging to the other subject. This means that other two arrows connected to the shared PII element, that is AB_Address, will not be followed.
  • the weighted graph 302 is not required to be directional, however traversal of the weighted graph 302 using the BFS algorithm has an implicit direction dictated from the way the path product is calculated. Since the calculation used the PII score of an originating node in the traversal, if the originating node, and a target node have a different uniqueness score, traversal from one node to the other node results in a different path product. This is advantageous, since the shared nodes (representing PII elements shared by multiple subjects), which by definition have a low uniqueness score, act as “sinks” since the low uniqueness score multiplied by the path product significantly reduces the weighting factor used in the BFS algorithm.
  • FIG. 4 is an exemplary scenario that illustrates filtering of PII elements of different subjects, while including all the PII elements of same subject, in accordance with an embodiment of the present disclosure.
  • FIG. 4 is described in conjunction with elements from FIGs. 1, 2, and 3.
  • an exemplary scenario 400 that illustrates a weighted graph 402 including a first portion 402A, a second portion 402B, a third portion 402C, and a fourth portion 402D.
  • the weighted graph 402 is represented by a dashed box, which is used for illustration purpose only.
  • the first portion 402A represents the relations between PII elements of subject A, such as A PPT, A_Social Security Number (SSN), A_Credit_Card, A_Name, A_emailID, A_address, and the like.
  • each of the second portion 402B, the third portion 402C and the fourth portion 402D represents PII elements and their relation for different subjects (or users).
  • An organization receives an identity request for “John Smith, SSN 123456”.
  • the BFS algorithm starts with an initial sorted search from of:
  • the detection scores may modify the order of search, for example, if the relation between the SSN, and the PPT scores low compared to the SSN and the Name relation, but any score must be equal to lower than the above scores.
  • Any additional path from “John Smith” / “1876 Washington st. Baltimore” has a path product of at most 0.63 for the address (0.9 as stated in the path #7 multiplied by 0.7) and 0.4 for the same (1 as stated by path #3 multiplied by 0.4). By using a path threshold above 0.63 eliminates any PII elements from other subjects.
  • a “Good” path threshold may be defined as one that filters out PII elements of different subjects, but includes all the PII elements of the subject for which the search is going on.
  • a fixed value can be used as the path threshold, which is a “good enough” value, but this value can be tweaked during the operation of BFs algorithm using a heuristic to signal when a path has reached irrelevant PII elements.
  • unique PII elements define a data subject by law, however duplicate unique PII elements of the same type can exist. For example, in a case, two values may be designated as a UK Passport. In such a case, it can be deduced that the second value belongs to another subject and it should not be included in search results. Moreover, it can also be deduced that the predefined threshold value used for this iteration is too low. In that case, the search results are back tracked in an attempt to filter out irrelevant results.
  • FIG. 5 is a flowchart of a computer-implemented method of data management, in accordance with an embodiment of the present disclosure.
  • FIG. 5 is described in conjunction with elements from FIG. 1.
  • a computer-implemented method 500 of data management there is shown a computer-implemented method 500 of data management.
  • the computer-implemented method 500 includes steps 502 to 508.
  • the step 508 includes sub-steps 508A to 508D.
  • the computer-implemented method 500 is executed by the data management device 102 (of FIG. 1).
  • the computer-implemented method 500 of data management is used to provide all the relevant information related to a data subject stored in one or more data storage systems using a weighted graph representation of personally identifiable information (PII) elements of the data subject and the correlation in between the PII elements.
  • the computer-implemented method 500 comprises receiving, by an input unit (e.g., the input unit 104, of FIG. 1), at least one document. In an implementation, only one document may be received by the input unit 104. In another implementation, more than one documents may be received by the input unit 104.
  • the received document includes information about a data subject (e.g., a person, an exam, a service, and the like).
  • the computer-implemented method 500 further comprises identifying, by an identification unit (e.g., the identification unit 106), one or more personally identifiable information, PII, elements in the received document.
  • an identification unit e.g., the identification unit 106
  • PII personally identifiable information
  • the one or more PII elements related to the data subject are identified by the identification unit 106.
  • the identified one or more PII elements may include name, identity, address, phone number, and the like.
  • the computer-implemented method 500 further comprises identifying, by a relation unit (e.g., the relation unit 108), one or more relations between pairs of PII elements identified in the received document.
  • a relation unit e.g., the relation unit 108
  • the relation unit 108 is configured to identify the relation (or correlation) between the identified one or more PII elements.
  • the computer-implemented method 500 further comprises generating, by a mapping unit (e.g., the mapping unit 110), a graph.
  • a mapping unit e.g., the mapping unit 110
  • the mapping unit 110 is configured to generate the graph, such as the weighted graph 202 of FIG. 2, the weighted graph 302 of FIG. 3 or the weighted graph 402 of FIG. 4, and the like.
  • the step 508 comprises adding each identified PII element as a node.
  • each identified PII element is considered as the node of the graph, as shown, in FIGs. 2, 3, and 4.
  • the step 508 further comprises adding each identified relation as an edge.
  • the relation (or the correlation) between the identified PII elements is considered as the edge between the nodes of the graph, as shown, in FIGs. 2, 3, and 4.
  • the step 508 further comprises assigning an accuracy score and a uniqueness score to each node.
  • the accuracy score i.e., PII accuracy score
  • the uniqueness score i.e., PII uniqueness score
  • the step 508 further comprises assigning a relation accuracy score to each edge.
  • the relation accuracy score i.e., PII relation accuracy score
  • PII relation accuracy score is assigned to each edge in the graph, has been described in detail, for example, in FIG. 1.
  • the computer-implemented method 500 further comprises receiving, by a request input unit (e.g., the request input unit 112A), a request specifying at least one PII element and traversing, by a discovery unit (e.g., the discovery unit 112B), the graph starting from the specified PII element and generating a list including each traversed PII element, where the traversal is limited by a weighting factor based on the assigned scores.
  • the request i.e., identity request
  • the request is used to describe a query to an organization specifying one or more PII elements related to a query subject and results in all relevant information about the query subject.
  • the request is received by the request input unit 112A.
  • the discovery unit 112B is configured to traverse the graph starting from the specified PII element in the request, and generating the list of each traversed PII element.
  • the graph traversal is limited by the weighting factor which is based on the accuracy score, the uniqueness score and the relation accuracy score.
  • traversing the graph includes using a breadth first search.
  • the graph traversal is performed using the breadth first search (BFS) algorithm which uses the weighting factor in order to limit the graph traversal only to those PII elements that are closely related to one PII element or the group of PII elements specified in the received request.
  • BFS breadth first search
  • the weighting factor is calculated for each node by multiplying the accuracy score of the node with a path weight, where the path weight is the product of the path weight of the preceding node, the uniqueness score of the preceding node and the accuracy score of the relation between the two nodes.
  • each node is assigned the weighting factor which is calculated by multiplying the accuracy score of the node with the path weight.
  • the path weight depends on multiplication of the path weight of the preceding node, the uniqueness score of the preceding node and the relation accuracy score between the two nodes, described in detail, for example, in FIG. 1.
  • the next node to be searched is determined to be the node with the highest value of path weight multiplied by the uniqueness score.
  • multiple nodes are available in the graph which are to be searched out.
  • the BFS algorithm uses a concept of “search front” to describe the node (i.e., the next node) that will be searched. The concept of “search front” is described earlier, for example, in FIG. 1.
  • traversing the graph includes stopping if the weighting factor falls below a predefined threshold.
  • the predefined threshold used for stopping the graph traversal has been described in detail, for example, in FIGs. 1 and 4.
  • the threshold adjusted if a plurality of PII elements sharing a common type is found for the same subject.
  • the predefined threshold is adjusted if the plurality of PII elements is shared among multiple subjects.
  • each node of the graph includes information on at least one received document related to the PII element, and where generating the list includes including each related document in the list of traversed PII elements.
  • Each node of the graph includes the information available either on the at least one document or the one or more documents related to the one or more PII elements.
  • the discovery unit 112B is configured to include each related document from the one or more received documents in the list of traversed PII elements.
  • the computer-implemented method 500 enables an efficient identification of the relevant information related to the data subject due to adequate correlation between the one or more PII elements identified in a data storage system and the at least one PII element specified in the identity request.
  • the computer-implemented method 500 uses the path weight (i.e., the path product) on the weighted PII or file relation graph in order to better identify the one or more PII elements related to the data subject in the data storage system with non-trivial correlation between the one or more PII elements.
  • a computer-readable medium comprising instructions which, when executed by a processor (e.g., the processor 116 of the data management device 102), cause the processor to perform the computer-implemented method 500.
  • the instructions may be implemented on the computer-readable media which include, but is not limited to, Electrically Erasable Programmable Read-Only Memory (EEPROM), Random Access Memory (RAM), Read Only Memory (ROM), Hard Disk Drive (HDD), Flash memory, a Secure Digital (SD) card, Solid-State Drive (SSD), a computer readable storage medium, and/or CPU cache memory.
  • EEPROM Electrically Erasable Programmable Read-Only Memory
  • RAM Random Access Memory
  • ROM Read Only Memory
  • HDD Hard Disk Drive
  • Flash memory Flash memory
  • SD Secure Digital
  • SSD Solid-State Drive
  • the instructions are generated by a computer program, which is implemented in view of the computer-implemented method 500, and for use in implementing the computer-implemented method 500 on one or more processors, such as the processor 116 of the data management device 102.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Dispositif de gestion de données, comprenant une unité d'entrée configurée pour recevoir au moins un document. Le dispositif de gestion de données comprend en outre une unité d'identification configurée pour identifier un ou plusieurs éléments d'informations personnellement identifiables (PII) dans le document reçu et une unité de relation configurée pour identifier une ou plusieurs relations entre des paires d'éléments PII identifiés dans le document reçu. Le dispositif de gestion de données comprend en outre une unité de mise en correspondance configurée pour générer un graphe en ajoutant chaque élément PII identifié en tant que nœud, en ajoutant chaque relation identifiée en tant qu'arête, en attribuant un score de précision et un score d'unicité à chaque nœud, et en attribuant un score de précision de relation à chaque arête. Le dispositif de gestion de données identifie efficacement les informations pertinentes relatives à un sujet de données en raison d'une corrélation adéquate entre le ou Les éléments PII identifiés dans un système de stockage de données et l'élément PII spécifié dans une demande d'identité.
PCT/EP2022/055453 2022-03-03 2022-03-03 Dispositif de gestion de données et procédé de gestion de données WO2023165702A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2022/055453 WO2023165702A1 (fr) 2022-03-03 2022-03-03 Dispositif de gestion de données et procédé de gestion de données

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2022/055453 WO2023165702A1 (fr) 2022-03-03 2022-03-03 Dispositif de gestion de données et procédé de gestion de données

Publications (1)

Publication Number Publication Date
WO2023165702A1 true WO2023165702A1 (fr) 2023-09-07

Family

ID=80953325

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2022/055453 WO2023165702A1 (fr) 2022-03-03 2022-03-03 Dispositif de gestion de données et procédé de gestion de données

Country Status (1)

Country Link
WO (1) WO2023165702A1 (fr)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200250139A1 (en) * 2018-12-31 2020-08-06 Dathena Science Pte Ltd Methods, personal data analysis system for sensitive personal information detection, linking and purposes of personal data usage prediction

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200250139A1 (en) * 2018-12-31 2020-08-06 Dathena Science Pte Ltd Methods, personal data analysis system for sensitive personal information detection, linking and purposes of personal data usage prediction

Similar Documents

Publication Publication Date Title
US11985037B2 (en) Systems and methods for conducting more reliable assessments with connectivity statistics
JP5232855B2 (ja) 電子メールメッセージを特定しかつ互いに関連付ける方法
US11734233B2 (en) Method for classifying an unmanaged dataset
US8782061B2 (en) Scalable lookup-driven entity extraction from indexed document collections
US10140664B2 (en) Resolving similar entities from a transaction database
US9053171B2 (en) Clustering data points
US10515374B2 (en) Keyword generation method and apparatus
US9459861B1 (en) Systems and methods for detecting copied computer code using fingerprints
US8156142B2 (en) Semantically weighted searching in a governed corpus of terms
US10579651B1 (en) Method, system, and program for evaluating intellectual property right
JP7159923B2 (ja) データベース公開に起因したプライバシー侵害の検出及び防止
US8788480B2 (en) Multiple candidate selection in an entity resolution system
CN110019785B (zh) 一种文本分类方法及装置
US11500876B2 (en) Method for duplicate determination in a graph
US20220229854A1 (en) Constructing ground truth when classifying data
CN112241420A (zh) 一种基于关联规则算法的政务服务事项推荐方法
US10147095B2 (en) Chain understanding in search
US20200242110A1 (en) Systems and methods for joining datasets
US20110093867A1 (en) System and Method for Optimizing Event Predicate Processing
WO2023165702A1 (fr) Dispositif de gestion de données et procédé de gestion de données
US20160292282A1 (en) Detecting and responding to single entity intent queries
WO2023193908A1 (fr) Dispositif de traitement de données et procédé de traitement de données
Sridhar et al. Feature based Community Detection by extracting Facebook profile details
CN105190620B (zh) 改变文档和/或搜索查询的相关性的方法和系统
CN117312354A (zh) 一种数据处理方法、装置、存储介质及设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22713368

Country of ref document: EP

Kind code of ref document: A1