CN114969819A - Data asset risk discovery method and device - Google Patents

Data asset risk discovery method and device Download PDF

Info

Publication number
CN114969819A
CN114969819A CN202210620381.8A CN202210620381A CN114969819A CN 114969819 A CN114969819 A CN 114969819A CN 202210620381 A CN202210620381 A CN 202210620381A CN 114969819 A CN114969819 A CN 114969819A
Authority
CN
China
Prior art keywords
data
node
metadata
newly added
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210620381.8A
Other languages
Chinese (zh)
Inventor
郝泳栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ant Blockchain Technology Shanghai Co Ltd
Original Assignee
Ant Blockchain Technology Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ant Blockchain Technology Shanghai Co Ltd filed Critical Ant Blockchain Technology Shanghai Co Ltd
Priority to CN202210620381.8A priority Critical patent/CN114969819A/en
Publication of CN114969819A publication Critical patent/CN114969819A/en
Priority to PCT/CN2022/135312 priority patent/WO2023231341A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24564Applying rules; Deductive queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An embodiment of the specification provides a data asset risk discovery method and a device, and the method comprises the following steps: acquiring newly added data aiming at metadata and operation data in target data assets; the metadata comprises description data of the data storage unit, and the operation data is access behavior data aiming at the data storage unit; acquiring a data blood relationship atlas corresponding to a pre-established target data asset; establishing a data blood relationship map based on metadata and historical data of operation data; the data blood margin graph comprises nodes and connecting edges, the nodes are determined based on metadata, the connecting edges are determined based on operation data, and the connecting edges reflect the incidence relation among the nodes; the attribute value of the node identifies risk information of a data storage unit corresponding to the corresponding metadata; updating the data blood relationship atlas according to the newly added data; and determining an attribute value of a node related to the newly added data according to the updated data blood-related atlas, and determining risk information of the newly added data according to the attribute value. The efficiency of risk discovery can be improved.

Description

Data asset risk discovery method and device
Technical Field
One or more embodiments of the present description relate to the field of computers, and more particularly, to a data asset risk discovery method and apparatus.
Background
With the improvement of the cognition of enterprises on the data security concept, a solution aiming at data asset risk discovery is urgently needed. The risk discovery generally includes identifying sensitive data in the data assets for processing against the identified sensitive data to prevent a risk of leakage of the sensitive data.
The sensitive data is also called private data (private data), i.e. secret data, which refers to information that is not wanted to be known by others or unrelated persons, and from the perspective of privacy owners, the private data can be divided into individual private data and common private data, wherein the individual private data includes information (such as phone number, address, credit card number, etc.) and sensitive information (such as personal health condition, financial information, company important documents, etc.) that can be used to locate or identify an individual. The common privacy data mainly takes family privacy as a main part, such as family economic conditions and the like. The disclosure and abuse of private data is highly susceptible to various personal and public security problems.
In the traditional data asset risk discovery technical scheme, mostly under a certain judgment rule, a server performs full traversal on data assets to discover or identify sensitive data in the data assets. As the data magnitude expands, in order to ensure a certain discovery efficiency, the number of servers needs to be increased, and usually, the number of servers and the data asset magnitude are cooperatively increased in a positive correlation relationship, so that the increase of the number of servers causes cost increase, and while the cost increase is performed, the discovery efficiency is not increased to the same extent, but shows a slow increase trend.
Disclosure of Invention
One or more embodiments of the present specification describe a method and an apparatus for risk discovery of data assets, where a server does not perform full traversal on data assets, but performs risk discovery according to an association relationship between data, so that on the premise of not depending on the increase of the number of servers, the efficiency of risk discovery can be improved.
In a first aspect, a data asset risk discovery method is provided, and the method includes:
acquiring newly added data aiming at metadata and operation data in target data assets; the metadata includes description data for a data storage unit of the target data asset, the operational data being access behavior data for the data storage unit;
acquiring a pre-established data blood relationship atlas corresponding to the target data asset; the data consanguinity map is established based on historical data of the metadata and operational data; the data blood relationship graph comprises nodes and connecting edges, the nodes are determined based on metadata, and the connecting edges are determined based on operation data and embody the incidence relation between the nodes; the attribute value of the node identifies risk information of a data storage unit corresponding to the corresponding metadata;
updating the data blood relationship atlas according to the newly added data;
and determining an attribute value of a node related to the newly added data according to the updated data blood-related atlas, and determining risk information of the newly added data according to the attribute value.
In one possible implementation, the target data asset is of structured data, whose data storage units are identified by databases, data tables, and data columns.
Further, the nodes correspond to columns of data.
Further, the incidence relation comprises a generation relation between the nodes, and the generation relation is generated by the first node based on the second node.
In a possible embodiment, the risk information of the metadata includes risk classification information indicating whether data in the corresponding data storage unit belongs to sensitive data and/or risk classification information indicating a level of the sensitive data.
In one possible embodiment, the obtaining new data for the metadata and the operation data in the target data asset includes:
obtaining a Structured Query Language (SQL) statement operating on a target data asset;
and analyzing the SQL statement, and determining the newly added data according to the related metadata and operation data.
In one possible embodiment, the new data includes first metadata and first operation data for the first storage unit; the updating the data consanguinity map comprises:
if the data blood relationship graph does not contain the node corresponding to the first metadata, adding a first node corresponding to the first metadata in the data blood relationship graph;
determining second metadata having an association relation with the first metadata and a first type of the association relation according to the first operation data;
and establishing a first type of connection edge between the first node and a second node corresponding to the second metadata.
In one possible implementation, the newly added data includes first metadata for the newly added first storage unit; determining an attribute value of a node related to the newly added data according to the updated data blood relationship graph, wherein the determining comprises:
taking the node corresponding to the first metadata as an initial node in the updated data blood-related graph, and searching a target node having a preset association relation with the initial node from the initial node;
and if the target node is found, taking the attribute value of the target node as the attribute value of the initial node.
Further, the preset incidence relation comprises a generation relation between nodes, and the initial node is generated based on the target node.
Further, the determining, according to the updated data blood-level graph, an attribute value of a node related to the newly added data further includes:
if the target node is not found, acquiring a judgment rule of risk information;
sampling from the target data assets according to the newly added data to obtain a plurality of sampling data;
respectively utilizing the judgment rules to identify the risk information of the sampled data so as to comprehensively determine the risk information of the newly added data;
and determining the attribute value of the initial node according to the risk information of the newly added data.
In a second aspect, an apparatus for discovering risk of data assets is provided, the apparatus comprising:
the first acquisition unit is used for acquiring newly added data aiming at metadata and operation data in the target data asset; the metadata includes description data for a data storage unit of the target data asset, the operational data being access behavior data for the data storage unit;
the second acquisition unit is used for acquiring a pre-established data blood relation map corresponding to the target data asset; the data consanguinity map is established based on historical data of the metadata and operational data; the data blood margin graph comprises nodes and connecting edges, the nodes are determined based on metadata, the connecting edges are determined based on operation data, and the connecting edges embody incidence relations among the nodes; the attribute value of the node identifies risk information of a data storage unit corresponding to the corresponding metadata;
the updating unit is used for updating the data blood-related atlas acquired by the second acquiring unit according to the newly added data acquired by the first acquiring unit;
and the determining unit is used for determining the attribute value of the node related to the newly added data according to the updated data blood-related atlas obtained by the updating unit and determining the risk information of the newly added data according to the attribute value.
In a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect.
In a fourth aspect, there is provided a computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method of the first aspect.
Through the method and the device provided by the embodiment of the specification, newly added data aiming at metadata and operation data in the target data asset is obtained at first; the metadata includes description data for a data storage unit of the target data asset, the operational data being access behavior data for the data storage unit; then acquiring a pre-established data blood relationship map corresponding to the target data asset; the data consanguinity map is established based on historical data of the metadata and operational data; the data blood margin graph comprises nodes and connecting edges, the nodes are determined based on metadata, the connecting edges are determined based on operation data, and the connecting edges embody incidence relations among the nodes; the attribute value of the node identifies risk information of a data storage unit corresponding to the corresponding metadata; then updating the data blood margin map according to the newly added data; and finally, according to the updated data blood-related atlas, determining an attribute value of a node related to the newly added data, and determining the risk information of the newly added data according to the attribute value. As can be seen from the above, in the embodiments of the present specification, data asset risk discovery is performed based on graph calculation, and a circulation behavior and circulation characteristics of a full life cycle of data are depicted in a graph manner, so that risk information of newly added data can be quickly determined, for example, data sensitivity and behavior sensitivity are determined in real time, and real-time sensitive information decision and processing are supported, so that on the premise of not depending on the increase of the number of servers, the efficiency of risk discovery can be improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram illustrating an implementation scenario of an embodiment disclosed herein;
FIG. 2 illustrates a flow diagram of a data asset risk discovery method according to one embodiment;
FIG. 3 shows a compositional schematic of a data consanguinity map according to one embodiment;
FIG. 4 illustrates a system architecture diagram of data asset risk discovery according to one embodiment;
FIG. 5 shows a schematic block diagram of a data asset risk discovery apparatus according to one embodiment.
Detailed Description
The scheme provided by the specification is described below with reference to the accompanying drawings.
Fig. 1 is a schematic view of an implementation scenario of an embodiment disclosed in this specification. The implementation scenario involves data asset risk discovery, which generally includes identifying sensitive data in a data asset for processing against the identified sensitive data to prevent the risk of leakage of the sensitive data and optionally also to prevent the risk of abnormal operation of the data asset. Abnormal operation is also called risky behavior, easily causes sensitive data leakage, system failure and other consequences, and generally comprises the steps of collecting sensitive data by bypassing a platform desensitization mechanism, illegally bumping a library and the like. Data assets are typically structured data and a database may be used to store data, the database including a plurality of data tables, each data table including a plurality of fields, and the primary task of risk discovery is to determine whether or not each field belongs to sensitive data, wherein the fields correspond to columns.
Referring to fig. 1, table one is an original data table in a database, field 1 included in table one belongs to sensitive data, and the operation on the database is to create table two, where field 1 in table two and field 1 in table one belong to a truncated relation, and field 2 in table two and field 1 in table one also belong to a truncated relation, that is, field 1 and field 2 in table two are obtained by extracting a substring of a string corresponding to field 1 in table one, for example, the string corresponding to field 1 in table one is abcd, and accordingly, field 1 in table two corresponds to substring ab of the string, and field 2 in table two corresponds to substring cd of the string. It is understood that field 1 and field 2 in table two have an association relationship with field 1 in table one.
Referring to fig. 1, table one is an original data table in a database, field 1 included in table one belongs to sensitive data, and an operation on the database is to create table three, where field 1 in table three and field 1 in table one belong to a duplicate relationship, that is, field 1 in table three is the same as field 1 in table one, for example, the character string corresponding to field 1 in table one is abcd, and correspondingly, the character string corresponding to field 1 in table three is abcd. It is understood that field 1 in table three has an association relationship with field 1 in table one.
The relationship may be referred to as a blood relationship, and the blood relationship is used to describe an upstream-downstream relationship between data and data, and the relationship between fields generally includes copying, truncation, splicing, conversion, and the like, which shows that data of one field is processed to obtain data of another field.
In the embodiment of the description, based on the analysis of the operation statements of the data assets, the metadata and the operation data related to the data assets can be automatically analyzed, the metadata comprises description data of data storage units of the data assets, the operation data is access behavior data of the data storage units, the operation data reflects the association relation between different data storage units, and then based on the association relation, the risk information of newly added data can be quickly determined, for example, the data sensitivity and the behavior sensitivity are judged in real time, and the decision and the processing of the real-time sensitive information are supported, so that the risk discovery efficiency can be improved on the premise of not depending on the increase of the number of servers.
FIG. 2 illustrates a flow diagram of a data asset risk discovery method according to one embodiment, which may be based on the implementation scenario illustrated in FIG. 1. As shown in fig. 2, the data asset risk discovery method in this embodiment includes the following steps: step 21, acquiring newly added data aiming at metadata and operation data in the target data assets; the metadata includes description data for a data storage unit of the target data asset, the operational data being access behavior data for the data storage unit; step 22, acquiring a pre-established data blood relationship atlas corresponding to the target data asset; the data consanguinity map is established based on historical data of the metadata and operational data; the data blood margin graph comprises nodes and connecting edges, the nodes are determined based on metadata, the connecting edges are determined based on operation data, and the connecting edges embody incidence relations among the nodes; the attribute value of the node identifies risk information of a data storage unit corresponding to the corresponding metadata; step 23, updating the data blood relationship atlas according to the newly added data; and step 24, determining an attribute value of a node related to the newly added data according to the updated data blood-related atlas, and determining the risk information of the newly added data according to the attribute value. Specific execution modes of the above steps are described below.
Firstly, in step 21, new data aiming at metadata and operation data in target data assets is obtained; the metadata includes description data for a data storage unit of the target data asset, and the operational data is access behavior data for the data storage unit. It is understood that the operands correspond to data storage locations and that the access actions can include, but are not limited to, creating a data table, adding fields to an existing data table, and the like.
In one example, the target data asset belongs to structured data whose data storage units are identified by a database, a data table, and a data column.
It is understood that a database typically includes a plurality of data tables, and a data table includes a plurality of data columns, and when a data column is described, it is also typically indicated to which data table and database the data column belongs. For example, a Globally Unique Identifier (GUID) is used as the identifier of the data column, which is specifically in the form of project _ name, table _ name, column _ name, where project _ name represents a database, table _ name represents a data table, and column _ name represents a data column. The metadata may be used to describe columns of data in the target data asset.
In one example, the obtaining of new data for metadata and operation data in the target data asset includes:
obtaining a Structured Query Language (SQL) statement operating on a target data asset;
and analyzing the SQL statement, and determining the newly added data according to the related metadata and operation data.
In this example, various behavior operations on the database may be implemented by executing SQL statements, and the SQL statements are parsed to obtain a plurality of data storage units related to the behavior operations and an association relationship between the plurality of data storage units, where a single data storage unit is a data column or a data table. It is understood that the association relationship may include a relationship between a field and a field, a relationship between a field and a data table, a relationship between a data table and a data storage unit, a relationship between a data storage unit and a data table, a data storage unit corresponding to metadata, and an association relationship corresponding to operation data.
The SQL analysis may be referred to as SQL analysis, and the SQL analysis may be used to analyze metadata and operation data related thereto, where the related metadata may be an original data column in the database or a newly added data column in the database, and correspondingly, the operation data may be embodied as an association relationship between the original data column in the database and the newly added data column in the database. The new data may include two parts of content, new metadata and new operating data.
For example, the following SQL:
createtablep1.t2 from(select c1 from p1.t1);
through SQL parsing, the following results can be obtained:
the data column p1.t2.c1 in the data table t2 is created based on the data column p1.t1.c1 in the data table t1, wherein the data column p1.t2.c1 in the data table t2 corresponds to the newly added metadata, and the operation data for the metadata is the newly added operation data.
At present, a mature third party library for SQL analysis can be used, and the principle is not described herein again.
Then, in step 22, a pre-established data blood margin map corresponding to the target data asset is obtained; the data consanguinity map is established based on historical data of the metadata and operational data; the data blood relationship graph comprises nodes and connecting edges, the nodes are determined based on metadata, and the connecting edges are determined based on operation data and embody the incidence relation between the nodes; the attribute values of the nodes identify risk information for the data storage units to which the corresponding metadata corresponds. It is understood that the risk information may specifically be information on whether the data storage unit belongs to sensitive data, that is, whether there is a risk of sensitive data leakage.
In one example, the target data asset belongs to structured data, whose data storage units are identified by a database, a data table, and a data column, the node corresponding to the data column.
Further, the incidence relation comprises a generation relation between the nodes, and the generation relation is generated by the first node based on the second node.
In the embodiment of the present specification, the first node may be generated based on the second node, the data storage unit corresponding to the first node may be generated by copying data stored in the data storage unit corresponding to the second node, or the data storage unit corresponding to the first node may be generated by cutting off data stored in the data storage unit corresponding to the second node.
In the embodiments of the present specification, it is not excluded that the data blood relationship graph further includes other types of nodes, for example, nodes corresponding to data tables, and in this case, the association relationship further includes an attribution relationship between a data column and a data table.
In one example, the risk information of the metadata includes risk classification information indicating whether data in a corresponding data storage unit belongs to sensitive data and/or risk classification information indicating a level of the sensitive data.
In this example, a plurality of levels, for example, three levels, i.e., high, medium, and low, may be pre-divided, and sensitive data at each level is only accessible to users with a corresponding permission level, but is not accessible to other users.
In the embodiment of the present specification, a data consanguinity map may be established based on SQL analysis, which is a basic stone for constructing a data consanguinity relationship, and mainly analyzes fields and tables, fields and fields, and inheritance relationships between tables and tables described in SQL, where generally, relationships between fields may include copy (copy), truncation (substr), concatenation (concat), and the like; the relationship between tables is dependency (depend); the relationship between a field and a table is belonged. The triplet may be used to represent the resolved blood-related relationships (source _ node, target _ node, relation). Wherein, source _ node is the identifier of the source node; target _ node is the identification of the target node; a relationship is an inter-node relationship.
FIG. 3 illustrates a compositional schematic of a data kindred graph, referring to FIG. 3, including two types of nodes, one type of node for representing a data table, e.g., node t1 and node t2, according to one embodiment; another type of node is used to represent columns of data, e.g., node c1, node c2, …, node c 7. The data blood margin graph also comprises two types of connecting edges, wherein one type of connecting edge is a connecting edge between a node for representing a data column and a node for representing a data table, and the connecting edge represents the attribution relationship between the two nodes, for example, the connecting edge between the node c3 and the node t 1; another type of connecting edge is a connecting edge between two nodes representing columns of data, the connecting edge representing a generating relationship between the two, e.g., a connecting edge between node c1 and node c7, wherein the direction of the connecting edge is from node c1 to node c7, indicating that node c7 is generated from node c1, the attribute value of node c1 identifies that the data of the data storage unit corresponding to the corresponding metadata belongs to sensitive data, and the level of the sensitive data is high.
Then, in step 23, the data blood-related atlas is updated according to the newly added data. It is to be understood that the new data may include new metadata and/or new operation data, and accordingly, updating the data lineage map may specifically include adding nodes and/or connecting edges in the data lineage map.
In one example, the new addition data includes first metadata and first operation data for a first storage unit; the updating the data consanguinity map comprises:
if the data blood relationship graph does not comprise the node corresponding to the first metadata, adding a first node corresponding to the first metadata in the data blood relationship graph;
determining second metadata having an association relation with the first metadata and a first type of the association relation according to the first operation data;
and establishing a first type of connection edge between the first node and a second node corresponding to the second metadata.
In this example, updating the data kindred map includes adding both nodes and connecting edges in the data kindred map.
Finally, in step 24, according to the updated data blood-related graph, determining an attribute value of a node related to the newly added data, and determining risk information of the newly added data according to the attribute value. It can be understood that, since the attribute value of a node identifies the risk information of the data storage unit corresponding to the corresponding metadata, after finding a node related to the newly added data, the risk information identified by the attribute value of the related node can be used as the risk information of the newly added data.
In one example, the newly added data includes first metadata for a newly added first storage unit; determining an attribute value of a node related to the newly added data according to the updated data blood relationship graph, wherein the determining comprises:
taking the node corresponding to the first metadata as an initial node in the updated data blood-related graph, and starting from the initial node, searching a target node having a preset association relation with the initial node;
and if the target node is found, taking the attribute value of the target node as the attribute value of the initial node.
Further, the preset association relationship comprises a generation relationship between nodes, and the initial node is generated based on the target node.
It can be understood that the association relationship between the nodes can be embodied by the attribute information of the connecting edge, and the attribute information can identify whether the nodes are in the generation relationship; the initial node is generated based on the target node, that is, the target node is a parent node, the initial node is a child node, or the target node is an upstream node of the initial node, the initial node is a downstream node of the target node, and an upstream-downstream relationship between the nodes can be embodied by a direction of the connecting edge.
Further, the determining, according to the updated data blood-level graph, an attribute value of a node related to the newly added data further includes:
if the target node is not found, acquiring a judgment rule of risk information;
sampling from the target data assets according to the newly added data to obtain a plurality of sampling data;
respectively utilizing the judgment rules to identify the risk information of the sampled data so as to comprehensively determine the risk information of the newly added data;
and determining the attribute value of the initial node according to the risk information of the newly added data.
It can be understood that, if a target node having a preset association relationship with an initial node is not found, the data stored in the data storage unit corresponding to the initial node needs to be identified, for example, the data storage unit is a data column, the data of the data column may be sampled to obtain a plurality of sampled data, and the identification result of the data column is determined according to the identification result of each sampled data. For example, a data column has 1000 pieces of data, 20 pieces of data can be sampled from the data column, whether each piece of data in the 20 pieces of data belongs to sensitive data is identified, and if data exceeding a preset proportion is identified to belong to sensitive data, the data column is determined to belong to sensitive data.
The above-mentioned judgment rule may be a rule capable of directly identifying the risk information, for example, determining whether the data belongs to the sensitive data by specifying the number of characters, the character type, and the like of the character string; the above-mentioned decision rule may also be a rule that can indirectly identify its risk information, for example, by determining whether it belongs to sensitive data through a specified neural network model.
Through the method provided by the embodiment of the specification, newly added data aiming at metadata and operation data in the target data asset is obtained at first; the metadata includes description data for a data storage unit of the target data asset, the operational data being access behavior data for the data storage unit; then acquiring a pre-established data blood relationship map corresponding to the target data asset; the data consanguinity map is established based on historical data of the metadata and operational data; the data blood relationship graph comprises nodes and connecting edges, the nodes are determined based on metadata, and the connecting edges are determined based on operation data and embody the incidence relation between the nodes; the attribute value of the node identifies risk information of a data storage unit corresponding to the corresponding metadata; then updating the data blood margin map according to the newly added data; and finally, according to the updated data blood-related atlas, determining an attribute value of a node related to the newly added data, and determining the risk information of the newly added data according to the attribute value. As can be seen from the above, in the embodiments of the present specification, data asset risk discovery is performed based on graph calculation, and a circulation behavior and circulation characteristics of a full life cycle of data are depicted in a graph manner, so that risk information of newly added data can be quickly determined, for example, data sensitivity and behavior sensitivity are determined in real time, and real-time sensitive information decision and processing are supported, so that on the premise of not depending on the increase of the number of servers, the efficiency of risk discovery can be improved.
FIG. 4 illustrates a system architecture diagram for data asset risk discovery, according to one embodiment. Referring to fig. 4, the system structure may be divided into a data processing service unit 41, a data protection service unit 42, a data consanguinity service unit 43, and a graph calculation engine 44. The data processing service unit 41 is used for providing storage and computation services of data assets, the data assets are structured data, and metadata and operation data corresponding to the data assets can be provided to the data protection service unit 42 and the data consanguinity service unit 43; the data protection service unit 42 is used for providing a risk discovery service for the data assets, wherein the risk discovery generally comprises identifying sensitive data in the data assets so as to process the identified sensitive data and prevent the leakage risk of the sensitive data; a data relationship service unit 43, configured to periodically generate data relationship and incrementally synchronize the data relationship to a graph database, which may be referred to as a data relationship graph, and it is understood that the relationship includes a relationship between fields, a relationship between fields and tables, and a relationship between tables; and the graph calculation engine 44 is used for completing the query calculation work related to the data blood-related graph. When risk discovery is performed by the data protection service unit 42, risk discovery can be performed by querying the data blood relationship graph from the data blood relationship service unit 43 and according to the blood relationship between the nodes embodied by the data blood relationship graph, so that risk discovery efficiency is effectively improved. Further, the data consanguinity service unit 43 acquires the operation data from the data processing service unit 41 and the metadata from the data protection service unit 42, and it is understood that after the data protection service unit 42 acquires the metadata and the operation data from the data processing service unit 41, the metadata may be buffered and provided to the data consanguinity service unit 43.
In the embodiment of the specification, based on a graph calculation technology, sensitivity infection is performed through the association relationship among data assets, data discovery does not depend on queue traversal any more, but through a data consanguinity graph is used for discovering, and if a strong relationship exists among the data assets, positive association growth of server resources is not brought by data magnitude growth. Through the efficient graph migration capability, the offline limitation of the traditional framework is broken through, the near-line and even online sensitive data discovery and real-time early warning are realized, and the data asset safety is efficiently ensured. Meanwhile, based on the data blood relationship, the behavior sensitivity under the data sensitivity can be mined, and then the sensitive behavior is predicted through the historical suspicious sensitive behavior, so that the whole scene risk control before, in and after the fact is realized.
According to an embodiment of another aspect, a data asset risk discovery device is also provided, which is used for executing the method provided by the embodiment of the present specification. FIG. 5 shows a schematic block diagram of a data asset risk discovery apparatus according to one embodiment. As shown in fig. 5, the apparatus 500 includes:
a first obtaining unit 51 configured to obtain new data for the metadata and the operation data in the target data asset; the metadata includes description data for a data storage unit of the target data asset, the operational data being access behavior data for the data storage unit;
a second obtaining unit 52, configured to obtain a pre-established data blood relationship atlas corresponding to the target data asset; the data consanguinity map is established based on historical data of the metadata and operational data; the data blood margin graph comprises nodes and connecting edges, the nodes are determined based on metadata, the connecting edges are determined based on operation data, and the connecting edges embody incidence relations among the nodes; the attribute value of the node identifies risk information of a data storage unit corresponding to the corresponding metadata;
an updating unit 53, configured to update the data blood-level map acquired by the second acquiring unit 52 according to the new data acquired by the first acquiring unit 51;
the determining unit 54 is configured to determine an attribute value of a node related to the newly added data according to the updated data blood-related map obtained by the updating unit 53, and determine risk information of the newly added data according to the attribute value.
Optionally, as an embodiment, the target data asset belongs to structured data, and its data storage unit is identified by a database, a data table, and a data column.
Further, the nodes correspond to columns of data.
Further, the incidence relation comprises a generation relation between the nodes, and the generation relation is generated by the first node based on the second node.
Optionally, as an embodiment, the risk information of the metadata includes risk classification information and/or risk classification information, the risk classification information is used to indicate whether data in a corresponding data storage unit belongs to sensitive data, and the risk classification information is used to indicate a level of the sensitive data.
Optionally, as an embodiment, the first obtaining unit 51 includes:
an acquisition subunit, configured to acquire a Structured Query Language (SQL) statement that operates on a target data asset;
and the analysis subunit is used for analyzing the SQL statements acquired by the acquisition subunit and determining the newly added data according to the related metadata and the operation data.
Optionally, as an embodiment, the new addition data includes first metadata and first operation data for the first storage unit; the updating unit 53 includes:
a node adding subunit, configured to add, if the data edge graph does not include a node corresponding to the first metadata, a first node corresponding to the first metadata in the data edge graph;
the determining subunit is used for determining second metadata having an association relation with the first metadata and a first type of the association relation according to the first operation data;
and the edge establishing subunit is used for establishing a first type of connection edge between the first node added by the node adding subunit and the second node corresponding to the second metadata determined by the determining subunit.
Optionally, as an embodiment, the newly added data includes first metadata for the newly added first storage unit; the determination unit 54 includes:
the searching subunit is configured to, in the updated data blood-related graph, use a node corresponding to the first metadata as an initial node, and start from the initial node, search for a target node having a preset association relationship with the initial node;
a first determining subunit, configured to, if the target node is found by the finding subunit, take the attribute value of the target node as the attribute value of the initial node.
Further, the preset incidence relation comprises a generation relation between nodes, and the initial node is generated based on the target node.
Further, the determining unit 54 further includes:
the obtaining subunit is configured to obtain a decision rule of the risk information if the target node is not found by the searching subunit;
the sampling subunit is used for sampling from the target data assets according to the newly added data to obtain a plurality of sampling data;
the identification subunit is used for identifying the risk information of the plurality of sampling data obtained by the sampling subunit by respectively utilizing the judgment rules obtained by the obtaining subunit so as to comprehensively determine the risk information of the newly added data;
and the second determining subunit is used for determining the attribute value of the initial node according to the risk information of the newly-added data obtained by the identifying subunit.
With the apparatus provided in the embodiment of the present specification, first, the first obtaining unit 51 obtains new data for metadata and operation data in a target data asset; the metadata includes description data for a data storage unit of the target data asset, the operational data being access behavior data for the data storage unit; then, the second obtaining unit 52 obtains a pre-established data blood relationship atlas corresponding to the target data asset; the data consanguinity map is established based on historical data of the metadata and operational data; the data blood margin graph comprises nodes and connecting edges, the nodes are determined based on metadata, the connecting edges are determined based on operation data, and the connecting edges embody incidence relations among the nodes; the attribute value of the node identifies risk information of a data storage unit corresponding to the corresponding metadata; then, the updating unit 53 updates the data blood relationship map according to the newly added data; finally, the determining unit 54 determines the attribute value of the node related to the newly added data according to the updated data blood-related graph, and determines the risk information of the newly added data according to the attribute value. As can be seen from the above, in the embodiments of the present specification, data asset risk discovery is performed based on graph calculation, and a circulation behavior and circulation characteristics of a full life cycle of data are depicted in a graph manner, so that risk information of newly added data can be quickly determined, for example, data sensitivity and behavior sensitivity are determined in real time, and real-time sensitive information decision and processing are supported, so that on the premise of not depending on the increase of the number of servers, the efficiency of risk discovery can be improved.
According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 2.
According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory having stored therein executable code, and a processor that, when executing the executable code, implements the method described in connection with fig. 2.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims (22)

1. A data asset risk discovery method, the method comprising:
acquiring newly added data aiming at metadata and operation data in target data assets; the metadata includes description data for a data storage unit of the target data asset, the operational data being access behavior data for the data storage unit;
acquiring a pre-established data blood relationship atlas corresponding to the target data asset; the data consanguinity map is established based on historical data of the metadata and operational data; the data blood margin graph comprises nodes and connecting edges, the nodes are determined based on metadata, the connecting edges are determined based on operation data, and the connecting edges embody incidence relations among the nodes; the attribute value of the node identifies risk information of a data storage unit corresponding to the corresponding metadata;
updating the data blood relationship atlas according to the newly added data;
and determining an attribute value of a node related to the newly added data according to the updated data blood-related atlas, and determining risk information of the newly added data according to the attribute value.
2. The method of claim 1, wherein the target data asset is of structured data whose data storage units are identified by a database, a data table, and a data column.
3. The method of claim 2, wherein the nodes correspond to columns of data.
4. The method of claim 3, wherein the associative relationship comprises a generating relationship between nodes, the generating relationship being generated for a first node based on a second node.
5. The method of claim 1, wherein the risk information of the metadata includes risk classification information indicating whether data in a corresponding data storage unit belongs to sensitive data and/or risk classification information indicating a level of the sensitive data.
6. The method of claim 1, wherein the obtaining new data for metadata and operational data in the target data asset comprises:
acquiring a Structured Query Language (SQL) statement operating on a target data asset;
and analyzing the SQL statement, and determining the newly added data according to the related metadata and operation data.
7. The method of claim 1, wherein the new addition data includes, for the first storage unit, first metadata and first operation data; the updating the data consanguinity map comprises:
if the data blood relationship graph does not comprise the node corresponding to the first metadata, adding a first node corresponding to the first metadata in the data blood relationship graph;
determining second metadata having an association relation with the first metadata and a first type of the association relation according to the first operation data;
and establishing a first type of connection edge between the first node and a second node corresponding to the second metadata.
8. The method of claim 1, wherein the new data includes first metadata for a new first storage unit; determining an attribute value of a node related to the newly added data according to the updated data blood relationship graph, wherein the determining comprises:
taking the node corresponding to the first metadata as an initial node in the updated data blood-related graph, and starting from the initial node, searching a target node having a preset association relation with the initial node;
and if the target node is found, taking the attribute value of the target node as the attribute value of the initial node.
9. The method of claim 8, wherein the preset associative relationship includes a generated relationship between nodes, and the initial node is generated based on the target node.
10. The method of claim 8, wherein determining the attribute values of the nodes associated with the new data based on the updated data edge graph further comprises:
if the target node is not found, acquiring a judgment rule of risk information;
sampling from the target data assets according to the newly added data to obtain a plurality of sampling data;
respectively utilizing the judgment rules to identify the risk information of the sampled data so as to comprehensively determine the risk information of the newly added data;
and determining the attribute value of the initial node according to the risk information of the newly added data.
11. A data asset risk discovery apparatus, the apparatus comprising:
the first acquisition unit is used for acquiring newly added data aiming at the metadata and the operation data in the target data asset; the metadata includes description data for a data storage unit of the target data asset, the operational data being access behavior data for the data storage unit;
the second acquisition unit is used for acquiring a pre-established data blood relation map corresponding to the target data asset; the data consanguinity map is established based on historical data of the metadata and operational data; the data blood margin graph comprises nodes and connecting edges, the nodes are determined based on metadata, the connecting edges are determined based on operation data, and the connecting edges embody incidence relations among the nodes; the attribute value of the node identifies risk information of a data storage unit corresponding to the corresponding metadata;
the updating unit is used for updating the data blood-related atlas acquired by the second acquiring unit according to the newly added data acquired by the first acquiring unit;
and the determining unit is used for determining the attribute value of the node related to the newly added data according to the updated data blood-related atlas obtained by the updating unit and determining the risk information of the newly added data according to the attribute value.
12. The apparatus of claim 11, wherein the target data asset is of structured data whose data storage units are identified by a database, a data table, and a data column.
13. The apparatus of claim 12, wherein the nodes correspond to columns of data.
14. The apparatus of claim 13, wherein the associative relationship comprises a generating relationship between nodes, the generating relationship being that a first node is generated based on a second node.
15. The apparatus of claim 11, wherein the risk information of the metadata comprises risk classification information indicating whether data in a corresponding data storage unit belongs to sensitive data and/or risk classification information indicating a level of the sensitive data.
16. The apparatus of claim 11, wherein the first obtaining unit comprises:
an acquisition subunit, configured to acquire a Structured Query Language (SQL) statement that operates on a target data asset;
and the analysis subunit is used for analyzing the SQL statements acquired by the acquisition subunit and determining the newly added data according to the related metadata and the operation data.
17. The apparatus of claim 11, wherein the new addition data includes, for the first storage unit, first metadata and first operation data; the update unit includes:
a node adding subunit, configured to add, if the data edge graph does not include a node corresponding to the first metadata, a first node corresponding to the first metadata in the data edge graph;
the determining subunit is used for determining second metadata having an association relation with the first metadata and a first type of the association relation according to the first operation data;
and the edge establishing subunit is used for establishing a first type of connection edge between the first node added by the node adding subunit and the second node corresponding to the second metadata determined by the determining subunit.
18. The apparatus of claim 11, wherein the new addition data includes first metadata for the new added first storage unit; the determination unit includes:
the searching subunit is configured to, in the updated data blood-level graph, use a node corresponding to the first metadata as an initial node, and search, starting from the initial node, for a target node having a preset association relationship with the initial node;
a first determining subunit, configured to, if the target node is found by the finding subunit, take the attribute value of the target node as the attribute value of the initial node.
19. The apparatus of claim 18, wherein the preset associative relationship comprises a generated relationship between nodes, and the initial node is generated based on the target node.
20. The apparatus of claim 18, wherein the determining unit further comprises:
the obtaining subunit is configured to obtain a decision rule of the risk information if the target node is not found by the searching subunit;
the sampling subunit is used for sampling from the target data assets according to the newly added data to obtain a plurality of sampling data;
the identification subunit is used for identifying the risk information of the plurality of sampling data obtained by the sampling subunit by respectively utilizing the judgment rules obtained by the obtaining subunit so as to comprehensively determine the risk information of the newly added data;
and the second determining subunit is used for determining the attribute value of the initial node according to the risk information of the newly-added data obtained by the identifying subunit.
21. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-10.
22. A computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method of any of claims 1-10.
CN202210620381.8A 2022-06-02 2022-06-02 Data asset risk discovery method and device Pending CN114969819A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210620381.8A CN114969819A (en) 2022-06-02 2022-06-02 Data asset risk discovery method and device
PCT/CN2022/135312 WO2023231341A1 (en) 2022-06-02 2022-11-30 Method and apparatus for discovering data asset risk

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210620381.8A CN114969819A (en) 2022-06-02 2022-06-02 Data asset risk discovery method and device

Publications (1)

Publication Number Publication Date
CN114969819A true CN114969819A (en) 2022-08-30

Family

ID=82960474

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210620381.8A Pending CN114969819A (en) 2022-06-02 2022-06-02 Data asset risk discovery method and device

Country Status (2)

Country Link
CN (1) CN114969819A (en)
WO (1) WO2023231341A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023231341A1 (en) * 2022-06-02 2023-12-07 蚂蚁区块链科技(上海)有限公司 Method and apparatus for discovering data asset risk
CN117874686A (en) * 2024-03-11 2024-04-12 中信证券股份有限公司 Abnormal data positioning method, device, electronic equipment and computer readable medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117891979B (en) * 2024-03-15 2024-05-17 中信证券股份有限公司 Method and device for constructing blood margin map, electronic equipment and readable medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111694858A (en) * 2020-04-28 2020-09-22 平安科技(深圳)有限公司 Data blood margin analysis method, device, equipment and computer readable storage medium
CN113486008A (en) * 2021-06-30 2021-10-08 平安信托有限责任公司 Data blood margin analysis method, device, equipment and storage medium
CN113672653A (en) * 2021-08-09 2021-11-19 支付宝(杭州)信息技术有限公司 Method and device for identifying private data in database
CN113672977A (en) * 2021-08-13 2021-11-19 支付宝(杭州)信息技术有限公司 Private data processing method and device
CN114969819A (en) * 2022-06-02 2022-08-30 蚂蚁区块链科技(上海)有限公司 Data asset risk discovery method and device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023231341A1 (en) * 2022-06-02 2023-12-07 蚂蚁区块链科技(上海)有限公司 Method and apparatus for discovering data asset risk
CN117874686A (en) * 2024-03-11 2024-04-12 中信证券股份有限公司 Abnormal data positioning method, device, electronic equipment and computer readable medium
CN117874686B (en) * 2024-03-11 2024-05-10 中信证券股份有限公司 Abnormal data positioning method, device, electronic equipment and computer readable medium

Also Published As

Publication number Publication date
WO2023231341A1 (en) 2023-12-07

Similar Documents

Publication Publication Date Title
CN110291517B (en) Query language interoperability in graph databases
Yang et al. Stack overflow in github: any snippets there?
KR100850255B1 (en) Real time data warehousing
CN114969819A (en) Data asset risk discovery method and device
US20190258648A1 (en) Generating asset level classifications using machine learning
US11030203B2 (en) Machine learning detection of database injection attacks
EP3674918B1 (en) Column lineage and metadata propagation
WO2022064348A1 (en) Protecting sensitive data in documents
US20230040635A1 (en) Graph-based impact analysis of misconfigured or compromised cloud resources
JP2020201935A (en) API access based on privacy reliability
CN112035508A (en) SQL (structured query language) -based online metadata analysis method, system and equipment
US11416631B2 (en) Dynamic monitoring of movement of data
Studiawan et al. Automatic event log abstraction to support forensic investigation
CN114328674A (en) Data mining method and system based on intranet log behavior diagram
CN114048227A (en) SQL statement anomaly detection method, device, equipment and storage medium
US11409790B2 (en) Multi-image information retrieval system
CN115292353B (en) Data query method and device, computer equipment and storage medium
US11960470B2 (en) Merging and unmerging entity representations via resolver trees
CN113672457A (en) Method and device for identifying abnormal operation in database
Doerr et al. A method for estimating the precision of placename matching
Li et al. LogKernel: A threat hunting approach based on behaviour provenance graph and graph kernel clustering
US11074278B2 (en) Methods for performing a search and devices thereof
Wu et al. Application of MapReduce parallel association mining on IDS in cloud computing environment
KR101083425B1 (en) Database detecting system and detecting method using the same
US11770413B2 (en) Domain-independent resource security and management

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination