CN114969819A

CN114969819A - Data asset risk discovery method and device

Info

Publication number: CN114969819A
Application number: CN202210620381.8A
Authority: CN
Inventors: 郝泳栋
Original assignee: Ant Blockchain Technology Shanghai Co Ltd
Current assignee: Ant Blockchain Technology Shanghai Co Ltd
Priority date: 2022-06-02
Filing date: 2022-06-02
Publication date: 2022-08-30
Also published as: WO2023231341A1

Abstract

An embodiment of the specification provides a data asset risk discovery method and a device, and the method comprises the following steps: acquiring newly added data aiming at metadata and operation data in target data assets; the metadata comprises description data of the data storage unit, and the operation data is access behavior data aiming at the data storage unit; acquiring a data blood relationship atlas corresponding to a pre-established target data asset; establishing a data blood relationship map based on metadata and historical data of operation data; the data blood margin graph comprises nodes and connecting edges, the nodes are determined based on metadata, the connecting edges are determined based on operation data, and the connecting edges reflect the incidence relation among the nodes; the attribute value of the node identifies risk information of a data storage unit corresponding to the corresponding metadata; updating the data blood relationship atlas according to the newly added data; and determining an attribute value of a node related to the newly added data according to the updated data blood-related atlas, and determining risk information of the newly added data according to the attribute value. The efficiency of risk discovery can be improved.

Description

Data asset risk discovery method and device

Technical Field

One or more embodiments of the present description relate to the field of computers, and more particularly, to a data asset risk discovery method and apparatus.

Background

With the improvement of the cognition of enterprises on the data security concept, a solution aiming at data asset risk discovery is urgently needed. The risk discovery generally includes identifying sensitive data in the data assets for processing against the identified sensitive data to prevent a risk of leakage of the sensitive data.

The sensitive data is also called private data (private data), i.e. secret data, which refers to information that is not wanted to be known by others or unrelated persons, and from the perspective of privacy owners, the private data can be divided into individual private data and common private data, wherein the individual private data includes information (such as phone number, address, credit card number, etc.) and sensitive information (such as personal health condition, financial information, company important documents, etc.) that can be used to locate or identify an individual. The common privacy data mainly takes family privacy as a main part, such as family economic conditions and the like. The disclosure and abuse of private data is highly susceptible to various personal and public security problems.

In the traditional data asset risk discovery technical scheme, mostly under a certain judgment rule, a server performs full traversal on data assets to discover or identify sensitive data in the data assets. As the data magnitude expands, in order to ensure a certain discovery efficiency, the number of servers needs to be increased, and usually, the number of servers and the data asset magnitude are cooperatively increased in a positive correlation relationship, so that the increase of the number of servers causes cost increase, and while the cost increase is performed, the discovery efficiency is not increased to the same extent, but shows a slow increase trend.

Disclosure of Invention

One or more embodiments of the present specification describe a method and an apparatus for risk discovery of data assets, where a server does not perform full traversal on data assets, but performs risk discovery according to an association relationship between data, so that on the premise of not depending on the increase of the number of servers, the efficiency of risk discovery can be improved.

In a first aspect, a data asset risk discovery method is provided, and the method includes:

acquiring newly added data aiming at metadata and operation data in target data assets; the metadata includes description data for a data storage unit of the target data asset, the operational data being access behavior data for the data storage unit;

acquiring a pre-established data blood relationship atlas corresponding to the target data asset; the data consanguinity map is established based on historical data of the metadata and operational data; the data blood relationship graph comprises nodes and connecting edges, the nodes are determined based on metadata, and the connecting edges are determined based on operation data and embody the incidence relation between the nodes; the attribute value of the node identifies risk information of a data storage unit corresponding to the corresponding metadata;

updating the data blood relationship atlas according to the newly added data;

and determining an attribute value of a node related to the newly added data according to the updated data blood-related atlas, and determining risk information of the newly added data according to the attribute value.

In one possible implementation, the target data asset is of structured data, whose data storage units are identified by databases, data tables, and data columns.

Further, the nodes correspond to columns of data.

Further, the incidence relation comprises a generation relation between the nodes, and the generation relation is generated by the first node based on the second node.

In a possible embodiment, the risk information of the metadata includes risk classification information indicating whether data in the corresponding data storage unit belongs to sensitive data and/or risk classification information indicating a level of the sensitive data.

In one possible embodiment, the obtaining new data for the metadata and the operation data in the target data asset includes:

obtaining a Structured Query Language (SQL) statement operating on a target data asset;

and analyzing the SQL statement, and determining the newly added data according to the related metadata and operation data.

In one possible embodiment, the new data includes first metadata and first operation data for the first storage unit; the updating the data consanguinity map comprises:

if the data blood relationship graph does not contain the node corresponding to the first metadata, adding a first node corresponding to the first metadata in the data blood relationship graph;

determining second metadata having an association relation with the first metadata and a first type of the association relation according to the first operation data;

and establishing a first type of connection edge between the first node and a second node corresponding to the second metadata.

In one possible implementation, the newly added data includes first metadata for the newly added first storage unit; determining an attribute value of a node related to the newly added data according to the updated data blood relationship graph, wherein the determining comprises:

taking the node corresponding to the first metadata as an initial node in the updated data blood-related graph, and searching a target node having a preset association relation with the initial node from the initial node;

and if the target node is found, taking the attribute value of the target node as the attribute value of the initial node.

Further, the preset incidence relation comprises a generation relation between nodes, and the initial node is generated based on the target node.

Further, the determining, according to the updated data blood-level graph, an attribute value of a node related to the newly added data further includes:

if the target node is not found, acquiring a judgment rule of risk information;

sampling from the target data assets according to the newly added data to obtain a plurality of sampling data;

respectively utilizing the judgment rules to identify the risk information of the sampled data so as to comprehensively determine the risk information of the newly added data;

and determining the attribute value of the initial node according to the risk information of the newly added data.

In a second aspect, an apparatus for discovering risk of data assets is provided, the apparatus comprising:

the first acquisition unit is used for acquiring newly added data aiming at metadata and operation data in the target data asset; the metadata includes description data for a data storage unit of the target data asset, the operational data being access behavior data for the data storage unit;

the second acquisition unit is used for acquiring a pre-established data blood relation map corresponding to the target data asset; the data consanguinity map is established based on historical data of the metadata and operational data; the data blood margin graph comprises nodes and connecting edges, the nodes are determined based on metadata, the connecting edges are determined based on operation data, and the connecting edges embody incidence relations among the nodes; the attribute value of the node identifies risk information of a data storage unit corresponding to the corresponding metadata;

the updating unit is used for updating the data blood-related atlas acquired by the second acquiring unit according to the newly added data acquired by the first acquiring unit;

and the determining unit is used for determining the attribute value of the node related to the newly added data according to the updated data blood-related atlas obtained by the updating unit and determining the risk information of the newly added data according to the attribute value.

In a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect.

In a fourth aspect, there is provided a computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method of the first aspect.

Through the method and the device provided by the embodiment of the specification, newly added data aiming at metadata and operation data in the target data asset is obtained at first; the metadata includes description data for a data storage unit of the target data asset, the operational data being access behavior data for the data storage unit; then acquiring a pre-established data blood relationship map corresponding to the target data asset; the data consanguinity map is established based on historical data of the metadata and operational data; the data blood margin graph comprises nodes and connecting edges, the nodes are determined based on metadata, the connecting edges are determined based on operation data, and the connecting edges embody incidence relations among the nodes; the attribute value of the node identifies risk information of a data storage unit corresponding to the corresponding metadata; then updating the data blood margin map according to the newly added data; and finally, according to the updated data blood-related atlas, determining an attribute value of a node related to the newly added data, and determining the risk information of the newly added data according to the attribute value. As can be seen from the above, in the embodiments of the present specification, data asset risk discovery is performed based on graph calculation, and a circulation behavior and circulation characteristics of a full life cycle of data are depicted in a graph manner, so that risk information of newly added data can be quickly determined, for example, data sensitivity and behavior sensitivity are determined in real time, and real-time sensitive information decision and processing are supported, so that on the premise of not depending on the increase of the number of servers, the efficiency of risk discovery can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram illustrating an implementation scenario of an embodiment disclosed herein;

FIG. 2 illustrates a flow diagram of a data asset risk discovery method according to one embodiment;

FIG. 3 shows a compositional schematic of a data consanguinity map according to one embodiment;

FIG. 4 illustrates a system architecture diagram of data asset risk discovery according to one embodiment;

FIG. 5 shows a schematic block diagram of a data asset risk discovery apparatus according to one embodiment.

Detailed Description

The scheme provided by the specification is described below with reference to the accompanying drawings.

Fig. 1 is a schematic view of an implementation scenario of an embodiment disclosed in this specification. The implementation scenario involves data asset risk discovery, which generally includes identifying sensitive data in a data asset for processing against the identified sensitive data to prevent the risk of leakage of the sensitive data and optionally also to prevent the risk of abnormal operation of the data asset. Abnormal operation is also called risky behavior, easily causes sensitive data leakage, system failure and other consequences, and generally comprises the steps of collecting sensitive data by bypassing a platform desensitization mechanism, illegally bumping a library and the like. Data assets are typically structured data and a database may be used to store data, the database including a plurality of data tables, each data table including a plurality of fields, and the primary task of risk discovery is to determine whether or not each field belongs to sensitive data, wherein the fields correspond to columns.

Referring to fig. 1, table one is an original data table in a database, field 1 included in table one belongs to sensitive data, and the operation on the database is to create table two, where field 1 in table two and field 1 in table one belong to a truncated relation, and field 2 in table two and field 1 in table one also belong to a truncated relation, that is, field 1 and field 2 in table two are obtained by extracting a substring of a string corresponding to field 1 in table one, for example, the string corresponding to field 1 in table one is abcd, and accordingly, field 1 in table two corresponds to substring ab of the string, and field 2 in table two corresponds to substring cd of the string. It is understood that field 1 and field 2 in table two have an association relationship with field 1 in table one.

Referring to fig. 1, table one is an original data table in a database, field 1 included in table one belongs to sensitive data, and an operation on the database is to create table three, where field 1 in table three and field 1 in table one belong to a duplicate relationship, that is, field 1 in table three is the same as field 1 in table one, for example, the character string corresponding to field 1 in table one is abcd, and correspondingly, the character string corresponding to field 1 in table three is abcd. It is understood that field 1 in table three has an association relationship with field 1 in table one.

The relationship may be referred to as a blood relationship, and the blood relationship is used to describe an upstream-downstream relationship between data and data, and the relationship between fields generally includes copying, truncation, splicing, conversion, and the like, which shows that data of one field is processed to obtain data of another field.

In the embodiment of the description, based on the analysis of the operation statements of the data assets, the metadata and the operation data related to the data assets can be automatically analyzed, the metadata comprises description data of data storage units of the data assets, the operation data is access behavior data of the data storage units, the operation data reflects the association relation between different data storage units, and then based on the association relation, the risk information of newly added data can be quickly determined, for example, the data sensitivity and the behavior sensitivity are judged in real time, and the decision and the processing of the real-time sensitive information are supported, so that the risk discovery efficiency can be improved on the premise of not depending on the increase of the number of servers.

FIG. 2 illustrates a flow diagram of a data asset risk discovery method according to one embodiment, which may be based on the implementation scenario illustrated in FIG. 1. As shown in fig. 2, the data asset risk discovery method in this embodiment includes the following steps: step 21, acquiring newly added data aiming at metadata and operation data in the target data assets; the metadata includes description data for a data storage unit of the target data asset, the operational data being access behavior data for the data storage unit; step 22, acquiring a pre-established data blood relationship atlas corresponding to the target data asset; the data consanguinity map is established based on historical data of the metadata and operational data; the data blood margin graph comprises nodes and connecting edges, the nodes are determined based on metadata, the connecting edges are determined based on operation data, and the connecting edges embody incidence relations among the nodes; the attribute value of the node identifies risk information of a data storage unit corresponding to the corresponding metadata; step 23, updating the data blood relationship atlas according to the newly added data; and step 24, determining an attribute value of a node related to the newly added data according to the updated data blood-related atlas, and determining the risk information of the newly added data according to the attribute value. Specific execution modes of the above steps are described below.

Firstly, in step 21, new data aiming at metadata and operation data in target data assets is obtained; the metadata includes description data for a data storage unit of the target data asset, and the operational data is access behavior data for the data storage unit. It is understood that the operands correspond to data storage locations and that the access actions can include, but are not limited to, creating a data table, adding fields to an existing data table, and the like.

In one example, the target data asset belongs to structured data whose data storage units are identified by a database, a data table, and a data column.

It is understood that a database typically includes a plurality of data tables, and a data table includes a plurality of data columns, and when a data column is described, it is also typically indicated to which data table and database the data column belongs. For example, a Globally Unique Identifier (GUID) is used as the identifier of the data column, which is specifically in the form of project _ name, table _ name, column _ name, where project _ name represents a database, table _ name represents a data table, and column _ name represents a data column. The metadata may be used to describe columns of data in the target data asset.

In one example, the obtaining of new data for metadata and operation data in the target data asset includes:

In this example, various behavior operations on the database may be implemented by executing SQL statements, and the SQL statements are parsed to obtain a plurality of data storage units related to the behavior operations and an association relationship between the plurality of data storage units, where a single data storage unit is a data column or a data table. It is understood that the association relationship may include a relationship between a field and a field, a relationship between a field and a data table, a relationship between a data table and a data storage unit, a relationship between a data storage unit and a data table, a data storage unit corresponding to metadata, and an association relationship corresponding to operation data.

The SQL analysis may be referred to as SQL analysis, and the SQL analysis may be used to analyze metadata and operation data related thereto, where the related metadata may be an original data column in the database or a newly added data column in the database, and correspondingly, the operation data may be embodied as an association relationship between the original data column in the database and the newly added data column in the database. The new data may include two parts of content, new metadata and new operating data.

For example, the following SQL:

createtablep1.t2 from(select c1 from p1.t1)；

through SQL parsing, the following results can be obtained:

the data column p1.t2.c1 in the data table t2 is created based on the data column p1.t1.c1 in the data table t1, wherein the data column p1.t2.c1 in the data table t2 corresponds to the newly added metadata, and the operation data for the metadata is the newly added operation data.

At present, a mature third party library for SQL analysis can be used, and the principle is not described herein again.

Then, in step 22, a pre-established data blood margin map corresponding to the target data asset is obtained; the data consanguinity map is established based on historical data of the metadata and operational data; the data blood relationship graph comprises nodes and connecting edges, the nodes are determined based on metadata, and the connecting edges are determined based on operation data and embody the incidence relation between the nodes; the attribute values of the nodes identify risk information for the data storage units to which the corresponding metadata corresponds. It is understood that the risk information may specifically be information on whether the data storage unit belongs to sensitive data, that is, whether there is a risk of sensitive data leakage.

In one example, the target data asset belongs to structured data, whose data storage units are identified by a database, a data table, and a data column, the node corresponding to the data column.

In the embodiment of the present specification, the first node may be generated based on the second node, the data storage unit corresponding to the first node may be generated by copying data stored in the data storage unit corresponding to the second node, or the data storage unit corresponding to the first node may be generated by cutting off data stored in the data storage unit corresponding to the second node.

In the embodiments of the present specification, it is not excluded that the data blood relationship graph further includes other types of nodes, for example, nodes corresponding to data tables, and in this case, the association relationship further includes an attribution relationship between a data column and a data table.

In one example, the risk information of the metadata includes risk classification information indicating whether data in a corresponding data storage unit belongs to sensitive data and/or risk classification information indicating a level of the sensitive data.

In this example, a plurality of levels, for example, three levels, i.e., high, medium, and low, may be pre-divided, and sensitive data at each level is only accessible to users with a corresponding permission level, but is not accessible to other users.

In the embodiment of the present specification, a data consanguinity map may be established based on SQL analysis, which is a basic stone for constructing a data consanguinity relationship, and mainly analyzes fields and tables, fields and fields, and inheritance relationships between tables and tables described in SQL, where generally, relationships between fields may include copy (copy), truncation (substr), concatenation (concat), and the like; the relationship between tables is dependency (depend); the relationship between a field and a table is belonged. The triplet may be used to represent the resolved blood-related relationships (source _ node, target _ node, relation). Wherein, source _ node is the identifier of the source node; target _ node is the identification of the target node; a relationship is an inter-node relationship.

FIG. 3 illustrates a compositional schematic of a data kindred graph, referring to FIG. 3, including two types of nodes, one type of node for representing a data table, e.g., node t1 and node t2, according to one embodiment; another type of node is used to represent columns of data, e.g., node c1, node c2, …, node c 7. The data blood margin graph also comprises two types of connecting edges, wherein one type of connecting edge is a connecting edge between a node for representing a data column and a node for representing a data table, and the connecting edge represents the attribution relationship between the two nodes, for example, the connecting edge between the node c3 and the node t 1; another type of connecting edge is a connecting edge between two nodes representing columns of data, the connecting edge representing a generating relationship between the two, e.g., a connecting edge between node c1 and node c7, wherein the direction of the connecting edge is from node c1 to node c7, indicating that node c7 is generated from node c1, the attribute value of node c1 identifies that the data of the data storage unit corresponding to the corresponding metadata belongs to sensitive data, and the level of the sensitive data is high.

Then, in step 23, the data blood-related atlas is updated according to the newly added data. It is to be understood that the new data may include new metadata and/or new operation data, and accordingly, updating the data lineage map may specifically include adding nodes and/or connecting edges in the data lineage map.

In one example, the new addition data includes first metadata and first operation data for a first storage unit; the updating the data consanguinity map comprises:

if the data blood relationship graph does not comprise the node corresponding to the first metadata, adding a first node corresponding to the first metadata in the data blood relationship graph;

In this example, updating the data kindred map includes adding both nodes and connecting edges in the data kindred map.

Finally, in step 24, according to the updated data blood-related graph, determining an attribute value of a node related to the newly added data, and determining risk information of the newly added data according to the attribute value. It can be understood that, since the attribute value of a node identifies the risk information of the data storage unit corresponding to the corresponding metadata, after finding a node related to the newly added data, the risk information identified by the attribute value of the related node can be used as the risk information of the newly added data.

In one example, the newly added data includes first metadata for a newly added first storage unit; determining an attribute value of a node related to the newly added data according to the updated data blood relationship graph, wherein the determining comprises:

taking the node corresponding to the first metadata as an initial node in the updated data blood-related graph, and starting from the initial node, searching a target node having a preset association relation with the initial node;

Further, the preset association relationship comprises a generation relationship between nodes, and the initial node is generated based on the target node.

It can be understood that the association relationship between the nodes can be embodied by the attribute information of the connecting edge, and the attribute information can identify whether the nodes are in the generation relationship; the initial node is generated based on the target node, that is, the target node is a parent node, the initial node is a child node, or the target node is an upstream node of the initial node, the initial node is a downstream node of the target node, and an upstream-downstream relationship between the nodes can be embodied by a direction of the connecting edge.

if the target node is not found, acquiring a judgment rule of risk information;

It can be understood that, if a target node having a preset association relationship with an initial node is not found, the data stored in the data storage unit corresponding to the initial node needs to be identified, for example, the data storage unit is a data column, the data of the data column may be sampled to obtain a plurality of sampled data, and the identification result of the data column is determined according to the identification result of each sampled data. For example, a data column has 1000 pieces of data, 20 pieces of data can be sampled from the data column, whether each piece of data in the 20 pieces of data belongs to sensitive data is identified, and if data exceeding a preset proportion is identified to belong to sensitive data, the data column is determined to belong to sensitive data.

The above-mentioned judgment rule may be a rule capable of directly identifying the risk information, for example, determining whether the data belongs to the sensitive data by specifying the number of characters, the character type, and the like of the character string; the above-mentioned decision rule may also be a rule that can indirectly identify its risk information, for example, by determining whether it belongs to sensitive data through a specified neural network model.

Through the method provided by the embodiment of the specification, newly added data aiming at metadata and operation data in the target data asset is obtained at first; the metadata includes description data for a data storage unit of the target data asset, the operational data being access behavior data for the data storage unit; then acquiring a pre-established data blood relationship map corresponding to the target data asset; the data consanguinity map is established based on historical data of the metadata and operational data; the data blood relationship graph comprises nodes and connecting edges, the nodes are determined based on metadata, and the connecting edges are determined based on operation data and embody the incidence relation between the nodes; the attribute value of the node identifies risk information of a data storage unit corresponding to the corresponding metadata; then updating the data blood margin map according to the newly added data; and finally, according to the updated data blood-related atlas, determining an attribute value of a node related to the newly added data, and determining the risk information of the newly added data according to the attribute value. As can be seen from the above, in the embodiments of the present specification, data asset risk discovery is performed based on graph calculation, and a circulation behavior and circulation characteristics of a full life cycle of data are depicted in a graph manner, so that risk information of newly added data can be quickly determined, for example, data sensitivity and behavior sensitivity are determined in real time, and real-time sensitive information decision and processing are supported, so that on the premise of not depending on the increase of the number of servers, the efficiency of risk discovery can be improved.

FIG. 4 illustrates a system architecture diagram for data asset risk discovery, according to one embodiment. Referring to fig. 4, the system structure may be divided into a data processing service unit 41, a data protection service unit 42, a data consanguinity service unit 43, and a graph calculation engine 44. The data processing service unit 41 is used for providing storage and computation services of data assets, the data assets are structured data, and metadata and operation data corresponding to the data assets can be provided to the data protection service unit 42 and the data consanguinity service unit 43; the data protection service unit 42 is used for providing a risk discovery service for the data assets, wherein the risk discovery generally comprises identifying sensitive data in the data assets so as to process the identified sensitive data and prevent the leakage risk of the sensitive data; a data relationship service unit 43, configured to periodically generate data relationship and incrementally synchronize the data relationship to a graph database, which may be referred to as a data relationship graph, and it is understood that the relationship includes a relationship between fields, a relationship between fields and tables, and a relationship between tables; and the graph calculation engine 44 is used for completing the query calculation work related to the data blood-related graph. When risk discovery is performed by the data protection service unit 42, risk discovery can be performed by querying the data blood relationship graph from the data blood relationship service unit 43 and according to the blood relationship between the nodes embodied by the data blood relationship graph, so that risk discovery efficiency is effectively improved. Further, the data consanguinity service unit 43 acquires the operation data from the data processing service unit 41 and the metadata from the data protection service unit 42, and it is understood that after the data protection service unit 42 acquires the metadata and the operation data from the data processing service unit 41, the metadata may be buffered and provided to the data consanguinity service unit 43.

In the embodiment of the specification, based on a graph calculation technology, sensitivity infection is performed through the association relationship among data assets, data discovery does not depend on queue traversal any more, but through a data consanguinity graph is used for discovering, and if a strong relationship exists among the data assets, positive association growth of server resources is not brought by data magnitude growth. Through the efficient graph migration capability, the offline limitation of the traditional framework is broken through, the near-line and even online sensitive data discovery and real-time early warning are realized, and the data asset safety is efficiently ensured. Meanwhile, based on the data blood relationship, the behavior sensitivity under the data sensitivity can be mined, and then the sensitive behavior is predicted through the historical suspicious sensitive behavior, so that the whole scene risk control before, in and after the fact is realized.

According to an embodiment of another aspect, a data asset risk discovery device is also provided, which is used for executing the method provided by the embodiment of the present specification. FIG. 5 shows a schematic block diagram of a data asset risk discovery apparatus according to one embodiment. As shown in fig. 5, the apparatus 500 includes:

a first obtaining unit 51 configured to obtain new data for the metadata and the operation data in the target data asset; the metadata includes description data for a data storage unit of the target data asset, the operational data being access behavior data for the data storage unit;

a second obtaining unit 52, configured to obtain a pre-established data blood relationship atlas corresponding to the target data asset; the data consanguinity map is established based on historical data of the metadata and operational data; the data blood margin graph comprises nodes and connecting edges, the nodes are determined based on metadata, the connecting edges are determined based on operation data, and the connecting edges embody incidence relations among the nodes; the attribute value of the node identifies risk information of a data storage unit corresponding to the corresponding metadata;

an updating unit 53, configured to update the data blood-level map acquired by the second acquiring unit 52 according to the new data acquired by the first acquiring unit 51;

the determining unit 54 is configured to determine an attribute value of a node related to the newly added data according to the updated data blood-related map obtained by the updating unit 53, and determine risk information of the newly added data according to the attribute value.

Optionally, as an embodiment, the target data asset belongs to structured data, and its data storage unit is identified by a database, a data table, and a data column.

Further, the nodes correspond to columns of data.

Optionally, as an embodiment, the risk information of the metadata includes risk classification information and/or risk classification information, the risk classification information is used to indicate whether data in a corresponding data storage unit belongs to sensitive data, and the risk classification information is used to indicate a level of the sensitive data.

Optionally, as an embodiment, the first obtaining unit 51 includes:

an acquisition subunit, configured to acquire a Structured Query Language (SQL) statement that operates on a target data asset;

and the analysis subunit is used for analyzing the SQL statements acquired by the acquisition subunit and determining the newly added data according to the related metadata and the operation data.

Optionally, as an embodiment, the new addition data includes first metadata and first operation data for the first storage unit; the updating unit 53 includes:

a node adding subunit, configured to add, if the data edge graph does not include a node corresponding to the first metadata, a first node corresponding to the first metadata in the data edge graph;

the determining subunit is used for determining second metadata having an association relation with the first metadata and a first type of the association relation according to the first operation data;

and the edge establishing subunit is used for establishing a first type of connection edge between the first node added by the node adding subunit and the second node corresponding to the second metadata determined by the determining subunit.

Optionally, as an embodiment, the newly added data includes first metadata for the newly added first storage unit; the determination unit 54 includes:

the searching subunit is configured to, in the updated data blood-related graph, use a node corresponding to the first metadata as an initial node, and start from the initial node, search for a target node having a preset association relationship with the initial node;

a first determining subunit, configured to, if the target node is found by the finding subunit, take the attribute value of the target node as the attribute value of the initial node.

Further, the determining unit 54 further includes:

the obtaining subunit is configured to obtain a decision rule of the risk information if the target node is not found by the searching subunit;

the sampling subunit is used for sampling from the target data assets according to the newly added data to obtain a plurality of sampling data;

the identification subunit is used for identifying the risk information of the plurality of sampling data obtained by the sampling subunit by respectively utilizing the judgment rules obtained by the obtaining subunit so as to comprehensively determine the risk information of the newly added data;

and the second determining subunit is used for determining the attribute value of the initial node according to the risk information of the newly-added data obtained by the identifying subunit.

With the apparatus provided in the embodiment of the present specification, first, the first obtaining unit 51 obtains new data for metadata and operation data in a target data asset; the metadata includes description data for a data storage unit of the target data asset, the operational data being access behavior data for the data storage unit; then, the second obtaining unit 52 obtains a pre-established data blood relationship atlas corresponding to the target data asset; the data consanguinity map is established based on historical data of the metadata and operational data; the data blood margin graph comprises nodes and connecting edges, the nodes are determined based on metadata, the connecting edges are determined based on operation data, and the connecting edges embody incidence relations among the nodes; the attribute value of the node identifies risk information of a data storage unit corresponding to the corresponding metadata; then, the updating unit 53 updates the data blood relationship map according to the newly added data; finally, the determining unit 54 determines the attribute value of the node related to the newly added data according to the updated data blood-related graph, and determines the risk information of the newly added data according to the attribute value. As can be seen from the above, in the embodiments of the present specification, data asset risk discovery is performed based on graph calculation, and a circulation behavior and circulation characteristics of a full life cycle of data are depicted in a graph manner, so that risk information of newly added data can be quickly determined, for example, data sensitivity and behavior sensitivity are determined in real time, and real-time sensitive information decision and processing are supported, so that on the premise of not depending on the increase of the number of servers, the efficiency of risk discovery can be improved.

According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 2.

According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory having stored therein executable code, and a processor that, when executing the executable code, implements the method described in connection with fig. 2.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims

1. A data asset risk discovery method, the method comprising:

acquiring a pre-established data blood relationship atlas corresponding to the target data asset; the data consanguinity map is established based on historical data of the metadata and operational data; the data blood margin graph comprises nodes and connecting edges, the nodes are determined based on metadata, the connecting edges are determined based on operation data, and the connecting edges embody incidence relations among the nodes; the attribute value of the node identifies risk information of a data storage unit corresponding to the corresponding metadata;

updating the data blood relationship atlas according to the newly added data;

2. The method of claim 1, wherein the target data asset is of structured data whose data storage units are identified by a database, a data table, and a data column.

3. The method of claim 2, wherein the nodes correspond to columns of data.

4. The method of claim 3, wherein the associative relationship comprises a generating relationship between nodes, the generating relationship being generated for a first node based on a second node.

5. The method of claim 1, wherein the risk information of the metadata includes risk classification information indicating whether data in a corresponding data storage unit belongs to sensitive data and/or risk classification information indicating a level of the sensitive data.

6. The method of claim 1, wherein the obtaining new data for metadata and operational data in the target data asset comprises:

acquiring a Structured Query Language (SQL) statement operating on a target data asset;

7. The method of claim 1, wherein the new addition data includes, for the first storage unit, first metadata and first operation data; the updating the data consanguinity map comprises:

8. The method of claim 1, wherein the new data includes first metadata for a new first storage unit; determining an attribute value of a node related to the newly added data according to the updated data blood relationship graph, wherein the determining comprises:

9. The method of claim 8, wherein the preset associative relationship includes a generated relationship between nodes, and the initial node is generated based on the target node.

10. The method of claim 8, wherein determining the attribute values of the nodes associated with the new data based on the updated data edge graph further comprises:

if the target node is not found, acquiring a judgment rule of risk information;

11. A data asset risk discovery apparatus, the apparatus comprising:

the first acquisition unit is used for acquiring newly added data aiming at the metadata and the operation data in the target data asset; the metadata includes description data for a data storage unit of the target data asset, the operational data being access behavior data for the data storage unit;

12. The apparatus of claim 11, wherein the target data asset is of structured data whose data storage units are identified by a database, a data table, and a data column.

13. The apparatus of claim 12, wherein the nodes correspond to columns of data.

14. The apparatus of claim 13, wherein the associative relationship comprises a generating relationship between nodes, the generating relationship being that a first node is generated based on a second node.

15. The apparatus of claim 11, wherein the risk information of the metadata comprises risk classification information indicating whether data in a corresponding data storage unit belongs to sensitive data and/or risk classification information indicating a level of the sensitive data.

16. The apparatus of claim 11, wherein the first obtaining unit comprises:

17. The apparatus of claim 11, wherein the new addition data includes, for the first storage unit, first metadata and first operation data; the update unit includes:

18. The apparatus of claim 11, wherein the new addition data includes first metadata for the new added first storage unit; the determination unit includes:

the searching subunit is configured to, in the updated data blood-level graph, use a node corresponding to the first metadata as an initial node, and search, starting from the initial node, for a target node having a preset association relationship with the initial node;

19. The apparatus of claim 18, wherein the preset associative relationship comprises a generated relationship between nodes, and the initial node is generated based on the target node.

20. The apparatus of claim 18, wherein the determining unit further comprises:

21. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-10.

22. A computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method of any of claims 1-10.