CN116541887A - Data security protection method for big data platform - Google Patents

Data security protection method for big data platform Download PDF

Info

Publication number
CN116541887A
CN116541887A CN202310831904.8A CN202310831904A CN116541887A CN 116541887 A CN116541887 A CN 116541887A CN 202310831904 A CN202310831904 A CN 202310831904A CN 116541887 A CN116541887 A CN 116541887A
Authority
CN
China
Prior art keywords
data
security protection
relation
field
storing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310831904.8A
Other languages
Chinese (zh)
Other versions
CN116541887B (en
Inventor
胡琦
严鹤
王俊
杨权
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yunqi Intelligent Technology Co ltd
Original Assignee
Yunqi Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yunqi Intelligent Technology Co ltd filed Critical Yunqi Intelligent Technology Co ltd
Priority to CN202310831904.8A priority Critical patent/CN116541887B/en
Publication of CN116541887A publication Critical patent/CN116541887A/en
Application granted granted Critical
Publication of CN116541887B publication Critical patent/CN116541887B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24573Query processing with adaptation to user needs using data annotations, e.g. user-defined metadata
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • Computational Linguistics (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Medical Informatics (AREA)
  • Mathematical Physics (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a data security protection method of a big data platform, which relates to the technical field of computers and comprises the following steps: the large data platform gathers all data of the service system, and stores all data tables in a data warehouse according to categories in the process of data development and treatment; automatically capturing data blood relationship among the data tables according to ETL scheduling job dependency relationship of the data management platform, forming a relationship graph of the data tables and the data blood relationship, and storing the relationship graph into a metadata database; adopting security protection measures according to different service demands, forming a plurality of data security protection strategies by the service demands and the corresponding security protection measures, and storing the data security protection strategies to a data security management platform; the user inputs the current data and the current business requirement, inquires the security protection measure of the current data according to the current business requirement, and executes security protection on the current data based on the security protection measure. The invention utilizes the data blood margin to realize the rapid identification of the data, and greatly improves the efficiency of identifying the data.

Description

Data security protection method for big data platform
Technical Field
The invention relates to the technical field of computers, in particular to a data security protection method for a big data platform.
Background
A database, in short, can be considered an electronic filing cabinet. In the prior art, metadata is a very important class of data generated during database management. Metadata is also called intermediate data, relay data, which is data describing data, or structural data for providing information about a certain resource. Metadata is primarily information describing data attributes to support functions such as indicating storage locations, history data, resource lookups, file records, etc. In terms of the data structure, metadata is an electronic catalog, and in order to achieve the purpose of cataloging, the content or characteristics of the data must be described and collected, so as to achieve the purpose of assisting in data retrieval.
Data warehouses in large data platforms are typically managed hierarchically, with different data layers storing sensitive data. A large number of new data tables are generated for each data layer in the processes of data acquisition, data development and data management. These data tables contain sensitive data, and there are many related methods for protecting sensitive data. The Chinese patent application number 201511026582.1 discloses a sensitive data protection system and method for data circulation and transaction of a big data platform, realizes the protection of sensitive data from the whole link of data circulation, and simultaneously provides an automatic sensitive data discovery method based on expert system and natural language processing, which can effectively verify the correctness and authenticity of a desensitization result. However, the security protection of data in the prior art relies on a large amount of labor, and the efficiency is not high.
Disclosure of Invention
In view of the above, the invention provides a data security protection method for a big data platform, which combines the data blood relationship with a data table to form a relationship diagram, and marks and protects the sensitive data in batches by utilizing the superior performance of the relationship diagram, thereby greatly improving the data identification efficiency and reducing the error and leakage.
The technical scheme of the invention is realized as follows: the invention provides a data security protection method for a big data platform, which comprises the following steps:
s1, acquiring all data tables in a large data platform, and storing all the data tables in a data warehouse according to categories, wherein the data warehouse comprises a plurality of data layers, and the data tables in one data layer have the same category;
s2, automatically capturing data blood-margin relations among the data tables according to ETL scheduling job dependency relations of the data management platform, forming a relation diagram of the data tables and the data blood-margin relations, and storing the relation diagram into a metadata database;
s3, adopting security protection measures according to different service demands, forming a plurality of data security protection strategies by the service demands and the corresponding security protection measures, and storing the data security protection strategies to a data security management platform;
s4, the user inputs the current data and the current service requirement, inquires the security protection measure of the current data according to the current service requirement, and executes security protection on the current data based on the security protection measure.
On the basis of the above technical solution, preferably, in step S2, the process of forming the relationship diagram includes:
performing sql statement analysis on the header in the data table to obtain a grammar tree of the header, determining semantic information of the header according to the grammar tree, and taking the semantic information as table name information of the header;
executing sql statement analysis on each field of a table in a data table to obtain a grammar tree of each field, determining semantic information of each field according to the grammar tree, and taking the semantic information as field information of the field;
linking each field information with the corresponding table name information to obtain a table field, and taking the table field as a node of the relation graph;
and storing the data blood relationship between the data tables as the edges of the relationship graph, wherein the data blood relationship is a directed relationship between the table fields, and each directed relationship divides the corresponding table field into an upstream table field and a downstream table field.
On the basis of the above technical solution, preferably, step S3 includes:
making corresponding data security levels for the data in the data table according to the security management specification, wherein the data security levels are divided into a plurality of security levels;
dividing service requirements into data access and service operations;
determining the safety protection measures adopted according to the service requirements, the data layer where the data are located and the data safety level of the data;
and constructing a data security protection strategy by the data-service requirement-data security level-data layer-security protection measures according to a one-to-one correspondence relationship, and storing the data security protection strategy to a data security management platform.
Still more preferably, step S3 further includes:
and identifying the data security protection strategy and the corresponding data in the big data platform by adopting an identification method based on the relationship graph, linking the identified process and result with the corresponding data security protection strategy, and storing the linked result in the data security management platform.
Still further preferably, the identification method includes:
firstly, randomly selecting data in a large data platform as target data by an expert, extracting a target table field and a data security level of the target data, judging the sensitivity of the target data by the expert, if the target data is sensitive data, giving a corresponding desensitization algorithm by the expert, marking the data security level of the target data, a sensitivity judgment result of the target data and the desensitization algorithm, and obtaining a marking result of the target data;
step two, taking a node corresponding to a target table field as a starting point in the relation diagram, recursively traversing the relation diagram according to a depth-first algorithm from the starting point according to the directed relation, searching a downstream table field related to the starting point, and storing the searched result to a first list;
step three, taking a node corresponding to a target table field as a starting point in the relation diagram, recursively traversing the relation diagram according to a depth-first algorithm from the starting point according to the directed relation, searching an upstream table field related to the starting point, and storing the searched result to a first list;
step four, the table segments in the first list are sorted to obtain associated data of the target data, an expert identifies the associated data manually, the data security level of the associated data, the sensitivity judgment result of the associated data and a desensitization algorithm are marked, and the marking result of the associated data is obtained;
and fifthly, repeating the first step to the fourth step until all the data in the big data platform are marked, and storing the marking results of the final target data and the associated data to the data security management platform.
Still further preferably, the current data is access data, and the current service requirement is data access, and step S4 includes:
the user executes data access operation, access data is input, and the access data is sensitive data;
a desensitization algorithm for calling access data from the data security management platform;
a desensitization algorithm is performed on the access data.
Still further preferably, the current data is service data, the current service requirement is service operation, and step S4 includes:
the user executes the business operation and inputs business data;
calling the data security level of the service data from the data security management platform;
querying a data layer of service data from a metadata database;
inquiring security protection measures of the service data from a data security management platform according to the service operation, the data layer of the service data and the data security level of the service data;
security protection measures are performed on the traffic data.
Still further preferably, the method further comprises:
when the big data platform detects that the relation diagram is updated, the data security protection strategy in the data security management platform is automatically identified, and the result is updated and stored in the data security management platform.
Still further preferably, the automatically identifying the data security protection policy in the data security management platform includes:
traversing the updated data blood-edge relation in the relation diagram after searching and updating, comparing the updated data blood-edge relation with the original relation diagram to obtain a plurality of target data tables with direct or indirect link relation with the updated data blood-edge relation, and storing the plurality of target data tables into a second list;
traversing each target data table in the second list, obtaining all table fields of each target data table in the updated relation diagram according to a map query mode, taking the table fields as a first table field set, and storing the table fields into a third list;
traversing a third list, determining a directed relation between the first table fields according to the updated data blood-edge relation, forming a plurality of updated paths by using the updated data blood-edge relation and the first table fields, searching the table fields positioned at the most upstream in each updated path based on the directed relation between the first table fields, taking the table fields at the most upstream as a second table field, and storing the second table field in a fourth list;
step four, traversing the fourth list, and sequentially inquiring the data security level and the marking result of the second table field in the data security management platform;
traversing the fourth list, recursively searching all downstream table fields of each second table field in the updated relation diagram to obtain a third table field set of each second table field, and storing the second table field, the corresponding third table field set, the corresponding data security level and the corresponding marking result into the fifth list;
and step six, traversing the fifth list, automatically assigning the data security level and the marking result of the second table field to the corresponding third table field set until all the table fields in the fifth list contain the data security level and the marking result, and storing the traversed fifth list to the data security management platform.
Still more preferably, the desensitization algorithm is a method for hiding sensitive information, and includes a mask type desensitization algorithm, a hash type desensitization algorithm, a truncated type desensitization algorithm, and a symmetric encryption type desensitization algorithm.
Compared with the prior art, the method has the following beneficial effects:
(1) The data table and the data blood relationship are analyzed to form a relationship diagram, so that the data is managed and utilized in a deeper level, and the utilization rate of the data is greatly increased;
(2) By utilizing the map performance of the relation diagram, when sensitive data are identified manually, batch marking and safety protection are realized, the efficiency of data identification is improved, and the safety protection performance is also improved;
(3) An automatic updating and verifying mechanism is arranged, and after the relation diagram is updated, the data in the platform is subjected to relevant safety protection check by utilizing the data blood relationship so as to ensure that the safety of the sensitive data is not destroyed.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a directed relationship according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a manual identification method according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating the execution of data access according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating the operation of a business according to another embodiment of the present invention;
FIG. 6 is a schematic diagram of an automatic identification method according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of an architecture according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will clearly and fully describe the technical aspects of the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, are intended to fall within the scope of the present invention.
As shown in fig. 1, the present invention provides a data security protection method for a big data platform, including:
s1, acquiring all data tables in a large data platform, and storing all the data tables in a data warehouse according to categories, wherein the data warehouse comprises a plurality of data layers, and the data tables in one data layer have the same category;
s2, automatically capturing data blood-margin relations among the data tables according to ETL scheduling job dependency relations of the data management platform, forming a relation diagram of the data tables and the data blood-margin relations, and storing the relation diagram into a metadata database;
s3, adopting security protection measures according to different service demands, forming a plurality of data security protection strategies by the service demands and the corresponding security protection measures, and storing the data security protection strategies to a data security management platform;
s4, the user inputs the current data and the current service requirement, inquires the security protection measure of the current data according to the current service requirement, and executes security protection on the current data based on the security protection measure.
Specifically, in an embodiment of the present invention, step S1 includes:
the big data platform comprises a data management platform, a data warehouse and a data safety management platform, wherein all data in the big data platform are stored in the data warehouse in a layering manner, and the layering manner of the data warehouse is determined according to specific data content. And storing the data into different data layers according to the type of each data, namely, the data in each data layer are of the same type. In data governance and data development services, new data tables are created at each data layer.
Specifically, in an embodiment of the present invention, step S2 includes:
the data management platform comprises a metadata management module, wherein metadata is data describing information resources or data and other objects, and mainly is information describing data attributes and used for supporting functions such as indication storage positions, historical data, resource searching, file recording and the like. In this embodiment, the metadata management module specifically includes a metadata database, which includes a relationship chart, that is, a data blood relationship, and metadata in the metadata database, where metadata in this embodiment is diversified metadata, when a data table is created, corresponding metadata to be formed describes table names, field information, field types, field lengths, and the like of the data table, and at the same time, metadata also describes storage locations of the data table, that is, which data layer in the data warehouse the data table is located on, and after the relationship chart is formed, all the nodes, edges, that is, the table fields, and relationships between the table fields in the relationship chart have metadata to be described, and in the subsequent data identification process, new data tables, new data, updated relationship charts, and the like are generated, which are all described corresponding metadata to be formed.
In the embodiment of the invention, the data blood-edge relationship is automatically captured according to the ETL scheduling operation dependency relationship of the data management platform, and in a specific example, the automatic capturing can be realized by presetting a blood-edge hook function. The data blood relationship is a directed relationship between a field of a data table formed during data processing of a large data platform to a field of another data table. After the data blood-edge relation is obtained, the metadata management module stores the data of the data blood-edge relation in a metadata database.
In one embodiment of the present invention, the process of forming the relationship graph includes:
performing sql statement analysis on the header in the data table to obtain a grammar tree of the header, determining semantic information of the header according to the grammar tree, and taking the semantic information as table name information of the header;
executing sql statement analysis on each field of a table in a data table to obtain a grammar tree of each field, determining semantic information of each field according to the grammar tree, and taking the semantic information as field information of the field;
linking each field information with the corresponding table name information to obtain a table field, and taking the table field as a node of the relation graph;
and storing the data blood relationship between the data tables as the edges of the relationship graph, wherein the data blood relationship is a directed relationship between the table fields, and each directed relationship divides the corresponding table field into an upstream table field and a downstream table field.
The grammar tree can analyze the grammar of the sql sentence, and converts the character strings in the sql sentence into a structure body, so that a computer can more easily understand the specific meaning of the character strings in the sql sentence. In a specific implementation process, an sql statement parser may be used to parse each sql statement in the sql statement set separately, so as to obtain a syntax tree of each sql statement, for example, a guide or other parser may be used to parse the sql statement.
After the syntax tree of each sql statement is obtained, traversing the syntax tree of each sql statement to obtain field information and table name information related to the sql statement. The field information and the table name information extracted from each grammar tree are linked firstly, for example, the table name information in the same data table should be linked to the field information of the corresponding data column for a plurality of times, if the table name information of one data table is Y1 and the data table has 3 field information w1, w2 and w3, the table name information and the field information are linked into the table fields Y1-w1, Y1-w2 and Y1-w3 firstly when the subsequent operation is executed. After the table fields are obtained, the table fields are used as nodes of a relation graph, the table fields are marked with directional relations according to the obtained data blood relation, each two associated table fields are divided into an upstream table field and a downstream table field by the directional relations, the directional relations are stored as edges of the relation graph, the relation graph between the data is formed according to the nodes and the directional relations, and the data is stored in a metadata database. Referring to FIG. 2, FIG. 2 is a simplified diagram showing the directional relationship of table fields among data tables in one embodiment of the present invention to facilitate an understanding of the description of the directional relationship of the present invention. In fig. 2, there is a directional relationship between field 1 of table a to field 1 of table D, where the upstream table field is field 1 of table a and the downstream table field is field 1 of table D.
Specifically, in an embodiment of the present invention, step S3 includes:
making corresponding data security levels for the data in the data table according to the security management specification, wherein the data security levels are divided into a plurality of security levels;
dividing service requirements into data access and service operations;
determining the safety protection measures adopted according to the service requirements, the data layer where the data are located and the data safety level of the data;
and constructing a data security protection strategy by the data-service requirement-data security level-data layer-security protection measures according to a one-to-one correspondence relationship, and storing the data security protection strategy to a data security management platform.
The data security level is an identification for classifying and grading the data according to the security management standards, and the number of the security levels is different according to different data contents. In this embodiment, the data security level includes 3 security levels.
The importance and privacy information of the data are analyzed to determine which security level the data corresponds to, and different security levels correspond to different security measures. Therefore, in the embodiment of the invention, according to the service requirement, the data layer where the data is located and the data security level of the data, the three factors are comprehensively considered, and then what security protection measures corresponding to the data are determined.
Specifically, in an embodiment of the present invention, after a data security protection policy is set, an identification method is used to identify the data security protection policy and its corresponding data in a large data platform based on a relational graph, and the identified process and result are linked with the corresponding data security protection policy and then stored in a data security management platform.
Referring to fig. 3, the identification method includes:
firstly, randomly selecting data in a large data platform as target data by an expert, extracting a target table field and a data security level of the target data, judging the sensitivity of the target data by the expert, if the target data is sensitive data, giving a corresponding desensitization algorithm by the expert, marking the data security level of the target data, a sensitivity judgment result of the target data and the desensitization algorithm, and obtaining a marking result of the target data;
step two, taking a node corresponding to a target table field as a starting point in the relation diagram, recursively traversing the relation diagram according to a depth-first algorithm from the starting point according to the directed relation, searching a downstream table field related to the starting point, and storing the searched result to a first list;
step three, taking a node corresponding to a target table field as a starting point in the relation diagram, recursively traversing the relation diagram according to a depth-first algorithm from the starting point according to the directed relation, searching an upstream table field related to the starting point, and storing the searched result to a first list;
step four, the table segments in the first list are sorted to obtain associated data of the target data, an expert identifies the associated data manually, the data security level of the associated data, the sensitivity judgment result of the associated data and a desensitization algorithm are marked, and the marking result of the associated data is obtained;
and fifthly, repeating the first step to the fourth step until all the data in the big data platform are marked, and storing the marking results of the final target data and the associated data to the data security management platform.
The identification process is a manual marking process, in the manual marking process, the data blood relationship between the data is fully utilized, the sensitive data in the large data platform is rapidly identified and judged to be safe and protected, the data identification efficiency is greatly improved, and a great effect is played on the safe and protection of the data.
In the fourth step of the identification method, when the expert performs manual identification on the associated data, the expert is a very fast process, and the security level, the sensitivity judgment result and the desensitization algorithm of the target data and the associated data are the same as each other due to the data blood edges of the unidirectional arrows between the target data and the associated data. Thus, the expert can quickly mass-label the associated data. This can greatly reduce the time required for recognition.
It should be understood that, in the manual identification method of the present invention, an expert randomly selects one data as the target data at the beginning, and then uses the relationship between the blood edges of the data to accelerate the identification process, so that the time of the expert can be reduced, which is a preferred embodiment of the present invention. However, the expert may first classify the data in the big data platform, first pick out the suspected sensitive data, and then randomly select one from the primarily selected data as the target data.
In this embodiment, the desensitization algorithm is a method for hiding sensitive information, and includes a mask type desensitization algorithm, a hash type desensitization algorithm, a truncated type desensitization algorithm, and a symmetric encryption type desensitization algorithm. The method comprises the following steps: the mask class includes masking sensitive information such as name, identity, phone number, etc. The hash-like algorithm includes desensitizing sensitive information using SM3/MD 5/SHA-1. The truncation type algorithm comprises the steps of truncating date, numerical value and other data. Symmetric encryption classes include data desensitization using SM 4/DES/AES.
The identification method will be described with a specific example:
the large data platform comprises a plurality of data, some of the data are sensitive data, and the manual identification mode adopted by the embodiment of the invention is that an expert randomly selects one data as target data, and the target table field of the expert in the relation graph is determined according to the target data.
Taking the target table field as a starting point, and performing the following two operations:
1. according to the position of the starting point in the relation diagram, the starting point is used as an upstream table field, a first direction is determined according to the directed relation, the first direction is the direction of the downstream table field of the starting point, the whole relation diagram is traversed based on a depth-first algorithm by taking the first direction as a searching direction, a first searching path is obtained, and nodes on the first searching path and the starting point are directly or indirectly associated, namely, the nodes on the first searching path and the starting point have blood-edge relation with each other. Nodes on the first search path are saved to a first list. Specifically, there may be multiple first directions, for example, when the starting point is used as the upstream table field and there are three downstream table fields, the first directions are also three, and when searching is performed, one first direction is sequentially selected to perform depth-first searching, so as to finally obtain three first searching paths, and all the three first searching paths are saved to the first list. Specifically, since the depth-first algorithm, when executed, will travel on a route until it can no longer go deep, and then return to a certain node, and continue to seek downward, the first search path generally refers to a tree-like path, and has a plurality of branch paths with different depths, besides a deepest trunk path.
2. According to the position of the starting point in the relation diagram, the starting point is used as a downstream table field, a second direction is determined according to the directed relation, the second direction is the direction of the upstream table field of the starting point, the whole relation diagram is traversed based on a depth-first algorithm by taking the second direction as a searching direction, and a second searching path is obtained, wherein nodes on the second searching path and the starting point are directly or indirectly associated, namely, the nodes on the second searching path and the starting point have blood-edge relation with each other. Nodes on the second search path are saved to the first list. Specifically, there may be multiple second directions, for example, when the starting point is used as the downstream table field and there are five upstream table fields, the second directions are also five, and when searching is performed, one second direction is sequentially selected to perform depth-first searching, so as to finally obtain five second searching paths, and all the five second searching paths are saved to the first list. Specifically, the second search path also refers to a tree-like path, and has a plurality of branch paths with different depths, in addition to a deepest trunk path.
And sorting all table fields in the first list, namely counting the tree diagrams of all the first search paths and the second search paths to obtain the associated data of the target data. And the expert marks the associated data in batches, and takes the data security level of the associated data, the sensitivity judgment result of the associated data and the desensitization algorithm as the marking result of the associated data. And then storing the marking result of the target data and the marking result of the associated data to the data security management platform.
Specifically, the manual identification method is performed for a plurality of times, namely, an expert randomly selects a plurality of data as target data later, and the characteristics of the relation diagram are utilized for batch marking, so that the efficiency of data security protection is greatly improved.
Specifically, referring to fig. 4, in an embodiment of the present invention, the current data is access data, the current service requirement is data access, the data access includes data query, data open API service, and data batch exchange service, and step S4 includes:
the user executes data access operation, access data is input, and the access data is sensitive data;
a desensitization algorithm for calling access data from the data security management platform;
a desensitization algorithm is performed on the access data.
Specifically, according to the access data input by the user, the content information of the access data is analyzed, a desensitization algorithm for determining the access data is searched in the data security management platform, for example, the access data is 3-level sensitive data, the corresponding desensitization algorithm is to desensitize the data through a hash desensitization algorithm, and then the data access operation executes SM3/MD5/SHA-1 to desensitize the access data.
Specifically, referring to fig. 5, in an embodiment of the present invention, the current data is service data, the current service requirement is service operation, the service operation includes a data resource application, and step S4 includes:
the user executes the business operation and inputs business data;
calling the data security level of the service data from the data security management platform;
querying a data layer of service data from a metadata database;
inquiring security protection measures of the service data from a data security management platform according to the service operation, the data layer of the service data and the data security level of the service data;
security protection measures are performed on the traffic data.
It should be noted that, in the above two embodiments, the data access and the service operation have different execution processes, and when the data access is executed, the emphasis is on the query of the data, so that the data content of the access data needs to be determined first, the marked desensitization algorithm of the data is found in the data security management platform according to the data content, and then the sensitive information in the access data is subjected to the desensitization operation and is displayed to the user. When executing the business operation, the emphasis is on the utilization of the data, so that after determining the security level of the data, the corresponding metadata is also required to be used for addressing from the metadata database to determine which data layer the business data is located, and after performing security protection measures to desensitize sensitive information in the business data, the user is provided with a downloading service according to the addressing function.
Specifically, in one embodiment of the present invention, when the metadata management module performs an update operation on the relationship graph, the platform automatically identifies the data security protection policy, and referring to fig. 6, the automatic identification process includes:
traversing the updated data blood-edge relation in the relation diagram after searching and updating, comparing the updated data blood-edge relation with the original relation diagram to obtain a plurality of target data tables with direct or indirect link relation with the updated data blood-edge relation, and storing the plurality of target data tables into a second list;
traversing each target data table in the second list, obtaining all table fields of each target data table in the updated relation diagram according to a map query mode, taking the table fields as a first table field set, and storing the table fields into a third list;
traversing a third list, determining a directed relation between the first table fields according to the updated data blood-edge relation, forming a plurality of updated paths by using the updated data blood-edge relation and the first table fields, searching the table fields positioned at the most upstream in each updated path based on the directed relation between the first table fields, taking the table fields at the most upstream as a second table field, and storing the second table field in a fourth list;
step four, traversing the fourth list, and sequentially inquiring the data security level and the marking result of the second table field in the data security management platform;
traversing the fourth list, recursively searching all downstream table fields of each second table field in the updated relation diagram to obtain a third table field set of each second table field, and storing the second table field, the corresponding third table field set, the corresponding data security level and the corresponding marking result into the fifth list;
and step six, traversing the fifth list, automatically assigning the data security level and the marking result of the second table field to the corresponding third table field set until all the table fields in the fifth list contain the data security level and the marking result, and storing the traversed fifth list to the data security management platform.
In the embodiment of the invention, the situation that the relation diagram is updated comprises: new data is added in the platform, the original data is subjected to error correction, the blood relationship between the original data is subjected to error correction, and the like.
The above verification process will be described with a specific example:
the change analysis of the original relationship graph and the updated relationship graph can be realized by adopting a change detection graph model, and the changed data blood-edge relationship is identified and extracted to obtain the updated data blood-edge relationship, wherein the updated data blood-edge relationship can be a newly added data blood-edge relationship or a corrected data blood-edge relationship.
And searching target data tables affected by the updated data blood-edge relation according to the updated data blood-edge relation, wherein field contents in the target data tables have direct or indirect relation with the updated data blood-edge relation, and storing the target data tables into a second list.
And extracting information of table fields from each target data table in the second list, positioning the table fields in the updated relation diagram by utilizing a map query mode, and storing the table fields as a first table field set to the third list.
The updated data blood-edge relationship affects all the first table fields, so that the directional relationship among the first table fields can be determined by the updated data blood-edge relationship, and when traversing the third list, only the first table fields which are related to the updated data blood-edge relationship in the relationship graph are searched, namely, a plurality of updated paths are formed by utilizing the updated data blood-edge relationship and the first table fields, each updated path comprises the directional relationship among the continuous first table fields, and the most upstream table field is positioned in the updated paths and is used as the second table field and is stored in the fourth list;
and inquiring the data security protection strategy corresponding to the second table field in the data security management platform, determining the data security level of the data security protection strategy, and inquiring the marking result of the data security protection strategy.
Traversing the fourth list, sequentially taking the second table field in the fourth list as a starting point, taking the starting point as an upstream table field according to the position of the starting point in the updated relation diagram, determining the travelling direction according to the directed relation, taking the travelling direction as the direction of the downstream table field of the starting point, traversing the whole updated relation diagram based on a depth-first algorithm by taking the travelling direction as a searching direction, obtaining a third searching path, and taking the nodes on the third searching path as a third table field set of the starting point. And finally obtaining a third table field set of each second table field.
And storing the second table field, the data security level and the marking result of the second table field and the third table field set of the second table field to a fifth list.
Traversing the fifth list, wherein the table fields of the same data blood source have the same property, namely the second table field and the corresponding third table field set have the same security level, sensitivity judgment result and desensitization algorithm, so that the data security level and the marking result of the second table field are automatically assigned to the corresponding third table field set in batches. And the quick batch identification is realized, and the data identification efficiency is improved.
Referring to fig. 7, an architecture diagram of an embodiment of the present invention is shown to illustrate a big data platform:
in fig. 7, the big data platform includes a data security management platform, a data management platform and a data warehouse, where the requirements of the big data platform are data services, and are classified into two types, namely, data access, including data query, data open API service, data batch exchange service, and the like, and one type is a service operation, including data resource application, and the like.
In the data security management platform, except the data security protection strategy which is initially set, namely the data security level and security protection measures, and the data security protection strategy which is obtained according to manual identification and automatic verification in the data processing process, the data security management platform also comprises data security level management, desensitization algorithm management and data identification.
The data security level management module stores data security levels obtained by marking the big data platform in the process of executing various processing operations, wherein the data security levels are identifiers for classifying and grading the data according to security management specifications; in a specific embodiment, the desensitization algorithm is a method for hiding sensitive information, so that a desensitization algorithm management module in the data security management platform is a desensitization algorithm obtained by marking in the manual identification and automatic identification processes. And the data identification refers to the identification and judgment of the data content, and the data security level and the desensitization algorithm are marked.
In the data management platform, besides the metadata management module, the data management platform also comprises data acquisition, data standard, main data, data quality and data assets, wherein the data acquisition module is used for acquiring data from multiple sources and transmitting the acquired data as original data to an original library of a data warehouse for storage; the data standard module performs standardization processing on the original data and transmits the standard data to a standard library of a data warehouse for storage; the main data module classifies the subject of the original data or the standard data, and transmits the classified data to a subject database of the data warehouse for storage; the data quality module detects and records the quality of the data; the data asset module records and displays all data in the big data platform; the metadata management module is used for constructing a relation graph, storing the data blood-edge relation and metadata to form a metadata database, and the metadata database plays a role in technical support in the data identification process of the data security management platform, the verification process of the platform and the execution of the requirements and access processes of users so as to realize the data security protection method.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims (10)

1. The data security protection method for the big data platform is characterized by comprising the following steps of:
s1, acquiring all data tables in a large data platform, and storing all the data tables in a data warehouse according to categories, wherein the data warehouse comprises a plurality of data layers, and the data tables in one data layer have the same category;
s2, automatically capturing data blood-margin relations among the data tables according to ETL scheduling job dependency relations of the data management platform, forming a relation diagram of the data tables and the data blood-margin relations, and storing the relation diagram into a metadata database;
s3, adopting security protection measures according to different service demands, forming a plurality of data security protection strategies by the service demands and the corresponding security protection measures, and storing the data security protection strategies to a data security management platform;
s4, the user inputs the current data and the current service requirement, inquires the security protection measure of the current data according to the current service requirement, and executes security protection on the current data based on the security protection measure.
2. The method of claim 1, wherein in step S2, the process of forming the relationship map includes:
performing sql statement analysis on the header in the data table to obtain a grammar tree of the header, determining semantic information of the header according to the grammar tree, and taking the semantic information as table name information of the header;
executing sql statement analysis on each field of a table in a data table to obtain a grammar tree of each field, determining semantic information of each field according to the grammar tree, and taking the semantic information as field information of the field;
linking each field information with the corresponding table name information to obtain a table field, and taking the table field as a node of the relation graph;
and storing the data blood relationship between the data tables as the edges of the relationship graph, wherein the data blood relationship is a directed relationship between the table fields, and each directed relationship divides the corresponding table field into an upstream table field and a downstream table field.
3. The method of claim 2, wherein step S3 comprises:
making corresponding data security levels for the data in the data table according to the security management specification, wherein the data security levels are divided into a plurality of security levels;
dividing service requirements into data access and service operations;
determining the safety protection measures adopted according to the service requirements, the data layer where the data are located and the data safety level of the data;
and constructing a data security protection strategy by the data-service requirement-data security level-data layer-security protection measures according to a one-to-one correspondence relationship, and storing the data security protection strategy to a data security management platform.
4. The method of claim 3, wherein step S3 further comprises:
and identifying the data security protection strategy and the corresponding data in the big data platform by adopting an identification method based on the relationship graph, linking the identified process and result with the corresponding data security protection strategy, and storing the linked result in the data security management platform.
5. The method of claim 4, wherein the identifying method comprises:
firstly, randomly selecting data in a large data platform as target data by an expert, extracting a target table field and a data security level of the target data, judging the sensitivity of the target data by the expert, if the target data is sensitive data, giving a corresponding desensitization algorithm by the expert, marking the data security level of the target data, a sensitivity judgment result of the target data and the desensitization algorithm, and obtaining a marking result of the target data;
step two, taking a node corresponding to a target table field as a starting point in the relation diagram, recursively traversing the relation diagram according to a depth-first algorithm from the starting point according to the directed relation, searching a downstream table field related to the starting point, and storing the searched result to a first list;
step three, taking a node corresponding to a target table field as a starting point in the relation diagram, recursively traversing the relation diagram according to a depth-first algorithm from the starting point according to the directed relation, searching an upstream table field related to the starting point, and storing the searched result to a first list;
step four, the table segments in the first list are sorted to obtain associated data of the target data, an expert identifies the associated data manually, the data security level of the associated data, the sensitivity judgment result of the associated data and a desensitization algorithm are marked, and the marking result of the associated data is obtained;
and fifthly, repeating the first step to the fourth step until all the data in the big data platform are marked, and storing the marking results of the final target data and the associated data to the data security management platform.
6. The method of claim 5, wherein the current data is access data and the current business requirement is data access, and step S4 comprises:
the user executes data access operation, access data is input, and the access data is sensitive data;
a desensitization algorithm for calling access data from the data security management platform;
a desensitization algorithm is performed on the access data.
7. The method of claim 5, wherein the current data is service data and the current service requirement is a service operation, and step S4 comprises:
the user executes the business operation and inputs business data;
calling the data security level of the service data from the data security management platform;
querying a data layer of service data from a metadata database;
inquiring security protection measures of the service data from a data security management platform according to the service operation, the data layer of the service data and the data security level of the service data;
security protection measures are performed on the traffic data.
8. The method of claim 5, wherein the method further comprises:
when the big data platform detects that the relation diagram is updated, the data security protection strategy in the data security management platform is automatically identified, and the result is updated and stored in the data security management platform.
9. The method of claim 8, wherein automatically identifying the data security protection policy in the data security management platform comprises:
traversing the updated data blood-edge relation in the relation diagram after searching and updating, comparing the updated data blood-edge relation with the original relation diagram to obtain a plurality of target data tables with direct or indirect link relation with the updated data blood-edge relation, and storing the plurality of target data tables into a second list;
traversing each target data table in the second list, obtaining all table fields of each target data table in the updated relation diagram according to a map query mode, taking the table fields as a first table field set, and storing the table fields into a third list;
traversing a third list, determining a directed relation between the first table fields according to the updated data blood-edge relation, forming a plurality of updated paths by using the updated data blood-edge relation and the first table fields, searching the table fields positioned at the most upstream in each updated path based on the directed relation between the first table fields, taking the table fields at the most upstream as a second table field, and storing the second table field in a fourth list;
step four, traversing the fourth list, and sequentially inquiring the data security level and the marking result of the second table field in the data security management platform;
traversing the fourth list, recursively searching all downstream table fields of each second table field in the updated relation diagram to obtain a third table field set of each second table field, and storing the second table field, the corresponding third table field set, the corresponding data security level and the corresponding marking result into the fifth list;
and step six, traversing the fifth list, automatically assigning the data security level and the marking result of the second table field to the corresponding third table field set until all the table fields in the fifth list contain the data security level and the marking result, and storing the traversed fifth list to the data security management platform.
10. The method of claim 5, wherein the desensitizing algorithm is a method of hiding sensitive information, including a mask-type desensitizing algorithm, a hash-type desensitizing algorithm, a truncated-type desensitizing algorithm, a symmetric encryption-type desensitizing algorithm.
CN202310831904.8A 2023-07-07 2023-07-07 Data security protection method for big data platform Active CN116541887B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310831904.8A CN116541887B (en) 2023-07-07 2023-07-07 Data security protection method for big data platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310831904.8A CN116541887B (en) 2023-07-07 2023-07-07 Data security protection method for big data platform

Publications (2)

Publication Number Publication Date
CN116541887A true CN116541887A (en) 2023-08-04
CN116541887B CN116541887B (en) 2023-09-15

Family

ID=87444025

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310831904.8A Active CN116541887B (en) 2023-07-07 2023-07-07 Data security protection method for big data platform

Country Status (1)

Country Link
CN (1) CN116541887B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117290355A (en) * 2023-08-29 2023-12-26 云启智慧科技有限公司 Metadata map construction system

Family Cites Families (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU725098B2 (en) * 1995-07-14 2000-10-05 Christopher Nathan Drake Computer software authentication, protection, and security system
US11392586B2 (en) * 2018-05-02 2022-07-19 Zte Corporation Data protection method and device and storage medium
CN108959564B (en) * 2018-07-04 2020-11-27 玖富金科控股集团有限责任公司 Data warehouse metadata management method, readable storage medium and computer device
CN109739893B (en) * 2018-12-28 2022-04-22 上海尚往网络科技有限公司 Metadata management method, equipment and computer readable medium
CN110704873B (en) * 2019-09-25 2021-05-25 全球能源互联网研究院有限公司 Method and system for preventing sensitive data from being leaked
CN111192015A (en) * 2019-12-30 2020-05-22 上海数熙科技有限公司 Integrated data management system based on core object
CN111694858A (en) * 2020-04-28 2020-09-22 平安科技(深圳)有限公司 Data blood margin analysis method, device, equipment and computer readable storage medium
CN112256721B (en) * 2020-10-21 2021-08-17 平安科技(深圳)有限公司 SQL statement parsing method, system, computer device and storage medium
CN112527816B (en) * 2020-12-03 2023-06-02 平安科技(深圳)有限公司 Data blood relationship analysis method, system, computer equipment and storage medium
CN114691786A (en) * 2020-12-30 2022-07-01 中兴通讯股份有限公司 Method and device for determining data blood relationship, storage medium and electronic device
CN112860662B (en) * 2021-01-22 2023-10-17 平安科技(深圳)有限公司 Automatic production data blood relationship establishment method, device, computer equipment and storage medium
CN112925914B (en) * 2021-03-31 2024-03-15 携程旅游网络技术(上海)有限公司 Data security grading method, system, equipment and storage medium
CN113360488A (en) * 2021-06-01 2021-09-07 深圳市酷开网络科技股份有限公司 Blood relationship management system and method based on data warehouse
CN113742368A (en) * 2021-09-16 2021-12-03 北京航空航天大学 Data blood relationship analysis method
CN114036130A (en) * 2021-11-09 2022-02-11 中国建设银行股份有限公司 Metadata analysis processing method and device
CN114238390A (en) * 2021-12-14 2022-03-25 东软集团股份有限公司 Data warehouse optimization method, device, equipment and storage medium
CN114218218A (en) * 2021-12-16 2022-03-22 新奥数能科技有限公司 Data processing method, device and equipment based on data warehouse and storage medium
CN114428822B (en) * 2022-01-27 2022-07-29 云启智慧科技有限公司 Data processing method and device, electronic equipment and storage medium
CN114528313A (en) * 2022-03-15 2022-05-24 北京金山云网络技术有限公司 Data processing method and device and electronic equipment
CN114925042A (en) * 2022-06-21 2022-08-19 正数网络技术有限公司 Method for constructing metadata relation based on graphic database
CN114896295B (en) * 2022-07-12 2022-10-04 云启智慧科技有限公司 Data desensitization method, desensitization device and desensitization system in big data scene
CN116226159A (en) * 2022-11-18 2023-06-06 中广核风电有限公司 Metadata blood-edge relationship analysis method, system, equipment and storage medium
CN116094696A (en) * 2022-12-28 2023-05-09 国科量子通信网络有限公司 Data security protection method, data security management platform, system and storage medium
CN115795400B (en) * 2023-02-07 2023-05-09 云启智慧科技有限公司 Application fusion system oriented to big data analysis

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117290355A (en) * 2023-08-29 2023-12-26 云启智慧科技有限公司 Metadata map construction system
CN117290355B (en) * 2023-08-29 2024-05-14 云启智慧科技有限公司 Metadata map construction system

Also Published As

Publication number Publication date
CN116541887B (en) 2023-09-15

Similar Documents

Publication Publication Date Title
CN109697162B (en) Software defect automatic detection method based on open source code library
US11704494B2 (en) Discovering a semantic meaning of data fields from profile data of the data fields
US10671750B2 (en) System and method for data classification centric sensitive data discovery
CN109063421B (en) Open source license compliance analysis and conflict detection method
US20040249796A1 (en) Query classification
CN106874764B (en) A method of Android application readjustment sequence is automatically generated based on call back function modeling
JP2011509472A (en) Data clustering method, system, apparatus, and computer program for applying the method
CN109344230A (en) Code library file generation, code search, connection, optimization and transplantation method
CN116541887B (en) Data security protection method for big data platform
CN116209997A (en) System and method for classifying software vulnerabilities
CN106339313B (en) A kind of abnormal inconsistent automatic testing method of description with document of Java api routines
US7159171B2 (en) Structured document management system, structured document management method, search device and search method
CN116975881A (en) LLVM (LLVM) -based vulnerability fine-granularity positioning method
CN116186759A (en) Sensitive data identification and desensitization method for privacy calculation
CN115033894A (en) Software component supply chain safety detection method and device based on knowledge graph
CN113626558B (en) Intelligent recommendation-based field standardization method and system
CN110990834A (en) Static detection method, system and medium for android malicious software
US20070156712A1 (en) Semantic grammar and engine framework
CN116821903A (en) Detection rule determination and malicious binary file detection method, device and medium
CN114490673B (en) Data information processing method and device, electronic equipment and storage medium
CN115438341A (en) Method and device for extracting code loop counter, storage medium and electronic equipment
CN114090076A (en) Method and device for judging compliance of application program
CN113051253A (en) Method and device for constructing tag database
CN115510446A (en) Vulnerability repair information retrieval method and electronic equipment
CN109408713A (en) A kind of software requirement searching system based on field feedback

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A data security protection method for big data platforms

Granted publication date: 20230915

Pledgee: China Postal Savings Bank Co.,Ltd. Wuhan Branch

Pledgor: Yunqi Intelligent Technology Co.,Ltd.

Registration number: Y2024980029917