CN113672653A - Method and device for identifying private data in database - Google Patents

Method and device for identifying private data in database Download PDF

Info

Publication number
CN113672653A
CN113672653A CN202110909377.9A CN202110909377A CN113672653A CN 113672653 A CN113672653 A CN 113672653A CN 202110909377 A CN202110909377 A CN 202110909377A CN 113672653 A CN113672653 A CN 113672653A
Authority
CN
China
Prior art keywords
field
data
private data
identification
identification result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110909377.9A
Other languages
Chinese (zh)
Inventor
刘佳伟
鲍梦瑶
章鹏
张谦
殷雪梅
刘新源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Ant Blockchain Technology Shanghai Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Ant Blockchain Technology Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd, Ant Blockchain Technology Shanghai Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202110909377.9A priority Critical patent/CN113672653A/en
Publication of CN113672653A publication Critical patent/CN113672653A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24578Query processing with adaptation to user needs using ranking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioethics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the specification provides a method and a device for identifying private data in a database, wherein the method comprises the following steps: forming a queue by using each field in each data table included in the database; according to the sequence of each field in the queue, processing operation is sequentially carried out on the current first field, and the processing operation comprises the following steps: under the condition that the first field does not have the identification result label, identifying whether the first field belongs to the private data or not to obtain a first identification result, and taking the first identification result as the identification result label of the first field; if the first identification result indicates that the first field belongs to the private data, searching a second field having a preset relationship with the first field; and identifying whether the second field belongs to the private data or not by using a mode corresponding to the preset relationship to obtain a second identification result, and using the second identification result as an identification result label of the second field. The efficiency of identifying private data in a database can be improved.

Description

Method and device for identifying private data in database
Technical Field
One or more embodiments of the present specification relate to the field of computers, and more particularly, to a method and apparatus for identifying private data in a database.
Background
Private data (private data), i.e., secret data, refers to information that is not intended to be known by others or unrelated persons, and from the perspective of the owner of privacy, the private data may be divided into individual private data and common private data, where the individual private data includes information that can be used to locate or identify an individual (e.g., phone number, address, credit card number, etc.) and sensitive information (e.g., personal health, financial information, company critical documents, etc.). The common privacy data mainly takes family privacy as a main part, such as family annual income condition and the like. The disclosure and abuse of private data is highly likely to cause various personal and public security problems. For the protection of private data, fields belonging to the private data need to be identified from a database, which typically comprises a large number of data tables, with on average tens of fields per data table.
In the prior art, when private data in a database is identified, whether each field of each data table belongs to the private data is basically identified one by one, and the performance problem is not obvious in the case of small data volume, but when the data is used for massive data (for example, hundreds of millions of tables and hundreds of millions of fields), the obvious performance problem is mainly represented as incomplete table scanning in a specified time, so that the customer experience is low.
Accordingly, improved solutions are desired that can improve the efficiency of identifying private data in a database.
Disclosure of Invention
One or more embodiments of the present specification describe a method and apparatus for identifying private data in a database, which can improve the efficiency of identifying private data in the database.
In a first aspect, there is provided a method of identifying private data in a database, the database comprising a plurality of data tables, each data table comprising a plurality of fields, the method comprising:
forming a queue by using each field in each data table included in the database;
according to the sequence of each field in the queue, processing operation is sequentially carried out on the current first field, and the processing operation comprises the following steps:
under the condition that the first field does not have the identification result label, identifying whether the first field belongs to private data or not to obtain a first identification result, and taking the first identification result as the identification result label of the first field;
if the first identification result indicates that the first field belongs to private data, searching a second field having a preset relationship with the first field;
and identifying whether the second field belongs to the private data by using a mode corresponding to the preset relation to obtain a second identification result, and using the second identification result as an identification result label of the second field.
In a possible implementation manner, the forming a queue of each field in each data table included in the database includes:
and analyzing the field names of the fields from a metadata table in the database, and sequencing the field names to form the queue.
Further, the identifying whether the first field belongs to private data includes:
acquiring sample data corresponding to the field name of the first field from the database;
and inputting the sample data into a private data identification model to obtain the first identification result.
Further, the private data recognition model comprises at least one of the following recognition logic:
regular expressions, language models, verification rules, multi-classification models.
In a possible embodiment, the searching for the second field having the preset relationship with the first field includes:
searching a second field having a preset relation with the first field from a pre-established data relation map; the data relation graph comprises nodes corresponding to the fields, and connecting edges among the nodes correspond to relations among the fields.
Further, the data relationship graph further includes nodes corresponding to the data tables, and the connection edges between the nodes further correspond to the relationship between the data tables and the fields, and the relationship between the data tables and the data tables.
Further, the data relationship map is obtained by analyzing a Structured Query Language (SQL) statement corresponding to the database.
Further, the searching for the second field having the preset relationship with the first field from the pre-established data relationship map includes:
and searching the nodes with the preset relationship corresponding to the connecting edges from the nodes corresponding to the first fields until the relationship of the connecting edges is not the preset relationship, and taking the fields corresponding to the searched nodes as the second fields.
In one possible embodiment, the predetermined relationship is replication;
the identifying whether the second field belongs to private data by using the mode corresponding to the preset relationship comprises:
and directly determining that the second identification result is that the second field belongs to private data.
In a possible embodiment, the predetermined relationship is a truncation;
the identifying whether the first field belongs to private data comprises:
respectively identifying whether the first field belongs to the private data or not by utilizing each identification model in a first identification model set to obtain each first identification sub-result, and comprehensively determining the first identification result according to each first identification sub-result;
the identifying whether the second field belongs to private data by using the mode corresponding to the preset relationship comprises:
identifying whether the second field belongs to private data or not by using at least one identification model in a second identification model set to obtain a second identification result; the second set of recognition models is a subset of the first set of recognition models.
In a possible embodiment, the first recognition result and/or the second recognition result includes:
whether the field belongs to private data, and the type of private data when it belongs to private data.
In a second aspect, there is provided an apparatus for identifying private data in a database, the database comprising a plurality of data tables, each data table comprising a plurality of fields, the apparatus comprising:
a queue forming unit, configured to form a queue for each field in each data table included in the database;
a first identifying unit, configured to perform processing operations on a current first field in sequence according to the sequence of each field in the queue obtained by the queue forming unit, where the processing operations include:
under the condition that the first field does not have the identification result label, identifying whether the first field belongs to private data or not to obtain a first identification result, and taking the first identification result as the identification result label of the first field;
the searching unit is used for searching a second field having a preset relation with the first field if the first identification result obtained by the first identification unit indicates that the first field belongs to private data;
and the second identification unit is used for identifying whether the second field searched by the search unit belongs to the private data or not by using a mode corresponding to the preset relationship to obtain a second identification result, and the second identification result is used as an identification result label of the second field.
In a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect.
In a fourth aspect, there is provided a computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method of the first aspect.
According to the method and the device provided by the embodiment of the specification, firstly, each field in each data table included in the database forms a queue; then, according to the sequence of each field in the queue, processing operation is sequentially performed on the current first field, and the processing operation includes: under the condition that the first field does not have the identification result label, identifying whether the first field belongs to private data or not to obtain a first identification result, and taking the first identification result as the identification result label of the first field; then if the first identification result indicates that the first field belongs to private data, searching a second field having a preset relationship with the first field; and finally, identifying whether the second field belongs to the private data or not by using a mode corresponding to the preset relation to obtain a second identification result, and taking the second identification result as an identification result label of the second field. As can be seen from the above, in the embodiments of the present specification, by using the relationship between fields, in the process of sequentially identifying whether each field belongs to private data, once a first field belonging to private data is encountered, a second field having a preset relationship with the first field is immediately queried, and whether the second field belongs to private data is identified by using a manner corresponding to the preset relationship.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram illustrating an implementation scenario of an embodiment disclosed herein;
FIG. 2 illustrates a flow diagram of a method of identifying private data in a database, according to one embodiment;
FIG. 3 illustrates a system architecture diagram for identifying private data in a database, according to one embodiment;
FIG. 4 illustrates a schematic diagram of a fast private data scanning method based on consanguinity relationships, according to one embodiment;
fig. 5 shows a schematic block diagram of an apparatus for identifying private data in a database according to one embodiment.
Detailed Description
The scheme provided by the specification is described below with reference to the accompanying drawings.
Fig. 1 is a schematic view of an implementation scenario of an embodiment disclosed in this specification. The implementation scenario involves identifying private data in a database, the database comprising a plurality of data tables, each data table comprising a plurality of fields, wherein a field corresponds to a column. Referring to fig. 1, the database includes n data tables, which are respectively denoted as table1, table2, …, and table n, where table1 includes i columns, table2 includes j columns, …, and table n includes k columns, and the database further includes a metadata table in which information of each data table is recorded, for example, information such as a field name and a storage location corresponding to each column in the data table.
Generally, when identifying private data in a database, a metadata table is read first, information of each data table is obtained from the metadata table, then, data of one table is fished for each time based on a bottom-layer database interface according to the information of each data table, and then, private data identification is performed on each column of data in the table to judge whether the column belongs to the private data. Since the types of the private data corresponding to the private data are usually dozens of types, some private data identification may be based on a deep learning model, the computation is complex, and when the data amount of the database is very large, it is difficult to identify the whole database within an acceptable time range.
According to the embodiment of the specification, the private data is identified by using the relation between the fields, so that the calculation amount can be effectively reduced, and the efficiency of identifying the private data in the database can be greatly improved.
The relationship between the fields may be a blood relationship between the fields, and the blood relationship is used to describe an upstream-downstream relationship between data, and generally includes copying, truncation, splicing, conversion, and the like, and represents that data of one field is processed to obtain data of another field.
Fig. 2 shows a flow diagram of a method of identifying private data in a database comprising a plurality of data tables, each data table comprising a plurality of fields, according to an embodiment, which may be based on the implementation scenario shown in fig. 1. As shown in fig. 2, the method for identifying private data in a database in this embodiment includes the following steps: step 21, forming a queue for each field in each data table included in the database; step 22, according to the sequence of each field in the queue, sequentially performing processing operation on the current first field, where the processing operation includes: under the condition that the first field does not have the identification result label, identifying whether the first field belongs to private data or not to obtain a first identification result, and taking the first identification result as the identification result label of the first field; step 23, if the first identification result indicates that the first field belongs to private data, searching for a second field having a preset relationship with the first field; and 24, identifying whether the second field belongs to the private data by using a mode corresponding to the preset relationship to obtain a second identification result, and using the second identification result as an identification result label of the second field. Specific execution modes of the above steps are described below.
First, in step 21, each field in each data table included in the database is formed into a queue. It can be understood that each field in the queue has a certain sequence, and each field in each data table can be sorted disorderly, or each field in the same data table can be sorted at an adjacent position.
In one example, the forming a queue of each field in each data table included in the database includes:
and analyzing the field names of the fields from a metadata table in the database, and sequencing the field names to form the queue.
It is understood that the field name can uniquely identify a field, for example, a Globally Unique Identifier (GUID) is used as the field name, which is specifically in the form of project _ name.
Then, in step 22, according to the sequence of each field in the queue, processing operation is sequentially performed on the current first field, where the processing operation includes: and under the condition that the first field does not have the identification result label, identifying whether the first field belongs to the private data or not to obtain a first identification result, and taking the first identification result as the identification result label of the first field. It will be appreciated that each field need only be identified once and that the field with the tag of the identification result need not be identified again.
In one example, the identifying whether the first field belongs to private data includes:
acquiring sample data corresponding to the field name of the first field from the database;
and inputting the sample data into a private data identification model to obtain the first identification result.
It is understood that the sample data is a column of data corresponding to the field name of the first field, or a partial data in the column of data. The columns of a data table are shown as table one.
Table one: data sheet
Column 1 Column 2 Column 3
Line 1 Xiao Hong Woman Age 15
Line 2 Xiaoming liquor For male Age 16
Line 3 Small steel For male Age 17
Line 4 Small blue Woman Age 14
Referring to table one, which is a data table with 4 rows and 3 columns, if the field name of the first field corresponds to column 1, all data in column 1 may be used as sample data, where the sample data includes pink, xiaoming, xiaojian, and xiaolan; alternatively, partial data of column 1 may be taken as sample data, for example, the sample data includes only small red.
Further, the private data recognition model comprises at least one of the following recognition logic:
regular expressions, language models, verification rules, multi-classification models.
It is understood that a regular expression (regular expression), also called regular expression, regular representation, constructs a single character string to describe and match a series of character strings conforming to a certain syntax rule based on a agreed grammar rule.
A language model (language model) is a mathematical model that describes the probability of a certain word string or character string using a probability distribution.
The check rule may have a plurality of pieces, for example, by determining whether all the sample data are numbers and whether the number of the numbers is a predetermined number of digits, the identification of the private data is performed.
The multi-classification model can be obtained through machine learning and can also be called a neural network model or a deep learning model.
Then, in step 23, if the first identification result indicates that the first field belongs to private data, a second field having a preset relationship with the first field is searched. It will be appreciated that if a first field belongs to private data, then a second field having a predetermined relationship with the first field must belong to private data, or with a greater probability belong to private data.
In one example, the finding a second field having a preset relationship with the first field comprises:
searching a second field having a preset relation with the first field from a pre-established data relation map; the data relation graph comprises nodes corresponding to the fields, and connecting edges among the nodes correspond to relations among the fields.
In the embodiments of the present description, the data relationship map may exist in the form of a map database. Graph database: also known as a graph database, is a non-relational database that uses graph theory to store relational information between entities. Compared with a relational database, the graph database can be conveniently and rapidly inquired and can be used for various calculations and reasoning.
Further, the data relationship graph further includes nodes corresponding to the data tables, and the connection edges between the nodes further correspond to the relationship between the data tables and the fields, and the relationship between the data tables and the data tables.
Further, the data relationship map is obtained by analyzing a Structured Query Language (SQL) statement corresponding to the database.
The SQL parsing is a basic stone for constructing data lineage relationships, and mainly parses fields and tables, fields and fields, and inheritance relationships between tables and tables described in SQL, and generally, relationships between fields may include copy (copy), truncation (substr), concatenation (concat), and the like; the relationship between tables is dependency (depended); the relationship between a field and a table is belonged. The triplet may be used to represent the resolved blood-related relationships (source _ node, target _ node, relation). Wherein, source _ node is the identifier of the source node; target _ node is the identification of the target node; a relationship is an inter-node relationship. For example, the following SQL:
Create Table1 as
Select identify_no,mobile_no
From Table2;
the blood relationship obtained by SQL analysis includes:
(table2. identity _ no, table1. identity _ no, copy), representing the relationship that the identity field in table1 is duplicated with the identity field in table 2;
(table2.mobile _ no, table1.mobile _ no, copy), representing the relationship that the phone number field in table1 is duplicated with the phone number field in table 2;
(Table2, Table1, depended), representing that Table1 and Table2 are dependencies;
(Table1. identity _ no, Table1, belong), representing the relationship to which the identity field in Table1 belongs to Table 1;
(Table1.mobile _ no, Table1, belong), representing the relationship that the phone number field in Table1 belongs to Table 1;
(Table2.mobile _ no, Table2, belong), representing the relationship that the phone number field in Table2 belongs to Table 2;
(Table2. identity _ no, Table2, belong), representing the relationship that the identity field in Table2 belongs to Table2.
At present, a mature third party library for SQL analysis can be used, and the principle is not described herein again.
Further, the searching for the second field having the preset relationship with the first field from the pre-established data relationship map includes:
and searching the nodes with the preset relationship corresponding to the connecting edges from the nodes corresponding to the first fields until the relationship of the connecting edges is not the preset relationship, and taking the fields corresponding to the searched nodes as the second fields.
In the embodiment of the present specification, the second field may be searched by using a depth-first search method, where the depth-first search method accesses a vertex v from a vertex v in the graph; sequentially starting from the non-accessed adjacent points of the vertex v, and performing depth-first traversal on the graph; until vertices in the graph that have a path to vertex v are visited; if the vertex in the graph is not accessed, starting from an unvisited vertex, performing depth-first traversal again until all the vertices in the graph are accessed. It will be appreciated that the vertex v corresponds to the first field.
And finally, in step 24, identifying whether the second field belongs to the private data by using a mode corresponding to the preset relationship to obtain a second identification result, and using the second identification result as an identification result label of the second field. It can be understood that, whether the second field belongs to the private data or not is identified, the identification mode of the second field is different from that of the first field, and when whether the second field belongs to the private data or not is identified, the preset relation is considered, so that the calculation amount can be effectively reduced, and the efficiency of identifying the private data is greatly improved.
In one example, the preset relationship is replication;
the identifying whether the second field belongs to private data by using the mode corresponding to the preset relationship comprises:
and directly determining that the second identification result is that the second field belongs to private data.
It can be understood that, if the first field and the second field are in a copied relationship, on the premise that the first field is already identified as private data, the second field necessarily belongs to the private data, and it is not necessary to identify the private data for the second field, thereby improving the identification efficiency.
In one example, the preset relationship is truncation;
the identifying whether the first field belongs to private data comprises:
respectively identifying whether the first field belongs to the private data or not by utilizing each identification model in a first identification model set to obtain each first identification sub-result, and comprehensively determining the first identification result according to each first identification sub-result;
the identifying whether the second field belongs to private data by using the mode corresponding to the preset relationship comprises:
identifying whether the second field belongs to private data or not by using at least one identification model in a second identification model set to obtain a second identification result; the second set of recognition models is a subset of the first set of recognition models.
It can be understood that, if the first field and the second field are in a truncated relationship, that is, the second field is a substring of the first field, and on the premise that the first field has been identified as the private data, the second field has a higher probability of belonging to the private data, the range of identifying the private data for the second field can be reduced, and the amount of calculation is reduced relative to the amount of calculation for identifying the first field, thereby improving the identification efficiency.
In one example, the first recognition result and/or the second recognition result includes:
whether the field belongs to private data, and the type of private data when it belongs to private data.
In the embodiments of the present specification, the number of types of private data is large, which is also one reason why the efficiency in the general private data identification is low. The type of private data that is common today is shown in table two.
Table two: common private data types
Figure BDA0003202925020000081
Figure BDA0003202925020000091
Referring to the table two, because the types of the private data are various, the identification is complex, and the calculation amount is large in general.
According to the method provided by the embodiment of the specification, firstly, each field in each data table included in the database forms a queue; then, according to the sequence of each field in the queue, processing operation is sequentially performed on the current first field, and the processing operation includes: under the condition that the first field does not have the identification result label, identifying whether the first field belongs to private data or not to obtain a first identification result, and taking the first identification result as the identification result label of the first field; then if the first identification result indicates that the first field belongs to private data, searching a second field having a preset relationship with the first field; and finally, identifying whether the second field belongs to the private data or not by using a mode corresponding to the preset relation to obtain a second identification result, and taking the second identification result as an identification result label of the second field. As can be seen from the above, in the embodiments of the present specification, by using the relationship between fields, in the process of sequentially identifying whether each field belongs to private data, once a first field belonging to private data is encountered, a second field having a preset relationship with the first field is immediately queried, and whether the second field belongs to private data is identified by using a manner corresponding to the preset relationship.
FIG. 3 illustrates a system architecture diagram for identifying private data in a database, according to one embodiment. Referring to fig. 3, firstly, the metadata parsing module 31 will read and parse relevant data from a metadata table in a database, and place the parsing result in a queue, where the scanner 32 assumes that a thread is used for processing, reads an element from the queue each time, and based on the reading, retrieves corresponding sample data from the database, and then performs data recognition by using a built-in private data recognition module, and if the recognition result is not sensitive data, continues to read the next element from the queue for consumption; and if the identification result is sensitive data, looking up the upstream and downstream elements with the copy relationship from the data blood relationship map by using a depth-first search algorithm, and then labeling the relevant elements with the sensitive data in the database.
Wherein the database decoupling unit: the data for the client is stored in different types of databases, such as MYSQL, ORACLE and the like, and the problems brought by different databases are shielded through a uniform interface.
Built-in private data identification module: and storing various logics for identifying the data sensitive data types, including a regular expression, a language model, a check rule, a multi-classification model and the like.
A scanning logic unit: the scan logic is used for executing scan logic, adopts a mixed scheme of linear scan and tree scan, is based on sequential scan, and is converted into tree scan when sensitive data is identified.
Data sampling logic: for sampling data from the database by metadata to provide a data basis for subsequent scanner identification.
It is understood that the sensitive data is private data.
FIG. 4 shows a schematic diagram of a fast private data scanning method based on consanguinity relations according to one embodiment. Referring to fig. 4, in the embodiment of the present disclosure, a sequential scan is used as a cold start entry, each column of each table is sequentially scanned from top to bottom according to a metadata table, and once a certain sensitive data type is scanned, the table is immediately scanned into a data edge map in a depth-first search manner, and two conditions need to be satisfied during the search: if the edge dependency relationship is a copy relationship, continuing searching downwards until the edge dependency relationship is not the copy relationship, otherwise, backtracking upwards; the searched node needs to be connected with the original node, otherwise, the search is stopped.
As shown in fig. 4, when the column i in table1 is scanned sequentially to be sensitive data, depth-first search is performed in the data blood-level map immediately, and at this time, it is found that there is a duplicate relationship between column 1 in table2 and table n and column i in table1, and then the column 1 in table2 and table n corresponding to the column is directly set to be the same type of sensitive data as column 1 in table1, so that the calculation amount of the built-in private data identification module of the scanner is reduced, and the identification efficiency is improved.
It should be noted that, when querying the data blood-margin map, the same effect can be achieved by performing connected graph traversal using the breadth-preferred search algorithm. In addition, the data relationship can be stored not by using a graph database but by using a general relational database.
According to an embodiment of another aspect, there is also provided an apparatus for identifying private data in a database, the database comprising a plurality of data tables, each data table comprising a plurality of fields, the apparatus being configured to perform the method provided by the embodiments of the present specification. Fig. 5 shows a schematic block diagram of an apparatus for identifying private data in a database according to one embodiment. As shown in fig. 5, the apparatus 500 includes:
a queue forming unit 51, configured to form a queue for each field in each data table included in the database;
a first identifying unit 52, configured to sequentially perform, according to the sequence of each field in the queue obtained by the queue forming unit 51, a processing operation on a current first field, where the processing operation includes:
under the condition that the first field does not have the identification result label, identifying whether the first field belongs to private data or not to obtain a first identification result, and taking the first identification result as the identification result label of the first field;
a searching unit 53, configured to search, if the first identification result obtained by the first identifying unit 52 indicates that the first field belongs to private data, a second field having a preset relationship with the first field;
the second identifying unit 54 is configured to identify, by using a manner corresponding to the preset relationship, whether the second field found by the searching unit 53 belongs to the private data, to obtain a second identification result, and use the second identification result as an identification result tag of the second field.
Optionally, as an embodiment, the queue forming unit 51 is specifically configured to parse a metadata table in the database to obtain field names of the fields, and sort the field names to form the queue.
Further, the first identifying unit 52 includes:
the acquisition subunit is configured to acquire, from the database, sample data corresponding to the field name of the first field;
and the identification subunit is used for inputting the sample data acquired by the acquisition subunit into a privacy data identification model to obtain the first identification result.
Further, the private data recognition model comprises at least one of the following recognition logic:
regular expressions, language models, verification rules, multi-classification models.
Optionally, as an embodiment, the searching unit 53 is specifically configured to search, from a data relationship map established in advance, a second field having a preset relationship with the first field; the data relation graph comprises nodes corresponding to the fields, and connecting edges among the nodes correspond to relations among the fields.
Further, the data relationship graph further includes nodes corresponding to the data tables, and the connection edges between the nodes further correspond to the relationship between the data tables and the fields, and the relationship between the data tables and the data tables.
Further, the data relation map is obtained by analyzing a Structured Query Language (SQL) statement corresponding to the database.
Further, the searching unit 53 is specifically configured to search, starting from the node corresponding to the first field, for a node whose relation corresponding to the connection edge is the preset relation until the relation of the connection edge is not the preset relation, and use a field corresponding to the searched node as the second field.
Optionally, as an embodiment, the preset relationship is replication;
the second identifying unit 54 is specifically configured to directly determine that the second identification result is that the second field belongs to private data.
Optionally, as an embodiment, the preset relationship is truncation;
the first identifying unit 52 is specifically configured to respectively identify whether the first field belongs to the private data by using each identification model in the first identification model set, to obtain each first identification sub-result, and comprehensively determine the first identification result according to each first identification sub-result;
the second identifying unit 54 is specifically configured to identify, by using at least one identification model in a second identification model set, whether the second field belongs to the private data, so as to obtain the second identification result; the second set of recognition models is a subset of the first set of recognition models.
Optionally, as an embodiment, the first recognition result and/or the second recognition result includes:
whether the field belongs to private data, and the type of private data when it belongs to private data.
With the apparatus provided in this specification, first, the queue forming unit 51 forms a queue for each field in each data table included in the database; then, the first identifying unit 52 performs processing operations on the current first field in sequence according to the sorting of the fields in the queue, where the processing operations include: under the condition that the first field does not have the identification result label, identifying whether the first field belongs to private data or not to obtain a first identification result, and taking the first identification result as the identification result label of the first field; then, when the first identification result indicates that the first field belongs to private data, the searching unit 53 searches for a second field having a preset relationship with the first field; finally, the second identifying unit 54 identifies whether the second field belongs to the private data by using a manner corresponding to the preset relationship, so as to obtain a second identification result, and uses the second identification result as an identification result tag of the second field. As can be seen from the above, in the embodiments of the present specification, by using the relationship between fields, in the process of sequentially identifying whether each field belongs to private data, once a first field belonging to private data is encountered, a second field having a preset relationship with the first field is immediately queried, and whether the second field belongs to private data is identified by using a manner corresponding to the preset relationship.
According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 2.
According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory having stored therein executable code, and a processor that, when executing the executable code, implements the method described in connection with fig. 2.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims (24)

1. A method of identifying private data in a database, the database comprising a plurality of data tables, each data table comprising a plurality of fields, the method comprising:
forming a queue by using each field in each data table included in the database;
according to the sequence of each field in the queue, processing operation is sequentially carried out on the current first field, and the processing operation comprises the following steps:
under the condition that the first field does not have the identification result label, identifying whether the first field belongs to private data or not to obtain a first identification result, and taking the first identification result as the identification result label of the first field;
if the first identification result indicates that the first field belongs to private data, searching a second field having a preset relationship with the first field;
and identifying whether the second field belongs to the private data by using a mode corresponding to the preset relation to obtain a second identification result, and using the second identification result as an identification result label of the second field.
2. The method of claim 1, wherein said forming a queue of fields in respective data tables comprised by said database comprises:
and analyzing the field names of the fields from a metadata table in the database, and sequencing the field names to form the queue.
3. The method of claim 2, wherein the identifying whether the first field belongs to private data comprises:
acquiring sample data corresponding to the field name of the first field from the database;
and inputting the sample data into a private data identification model to obtain the first identification result.
4. The method of claim 3, wherein the private data recognition model comprises at least one of the following recognition logic:
regular expressions, language models, verification rules, multi-classification models.
5. The method of claim 1, wherein said finding a second field having a preset relationship with said first field comprises:
searching a second field having a preset relation with the first field from a pre-established data relation map; the data relation graph comprises nodes corresponding to the fields, and connecting edges among the nodes correspond to relations among the fields.
6. The method of claim 5, wherein the data relationship graph further comprises nodes corresponding to data tables, and the connecting edges between the nodes further correspond to relationships between data tables and fields, and relationships between data tables and data tables.
7. The method of claim 5 or 6, wherein the data relationship graph is derived from parsing a Structured Query Language (SQL) statement corresponding to the database.
8. The method of claim 5, wherein said finding a second field having a preset relationship with said first field from a pre-established data relationship graph comprises:
and searching the nodes with the preset relationship corresponding to the connecting edges from the nodes corresponding to the first fields until the relationship of the connecting edges is not the preset relationship, and taking the fields corresponding to the searched nodes as the second fields.
9. The method of claim 1, wherein the predetermined relationship is replication;
the identifying whether the second field belongs to private data by using the mode corresponding to the preset relationship comprises:
and directly determining that the second identification result is that the second field belongs to private data.
10. The method of claim 1, wherein the predetermined relationship is truncation;
the identifying whether the first field belongs to private data comprises:
respectively identifying whether the first field belongs to the private data or not by utilizing each identification model in a first identification model set to obtain each first identification sub-result, and comprehensively determining the first identification result according to each first identification sub-result;
the identifying whether the second field belongs to private data by using the mode corresponding to the preset relationship comprises:
identifying whether the second field belongs to private data or not by using at least one identification model in a second identification model set to obtain a second identification result; the second set of recognition models is a subset of the first set of recognition models.
11. The method of claim 1, wherein the first recognition result and/or the second recognition result comprises:
whether the field belongs to private data, and the type of private data when it belongs to private data.
12. An apparatus to identify private data in a database, the database comprising a plurality of data tables, each data table comprising a plurality of fields, the apparatus comprising:
a queue forming unit, configured to form a queue for each field in each data table included in the database;
a first identifying unit, configured to perform processing operations on a current first field in sequence according to the sequence of each field in the queue obtained by the queue forming unit, where the processing operations include:
under the condition that the first field does not have the identification result label, identifying whether the first field belongs to private data or not to obtain a first identification result, and taking the first identification result as the identification result label of the first field;
the searching unit is used for searching a second field having a preset relation with the first field if the first identification result obtained by the first identification unit indicates that the first field belongs to private data;
and the second identification unit is used for identifying whether the second field searched by the search unit belongs to the private data or not by using a mode corresponding to the preset relationship to obtain a second identification result, and the second identification result is used as an identification result label of the second field.
13. The apparatus according to claim 12, wherein the queue forming unit is specifically configured to parse a metadata table in the database to obtain field names of the fields, and form the queue after sorting the field names.
14. The apparatus of claim 13, wherein the first identifying unit comprises:
the acquisition subunit is configured to acquire, from the database, sample data corresponding to the field name of the first field;
and the identification subunit is used for inputting the sample data acquired by the acquisition subunit into a privacy data identification model to obtain the first identification result.
15. The apparatus of claim 14, wherein the private data recognition model comprises at least one of the following recognition logic:
regular expressions, language models, verification rules, multi-classification models.
16. The apparatus according to claim 12, wherein the searching unit is specifically configured to search a second field having a preset relationship with the first field from a pre-established data relationship map; the data relation graph comprises nodes corresponding to the fields, and connecting edges among the nodes correspond to relations among the fields.
17. The apparatus of claim 16, wherein the data relationship graph further comprises nodes corresponding to data tables, and the connecting edges between the nodes further correspond to relationships between data tables and fields, and relationships between data tables and data tables.
18. The apparatus of claim 16 or 17, wherein the data relationship graph is derived from parsing a Structured Query Language (SQL) statement corresponding to the database.
19. The apparatus according to claim 16, wherein the searching unit is specifically configured to search, starting from the node corresponding to the first field, for a node whose relation to the connection edge is the preset relation until the relation to the connection edge is not the preset relation, and use a field corresponding to the searched node as the second field.
20. The apparatus of claim 12, wherein the predetermined relationship is replication;
the second identifying unit is specifically configured to directly determine that the second identification result is that the second field belongs to private data.
21. The apparatus of claim 12, wherein the predetermined relationship is truncation;
the first identification unit is specifically configured to respectively identify whether the first field belongs to the private data by using each identification model in a first identification model set, obtain each first identification sub-result, and comprehensively determine the first identification result according to each first identification sub-result;
the second identification unit is specifically configured to identify whether the second field belongs to the private data by using at least one identification model in a second identification model set, so as to obtain a second identification result; the second set of recognition models is a subset of the first set of recognition models.
22. The apparatus of claim 12, wherein the first recognition result and/or the second recognition result comprises:
whether the field belongs to private data, and the type of private data when it belongs to private data.
23. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-11.
24. A computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method of any of claims 1-11.
CN202110909377.9A 2021-08-09 2021-08-09 Method and device for identifying private data in database Pending CN113672653A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110909377.9A CN113672653A (en) 2021-08-09 2021-08-09 Method and device for identifying private data in database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110909377.9A CN113672653A (en) 2021-08-09 2021-08-09 Method and device for identifying private data in database

Publications (1)

Publication Number Publication Date
CN113672653A true CN113672653A (en) 2021-11-19

Family

ID=78541889

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110909377.9A Pending CN113672653A (en) 2021-08-09 2021-08-09 Method and device for identifying private data in database

Country Status (1)

Country Link
CN (1) CN113672653A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114780551A (en) * 2022-05-16 2022-07-22 北京火山引擎科技有限公司 Method and device for identifying specific type of data
WO2023231341A1 (en) * 2022-06-02 2023-12-07 蚂蚁区块链科技(上海)有限公司 Method and apparatus for discovering data asset risk

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104731976A (en) * 2015-04-14 2015-06-24 海量云图(北京)数据技术有限公司 Method for finding and sorting private data in data table
CN110457405A (en) * 2019-08-20 2019-11-15 上海观安信息技术股份有限公司 A kind of database audit method based on genetic connection
CN110543586A (en) * 2019-09-04 2019-12-06 北京百度网讯科技有限公司 Multi-user identity fusion method, device, equipment and storage medium
CN110704873A (en) * 2019-09-25 2020-01-17 全球能源互联网研究院有限公司 Method and system for preventing sensitive data from being leaked
CN110781520A (en) * 2019-10-30 2020-02-11 上海观安信息技术股份有限公司 Sensitive table group discovery method and system
CN111046242A (en) * 2019-11-27 2020-04-21 支付宝(杭州)信息技术有限公司 Data processing method, device, equipment and medium
CN111310232A (en) * 2020-03-17 2020-06-19 杭州数梦工场科技有限公司 Data desensitization method and device, electronic equipment and storage medium
CN111539021A (en) * 2020-04-26 2020-08-14 支付宝(杭州)信息技术有限公司 Data privacy type identification method, device and equipment
CN111881302A (en) * 2020-07-23 2020-11-03 民生科技有限责任公司 Bank public opinion analysis method and system based on knowledge graph
CN112132238A (en) * 2020-11-23 2020-12-25 支付宝(杭州)信息技术有限公司 Method, device, equipment and readable medium for identifying private data
CN112528315A (en) * 2019-09-19 2021-03-19 华为技术有限公司 Method and device for identifying sensitive data

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104731976A (en) * 2015-04-14 2015-06-24 海量云图(北京)数据技术有限公司 Method for finding and sorting private data in data table
CN110457405A (en) * 2019-08-20 2019-11-15 上海观安信息技术股份有限公司 A kind of database audit method based on genetic connection
CN110543586A (en) * 2019-09-04 2019-12-06 北京百度网讯科技有限公司 Multi-user identity fusion method, device, equipment and storage medium
CN112528315A (en) * 2019-09-19 2021-03-19 华为技术有限公司 Method and device for identifying sensitive data
CN110704873A (en) * 2019-09-25 2020-01-17 全球能源互联网研究院有限公司 Method and system for preventing sensitive data from being leaked
CN110781520A (en) * 2019-10-30 2020-02-11 上海观安信息技术股份有限公司 Sensitive table group discovery method and system
CN111046242A (en) * 2019-11-27 2020-04-21 支付宝(杭州)信息技术有限公司 Data processing method, device, equipment and medium
CN111310232A (en) * 2020-03-17 2020-06-19 杭州数梦工场科技有限公司 Data desensitization method and device, electronic equipment and storage medium
CN111539021A (en) * 2020-04-26 2020-08-14 支付宝(杭州)信息技术有限公司 Data privacy type identification method, device and equipment
CN111881302A (en) * 2020-07-23 2020-11-03 民生科技有限责任公司 Bank public opinion analysis method and system based on knowledge graph
CN112132238A (en) * 2020-11-23 2020-12-25 支付宝(杭州)信息技术有限公司 Method, device, equipment and readable medium for identifying private data

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114780551A (en) * 2022-05-16 2022-07-22 北京火山引擎科技有限公司 Method and device for identifying specific type of data
WO2023231341A1 (en) * 2022-06-02 2023-12-07 蚂蚁区块链科技(上海)有限公司 Method and apparatus for discovering data asset risk

Similar Documents

Publication Publication Date Title
Khayyat et al. Bigdansing: A system for big data cleansing
Ganti et al. Data cleaning: A practical perspective
Aggarwal et al. Managing and mining graph data
WO2020143184A1 (en) Knowledge fusion method and apparatus, computer device, and storage medium
US20140156633A1 (en) Scalable Multi-Query Optimization for SPARQL
US7606827B2 (en) Query optimization using materialized views in database management systems
CN112434024B (en) Relational database-oriented data dictionary generation method, device, equipment and medium
CN115576984A (en) Method for generating SQL (structured query language) statement and cross-database query by Chinese natural language
CN113672653A (en) Method and device for identifying private data in database
Schirmer et al. Efficient discovery of matching dependencies
US11288266B2 (en) Candidate projection enumeration based query response generation
CN114625748A (en) SQL query statement generation method and device, electronic equipment and readable storage medium
Talburt et al. A practical guide to entity resolution with OYSTER
Arasu et al. A grammar-based entity representation framework for data cleaning
CN116069808A (en) Method and device for determining dependency information of database storage process and electronic equipment
Zhang et al. Scalable entity resolution using probabilistic signatures on parallel databases
CN115114420A (en) Knowledge graph question-answering method, terminal equipment and storage medium
AT&T
AT&T
Andrzejewski et al. On tuning parameters guiding similarity computations in a data deduplication pipeline for customers records: Experience from a R&D project
Unbehauen et al. SPARQL Update queries over R2RML mapped data sources
Mukherjee et al. Truthcore: Non-parametric estimation of truth from a collection of authoritative sources
US10977284B2 (en) Text search of database with one-pass indexing including filtering
Ganti et al. Data Cleaning
US20200394193A1 (en) Text search of database with one-pass indexing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination