CN111914130A - Sensitive data detection method and device - Google Patents

Sensitive data detection method and device Download PDF

Info

Publication number
CN111914130A
CN111914130A CN202010767486.7A CN202010767486A CN111914130A CN 111914130 A CN111914130 A CN 111914130A CN 202010767486 A CN202010767486 A CN 202010767486A CN 111914130 A CN111914130 A CN 111914130A
Authority
CN
China
Prior art keywords
data
column
detected
detection
sampling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010767486.7A
Other languages
Chinese (zh)
Inventor
赵正邦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202010767486.7A priority Critical patent/CN111914130A/en
Publication of CN111914130A publication Critical patent/CN111914130A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • Computational Linguistics (AREA)
  • Medical Informatics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Telephone Function (AREA)

Abstract

The specification discloses a sensitive data detection method and a sensitive data detection device. The method is used for sensitive detection of any column in a data table of source data. The source data comprises at least one data table; and pre-configuring detection execution conditions and corresponding to-be-detected column determination strategies. The method comprises the following steps: determining that the source data meets any detection execution condition, wherein the detection execution condition comprises: newly adding a data table, newly adding a column of any data table, or changing the name of any column in any data table; determining a strategy according to the to-be-detected columns corresponding to the satisfied detection execution conditions, determining the to-be-detected columns in the data table of the source data, and sampling any to-be-detected column; and detecting the sampling result of the to-be-detected column based on any sensitive data detection algorithm, and determining the detection result as the detection result of the to-be-detected column.

Description

Sensitive data detection method and device
Technical Field
The embodiment of the specification relates to the technical field of computer application, in particular to a sensitive data detection method and device.
Background
At present, in a big data scene, in consideration of privacy protection, data security and the like, sensitive data needs to be detected for massive data, and non-sensitive data and sensitive data need to be determined, so that authority management and control and audit are performed on the detected sensitive data.
When the detection of sensitive data is carried out, manual detection is usually relied on. However, in a large data scene, the magnitude of data is large, so that the detection efficiency is very low only by means of manual detection, and the data changes at any time, so that the changed data is difficult to detect in time.
Disclosure of Invention
In order to improve detection efficiency and detect changed data in time, the specification provides a sensitive data detection method and a sensitive data detection device. The technical scheme is as follows.
A sensitive data detection method is used for carrying out sensitive detection on any column in a data table of source data, wherein the source data comprises at least one data table; presetting detection execution conditions and corresponding to-be-detected column determination strategies; the method comprises the following steps:
determining that the source data meets any detection execution condition, wherein the detection execution condition comprises: newly adding a data table, newly adding a column of any data table, or changing the name of any column in any data table;
determining a strategy according to the to-be-detected columns corresponding to the satisfied detection execution conditions, determining the to-be-detected columns in the data table of the source data, and sampling any to-be-detected column;
and detecting the sampling result of the to-be-detected column based on any sensitive data detection algorithm, and determining the detection result as the detection result of the to-be-detected column.
A sensitive data detection device is used for carrying out sensitive detection on any column in a data table of source data, wherein the source data comprises at least one data table; presetting detection execution conditions and corresponding to-be-detected column determination strategies; the device comprises:
a determination unit: determining that the source data meets any detection execution condition, wherein the detection execution condition comprises: newly adding a data table, newly adding a column of any data table, or changing the name of any column in any data table;
a sampling unit: determining a strategy according to the to-be-detected columns corresponding to the satisfied detection execution conditions, determining the to-be-detected columns in the data table of the source data, and sampling any to-be-detected column;
a detection unit: and detecting the sampling result of the to-be-detected column based on any sensitive data detection algorithm, and determining the detection result as the detection result of the to-be-detected column.
By the technical scheme, the machine can be used for detecting the sampled data by using a sensitive data detection algorithm, and the data sampling is performed under the scene with a large data magnitude, so that the data volume to be detected is reduced on the premise of ensuring the accuracy of data detection, and the efficiency of data detection is improved; and the change in the source data can be sensed in time, so that the newly added data can be detected in time.
Drawings
In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present specification, and other drawings can be obtained by those skilled in the art according to the drawings.
FIG. 1 is a schematic flow chart diagram illustrating a method for detecting sensitive data according to an embodiment of the present disclosure;
FIG. 2 is a schematic flow chart illustrating an example of an application of a sensitive data detection method provided in an embodiment of the present disclosure;
FIG. 3 is a schematic structural diagram of a sensitive data detection apparatus provided in an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of a sampling unit provided in an embodiment of the present specification;
fig. 5 is a schematic structural diagram of an apparatus for configuring a method according to an embodiment of the present disclosure.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the embodiments of the present specification, the technical solutions in the embodiments of the present specification will be described in detail below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments of the present specification, and not all the embodiments. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of protection.
At present, in a big data scene, in consideration of privacy protection, data security and the like, sensitive data needs to be detected for massive data, and non-sensitive data and sensitive data need to be determined, so that authority management and control and audit are performed on the detected sensitive data.
Sensitive data may include business related data, personal privacy data, company confidential data, and the like, among others. Personal information such as identification numbers, mobile phone numbers, home addresses, and the like; or customer details, business secrets, etc. of the company.
Illegal modification or leakage of sensitive data may result in loss of the stakeholder. For example, when personal information is leaked, telephone harassment, stealing of bank assets, illegal business handling and the like may be caused, and unpredictable loss may be caused.
In order to protect the sensitive data, the detected sensitive data can be managed and controlled by permissions such as viewing permissions, modification permissions, output permissions and the like, and can be audited after illegal modification or leakage occurs to determine and remedy the vulnerability in data security management.
Of course, sensitive data may be further divided into different sensitivity levels to facilitate more detailed entitlement management and auditing.
For convenience of description, data that needs to be subjected to sensitive data detection in this specification is referred to as source data. The source data may be stored in the form of database tables, and may include at least 1 data table.
In the data table of the source data, a single row of data may represent one data entity, one data entity may include at least 1 field, and a single column of data may belong to the same field of all data entities in the data table.
Due to the large data size, for example, tens of data tables may be included in the source data, and each data table includes millions of rows of data, even if it is possible to detect whether each field value in each row of data is sensitive data, it is difficult to perform authority control and audit subsequently on each field value as sensitive data.
When the detection of sensitive data is specifically performed, manual detection is usually relied on, and the detection is performed manually on each field value of each row of data in each data table in the source data. This approach is clearly inefficient.
In addition, the source data may have a requirement of business processing, and data in the source data needs to be subjected to addition, deletion, modification and check. That is, the source data may be changed, the changes including at least: data is added, deleted or modified, so that the changed data needs to be confirmed and detected manually, but it is difficult to find the change of the source data in time manually, efficient detection is difficult, and a detection result is difficult to obtain in time.
In view of the above three technical problems, the present specification provides a sensitive data detection method. The following explains features of a sensitive data detection method provided in the present specification that respectively solve the above three problems.
1) In view of the problem that it is difficult to perform authority control and audit subsequently, in this specification, it is first considered that it is difficult to perform authority control and audit on a single field value with fine granularity in a scenario with a large data magnitude, and the granularity of authority control and audit needs to be increased, that is, the granularity of detected sensitive data needs to be increased; secondly, a single column of data may belong to the same field of all data entities in the data table, and sensitive data is not usually a partial data entity, but a partial field of the data entity, for example, an individual as a data entity may include fields of name, gender, identification number, etc., where the identification number belongs to sensitive data, and the name and gender belong to non-sensitive data.
In a sensitive data detection method provided in this specification, detection of sensitive data may be performed on a single column of data in any data table of source data, and a detection result may include: a single column of data belongs to sensitive data or a single column of data belongs to non-sensitive data.
Sensitive data detection is carried out on single-column data, the granularity of the detected sensitive data can be increased, and subsequent authority control and audit are facilitated under the scene that the data magnitude of source data is large.
It is noted that while a single column of data may belong to the same field, it may have different content meanings. For example, in the field of the account, the corresponding single column of data may include a mobile phone number, a mailbox address, a user name, and the like, wherein the mobile phone number may belong to sensitive data, and the user name and the mailbox address do not belong to sensitive data. That is, the data of the meaning of each content included in the single-column data does not necessarily belong to the sensitive data, and the detection result of the final single-column data may be determined based on the detection result of each data in the single-column data.
2) In order to solve the problem of low manual detection efficiency, the method for detecting the sensitive data provided by the specification can be used for detecting the sensitive data by using a machine based on any sensitive data detection algorithm, so that the detection efficiency is improved.
3) In order to solve the problem that it is difficult to obtain a detection result in time, in the sensitive data detection method provided in this specification, a machine may monitor whether source data changes, and the machine may detect sensitive data according to the changed data. Because the efficiency of machine monitoring and detection is very high, consequently can in time acquire the testing result.
In summary of the above description, the present specification provides a sensitive data detection method for performing sensitive detection on any column in a data table of source data, which may include: when any column in any data table in the source data is monitored to be changed, sampling the data of any column which is changed; and detecting the sampling result of the column based on any sensitive data detection method, and taking the detection result as the detection result of the column.
The data magnitude of the source data is large, so that when any column of data is detected, the data of the column is sampled, and the detection result of the sampling result is used as the detection result of the column, so that the data amount to be detected can be reduced on the premise of ensuring the correctness of the detection result, and the detection efficiency is further improved.
To explain a sensitive data detection method provided in this specification in more detail, as shown in fig. 1, a flow chart of a sensitive data detection method provided in this specification is shown, the method is used for performing sensitive detection on any column in a data table of source data, and the source data may include at least one data table.
And pre-configuring detection execution conditions and corresponding to-be-detected column determination strategies.
The detection execution condition at least comprises the change of the source data, and can be used for executing the sensitive data detection when the detection execution condition is met.
The columns to be detected at least include the corresponding columns which are changed under the condition that the source data meet the detection execution condition. The determined column to be detected can be subjected to sensitive data detection in a later step.
The correspondence between the detection execution conditions and the determination policy of the columns to be detected can be shown in table 1 below. Of course, the following table includes only some examples of the corresponding relationship for illustrative purposes, and does not limit the scope of the disclosure.
Figure BDA0002615246590000061
TABLE 1 correspondence between test execution conditions and to-be-tested column determination policies
The method shown in fig. 1 may include at least the following steps.
S101: it is determined that the source data satisfies any of the detection execution conditions.
S102: and determining a strategy according to the to-be-detected columns corresponding to the met detection execution conditions, determining the to-be-detected columns in the data table of the source data, and sampling any to-be-detected column.
S103: and detecting the sampling result of the to-be-detected column based on any sensitive data detection algorithm, and determining the detection result as the detection result of the to-be-detected column.
In S101, by monitoring the source data in real time, S101 may be triggered to be executed when the source data satisfies any one of the detection execution conditions in real time.
Of course, the source data may simultaneously satisfy at least 2 detection execution conditions, and correspondingly, the to-be-detected column may be determined according to at least 2 corresponding to-be-detected column determination policies.
The method for monitoring whether the source data meets any detection execution condition in real time is not limited in the present specification, and the following 3 examples respectively correspond to part of the detection execution conditions in table 1 above to implement real-time monitoring, and are only described as an example.
a) The three detection execution conditions, namely, a newly added data table, a newly added column of any data table, or a name change of any column in any data table, belong to the structure change in the source data, and can realize real-time monitoring by synchronizing with the structure of the source data.
In particular, real-time synchronization may be combined with timing synchronization. Real-time synchronization may include: when the structure of the source data changes, the notification is carried out in real time so as to determine that the source data meets any detection execution condition. The timing synchronization may include: and checking whether the structure of the source data is changed or not at preset periodic time points, and if so, determining that the source data meets any detection execution condition.
By combining the two synchronization modes, the real-time monitoring can be realized, and the condition of real-time monitoring failure caused by real-time synchronization failure can be avoided.
b) For the detection execution condition of updating the version of any sensitive data detection algorithm, whether the version of the algorithm is updated or not can be monitored in real time. A method combining real-time synchronization and timing synchronization may also be used, and will not be described herein.
c) For two detection execution conditions that the time length between the current time point and the last detection time point is greater than a preset time length or the current time point is the same as a preset time point for periodically executing detection, a timer can be set, and when sensitive data detection is not executed for a long time or the preset periodic time point is reached, it is determined that the source data meets any detection execution condition.
The two detection execution conditions may be for unstructured changes in the source data, and the specific unstructured changes may include addition and deletion of data, for example, adding data in a single column of data, adding data entities in a data table, and so on. Since the change of the non-structure may also cause the change of the detection result of the sensitive data of the single column of data, it is necessary to detect whether there is a change of the detection result of the sensitive data caused by the change of the non-structure at regular time. Specifically, how the change of the non-structure causes the change of the detection result of the sensitive data is explained in S103.
In S102, before sampling, it may be determined whether the name of the column to be detected exists in a pre-stored sensitive data name set for any column to be detected.
If the name of the column to be detected exists in the pre-stored sensitive data name set, determining the detection result of the column to be detected; and if the name of the column to be detected does not exist in the pre-stored sensitive data name set, sampling the column to be detected.
For example, the pre-stored set of sensitive data names may include: identification number, mobile phone number, bank password, etc. When the name of any column to be detected is the identification number, the meaning of the content of the data of the column to be detected is the identification number, and the column to be detected can be directly determined to belong to sensitive data without subsequent sampling and sensitive data detection.
The sampled data can be stored by using the column dimension, which is convenient for the subsequent processing of S103. Specifically, the data obtained by sampling a single column to be detected may be stored using a dynamic array, and the obtained array is processed in subsequent S103.
The specific sampling method is described after the explanation of each step.
In S102, in the case that the source data meets the detection execution condition, only the to-be-detected column corresponding to the met detection execution condition is sampled and the subsequent sensitive data detection is performed, and other columns not belonging to the to-be-detected column are not sampled or the subsequent sensitive data detection is not performed; meanwhile, the column to be detected belonging to the sensitive data can be determined in advance through the pre-stored sensitive data name set, and sampling and subsequent sensitive data detection are not required, so that the sensitive data detection efficiency can be further improved, and the detection result can be obtained more timely.
For S103, explanation is made below in terms of the sensitive data detection algorithm, the specific detection flow, and the algorithm version update, respectively.
1. And (4) sensitive data detection algorithm.
The sensitive data detection algorithm may include a data content recognition algorithm, and the content recognition algorithm may specifically include a regular expression algorithm, a drools rule, an xgboost algorithm, and the like. After the content of the data is identified, whether the identification result belongs to the sensitive data can be determined according to a preset sensitive data content set.
In the specific process of identifying the data content, the data and a plurality of pre-stored data content forms can be matched one by one, and at least 1 data content form which is successfully matched is determined.
The regular expression algorithm may match the data with a plurality of pre-stored regular expressions one by one, and if the matching is successful, it may be determined that the data may contain the content meaning corresponding to the successfully matched regular expression.
For example, the regular expression can be ^1(3|4|5|7|8) \ d {9} $, corresponding to the content meaning of the mobile phone number, and in data containing 11 digits, the first digit can only be 1, the second digit can be 3, 4, 5, 7 or 8, and the following 9 digits can be any one of 0-9 digits respectively. If certain data is matched with the regular expression, the content meaning of the data is possibly the mobile phone number, and the data is determined to be sensitive data.
It should be noted that, since different data content forms may match the same data at the same time, the same data may contain different content meanings and may be detected as sensitive data and non-sensitive data at the same time.
For example, for a string "wanx", the content form of a name (regular expression) and the content form of a user name (regular expression) can be matched, and thus "wanx" can contain both content meanings of a name and a user name. In the case where the name belongs to sensitive data and the username belongs to non-sensitive data, "wanx" may be detected as both sensitive and non-sensitive data.
2. And (5) a specific detection process.
In the specific detection process, the following explains the three aspects of pre-recognition before the detection of the sensitive data, how to obtain the detection result of the sampling result after each data in the sampling result is detected by using the above sensitive data detection algorithm, and how to calculate the proportion of the sensitive data.
1) And (4) pre-recognition.
Before the sensitive data detection is carried out on the sampling result, the pre-recognition can be carried out firstly, and the characteristics of each data in the sampling result are determined, so that the detection times can be reduced, the detection speed is accelerated, and the detection efficiency is improved in the subsequent sensitive data detection algorithm.
For example, when a certain data in the sampling result is recognized in advance to contain 15 digits, the data is not necessarily a mobile phone number, and detection of the mobile phone number is not needed in a subsequent sensitive data detection algorithm, so that the detection speed is increased, and the detection efficiency is improved.
2) How to obtain the detection result of the sampling result.
The detection is performed based on any sensitive data detection algorithm, specifically, a data content identification algorithm is used for identifying the data content of each data in the sampling results of the column to be detected, and whether the data is sensitive data is determined according to the identification result; and determining the detection result of the column sampling result to be detected according to the proportion of the sensitive data in the column sampling result to be detected.
For example, when the proportion of the sensitive data exceeds a preset threshold, it may be determined that the sampling result of the to-be-detected column belongs to the sensitive data, and then it is determined that the to-be-detected column belongs to the sensitive data.
Corresponding to the above explanation of S101 that "the non-structural change may also cause the sensitive data detection result of a single column of data to change", the non-structural change of the source data may also cause the sensitive data detection result of a column to be detected to change. When data in a single column to be detected includes sensitive data and non-sensitive data, the proportion of the sensitive data may change with the addition and deletion of data in the column, so that the detection result of the sensitive data in the column changes.
A specific example may be a column to be detected corresponding to the field of the account, which may include a mobile phone number, a mailbox address, and a user name. The mobile phone number belongs to sensitive data, and the mailbox address and the user name belong to non-sensitive data. And when the proportion of the mobile phone number in the data in the account list exceeds 30%, the detection result of the account list is that the mobile phone number belongs to sensitive data, otherwise, the mobile phone number belongs to non-sensitive data.
For the change, the detection result can be ensured to be updated along with the change of the source data non-structure in time by detecting that the time length between the current time point and the last detected time point in the execution condition is greater than the preset time length or the current time point is the same as the preset time point for periodically executing detection and carrying out timed monitoring and sensitive data detection.
3) How to calculate the occupancy of sensitive data.
It should be noted that the same data may be detected as sensitive data and non-sensitive data at the same time. When determining the proportion of sensitive data in a single column of data, the specification does not limit the calculation method of the proportion, and the following 2 examples are only exemplary.
a) Data that is detected as both sensitive data and non-sensitive data is not considered when determining the fraction of sensitive data.
b) And when the proportion of the sensitive data is determined, taking the number of all detection results of all data as a denominator and taking the number of the detection results which are sensitive data as a numerator.
3. And updating the algorithm version.
It should be additionally noted that, since the sensitive data detection algorithm may update the version, for example, the user name also belongs to the sensitive data due to the consideration of business change, the version of the sensitive data detection algorithm needs to be updated, and the rule in the algorithm is determined again, so that when the detection is performed based on any sensitive data detection algorithm, the latest version of the sensitive data detection algorithm can be used.
In table 1 above, the detecting and executing condition also includes updating the version of any sensitive data detecting algorithm, so that when the version of the sensitive data detecting algorithm is updated, the single-column data of the old version of the sensitive data detecting algorithm can be used, and the latest version of the sensitive data detecting algorithm can be reused for detection. Therefore, the sensitive data detection result can be ensured to be updated according to the version update of the sensitive data detection algorithm in real time.
S102 and S103 describe performing sensitive data detection on any column to be detected, and it can be understood that sensitive data detection can be performed in the same manner for all columns to be detected determined in S102.
After the respective steps are explained, the following explanation is made with respect to the sampling method in S102.
The sensitive data detection method provided by the specification aims at single-column data to perform sensitive data detection, and the single-column data can belong to the same field.
It is understood that the single columns of data belonging to the same field may have the same content meaning, for example, the single columns of data corresponding to the identification field may all be identification numbers; the single column of data corresponding to the contact information field can be all mobile phone numbers. For single-column data with the same content meaning, the sampling number does not influence the sensitive data detection result, so the sampling number can be reduced as much as possible.
The single-column data belonging to the same field may also have different content meanings, for example, the single-column data corresponding to the contact information field may include a mobile phone number and a mailbox address; the single column of data corresponding to the account field may include a mobile phone number, a mailbox address, and a user name.
However, due to the large magnitude of the source data, it is difficult to determine in advance whether the data in a single column of data all have the same meaning. For example, for a single column of data corresponding to a contact address field, which may contain millions of data, it is difficult to determine whether all have the same meaning.
The present specification does not limit the specific implementation of the sampling method, and the following two examples are only exemplary.
a) And aiming at single-column data, random sampling with fixed sampling quantity or fixed sampling ratio is firstly carried out, and then identification is carried out.
The specific identification method is not limited in this specification, and may be manual identification or identification based on a data content identification algorithm.
If the identification result is that the column of data sampling results have the same content meaning, the column of data is considered to have the same content meaning, and sensitive data detection is carried out aiming at the sampling results; if the identification result is that the column of data sampling results have different content meanings, sampling can be carried out again, and sensitive data detection can be carried out according to the new sampling result.
The sampling method to be performed anew is not limited in this specification.
b) For any data meaning, the preset requirements are as follows: and if the actual proportion of the data of the content meaning exceeds a first preset threshold, the probability that the data of the content meaning is not contained in the sampling result is smaller than a second preset threshold. Based on the requirement, the number of samples can be determined according to the number of data contained in the single-column data, and random sampling is performed based on the determined number of samples.
Specific formula examples may be as follows.
Figure BDA0002615246590000121
For any data with content meaning, p refers to the probability that the sampling result does not contain the data with the content meaning, i.e. a second preset threshold, m is the number of data in the line of data, n is the number of samples in the line of data, and x is a first preset threshold.
After the first preset threshold, the second preset threshold and the data number of the line of data are determined, the sampling number can be obtained.
By the sampling method, on the premise of ensuring the correctness of the detection result, less data can be sampled as far as possible to carry out subsequent sensitive data detection, so that the detection efficiency is improved.
By the method, the machine can be used for detecting the sampled data by using a sensitive data detection algorithm, and the data sampling is performed under the scene with a large data magnitude, so that the data volume needing to be detected is reduced on the premise of ensuring the accuracy of data detection, and the efficiency of data detection is improved; the change in the source data can be monitored in time, so that the changed data can be detected in time.
For convenience of further understanding, the present specification further provides an application example of a sensitive data detection method, and as shown in fig. 2, a flow diagram of the application example of the sensitive data detection method provided in the present specification may specifically include the following steps.
S201: a data table is newly established in the source data, and data entities in the data table comprise names, identification numbers and contact ways.
S202: and determining that the source data meets the detection execution condition of the 'newly added data table'.
S203: and determining all columns in the newly added data table as columns to be detected based on the corresponding relation in the table 1, and determining that the to-be-detected column of the identity card number belongs to the sensitive data based on a preset sensitive data name set according to the names of three fields in the data table.
S204: for the name to be detected column, randomly sampling 500 data, determining that the content meanings of the data in the sampling result are the same based on the regular expression algorithm, successfully matching the data with the regular expression corresponding to the name, and executing S205.
S205: based on the preset sensitive data content set, it is determined that no name exists, and thus it is determined that the column to be detected of names belongs to non-sensitive data.
S206: for the contact way to be detected column, 500 data are randomly sampled, the content meanings of the data in the sampling result are determined to be different based on the regular expression algorithm, data which are successfully matched with the regular expression corresponding to the mobile phone number and data which are successfully matched with the regular expression corresponding to the mailbox address exist, and S207 is executed.
S207: further, according to 50 ten thousand data in the to-be-detected column of the contact way, randomly sampling 1 ten thousand data, and determining that the proportion of the mobile phone number (the mobile phone number belongs to the sensitive data content set) in the sampling results of 1 ten thousand is 60% based on a sensitive data detection algorithm, that is, the proportion of the sensitive data in the to-be-detected column of the contact way is greater than a preset threshold value of 30%, so that the to-be-detected column of the contact way is determined to belong to the sensitive data.
Where S204 and S206 may be performed in parallel.
Finally, the two columns of data of the identity card number and the contact way belong to sensitive data, and the column of the name belongs to non-sensitive data.
In addition to the above method embodiments, the present specification also provides a sensitive data detection apparatus.
Fig. 3 is a schematic structural diagram of a sensitive data detection device provided in this specification. The device is used for carrying out sensitive detection on any column in a data table of source data, and the source data can comprise at least one data table; the detection execution conditions and the corresponding to-be-detected column determination strategies are configured in advance, and the details can be seen in table 1 in the above.
The apparatus shown in fig. 3 may include the following 3 units.
The determination unit 301: it is determined that the source data satisfies any of the detection execution conditions.
The sampling unit 302: and determining a strategy according to the to-be-detected columns corresponding to the met detection execution conditions, determining the to-be-detected columns in the data table of the source data, and sampling any to-be-detected column.
The detection unit 303: and detecting the sampling result of the to-be-detected column based on any sensitive data detection algorithm, and determining the detection result as the detection result of the to-be-detected column.
The detection unit 303 may comprise the following two sub-units.
Identification subunit 303 a: and identifying the data content of each data in the sampling result of the column to be detected by using a data content identification algorithm, and determining whether the data is sensitive data according to the identification result.
Determination subunit 303 b: and determining the detection result of the column sampling result to be detected according to the proportion of the sensitive data in the column sampling result to be detected.
The sampling unit 302 may include the following two sub-units.
Pre-detection subunit 302 a: and for any column to be detected, if the name of the column to be detected exists in the pre-stored sensitive data name set, determining the detection result of the column to be detected.
The first judgment subunit 302 b: and if the name of the column to be detected does not exist in the pre-stored sensitive data name set, sampling the column to be detected.
The sampling unit 302 may also include the following two sub-units.
Random sampling subunit 302 c: random sampling with fixed sampling quantity or fixed sampling ratio is carried out on any column to be detected, and then data in a sampling result is identified;
second determination subunit 302 d: and if the identification result is that each data in the sampling result has different content meanings, re-sampling the column to be detected.
As shown in fig. 4, a schematic structural diagram of a sampling unit provided in this specification is shown, where the sampling unit 302 may simultaneously include four sub-units, namely a pre-detection sub-unit 302a, a first judgment sub-unit 302b, a random sampling sub-unit 302c, and a second judgment sub-unit 302 d. The method can not only judge the sensitive data detection result in advance according to the pre-stored sensitive data name set, but also carry out different sampling processing on single-column data which need to be subjected to sensitive data detection and contain the same or different content meanings.
Embodiments of the present specification further provide a computer device, which at least includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements a sensitive data detection method as shown in fig. 1 when executing the program.
Fig. 5 is a schematic diagram illustrating a more specific hardware structure of a computer device according to an embodiment of the present disclosure, where the device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.
The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.
The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called to be executed by the processor 1010.
The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.
The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).
Bus 1050 includes a path that transfers information between various components of the device, such as processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.
It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.
Embodiments of the present description also provide a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements a sensitive data detection method as shown in fig. 1.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
From the above description of the embodiments, it is clear to those skilled in the art that the embodiments of the present disclosure can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the embodiments of the present specification may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments of the present specification.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points. The above-described apparatus embodiments are merely illustrative, and the modules described as separate components may or may not be physically separate, and the functions of the modules may be implemented in one or more software and/or hardware when implementing the embodiments of the present disclosure. And part or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The foregoing is only a detailed description of the embodiments of the present disclosure, and it should be noted that, for those skilled in the art, many modifications and decorations can be made without departing from the principle of the embodiments of the present disclosure, and these modifications and decorations should also be regarded as protection for the embodiments of the present disclosure.

Claims (13)

1. A sensitive data detection method is used for carrying out sensitive detection on any column in a data table of source data, wherein the source data comprises at least one data table; presetting detection execution conditions and corresponding to-be-detected column determination strategies; the method comprises the following steps:
determining that the source data meets any detection execution condition, wherein the detection execution condition at least comprises the following steps: newly adding a data table, newly adding a column of any data table, or changing the name of any column in any data table;
determining a strategy according to the to-be-detected columns corresponding to the satisfied detection execution conditions, determining the to-be-detected columns in the data table of the source data, and sampling any to-be-detected column;
and detecting the sampling result of the to-be-detected column based on any sensitive data detection algorithm, and determining the detection result as the detection result of the to-be-detected column.
2. The method of claim 1, the detecting an execution condition further comprising: updating the version of any sensitive data detection algorithm;
the to-be-detected column determination policy corresponding to the detection execution condition includes: and determining columns in any data table, which are detected by using the sensitive data detection algorithm of the old version, as columns to be detected.
3. The method of claim 1, the detecting an execution condition further comprising: the time length between the current time point and the last detection time point is larger than the preset time length, or the current time point is the same as the preset time point for periodically executing detection.
4. The method of claim 1, the sensitive data detection algorithm comprising a data content identification algorithm; the detection is carried out on the sampling result of the column to be detected based on any sensitive data detection algorithm, and the method comprises the following steps:
performing data content identification on each data in the sampling results of the column to be detected by using a data content identification algorithm, and determining whether the data is sensitive data according to the identification result;
and determining the detection result of the column sampling result to be detected according to the proportion of the sensitive data in the column sampling result to be detected.
5. The method of claim 1, the sampling for any column to be detected, comprising:
for any column to be detected, if the name of the column to be detected exists in a pre-stored sensitive data name set, determining the detection result of the column to be detected; and if the name of the column to be detected does not exist in the pre-stored sensitive data name set, sampling the column to be detected.
6. The method of claim 1, the sampling for any column to be detected, comprising:
random sampling with fixed sampling quantity or fixed sampling ratio is carried out on any column to be detected, and then data in a sampling result is identified; and if the identification result is that each data in the sampling result has different content meanings, re-sampling the column to be detected.
7. A sensitive data detection device is used for carrying out sensitive detection on any column in a data table of source data, wherein the source data comprises at least one data table; presetting detection execution conditions and corresponding to-be-detected column determination strategies; the device comprises:
a determination unit: determining that the source data meets any detection execution condition, wherein the detection execution condition at least comprises the following steps: newly adding a data table, newly adding a column of any data table, or changing the name of any column in any data table;
a sampling unit: determining a strategy according to the to-be-detected columns corresponding to the satisfied detection execution conditions, determining the to-be-detected columns in the data table of the source data, and sampling any to-be-detected column;
a detection unit: and detecting the sampling result of the to-be-detected column based on any sensitive data detection algorithm, and determining the detection result as the detection result of the to-be-detected column.
8. The apparatus of claim 7, the detecting an execution condition further comprising: updating the version of any sensitive data detection algorithm;
the to-be-detected column determination policy corresponding to the detection execution condition includes: and determining columns in any data table, which are detected by using the sensitive data detection algorithm of the old version, as columns to be detected.
9. The apparatus of claim 7, the detecting an execution condition further comprising: the time length between the current time point and the last detection time point is larger than the preset time length, or the current time point is the same as the preset time point for periodically executing detection.
10. The apparatus of claim 7, the sensitive data detection algorithm comprising a data content identification algorithm; the detection unit includes:
identifying the subunit: performing data content identification on each data in the sampling results of the column to be detected by using a data content identification algorithm, and determining whether the data is sensitive data according to the identification result;
determining a subunit: and determining the detection result of the column sampling result to be detected according to the proportion of the sensitive data in the column sampling result to be detected.
11. The apparatus of claim 7, the sampling unit, comprising:
pre-detection subunit: for any column to be detected, if the name of the column to be detected exists in a pre-stored sensitive data name set, determining the detection result of the column to be detected;
a first judgment subunit: and if the name of the column to be detected does not exist in the pre-stored sensitive data name set, sampling the column to be detected.
12. The apparatus of claim 7, the sampling unit, comprising:
a random sampling subunit: random sampling with fixed sampling quantity or fixed sampling ratio is carried out on any column to be detected, and then data in a sampling result is identified;
a second judgment subunit: and if the identification result is that each data in the sampling result has different content meanings, re-sampling the column to be detected.
13. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1 to 6 when executing the program.
CN202010767486.7A 2020-08-03 2020-08-03 Sensitive data detection method and device Pending CN111914130A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010767486.7A CN111914130A (en) 2020-08-03 2020-08-03 Sensitive data detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010767486.7A CN111914130A (en) 2020-08-03 2020-08-03 Sensitive data detection method and device

Publications (1)

Publication Number Publication Date
CN111914130A true CN111914130A (en) 2020-11-10

Family

ID=73287024

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010767486.7A Pending CN111914130A (en) 2020-08-03 2020-08-03 Sensitive data detection method and device

Country Status (1)

Country Link
CN (1) CN111914130A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103155487A (en) * 2010-10-26 2013-06-12 惠普发展公司,有限责任合伙企业 Methods and systems for detecting suspected data leakage using traffic samples
CN104794204A (en) * 2015-04-23 2015-07-22 上海新炬网络信息技术有限公司 Database sensitive data automatically-recognizing method
US20150242639A1 (en) * 2014-02-26 2015-08-27 International Business Machines Corporation Detection and prevention of sensitive information leaks
CN105825138A (en) * 2015-01-04 2016-08-03 北京神州泰岳软件股份有限公司 Sensitive data identification method and device
CN106790271A (en) * 2017-02-16 2017-05-31 济南浪潮高新科技投资发展有限公司 A kind of detection method of sensitive data, device, computer-readable recording medium and storage control
CN110222170A (en) * 2019-04-25 2019-09-10 平安科技(深圳)有限公司 A kind of method, apparatus, storage medium and computer equipment identifying sensitive data
CN111191281A (en) * 2019-12-25 2020-05-22 平安信托有限责任公司 Data desensitization processing method and device, computer equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103155487A (en) * 2010-10-26 2013-06-12 惠普发展公司,有限责任合伙企业 Methods and systems for detecting suspected data leakage using traffic samples
US20150242639A1 (en) * 2014-02-26 2015-08-27 International Business Machines Corporation Detection and prevention of sensitive information leaks
CN105825138A (en) * 2015-01-04 2016-08-03 北京神州泰岳软件股份有限公司 Sensitive data identification method and device
CN104794204A (en) * 2015-04-23 2015-07-22 上海新炬网络信息技术有限公司 Database sensitive data automatically-recognizing method
CN106790271A (en) * 2017-02-16 2017-05-31 济南浪潮高新科技投资发展有限公司 A kind of detection method of sensitive data, device, computer-readable recording medium and storage control
CN110222170A (en) * 2019-04-25 2019-09-10 平安科技(深圳)有限公司 A kind of method, apparatus, storage medium and computer equipment identifying sensitive data
CN111191281A (en) * 2019-12-25 2020-05-22 平安信托有限责任公司 Data desensitization processing method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
EP3834115B1 (en) Automated access control policy generation for computer resources
CN109347787B (en) Identity information identification method and device
CN110383278A (en) The system and method for calculating event for detecting malice
CN109690548B (en) Computing device protection based on device attributes and device risk factors
CN104866770B (en) Sensitive data scanning method and system
WO2016145993A1 (en) Method and system for user device identification
CN106295333A (en) For detecting the method and system of malicious code
CN112868042A (en) Systems, methods, and computer program products for fraud management using shared hash maps
CN111475853B (en) Model training method and system based on distributed data
US9686277B2 (en) Unique identification for an information handling system
US20230205755A1 (en) Methods and systems for improved search for data loss prevention
CN111092880B (en) Network traffic data extraction method and device
CN113162794A (en) Next-step attack event prediction method and related equipment
CN106789837A (en) Network anomalous behaviors detection method and detection means
CN111126623A (en) Model updating method, device and equipment
JPWO2015121923A1 (en) Log analysis device, unauthorized access audit system, log analysis program, and log analysis method
US11423099B2 (en) Classification apparatus, classification method, and classification program
CN109815702A (en) Safety detection method, device and the equipment of software action
CN110278241B (en) Registration request processing method and device
JP6018344B2 (en) Dynamic reading code analysis apparatus, dynamic reading code analysis method, and dynamic reading code analysis program
CN110880023A (en) Method and device for detecting certificate picture
CN111914130A (en) Sensitive data detection method and device
CN113220949B (en) Construction method and device of private data identification system
CN114422175A (en) Network security supervision and inspection behavior auditing method and device
CN109656805B (en) Method and device for generating code link for business analysis and business server

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination