CN116894073A

CN116894073A - Sensitive data identification method, device and storage medium

Info

Publication number: CN116894073A
Application number: CN202310833297.9A
Authority: CN
Inventors: 王铮
Original assignee: China Telecom Technology Innovation Center; China Telecom Corp Ltd
Current assignee: China Telecom Technology Innovation Center; China Telecom Corp Ltd
Priority date: 2023-07-07
Filing date: 2023-07-07
Publication date: 2023-10-17

Abstract

The application discloses a sensitive data identification method, a sensitive data identification device and a storage medium. The method specifically comprises the following steps: the method comprises the steps that electronic equipment obtains a first data set and a second data set, wherein the first data set comprises sensitive data of N fields, and the second data set comprises data to be detected of M fields; the first data set is a pre-stored marked sensitive data set, and the second data set is an acquired unmarked data set. And the electronic equipment combines the data of the fields with the same character type in the first data set and the second data set to obtain a third data set, wherein the third data set comprises S fields of data. And the electronic equipment clusters the data in the third data set to obtain R-class data. The electronic equipment determines the distribution difference of the sensitive data and the data to be detected in the R-class data of the third data set, and if the distribution difference is smaller than a first preset threshold value, the data in the third data set is determined to be the sensitive data. By the method, the efficiency of sensitive data identification can be improved.

Description

Sensitive data identification method, device and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and apparatus for identifying sensitive data, and a storage medium.

Background

Structured data refers to data stored in databases, tabular text files (e.g., excel, csv, etc.). Structured data can be generally identified through a field, and whether the data corresponding to the field is sensitive data can be judged through the field of the data. For example, a database of a bank stores some type of numerical data, and a field of the data is "amount", and then the data in the field can be determined to be sensitive data of a user. However, in some cases, there may be a case where the data field is missing, so it is necessary to judge whether or not it is sensitive data by the data itself. At present, sensitive data identification is mainly carried out in a manual mode, and the efficiency is low.

Disclosure of Invention

The application provides a sensitive data identification method, a device and a storage medium, which are used for solving the problem of low efficiency of sensitive data identification.

In a first aspect, the present application provides a method for identifying sensitive data. The method is applicable to electronic equipment with processing capability, and specifically comprises the following steps: the method comprises the steps that electronic equipment obtains a first data set and a second data set, wherein the first data set comprises sensitive data of N fields, and the second data set comprises data to be detected of M fields; the first data set is a pre-stored marked sensitive data set, the second data set is an acquired unmarked data set, and M, N is a positive integer. The electronic equipment combines the data of the fields with the same character type in the first data set and the second data set to obtain a third data set, wherein the third data set comprises S fields of data, and S is a positive integer. And the electronic equipment clusters the data in the third data set to obtain R-class data, wherein R is a positive integer. The electronic equipment determines the distribution difference of the sensitive data and the data to be detected in the R-class data of the third data set, and if the distribution difference is smaller than a first preset threshold value, the data in the third data set is determined to be the sensitive data.

In the embodiment of the application, fields in the second data set, which obviously do not belong to sensitive data, can be initially screened by merging the fields in the M fields of the second data set, which are the same as the N field character types (such as floating point number, letters and the like) of the first data set, so as to obtain a third data set. The electronic device further clusters the data in the third data set to obtain R-class data, and whether the data in the third data set is the sensitive data can be determined according to the distribution difference of the sensitive data of the third data set and the data to be detected in the R-class data. When sensitive data identification is carried out, the data which obviously do not belong to the sensitive data can be deleted by combining the first data set and the second data set, so that the identification efficiency is improved. And clustering the combined third data set to obtain the distribution condition of the sensitive data and the data to be detected, and finally obtaining the identification result, wherein the identification efficiency of the sensitive data is higher through the distribution condition.

Optionally, the electronic device merges the data of the fields with the same character type in the first data set and the second data set, including: the electronic device samples a first field of the N fields of the first data set and a second field of the M fields of the second data set respectively to obtain a plurality of first sampling data and a plurality of second sampling data. The electronic device determines whether the character types of the plurality of first sampled data and the plurality of second sampled data are the same. If the character types are the same, the electronic device merges the data of the first field with the data of the second field, and the merged data is the data of any field in the third data set.

In the embodiment of the application, whether the character types of the two fields of the first data set and the second data set are the same or not is compared through sampling, and if the character types of the two fields are the same, the data of the two fields are combined into the data of one field in the third data set. The efficiency of merging can be improved by sampling.

Optionally, the electronic device determines whether the character types of the plurality of first sampled data and the plurality of second sampled data are the same, including: the electronic device respectively determines statistical parameters of the first sampling data and the second sampling data, wherein the statistical parameters comprise a mean value and/or a variance. The electronic device determines whether the character types of the first sampling data and the second sampling data are the same according to whether the difference of the statistical parameters of the first sampling data and the second sampling data is smaller than a second preset threshold value.

In the embodiment of the application, since the electronic device judges that the character type has a certain accuracy, when judging whether the character types of the plurality of first sampling data and the plurality of second sampling data are identical, the electronic device can judge through the statistical parameters of the sampling data, and if the difference of the statistical parameters is smaller than the second preset threshold value, the statistical distribution of the plurality of first sampling data and the plurality of second sampling data can be considered to be approximately identical, and the character types of the plurality of first sampling data and the plurality of second sampling data can be determined to be identical.

Optionally, the electronic device determines a distribution difference of the sensitive data of the third data set and the data to be detected in the R-class data, including: the electronic equipment determines a first set according to the duty ratio of each type of data in the R type of data, and the first set is used for representing the distribution condition of the sensitive data in the R type of data. And the electronic equipment determines a second set according to the duty ratio of each type of data in the R type of data of the data to be detected, and the second set is used for representing the distribution condition of the data to be detected in the R type of data. And the electronic equipment determines the Euclidean distance between the first set and the second set as the distribution difference of the sensitive data and the data to be detected in the R-class data.

In the embodiment of the application, the first set corresponding to the sensitive data and the second set of the data to be detected are obtained by clustering the data in the third data set. Whether the data to be detected in the third data set is sensitive data or not can be determined according to the Euclidean distance between the first set and the second set, and the efficiency is high.

Optionally, before clustering the data in the third data set, the electronic device further includes: the electronic device converts character-type data in the third data set into numeric-type data.

In the embodiment of the application, the character type data in the third data set is converted into the numerical value type data, so that the processing efficiency of the electronic equipment can be improved.

In a second aspect, the present application provides a sensitive data identification apparatus. The device comprises: the device comprises an acquisition module, a merging module, a clustering module and a determining module. The acquisition module is used for acquiring a first data set and a second data set, wherein the first data set comprises sensitive data of N fields, and the second data set comprises data to be detected of M fields; the first data set is a pre-stored marked sensitive data set, the second data set is an acquired unmarked data set, and M, N is a positive integer. The merging module is used for merging the data of the fields with the same character types in the first data set and the second data set to obtain a third data set, wherein the third data set comprises S fields of data, and S is a positive integer. The clustering module is used for clustering the data in the third data set to obtain R-class data, wherein R is a positive integer. The determining module is used for determining distribution difference of sensitive data and data to be detected in R-type data of the third data set, and if the distribution difference is smaller than a first preset threshold value, determining the data in the third data set as sensitive data.

Optionally, the merging module is specifically configured to: and respectively sampling a first field in the N fields of the first data set and a second field in the M fields of the second data set to obtain a plurality of first sampling data and a plurality of second sampling data. It is determined whether the character types of the plurality of first sample data and the plurality of second sample data are identical. If the character types are the same, merging the data of the first field with the data of the second field, wherein the merged data is the data of any field in the third data set.

Optionally, the merging module is specifically configured to: statistical parameters of the first sampling data and the second sampling data are respectively determined, wherein the statistical parameters comprise a mean value and/or a variance. And determining whether the character types of the first sampling data and the second sampling data are the same according to whether the difference of the statistical parameters of the first sampling data and the second sampling data is smaller than a second preset threshold value.

Optionally, the determining module is specifically configured to: and determining a first set according to the duty ratio of each type of data in the R type of data, wherein the first set is used for representing the distribution condition of the sensitive data in the R type of data. And determining a second set according to the duty ratio of the data to be detected in the R-class data, wherein the second set is used for representing the distribution condition of the data to be detected in the R-class data. And determining the Euclidean distance between the first set and the second set as the distribution difference of the sensitive data and the data to be detected in the R-class data.

Optionally, the clustering module is further configured to: the character type data in the third data set is converted into numerical type data.

In a third aspect, an embodiment of the present application provides an electronic device including a processor and a memory communicatively coupled to the processor. Wherein the memory stores computer-executable instructions that are executed by the processor to enable the processor to perform the method of any one of the first aspects.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, cause the processor to perform the method of any one of the first aspects.

In a fifth aspect, embodiments of the present application provide a computer program product comprising a computer program stored in a computer readable storage medium, from which computer program a processor can read the computer program, the processor implementing the method according to any of the first aspects above when executing the computer program.

Drawings

FIG. 1 is a flow chart of a method for identifying sensitive data according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a sensitive data identification device according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to better understand the solution provided by the embodiments of the present application, some technical concepts related to the embodiments of the present application are first described. In the technical scheme of the application, the data acquisition, transmission, use and the like all meet the requirements of national relevant laws and regulations.

As previously described, structured data in a database table may be identified by a field of the data to determine whether the data is sensitive. However, in some cases there may be a missing data field, which needs to be identified by the data itself. The following tables 1 and 2 are used as examples.

TABLE 1

TABLE 2

Table 1 above is a storage case of data in a database table under normal conditions. It can be seen that the data fields in table 1 are clear, and it can be known whether the data is sensitive or not according to the fields of the data. For example, the "name" field includes the user's name, where there is sensitive data "hu_jingtao" (primordial name). For another example, the data in the "amount" field is sensitive data as it relates to user privacy. Whether the data is sensitive data or not can be quickly known through the field corresponding to the data. As shown in table 2, when a data field is missing (the missing field is indicated by ". For example, the meaning represented by the data "4256.27", "54212.65", "1684.74" in table 2 cannot be intuitively determined.

The simpler mode is to identify one by a manual method, but the manual method has lower identification efficiency when the data volume is larger. At present, some machine learning algorithms are used for identifying sensitive data, but all the machine learning algorithms need to identify the data to be identified one by inputting the data to be identified into a model, and the identification efficiency is low.

In view of this, the embodiment of the application provides a sensitive data identification method, which is capable of determining whether the data to be detected is sensitive data according to the distribution difference of the sensitive data of the data domain to be detected after clustering by combining the data to be detected with the sensitive data and clustering the combined data. The sensitive data is identified based on the distribution of the data, so that the identification efficiency is higher.

Referring to fig. 1, a flow chart of a sensitive data identification method according to an embodiment of the present application is shown. The flow shown in fig. 1 is exemplified by the electronic device performing sensitive data identification. The electronic device may be any device having a data storage function, for example, the electronic device may be a terminal, such as a smart phone, a tablet computer, a desktop computer, or the like.

For ease of description, in the following description, a field-clear dataset may be distinguished from a field-missing dataset, and the field-clear dataset is referred to as a first dataset, as in table 1. The data set with the missing field is referred to as the second data set, as in table 2. That is, the data in the second data set needs to be identified. It should be understood that references herein to "first" and "second" are used merely to distinguish between data sets and are not intended to limit the size, content, order, timing, priority or importance of the data sets, etc.

S101, an electronic device acquires a first data set and a second data set, wherein the first data set comprises sensitive data of N fields, and the second data set comprises data to be detected of M fields; the first data set is a pre-stored marked sensitive data set, the second data set is an acquired unmarked data set, and M, N is a positive integer.

The electronic device obtains a first data set and a second data set. Wherein the data in the first data set is sensitive data with clear field identification. The first data set may be a data set pre-acquired by the electronic device. The electronic device can manually identify the pre-acquired data set and label the fields of the sensitive data of the data set. After the electronic device marks the first data set, sensitive data of N fields can be obtained, wherein N is a positive integer. The electronic device may pre-store the annotated first data set.

The second data set is a data set acquired by the electronic device but not marked, and the fields in the second data set may be partially or completely missing. Thus, the data in the second data set may have both sensitive and non-sensitive data, and the sensitive data in the second data set needs to be identified. For ease of description, the data in the second data set will be referred to as data to be detected. The number of fields in the second data set is M, that is, the second data set includes M fields of data to be detected.

S102, the electronic equipment combines the data of the fields with the same character type in the first data set and the second data set to obtain a third data set, wherein the third data set comprises data of S fields, and S is a positive integer.

In the embodiment of the application, the first data set can be used for identifying the second data set because the data in the second data set is directly identified by the data in the second data set has lower efficiency, and the first data set is considered to be known marked sensitive data. The first data set comprises data of N fields, the second data set comprises data to be detected of M fields, the data of the N fields included in the first data set are all sensitive data, and the data to be detected of the M fields in the second data set are not all sensitive data. The character type of the data in the second data set may be completely different from the data in the first data set, or may be the same as the data character type of the first data set, but not sensitive data. Therefore, when the electronic device identifies the data to be detected in the second data set, the fields which obviously do not belong to the sensitive data can be disregarded, the number of the data to be detected which need to be identified can be reduced, and the detection efficiency is improved. For example, if there is "10:34", "09:42", "16:23" … "of data under a certain unlabeled field in the second data set, the field may be judged to be" time "from the character type of the data, and the data under the field is non-sensitive data. For another example, if the second data set summarizes data "SAFGJAIOJIAJDO", "OPQJTEQIOTAD", "POQTEUPODFDAF" …, which is present in a field that is not labeled, the data in that field represents meaning that is also significantly insensitive.

To identify whether the data character type of the second data set is the same as the character type of the first data set, the electronic device may compare one of the N fields of the first data set (e.g., the first field) with one of the M fields of the second data set (e.g., the second field) one by sampling.

The electronic device may sample the first data set to obtain a plurality of first sampled data and sample the second data set to obtain a plurality of second sampled data. If the field to which the plurality of second sampled data collected by the electronic device belongs does not belong to the field of the sensitive data, the character type of the field may be completely different from the plurality of first sampled data. For example, the character types of the plurality of second sample data are english letter character strings, "SAFGJAIOJIAJDO", "OPQJTEQIOTAD", etc., and the character types of the plurality of first sample data are floating point numbers, "4256.27", "54212.65", etc. It may be determined that the character types of the plurality of first sample data and the plurality of second sample data are not identical. It should be understood that, for the number of sampled data to be within a reasonable range, the data acquired by too small a number may not be representative, and the processing efficiency of the electronic device may be affected by too large a number, and the specific number may be determined according to the actual requirement, which is not particularly limited by the embodiment of the present application.

Specifically, the electronic device may determine whether the character types of the plurality of first sampled data and the plurality of second sampled data are the same through statistical parameters of the plurality of first sampled data and the plurality of second sampled data, where the statistical parameters include a mean and/or a variance. That is, the electronic device may calculate the mean and/or variance of the plurality of first sampled data and the mean and/or variance of the plurality of second sampled data, and since the statistical parameters may characterize the distribution of the data, if the distribution of the plurality of first sampled data and the plurality of second sampled data is more similar to each other, the probability that the character types of the two are the same may be determined to be higher. For example, the first sample data and the second sample data are both "height", the mean and/or variance of the first sample data and the mean and/or variance of the second sample data calculated by the electronic device should be similar, and are within the height range of a normal person, and the electronic device can determine that the character types of the first sample data and the second sample data are the same. For another example, the first sample data and the second sample data are letter data "names", the electronic device may perform vectorization processing on the first sample data and the second sample data in a word2vec manner, may calculate a mean and/or variance of the vectorized first sample data and a mean and/or variance of the second sample data, and if the two sample data are similar, the electronic device may determine that the character types of the two sample data are the same. On the other hand, if the character types of the plurality of first sampled data are floating point numbers and the character types of the plurality of second sampled data are strings of english letters, the statistical parameters of the plurality of first sampled data and the plurality of second sampled data are obviously different, and the electronic device can determine that the character types of the plurality of first sampled data and the plurality of second sampled data are different.

Since the approximation of the distribution of the plurality of first sample data and the plurality of second sample data can be characterized by the statistical parameter, i.e. the distribution of the plurality of first sample data and the plurality of second sample data may differ. The electronic device may set a second preset threshold for the statistical parameters of the plurality of first sampled data and the plurality of second sampled data, and when the difference between the statistical parameters of the plurality of first sampled data and the plurality of second sampled data is smaller than the second preset threshold, the distribution conditions of the plurality of first sampled data and the plurality of second sampled data are similar enough, and the character types of the plurality of first sampled data and the plurality of second sampled data may be considered to be the same. The second preset threshold is a preset threshold of the mean and/or variance of the sampled data, and as an example, the second preset threshold of the mean may be 2, and the second preset threshold of the variance may be 3. That is, if the difference between the average values of the plurality of first sample data and the plurality of second sample data is less than 2 and the difference between the variances is less than 3, the character types of the plurality of first sample data and the plurality of second sample data may be regarded as the same. Accordingly, if the difference between the average values of the plurality of first sampled data and the plurality of second sampled data is greater than 2 or the difference between the variances is less than 3, the character types of the plurality of first sampled data and the plurality of second sampled data may be considered to be different, and the data corresponding to the second field may be disregarded.

The electronic device may compare each field of the second data set with each field of the first data set one by one, and combine the fields corresponding to the data with the same character type. When determining whether the character types are the same, the electronic device determines through a plurality of first sampling data and a plurality of second sampling data obtained by sampling, and when merging, merges data of a first field corresponding to the first sampling data and data of a second field corresponding to the second sampling data. That is, it is determined whether the data character types of the two fields are the same by sampling, and at the time of merging, all the data in the two fields are merged. And the electronic equipment combines the data of the first field with the data of the second field, and the combined data is the data of any field in the third data set. The electronic device merges the data of the N fields in the first data set and the data of the fields with the same character type in the M fields in the second data set to obtain a third data set, wherein the number of the fields in the third data set can be recorded as S, and S is a positive integer. It should be appreciated that the data in the third data set is not all sensitive data, and that further determination of the data in the third data set is required.

S103, the electronic equipment clusters the data in the third data set to obtain R-class data, wherein R is a positive integer.

The electronic device may employ a density-based clustering algorithm (DBSCAN) to cluster data in the third dataset. To facilitate understanding of the present solution, a brief description of the DBSCAN algorithm follows.

The DBSCAN algorithm is a density-based clustering algorithm that generally assumes that the class can be determined by how tightly the sample is distributed. The DBSCAN algorithm requires the selection of a distance metric. For the data set to be clustered, the distance between any two points reflects the density between the points, and whether the points and the points can be gathered into the same class or not can be judged through the distance, that is, if the distance between the points meets a certain condition, the two points can be considered to be closely connected and can be classified into the same class. The DBSCAN algorithm requires the user to input 2 parameters: one parameter is the radius (EPS), which represents the extent of the circular field centered at a given point P. Another parameter is the number of minimum points (mints) within the field centered around point P. If the condition is satisfied: the number of points in the field with the point P as the center and the radius as EPS is not less than mints, and the point P is called as a core point. Points in the area with radius EPS can be continuously found by the point P until the above condition cannot be satisfied, and the found points can be clustered into the same class, that is, one cluster.

The electronic equipment can calculate the distance between samples in the third data set through a DBSCAN algorithm, and can determine the samples which are closely connected in the third data set through setting the parameter radius EPS, and divide the closely connected samples into one class, so that a clustering class is obtained. By grouping all closely connected sets of samples into different categories we get the final all clusters category result. The electronic device may cluster the third dataset by a clustering algorithm to obtain R clusters, i.e. R categories. The third dataset may be considered to be clustered to obtain R-class data. Before the electronic device clusters the data in the third data set, the electronic device may convert the character data in the third data set into numerical data, so as to improve the efficiency of the electronic device in clustering the third data set.

S104, the electronic equipment determines the distribution difference of the sensitive data and the data to be detected in the R-class data of the third data set, and if the distribution difference is smaller than a first preset threshold value, the data in the third data set is determined to be the sensitive data.

The data of one of the S fields of the third data set is only data of the same character type, however, the character types are the same, and the parameters characterizing them are not exactly the same. That is, the third data set may combine the non-sensitive data in the second data set, and finally, there may be more non-sensitive data in the data in a certain field of the S fields of the third data set. For example, in the step S102, when the first data set and the second data set are combined, and the first data set is combined with the floating point number having the same character type as the character type in the second data set, the data in one of the S fields of the obtained third data set may include data such as "amount" or "height" and "weight" in the first data set and the second data set. Therefore, the data in the third data set can be further identified by the R-class data obtained by clustering the third data set.

Because the third data set simultaneously comprises the sensitive data in the first data set, the data to be detected in the second data set, after the data in the third data set are clustered, the distribution condition of the sensitive data in the third data set and the data to be detected in the R-class data may be different. For example, if fields of non-sensitive data in the second data set are incorporated in the data to be detected, more non-sensitive data in the data to be detected may result. The non-sensitive data can generate a cluster type which is far away from other data during clustering, namely a relatively independent cluster is generated, and further the distribution difference between the sensitive data and the data to be detected is larger. Therefore, the electronic device may determine a distribution difference between the sensitive data and the data to be detected in the R-class data of the third data set, and if the distribution difference is too large, it may be considered that the data to be detected includes a non-sensitive field, and the data to be detected are not all sensitive data.

Because the electronic device generates R-class data for the third data aggregation class, the distribution of the R-class data can be determined by the duty ratio of the data in each class of data. The duty cycle of the data may be expressed in terms of a fraction or percentage. Taking the duty cycle as a fraction, the electronic device may obtain R fractions representing the duty cycle, where the R fractions may be represented by (n 1, n2 … nr). The set of R fractional components representing the duty cycle can be used to characterize the distribution of data in R classes of data, it being understood that the sum of the individual elements of the set add to 1, i.e., Σni=1.

The electronic device may determine a first set according to a duty cycle of each type of data in the R type of data, to characterize a distribution of the sensitive data in the R type of data, where the first set may be represented by (a 1, a2 … ar). Likewise, the electronic device may determine a second set according to the duty ratio of each type of data in the R type of data, to characterize the distribution of the data to be detected in the R type of data, where the second set may be represented by (b 1, b2 … br). The electronic device may be configured to use the euclidean distance between the first set and the second set as a distribution difference between the sensitive data and the data to be detected in the R-class data. The electronic device may set a first preset threshold to determine whether a distribution difference between the first set and the second set is too large. For example, the electronic device may set the first preset threshold to 4. When the distribution difference between the first set and the second set is greater than or equal to a first preset threshold, the difference between the data to be detected and the sensitive data can be considered to be large, and the data to be detected can be considered to be not the sensitive data. Correspondingly, when the distribution difference between the first set and the second set is smaller than a first preset threshold, the difference between the data to be detected and the sensitive data can be considered to be small enough, and the data to be detected can be considered to be the sensitive data. It should be understood that, in the embodiment of the present application, the specific value of the first preset threshold is not limited, and the electronic device may set the first preset threshold to be small enough to ensure that the distribution of the data to be detected and the sensitive data is as consistent as possible, so that accuracy of identifying the data to be detected as the sensitive data may be ensured.

Referring to fig. 2, based on the same inventive concept, an embodiment of the present application provides a sensitive data identification apparatus 200. The apparatus 200 comprises: the device comprises an acquisition module 201, a combination module 202, a clustering module 203 and a determination module 204. The acquiring module 201 is configured to acquire a first data set and a second data set, where the first data set includes sensitive data of N fields, and the second data set includes data to be detected of M fields; the first data set is a pre-stored marked sensitive data set, the second data set is an acquired unmarked data set, and M, N is a positive integer. The merging module 202 is configured to merge the data of the fields with the same character type in the first data set and the second data set to obtain a third data set, where the third data set includes data of S fields, and S is a positive integer. The clustering module 203 is configured to cluster the data in the third data set to obtain R-class data, where R is a positive integer. The determining module 204 is configured to determine a distribution difference between the sensitive data and the data to be detected in the R-type data of the third data set, and determine that the data in the third data set is the sensitive data if the distribution difference is less than a first preset threshold.

Optionally, the merging module 202 is specifically configured to: and respectively sampling a first field in the N fields of the first data set and a second field in the M fields of the second data set to obtain a plurality of first sampling data and a plurality of second sampling data. It is determined whether the character types of the plurality of first sample data and the plurality of second sample data are identical. If the character types are the same, merging the data of the first field with the data of the second field, wherein the merged data is the data of any field in the third data set.

Optionally, the merging module 202 is specifically configured to: statistical parameters of the first sampling data and the second sampling data are respectively determined, wherein the statistical parameters comprise a mean value and/or a variance. And determining whether the character types of the first sampling data and the second sampling data are the same according to whether the difference of the statistical parameters of the first sampling data and the second sampling data is smaller than a second preset threshold value.

Optionally, the determining module 204 is specifically configured to: and determining a first set according to the duty ratio of each type of data in the R type of data, wherein the first set is used for representing the distribution condition of the sensitive data in the R type of data. And determining a second set according to the duty ratio of the data to be detected in the R-class data, wherein the second set is used for representing the distribution condition of the data to be detected in the R-class data. And determining the Euclidean distance between the first set and the second set as the distribution difference of the sensitive data and the data to be detected in the R-class data.

Optionally, the clustering module 203 is further configured to: the character type data in the third data set is converted into numerical type data.

Referring to fig. 3, based on the same inventive concept, an embodiment of the present application provides an electronic device 300, including: at least one processor 301, at least one memory 302, and computer program instructions stored in the memory, which when executed by the processor implement a sensitive data identification method as described above.

Alternatively, the processor 301 may be a central processing unit, an application specific integrated circuit (english: application Specific Integrated Circuit, abbreviated as ASIC), one or more integrated circuits for controlling program execution, a hardware circuit developed using a field programmable gate array (english: field Programmable Gate Array, abbreviated as FPGA), and a baseband processor.

Optionally, the Read-write lock operation device further includes a Memory 302 connected to the at least one processor 301, where the Memory 302 may include a Read Only Memory (ROM), a random access Memory (Random Access Memory, RAM), and a disk Memory. The memory 302 is used to store data required by the processor 301 when it is running. The number of memories 302 is one or more. The memory 302 is shown in fig. 3, but it should be noted that the memory 302 is not an essential functional block, and is therefore shown in fig. 3 by a broken line.

Based on the same inventive concept, the embodiments of the present application also provide a computer-readable storage medium storing computer instructions that, when run on a computer, cause the computer to perform the sensitive data identification method as described above.

In a specific implementation, the computer readable storage medium includes: a universal serial bus flash disk (Universal Serial Bus flash drive, USB), a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk or an optical disk, or the like, which can store program codes.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to perform all or part of the functions described above. The specific working processes of the above-described systems, devices and units may refer to the corresponding processes in the foregoing method embodiments, which are not described herein.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a universal serial bus flash disk (Universal Serial Bus flash disk), a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk or an optical disk, or other various media capable of storing program codes.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A method for identifying sensitive data, comprising:

acquiring a first data set and a second data set, wherein the first data set comprises sensitive data of N fields, and the second data set comprises data to be detected of M fields; the first data set is a pre-stored marked sensitive data set, the second data set is an acquired unmarked data set, and M, N is a positive integer;

combining the data of the fields with the same character type in the first data set and the second data set to obtain a third data set, wherein the third data set comprises data of S fields, and S is a positive integer;

clustering the data in the third data set to obtain R-class data, wherein R is a positive integer;

determining the distribution difference of the sensitive data and the data to be detected in the R-class data of the third data set, and determining the data in the third data set as sensitive data if the distribution difference is smaller than a first preset threshold value.

2. The method of claim 1, wherein merging data of fields of the same character type in the first data set and the second data set comprises:

sampling a first field of the N fields of the first data set and a second field of the M fields of the second data set respectively to obtain a plurality of first sampling data and a plurality of second sampling data;

determining whether character types of the plurality of first sample data and the plurality of second sample data are the same;

if the character types are the same, merging the data of the first field with the data of the second field, wherein the merged data is the data of any field in the third data set.

3. The method of claim 2, wherein determining whether the character types of the plurality of first sample data and the plurality of second sample data are the same comprises:

respectively determining statistical parameters of the first sampling data and the second sampling data, wherein the statistical parameters comprise a mean value and/or a variance;

and determining whether the character types of the plurality of first sampling data and the plurality of second sampling data are the same according to whether the difference of the statistical parameters of the plurality of first sampling data and the plurality of second sampling data is smaller than a second preset threshold value.

4. The method of claim 1, wherein determining a distribution difference of the sensitive data of the third data set and the data to be detected in the R-class data comprises:

determining a first set according to the duty ratio of each type of data in the R type of data of the sensitive data, wherein the first set is used for representing the distribution condition of the sensitive data in the R type of data;

determining a second set according to the duty ratio of the data to be detected in the R-class data, wherein the second set is used for representing the distribution condition of the data to be detected in the R-class data;

and determining the Euclidean distance between the first set and the second set as the distribution difference of the sensitive data and the data to be detected in the R-class data.

5. The method of claim 2, wherein prior to clustering the data in the third dataset, the method further comprises:

and converting the character type data in the third data set into numerical type data.

6. A sensitive data identification device, comprising:

the acquisition module is used for acquiring a first data set and a second data set, wherein the first data set comprises sensitive data of N fields, and the second data set comprises data to be detected of M fields; the first data set is a pre-stored marked sensitive data set, the second data set is an acquired unmarked data set, and M, N is a positive integer;

the merging module is used for merging the data of the fields with the same character type in the first data set and the second data set to obtain a third data set, wherein the third data set comprises S fields of data, and S is a positive integer;

the clustering module is used for clustering the data in the third data set to obtain R-class data, wherein R is a positive integer;

the determining module is configured to determine a distribution difference between the sensitive data and the data to be detected in the R-type data of the third data set, and if the distribution difference is smaller than a first preset threshold value, determine that the data in the third data set is sensitive data.

7. The apparatus of claim 6, wherein the combining module is specifically configured to:

8. The apparatus of claim 7, wherein the combining module is specifically configured to:

9. An electronic device, comprising: a processor, and a memory communicatively coupled to the processor;

the memory stores computer-executable instructions;

the processor executes computer-executable instructions stored in the memory to implement the method of any one of claims 1-5.

10. A computer readable storage medium having stored therein computer executable instructions which when executed by a processor are adapted to carry out the method of any one of claims 1-5.