CN116340387A

CN116340387A - Statistical analysis method and system for personal information disclosure condition of data table

Info

Publication number: CN116340387A
Application number: CN202310257243.2A
Authority: CN
Inventors: 廖佳纯; 陈海粟; 姚思诚; 焦文品; 张磊
Original assignee: Nanhu Laboratory
Current assignee: Nanhu Laboratory
Priority date: 2023-03-17
Filing date: 2023-03-17
Publication date: 2023-06-27

Abstract

The scheme discloses a statistical analysis method and a statistical analysis system for personal information disclosure conditions of a data table, and provides a novel data processing method, wherein a data catalog is made for the data table, the data catalog is based on the data catalog to preliminarily mark and classify the personal information related data table, and then the screened personal information related data table is comprehensively identified in field content, so that the marking of a field identifier is accurately and efficiently completed; on the basis of the processing, the data table is divided according to whether the direct identifier information record line exists or not, and the data table is split and reorganized according to whether the direct identifier information record line exists or not, so that the subsequent processing analysis and statistics efficiency can be effectively improved; on the basis of the processing, a data table is analyzed by adopting a layer-by-layer classification method, statistical analysis is carried out from a plurality of dimensions, and a personal information disclosure result report is automatically generated, so that the personal information disclosure of the platform is comprehensively and completely described.

Description

Statistical analysis method and system for personal information disclosure condition of data table

Technical Field

The scheme belongs to the technical field of personal information security, and provides a statistical analysis method and a statistical analysis system for personal information disclosure conditions of a data table.

Background

According to the information security technology personal information de-identification guideline, personal information refers to various information recorded in an electronic or other way and capable of identifying the identity of a specific natural person or reflecting the activity status of the specific natural person alone or in combination with other information. The natural person identified by the personal information is referred to as a personal information body. Micro data refers to a structured data table in which each record (row) corresponds to a body of personal information and each field (column) of the record corresponds to an attribute. The identifier is one or more attributes in the micro data, and can realize unique identification of the personal information body, and is divided into a direct identifier and a standard identifier. Direct identifiers refer to micro-data attributes that can individually identify a personal information body under a specific environment, and common direct identifiers such as names, identity cards, mobile phone numbers, etc. Quasi-identifiers refer to micro-data attributes that cannot be used alone to identify individuals, but in combination with other attributes can uniquely identify individuals' information bodies, common quasi-identifiers such as gender, occupation, academy, etc. The data platform usually performs de-identification processing on the contents of the data table before the data table is released, and the de-identification processing is that is, the process of identifying the personal information main body can not be realized under the condition that no additional information is used by the data platform. The record rows with the same values of the standard identifiers in the data table without the direct identifier disclosure form an equivalence class, and the equivalence class size is the number of the record rows with the same values of all the standard identifiers, and determines the risk that the record rows correspond to individuals and are re-identified; the equivalence class dimension is the number of quasi-identifiers that form the equivalence class, which measures how many categories of personal information a record line contains, the higher the dimension, the more personal information that can be revealed. Re-identification is the process of re-associating the de-identified data table to the original personal information body or to a group of personal information bodies. For the data table to be de-identified, because each record row in the same equivalence class under the data table is indistinguishable, the probability that the person corresponding to each record row in the same equivalence class is re-identified, namely the re-identification risk of the record row, is the reciprocal of the equivalent class size.

R _b ＝max _j∈J θ _j Equation 2

Wherein: j is an equivalence class, j.epsilon.J, f _j Is equivalent to class size, θ _j For the purpose of identifying probability, |J| represents the number of equivalent classes in the data table, R _b For identifying probability maxima.

When analyzing the disclosure of personal information on a data table disclosed by a certain platform, the method generally comprises at least two parts of data classification and data content identification:

1) Data classification, classifying data with similar attributes or characteristics according to certain rules and methods according to dimensions selected by specific management or service, classifying data table sets by data classification tasks, and classifying data into personal information and non-personal information according to personal dimensions of citizens. At present, common data classification tasks mainly rely on an automatic tool or manual data classification marking, manual classification marking is accurate but the task period can be remarkably increased, the quality requirement of a full-automatic marking method on a data table is high, and the quality of the data table can not meet the requirement in real situations.

2) And the identification of the data content is completed, and the identification of the information type related to the data field in the data table is mainly carried out on the sensitive information in the data table. The traditional data content identification mode mainly adopts two modes based on manual definition and regular expression. The manually defined method is to manually define a sensitive word lexicon, and uses keyword matching to perform information identification on a data table at a metadata level, and also has higher requirements on the quality of the data table, if the data fields have misplacement or field names are not matched with field contents, the application of the method is limited in the identification credibility. The regular expression method is relatively adaptive to data information with structural characteristics such as mobile phone numbers, identity cards and bank card numbers, but can not be used for completing identification of sensitive information such as names and the like in unstructured text information, and can not be used for accurately identifying and extracting information with relatively simple composition modes, such as mobile phone numbers and the like, mixed in unstructured long text.

In summary, the prior art has the following drawbacks:

in a real scene, the data table has the objective conditions of complex data format, unreliable fields, random nesting of the structured table, non-structured content and other non-compliance, so that the classification and recognition work is difficult to develop: with the continuous development of information processing and storage technology, the problem of personal information abuse in China is becoming serious. Under the release scene of a specific data resource platform, the situation that personal information is revealed due to the fact that the anonymization of the acquired data table is not in place often occurs, the sources of the data table are diversified, unified standards are lacking, and the data field content and the field catalogue are often unverified, or the situation that the field names are not matched with the field content, the information content between the fields is mixed and other non-compliance situations exist, so that the traditional classification and identification tasks are difficult to develop directly.

Disclosure of Invention

The purpose of the present invention is to provide a statistical analysis method and system for personal information disclosure situations of a data table, so that a specific data resource platform manager can efficiently and accurately know the disclosure situations of privacy information related to individuals, potential re-identification risks of the data table and realizable re-identification situations of resources published in a current platform scene by using the system of the present invention through the method flow.

A personal information disclosure statistics analysis method for a data sheet, comprising:

s1, acquiring a data table to be analyzed;

s2, cleaning a data table to be analyzed and manufacturing a metadata catalog for the cleaned data table to be analyzed;

each entry of the metadata directory corresponds to a data table to be analyzed, and comprises a field name set of the corresponding data table to be analyzed and a mapping code for pointing to the corresponding data table to be analyzed; taking a directory table as an example, one row corresponds to one entry, and when only one data table to be analyzed exists, the metadata directory table only has one row;

s3, matching the data table to be analyzed of each item based on the mapping code; completing preliminary classification labeling of each field value of the corresponding data table on the field identifier type for the field name set of the metadata directory, and screening entries related to personal information in the metadata directory;

s4, extracting a corresponding personal information related data table based on the screening result of the step S3, and comprehensively identifying field values of field names in the personal information related data table;

s5, classifying the personal information related data table into a type one data table and a type two data table according to whether a record row containing direct identifier information exists or not;

S6, splitting and reorganizing the type-one data table into a class-A data table which is completely composed of record rows containing direct identifier identifiable information and a class-B data table which is not containing direct identifier identifiable information according to whether the record rows in the data table contain the direct identifier identifiable information or not; the method comprises the steps of carrying out a first treatment on the surface of the

Classifying the type two data table into a type B data table;

s7, respectively carrying out statistical analysis on the A-class data table and the B-class data table so as to carry out statistical analysis on the personal information disclosure condition of the data table to be analyzed.

In the above statistical analysis method for personal information disclosure situation of data table, in step S2, the cleaning of the data table to be analyzed includes any one or more of field misalignment correction, field name perfection, field name conversion and special character processing, so that each field name of the cleaned data table exists and corresponds to a field value, the field name is mainly Chinese characters and the special characters in the field are removed; the special characters mainly comprise space, line feed and other special characters;

each entry in the metadata catalog also comprises any one or more of a data table title, a web page link, a data table file name and data table related information containing field labeling information of the corresponding data table to be analyzed;

And mapping the mapping code, the field name set, the data table title, the webpage link, the data table file name and the data table related information of each data table to be analyzed to integrate the metadata catalogue.

In the above statistical analysis method for personal information disclosure of a data table, step S3 specifically includes:

s31, acquiring field value characteristics of each field name set in each item, wherein the field value characteristics comprise field unique value ratio, various character type ratio distribution of a sampling sample and field data types;

s32, text vectorization is carried out on the data table title and the data table field label of the entry where the field name is located, vectorization is carried out on the field value characteristics of the field set, and the vector characteristics are combined;

s33, inputting the combined vectorized field characteristics into a trained machine learning classification model, such as a decision tree classification model, and outputting identifier type labels of all fields by the model;

s34, judging whether the corresponding item is related to the personal information according to the identifier type label so as to screen the item related to the personal information for the metadata catalogue.

In the above statistical analysis method for personal information disclosure condition of data table, in step S33, the types of identifiers that can be marked include direct identifier, standard identifier, and non-identifier;

In step S34, it is determined that an item is related to personal information when one of the direct identifier, the quasi identifier, and the de-identified identifier exists in a field in the item.

In the above statistical analysis method for the personal information disclosure condition of the data table, in step S4, the overall recognition method is as follows:

for the direct identifier, information strictly following a certain composition mode is used for identification, such as a mobile phone number, an identity card, a bank card number, a license plate number and the like; identifying and extracting a named entity identification method based on deep learning in the LAC lexical analysis tool, wherein the named entity identification method does not have strict composition modes in the descriptive text, such as names and the like;

for quasi identifiers such as gender, academy, work units and the like, according to personal information reference files such as annex B table B.1 in network safety standard practice guideline-network data classification guideline, performing identification matching by using a metadata identification technology based on a keyword word stock;

for the de-identified identifier, the de-identified degree is detected, for example, whether the de-identified identifier contains special characters is used as a basis for determining the de-identification, and the degree of the de-identification processing is confirmed by using the duty ratio containing the field value. The de-identified identifier may be a direct identifier or a quasi-identifier before being de-identified.

In the above-described statistical analysis method for personal information disclosure condition of a data table, for a class a data table, the individual corresponding to each record row in the table is directly disclosed. The distribution situation of the data table in different fields can be counted, so that the situation that individuals are directly revealed in each field in the platform scene is intuitively displayed; the coverage distribution of the class data table to different classes of personal information in a specific field can be counted, so that different information leakage situations of individuals can be directly revealed in different fields by the display platform scene. Because the number of individuals involved in each data table is greatly different, the method also gathers the record number of each related personal information type in each data table, and takes the identification result number of each direct identifier of each row as the indication of the number of people involved in the identified direct identifier information; for the matched quasi-identifier field, summarizing the number of recorded lines is used as an indication of the number of people involved, and individuals with the same direct identifier information among the data tables are de-duplicated to be used as a reference indication of the number of people involved in the data tables, so that comprehensive depiction and quantitative display of individuals and personal information thereof are disclosed in a single data table form for a specific platform scene are realized;

For the class B data table, the identified potential risk exists in the individual corresponding to each record row in the table, firstly, the re-identification risk of the record row in each data table is calculated according to the calculation method of the re-identification risk mentioned in the background technology, the maximum re-identification risk of the data table is the maximum re-identification risk in all record rows in the data table, and the number of the data tables under different maximum re-identification risks is counted; the data table with the maximum re-identification risk larger than or equal to the set threshold value can be counted, the distribution situation of the data table in different fields and the coverage distribution of the data table in specific fields on different types of personal information are counted, and the existence situation of the high-risk data table in different fields and the disclosure situation of different information contained in the data table are displayed on the display platform. All the data lines with the re-identification risk greater than or equal to the set threshold value in the B-class data table can be summarized to be used as an indication of the number of individuals involved in the data table, so that comprehensive depiction and quantitative display of the personal information disclosed by the data table with the maximum re-identification risk greater than or equal to the set threshold value in the specific platform scene are realized. The set threshold may be 1/2,1/3,1 …, preferably 1, in which case the risk of re-identification is at most 1 in practical situations.

In the above statistical analysis method for personal information disclosure of a data table, in step S7, the method further includes a correlation analysis method for a class a data table and a class B data table:

s71, using a B-class data table with the re-expression risk being greater than or equal to a set threshold value as a B-class data table which can be used for association;

s72, matching and associating the record row with the A type data table, wherein the re-identification risk of the record row is greater than or equal to the set threshold value, in the B type data table capable of being used for association:

s721, respectively acquiring standard identifier field sets of two data tables, and matching fields which contain the same personal information type and have the same value to obtain all the matchable field pairs in the two data tables;

s722, analyzing the record rows of the two data tables one by one according to the values of the two data table field pairs determined in the S721, wherein the values of all the quasi-identifier field pairs are identical, the direct identifier information is matched with the record rows of which the residual information of the de-identified identifier field is identical, and the judgment that the record rows correspond to the same person is made;

s73, measuring the credibility of the matched record row corresponding to the same individual by using the successfully matched standard identifier numberFor pairing data table A _i And B _j Data sheet A _i Extended information content data sheet B _j Subtracting the number of all pairs of fields in the two data tables that can be matched from the number of quasi-identifiers in the two data tables;

for pairing data table a _i And B _j Data sheet A _i The certainty of the extended personal information is 1/n, n represents A _i One record line matching B _j N record lines with re-identification risk greater than or equal to a set threshold;

s74, counting the number of records for realizing re-identification under different credibility and different credibility according to the association matching result;

and carrying out similar statistical analysis on the class A data table on the association matching result.

In the above statistical analysis method for personal information disclosure condition of the data sheet, the statistical analysis result of step S7 is subjected to a visualization process:

s81, carrying out statistical analysis on a class A data table and a class B data table, and from the two angles of data table data and data table data relating to the number of people, taking the type relating to personal information as an X axis, taking a domain label as a Y axis, indicating the number of data tables or relating to the number of people, and drawing a thermodynamic diagram to show the specific disclosure condition of the data table under each domain label relating to the type of personal information;

s82, carrying out statistical analysis on association matching of A, B data tables, and designing visual presentation at different angles:

(1) Drawing a multi-dimensional clustered column diagram by taking the related personal information type as an X axis and the number of people as a Y axis and taking the credibility as a dimension to show the specific distribution condition of the record row which is successfully associated and matched with the data table and related to the information type of each person under different credibility;

(2) And drawing a multi-dimensional cluster column diagram by taking the information quantity expanded after the data pair association matching as an X axis and the number of people as a Y axis and taking the certainty factor range as a dimension to show the expansion condition of the information quantity after the line re-identification is recorded under different certainty factors.

In the above statistical analysis method for personal information disclosure condition of data table, in step S3, after classification labeling is completed, a user performs a manual verification;

in step S4, after the alignment identifier, the direct identifier and the de-identified field are identified, the user performs auxiliary verification on the identification result in combination with the header of the data table, the field value sampling and the field unique value occupying ratio.

In step S5, the type-one data table contains record lines containing identification information capable of identifying individuals independently, and corresponding individuals are directly disclosed; all record rows in the type two data sheet are exposed to personal information, which corresponds to individuals with identified potential risks.

A personal information disclosure statistical analysis system for executing the personal information disclosure statistical analysis method for a data table.

The advantage of this scheme lies in:

the data quality requirement is very low, and the data table with good quality can be used for carrying out personal information disclosure statistical analysis, reducing the degree of dependence on human beings and improving the reliability of the statistical analysis, namely, the structured data table is randomly nested with unstructured data;

the scheme provides a new data processing method, which comprises the steps of creating a data catalog for a data table, primarily marking and classifying the personal information related data table based on the data catalog, and then comprehensively identifying the field content of the screened personal information related data table to accurately and efficiently finish marking a field identifier; on the basis of the processing, the data table is divided according to whether the direct identifier information record line exists or not, and the data table is split and reorganized according to whether the direct identifier information record line exists or not, so that the subsequent processing analysis and statistics efficiency can be effectively improved; on the basis of the processing, a data table is analyzed by adopting a layer-by-layer classification method, statistical analysis is carried out from a plurality of dimensions, and a personal information disclosure result report is automatically generated, so that the personal information disclosure of the platform is comprehensively and completely depicted;

Meanwhile, the scheme adopts the concept of association disclosure to analyze the possibility of using the class A data table to identify the class B data table, so that the statistical analysis of the personal information disclosure condition of a plurality of association data tables can be realized, and the statistical analysis of the personal information disclosure condition of the whole data table set of a certain platform can be realized.

Drawings

FIG. 1 is a general flow chart of a statistical analysis method for personal information disclosure of a data sheet according to the present invention;

FIG. 2 is a raw data table of an example of the present invention;

FIG. 3 is a first sub-graph and a second sub-graph of the original data table of FIG. 2 after data cleansing;

FIG. 4 is a diagram of a table of contents;

FIG. 5 is a flow chart of a personal information data table reorganization module according to the present invention;

FIG. 6 is a flowchart of a personal information risk statistics analysis module according to the present invention;

FIG. 7 is a flow chart of the analysis result visualization module of the present invention;

FIG. 8 is a flow chart of the data catalog creation, data sheet classification, identification and labeling module of the present invention.

Detailed Description

The present invention is described in further detail below with reference to the drawings and detailed description.

FIG. 1 is a flow chart of a general framework for applying the invention to develop analysis statistics of personal information disclosure, and the invention provides an operation framework for carrying out statistical analysis and visual chart making on personal information disclosure in a single data table and a data table set by using the method flows of classification, identification, association and the like of a set realized by using the python language. Firstly, a data table set is obtained from a certain data resource platform, then the data table set is mapped to manufacture a metadata catalog, and then the data table set is classified and screened, so that the data table potentially related to personal information is screened out. The privacy information in the contents of the data table fields of each data table in the set of data tables marked as personal information-dependent is then comprehensively identified. And dividing the data table according to the identification detection result and the extraction information, namely whether the direct identifier is identified in the data table, splitting the line identified with the direct identifier from the original data table, and recombining the line into a new data table. Based on different conditions of the data table after the data table division and recombination, a layer-by-layer classification discussion method is adopted to analyze the data table in the actual data table platform scene, statistical analysis is carried out from a plurality of dimensions, and a personal information disclosure condition result report is automatically generated, so that the personal information disclosure condition of the platform is comprehensively and completely described. The method specifically comprises the following steps:

Step one, establishing a data catalog, namely cleaning data of an acquired data table set of a certain data resource platform, cleaning field names of the data table set to enable the field names to be as Chinese as possible, remapping the field names into Chinese according to platform information if the field names are abbreviated or English, removing special characters such as line wrapping characters and blank spaces in the field names, ensuring that each field name of the data table exists, the field names are mainly Chinese characters, and removing the special characters such as the blank spaces in the field names.

Taking an extreme case as an example, as shown in fig. 2, when an original data table in xls format is published by a data platform, it can be seen that the original data table lacks a field name and has a dislocation, which is a problem caused by an operation error when the data table is called out from an original database when the data platform is published. The data platform also issues the data of the data table in the XML format, the format is in a tree structure and contains complete data information, each data element contains a tag, the tag is a field name, when the tag is converted into the standard data frame format, the tag in the XML can be called out as the field name in the XLS format required by us, the data of the original data table is cleaned according to the call, the data is cleaned, the data is converted step by step to obtain a final more standard data set, a first sub-graph and a second sub-graph in the figure 3 can be obtained, and the first sub-graph is the data table obtained by converting the original data table through the XML and the XLS structure; the second sub-graph is a cleaned data table obtained by converting English and abbreviated field names in the first sub-graph data table by using data table metadata provided on the data platform.

According to the actual file condition of the data table set given in the specific data resource platform, firstly, a mapping mode between the file name of the data table and the name of the specific data table is determined, namely, a mapping code can uniquely read the specific data table and obtain the correct name of the specific data table. The general mapping codes are provided on the data resource platform, the mapping codes can be synchronously obtained when the data table set is obtained, the mapping codes are recorded in the metadata catalogue, and the specific data table file can be read through an algorithm and the data table information can be obtained; if the platform is not provided, a unique mapping is established between the specific data table and the file name pointed by the specific data table, and a mapping code is generated.

And traversing the data table with the data cleaned, acquiring the file name, the field name set, the data table title and the related information (such as field label) of the data table given by the data resource platform of each data table, and establishing physical mapping for the file name, the field name set, the mapping code, the title and the related information (such as field label) of the data table given by the data resource platform of the data table according to the field name set and the mapping code in the current data table, so as to integrate the file name, the field name set, the mapping code, the title and the related information of the data table given by the data resource platform into a data table data directory. Each row in the table catalog corresponds to a field, and all record rows are the set of all fields of the table set.

FIG. 4 is a specific example of mapping a data table catalog, where title is a data table header, domain is domain labeling information, url is a web page url link of the data table corresponding to the data table, filecode is a mapping code, filenames are specific data table file names, and columns are a cleaned set of field names.

Step two, classifying the data table, namely reading the data table according to mapping by using a cleaned data table set and an established data directory, preprocessing the data table, extracting and detecting characteristic information of field content, acquiring the value occupation ratio of a field unique value, various character type occupation ratio distribution (such as Chinese, digital, english and the like) of a sampling sample and field data types, vectorizing the fields by using the characteristic information, vectorizing the text by using the data table title and field label of an item where a field name is located, and merging the vector characteristics to obtain vectorized field characteristics. And inputting the combined vectorization characteristics into a trained decision tree model, predicting to obtain classification labels related to field identifier types, completing preliminary classification labels for each field of the data table, namely, summarizing the field labels in the data table after the fields are direct identifiers, de-identified identifiers, standard identifiers or non-identifiers, and preliminarily confirming whether the data table is related to personal information.

The field characteristics comprise a data table title of the field, a field label of the data table of the field, a field data type, a field name, a field data type, a character type ratio of a value sampling sample of a field under the field, a value ratio of a unique value of the field, and the like.

Possible field data types are int, float, object, date, bool, etc. The field of the plain text is an object; the field of the pure mobile phone number is int, but the situation that the field is mistakenly stored as float also needs to be cleaned and converted in advance; the value in the field is that the object is the value of the number and the text, and the date type field is the date.

The field unique value takes the following value:

for a field j in a given data table K, the total field element length after the field j is obtained and the null set is removed is recorded as all_length _j The length of the current total field element after the field j is obtained and null value is removed and the repeated value is removed by using unique () is recorded as the length of the unit _j Two element lengths are input into the following equation:

the purpose of calculating the value ratio of the unique value of the field uses an index to represent the value distribution characteristic of the field, so as to be used as a characteristic variable to be input into the decision tree model. For example, a data set summarizing teacher information has a field showing a name and a field showing a name of a unit. Because the name is used as a direct identifier, a personal information body can be uniquely determined under a specific environment, the value ratio of the field unique value of the field for displaying the name is generally above 0.95 (the unusual case is a duplicate name and a duplicate record). The unit name is a standard identifier, and different personal information bodies can be in the same unit, so that the field unique value ratio of the field showing the unit name is generally not very high and is more than 0.5. The index can distinguish the direct identifier and the standard identifier to a certain extent, and can be used as a characteristic variable marked by a data set field to help the classification of the data set.

The character type duty ratio value of the valued sample samples of the fields is explained by three field examples:

the first field is a name field in the canonical dataset, the field data type is an object type, ten value samples, such as Zhang san, li Si, wang Wu, and the like, are obtained by sampling, and each character in each value sample is a Chinese character. The chinese character of each valued sample has a 1-to-1 ratio, and the remaining character types have a 0-to-0 ratio, such as english and numerals, and divided characters (spaces, underlines, periods, commas). The entire field can be calculated and expanded from the ten valued samples.

The second field is an identity card field in the standard data set, and because the last check code in the identity card has X, the identity card may have a number which is completely digital or a number which is mixed with numbers and letters, and the field data type is an object type. Sampling to obtain ten valued samples, wherein for numbers which are completely digital, the digital character ratio is 1, and other character types are 0; for numbers containing X, the numerical character ratio is 17/18=0.94, the english character ratio is 1/18=0.06, and the rest is 0.

The third field is name field in anonymizing data set, the field data type is object, ten samples are obtained by sampling, there are Zhang Xthree, li Xfour and Wang Xand so on, the field value is composed of Chinese character and special character ". X", namely the Chinese character ratio of the sample is 0.66 or 0.5, and the special character ". X" ratio is 0.33 or 0.5. And then the character ratio calculation according to the sampled sample is expanded to the whole field.

Taking a data set with the name of a teacher as an example, the field of the 'education culture' is taken as a data set with the name of the excellent class of a certain city as an example, after ten samples are sampled, the unique value proportion is found to be low, the proportion of special characters is higher, and the field is de-identified, so that a classification model makes a labeling decision on the field belonging to a de-identified identifier.

The same data table uses the 'serial number' field, although the sample has high unique value proportion and high number character or English character proportion, the 'serial number' of the field name is usually marked as a non-identifier in the pre-trained classification model, so the model makes a marking decision that the field belongs to the 'non-identifier'.

The main purpose of the classification marking is to analyze and judge whether the fields of the data table are related to personal information in advance, and to be used for the subsequent analysis and statistics task development of the privacy information identification detection task, the single data table and the associated multiple data tables.

And thirdly, identifying and marking field information, namely comprehensively identifying the content of each data table in the data table set after the personal information related data table is obtained through classification marking, and performing traversal identification screening on each structured and unstructured field in the data table by using an identification algorithm. The recognition algorithm comprises algorithm tools such as regular expressions, named entity recognition methods, keyword matching and the like. The character string text with obvious composing rule characteristics such as a mobile phone number, an identity card, a bank card number, a license plate number and the like is identified by a regular expression method; the named entity recognition method selects LAC lexical analysis tools for recognizing and extracting names in texts.

And meanwhile, the identification of the de-identified identifier field is also carried out, and whether the field belongs to the de-identified identifier is preliminarily judged by judging the proportion of the number of field values containing common shielding characters in the field to the length of the field. If the special character is used as the basis for determining the de-identification and the duty ratio of the field value is used for confirming the de-identification treatment degree.

For the rest fields in the data table, such as the quasi-identifier information with complicated and various types, such as gender, academic and working units, a word stock is used, such as the personal information classification reference example of annex B table B.1 in the network security standard practice guideline-network data classification guideline (hereinafter referred to as guideline), and the quasi-identifier field is marked by matching field name keywords and combining field value samples.

This step obtains a field completion identifier label, such as a directory of whether direct identifier information is contained, whether the identifier field is de-identified, and the quasi identifier field.

Step four, according to the existence of the direct identifier information in the data table identification result, the data tables related to the personal information in the data resource platform can be divided into two types: (1) The data sheet contains record lines containing identifying information capable of identifying individuals independently, and corresponding individuals are directly disclosed; (2) All record rows in the data sheet are exposed to personal information, the corresponding individuals of which have only identified potential risks. Therefore, the disclosure of personal information by the data resource platform can be divided into two layers of directly disclosing individuals and risk of re-identification of the individuals, and the case of directly disclosing the individuals by the platform can be divided into two different modes of directly disclosing a single data table and disclosing a plurality of data tables by the platform in a correlation way.

According to the scheme, according to the flow of fig. 5, the related data table set is divided into a type (one) data table and a type (two) data table according to whether the data table contains direct identifier information, then whether the record rows in the type (one) data table contain identifiable information is split at the level of the record rows, the record rows containing the identifiable information and the record rows not containing the identifiable information are respectively aggregated to form a new data table, namely a type-A data table, and a new data table, wherein all the record rows do not contain the identifiable information, and all the record rows correspond to the individual and have only identified potential risks, as the new data table is the data table of the type (two). Identifiable information refers to information having a direct identifier, with only information of a quasi-identifier or de-identified identifier being non-identifiable information. And adding five types of direct identifier information detected in the identification process into the class A data table, wherein the addition of the identified information corresponds to the record row, so that the subsequent analysis and statistics are convenient. Such as a field "case description" consisting of descriptive text, where an identification number, name, phone number, etc. may appear, but these information are contained in unstructured text and are not structured, the present solution proposes and adds these information to the original data table after detection during recognition, forming a series of structured fields. The addition of the identified information corresponds to the record row, meaning that the row index of the names and identity cards extracted from a piece of text in a row of the "case description" in the added structured field should be consistent with the row of the "case description", representing that the names and identity cards are extracted from the row of information.

For any platform scene, the number of class A data tables is equal to the number of the data tables of the type (one), and the number of class B data tables is not less than the number of the data tables of the type (two). The disclosure of personal information by all record rows in the two types of data tables A and B can respectively correspond to the two layers of 'individual direct disclosure' and 'individual re-identification risk' in fig. 1.

Step five, according to the field labels of the data table and the classification of personal information by the guide file assigned to the data table by the data resource platform, carrying out statistical analysis of different flows on the class A data table, the class B data table and the association result respectively, as shown in fig. 6:

1) The individual corresponding to each record row in the class a data table is directly revealed. And for the field names of the direct identifier information and the quasi identifier information in the class A data table, according to the annex B table B.1 in the guideline, the information type classification is carried out by establishing an information type dictionary. The types of personal information that conventional data sheets relate to are mainly: personal basic information, personal identification information, personal health physiological information, personal educational material information, personal property information, and other information. The distribution situation of the data table in different fields is counted, and the situation of the individual can be directly revealed in each field in the platform scene can be intuitively displayed; the coverage distribution of the data table to different types of personal information in a specific field is counted, and different information leakage conditions of the individual can be directly revealed in different fields by the platform scene. Because there is a large difference in the number of individuals involved in each data table, the present embodiment refers to a summary of the number of records of each relevant personal information type for each data table (for the identified direct identifier information, each direct identifier identification result number of each line is used as an indication of the number of persons involved; for the matched standard identifier field, a summary of the number of records is used as an indication of the number of persons involved).

In the personal information type classification process, one type may include both a direct identifier and a quasi-identifier, such as personal basic information including name, gender, age, ethnicity, and the like. For the direct identifiers identified in the data table, when counting the number of people involved, the record number of the related personal information type refers to the identification result of each direct identifier; for the matched quasi-identifier fields in the data table, the number of records of the relevant personal information type references the number of records available under each quasi-identifier field when counting the number of persons involved. And finally, carrying out de-duplication on individuals with the same direct identifier information among the data tables, and then counting the number of the data tables related to the related personal information types and the number of people.

Through the statistical analysis method, comprehensive depiction and quantitative display of individuals and personal information thereof are disclosed in a single data table form for a specific platform scene.

2) For the class B data table, the identified potential risk exists in the individual corresponding to each record row in the data table, firstly, all quasi identifier fields in each data table are combined to form an equivalent class, and the re-identification risk of the record row in each data table is calculated according to the re-identification risk calculation method described in the background technology. And counting and extracting the record rows which are unique in the equivalent class screening condition, recording the record rows with the re-identification risk of 1, and adopting a counting mode such as a class A data table for the data table with the maximum re-identification risk of 1, wherein when the data table does not contain a direct identifier, the number of valid record rows under the matched standard identifier field in the data table is only referred to as an indication of the number of individuals involved in the data table when the number of persons involved in the related personal information type is counted. The distribution conditions of the platform in different fields and the coverage distribution of the platform to different types of personal information in specific fields are counted, and the existence condition of the high risk data table and the different information disclosure condition contained in the high risk data table in different fields can be displayed. And summarizing all the data rows with the re-identification risk of 1 in the B-class data table to be used as an indication of the number of individuals involved in the data table, so that comprehensive depiction and quantitative display of the personal information disclosed by the data table with the maximum re-identification risk of 1 in the specific platform scene are realized.

3) The method comprises the steps of associating a class A data table with a record row with a re-identification risk of 1 in a class B data table, confirming all the matching registration identifier field pairs with common values in the two data tables according to the standard identifier field information distribution matching method, judging the matching of the standard identifiers row by row according to the values of the standard identifier field pairs of the two data tables, and if all the standard identifier field pairs of the two record rows are matched in value, and the direct identifier information and the residual information of the de-identified identifier field are matched, the two record rows can be considered as matching and are recorded in an association matching result. And carrying out a statistical mode similar to the class A data table and related to the number of people for the association matching result.

After the record line association matching is completed, the number of successfully matched quasi-identifiers can be used for measuring the credibility of the matched record line corresponding to the same individual. For pairing data table a _i And B _j Data sheet A _i Extended information content data sheet B _j Less the number of all pairs of fields in the two data tables that can be matched. Furthermore, in view of the presence of A in the class A data table _i One record line may correspond to B in the B-class data table after passing through the step matching method _j Is matched on the values of all pairs of matching registration identifier fields, where it is defined that after the association match, for the paired data table a _i And B _j Data sheet A _i The certainty of the extended personal information is 1/n. And finally, counting the number of records for realizing re-identification under different credibility and different certainty according to the association matching result.

In the scheme, a distributed matching mode is adopted, all the matching registration identifier field pairs with common values are confirmed at the level of the two data tables, and then row-by-row matching judgment of the registration identifiers is carried out according to the values of the two data table registration identifier field pairs. By means of matching of the quasi-identifiers, the data table for matching is subjected to slice compression, and matching is traversed, so that the correlation analysis can be achieved, and meanwhile, the calculation power consumption can be greatly reduced.

And step six, respectively carrying out visual drawing on the statistical results obtained in the step five according to the flow described in fig. 7. The class A data table, the class B data table and the related person number distribution thermodynamic diagram form an X-axis label by the related personal information type, a Y-axis label by the label in the field of the data table, and the number of the data table or the color value field of a specific block corresponding to the related person number indication X, Y label. The thermodynamic diagram is drawn to show specific disclosure of the data table under the labels of each field to the types of personal information.

And calculating the credibility and the certainty factor of the association matching record, and summarizing the number of record pairs according to the relevant dimension as a reference for the number of people involved. The multi-dimensional clustered bar graph of the correlation matching information disclosure comprises an X-axis label by the related personal information type, a Y-axis label by the related number of people, and a reference dimension of the bar graph by the credibility; the multi-dimensional clustered bar graph of the expansion condition of the associated matching information forms an X-axis label by the expansion quantity of the associated personal information, forms a Y-axis label by the number of people involved, and forms a reference dimension of the bar graph by the certainty factor range. The multi-dimensional cluster column diagram is drawn to show the expansion condition of information quantity after the line re-identification is recorded under different confident degrees

In another embodiment, as shown in fig. 8, after the preliminary confirmation of whether the data table is related to the personal information, only a negative screening is performed on the classification label by a human operator, and a manual verification is performed on the data table labeled as the personal information data table. In the third step, auxiliary verification can be carried out on the identified direct identifier and the de-identified identifier along with manual work, so that the labeling accuracy is ensured as much as possible. And meanwhile, the rest fields except the direct identifier field and the de-identified identifier field can be marked by matching keywords and matching with manual auxiliary verification. In this embodiment, a man-machine interactive mode may be used in the identification process, so that accuracy is maintained for identification of the alignment identifier information while efficient identification of the direct identifier information is ensured, and erroneous judgment is reduced as much as possible.

Based on related technologies such as data classification and grading, data content identification and the like, the scheme aims at the reality that conventional identification and analysis statistics tasks are difficult to complete due to poor data quality in a data resource platform scene, and provides a statistical analysis method, a statistical analysis system and a visual drawing application tool for integrating processes such as data cleaning, classification, identification and association and the like realized by using a python language. The method for dividing and splitting the recombined original data table set according to the types of the data table set of the identification result and the step-by-step matching process are provided by the framework of the invention, and the framework defines the credibility and certainty concepts to carry out the visual drawing, so that a manager can carry out statistical analysis on individual data tables and the personal information disclosure condition in the data table set in an actual data resource platform scene, and the credibility of the statistical result can be estimated and verified, so that the manager of the specific data resource platform using the invention can efficiently and accurately know the condition of privacy information disclosure related to individuals, the potential re-identification risk of the data table and the achievable re-identification condition of the resources published in the current platform scene.

The specific embodiments described herein are offered by way of example only to illustrate the spirit of the present solution. Those skilled in the art may make various modifications or additions to the described embodiments or substitutions in a similar manner without departing from the spirit of the invention or exceeding the scope of the invention as defined in the accompanying claims.

Claims

1. A personal information disclosure statistical analysis method for a data sheet, comprising:

s1, acquiring a data table to be analyzed;

each entry of the metadata directory corresponds to a data table to be analyzed, and comprises a field name set of the corresponding data table to be analyzed and a mapping code for pointing to the corresponding data table to be analyzed;

s3, matching the data table to be analyzed of each item based on the mapping code;

finishing preliminary classification marking of the field identifier type by taking the value information of each field of the corresponding data table as a feature vector for the field name set of each item, and screening items related to personal information in the metadata catalogue;

s5, classifying the personal information related data table into a type one data table and a type two data table according to whether the data table has a record row containing direct identifier information or not;

s6, splitting and reorganizing the type-one data table into a type-A data table which is completely composed of record rows containing direct identifier information and a type-B data table which is not containing direct identifier information according to whether the record rows in the data table contain the direct identifier information or not;

Classifying the type two data table into a type B data table;

2. The statistical analysis method for the personal information disclosure situation of the data sheet according to claim 1, wherein in step S2, the cleaning of the data sheet to be analyzed includes any one or more of field misalignment correction, field name perfection, field name conversion and special character processing, so that each field name of the cleaned data sheet exists and corresponds to a field value, the field name is mainly chinese characters and the special characters in the field are removed;

each entry in the metadata catalog also comprises any one or more of a data table title, a web page link, a data table file name and data table related information containing field labeling information of the corresponding data table to be analyzed; and mapping the mapping code, the field name set, the data table title, the webpage link, the data table file name and the data table related information of each data table to be analyzed to integrate the metadata catalogue.

3. The statistical analysis method for personal information disclosure of a data sheet according to claim 1, wherein step S3 specifically comprises:

S31, acquiring field value characteristics of each field name set in each item;

s33, inputting the combined vectorization characteristics into a trained machine learning classification model, and outputting identifier type labels of all fields by the model;

4. The statistical analysis method for personal information disclosure of data sheet according to claim 4, wherein in step S33, the types of identifiers that can be marked are direct identifier, de-identified identifier, quasi-identifier, non-identifier;

in step S34, when one of the direct identifier, the quasi identifier and the de-identified identifier exists in the field in one of the entries, it is determined that the data table pointed to by the entry is related to the personal information.

5. The statistical analysis method for personal information disclosure of data sheet according to claim 4, wherein in step S4, the overall recognition is as follows:

For the direct identifier, identifying information strictly following a certain composition pattern by using a regular expression; identifying and extracting information which does not have strict constitution modes in the descriptive text by using a named entity identification method;

for the quasi identifier, according to the personal information reference file, using a metadata identification technology based on a keyword lexicon to carry out identification matching;

for the de-identified identifier, its de-identification degree is detected.

6. The statistical analysis method for personal information disclosure of data sheets according to claim 1, wherein for a class a data sheet, statistical analysis for information disclosure is performed according to settings;

for the B-class data table, respectively calculating the re-identification risks of the record rows in each data table according to a re-identification risk calculation method, wherein the maximum re-identification risk of the data table is the maximum re-identification risk in all the record rows in the data table; and carrying out statistical analysis on information disclosure conditions according to the setting on a data table with the maximum re-identification risk being greater than or equal to the set threshold.

7. The statistical analysis method for personal information disclosure of data sheet according to claim 6, wherein in step S7, further comprising a correlation analysis method for a class a data sheet and a class B data sheet:

S71, using a B-type data table with the re-identification risk being greater than or equal to a set threshold value as a data table for association;

s73, measuring the credibility of the matched record row corresponding to the same individual by using the successfully matched standard identifier number, and for the paired data table A _i And B _j Data sheet A _i Extended information content data sheet B _j Subtracting the number of all pairs of fields in the two data tables that can be matched from the number of quasi-identifiers in the two data tables;

8. The statistical analysis method for personal information disclosure of data sheet according to claim 7, further comprising the step of visualizing the statistical analysis result of step S7:

s81, carrying out statistical analysis on the class A data table and the class B data table, and from the two angles of the number of the data tables and the number of people involved in the data tables, taking the type of the involved personal information as an X axis, taking the domain label as a Y axis, indicating the number of the data tables or the number of people involved in the color, and drawing a thermodynamic diagram to show the specific disclosure condition of the data tables under the labels of each domain on the type of the involved personal information;

9. The statistical analysis method for personal information disclosure of data sheet according to claim 1, wherein in step S3, after classification labeling is completed, a manual verification is performed by a user;

in step S4, after the alignment identifier, the direct identifier and the de-identified field are identified, an auxiliary verification is performed by the user.

10. A personal information disclosure statistics analysis system for performing the personal information disclosure statistics analysis method for a data sheet as claimed in any one of claims 1 to 9.