CN113488127B

CN113488127B - Sensitivity processing method and system for population health data set

Info

Publication number: CN113488127B
Application number: CN202110856219.1A
Authority: CN
Inventors: 吴思竹; 邬金鸣; 钱庆; 修晓蕾; 钟明
Original assignee: Institute of Medical Information CAMS
Current assignee: Institute of Medical Information CAMS
Priority date: 2021-07-28
Filing date: 2021-07-28
Publication date: 2023-10-20
Anticipated expiration: 2041-07-28
Also published as: CN113488127A

Abstract

The invention discloses a sensitivity processing method and a sensitivity processing system for population health data sets, wherein the sensitivity processing method comprises the following steps: acquiring a population health data set to be evaluated; carrying out sensitive information identification on each feature of the population health data set to obtain a sensitive feature corresponding to each feature, wherein the features comprise metadata features, data item features and data value features; analyzing each sensitive feature to obtain an analysis result corresponding to each sensitive feature; calculating based on the analysis result corresponding to each sensitive feature to obtain a sensitivity comprehensive evaluation result of the population health data set; and generating a sensitivity evaluation report of the population health data set based on the sensitivity comprehensive evaluation result. The invention realizes the discovery, identification, analysis and processing of the sensitive information, meets the application requirements of the sensitivity evaluation of the population health data set through multidimensional analysis, and improves the efficiency and the safety of the subsequent population health data application.

Description

Sensitivity processing method and system for population health data set

Technical Field

The invention relates to the technical field of data processing, in particular to a sensitivity processing method and system for a population health data set.

Background

Population health data sharing plays an important role in improving medical quality and effect, assisting medical management decision, improving medical scientific research level and scientific research transparency, controlling medical cost and the like, but highly sensitive information such as identity information, health information, genetic information and the like of a large number of individuals or groups in population health data can bring security risks and property losses to countries, society or individuals to different extents if leakage occurs in the data sharing process. As can be seen, data sensitivity processing is an important basis and premise for human health data desensitization processing and data collection sharing.

The existing research is focused on the technology and scheme for identifying and removing the population health data sensitivity information, the current research on the data sensitivity evaluation method is insufficient, an effective evaluation method system and a more comprehensive data sensitivity evaluation flow are not formed, and a system is formed in a short way, and the comprehensive population health data sensitivity information category is not formed. And the existing sensitivity evaluation researches are concentrated in data records, attribute classes, attributes and attribute value layers to measure sensitive information, so that the overall sensitivity evaluation research on the data set layers is less. Further, the existing research method only calculates the sensitivity level or risk score, the calculation process and the result have low interpretability, are not easy to understand, and lack human-machine understandable forms. As can be seen, the existing population health data sensitivity processing cannot meet the actual population health data sensitivity assessment requirements, and the security and data processing efficiency of subsequent population health data sharing are reduced.

Disclosure of Invention

Aiming at the problems, the invention provides a population health data set sensitivity processing method and system, which can meet the actual population health data evaluation requirement and improve the efficiency and safety of data application.

In order to achieve the above object, the present invention provides the following technical solutions:

a method of sensitivity processing of a population health dataset, comprising:

acquiring a population health data set to be evaluated;

carrying out sensitive information identification on each feature of the population health data set to obtain a sensitive feature corresponding to each feature, wherein the features comprise metadata features, data item features and data value features;

analyzing each sensitive feature to obtain an analysis result corresponding to each sensitive feature;

calculating based on the analysis result corresponding to each sensitive feature to obtain a sensitivity comprehensive evaluation result of the population health data set;

and generating a sensitivity evaluation report of the population health data set based on the sensitivity comprehensive evaluation result.

Optionally, the identifying the sensitive information of each feature of the population health data set to obtain a sensitive feature corresponding to each feature includes:

Acquiring each characteristic dimension of the population health data, wherein the characteristic dimensions comprise: metadata features, data item features, and data value features;

performing sensitive information identification on the metadata features based on target judgment rules to obtain the metadata sensitive features, wherein the target judgment rules are determined based on marked metadata;

based on the sensitive information type dictionary, determining whether the data item features comprise sensitive information type item words, and if so, obtaining the data item sensitive features;

and identifying the data value of the data item sensitive characteristic, and if the data value obtained by identification meets the identification condition corresponding to the sensitive information value, obtaining the data value sensitive characteristic.

Optionally, the analyzing each sensitive feature to obtain an analysis result corresponding to each sensitive feature includes:

analyzing the data quantity, the time span, the object characteristics, the theme type and the main body quantity of the metadata sensitive characteristics to obtain metadata sensitive characteristic analysis results;

analyzing the sensitive information type characteristics and the sensitive information quantity characteristics of the sensitive characteristics of the data items to obtain a data item sensitive characteristic analysis result;

And analyzing the value quantity characteristic, the value distribution characteristic and the value precision degree characteristic of the data value sensitive characteristic to obtain a data value sensitive characteristic analysis result.

Optionally, the calculating based on the analysis result corresponding to each sensitive feature, to obtain a comprehensive sensitivity evaluation result of the population health dataset includes:

calculating and obtaining a leakage loss degree value based on the metadata sensitive characteristic analysis result;

calculating to obtain an identification degree value based on the data item sensitive characteristic analysis result and the data value sensitive characteristic analysis result;

and calculating to obtain a sensitivity comprehensive evaluation result of the population health data set based on the leakage loss degree value and the identification degree value.

Optionally, the generating a sensitivity evaluation report of the population health dataset based on the sensitivity comprehensive evaluation result includes:

determining basic information of the population health data set;

determining information to be displayed of the sensitivity evaluation result based on the sensitivity comprehensive evaluation result, wherein the display information comprises a data set expression degree, a data set leakage loss degree, a data set sensitivity reference value and sensitive characteristic data;

Determining sensitive characteristic mark information based on the sensitivity comprehensive evaluation result;

and generating a sensitivity evaluation report of the population health data set according to the basic information, the information to be displayed and the sensitive characteristic mark information.

A population health dataset sensitivity processing system, comprising:

an acquisition unit for acquiring a population health dataset to be evaluated;

the identification unit is used for carrying out sensitive information identification on each feature of the population health data set to obtain a sensitive feature corresponding to each feature, wherein the features comprise metadata features, data item features and data value features;

the analysis unit is used for analyzing each sensitive characteristic to obtain an analysis result corresponding to each sensitive characteristic;

the computing unit is used for computing based on the analysis result corresponding to each sensitive characteristic to obtain a sensitivity comprehensive evaluation result of the population health data set;

and the generating unit is used for generating a sensitivity evaluation report of the population health data set based on the sensitivity comprehensive evaluation result.

Optionally, the identifying unit includes:

a first obtaining subunit, configured to obtain each feature dimension of the population health data, where the feature dimensions include: metadata features, data item features, and data value features;

The first identification subunit is used for carrying out sensitive information identification on the metadata characteristics based on target judgment rules, so as to obtain the metadata sensitive characteristics, wherein the target judgment rules are determined based on marked metadata;

a first determining subunit, configured to determine, based on a sensitive information type dictionary, whether the data item feature includes a sensitive information type item word, and if so, obtain a data item sensitive feature;

and the second identification subunit is used for identifying the data value of the data item sensitive characteristic, and if the data value obtained by identification meets the identification condition corresponding to the sensitive information value, the data value sensitive characteristic is obtained.

Optionally, the analysis unit includes:

the first analysis subunit is used for analyzing the data quantity, the time span, the object characteristics, the theme type and the main body quantity of the metadata sensitive characteristics to obtain metadata sensitive characteristic analysis results;

the second analysis subunit is used for analyzing the sensitive information type characteristics and the sensitive information quantity characteristics of the sensitive characteristics of the data items to obtain the analysis result of the sensitive characteristics of the data items;

and the third analysis subunit is used for analyzing the value quantity characteristic, the value distribution characteristic and the value precision degree characteristic of the data value sensitive characteristic to obtain a data value sensitive characteristic analysis result.

Optionally, the computing unit includes:

the first calculating subunit is used for calculating and obtaining a leakage loss degree value based on the metadata sensitive characteristic analysis result;

the second calculating subunit is used for calculating to obtain an identification degree value based on the data item sensitive characteristic analysis result and the data value sensitive characteristic analysis result;

and the third calculation subunit is used for calculating and obtaining the comprehensive sensitivity evaluation result of the population health data set based on the leakage loss degree value and the identification degree value.

Optionally, the generating unit includes:

a second determination subunit configured to determine basic information of the population health dataset;

the third determining subunit is used for determining information to be displayed of the sensitivity evaluation result based on the sensitivity comprehensive evaluation result, wherein the display information comprises a data set expression degree, a data set leakage loss degree, a data set sensitivity reference value and sensitive characteristic data;

a fourth determination subunit, configured to determine sensitive feature tag information based on the sensitivity comprehensive evaluation result;

and the generation subunit is used for generating a sensitivity evaluation report of the population health data set according to the basic information, the information to be displayed and the sensitive characteristic mark information.

Compared with the prior art, the invention provides a sensitivity processing method and a sensitivity processing system for a population health data set, wherein the sensitivity processing method comprises the following steps: acquiring a population health data set to be evaluated; carrying out sensitive information identification on each feature of the population health data set to obtain a sensitive feature corresponding to each feature, wherein the features comprise metadata features, data item features and data value features; analyzing each sensitive feature to obtain an analysis result corresponding to each sensitive feature; calculating based on the analysis result corresponding to each sensitive feature to obtain a sensitivity comprehensive evaluation result of the population health data set; and generating a sensitivity evaluation report of the population health data set based on the sensitivity comprehensive evaluation result. The invention realizes the discovery, identification, analysis and processing of the sensitive information, meets the application requirements of the sensitivity evaluation of the population health data set through multidimensional analysis, and improves the efficiency and the safety of the subsequent population health data application.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for sensitivity processing of a population health data set according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a method for evaluating sensitivity of demographic health data according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of metadata feature information according to an embodiment of the present invention;

FIG. 4 is a diagram of a data sensitivity evaluation dimension according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a sensitive data calculation flow provided in an embodiment of the present invention;

FIGS. 6 (a) -6 (c) are diagrams illustrating a data sensitivity assessment report according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a sensitivity processing system for a population health data set according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The terms first and second and the like in the description and in the claims and in the above-described figures are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to the listed steps or elements but may include steps or elements not expressly listed.

The embodiment of the invention provides a population health data set sensitivity processing method, which is a data set sensitivity evaluation method facing population health data sharing requirements, provides support for the discovery, identification, analysis and processing of sensitive information of population health data sharing, and effectively implements population health data hierarchical management and safe sharing through computer assistance.

Referring to fig. 1, a flow chart of a method for processing sensitivity of a population health data set according to an embodiment of the present invention is provided, where the method includes:

s101, acquiring a population health data set to be evaluated.

The population health data set to be evaluated refers to the original population health data set, and the population health data set is not subjected to any treatment. Because highly sensitive information such as identity information, health information, genetic information, etc., of a large number of individuals or groups is involved in demographic health data, if leakage occurs during data sharing, different degrees of security risk and property loss are brought to countries, society, or individuals. Therefore, sensitivity processing is required in the embodiment of the invention to ensure the safety of the follow-up application of the population health data set such as data sharing.

S102, identifying sensitive information of each feature of the population health data set, and obtaining sensitive features corresponding to each feature.

S103, analyzing each sensitive feature to obtain an analysis result corresponding to each sensitive feature.

In the embodiment of the invention, the sensitive information is identified and analyzed from multiple dimensions of the population health data set, and the corresponding multiple dimensions are embodied in various characteristics of the population health data set, wherein the characteristics comprise metadata characteristics, data item characteristics and data value characteristics.

The identification of sensitive information of the population health data set is mainly based on the data set organization structure, and the data set comprises metadata describing data, data items and sensitive information characteristic identification of different levels of data values. In the embodiment of the invention, based on the defined category of sensitive information in the population health data, whether the data set contains sensitive information, which types are contained, which sensitive features are provided and the position of the sensitive information are detected. Different sensitive information identification processes are employed for different data set metadata, data items and data values.

In one embodiment of the invention, 12 metadata features to be identified are determined; the identification of the data item features refers to detecting which data items in the form data set corresponding to the population health data set belong to the sensitive information type determined by the invention, namely, mainly detecting the relation type data table, namely, the form type of which a large text cannot appear in one cell; on the one hand, the identification of the data value characteristics judges the non-null number, the distribution characteristics of the data value, the accuracy degree of certain special data types and the like in the data columns of which the sensitive information types are identified in the relational data table, on the other hand, unstructured text information possibly exists in the table, specific sensitive information fingers are identified in the text information, and the subsequent statistical analysis is carried out. It should be noted that, the kind and type of the metadata features and the identification of the data items and the data values specifically related to the metadata features need to be determined according to the actual requirements, which is not limited by the present invention.

The sensitive characteristic of the invention is a series of interconnected index sets which scientifically, specifically and finely divide factors influencing data sensitivity, lays a foundation for data sensitivity assessment, and describes and reveals sensitive information conditions in a data set by using the factors, thereby facilitating data submitters, data managers and data users to more objectively and fairly know the sensitivity level of population health data sets. The sensitive characteristic analysis of the invention starts from 3 dimensions of metadata, data items and data values, and comprises major factors of sensitive information types, sensitive information values, data main bodies (main body types and main body amounts) and overall description information (data amounts, data time, data main body types and the like), and specific processing modes are described in subsequent embodiments of the invention.

And S104, calculating based on the analysis result corresponding to each sensitive characteristic to obtain the comprehensive sensitivity evaluation result of the population health data set.

The sensitivity of a data set is determined by two major factors, namely, the degree of identification and the degree of leakage loss, where the degree of leakage loss is determined by the type of data topic and the amount of data. Both the extent of identification and the extent of leakage loss are positively correlated with data sensitivity. The data sensitivity calculation receives the sensitive information identification and characteristic analysis results, and is mainly divided into 3 links of identification degree evaluation (identification degree value determination), leakage loss degree evaluation (leakage loss degree value determination) and sensitivity comprehensive calculation, and finally a sensitivity reference value representing the sensitive information content level of the data set, namely a sensitivity comprehensive evaluation result of the population health data set is obtained.

And S105, generating a sensitivity evaluation report of the population health data set based on the sensitivity comprehensive evaluation result.

The sensitivity evaluation report generally needs to include a unique detection number so as to be capable of determining the authenticity of the sensitivity evaluation report, and further needs to include labeling sensitivity detection time, and specific contents may include basic information of a population health data set, a data sensitivity evaluation result, a data set sensitive feature analysis, a sensitive information position mark and the like, and specific contents may be determined based on actual requirements, which is not limited by the present invention.

The embodiment of the invention provides a sensitivity processing method for a population health data set, which comprises the following steps: acquiring a population health data set to be evaluated; carrying out sensitive information identification on each feature of the population health data set to obtain a sensitive feature corresponding to each feature, wherein the features comprise metadata features, data item features and data value features; analyzing each sensitive feature to obtain an analysis result corresponding to each sensitive feature; calculating based on the analysis result corresponding to each sensitive feature to obtain a sensitivity comprehensive evaluation result of the population health data set; and generating a sensitivity evaluation report of the population health data set based on the sensitivity comprehensive evaluation result. The invention realizes the discovery, identification, analysis and processing of the sensitive information, meets the application requirements of the sensitivity evaluation of the population health data set through multidimensional analysis, and improves the efficiency and the safety of the subsequent population health data application.

Referring to fig. 2, a schematic diagram of a population health data sensitivity evaluation method according to an embodiment of the present invention is provided. The framework mainly comprises data sensitive information identification, multidimensional sensitive characteristic analysis, sensitivity evaluation calculation and evaluation report generation, namely, a population health data set to be evaluated is input into the framework, and a sensitivity evaluation report is output as the data set.

Sensitive information identification of population health datasets is primarily based on dataset organization structures. The data set includes metadata describing the data, data items, and sensitive information feature identification at different levels of data values. The invention detects whether the data set contains sensitive information, which types are contained, which sensitive features are contained, and the location of the sensitive information based on the category of sensitive information in the defined population health data. Different sensitive information identification methods are employed for different data set metadata, data items and data values.

Correspondingly, in one embodiment of the present invention, the identifying the sensitive information of each feature of the population health data set to obtain a sensitive feature corresponding to each feature includes:

Identifying the metadata features, the data item features and the data value features in different modes, specifically identifying the metadata features through target judgment rules, namely rules to be set; the data item features can be identified through the sensitive information type dictionary, the corresponding data values are identified after the data item sensitive features are identified, and whether the corresponding data values belong to leaked sensitive information is judged.

Specifically, in the embodiment of the present invention, 12 metadata feature information to be identified is determined, referring to fig. 3, which is a schematic diagram illustrating metadata feature information provided in the embodiment of the present invention, where metadata feature identification is divided into two parts, and metadata labeling can be based on a data set. Such as the record number, time span, time novelty, whether the metadata features are the human genetic resources 4, are respectively obtained from the data record number, the time range and the human genetic resources in the basic information of the population health data set.

Another part of metadata feature recognition includes 8 features such as "data subject age feature", "aggregate type and individual type data", "whether biometric data is included", "whether clinical records are involved", "whether sensitive clinical records are involved", "whether finance-related data", "subject number", "sensitive information type number", and the like, and the rule needs to be set for determination. The rule judgment is based on the related information mainly from the data set names, the key words, the data description and the like, wherein the data main body number, the sensitive information type number and the like need to be identified, and then the related content of the data item and the data value is comprehensively calculated.

The identification of the data item features refers to detecting which data items in the table data set belong to the defined sensitive information type of the study, mainly detecting which data items in the table data set corresponding to the population health data set belong to the defined sensitive information type of the study, and mainly detecting the relational data table, namely, the table type of which a large text cannot appear in one cell. The study is directed to the characteristics of population health data sets, and 44 sensitive information types are temporarily selected for identification.

For relational table data, two factors, namely data item characteristics and data value characteristics, are considered in the process of sensitive item identification, and the two factors are based on two modes of top-down and bottom-up. Specifically, the top-down method refers to firstly detecting information such as names, meaning/comments, data dictionary and the like of data items (fields/variables), judging whether the information item is a sensitive information type item based on a sensitive information type dictionary, and judging that the information item is a corresponding sensitive information type when the information item is a sensitive information type item, such as a sensitive information type name, wherein the item comprises a personal name, a patient name, a direct relative name, a family member name, a unit contact name, a medical history statement maker, name, patient, participant, XM, XRXM and the like; and then reading the first 50 examples, verifying whether the examples under the list accord with the characteristics of the judged values of the sensitive information types or not based on a 'sensitive information value dictionary' or a 'regular expression rule base', if so, continuing, and if not, marking as a 'pending state'. The bottom-up mode is that if the field names and comments thereof do not appear corresponding style words, the first 50 examples are read, whether the examples are values under the sensitive information types or not and what sensitive information values are judged based on a sensitive information value dictionary or a regular expression rule base, the sensitive information types to which the column belongs are induced upwards according to the values, the field names of the column are stored in the sensitive information type dictionary, and expansion of the sensitive information type dictionary is completed. For example, when the sensitive information type dictionary includes the sensitive words A, B and C, the sensitive word can be identified A, B, C, and when the sensitive word is found to be D based on the regular expression rule base, the sensitive word can be added to the sensitive information type dictionary to expand the dictionary more, and then the sensitive word can be identified as D by the expanded sensitive information type dictionary.

On the one hand, the identification of the data value characteristics, on the other hand, the non-null value number, the distribution characteristics of the data value, the accuracy degree of certain special data types and the like in the data columns which are identified as the sensitive information types in the relational data table are judged. If the time information is to be judged to be accurate to the degree of years, months, days and the like; the location information is accurate to what extent province, city, county (district), street, etc.; the age information is accurate to a specific age range such as 14 years old or just 10-20 years old; the disease information is used for judging whether the disease is a sensitive disease or not and which sensitive disease category belongs to mental disorder, sexual and fertility, genetic and social sensitive infectious diseases. On the other hand, for non-relational data tables, unstructured text information may be included in the tables, in which specific sensitive information values are identified and subsequently statistically analyzed.

Sensitive information can be divided into 3 major categories according to data types, including digital types such as identification card numbers, telephone numbers, postal codes and the like; date type such as birth date, medical action date, etc.; name, address, medical institution, etc. The text uses regular expression rules to identify the number type and date type sensitive information, and examples of the rules are shown in table 1. In addition, although information such as an identification card number can be easily identified according to the regular expression, the information is easily confused with image coding of an electrocardiogram. In order to ensure the accuracy of the identification of the digital type sensitive information, the text utilizes the regular expression to screen out a candidate set, then further screens out the digital type protection health information through context environment semantic judgment and removes the digital type protection health information, and if the image and the word pattern of the medicine appear in the field item description, the digital type protection health information is deleted from the candidate result set. In addition, relative time may also occur in demographic health data, such as the last month, winter, etc., and such information is not processed in this study because an attacker cannot obtain a specific date from the relative time.

Table 1 sensitive information identification regular expression example

For data of named entity types such as addresses, medical institution names and the like, the data is identified in a dictionary-based manner and optimized in an AC automaton (Aho-Corasick automaton) based manner. For the identification of disease sensitive information, a sensitive information dictionary is built by multiplexing Chinese version international disease classification (International Classification of Diseases, ICD) and Chinese medical subject word list (Chinese Medical Subject Headings, CMesh), and comprises mental disorder, sexual and fertility, genetic diseases and social sensitive infectious disease 4 sensitive diseases, and currently comprises 405 subject words and corresponding style words, and the application can be expanded in the later period; for the identification of the names of the data main bodies, the Chinese common surname list is applied; for the identification of the type of the ethnic sensitive information, a Chinese ethnic list and an abbreviation dictionary are constructed; respectively constructing dictionaries for representing the states of marital, profession, religion beliefs and academic histories; for address identification, apply "national level 5 administrative division table", which contains 747748 records; for identification of medical institution names, a dictionary of medical institutions in China is applied, including 49582 medical institutions of various grades in China.

In one embodiment of the present invention, the analyzing each sensitive feature to obtain an analysis result corresponding to each sensitive feature includes:

In the invention, the sensitive characteristic refers to a series of interconnected index sets which scientifically, specifically and finely divide factors influencing data sensitivity, lays a foundation for data sensitivity assessment, and describes and reveals sensitive information conditions in a data set by using the factors, so that a data presenter, a data manager and a data user can more objectively and fairly know the sensitivity level of a population health data set. Referring to fig. 4, a schematic diagram of data sensitivity evaluation dimensions is provided in an embodiment of the present invention, where the sensitive feature analysis in the present invention starts from 3 dimensions of metadata, data items, and data values, and includes major factors of sensitive information types, sensitive information values, data bodies (body types, body amounts), and overall description information (data amounts, data times, data body types, etc.) 4.

The metadata dimension analyzes 12 metadata characteristics, including specifically the number of dataset records, time span, time novelty, data body age characteristics, aggregate type and individual type data, whether human genetic resources are involved, whether biometric data are involved, whether clinical records are involved, whether sensitive clinical records are involved, whether financial related data are involved, the number of data bodies involved, the number of sensitive information types. Data sensitivity is a measure of the content of sensitive information in a data set, and quantity is a visual reflection of "content", and quantity factors include the number of data records and the number of data bodies involved. The time factor is one of factors affecting data sensitivity, including time span and time novelty. The time span is similar to the number of records, the former in the number dimension and the latter in the time dimension reflecting the sensitive information content. The data is time-efficient, the data sensitivity is different in different time periods, in general, the data sensitivity is lower as the time is longer, the data sensitivity is higher as the time is shorter, and the data is allowed to be disclosed after the sensitivity is reduced after two or more years, such as some data sets. The last year referred to in the data set is taken as the temporal novelty, the higher the temporal novelty is in positive correlation with the data sensitivity. Furthermore, the sensitivity or not is for humans, the data body to which the data set relates is an important factor affecting the sensitivity of the data. The Data body (Data Subject) refers to a natural person whose personal information is disclosed as personal Data in a network in a bright or dark manner. The data-oriented object includes an individual type object (individual type data) such as a data body, and also includes a group type object (aggregate data), the individual type-oriented data being more sensitive to the group type; in addition, the data main bodies of different types have different sensitivities, for example, the national personal information protection law requires to strengthen the information protection of minors, other special data main body types also comprise the aged, pregnant women and the like, for example, the MIMIMIC carries out conversion treatment on the ages of the aged to reduce the data sensitivity.

The data item dimension analyzes 9 data item characteristics, and specifically comprises a direct identification information type number, an indirect identification information type number, a network identification information type number, a communication information type number, a position information type number, a time information type number, a sensitive disease type number, a medical insurance payment information type number, a health medical record identification information type number and the like. The dimension characteristic of the data item is mainly a "certain sensitive information type number", for example, the "direct identification information type number" refers to a list of several specific direct identification information in the data set, such as "name, identification card number, birth certificate number, social security number, passport number, etc.

The data item dimension reflects the content framework of the dataset, i.e. reflects from the content level which sensitive information types the dataset contains. Sensitive information related to population health data can be classified into identity information (direct identity information, indirect identity information, network identity information), biological characteristic information, communication information, position information, time information, health medical information and the like according to content. Whether a data set contains some kind of sensitive information, and which specific types of sensitive information that contain that kind, is an important factor in measuring the sensitivity of the data set. The identity information has strong identifiability, and can correlate various information in the data set with a data main body of the data set; biometric information, including genetic data, facial recognition features, fingerprints, palmprints, iris information, etc., has become possible by reverse-pushing the appearance features of individuals and their families through genomic sequencing results as new generation sequencing technologies are widely used in medical research, and rapid development of clinical genomics, and genetic information of individuals and their families has become sensitive information that needs attention in research, medical treatment, and data sharing.

The data value dimension analyzes 14 data item characteristics, and specifically comprises a direct identification information value number, an indirect identification information value number, a network identification information value number, a communication information shielding processing degree, a position information value number, a position information accuracy degree, a birth time value number, a birth time accuracy degree, other behavior time value numbers, other behavior time accuracy degrees, a sensitive disease value number, a medical insurance payment information type value number and the like. The "number of information values" refers to the number of specific values of the type of "certain information", which can be calculated from the identification result of the sensitive information, wherein the "certain information accuracy degree", such as the accuracy degree of the date of birth, refers to the accuracy to the year/month/day, etc.

The reason why the sensitive feature analysis is performed in the data value dimension is that, although two data sets have the same number, time, main body and type features on the metadata level and contain the same sensitive information type on the data item level, the filling degree of the two data sets on the instance level is different, for example, the degree of missing values under the same information type item is different, the accuracy degree of specific information type values such as date, address and the like is different, the number of data main bodies related to each sensitive information type is different, and the like, so that the sensitivity degree of the data sets may be different.

In another embodiment of the present invention, the calculating based on the analysis result corresponding to each sensitive feature, to obtain the comprehensive sensitivity evaluation result of the population health dataset includes:

The sensitivity of a data set is determined by two major factors, namely, the degree of identification and the degree of leakage loss, where the degree of leakage loss is determined by the type of data topic and the amount of data. Both the extent of identification and the extent of leakage loss are positively correlated with data sensitivity. The data sensitivity calculation receives the sensitive information identification and characteristic analysis results, and is mainly divided into 3 links of identification degree evaluation (identification degree value determination), leakage loss degree evaluation (leakage loss degree value determination) and sensitivity comprehensive calculation, and finally a sensitivity reference value representing the sensitive information content level of the data set is obtained.

Referring to fig. 5, a schematic diagram of a sensitive data calculation flow provided in an embodiment of the present invention includes: and performing identification degree calculation and leakage loss degree calculation based on the received sensitive characteristic analysis result. Wherein, the identification degree calculation includes: judging whether the data are aggregated data or not through the calculation of the identification degree, and if so, identifying the data as level 1; if not, judging whether the direct identification is contained, if yes, identifying the level 4, if not, quantitatively calculating the re-identification risk, determining whether the risk is larger than a threshold value based on a calculation result, if yes, identifying the level as the level three, if not, identifying the level 2, and obtaining the identification degree score through the identification level. On the other hand, the leak loss degree calculation includes: whether the data are the biological identification data or not is judged, if the data are the biological identification data, the comprehensive quantity calculation is directly carried out, if the data are not the financial related data, if the data are the financial related data, the comprehensive quantity calculation is directly carried out, if the data are the clinical data, the comprehensive quantity calculation is directly carried out, if the data are the special clinical data, the comprehensive quantity calculation is then carried out, and the sensitive leakage loss degree score is obtained based on the comprehensive quantity calculation result. And calculating a sensitivity comprehensive score of the population health data set based on the identification degree score and the sensitivity leakage loss degree score.

In one implementation of the embodiment of the present invention, the generating the sensitivity evaluation report of the population health dataset based on the sensitivity comprehensive evaluation result includes:

determining basic information of the population health data set;

Specifically, the design of the data sensitivity evaluation report contains a unique detection number to prevent authenticity and marks the detection time. The main content mainly comprises 4 parts:

(1) Data set basic information is detected. Including english names, scientific resource identifiers, data resource creation institutions, and data resource creators, a brief introduction of the main content and sources of the data set.

(2) And evaluating the data sensitivity. The sensitivity of the data set is mainly determined by the identification degree of the data set and the leakage loss degree of the data set, the report shows that the identification degree of the data set is several levels and corresponding identification degree values, the topic type, the data volume and corresponding leakage loss degree values of the data set are shown, and a complete evaluation reference table is shown, so that a report reader can intuitively and comprehensively know the identification degree and the leakage loss degree of the data set. Finally, the report shows a specific sensitivity reference for the dataset and notes that the range of reference is [ 0.05-1 ], and the reporting reader can have knowledge of the sensitivity of the dataset based on the location of the reference in the range and in combination with manual review. In addition, the report also identifies the annotation that the data presenter originally had for the data set "whether or not it contained sensitive information" for verification of annotation correctness.

(3) Data set sensitive feature analysis. The data set sensitive characteristics comprise metadata, data items and data values, wherein the metadata dimensions comprise 12 items such as record quantity, time span, related main body quantity and the like, and the data items and the data values are subjected to sensitivity merging display and mainly comprise 23 sensitive characteristics such as various sensitive information type numbers, various sensitive information value numbers, various special sensitive information type accuracy degrees and the like. Reports are intended to allow data submitters, data managers, to understand and ascertain the status of sensitive information in the dataset through interpretation of the features described above.

(4) Data information location markers. This section primarily identifies the location, manner of presentation, recognition pattern, and degree of accuracy of sensitive information in the dataset, etc., in an effort to provide a reference for possible subsequent desensitization processing operations of the dataset.

The embodiment of the invention provides a sensitivity processing method of a population health data set, which is oriented to population health data sharing requirements, and the constructed sensitivity evaluation method can measure the type, the characteristics, the distribution and the content of sensitive information of the sensitivity evaluation method per se in the data set hierarchy, and can be used as a measurement standard to guide the discovery, the identification, the detection and the hierarchical management of the sensitive information of the data sharing and provide support for subsequent processing. The invention discloses a whole set of data set sensitivity evaluation method for four key links of sensitive information identification, sensitive characteristic analysis, sensitivity calculation and evaluation report generation, which designs a dictionary base and a rule base facing sensitive information identification requirements, automatically scans to generate position marks of sensitive information in the data set, realizes characteristic analysis, realizes calculation and evaluation of sensitivity based on characteristic scanning results of the sensitive information, and provides computer-aided support for data sensitive information detection, auditing, hierarchical management and the like.

In the following, a specific application example is used to describe an embodiment of the present invention, where the population health data set to be evaluated is "XXX psychiatric case data set", and because this type of data does not belong to completely public data, "XXX" is used to represent related information in the name for convenience of description in the embodiment of the present invention, and this data set is not a typical relational data table, but includes unstructured text information therein, and this is taken as an example to display a sensitivity evaluation report of this type of data set. The data set is not displayed as sample data because the sensitive information contained therein is not suitable for leakage. The data set includes information about complaints, current medical history, past history, diagnosis, treatment plan, etc. of 270 psychiatric patients.

The data sensitivity evaluation report of XXX psychiatric case data set is shown in FIGS. 6 (a) -6 (c), and part of the information is masked in the evaluation report presentation. The method does not contain a direct identity identifier, contains standard identifiers such as gender, age, marital status, nationality and the like, and the re-identification risk of the data set is less than 0.5, so that the identification degree of the data set is judged to be level 2, and the identification degree value is 0.3. The data body of the data set relates to minors and old people, belongs to the subject of mental disorder diseases, is combined with 270<500 data, and is evaluated according to the leakage loss degree evaluation reference table, and the leakage loss reference value is assigned to be 0.5. And finally, adding the identification degree value and the leakage loss degree value and normalizing to obtain a sensitivity reference value of 0.4. The average single record of the data set contains a large amount of sensitive information, but the total number of data records is small, so that the data sensitivity reference value representing the content of the whole data set does not show high.

Such a data set is more specific to the sensitive information position marking part, and is not typically a relational data form, and a cell position determined by the number of recording lines and the number of columns may contain a large text message, so that when the sensitive information position is performed, the number of data recording lines, the field name and the starting position (for example, start:468, end: 477) of the sensitive information (for example, the discharge date 2019-12-14) in the data set are recorded.

Based on the foregoing embodiments, embodiments of the present invention further provide a population health dataset sensitivity processing system, see fig. 7, including:

an acquisition unit 10 for acquiring a population health dataset to be evaluated;

the identifying unit 20 is configured to identify sensitive information of each feature of the population health dataset, and obtain a sensitive feature corresponding to each feature, where the features include metadata features, data item features and data value features;

an analysis unit 30, configured to analyze each of the sensitive features to obtain an analysis result corresponding to each of the sensitive features;

a calculating unit 40, configured to calculate based on the analysis result corresponding to each sensitive feature, and obtain a sensitivity comprehensive evaluation result of the population health dataset;

A generating unit 50 for generating a sensitivity evaluation report of the population health data set based on the sensitivity comprehensive evaluation result.

Further, the identification unit includes:

Further, the analysis unit includes:

Further, the computing unit includes:

Further, the generating unit includes:

The embodiment of the invention provides a population health data set sensitivity processing system, which comprises the following steps: the method comprises the steps that an acquisition unit acquires a population health data set to be evaluated; the identification unit carries out sensitive information identification on each feature of the population health data set to obtain a sensitive feature corresponding to each feature, wherein the features comprise metadata features, data item features and data value features; the analysis unit analyzes each sensitive feature to obtain an analysis result corresponding to each sensitive feature; the calculation unit calculates based on the analysis result corresponding to each sensitive feature to obtain the comprehensive sensitivity evaluation result of the population health data set; a generating unit generates a sensitivity evaluation report of the population health data set based on the sensitivity comprehensive evaluation result. The invention realizes the discovery, identification, analysis and processing of the sensitive information, and improves the application requirement of the sensitivity evaluation of the population health data set through multidimensional analysis, and improves the efficiency and the safety of the subsequent population health data application.

Based on the foregoing embodiments, embodiments of the present application provide a computer-readable storage medium storing one or more programs executable by one or more processors to implement the steps of the population health dataset sensitivity processing method as set forth in any one of the above.

Based on the foregoing embodiments, an embodiment of the present application further provides an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the steps of the population health dataset sensitivity processing method implemented by the processor.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for sensitivity processing of a population health dataset, comprising:

acquiring a population health data set to be evaluated;

generating a sensitivity evaluation report of the population health data set based on the sensitivity comprehensive evaluation result;

the identifying the sensitive information of each feature of the population health data set to obtain the sensitive feature corresponding to each feature includes:

performing sensitive information identification on the metadata features based on target judgment rules to obtain metadata sensitive features, wherein the target judgment rules are determined based on marked metadata;

2. The method of claim 1, wherein analyzing each of the sensitive features to obtain an analysis result corresponding to each of the sensitive features comprises:

3. The method of claim 2, wherein the calculating based on the analysis result corresponding to each sensitive feature to obtain the comprehensive sensitivity assessment result of the population health dataset comprises:

4. The method of claim 1, wherein the generating a sensitivity assessment report of the population health dataset based on the sensitivity comprehensive assessment results comprises:

determining basic information of the population health data set;

5. A population health dataset sensitivity processing system, comprising:

An acquisition unit for acquiring a population health dataset to be evaluated;

a generating unit, configured to generate a sensitivity evaluation report of the population health data set based on the sensitivity comprehensive evaluation result;

wherein the identification unit includes:

the first identification subunit is used for carrying out sensitive information identification on the metadata characteristics based on target judgment rules to obtain metadata sensitive characteristics, wherein the target judgment rules are determined based on marked metadata;

6. The system of claim 5, wherein the analysis unit comprises:

7. The system of claim 6, wherein the computing unit comprises:

8. The system of claim 5, wherein the generating unit comprises: