CN111382457A - Data risk assessment method and device - Google Patents

Data risk assessment method and device Download PDF

Info

Publication number
CN111382457A
CN111382457A CN201811627005.1A CN201811627005A CN111382457A CN 111382457 A CN111382457 A CN 111382457A CN 201811627005 A CN201811627005 A CN 201811627005A CN 111382457 A CN111382457 A CN 111382457A
Authority
CN
China
Prior art keywords
data
evaluation
assessment
desensitization
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811627005.1A
Other languages
Chinese (zh)
Other versions
CN111382457B (en
Inventor
史文钊
弓孟春
王乐子
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Digital China Health Technologies Co ltd
Original Assignee
Digital China Health Technologies Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Digital China Health Technologies Co ltd filed Critical Digital China Health Technologies Co ltd
Priority to CN201811627005.1A priority Critical patent/CN111382457B/en
Publication of CN111382457A publication Critical patent/CN111382457A/en
Application granted granted Critical
Publication of CN111382457B publication Critical patent/CN111382457B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • G06F21/6254Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Theoretical Computer Science (AREA)
  • Bioethics (AREA)
  • Physics & Mathematics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Health & Medical Sciences (AREA)
  • Strategic Management (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • General Physics & Mathematics (AREA)
  • Operations Research (AREA)
  • Game Theory and Decision Science (AREA)
  • General Business, Economics & Management (AREA)
  • Quality & Reliability (AREA)
  • Educational Administration (AREA)
  • Marketing (AREA)
  • Development Economics (AREA)
  • Tourism & Hospitality (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application provides a data risk assessment method and a data risk assessment device, wherein the method comprises the following steps: desensitization data is obtained; screening the desensitization data based on a plurality of preset evaluation identifications to obtain evaluation data matched with each evaluation identification; determining an evaluation data set corresponding to each evaluation identifier combination based on the permutation and combination of the plurality of evaluation identifiers, wherein the evaluation identifier combination is a result of permutation and combination of different evaluation identifiers in the plurality of evaluation identifiers; determining risk assessment influencing factors of each assessment data set; and determining whether the acquired desensitization data has privacy disclosure risks or not based on the risk assessment influence factors. Therefore, by evaluating the desensitized data, whether the desensitized data has privacy disclosure risks or not is judged, the desensitized data can be effectively quantitatively evaluated and controlled, the personal privacy of the user is effectively prevented from disclosure, and the privacy of the user is better protected.

Description

Data risk assessment method and device
Technical Field
The present application relates to the technical field of data risk assessment, and in particular, to a method and an apparatus for data risk assessment.
Background
With the rapid growth of medical data, it is a great trend to use large sample data for medical research, and at present, many hospitals or doctors have demands for large sample medical data when performing medical research. However, medical data belongs to private data, has certain confidentiality and has great risk once the privacy of a patient is revealed. When a hospital or a doctor performs data exchange and other processing, generally, desensitization processing is performed on data, that is, sensitive identification is removed from private data to form an identification-removed data set, so as to achieve the purpose of protecting the data privacy of a user.
However, currently, for desensitization processing of data, the used desensitization modes are different, and there is no unified desensitization method and desensitization standard, which results in that for different data sources, the desensitization effects obtained by using different desensitization methods are also different, and accordingly, for desensitized data, whether desensitization is successful or not cannot be evaluated, and whether user privacy can be protected or not can be achieved.
Disclosure of Invention
In view of this, the application provides a data risk assessment method and device, which can effectively perform quantitative assessment and control on desensitization data, effectively prevent the personal privacy of a user from being revealed, and better protect the privacy of the user.
The embodiment of the application provides a data risk assessment method, which comprises the following steps:
desensitization data is obtained;
screening the desensitization data based on a plurality of preset evaluation identifications to obtain evaluation data matched with each evaluation identification;
determining an evaluation data set corresponding to each evaluation identifier combination based on the permutation and combination of the plurality of evaluation identifiers, wherein the evaluation identifier combination is a result of permutation and combination of different evaluation identifiers in the plurality of evaluation identifiers;
determining risk assessment influencing factors of each assessment data set;
and determining whether the acquired desensitization data has privacy disclosure risks or not based on the risk assessment influence factors.
Further, the plurality of assessment indicators comprises a base item assessment indicator; or the plurality of assessment identifications comprise a base item assessment identification and at least one of a professional assessment identification, a marital assessment identification and a national assessment identification; the basic item evaluation identifier comprises a gender evaluation identifier, a birth date evaluation identifier and an address evaluation identifier.
Further, based on a plurality of preset evaluation identifiers, screening the desensitization data to obtain evaluation data matched with each evaluation identifier, including:
deleting a data group corresponding to data which do not accord with preset identification content in the desensitization data based on a plurality of preset evaluation identifications;
based on a plurality of preset evaluation identifiers, carrying out standardization processing on desensitization data, wherein the standardization processing comprises unified processing on data formats;
coding the desensitization data after the normalization processing based on a plurality of preset evaluation identifications;
determining evaluation data matching each of the evaluation identifiers from the processed data.
Further, when the plurality of evaluation identifiers include a national evaluation identifier, the normalization processing is performed on the desensitization data based on the preset plurality of evaluation identifiers, and the normalization processing includes uniformly processing a data format, including:
determining the number of data sets for each ethnic group except Han in the ethnic assessment markers in the desensitization data;
and if the number of the data sets is smaller than the preset number, changing the evaluation identification of the nationality corresponding to the number of the data sets into the minority nationality.
Further, the determining risk assessment influencing factors of each assessment data set comprises:
analyzing and processing the data in each evaluation data set;
determining the occurrence number of target data groups in each evaluation data set based on the result of data analysis processing, wherein the occurrence number of the target data groups is smaller than the occurrence number of other data groups except the target data groups in the evaluation data set;
determining a calculation data group from the plurality of target data groups, wherein the calculation data group is a target data group of the plurality of target data groups, and the occurrence frequency of the target data group is smaller than that of other target data groups;
and determining the weight ratio of the occurrence times of the calculation data groups to the number of total data groups in the desensitization data, wherein each group of data in the total data groups represents the data of one user.
Further, determining whether the acquired desensitization data has a privacy disclosure risk based on the risk assessment influence factors includes:
and determining whether the acquired desensitization data has privacy disclosure risks or not based on the occurrence frequency of the calculation data group and the weight ratio.
Further, the determining whether the acquired desensitization data has a privacy disclosure risk based on the number of occurrences of the calculated data group and the weight ratio includes:
judging whether the occurrence frequency of the calculation data group is greater than a first preset frequency threshold value or not;
and if the occurrence frequency of the calculated data group is greater than a first preset frequency threshold value, determining that the acquired desensitized data has strong privacy protection.
Further, the determining whether the acquired desensitization data has a privacy disclosure risk based on the number of occurrences of the calculated data group and the weight ratio includes:
judging whether the occurrence frequency of the calculation data group is smaller than a second preset frequency threshold value or not;
if the occurrence frequency of the calculation data group is smaller than a second preset frequency threshold value, determining whether the value of the weight ratio is larger than a preset weight threshold value;
and if the value of the weight ratio is larger than a preset weight threshold value, determining that the obtained data has privacy disclosure risks.
An embodiment of the present application further provides a data risk assessment apparatus, where the data risk assessment apparatus includes:
the acquisition module is used for acquiring desensitization data;
the screening module is used for screening the desensitization data based on a plurality of preset evaluation identifications to obtain evaluation data matched with each evaluation identification;
the combination module is used for determining an evaluation data set corresponding to each evaluation identifier combination based on the permutation and combination of the plurality of evaluation identifiers, wherein the evaluation identifier combination is a result of permutation and combination of different evaluation identifiers in the plurality of evaluation identifiers;
a determining module for determining risk assessment influencing factors of each assessment data set;
and the evaluation module is used for determining whether the acquired desensitization data has privacy disclosure risks or not based on the risk evaluation influence factors.
Further, the plurality of assessment indicators comprises a base item assessment indicator; or the plurality of assessment identifications comprise a base item assessment identification and at least one of a professional assessment identification, a marital assessment identification and a national assessment identification; the basic item evaluation identifier comprises a gender evaluation identifier, a birth date evaluation identifier and an address evaluation identifier.
Further, the screening module is specifically further configured to:
deleting a data group corresponding to data which do not accord with preset identification content in the desensitization data based on a plurality of preset evaluation identifications;
based on a plurality of preset evaluation identifiers, carrying out standardization processing on desensitization data, wherein the standardization processing comprises unified processing on data formats;
coding the desensitization data after the normalization processing based on a plurality of preset evaluation identifications;
determining evaluation data matching each of the evaluation identifiers from the processed data.
Further, the screening module is specifically further configured to:
determining the number of data sets for each ethnic group except Han in the ethnic assessment markers in the desensitization data;
and if the number of the data sets is smaller than the preset number, changing the evaluation identification of the nationality corresponding to the number of the data sets into the minority nationality.
Further, the determining module is specifically further configured to:
analyzing and processing the data in each evaluation data set;
determining the occurrence number of target data groups in each evaluation data set based on the result of data analysis processing, wherein the occurrence number of the target data groups is smaller than the occurrence number of other data groups except the target data groups in the evaluation data set;
determining a calculation data group from the plurality of target data groups, wherein the calculation data group is a target data group of the plurality of target data groups, and the occurrence frequency of the target data group is smaller than that of other target data groups;
and determining the weight ratio of the occurrence times of the calculation data groups to the number of total data groups in the desensitization data, wherein each group of data in the total data groups represents the data of one user.
Further, the evaluation module is specifically further configured to:
and determining whether the acquired desensitization data has privacy disclosure risks or not based on the occurrence frequency of the calculation data group and the weight ratio.
Further, the evaluation module is specifically further configured to:
judging whether the occurrence frequency of the calculation data group is greater than a first preset frequency threshold value or not;
and if the occurrence frequency of the calculated data group is greater than a first preset frequency threshold value, determining that the acquired desensitized data has strong privacy protection.
Further, the evaluation module is specifically further configured to:
judging whether the occurrence frequency of the calculation data group is smaller than a second preset frequency threshold value or not;
if the occurrence frequency of the calculation data group is smaller than a second preset frequency threshold value, determining whether the value of the weight ratio is larger than a preset weight threshold value;
and if the value of the weight ratio is larger than a preset weight threshold value, determining that the obtained data has privacy disclosure risks.
An embodiment of the present application further provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the steps of the data risk assessment method as described above.
Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program performs the steps of the data risk assessment method as described above.
The data risk assessment method and device provided by the embodiment of the application acquire desensitization data; screening the desensitization data based on a plurality of preset evaluation identifications to obtain evaluation data matched with each evaluation identification; determining an evaluation data set corresponding to each evaluation identifier combination based on the permutation and combination of the plurality of evaluation identifiers, wherein the evaluation identifier combination is a result of permutation and combination of different evaluation identifiers in the plurality of evaluation identifiers; determining a risk assessment factor for each assessment data set; and determining whether the acquired desensitization data has privacy disclosure risks or not based on the risk assessment factors. Therefore, by evaluating the desensitized data, whether the desensitized data has privacy disclosure risks or not is judged, the desensitized data can be effectively quantitatively evaluated and controlled, the personal privacy of the user is effectively prevented from disclosure, and the privacy of the user is better protected.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.
FIG. 1 is a diagram of a system architecture in one possible application scenario;
FIG. 2 is a flowchart of a data risk assessment method according to an embodiment of the present application;
FIG. 3 is a flow chart of a data risk assessment method according to another embodiment of the present application;
fig. 4 is a block diagram of a data risk assessment apparatus according to an embodiment of the present application;
fig. 5 is a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.
First, an application scenario to which the present application is applicable will be described. The method and the device can be applied to the technical field of data risk assessment, quantitative assessment and control are carried out on desensitization data, individual privacy of the user is effectively prevented from being revealed, and the privacy of the user is better protected. Referring to fig. 1, fig. 1 is a system diagram in the application scenario. As shown in fig. 1, the system includes a data risk assessment apparatus, an application server, and a plurality of desensitization data sources, where the data risk assessment apparatus is connected to the application server, and can call various data, such as desensitization data after desensitization processing, from the application server, and the application server can obtain desensitization data from the data sources, or obtain data, and then perform desensitization processing on the obtained data. In the above example, the data risk assessment apparatus is connected to the application server to obtain desensitization data from the application server, but the present invention is not limited thereto, and in other examples, the data risk assessment apparatus may also be directly connected to the data source to directly obtain desensitization data from the data source.
Research shows that at present, data is desensitized, a unified desensitization method and a desensitization standard do not exist, so that desensitization effects obtained by using different desensitization methods are different for different data sources, and accordingly, whether desensitization is successful or not cannot be evaluated for desensitized data, and whether privacy of a user can be protected or not is achieved.
Based on this, the embodiment of the application provides a method and a device for data risk assessment, by assessing data which has been subjected to desensitization processing, whether privacy leakage risks still exist in the desensitization data is judged, quantitative assessment and control can be effectively performed on the desensitization data, personal privacy of a user is effectively prevented from being leaked, and the privacy of the user is better protected.
Referring to fig. 2, fig. 2 is a flowchart of a data risk assessment method according to an embodiment of the present application. As shown in fig. 2, the data risk assessment method provided in the embodiment of the present application includes:
step 201, desensitization data is acquired.
In this step, the desensitization data that the data risk assessment apparatus can acquire are large data of patients who are derived from hospitals in the province or the region, and these data are desensitization data that have been subjected to desensitization processing.
202, screening the desensitization data based on a plurality of preset evaluation identifications to obtain evaluation data matched with each evaluation identification.
In the step, the data risk assessment device firstly determines a plurality of preset assessment identifiers, then screens the acquired desensitization data according to the plurality of preset assessment identifiers, removes data which do not meet requirements, such as blank data, data which are filled in irregularly, zero data, filling errors and the like, and arranges the missing data and the data which are in irregular formats so as to obtain an assessment data set which is matched with each assessment identifier and meets the requirements, wherein the data set meets the preset number.
Wherein the plurality of assessment indicators comprises a base item assessment indicator; or the plurality of assessment identifications comprise a base item assessment identification and at least one of a professional assessment identification, a marital assessment identification and a national assessment identification; the basic item evaluation identifier comprises a gender evaluation identifier, a birth date evaluation identifier and an address evaluation identifier.
Specifically, the preset multiple evaluation identifiers may be a gender evaluation identifier, a birth date evaluation identifier, and an address evaluation identifier; or a gender assessment identifier, a birth date assessment identifier, an address assessment identifier and a occupation assessment identifier; or a gender assessment identifier, a birth date assessment identifier, an address assessment identifier and a ethnic group assessment identifier; or a gender assessment identifier, a birth date assessment identifier, an address assessment identifier, a marital assessment identifier; or a gender assessment identifier, a birth date assessment identifier, an address assessment identifier, a ethnic assessment identifier and a marital assessment identifier; or a gender assessment identifier, a birth date assessment identifier, an address assessment identifier, a ethnic assessment identifier and a occupation assessment identifier; or a gender assessment identifier, a birth date assessment identifier, an address assessment identifier, a professional assessment identifier and a marital assessment identifier; but also a gender assessment identifier, a date of birth assessment identifier, an address assessment identifier, a professional assessment identifier, a marital assessment identifier, and a ethnic assessment identifier.
In this embodiment, the specific evaluation identifier included in the evaluation identifier and the combination and arrangement of different evaluation identifiers are only illustrated for convenience of understanding, and are not limited to this, and in other real-time examples, the evaluation identifier may further include other evaluation identifiers such as a height evaluation identifier and a score evaluation identifier, or may not include some of the specific evaluation identifiers, such as a birth date evaluation identifier and an address evaluation identifier, and is not limited at all. The evaluation identifier may include specific identifier content, which may be specifically set when performing risk evaluation for different requirements according to evaluation data of different properties.
Step 203, determining an evaluation data set corresponding to each evaluation identifier combination based on the permutation and combination of the plurality of evaluation identifiers, where the evaluation identifier combination is a result of permutation and combination of different evaluation identifiers in the plurality of evaluation identifiers.
In this step, the data risk assessment apparatus may combine the different assessment indicators into different assessment indicator combinations in different permutation and combination manners, and then may determine, according to the obtained different assessment indicator combinations, the assessment data corresponding to the indicator in each assessment indicator combination, thereby obtaining an assessment data set of each assessment data group.
And step 204, determining risk assessment influence factors of each assessment data set.
In the step, the data risk assessment device determines risk assessment factors for each assessment data set to obtain risk assessment factors influencing privacy disclosure.
And step 205, determining whether the acquired desensitization data has privacy disclosure risks or not based on the risk assessment influence factors.
In the step, the data risk assessment device determines whether the acquired desensitization data has a risk according to the acquired risk assessment factors. And further, whether the desensitization method of the desensitization data set is effective or not is obtained, and whether the privacy of a patient user is really protected or not is obtained.
According to the data risk assessment method provided by the embodiment of the application, desensitization data are obtained; screening the desensitization data based on a plurality of preset evaluation identifications to obtain evaluation data matched with each evaluation identification; determining an evaluation data set corresponding to each evaluation identifier combination based on the permutation and combination of the plurality of evaluation identifiers, wherein the evaluation identifier combination is a result of permutation and combination of different evaluation identifiers in the plurality of evaluation identifiers; determining a risk assessment factor for each assessment data set; and determining whether the acquired desensitization data has privacy disclosure risks or not based on the risk assessment factors. By evaluating the desensitized data, whether the desensitized data still has privacy disclosure risks is judged, the desensitized data can be effectively quantitatively evaluated and controlled, the personal privacy of the user is effectively prevented from being disclosed, and the privacy of the user is better protected.
Referring to fig. 3, fig. 3 is a flowchart of a data risk assessment method according to another embodiment of the present application. As shown in fig. 3, the data risk assessment method provided in the embodiment of the present application includes:
301, obtaining desensitization data;
step 302, based on a plurality of preset evaluation identifiers, deleting a data group corresponding to data which does not accord with preset identification content in the desensitization data.
In the step, in the process of screening desensitization data, the data risk assessment device deletes a data group corresponding to data which does not conform to the preset contents of the plurality of assessment identifiers in the desensitization data according to the plurality of preset assessment identifiers, and retains a data group corresponding to data which does not conform to the preset contents of the plurality of assessment identifiers, so as to obtain an effective data group.
Here, the data that does not conform to the preset plurality of evaluation identification contents is mainly empty data and data that does not conform to the respective evaluation identification contents.
Specifically, if the evaluation identifier is a gender evaluation identifier, the content of the gender evaluation identifier includes a male and a female, and if the data of the corresponding gender evaluation identifier in the desensitization data is blank or not a male or a female, the user data group corresponding to the data is deleted, and the data group with the male and female identification content is reserved.
If the evaluation identifier is the marital evaluation identifier, the content of the marital evaluation identifier comprises married, not married, divorced and bereaved, and if the data of the marital evaluation identifier corresponding to the desensitization data is blank or not married, divorced and bereaved, the user data group corresponding to the data is deleted, and the data group with the content of the married, not married, divorced and bereaved identifiers is reserved. For example, the following steps are carried out: if a patient is married, the marriage is noted in his data marriage column, and data identifying the patient for the marriage may be retained.
If the evaluation identifier is the birth date evaluation identifier, the content of the birth date evaluation identifier comprises year, month and day, if the data of the birth date evaluation identifier corresponding to the desensitization data is blank or not the year, month and day, or the date which cannot be determined completely and is larger than the current year, the user data groups corresponding to the data are deleted, and the data groups with three identifiers of year, month and day are reserved.
If the evaluation identifier is a national evaluation identifier, the content of the national evaluation identifier may include any of fifty-six nationalities, such as han nationality, Miao nationality, Mongolian nationality and the like, and if the data of the corresponding national evaluation identifier in the desensitization data is blank or is not any of the fifty-six nationalities, the user data group corresponding to the data is deleted, and the data group with any of the fifty-six nationalities is reserved.
And counting the total number of all the ethnic groups, and changing the ethnic group identification of which the total number is less than the preset ethnic group number into a minority.
If the evaluation identifier is the address evaluation identifier, the content of the address evaluation identifier comprises addresses of provinces, cities, districts or counties related to China, if the data of the corresponding address evaluation identifier in the desensitization data is blank or not Chinese address content, or only at least one of the provinces or the cities cannot determine the address of the district or county where the user is located, the user data group corresponding to the data is deleted, and if the address has the data of the district or county, the data is to be retained.
And if the evaluation identifier is the professional evaluation identifier, the content of the professional evaluation identifier comprises professional content, and if the data of the corresponding professional evaluation identifier in the desensitization data is blank or is not the professional content, the user data group corresponding to the data is deleted.
The screened data set finally satisfies the preset number, and if the screened desensitization data number is less than the preset number, the data which do not accord with the content of the professional identification in the professional evaluation identification is reserved so that the desensitization data reach the preset number.
And 303, carrying out standardized processing on the desensitization data based on a plurality of preset evaluation identifiers, wherein the standardized processing comprises unified processing on a data format.
In this step, the data risk assessment device performs normalization processing on desensitization data, including unifying formats of data in the valid data set.
Specifically, the data related to the evaluation identifier is normalized, and if the evaluation identifier is a birth date evaluation identifier, the content of the birth date evaluation identifier includes year, month and day, and the format in the data may be day-month-year, year-month-day and year/month/day, the format is uniformly changed into a month/day/year format, that is, the birth date data is normalized.
If the evaluation identifier is the address evaluation identifier, if the province is null, and the city and the prefecture are non-null, the null part of the province is deduced upwards according to the non-null part, namely the city and the province are deduced by the prefecture, and the city and the province are deduced by the city; and if the matching is successful, taking the corresponding province, local city and prefecture as recorded result values, namely normalizing the address data. If the matching fails, for example, the county may have a duplicate name, and the specific address cannot be determined, the user data corresponding to the record is deleted.
And 304, coding the normalized desensitization data based on a plurality of preset evaluation identifiers.
In this step, the data risk assessment device normalizes the valid data set, and then encodes the data, specifically, compiles the content of each assessment identifier into a predetermined code, and then changes the data corresponding to the assessment identifier in the normalized data set into a code corresponding to the content of the corresponding assessment identifier.
Specifically, the following are exemplified: if the evaluation identifier is the sex evaluation identifier, the content of the sex evaluation identifier comprises a man and a woman, the number 1 corresponds to the man, the number 2 corresponds to the woman, the normalization data group is provided, if the content of the sex evaluation identifier in the user data is the man, the man is changed into 1, and if the content is the woman, the woman is changed into 2.
If the evaluation flag is an address evaluation flag, in order to obtain a more accurate mapping between each patient and the corresponding address code, the address of the patient is first divided into three parts (i.e., into province, prefecture, and prefecture by using a natural language processing method). The new address is encoded as 9 digits, the first 2 digits, the last 3 digits, and the last 4 digits represent the patient's information on the province, city, and county of residence, respectively. And matching by using a dictionary library, wherein if the corresponding position is successfully matched, the numerical value is the value of the corresponding province, city, district and county.
For the case of incomplete addresses, if there is only a city or a county, the city or the county of the province needs to be supplemented, specifically as follows: firstly, for example, an address standard table can be established, wherein the address standard table contains the corresponding relation among standard province, standard city, standard prefecture, longitude, latitude, length, province, city and prefecture, then, the standard words in the address standard table are used for matching the content of the address evaluation identifier, if the matching is successful, the value is the result value of province, city and prefecture, then, on the basis of the previous step, in the result of the same data, if the result value of province is empty, and the result value of city or prefecture is non-empty, the comparison relation among non-empty value and province, city and prefecture in the dictionary table can be used for upward derivation, namely, the standard province is derived by using the standard city, the standard city and standard prefecture are derived by using the standard prefecture, and the data with the result still empty are used for data in the dictionary table, the first 2 characters of the standard city and the first 3 characters of the standard prefecture, and matching the content of the address evaluation identification, if the matching is successful, taking the corresponding standard word as a result value of the record, and deducing provinces and cities again for the result.
Step 305, determining the evaluation data matched with each evaluation identifier from the processed data.
In this step, after the acquired desensitization data is processed by the data risk assessment apparatus, such as deletion, normalization, and encoding, the processed data may be regarded as basically usable data or data that will meet the processing requirements, and then the assessment data that matches each assessment flag may be determined from the processed data.
Step 306, determining an evaluation data set corresponding to each evaluation identifier combination based on the permutation and combination of the plurality of evaluation identifiers, where the evaluation identifier combination is a result of permutation and combination of different evaluation identifiers in the plurality of evaluation identifiers.
And 307, determining risk assessment influence factors of each assessment data set.
And 308, determining whether the acquired desensitization data has privacy disclosure risks or not based on the risk assessment influence factors.
The descriptions of step 301, step 306 to step 308 may refer to the descriptions of step 201, step 203 to step 205, and the same technical effects can be achieved, which is not described in detail herein.
Optionally, the plurality of evaluation identifiers include a basic item evaluation identifier; or the plurality of assessment identifications comprise a base item assessment identification and at least one of a professional assessment identification, a marital assessment identification and a national assessment identification; the basic item evaluation identifier comprises a gender evaluation identifier, a birth date evaluation identifier and an address evaluation identifier.
Optionally, when the plurality of evaluation identifiers include a national evaluation identifier, the normalization processing is performed on the desensitization data based on the preset plurality of evaluation identifiers, where the normalization processing includes performing unified processing on a data format, and includes:
determining the number of data sets for each ethnic group except Han in the ethnic assessment markers in the desensitization data;
and if the number of the data sets is smaller than the preset number, changing the evaluation identification of the nationality corresponding to the number of the data sets into the minority nationality.
In this step, when the plurality of evaluation identifiers include a national evaluation identifier, and the data risk evaluation device normalizes the desensitization data, it may be to identify the ethnicity of each data group in the desensitization data, for example, the national evaluation representation such as han nationality, miao nationality, mongolian nationality in the desensitization data, and after a plurality of identified ethnicities, it may count the number of data groups of the data group in which each ethnicity is located, and if the number of data groups of other ethnicities is smaller than the preset total number of other ethnicities, it may be considered that the number of data groups of other ethnicities is small, and for convenience of processing, the evaluation identifiers of other ethnicities except for han nationality in the national evaluation identifiers may be changed into minority nationalities.
Optionally, the determining risk assessment influencing factors of each assessment data set includes:
analyzing and processing the data in each evaluation data set;
determining the occurrence number of target data groups in each evaluation data set based on the result of data analysis processing, wherein the occurrence number of the target data groups is smaller than the occurrence number of other data groups except the target data groups in the evaluation data set;
determining a calculation data group from the plurality of target data groups, wherein the calculation data group is a target data group of the plurality of target data groups, and the occurrence frequency of the target data group is smaller than that of other target data groups;
and determining the weight ratio of the occurrence times of the calculation data groups to the number of total data groups in the desensitization data, wherein each group of data in the total data groups represents the data of one user.
In this step, after determining an evaluation data set corresponding to each evaluation identifier combination according to the permutation and combination of the plurality of evaluation identifiers, the data risk evaluation device first analyzes and processes data in each evaluation data set; the number of occurrences of the target data set in each evaluation data set is then determined based on the results of the data analysis process.
Here, the number of occurrences of the target data group is smaller than the number of occurrences of the other data groups in the evaluation data set other than the target data group; multiple target data sets may be present in each evaluation data set.
Comparing the occurrence times of target data groups in a plurality of evaluation data sets to determine a calculation data group, namely determining the calculation data group from the plurality of target data groups, wherein the calculation data group is a target data group of which the occurrence times of the plurality of target data groups are smaller than those of other target data groups, and a plurality of calculation data groups can appear in each evaluation data set.
And then determining the weight ratio of the occurrence times of all the calculation data groups in the total number of the desensitization data in the evaluation data set in which the calculation data groups are positioned, wherein each group of data in the total data groups represents the data of one user.
Examples are as follows: the method is characterized in that three fields of occupation, marital and ethnic are added on the basis of three fields of gender, birth date and address to study the influence degree of the combination of the fields on the re-identification of the medical privacy data. I.e. combined with the occupation, marriage and ethnicity in turn, i.e. occupation, marriage, ethnicity, occupation and marriage, occupation and ethnicity, marriage and ethnicity, occupation and marriage and ethnicity), on the basis of gender, date of birth and address). A model for assessing patient privacy risk is used to measure the unique certainty of a patient record. The main purpose of the model is to count the scarcity of individual records in a database to see the potential risks it carries, which quantifies the uniqueness of individual records in the database. Here the database, i.e. desensitization data, the model can be written as follows:
Figure RE-GDA0001979525370000161
wherein the content of the first and second substances,
g is a model parameter, i.e. the number of occurrences of a calculation data set in said evaluation data set as described above;
bin (i) represents a subset of i identical records, i.e. the number of all calculated data sets in said evaluation data set above;
bin (i) is the total number of subsets that satisfy i identical records, i.e., the total number of occurrences of all the calculation data sets in the evaluation data set described above.
This measure is an indicator of the risk to stratified populations with different combinations of features to assess privacy risk under different de-identification methods in desensitization data. Based on the gender, the birth date and the address, seven re-identification items of occupation, marital, ethnic, occupation and marital, occupation and ethnic, marital and ethnic, occupation and marital and ethnic are added respectively, so that the influence degree of each re-identification item on the data is calculated respectively.
When the patient data is processed, counting the times of the simultaneous occurrence of the sex, the birth date, the address and the re-identification item in each group of patient data, if the times of the simultaneous occurrence are 1 time, indicating that the patient can be identified through the sex, the birth date, the address and the re-identification item of the patient; if the number of simultaneous occurrences exceeds 1, then the patient's information is successfully desensitized; however, since there is a high or low degree of desensitization in data desensitization, we usually select the values of g as 1, 3, 5, and 10 as the strong or weak boundary points of the data desensitization. Where g 1 is the most sensitive 3, 5, 10 are chosen as required.
And judging the re-identification rate of the data through the g value after the two steps are completed so as to ensure the safe use of the medical data.
In addition, address saving to different levels may be attempted, such as: province, province and region and city county; different levels of saving may also be considered for dates, such as: year, year and month and day. The influence degree of the fields on the re-identification rate is judged by different desensitization levels.
Optionally, the determining whether the acquired desensitization data has a privacy disclosure risk based on the risk assessment influence factors includes:
and determining whether the acquired desensitization data has privacy disclosure risks or not based on the occurrence frequency of the calculation data group and the weight ratio.
In the step, the data risk assessment device determines whether the acquired desensitization data has privacy leakage risk according to the ratio of the number of times of occurrence of the calculation data groups in the assessment data set determined from all the assessment data sets to the weight of all the calculation data groups in the assessment data set.
Optionally, the determining whether the acquired desensitization data has a privacy disclosure risk based on the number of occurrences of the calculated data group and the weight ratio includes:
judging whether the occurrence frequency of the calculation data group is greater than a first preset frequency threshold value or not;
and if the occurrence frequency of the calculated data group is greater than a first preset frequency threshold value, determining that the acquired desensitized data has strong privacy protection.
In the step, after the data risk assessment device determines the occurrence frequency and the weight ratio of the calculation data group, judging whether the occurrence frequency of the calculation data group is greater than a first preset frequency threshold value; and if the occurrence frequency of the calculated data group is greater than a first preset frequency threshold value, determining that the acquired desensitized data has strong privacy protection.
Specifically, if the number of times of occurrence of the data group is calculated to be 11 times, the first preset number threshold is 10 times, and the number of times of 11 times is greater than 10 times, it is determined that the acquired desensitization data privacy protection is strong.
Optionally, the determining whether the acquired desensitization data has a privacy disclosure risk based on the number of occurrences of the calculated data group and the weight ratio includes:
judging whether the occurrence frequency of the calculation data group is smaller than a second preset frequency threshold value or not;
if the occurrence frequency of the calculation data group is smaller than a second preset frequency threshold value, determining whether the value of the weight ratio is larger than a preset weight threshold value;
and if the value of the weight ratio is larger than a preset weight threshold value, determining that the obtained data has privacy disclosure risks.
In the step, after determining the occurrence frequency and the weight ratio of the calculation data group, the data risk assessment device judges whether the occurrence frequency of the calculation data group is smaller than a second preset frequency threshold value, and determines whether the value of the weight ratio is larger than a preset weight threshold value; and if the value of the weight ratio is larger than a preset weight threshold value, determining that the obtained data has privacy disclosure risks.
Specifically, if the number of times of occurrence of the data group is 1, the second preset number threshold is 2, the number of times of occurrence of 1 is greater than 2, the weight threshold is preset to five ten-thousandths, the weight proportion is determined to be seven-thousandths, and the seven-thousandths is greater than five-thousandths, at this time, it is determined that the acquired desensitization data has a privacy disclosure risk.
In addition, when the occurrence frequency of the calculated data group is smaller than a first preset frequency threshold value and is larger than or equal to a second threshold value, the acquired desensitized data privacy protection is determined to be weak.
And when the occurrence frequency of the calculated data group is less than a second threshold value, if the weight ratio value is less than or equal to a preset weight threshold value, determining that the acquired desensitization data privacy protection is weak.
According to the data risk assessment method provided by the embodiment of the application, desensitization data are obtained; deleting a data group corresponding to data which do not accord with preset identification content in the desensitization data based on a plurality of preset evaluation identifications; carrying out standardization processing on the filtered desensitization data, wherein the standardization processing comprises unified processing on data formats; coding the desensitization data after the normalization processing; determining evaluation data matching each of the evaluation identifiers from the processed data. Determining an evaluation data set corresponding to each evaluation identifier combination based on the permutation and combination of the plurality of evaluation identifiers, wherein the evaluation identifier combination is a result of permutation and combination of different evaluation identifiers in the plurality of evaluation identifiers; determining a risk assessment factor for each assessment data set; and determining whether the acquired desensitization data has privacy disclosure risks or not based on the risk assessment factors. Therefore, by evaluating the desensitized data, whether the desensitized data has privacy disclosure risks or not is judged, the desensitized data can be effectively quantitatively evaluated and controlled, the personal privacy of the user is effectively prevented from disclosure, and the privacy of the user is better protected.
Referring to fig. 4, fig. 4 is a structural diagram of a data risk assessment apparatus according to an embodiment of the present application, and as shown in fig. 4, the data risk assessment apparatus 400 includes:
an acquisition module 410 for acquiring desensitization data;
the screening module 420 is configured to screen the desensitization data based on a plurality of preset evaluation identifiers to obtain evaluation data matched with each evaluation identifier;
a combination module 430, configured to determine, based on a permutation and combination of the multiple evaluation identifiers, an evaluation data set corresponding to each evaluation identifier combination, where the evaluation identifier combination is a result of permutation and combination of different evaluation identifiers in the multiple evaluation identifiers;
a determining module 440 for determining risk assessment impact factors for each assessment data set;
and the evaluation module 450 is configured to determine whether the acquired desensitization data has a privacy disclosure risk based on the risk evaluation influence factors.
Further, the plurality of assessment indicators comprises a base item assessment indicator; or the plurality of assessment identifications comprise a base item assessment identification and at least one of a professional assessment identification, a marital assessment identification and a national assessment identification; the basic item evaluation identifier comprises a gender evaluation identifier, a birth date evaluation identifier and an address evaluation identifier.
Further, the screening module 420 is specifically further configured to:
deleting a data group corresponding to data which do not accord with preset identification content in the desensitization data based on a plurality of preset evaluation identifications;
based on a plurality of preset evaluation identifiers, carrying out standardization processing on desensitization data, wherein the standardization processing comprises unified processing on data formats;
coding the desensitization data after the normalization processing based on a plurality of preset evaluation identifications;
determining evaluation data matching each of the evaluation identifiers from the processed data.
Further, the screening module 420 is specifically further configured to:
determining the number of data sets for each ethnic group except Han in the ethnic assessment markers in the desensitization data;
and if the number of the data sets is smaller than the preset number, changing the evaluation identification of the nationality corresponding to the number of the data sets into the minority nationality.
Further, the determining module 440 is further specifically configured to:
analyzing and processing the data in each evaluation data set;
determining the occurrence number of target data groups in each evaluation data set based on the result of data analysis processing, wherein the occurrence number of the target data groups is smaller than the occurrence number of other data groups except the target data groups in the evaluation data set;
determining a calculation data group from the plurality of target data groups, wherein the calculation data group is a target data group of the plurality of target data groups, and the occurrence frequency of the target data group is smaller than that of other target data groups;
and determining the weight ratio of the occurrence times of the calculation data groups to the number of total data groups in the desensitization data, wherein each group of data in the total data groups represents the data of one user.
Further, the evaluation module 450 is specifically further configured to:
and determining whether the acquired desensitization data has privacy disclosure risks or not based on the occurrence frequency of the calculation data group and the weight ratio.
Further, the evaluation module 450 is specifically further configured to:
judging whether the occurrence frequency of the calculation data group is greater than a first preset frequency threshold value or not;
and if the occurrence frequency of the calculated data group is greater than a first preset frequency threshold value, determining that the acquired desensitized data has strong privacy protection.
Further, the evaluation module 450 is specifically further configured to:
judging whether the occurrence frequency of the calculation data group is smaller than a second preset frequency threshold value or not;
if the occurrence frequency of the calculation data group is smaller than a second preset frequency threshold value, determining whether the value of the weight ratio is larger than a preset weight threshold value;
and if the value of the weight ratio is larger than a preset weight threshold value, determining that the obtained data has privacy disclosure risks.
The auxiliary device 400 in this embodiment may implement all the method steps of the data risk assessment method in the embodiments shown in fig. 2 and fig. 3, and may achieve the same effect, which is not described herein again.
The data risk assessment device provided by the embodiment of the application acquires desensitization data; screening the desensitization data based on a plurality of preset evaluation identifications to obtain evaluation data matched with each evaluation identification; determining an evaluation data set corresponding to each evaluation identifier combination based on the permutation and combination of the plurality of evaluation identifiers, wherein the evaluation identifier combination is a result of permutation and combination of different evaluation identifiers in the plurality of evaluation identifiers; determining a risk assessment factor for each assessment data set; and determining whether the acquired desensitization data has privacy disclosure risks or not based on the risk assessment factors. Therefore, by evaluating the desensitized data, whether the desensitized data has privacy disclosure risks or not is judged, the desensitized data can be effectively quantitatively evaluated and controlled, the personal privacy of the user is effectively prevented from disclosure, and the privacy of the user is better protected.
Referring to fig. 5, fig. 5 is a structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 5, the electronic device 500 includes a processor 510, a memory 520, and a bus 530.
The memory 520 stores machine-readable instructions executable by the processor 510, when the electronic device 500 runs, the processor 510 communicates with the memory 520 through the bus 530, and when the machine-readable instructions are executed by the processor 510, the steps of the risk assessment control method in the method embodiments shown in fig. 2 and fig. 3 may be performed.
An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the risk assessment control method in the method embodiments shown in fig. 2 and fig. 3 may be executed.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present application, and are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method for data risk assessment, the method comprising:
desensitization data is obtained;
screening the desensitization data based on a plurality of preset evaluation identifications to obtain evaluation data matched with each evaluation identification;
determining an evaluation data set corresponding to each evaluation identifier combination based on the permutation and combination of the plurality of evaluation identifiers, wherein the evaluation identifier combination is a result of permutation and combination of different evaluation identifiers in the plurality of evaluation identifiers;
determining risk assessment influencing factors of each assessment data set;
and determining whether the acquired desensitization data has privacy disclosure risks or not based on the risk assessment influence factors.
2. The method of claim 1, wherein the plurality of evaluation identifiers comprises a base item evaluation identifier; or
The plurality of assessment identifications comprise a base item assessment identification and at least one of a professional assessment identification, a marital assessment identification and a national assessment identification;
the basic item evaluation identifier comprises a gender evaluation identifier, a birth date evaluation identifier and an address evaluation identifier.
3. The method of claim 1, wherein the screening the desensitization data based on a plurality of preset assessment markers to obtain assessment data matching each assessment marker comprises:
deleting a data group corresponding to data which do not accord with preset identification content in the desensitization data based on a plurality of preset evaluation identifications;
based on a plurality of preset evaluation identifiers, carrying out standardization processing on desensitization data, wherein the standardization processing comprises unified processing on data formats;
coding the desensitization data after the normalization processing based on a plurality of preset evaluation identifications;
determining evaluation data matching each of the evaluation identifiers from the processed data.
4. The method according to claim 3, wherein when the plurality of evaluation identifiers include national evaluation identifiers, the normalization processing is performed on desensitization data based on a plurality of preset evaluation identifiers, and the normalization processing includes uniform processing of data formats, including:
determining the number of data sets for each ethnic group except Han in the ethnic assessment markers in the desensitization data;
and if the number of the data sets is smaller than the preset number, changing the evaluation identification of the nationality corresponding to the number of the data sets into the minority nationality.
5. The method of claim 1, wherein determining risk assessment impact factors for each assessment data set comprises:
analyzing and processing the data in each evaluation data set;
determining the occurrence number of target data groups in each evaluation data set based on the result of data analysis processing, wherein the occurrence number of the target data groups is smaller than the occurrence number of other data groups except the target data groups in the evaluation data set;
determining a calculation data group from the plurality of target data groups, wherein the calculation data group is a target data group of the plurality of target data groups, and the occurrence frequency of the target data group is smaller than that of other target data groups;
and determining the weight ratio of the occurrence times of the calculation data groups to the number of total data groups in the desensitization data, wherein each group of data in the total data groups represents the data of one user.
6. The method of claim 5, wherein determining whether the acquired desensitization data is at risk of privacy disclosure based on the risk assessment influencing factors comprises:
and determining whether the acquired desensitization data has privacy disclosure risks or not based on the occurrence frequency of the calculation data group and the weight ratio.
7. The method according to claim 6, wherein the determining whether the acquired desensitization data is at risk of privacy disclosure based on the number of occurrences of the calculated data set and the weight ratio comprises:
judging whether the occurrence frequency of the calculation data group is greater than a first preset frequency threshold value or not;
and if the occurrence frequency of the calculated data group is greater than a first preset frequency threshold value, determining that the acquired desensitized data has strong privacy protection.
8. The method according to claim 6, wherein the determining whether the acquired desensitization data is at risk of privacy disclosure based on the number of occurrences of the calculated data set and the weight ratio comprises:
judging whether the occurrence frequency of the calculation data group is smaller than a second preset frequency threshold value or not;
if the occurrence frequency of the calculation data group is smaller than a second preset frequency threshold value, determining whether the value of the weight ratio is larger than a preset weight threshold value;
and if the value of the weight ratio is larger than a preset weight threshold value, determining that the obtained data has privacy disclosure risks.
9. A data risk assessment apparatus, characterized in that the data risk assessment apparatus comprises:
the acquisition module is used for acquiring desensitization data;
the screening module is used for screening the desensitization data based on a plurality of preset evaluation identifications to obtain evaluation data matched with each evaluation identification;
the combination module is used for determining an evaluation data set corresponding to each evaluation identifier combination based on the permutation and combination of the plurality of evaluation identifiers, wherein the evaluation identifier combination is a result of permutation and combination of different evaluation identifiers in the plurality of evaluation identifiers;
a determining module for determining risk assessment influencing factors of each assessment data set;
and the evaluation module is used for determining whether the acquired desensitization data has privacy disclosure risks or not based on the risk evaluation influence factors.
10. The apparatus of claim 9, wherein the determining module is configured to:
analyzing and processing the data in each evaluation data set;
determining the occurrence number of target data groups in each evaluation data set based on the result of data analysis processing, wherein the occurrence number of the target data groups is smaller than the occurrence number of other data groups except the target data groups in the evaluation data set;
determining a calculation data group from the plurality of target data groups, wherein the calculation data group is a target data group of the plurality of target data groups, and the occurrence frequency of the target data group is smaller than that of other target data groups;
and determining the weight ratio of the occurrence times of the calculation data groups to the number of total data groups in the desensitization data, wherein each group of data in the total data groups represents the data of one user.
CN201811627005.1A 2018-12-28 2018-12-28 Data risk assessment method and device Active CN111382457B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811627005.1A CN111382457B (en) 2018-12-28 2018-12-28 Data risk assessment method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811627005.1A CN111382457B (en) 2018-12-28 2018-12-28 Data risk assessment method and device

Publications (2)

Publication Number Publication Date
CN111382457A true CN111382457A (en) 2020-07-07
CN111382457B CN111382457B (en) 2023-08-18

Family

ID=71216467

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811627005.1A Active CN111382457B (en) 2018-12-28 2018-12-28 Data risk assessment method and device

Country Status (1)

Country Link
CN (1) CN111382457B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115065509A (en) * 2022-05-27 2022-09-16 中电长城网际系统应用有限公司 Method and device for identifying risk of statistical inference attack based on deviation function

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104731976A (en) * 2015-04-14 2015-06-24 海量云图(北京)数据技术有限公司 Method for finding and sorting private data in data table
CN106339396A (en) * 2015-07-10 2017-01-18 上海贝尔股份有限公司 Privacy risk assessment method and device for user generated content
WO2017088683A1 (en) * 2015-11-24 2017-06-01 阿里巴巴集团控股有限公司 Data desensitization method and system
CN106951796A (en) * 2016-01-07 2017-07-14 阿里巴巴集团控股有限公司 A kind of desensitization method and its device of data-privacy protection
US20170287034A1 (en) * 2016-04-01 2017-10-05 OneTrust, LLC Data processing systems and communication systems and methods for the efficient generation of privacy risk assessments
CN108009435A (en) * 2017-12-18 2018-05-08 网智天元科技集团股份有限公司 Data desensitization method, device and storage medium
CN108776762A (en) * 2018-06-08 2018-11-09 北京中电普华信息技术有限公司 A kind of processing method and processing device of data desensitization

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104731976A (en) * 2015-04-14 2015-06-24 海量云图(北京)数据技术有限公司 Method for finding and sorting private data in data table
CN106339396A (en) * 2015-07-10 2017-01-18 上海贝尔股份有限公司 Privacy risk assessment method and device for user generated content
WO2017088683A1 (en) * 2015-11-24 2017-06-01 阿里巴巴集团控股有限公司 Data desensitization method and system
CN106951796A (en) * 2016-01-07 2017-07-14 阿里巴巴集团控股有限公司 A kind of desensitization method and its device of data-privacy protection
US20170287034A1 (en) * 2016-04-01 2017-10-05 OneTrust, LLC Data processing systems and communication systems and methods for the efficient generation of privacy risk assessments
CN108009435A (en) * 2017-12-18 2018-05-08 网智天元科技集团股份有限公司 Data desensitization method, device and storage medium
CN108776762A (en) * 2018-06-08 2018-11-09 北京中电普华信息技术有限公司 A kind of processing method and processing device of data desensitization

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115065509A (en) * 2022-05-27 2022-09-16 中电长城网际系统应用有限公司 Method and device for identifying risk of statistical inference attack based on deviation function
CN115065509B (en) * 2022-05-27 2024-04-02 中电长城网际系统应用有限公司 Risk identification method and device for statistical inference attack based on deviation function

Also Published As

Publication number Publication date
CN111382457B (en) 2023-08-18

Similar Documents

Publication Publication Date Title
CN108876636B (en) Intelligent air control method, system, computer equipment and storage medium for claim settlement
RU2729458C2 (en) Comparing hospitals from depersonalized health databases without apparent quasi-identifiers
CN107784058B (en) Medicine data processing method and device
US20080240425A1 (en) Data De-Identification By Obfuscation
KR20110081177A (en) Detection of confidential information
CN112259210B (en) Medical big data access control method and device and computer readable storage medium
EP3596620A1 (en) Interoperable record matching process
CN112329055A (en) Method and device for desensitizing user data, electronic equipment and storage medium
CN114496264A (en) Health index analysis method, device, equipment and medium based on multi-dimensional data
CN110729054A (en) Abnormal diagnosis behavior detection method and device, computer equipment and storage medium
CN110752027B (en) Electronic medical record data pushing method, device, computer equipment and storage medium
CN114186275A (en) Privacy protection method and device, computer equipment and storage medium
CN115330569A (en) Automatic balancing method for burden difference and medical resources of children tumor diseases
CN110490750B (en) Data identification method, system, electronic equipment and computer storage medium
CN111382457B (en) Data risk assessment method and device
CN113345545B (en) Clinical data checking method and device, electronic equipment and readable storage medium
CN113178071B (en) Driving risk level identification method and device, electronic equipment and readable storage medium
CN113657550A (en) Patient marking method, device, equipment and storage medium based on hierarchical calculation
CN109545319B (en) Prescription alarm method based on knowledge relation analysis and terminal equipment
CN109063507A (en) A kind of general design model for hospital information system analysis
CN114783581B (en) Reporting method and reporting device for single disease type data
KR102418984B1 (en) A pseudonymization system for data-set according to risks to an environment and a control method thereof
CN112765162B (en) Method, device, medium and equipment for determining unique identity based on multi-source data
CN113066543A (en) Clinical research coordinator scheduling method, device, computer equipment and storage medium
US20200117833A1 (en) Longitudinal data de-identification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant