CN115859372B

CN115859372B - Medical data desensitization method and system

Info

Publication number: CN115859372B
Application number: CN202310199626.9A
Authority: CN
Inventors: 李睿; 胡其桐; 刘瑞华; 郑名扬; 唐学文
Original assignee: Chengdu Angels Biomedical Technology Co ltd
Current assignee: Chengdu Angels Biomedical Technology Co ltd
Priority date: 2023-03-04
Filing date: 2023-03-04
Publication date: 2023-04-25
Anticipated expiration: 2043-03-04
Also published as: CN115859372A

Abstract

The invention belongs to the technical field of data processing, and discloses a medical data desensitizing method and a system, wherein the medical data desensitizing method comprises the following steps: classifying the acquired medical data into text data and non-text data; extracting keywords in text data, retaining original texts of non-keywords, and taking the extracted keywords and the non-text data as data to be desensitized; classifying the data to be desensitized into personal identity information, personal medical information, date information, address information and other information; and desensitizing the classified information. The medical data desensitization system includes: the system comprises a data classification module, a sensitive word extraction module, a field classification module and a data desensitization module. The medical data desensitization method and the system provided by the invention can complete full-automatic desensitization of the medical data, and a user only needs to input medical fields contained in the medical data; desensitization may be performed with respect to the multiplexed medical data.

Description

Medical data desensitization method and system

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a medical data desensitizing method and system.

Background

Information related to the personal characteristics of a large number of patients in medical data, such as patient's name, contact phone, birth place, life track, illness description, etc., needs privacy protection because once compromised, injury to the patient can occur. The existing text data desensitization algorithm is based on a pattern matching mechanism, and keywords in a static matching text are removed for processing. This can lead to three problems:

1. aiming at privacy information with weaker regularity such as names, accurate matching cannot be achieved. Such as: once the keyword "sheet" appears in the text, the two characters of the "sheet" and thereafter are deleted as the name of the person. However, if the text information is "varicose vein phenomenon is obvious", the method can incorrectly treat the "varicose vein phenomenon" as a name of a person;

2. sensitive data cannot be dynamically judged based on context information. Such as: the word "people hospitals in south county of Sichuan province" may appear in the patient's disease description, and the existing method can cover and code "south county" as sensitive information, so that only "people hospitals in Sichuan province" are left; however, the 'southern county people hospital in Sichuan province' does not relate to the information of the birth place of the patient and does not need to be subjected to desensitization treatment; in addition, "people hospitals in Sichuan province" can be confused with a plurality of people hospitals in Sichuan province when the reference is unknown;

3. static matching rules require desensitizers to exhaust all possible data formats of sensitive information in advance, but may be omitted in the face of text data with diversified forms. Such as: the patient's condition may be described as having a patient visit date of "ju-6 of 2023", which is not in standard format: 2023, 5, 6, 2023/5/6, 2023/05/06, thus are difficult to statically match.

Disclosure of Invention

The present invention aims to solve the above technical problems at least to some extent. To this end, the present invention aims to provide a method and a system for desensitizing medical data.

The technical scheme adopted by the invention is as follows:

a method of desensitizing medical data comprising the steps of:

s1, classifying acquired medical data into text data and non-text data;

s2, extracting keywords in the text data, retaining the original text of non-keywords, and taking the extracted keywords and the non-text data as data to be desensitized;

s3, classifying the data to be desensitized into personal identity information, personal medical information, date information, address information and other information;

s4, desensitizing the classified information: the personal identity information is encrypted, the personal medical information and the date information are subjected to blurring processing, the address information is subjected to mask covering processing to obtain text description, and other information is subjected to original text retaining processing.

Preferably, in step S2, the Pointer Network model improves the Attention mechanism of the BERT model based on the transducer framework to obtain a BERT-Pointer Network model; the BERT-Pointer Network model converts text information into word vectors based on context information and extracts keywords in the text data.

Preferably, step S3 includes: the BERT model converts text information into word vectors based on context information; the PCA model carries out principal component decomposition on the output result of the BERT model, combines similar medical fields, and deletes irrelevant medical fields; and clustering the output result of the PCA model by using a clustering algorithm.

Preferably, the cosine distance between the new data and the four types of data including personal identity information, personal medical information, date information and address information is judged through a clustering algorithm, and if the distance between the new data and one type of data is nearest and is lower than a preset threshold value, the new data is distributed to the type of data; and if the distances between the new data and the four types of data are larger than the preset threshold value, marking the new data as other information.

Preferably, in step S1, classification of text data and non-text data is performed according to each field name of medical data.

A medical data desensitization system, comprising:

the data classification module is used for classifying the acquired medical data into text data and non-text data;

the sensitive word extraction module is used for extracting keywords in the text data, sending the extracted keywords into the field classification module and retaining the original text of the non-keywords;

the field classification module is used for classifying the non-text data and the keywords into personal identity information, personal medical information, date information, address information and other information;

the data desensitization module is used for carrying out desensitization treatment on the information classified by the field classification module: the personal identity information is encrypted, the personal medical information and the date information are subjected to blurring processing, the address information is subjected to mask covering processing to obtain text description, and other information is subjected to original text retaining processing.

Preferably, the sensitive word extraction module includes:

the Pointer Network model is used for improving the Attention mechanism of the BERT model based on the transducer framework to obtain a BERT-Pointer Network model;

the BERT-Pointer Network model is used for converting text information into word vectors and extracting keywords based on context information.

Preferably, the field classification module includes:

the BERT model is used for converting text information into word vectors based on the context information;

the PCA model is used for carrying out principal component decomposition on the output result of the BERT model, merging similar medical fields and deleting irrelevant medical fields;

and the clustering model is used for clustering the output result of the PCA model.

Preferably, the medical data desensitization system further comprises: and the output module is used for outputting the desensitized information to the original position of the medical data.

The beneficial effects of the invention are as follows:

1. the medical data desensitization method and the system provided by the invention can complete full-automatic desensitization of the medical data, and a user only needs to input medical fields contained in the medical data; desensitization may be performed with respect to multiplexed medical data (including both textual and non-textual data).

2. The BERT-Pointer Network model adopted by the invention extracts the sensitive keywords of the medical text data for desensitization. The BERT model is optimized by the model, and sensitive keywords can be extracted by better combining with contextual medical information. Compared with the traditional mode recognition algorithm, the recognition accuracy is improved by 81%, and the recognition speed is improved by 13 times; compared with the BERT model, the recognition accuracy is improved by 22%.

Drawings

FIG. 1 is a schematic block diagram of a medical data desensitization system of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made more apparent and fully hereinafter with reference to the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should also be appreciated that in the embodiments, the functions/acts may occur in a different order than the figures. For example, two figures shown in succession may in fact be executed substantially concurrently or the figures may sometimes be executed in the reverse order, depending upon the functionality/acts involved.

As shown in fig. 1, a method for desensitizing medical data according to the present embodiment includes the steps of:

s1, classifying the acquired medical data into text data and non-text data according to the names of all fields; the medical data shown in table 1, wherein the fields corresponding to the field names "name", "gender", "birth date", "identification card number", "body temperature" and "blood pressure" are non-text data, and the field corresponding to the "illness description" is text data.

Table 1 raw medical data record table

S2, converting text information into word vectors based on context information by using a BERT-Pointer Network model, extracting keywords in text data, reserving original texts of non-keywords, and taking the extracted keywords and the non-text data as data to be desensitized; as shown in table 1, the extracted keywords are "Zhang Jiang", "2018, 3 month, 14 days", and "3 month, 21 days".

In particular, for medical data, some text may have multiple cut patterns. Such as: "people hospitals in south county of Sichuan province" can be divided into "people hospitals in south county of Sichuan province" and "people hospitals" for coding, and "people hospitals in south county of Sichuan province" can also be directly coded as a whole. In order to better process medical text information with complex structure, the BERT model is optimized, and the method is expanded by using a Pointer Network model. The BERT model is mainly based on an Attention mechanism, the Pointer Network model is improved on the Attention mechanism based on a transducer framework to obtain the BERT-Pointer Network model, and the problem that the traditional seq2seq framework cannot solve the problem that an output sequence changes along with the change of the length of an input sequence is solved, so that the new Attention mechanism can better combine with a context to encode, and the problem of label overlapping can be solved.

The conventional Attention architecture is as follows

；

；

；

Wherein e _j ，d _i Is the state quantity, v, W ₁ 、W ₂ In order to learn the parameters of the model,

is the weight. While the Pointer Network model makes a simplification above this, discarding the third layer weights, taking the result of softmax as assuming a Pointer role pointing to a specific element of the input sequence.

The improved Attention mechanism formula is as follows:

；

；/>

the output sequence of the Pointer Network model is derived from the input sequence, the range of i is the length of the output sequence, the maximum range is preset, and the vector

An Attention mask representing the jth input vector; t represents matrix transposition, e _j 、d _i Is the state quantity, v, W ₁ 、W ₂ Is a learning parameter; c (C) ₁ 、C _i-1 、C _i All are random variables representing a certain item in an input sequence, and p is a super-parameter representing joint probability distribution; />

The conditional probability of occurrence of item i is known as item i-1 above.

The BERT-Pointer Network model can be used for coding at two layers, namely 'the southern county of Sichuan province' and the people's hospitals' as a whole, and the 'the southern county of Sichuan province' and the people's hospitals' respectively, so that the problem of segmentation ambiguity of medical text data is solved.

S3, classifying personal identification information (such as name and ID card number), personal medical information (such as age), date information, address information and other information of the data to be desensitized.

Specifically, the BERT model converts text information into word vectors based on context information; and a certain initialization process is performed, and the BERT model is mainly based on an Attention mechanism, so that the model can be parallelized and operated, and can have global information. Wherein the Attention function is defined as follows:

；

where Q represents input information, which is information that the input text exists. K represents content information, which is semantic information, then the attribute (Q, K) represents the matching degree of Query and Key, and V represents the information itself, which is mainly used for weighting the matching degree. The BERT model also considers the position information, fully considers the influence of the context on the result, and the output of the BERT model contains probability distribution in the same type of text information.

The PCA model carries out principal component decomposition on the output result of the BERT model, combines similar medical fields, and deletes irrelevant medical fields; for example, the output result obtained after the BERT model processing includes: age, date of birth, phone number, time of patient visit, etc. Then, after the PCA model processing, the "phone number" is deleted as noise by the PCA module because it is irrelevant to the medical service, and the "age" and "date of birth" are automatically merged by the PCA module as similar medical fields.

The principal component decomposition of the PCA model aims to find a set of orthonormal basis such that the distance between data is maximized after the data points are projected on a plane formed by the orthonormal basis

；

Wherein the method comprises the steps of

As data, the maximum value of the dual problem of the problem obtained by using the lagrangian multiplier method is as follows:

；

wherein the method comprises the steps of

And selecting a projection matrix formed by feature vectors corresponding to the first k feature values for the feature values of the covariance matrix after sequencing, and extracting the principal components of the word vectors.

And clustering the output result of the PCA model by using a single-pass clustering text online clustering algorithm. First, for four classifications of medical data: the personal identity information, the personal medical information, the date information and the address information are used for providing training data of each category, and after the training data are processed by using the BERT-PCA model, space vector representation of the data and the category to which the data belong are obtained. The single-pass clustering text online clustering algorithm judges cosine distances between the new data and the four types of data, namely personal identity information, personal medical information, date information and address information, and if the distance between the new data and one type of data is nearest and is lower than a preset threshold value, the new data is distributed to the data of the type; and if the distances between the new data and the four types of data are larger than the preset threshold value, marking the new data as other information.

Specifically, for four classifications: personal identification information, personal medical information, date information, address information, respectively providing corresponding medical field training data, such as: { name }, { age }, { date of visit }, { pharmacy address }.

Putting the training data into a BERT-PCA model, and carrying out text vectorization and principal component normalization processing on the corresponding data, so as to obtain a plurality of reference vectors for each type of data, such as: {1 01 01 01 0},{0 1 01 01 0 1},{0 0 0 01 11 1},{1 11 1 0 0 0 0}. The four sets of vectors are reference vectors for the four classes of data.

When new medical fields are added, such as: the date of the surgery. It is first put into the BERT-PCA model to vector it, for example: {1 11 01 0 0 0}. Next, the new vector is compared with the above-obtained sets of reference vectors, their cosine distances are calculated, and it is found that it is very close to the date reference vector 11 11 0 0 0 0, thus marking the "date of surgery" as a medical field of the "date information" category.

S4, desensitizing the classified information: for personal identity information, an encryption algorithm is used for encrypting the personal identity information, so that leakage of personal privacy information is prevented. For personal medical information, a randomization algorithm is designed to carry out blurring processing on the personal medical information, so that personal privacy is protected, and data can be ensured to be used for intelligent medical services. For example, the age of a patient, the system can add random noise of plus or minus 5 percent of the age on the basis of the real age of the patient, so that the real age of the patient is covered, and the processed data cannot deviate from the real data too far;

the date information includes a date of a patient's visit, a date of an operation, a date of CT taking, and the like. And blurring the date according to specific legal and legal requirements. Such as: only the year and month of the original data are reserved, and the specific day is randomized in the current month, so that if the date of the user's visit is 14 days of 3 months in 2018, the date of the user's visit may be blurred to 11 days of 3 months in 2018;

the address information includes the address of the patient, the pharmacy address for buying the medicine, and the like. And carrying out covering mask processing on the product according to specific legal and regulatory requirements. Such as: only provincial level and municipal level information in the original data is reserved, and information (county level, district level and the like) below the municipal level is subjected to covering processing. Thus, if the patient purchases the medicine from the large-heart pharmacy "Hebei Jizhuang Yuhua Yu Hua Ouyu Xiang Jielian 9 lane 29" the address of the pharmacy is masked to "Hebei Jizhuang" as follows.

The other information is kept in original text, and the medical data in table 1 is subjected to desensitization treatment and then shown in table 2.

Table 2 medical data recording table after desensitization treatment

The final processed medical data is written into a desensitized medical health database for access by intelligent medical developers, wherein the database does not contain privacy information of patients and doctors.

The embodiment also provides a medical data desensitizing system adopting the medical data desensitizing method, which comprises an acquisition module, a data classification module, a sensitive word extraction module, a field classification module, a data desensitizing module and an output module, wherein the acquisition module is used for acquiring medical data, and the data classification module is used for classifying text data and non-text data of the acquired medical data; the sensitive word extraction module is used for extracting keywords in the text data, sending the extracted keywords into the field classification module, and retaining the original text of the non-keywords; the field classification module is used for classifying the non-text data and the keywords into personal identity information, personal medical information, date information, address information and other information; the data desensitization module is used for carrying out desensitization treatment on the information classified by the field classification module: the personal identity information is encrypted, the personal medical information and the date information are subjected to blurring processing, the address information is subjected to mask covering processing to obtain text description, and other information is subjected to original text retaining processing. The output module is used for outputting the desensitized information to the original position of the medical data.

The sensitive word extraction module comprises a Pointer Network model and a BERT-Pointer Network model, wherein the Pointer Network model improves an Attention mechanism of the BERT model based on a transducer frame to obtain the BERT-Pointer Network model; the BERT-Pointer Network model is used to convert text information into word vectors based on context information and extract keywords.

The field classification module comprises a BERT model, a PCA model and a clustering model, wherein the BERT model is used for converting text information into word vectors based on context information; the PCA model is used for carrying out principal component decomposition on the output result of the BERT model, merging similar medical fields and deleting irrelevant medical fields; the clustering model is used for clustering the output result of the PCA model.

The invention is not limited to the above-described alternative embodiments, and any person who may derive other various forms of products in the light of the present invention, however, any changes in shape or structure thereof, all falling within the technical solutions defined in the scope of the claims of the present invention, fall within the scope of protection of the present invention.

Claims

1. A method of desensitizing medical data comprising the steps of:

s1, classifying acquired medical data into text data and non-text data;

s4, desensitizing the classified information: encrypting personal identity information, blurring personal medical information and date information, masking address information to obtain text description, and preserving other information;

in step S1, classifying text data and non-text data according to the names of the fields of the medical data;

in step S2, the Pointer Network model improves an Attention mechanism of the BERT model based on a transducer frame to obtain a BERT-Pointer Network model; the BERT-Pointer Network model converts text information into word vectors based on the context information and extracts keywords in the text data;

the improved Attention mechanism formula is as follows:

；

；

The conditional probability of occurrence of item i is known as item i-1 above.

2. The method of desensitizing medical data according to claim 1, wherein step S3 comprises: the BERT model converts text information into word vectors based on context information; the PCA model carries out principal component decomposition on the output result of the BERT model, combines similar medical fields, and deletes irrelevant medical fields; and clustering the output result of the PCA model by using a clustering algorithm.

3. The method of desensitizing medical data according to claim 2, comprising: judging cosine distances between the new data and four types of data, namely personal identity information, personal medical information, date information and address information, through a clustering algorithm, and if the distance between the new data and one type of data is nearest and is lower than a preset threshold value, distributing the new data into the type of data; and if the distances between the new data and the four types of data are larger than the preset threshold value, marking the new data as other information.

4. A medical data desensitization system, comprising:

the data desensitization module is used for carrying out desensitization treatment on the information classified by the field classification module: encrypting personal identity information, blurring personal medical information and date information, masking address information to obtain text description, and preserving other information;

the sensitive word extraction module comprises:

the Pointer Network model is used for improving the Attention mechanism of the BERT model based on the transducer framework to obtain a BERT-Pointer Network model; the improved Attention mechanism formula is as follows:

；

；

The conditional probability that the ith term occurs, given the previous i-1 term;

5. The medical data desensitization system according to claim 4, wherein the field classification module comprises:

6. The medical data desensitization system according to claim 4, further comprising: and the output module is used for outputting the desensitized information to the original position of the medical data.