CN115859372A

CN115859372A - Medical data desensitization method and system

Info

Publication number: CN115859372A
Application number: CN202310199626.9A
Authority: CN
Inventors: 李睿; 胡其桐; 刘瑞华; 郑名扬; 唐学文
Original assignee: Chengdu Angels Biomedical Technology Co ltd
Current assignee: Chengdu Angels Biomedical Technology Co ltd
Priority date: 2023-03-04
Filing date: 2023-03-04
Publication date: 2023-03-28
Anticipated expiration: 2043-03-04
Also published as: CN115859372B

Abstract

The invention belongs to the technical field of data processing, and discloses a medical data desensitization method and a medical data desensitization system, wherein the medical data desensitization method comprises the following steps: classifying the acquired medical data into text data and non-text data; extracting keywords in the text data, reserving original texts without the keywords, and taking the extracted keywords and the non-text data as data to be desensitized; classifying personal identity information, personal medical information, date information, address information and other information of the data to be desensitized; desensitizing the classified information. A medical data desensitization system comprising: the system comprises a data classification module, a sensitive word extraction module, a field classification module and a data desensitization module. The medical data desensitization method and the medical data desensitization system can complete full-automatic desensitization aiming at the medical data, and a user only needs to input a medical field contained in the medical data; desensitization may be performed for a diverse set of medical data.

Description

Medical data desensitization method and system

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a medical data desensitization method and system.

Background

Medical data contains a large amount of information about the individual characteristics of a patient, such as the patient's name, contact phone, place of birth, life track, description of illness, etc., and this information needs to be protected from privacy since it could harm the patient if it were to be revealed. The existing text data desensitization algorithm is based on a pattern matching mechanism and is used for statically matching keywords in a text to process. This can lead to three problems:

1. accurate matching cannot be achieved aiming at private information with weak regularity such as names. Such as: once the keyword "one sheet" appears in the text, the "one sheet" and the two characters after the "one sheet" are deleted as the names of the persons. However, if the text information is "varicose vein phenomenon is obvious", the method may erroneously regard "varicose vein" as the name of the person;

2. sensitive data cannot be dynamically judged based on context information. Such as: the patient's disease description may have a word "people's hospital in south county of Sichuan province", and the existing method will take "south county" as sensitive information to cover and code, so that only "people's hospital in south county of Sichuan province" is left; however, "Min Hospital in south prefecture of Sichuan province" does not refer to the patient's information of place of birth, and does not require desensitization; moreover, the situation of unknown reference can occur in the ' Sichuan province ' people hospital ' and can be confused with a plurality of people hospitals in the Sichuan province;

3. static matching rules require desensitizers to enumerate all possible data formats of sensitive information in advance, but there is still a possibility of omission in the face of text data with diversified forms. Such as: the patient's visit date may appear in the patient's description of the condition as "May 6 d 2023", which is not in standard format: 5/6/2023/05/06, thus being difficult to be statically matched.

Disclosure of Invention

The present invention aims to solve the above technical problem at least to some extent. Therefore, the invention aims to provide a medical data desensitization method and system.

The technical scheme adopted by the invention is as follows:

a method of desensitizing medical data, comprising the steps of:

s1, classifying the acquired medical data into text data and non-text data;

s2, extracting keywords in the text data, reserving original texts without the keywords, and taking the extracted keywords and the non-text data as data to be desensitized;

s3, classifying the data to be desensitized into personal identity information, personal medical information, date information, address information and other information;

s4, desensitizing the classified information: the personal identity information is encrypted, the personal medical information and the date information are fuzzified, the address information is covered by a mask to obtain text description, and other information is processed by keeping original text.

Preferably, in the step S2, the Pointer Network model improves an Attention mechanism of the BERT model based on a Transformer framework to obtain a BERT-Pointer Network model; the BERT-Pointer Network model converts text information into word vectors based on context information and extracts keywords in the text data.

Preferably, step S3 comprises: the BERT model converts text information into word vectors based on context information; the PCA model carries out principal component decomposition on the output result of the BERT model, combines similar medical fields and deletes irrelevant medical fields; and clustering the output result of the PCA model by using a clustering algorithm.

Preferably, the cosine distance between the new data and the four types of data of personal identity information, personal medical information, date information and address information is judged through a clustering algorithm, and if the distance between the new data and one type of data is closest and is lower than a preset threshold value, the new data is distributed to the data in the type; and if the distance between the new data and the four types of data is larger than a preset threshold value, marking the new data as other information.

Preferably, in step S1, the text data and the non-text data are classified according to the field names of the medical data.

A medical data desensitization system, comprising:

the data classification module is used for classifying the acquired medical data into text data and non-text data;

the sensitive word extraction module is used for extracting key words in the text data, sending the extracted key words into the field classification module and reserving original texts without the key words;

the field classification module is used for classifying personal identity information, personal medical information, date information, address information and other information of the non-text data and the keywords;

the data desensitization module is used for desensitizing the information classified by the field classification module: the personal identity information is encrypted, the personal medical information and the date information are fuzzified, the address information is covered by a mask to obtain text description, and other information is processed by keeping original text.

Preferably, the sensitive word extraction module includes:

the device comprises a Pointer Network model, a transform framework, a BERT-Pointer Network model and a data processing module, wherein the Pointer Network model is used for improving an Attention mechanism of the BERT model based on the transform framework to obtain the BERT-Pointer Network model;

and the BERT-Pointer Network model is used for converting text information into word vectors based on context information and extracting keywords.

Preferably, the field classification module comprises:

the BERT model is used for converting text information into word vectors based on context information;

the PCA model is used for carrying out principal component decomposition on the output result of the BERT model, combining similar medical fields and deleting irrelevant medical fields;

and the clustering model is used for clustering the output result of the PCA model.

Preferably, the medical data desensitization system further comprises: and the output module is used for outputting the information after desensitization treatment to the original position of the medical data.

The invention has the beneficial effects that:

1. the medical data desensitization method and the medical data desensitization system can complete full-automatic desensitization aiming at the medical data, and a user only needs to input a medical field contained in the medical data; desensitization may be performed for a diverse set of medical data (including textual data and non-textual data).

2. The BERT-Pointer Network model adopted by the invention extracts sensitive keywords of the medical text data for desensitization. The model optimizes the BERT model, and can better combine context medical information to extract sensitive keywords. Compared with the traditional pattern recognition algorithm, the recognition accuracy is improved by 81%, and the recognition speed is improved by 13 times; compared with the BERT model, the recognition accuracy is improved by 22%.

Drawings

FIG. 1 is a functional block diagram of a medical data desensitization system of the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should also be noted that, in some embodiments, the functions/acts noted may occur out of the order noted in the figures. For example, two figures shown in succession may, in fact, be executed substantially concurrently, or the figures may sometimes be executed in the reverse order, depending upon the functionality/acts involved.

As shown in fig. 1, a medical data desensitization method of the present embodiment includes the following steps:

s1, classifying the acquired medical data into text data and non-text data according to the name of each field; as shown in table 1, the medical data includes fields corresponding to field names "name", "sex", "date of birth", "identification number", "body temperature", and "blood pressure", which are non-text data, and the medical data includes fields corresponding to medical condition description "which are text data.

TABLE 1 original medical data record sheet

S2, converting text information into word vectors based on context information by using a BERT-Pointer Network model, extracting keywords in the text data, reserving original texts without the keywords, and taking the extracted keywords and the non-text data as data to be desensitized; as shown in table 1, the extracted keywords were "zhang qiang", "3/14/2018", and "3/21/day".

In particular, for medical data, several ways of segmenting some text are possible. Such as: the national hospital in the south county of Sichuan province can be divided into the national hospital in the south county of Sichuan province and the national hospital for coding, and the national hospital in the south county of Sichuan province can also be directly coded as a whole. In order to better process medical text information with a complex structure, a BERT model is optimized and is expanded by using a Pointer Network model. The BERT model is mainly based on an Attention mechanism, and the Pointer Network model improves the Attention mechanism based on a Transformer framework to obtain the BERT-Pointer Network model, so that the problem that the traditional seq2seq framework cannot solve the problem that an output sequence changes along with the change of the length of an input sequence is solved, and the new Attention mechanism can be better combined with context for coding and can solve the problem of label overlapping.

The conventional Attention architecture is as follows

；

；

；

Wherein e _j ，d _i Is a state quantity, v, W ₁ 、W ₂ In order to learn the parameters, the user may,

are weights. And simplifying the Pointer Network model, discarding the third layer weight, and taking the result of softmax as the role of a Pointer pointing to a specific element of the input sequence.

The improved Attention mechanism formula is as follows:

；

；/>

the output sequence of the Pointer Network model is derived from the input sequence, the range of i is the preset maximum range and is the length of the output sequence, and the vector

An Attention mask representing the jth input vector; t denotes the matrix transposition, e _j 、d _i Is a state quantity, v, W ₁ 、W ₂ Is a learning parameter; c ₁ 、C _i-1 、C _i All random variables represent a certain item in an input sequence, and p is a hyper-parameter representing joint probability distribution; />

The conditional probability of occurrence of the i-th term, the first i-1 term, is known.

The BERT-Pointer Network model can be coded on two levels, one is that 'Min Hospital in south county of Sichuan province' is coded as a whole, and the other is that 'south county of Sichuan province' and 'Min Hospital' are respectively coded, so that the problem of ambiguity in segmentation of medical text data is solved.

And S3, classifying the data to be desensitized into personal identity information (such as name and identification card number), personal medical information (such as age), date information, address information and other information.

Specifically, the BERT model converts text information into word vectors based on context information; and certain initialization processing is carried out, and the BERT model is mainly based on an Attention mechanism, so that the model can be operated in a parallelization mode and can have global information. Wherein the Attention function is defined as follows:

；

where Q represents the input information, information that exists for the input text. K represents content information, namely semantic information, and Attention (Q, K) represents the matching degree of Query and Key, while V represents information per se and has the main function of weighting the matching degree. The BERT model also considers the position information in combination, the influence of the context on the result is fully considered, and the output of the BERT model contains probability distribution in the same type of character information.

The PCA model carries out principal component decomposition on the output result of the BERT model, combines similar medical fields and deletes irrelevant medical fields; for example, the output result obtained after the BERT model processing includes: age, date of birth, cell phone number, time of patient visit, etc. After the PCA model processing, the "cell phone number" is deleted as noise regardless of medical service by the PCA module, and the "age" and "date of birth" are automatically merged as similar medical fields by the PCA module.

The principal component decomposition of the PCA model aims to find a set of orthonormal bases so that the distance between data is maximum after data points are projected on a plane formed by the orthonormal bases

；

Wherein

For data, the maximum value of the dual problem of the problem is obtained by using a Lagrange multiplier method as follows:

；

wherein

And selecting a projection matrix formed by eigenvectors corresponding to the first k eigenvalues for the eigenvalues of the sequenced covariance matrix, and extracting principal components of the word vectors.

And clustering the output result of the PCA model by using a single-pass clustering text online clustering algorithm. First, four classifications of medical data are made: personal identity information, personal medical information, date information and address information, training data of each category are provided, and after the training data are processed by using a BERT-PCA model, space vector representation of the data and the category of the data are obtained. The single-pass clustering text online clustering algorithm judges cosine distances between the new data and four types of data, namely personal identity information, personal medical information, date information and address information, and if the distance between the new data and one type of data is closest and is lower than a preset threshold value, the new data is distributed to the data in the type; and if the distances between the new data and the four types of data are larger than a preset threshold value, marking the new data as other information.

Specifically, for four categories: personal identification information, personal medical information, date information, address information, respectively, providing corresponding medical field training data, such as: name, age, date of visit, and pharmacy address.

Putting the training data into a BERT-PCA model, and performing text vectorization and principal component normalization processing on the corresponding data so as to obtain some reference vectors for each type of data, such as: {1 01 01 01 0},{0 10 10 10 1},{0 0 0 01 11 1},{1 11 10 0 0 0}. These four sets of vectors are the reference vectors for the four types of data.

When new medical fields are added, such as: the date of the surgery. It is first put into the BERT-PCA model to be vectorized, such as: {1 11 01 0 0 0}. Next, this new vector is compared with the above-mentioned sets of reference vectors, the cosine distance of which is calculated, and it is found that it is very close to the date reference vector { 11 10 0 0} so that "date of operation" is marked as the medical field of the "date information" category.

S4, desensitizing the classified information: and the personal identity information is encrypted by using an encryption algorithm, so that the leakage of personal privacy information is prevented. For the personal medical information, a randomization algorithm is designed to fuzzify the personal medical information, so that the personal privacy is protected, and the data can be used for the following intelligent medical service. For example, the age of a patient, the system can add random noise of +/-5% of the age of the patient on the basis of the real age of the patient, so that the real age of the patient is covered, and processed data are not deviated too far from real data;

the date information includes the patient's visit date, operation date, CT taking date, etc. And (4) fuzzifying the date according to the specific legal and regulatory requirements. Such as: only the year and month of the original data are reserved, and the specific day is randomized in the current month, so that the visit date of the user is 3, 14 and 2018, the visit date of the user can be fuzzified to be 3, 11 and 2018;

for the address information, the patient's home address, the pharmacy address of buying a medicine, and the like are included. And performing covering mask processing according to specific legal and legal requirements. Such as: only provincial and city level information in the original data is retained, and information below the city level (county level, district level, etc.) is masked. Thus, if a patient goes to No. 29 Yongtai road 9 lane of south village on Yuhua district Yuxiang street in Shijia village in Hebei province to buy medicine, then the pharmacy address would be masked as "hebei province shizhuan city.

The other information is processed as it is, and the medical data in table 1 is shown in table 2 after being desensitized.

TABLE 2 medical data record sheet after desensitization treatment

Finally, the processed medical data can be written into a desensitization medical health database for access of intelligent medical developers, and the database does not contain privacy information of patients and doctors.

The embodiment also provides a medical data desensitization system adopting the medical data desensitization method, which comprises an acquisition module, a data classification module, a sensitive word extraction module, a field classification module, a data desensitization module and an output module, wherein the acquisition module is used for acquiring medical data, and the data classification module is used for classifying the acquired medical data into text data and non-text data; the sensitive word extraction module is used for extracting key words in the text data, sending the extracted key words into the field classification module and reserving original texts without the key words; the field classification module is used for classifying personal identity information, personal medical information, date information, address information and other information of the non-text data and the keywords; the data desensitization module is used for desensitizing the information classified by the field classification module: the personal identity information is encrypted, the personal medical information and the date information are fuzzified, the address information is covered by a mask to obtain text description, and other information is subjected to original text preservation. The output module is used for outputting the information after desensitization processing to the original position of the medical data.

The sensitive word extraction module comprises a Pointer Network model and a BERT-Pointer Network model, wherein the Pointer Network model is used for improving an Attention mechanism of the BERT model based on a Transformer frame to obtain the BERT-Pointer Network model; the BERT-Pointer Network model is used for converting text information into word vectors based on context information and extracting keywords.

The field classification module comprises a BERT model, a PCA model and a clustering model, wherein the BERT model is used for converting text information into word vectors based on context information; the PCA model is used for carrying out principal component decomposition on an output result of the BERT model, combining similar medical fields and deleting irrelevant medical fields; the clustering model is used for clustering the output result of the PCA model.

The invention is not limited to the above alternative embodiments, and any other various forms of products can be obtained by anyone in the light of the present invention, but any changes in shape or structure thereof, which fall within the scope of the present invention as defined in the claims, fall within the scope of the present invention.

Claims

1. A method of desensitizing medical data, comprising the steps of:

s1, classifying the acquired medical data into text data and non-text data;

2. The medical data desensitization method according to claim 1, wherein in step S2, the Pointer Network model is modified based on a transform framework to an Attention mechanism of the BERT model to obtain a BERT-Pointer Network model; the BERT-Pointer Network model converts text information into word vectors based on context information and extracts keywords in the text data.

3. A method of desensitizing medical data according to claim 2, wherein: the improved Attention mechanism formula is as follows:

；

；

Represents the j thAn Attention mask for each input vector; t denotes the matrix transposition, e _j 、d _i Is a state quantity, v, W ₁ 、W ₂ Is a learning parameter; c ₁ 、C _i-1 、C _i All random variables represent a certain item in an input sequence, and p is a hyper-parameter representing joint probability distribution; />

4. A method of desensitizing medical data according to claim 1, wherein step S3 comprises: the BERT model converts text information into word vectors based on context information; the PCA model carries out principal component decomposition on the output result of the BERT model, combines similar medical fields and deletes irrelevant medical fields; and clustering the output result of the PCA model by using a clustering algorithm.

5. A method of desensitizing medical data according to claim 4, comprising: the cosine distance between the new data and the four types of data of personal identity information, personal medical information, date information and address information is judged through a clustering algorithm, and if the distance between the new data and one type of data is closest and is lower than a preset threshold value, the new data is distributed to the data of the type; and if the distance between the new data and the four types of data is larger than a preset threshold value, marking the new data as other information.

6. A method of desensitizing medical data according to claim 1, wherein in step S1, textual data and non-textual data are classified according to the name of each field of the medical data.

7. A medical data desensitization system, comprising:

8. The medical data desensitization system according to claim 7, wherein the sensitive word extraction module includes:

9. The medical data desensitization system according to claim 7, wherein the field classification module includes:

10. The medical data desensitization system according to claim 7, further comprising: and the output module is used for outputting the information after desensitization treatment to the original position of the medical data.