CN115438379A - Electronic medical record data desensitization method and system based on FLAT - Google Patents

Electronic medical record data desensitization method and system based on FLAT Download PDF

Info

Publication number
CN115438379A
CN115438379A CN202211116144.4A CN202211116144A CN115438379A CN 115438379 A CN115438379 A CN 115438379A CN 202211116144 A CN202211116144 A CN 202211116144A CN 115438379 A CN115438379 A CN 115438379A
Authority
CN
China
Prior art keywords
medical record
entity
electronic medical
data
flat
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211116144.4A
Other languages
Chinese (zh)
Inventor
桑波
王文谦
靳恩朝
张述睿
王建坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Msunhealth Technology Group Co Ltd
Original Assignee
Shandong Msunhealth Technology Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Msunhealth Technology Group Co Ltd filed Critical Shandong Msunhealth Technology Group Co Ltd
Priority to CN202211116144.4A priority Critical patent/CN115438379A/en
Publication of CN115438379A publication Critical patent/CN115438379A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • G06F21/6254Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/041Abduction
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records

Abstract

The invention provides an electronic medical record data desensitization method and system based on FLAT, which relate to the technical field of data desensitization, and are used for collecting electronic medical record text data, and performing data generalization and knowledge embedding processing on the text data to obtain a character fragment sequence sample set; training an entity recognition model constructed based on FLAT and CRF by using a text fragment sequence sample set; inputting an electronic medical record text to be desensitized into a trained entity recognition model to obtain a sensitive entity and an entity type of the electronic medical record; carrying out specific desensitization treatment on the sensitive entity according to the entity type; the entity identification scheme mainly based on the FLAT-CRF model is adopted, data enhancement is carried out on marked entities in a generalization mode of random substitution of similar entities, representations of the entities are added into word vectors and word vectors of the marked entities simultaneously to carry out information embedding, classification desensitization treatment is carried out on the identified entities, and the accuracy rate and reasoning speed of data desensitization are improved.

Description

Electronic medical record data desensitization method and system based on FLAT
Technical Field
The invention belongs to the technical field of data desensitization, and particularly relates to an electronic medical record data desensitization method and system based on FLAT.
Background
With the popularization of medical electronic informatization, electronic medical records have become a necessary way for various hospitals to record medical information. The method has important significance for promoting medical service intellectualization, improving medical service quality and reducing treatment response time by aiming at analysis of related data in the electronic medical record. Because of the privacy requirements of the patient, the patient's relevant information must be desensitized prior to application, such as name, date, address, institution name, contact details, various important numbers, and the like.
Due to the characteristics of the data structure, the source locality and the like of the electronic medical record, the data desensitization task is very challenging.
(1) In an electronic medical record, the content of a structured text is less, the structure of the text is in random change, manual rules are often adopted for discrimination, and besides, extraction of sensitive information in a large amount of unstructured texts is a main difficult problem.
The extraction of sensitive information of unstructured text can be classified into the technical field of named entity recognition. The accuracy and reasoning speed of named entity recognition techniques are major aspects in the application. With the rapid development of a neural network, an LSTM-CRF model, a GRU-CRF model and an IDCNN-CRF model based on a static word vector technology gradually become a mainstream framework for named entity identification; with the development of the dynamic word vector technology, a pre-training model based on a Transfromer framework begins to become a mainstream, and through simple parameter adjustment, higher accuracy than that of the previous network model can be obtained, but meanwhile, due to the huge parameter quantity, the reasoning speed is slower.
(2) The sources of data are often concentrated in specific regions, for example, most of the data come from a certain province, so the place name and the organization name in the collected corpus have strong region characteristics.
(3) Data were collected in the last 2, 3 years, so only the last 2, 3 years appeared in time.
(4) Surnames in the data may appear very different in frequency due to the difference in population.
In addition, the address expression in the electronic medical record often adopts the spoken forms such as shorthand, abbreviation, etc., for example: the chen table city of hebei province, ningjin county, is often written as: hechen taining jin; similarly, dates, addresses, organizational structures, etc. may be used to similarly taste.
Therefore, the existing electronic medical record data desensitization scheme has the problems of large data sample limitation, difficult identification of spoken entities, low accuracy and slow model reasoning speed, and needs to be further researched.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides an electronic medical record data desensitization method and system based on FLAT, an entity identification scheme taking an FLAT-CRF model as a main part is adopted, data enhancement is carried out on labeled entities in a generalization mode of randomly replacing similar entities, the expression of the entities is added into word vectors and word vectors of the labeled entities simultaneously to carry out information embedding, desensitization processing is carried out on the identified entities in a mode of replacing special characters, and the accuracy rate and reasoning speed of data desensitization are improved.
In order to realize the purpose, the invention provides the following technical scheme:
collecting electronic medical record text data, and carrying out data generalization and knowledge embedding processing on the text data to obtain a character fragment sequence sample set;
training an entity recognition model constructed based on FLAT and CRF by using a character fragment sequence sample set;
inputting the electronic medical record text to be desensitized into the trained entity recognition model for reasoning to obtain the sensitive entities and entity types of the electronic medical record;
specific desensitization treatments are performed on sensitive entities according to entity type.
Furthermore, the text segments are general names of characters and words.
Further, the specific steps of obtaining the sample set are as follows:
according to the special characters, punctuations and the set maximum length of the sentence, the text is divided into sentences;
manually marking entities and entity types in sentences, and carrying out data generalization processing on the marked entities according to the entity types;
constructing word vectors and word vectors which are added with knowledge embedding representation of surnames and addresses;
and segmenting the intercepted sentences to obtain a character fragment sequence of each sentence, wherein the character fragments and the position information of the character fragments form a Flat-lattice data structure unit required by the model.
Performing character vectorization and word vectorization on characters and words in the character fragment sequence to obtain a character fragment sequence matrix of each sentence;
constructing a relative position coding matrix of the text segment sequence matrix;
further, the data generalization includes surname generalization, address, organization name generalization and date generalization of names.
Further, the building of the word vector and the word vector to which the knowledge embedding representation of the surnames and the addresses is added specifically includes:
constructing a word vector and a word vector according to the word vector dictionary and the word vector dictionary of the social discipline class;
and adding knowledge embedding representation of surnames and addresses in the constructed word vector and the word vector.
Further, the relative position coding matrix is composed of relative position codes of two character segments in the character segment sequence matrix, and the calculation method of the relative position codes is as follows:
simulating the relative position relation between two different character segments by using dense vectors to obtain four distances between a head, the head and the tail, and the tail;
and splicing the four distances, and then carrying out nonlinear transformation to obtain the relative position code of the character segment sequence.
Further, the entity identification model comprises a multi-head self-attention layer, a feed-forward network layer and a CRF layer, and the specific steps are as follows:
in the multi-head self-attention layer, multi-attention position coding is carried out on the character segment sequence matrix and the corresponding relative position coding matrix; based on the position coding, performing multi-head self-attention mechanism calculation on the character segment matrix;
performing residual error connection and normalization on a feed-forward network layer to obtain character segment coding representation;
and calculating the highest score of the text segment on a CRF layer to obtain the entity label.
The invention provides an electronic medical record data desensitization system based on FLAT in a second aspect.
An electronic medical record data desensitization system based on FLAT comprises a sample set construction module, a model training module, an entity identification module and a desensitization processing module;
a sample set construction module configured to: collecting electronic medical record text data, and performing data generalization and knowledge embedding processing on the text data to obtain a character fragment sequence sample set;
a model training module configured to: training an entity recognition model constructed based on FLAT and CRF by using a character fragment sequence sample set;
an entity identification module configured to: inputting an electronic medical record text to be desensitized into a trained entity recognition model to obtain a sensitive entity and an entity type of the electronic medical record;
a desensitization processing module configured to: specific desensitization treatments are performed on sensitive entities according to entity type.
A third aspect of the present invention provides a computer readable storage medium, on which a program is stored, which program, when being executed by a processor, carries out the steps of a method for desensitizing electronic medical record data based on FLAT according to the first aspect of the present invention.
A fourth aspect of the present invention provides an electronic device, which includes a memory, a processor, and a program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of the method for desensitizing electronic medical record data based on FLAT according to the first aspect of the present invention.
The above one or more technical solutions have the following beneficial effects:
the invention adopts an entity identification scheme taking a FLAT-CRF model as a main part, and utilizes the FLAT technology to fuse static word vectors and word vectors, thereby obtaining higher accuracy than that of the traditional main flow model, and simultaneously, a pre-training model is not applied for feature extraction, thereby ensuring better reasoning speed.
In order to meet the requirement of the model on the feature recognition of the spoken entity, the invention adds knowledge embedding representation of surnames and addresses in the word vector and the word vector, so that the model automatically learns the spoken representation structure knowledge of the entity in the corpus, and the accuracy of the model is improved.
Aiming at the limitations of the region and time of the corpus source, the data is enhanced by utilizing the information of surnames, addresses, dates and the like and adopting a generalization mode of similar entity random replacement, so that the limitation of data samples is reduced, the overfitting of the model is greatly slowed down, and the usable range of the model is ensured.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
FIG. 1 is a flow chart of the method of the first embodiment.
Fig. 2 is a data structure diagram of the Flat-Lattice in the first embodiment.
Fig. 3 is a structural diagram of an encoder in the first embodiment.
Fig. 4 is a system configuration diagram of the second embodiment.
Detailed Description
The invention is further described with reference to the following figures and examples.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
Example one
The embodiment discloses an electronic medical record data desensitization method based on FLAT;
as shown in fig. 1, a method for desensitizing electronic medical record data based on FLAT includes:
s1, collecting electronic medical record text data, and performing data generalization and knowledge embedding processing on the text data to obtain a character fragment sequence sample set;
the character fragment sequence sample set consists of a character fragment sequence matrix and a relative position coding matrix, and the specific steps of obtaining the character fragment sequence sample set are as follows:
s1-1, dividing a text into sentences according to special characters, punctuations and set maximum length of the sentence;
sentence division is carried out on the collected electronic medical record text data according to special characters (n, r, and the like); aiming at the structured data with complexity, variability and small quantity in the electronic medical record, the structured data is uniformly treated as unstructured data, and the structural symbols (\\ n, \ r, spaces and the like) are treated as specific characters.
Carrying out sentence cutting processing on the sentences according to the set maximum sentence length; if the marked entity appears in the sentence length larger than the maximum set sentence length and the entity mark exists outside the maximum set sentence length, the sentence is cut at the punctuation mark position within the maximum sentence length, and the specific rule is as follows:
rule 1: when the number of words in a sentence is more than 90, a punctuation mark (",", ",") is encountered, the truncation is carried out at the punctuation mark, a new sentence is formed before the truncation, and the rest sentences are continued to be regularly truncated.
Rule 2: when the number of words in the sentence is more than 120, the punctuation symbol (",") is encountered, the truncation is carried out at the punctuation symbol, so that a new sentence is formed before the truncation, and the rest sentences continue to be subjected to the regular truncation.
Rule 3: when the number of words in the sentence is more than 150 and 150 is not the target entity, the truncation is directly carried out at the position; if 150 is part of the target entity, a clause is made at the closest punctuation preceding the target entity.
For example, in the example sentence "(2), the patient finds out liver space occupation by ultrasonic examination in 2019-07-19, takes the liver cancer into consideration by performing intensive CT, performs liver puncture in tumor hospital in shandong province in 2019-07-24, pathologically shows intrahepatic bile duct cancer (the pathology number is 2019-511880), and gives 'hepatic artery chemoembolization' in the hospital interventional department in 2019-07-29 and 2019-09-11 respectively, and the specific medicines are as follows: 150mg of oxaliplatin, 1.4g of gemcitabine, 12ml of iodized oil and 15ml of iodized oil are adopted, ascites is formed before 2 months, abdominal catheterization ascites drainage treatment is carried out in Shandong province tumor hospitals before 1 month, and ascites is drained automatically at home. If the comma before "medication is specified" meets rule 1, a sentence is divided at the comma.
S1-2, manually marking entities and entity types in sentences, and carrying out data generalization processing on marked entities according to the entity types;
after the entity marking is finished, according to the entity type, namely surname, address, organization name and date, the marked entity is subjected to data generalization treatment, so as to achieve the purpose of data enhancement, comprising the following steps:
surname generalization of names
1) Preparing a relatively complete surname dictionary, and extracting single characters of the dictionary to form a surname dictionary; words in the surname dictionary have a length of 1, but words in the surname dictionary have a length of 1.
2) Aiming at the name in the label sentence, extracting the surname word by a code, and randomly selecting surname from a surname dictionary for replacement; for example, after "li" is extracted from the sentence, "li" is extracted, generalized words such as "wang", "zhang" and "euyang" are randomly selected from the family name dictionary, and combined into "wang qiang", "zhang" and "euyang qiang".
Generalization of address and organization name
1) Preparing a relatively complete address dictionary; according to the administrative division of the people's republic of China, an address dictionary generally comprises five levels of units: province level (province, city, municipality), district level (city), county level (county), county level (village, town), village level (village committee, residence committee), cleaning for non-address class representation therein, such as: * Township member, remove committee; * Cell affiliation, removal of affiliation, etc.; here, only dictionary expressions at the provincial level, the prefecture level, and the prefecture level are selected.
2) Aiming at provincial, regional and county dictionaries, extracting the single words in the provincial, regional and county dictionaries to form corresponding provincial, regional and county dictionaries; the word length of the provincial level dictionary, the regional level dictionary and the county level dictionary is 1.
3) Aiming at the address and the mechanism name in the annotation sentence, extracting the unit words at each level through the code to represent, and then randomly selecting the address from the corresponding unit dictionary at the level in the address dictionary to replace; for example, the sentence "zhao county of shi jiazhuang city, north Hei" can be generalized to "Jinan city, cao county, of Shandong province" and "Hibiscus area, changsha city, of Hunan province"; "Neze women and children" can be generalized as "Zhangye women and children" and "Cao county women and children".
Generalization of date
Extracting the year of the date in the marked sentence through codes, and then randomly replacing the date with the year of the last 5 years; keep the format of the date unchanged, such as: chinese font, digital font. For example, "6 months in 2021", in the sentence, can be generalized to "6 months in 2022", "6 months in 2025", "6 months in 2026"; "2021.07" in the sentence can be generalized as "2023.07", "2025.07", etc.
S1-3, constructing a word vector and a word vector which are added with the knowledge embedding representation of the surname and the address, and comprising the following steps:
1) Preparing a word vector dictionary and a word vector dictionary of the social discipline class; because the entity to be identified belongs to the category of social disciplines, a more accurate representation can be obtained using the vector dictionary of social disciplines. The word vector and the word vector adopted in the present embodiment are both 50 dimensions, where the word vector is denoted by c and the word vector is denoted by w.
Figure BDA0003845486250000081
Where k represents the position of occurrence of a specific word in a sentence.
Figure BDA0003845486250000082
Where k represents the occurrence position of a particular word in a sentence.
2) The word vector after adding knowledge embedding representation of surname and address (province level, prefecture level and county level) is represented as:
Figure BDA0003845486250000083
wherein the content of the first and second substances,
Figure BDA0003845486250000084
is 0/1, representing whether word k appears in the surname dictionary;
Figure BDA0003845486250000085
is 0/1, representing whether word k appears in the provincial address dictionary;
Figure BDA0003845486250000086
is 0/1, representing whether word k appears in the level address dictionary;
Figure BDA0003845486250000087
is 0/1, representing whether word k appears in the county level address dictionary.
3) The word vector after adding knowledge embedding representation of surname and address (province level, prefecture level and county level) is represented as:
Figure BDA0003845486250000088
wherein the content of the first and second substances,
Figure BDA0003845486250000089
is 0/1, whether the representative word (word) k appears in the surname dictionary or not;
Figure BDA00038454862500000810
is 0/1, represents whether the word k appears in the provincial address dictionary;
Figure BDA0003845486250000091
is 0/1, representing whether the word k appears in the level address dictionary;
Figure BDA0003845486250000092
is 0/1, represents whether the word k appears in the county-level address dictionary.
S1-4, segmenting words of the intercepted sentences, and vectorially representing the distribution of characters and words in the sentences to obtain a character fragment sequence matrix of the sentences;
1) And segmenting the intercepted sentence, and splicing the sentence into characters and words to obtain a character fragment expression of the sentence.
X u ={c 1 ,c 2 ,…c i ,…c n ,w 1 ,w 2 ,…w j ,…w m }
Wherein, c i Is a vector representation of the ith word in a sentence, w j Is a vector representation of the jth word in a sentence, X u Representing the u-th sentence in the corpus.
Note that, here, characters and words are collectively described and represented by text segments, and sentences can be directly described by text segments, and the formula is as follows:
X u ={x 1 ,x 2 ,…x k ,…x n+m }
wherein x is k The kth text fragment representing the u-th sentence in the corpus.
2) Text fragment sequence X of sentence u Can be developed into a flat-lattice data structure, as shown in fig. 2, the flat-lattice data is a collection of sequence segments, and the segments are composed of a token, a head and a tail; the token represents a word or a word of text, and the head and the tail represent the positions of the first word and the last word in the token in the original sequence.
Finally, each sentence can be vectorized and expressed into a character segment sequence matrix E X ∈d model ,d model Is the dimension of the word vector and the word vector in step S1-4, here 54.
S1-5, constructing a relative position coding matrix of a character segment sequence matrix;
the relative position coding matrix is composed of relative position codes of every two character segments in the character segment sequence matrix, and the calculation method of the relative position codes comprises the following steps:
1) In lattice structure, two different text segments (words/words) x i And x j There are three kinds of correlation relationships between them: intersection, inclusion, and phase separation, these relationships are modeled using dense vectors:
wherein, head [ i ]]、tail[i]Respectively represent x i The head position and the tail position of the body,
Figure BDA0003845486250000101
denotes x i Head and x j Of the head, others
Figure BDA0003845486250000102
And the like.
2) After splicing the four distances, carrying out nonlinear transformation to obtain the relative position code of the character fragment sequence, wherein the specific formula is as follows:
Figure BDA0003845486250000103
wherein, W r It is the parameter that can be learned that,
Figure BDA0003845486250000104
indicating a splicing operation, P d The calculation formula of (2) is as follows:
Figure BDA0003845486250000105
here, d represents
Figure BDA0003845486250000106
k is the vector dimension of the text segment and has a value range of [0-d model /2]。
S2, training an entity recognition model constructed based on FLAT and CRF by using a sentence sample set;
the entity identification model is formed by stacking a plurality of encoders, the structure of each encoder is shown in fig. 3 and mainly comprises a multi-head self-attention layer (multi-head self-attention), a Feed-forward network layer (Feed-forward network), a Residual connection (Residual connection) and a layer normalization (layer normalization) which are arranged in a penetrating way; in order to ensure the reasoning efficiency, only a 1-layer Transformer structure is adopted as an encoder.
The specific steps of each layer of encoder are as follows:
1) In a multi-head self-attention layer, utilizing head and tail information carried by sequence segments in a flat-lattice data structure to carry out relative position coding; based on the position coding, performing multi-head self-attention mechanism calculation by using the sequence vector;
inputting the training sentences with fixed batch sizes into an encoder, wherein the formula is as follows:
Figure BDA0003845486250000107
wherein the content of the first and second substances,
Figure BDA0003845486250000108
W k,E 、W k,R are all parameters that can be learned in a trainable way,
Figure BDA0003845486250000109
is a matrix representation of the jth sentence in the training sentence,
Figure BDA00038454862500001010
also for the same reason, R ij Is the relative position coding of sentence i and sentence j in step (6),
Figure BDA0003845486250000111
is a variant of the self-attention module, approximately equivalent to the following representation:
Figure BDA0003845486250000112
will be provided with
Figure BDA0003845486250000113
Instead of A in the following formula, the attention mechanism calculation for the batch statement can be derived:
Att(A,V)=softmax(A)*V
in this model Q i ,K i ,V i Representing inputs, respectively query vector, key vector, value vector, containing
Figure BDA0003845486250000114
And
Figure BDA0003845486250000115
it is shown that self-attention is performed by the module.
And splicing different Attention results obtained by the multi-head Attention mechanism to obtain a final output sequence vector.
2) In a feed-forward network layer, residual error connection and normalization are carried out on the output of the multi-head self-attention layer to obtain the coded representation of the character segments;
3) And calculating the highest score of the text segment on a CRF layer to obtain the entity label.
S3, inputting an electronic medical record text to be desensitized into the trained entity recognition model to obtain a sensitive entity and an entity type of the electronic medical record;
and S4, carrying out specific desensitization treatment on the sensitive entity according to the entity type. The method for replacing the special characters is specifically as follows: replacing the name of the person with "# person _ name #"; the specific date is replaced with "# date #"; replacing the address with "# location #"; replacing the organization name with "# organization #"; the contact means is replaced with "# telephone #"; replacing various important numbers with "# ID #"; where "#" is a special character set for convenience in extracting an entity.
The desensitization selection data of the embodiment come from a plurality of comprehensive hospitals, and the labeling data is selected in a keyword search mode. The selected data emphasizes the content diversity of the labeled data; the number of each type of entity is determined according to the richness of the entity content. The number of entities in the selected sample is shown in table 1:
TABLE 1 entity quantity chart of selected samples
Figure BDA0003845486250000116
Figure BDA0003845486250000121
The data obtained after generalization are shown in table 2:
TABLE 2 Experimental data constitution
Number of training set sentences 118998
Number of sentences in verification set 14927
Test set sentence number 14878
The effect of several models on data desensitization was experimentally compared, and specific data are shown in table 3:
TABLE 3 comparison of the accuracy of the different protocols
Figure BDA0003845486250000122
As can be seen from Table 1, the FLAT model has an accuracy improved by 3% compared with the BILSTM model; after the FLAT is selected as the baseline model, knowledge embedding is added according to the word vectors and the word vectors, and the accuracy rate is improved by 1%.
After data generalization and knowledge embedding are used, the accuracy of BILSTM + CRF is improved by about 4%, which shows the effectiveness of data generalization and knowledge embedding.
Example two
The embodiment discloses an electronic medical record data desensitization system based on FLAT;
as shown in fig. 3, an electronic medical record data desensitization system based on FLAT includes a sample set constructing module, a model training module, an entity identifying module and a desensitization processing module;
a sample set construction module configured to: collecting electronic medical record text data, and carrying out data generalization and knowledge embedding processing on the text data to obtain a character fragment sequence sample set;
a model training module configured to: training an entity recognition model constructed based on FLAT and CRF by using a text fragment sequence sample set;
an entity identification module configured to: inputting an electronic medical record text to be desensitized into a trained entity recognition model to obtain a sensitive entity and an entity type of the electronic medical record;
a desensitization processing module configured to: specific desensitization treatments are performed on sensitive entities according to entity type.
EXAMPLE III
An object of the present embodiments is to provide a computer-readable storage medium.
A computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps in a method for desensitizing electronic medical record data based on FLAT according to embodiment 1 of the present disclosure.
Example four
An object of the present embodiment is to provide an electronic apparatus.
Electronic equipment, comprising a memory, a processor and a program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of the method for desensitizing electronic medical record data based on FLAT according to embodiment 1 of the present disclosure.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method for desensitizing electronic medical record data based on FLAT is characterized by comprising the following steps:
collecting electronic medical record text data, and carrying out data generalization and knowledge embedding processing on the text data to obtain a character fragment sequence sample set;
training an entity recognition model constructed based on FLAT and CRF by using a character fragment sequence sample set;
inputting an electronic medical record text to be desensitized into a trained entity recognition model for reasoning to obtain a sensitive entity and an entity type of the electronic medical record;
specific desensitization treatments are performed on sensitive entities according to entity type.
2. The method as claimed in claim 1, wherein the text segment is a general name of a word or a phrase.
3. The method for desensitizing electronic medical record data based on FLAT as set forth in claim 1, wherein the specific steps for obtaining the sample set are:
according to the special characters, punctuations and the set maximum length of the sentence, the text is divided into sentences;
manually marking entities and entity types in sentences, and carrying out data generalization processing on the marked entities according to the entity types;
constructing a word vector and a word vector which are added with knowledge embedding representation of surnames and addresses;
and segmenting the intercepted sentences to obtain a character fragment sequence of each sentence, wherein the character fragments and the position information of the character fragments form a Flat-lattice data structure unit required by the model.
Performing character vectorization and word vectorization on characters and words in the character fragment sequence to obtain a character fragment sequence matrix of each sentence;
and constructing a relative position coding matrix of the text segment sequence matrix.
4. The method as set forth in claim 3, wherein the data generalization includes surname generalization of names, address generalization of addresses, organization name generalization, and date generalization of names.
5. The method as claimed in claim 3, wherein the method for desensitizing electronic medical record data based on FLAT is characterized in that the method for constructing word vectors and word vectors to which knowledge embedding expressions of surnames and addresses are added specifically comprises:
constructing a word vector and a word vector according to the word vector dictionary and the word vector dictionary of the social discipline class;
and adding knowledge embedding representation of surnames and addresses in the constructed word vector and the word vector.
6. The method as claimed in claim 3, wherein the relative position coding matrix is composed of relative position codes of two character segments in a character segment sequence matrix, and the method for calculating the relative position codes comprises:
simulating the relative position relation between two different character segments by using dense vectors to obtain four distances between a head, the head and the tail, and the tail;
and splicing the four distances, and then carrying out nonlinear transformation to obtain the relative position code of the character segment sequence.
7. The method for desensitizing electronic medical record data based on FLAT according to claim 1, wherein the entity identification model comprises a multi-head self-attention layer, a feed-forward network layer and a CRF layer, and the method comprises the following steps:
in the multi-head self-attention layer, multi-attention position coding is carried out on the character segment sequence matrix and the corresponding relative position coding matrix; based on the position coding, performing multi-head self-attention mechanism calculation on the character fragment matrix;
performing residual connection and normalization on a feed-forward network layer to obtain character segment coding representation;
and calculating the highest score of the text segment on a CRF layer to obtain the entity label.
8. An electronic medical record data desensitization system based on FLAT is characterized by comprising a sample set construction module, a model training module, an entity identification module and a desensitization processing module;
a sample set construction module configured to: collecting electronic medical record text data, and performing data generalization and knowledge embedding processing on the text data to obtain a character fragment sequence sample set;
a model training module configured to: training an entity recognition model constructed based on FLAT and CRF by using a character fragment sequence sample set;
an entity identification module configured to: inputting the electronic medical record text to be desensitized into the trained entity recognition model for reasoning to obtain the sensitive entities and entity types of the electronic medical record;
a desensitization processing module configured to: specific desensitization treatment is performed on sensitive entities according to entity types.
9. A computer-readable storage medium, on which a program is stored, which, when being executed by a processor, carries out the steps of a method for electrical medical record data desensitization based on FLAT according to any of claims 1-7.
10. Electronic device comprising a memory, a processor and a program stored on the memory and executable on the processor, wherein the processor executes the program to perform the steps of a method for FLAT-based electronic medical record data desensitization according to any of claims 1-7.
CN202211116144.4A 2022-09-14 2022-09-14 Electronic medical record data desensitization method and system based on FLAT Pending CN115438379A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211116144.4A CN115438379A (en) 2022-09-14 2022-09-14 Electronic medical record data desensitization method and system based on FLAT

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211116144.4A CN115438379A (en) 2022-09-14 2022-09-14 Electronic medical record data desensitization method and system based on FLAT

Publications (1)

Publication Number Publication Date
CN115438379A true CN115438379A (en) 2022-12-06

Family

ID=84246278

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211116144.4A Pending CN115438379A (en) 2022-09-14 2022-09-14 Electronic medical record data desensitization method and system based on FLAT

Country Status (1)

Country Link
CN (1) CN115438379A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117688611A (en) * 2024-01-30 2024-03-12 深圳昂楷科技有限公司 Electronic medical record desensitizing method and system, electronic equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117688611A (en) * 2024-01-30 2024-03-12 深圳昂楷科技有限公司 Electronic medical record desensitizing method and system, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
Alvarado et al. Domain adaption of named entity recognition to support credit risk assessment
Szarvas et al. State-of-the-art anonymization of medical records using an iterative machine learning framework
CN108009182B (en) Information extraction method and device
CN111966917B (en) Event detection and summarization method based on pre-training language model
CN109670179B (en) Medical record text named entity identification method based on iterative expansion convolutional neural network
Campos et al. Biomedical named entity recognition: a survey of machine-learning tools
Dehghan et al. Combining knowledge-and data-driven methods for de-identification of clinical narratives
CN108959566B (en) A kind of medical text based on Stacking integrated study goes privacy methods and system
CN106980609A (en) A kind of name entity recognition method of the condition random field of word-based vector representation
CN106844351B (en) Medical institution organization entity identification method and device oriented to multiple data sources
CN111274806A (en) Method and device for recognizing word segmentation and part of speech and method and device for analyzing electronic medical record
CN111291568B (en) Automatic entity relationship labeling method applied to medical texts
Chan et al. Reproducible extraction of cross-lingual topics (rectr)
Trienes et al. Comparing rule-based, feature-based and deep neural methods for de-identification of dutch medical records
CN111325018B (en) Domain dictionary construction method based on web retrieval and new word discovery
Du et al. A machine learning based approach to identify protected health information in Chinese clinical text
CN112487202A (en) Chinese medical named entity recognition method and device fusing knowledge map and BERT
CN113468887A (en) Student information relation extraction method and system based on boundary and segment classification
CN115438379A (en) Electronic medical record data desensitization method and system based on FLAT
CN116049354A (en) Multi-table retrieval method and device based on natural language
Ahamed et al. Spell corrector for Bangla language using Norvig’s algorithm and Jaro-Winkler distance
CN112215007B (en) Organization named entity normalization method and system based on LEAM model
Alipour et al. Learning bilingual word embedding mappings with similar words in related languages using GAN
CN113254651A (en) Method and device for analyzing referee document, computer equipment and storage medium
Khan et al. Enhancement of text analysis using context-aware normalization of social media informal text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination