Electronic medical record data set analysis method and system based on ernie model
Technical Field
The invention relates to the field of natural language processing, in particular to an electronic medical record data set analysis method and system based on an ernie model.
Background
The electronic medical record is a complete and detailed clinical information resource generated and recorded by a person in the process of medical institution treatment, and is a main component of the current medical data. However, the current electronic medical records are mainly in text form and cannot be directly used for analysis and research. Therefore, how to accurately and effectively analyze the electronic medical record and extract the content of the data set for analysis and research is a problem to be solved in medical data treatment.
At present, a commonly used method for analyzing a data set is a method for extracting keywords and matching regular expressions, and the method specifically comprises the following steps:
firstly, determining the position of an extracted data set according to keywords in an electronic medical record;
and then, extracting the content of the data group by using a regular expression and other rule matching modes.
For example, the complaint information is extracted from the admission record text: firstly, determining the position of the group of data groups in the admission record text according to the 'main complaint' two words; and then extracting the main complaint content according to the separator such as carriage return symbol, period number and the like.
Although the method can rapidly analyze the data set of the electronic medical record, a large number of paragraphs are freely filled because the electronic medical record is semi-structured content. And the electronic medical record templates of different factories in different hospitals are different. Therefore, there are the following problems:
(1) when keywords are determined and extraction rules are formulated, the keywords are formulated one by one according to medical records of different manufacturers and different types, so that the universality is poor;
(2) in the extraction process, the rule needs to be repeatedly polished according to the actual content, so that the accuracy is low;
(3) once the system is upgraded or replaced by a manufacturer, the keywords and the extraction rules need to be determined again, so that the universality is poor;
(4) the method can not analyze the condition of keyword missing in the text, and has strong dependence on keywords.
In summary, how to overcome the problems that the extraction rules are repeatedly updated and the text without keywords cannot be analyzed due to the dependence of the extraction process of the electronic medical record data set on the keywords and the rules, and effectively reduce the analysis cost are urgent to be solved in the current medical data management.
Disclosure of Invention
The invention aims to provide an electronic medical record data set analysis method and system based on an ernie model, which solve the problems that extraction rules are repeatedly updated and keyword-free texts cannot be analyzed due to the fact that the extraction process of the electronic medical record data set depends on keywords and rules, and the analysis cost is effectively reduced.
The technical task of the invention is realized in the following way, namely, the method for analyzing the data set of the electronic medical record of the base ernie model is to judge the data set according to the meaning of each sentence in the electronic medical record, and the dependence on keywords and rules in the analysis process of the electronic medical record is overcome; the method comprises the following steps:
s1, determining different types of text data sets: determining an extracted data set according to different types of electronic medical records, and mapping or fine-tuning the data set according to the text conditions of the electronic medical records of different factories;
s2, extracting and marking a data set sample: after determining electronic medical record data sets to be extracted of different types of documents, collecting and labeling samples to construct a sample set;
s3, retraining a text classification model based on an ernie pre-training model: respectively carrying out model training on M sub-sample sets in the sample set;
s4, extracting the content of the data set: the content of the corresponding data set is extracted using the model trained in step S3.
Preferably, the extracting and marking the data set sample in the step S2 is specifically as follows:
s201, randomly extracting N texts from various samples to be analyzed respectively;
s202, selecting reasonable separators (generally periods or carriage returns or multiple separators can be used jointly) to block the text according to the actual text condition;
s203, removing dirty characters in each text, wherein the dirty characters refer to characters affecting semantic judgment;
s204, manually labeling according to the data set determined in the step S1.
Preferably, the construction sample set is specifically as follows:
(1) Extracting N documents from each type of document respectively;
(2) Determining the category of the data group by combining the actual data to be analyzed;
(3) Manually or by means of a labeling platform, labeling the data groups in the N documents;
(4) The total sample set is composed by the sample model structure of the formula 1 and the formula 2, and the concrete steps are as follows:
S={s 1 ,s 2 ,s 3 …s M -a }; equation 1
s i ={n i1 ,n i2 ,n i3, …,n id ,n id+1 -a }; equation 2
S is expressed as a total sample set, and the total sample set is composed of sub-sample sets S of M types of documents to be analyzed; each sub-sample set s contains d sub-categories, namely the number of categories of data groups contained in the documents of the category; although each sub-sample set is formed by extracting and labeling N texts randomly extracted from the texts during sample collection; however, in order to eliminate the dependence of the model on the keywords, the actual text in each data group is subjected to block processing in the process of forming the sample set, so that the number of samples contained in the d subcategories is not equal, and the minimum value is N samples; in addition, there are typically many template class sentences in a parsed document, which do not belong to any class, so other classes n are added in each sub-sample set id+1 For distinguishing other classes of text.
More preferably, the sample set is constructed by taking care of the following:
(1) in the sampling process, the whole sample set should be randomly sampled, so that the comprehensiveness of the sample is ensured;
(2) and the complete text of the original data set is partitioned and then is put into a sample set, so that the model gets rid of dependence on keywords, and the diversity of the content of the sample set is ensured as much as possible.
Preferably, in the step S3, three parameters including a maximum sequence length (Maximum Sequence Length), a Batch Size (Batch Size) and a Learning Rate (Learning Rate) of the model are adjusted in the process of retraining the text classification model based on the ernie pre-training model; the method comprises the following steps:
s301, selecting maximum sequence length search values max_len_num, batch size search values batch_size_num and learning rate search values learn_rate_num, and combining the maximum sequence length search values max_len_num, batch size_size_num and learning rate search values learn_rate_num into a max_len_num group;
s302, selecting a parameter combination from the step S301, and calculating the average recognition rate of a model by adopting a leave-one-out method cross-validation model;
s303, circulating the step S302 until all groups of parameters are processed, selecting a group of parameters with highest average recognition rate as optimal parameters of a model, and outputting the model trained by the optimal parameters as an optimal model;
s304, training the M sub-sample sets through the steps S301 to S303 respectively to obtain M sub-models.
Preferably, the content of the extraction data set in the step S4 is specifically as follows:
s401, performing dirty character removal and block division processing on all texts to be tested;
s402, inputting the segmented text into a corresponding text model to classify each text;
s403, combining according to the input sequence, wherein the text content obtained in each data set is used as the data set extraction result of the document, and the combination results of the categories except the other categories are the data set extraction result of the document.
More preferably, the specific procedure of the partitioning process in step S401 is as follows:
s40101, performing blocking processing on the document to be analyzed by using a period or a carriage return character;
s40102, removing dirty characters of each sentence, wherein the dirty characters are characters affecting semantic judgment;
s40103, inputting the block sentences into the sub-models of the corresponding document types in sequence respectively, and judging the types of the text blocks;
s40104, recombining the classification results of the document categories according to the analysis sequence, and connecting the combination process through carriage returns or blank spaces.
An electronic medical record data set analysis system based on an ernie model, which comprises,
the data set determining unit is used for determining and extracting data sets according to different types of electronic medical records and then carrying out data set mapping or fine adjustment according to the text conditions of the electronic medical records of different factories;
the data set sample extraction and marking unit is used for collecting and marking samples to construct a sample set after determining electronic medical record data sets to be extracted of different types of documents; the data set sample extraction and marking unit comprises,
the text random extraction module is used for randomly extracting N texts from various samples to be analyzed respectively;
the text block module is used for selecting reasonable separators (generally periods or carriage returns or used by combining multiple separators) to block the text according to the actual text condition;
the dirty character removing module is used for removing dirty characters in each text, wherein the dirty characters are characters affecting semantic judgment;
the manual labeling module is used for manually labeling the data set determined in the data set determining module;
the text classification model retraining unit is used for respectively carrying out model training on M sub-sample sets in the sample set; the text classification model is included in the training unit,
the combination module is used for selecting the maximum sequence length search values max_len_num, the batch size search values batch_size_num and the learning rate search values learn_rate_num, and combining the maximum sequence length search values max_len_num, the batch size search values batch_size_num and the learning rate search values learn_rate_num into groups of max_len_num;
the average recognition rate calculation module is used for selecting a parameter combination from the combination module, adopting a leave-one-out method cross verification model, and calculating the average recognition rate of the model;
the optimal model output module is used for circulating the average recognition rate calculation module until all the groups of parameters are processed, selecting a group of parameters with the highest average recognition rate as optimal parameters of the model, and outputting the model trained by the optimal parameters as an optimal model;
the sub-model acquisition module is used for training the M sub-sample sets through the combination module, the average recognition rate calculation module and the optimal model output module respectively to obtain M sub-models;
a data set content extraction unit that extracts the content of the corresponding data set using the trained model; the data set content extraction unit comprises a data set extraction unit,
the dirty character removing and blocking processing module is used for carrying out dirty character removing and blocking processing on all texts to be detected;
the block text classification module is used for inputting the block texts into the corresponding text model to classify each block of text;
and the data set result extraction module is used for combining according to the input sequence, wherein the text content obtained in each data set is used as the data set extraction result of the document, and the combination results of the categories except the other categories are the data set extraction result of the document.
A storage medium having stored therein a plurality of instructions for loading by a processor for performing the steps of the electronic medical record data set parsing method described above based on the ernie model.
An electronic device, the electronic device comprising:
the storage medium described above; and
and a processor for executing the instructions in the storage medium.
The electronic medical record data set analysis method and system based on the ernie model have the following advantages:
the invention judges the data group according to the meaning of each sentence in the electronic medical record, overcomes the dependence on keywords and rules in the analysis process of the electronic medical record, solves the problems of repeated updating of the rules and incapability of analyzing irrelevant keywords, and reduces the analysis cost;
the method solves the problem of excessive dependence on keywords and rules in the process of analyzing the electronic medical record data set, and saves the time for running-in and updating the keywords; compared with the traditional electronic medical record data set analysis method, the method has more universality;
the semantic analysis technology is used, namely a text classification model, and the text classification model is an ernie model in a paldlepay framework, so that the data group classification of each text in the electronic medical record is realized without depending on keywords;
and fourthly, when the text is segmented, the user-defined separator is segmented according to the actual text content, so that the segmentation requirements of different texts can be met, and the accuracy is ensured.
Drawings
The invention is further described below with reference to the accompanying drawings.
FIG. 1 is a flow chart diagram of an electronic medical record data set analysis method based on an ernie model;
fig. 2 is a block diagram of an electronic medical record data set analysis system based on an ernie model.
Detailed Description
The invention relates to an electronic medical record data set analysis method and system based on an ernie model, which are described in detail below with reference to the accompanying drawings and specific embodiments.
Example 1:
as shown in figure 1, the method for analyzing the data set of the electronic medical record based on the ernie model is used for judging the data set according to the meaning of each sentence in the electronic medical record, and overcomes the dependence on keywords and rules in the analysis process of the electronic medical record; the method comprises the following steps:
s1, determining different types of text data sets: the Data Group (Data Group) is a composite Data structure formed by aggregating related information items according to the description of electronic medical record Data Group and Data element. Different types of electronic medical record text contain different data sets; the electronic medical record texts of different factories and hospitals have slightly different contents of the data sets; the data is thus determined as follows:
s101, determining an extraction data set according to different types of electronic medical records;
s102, mapping or fine tuning the data set according to the conditions of electronic medical record texts of different factories;
s2, extracting and marking a data set sample: after determining electronic medical record data sets to be extracted of different types of documents, collecting and labeling samples to construct a sample set; the method comprises the following steps:
s201, randomly extracting N texts from various samples to be analyzed respectively;
s202, selecting reasonable separators (generally periods or carriage returns or multiple separators can be used jointly) to block the text according to the actual text condition;
s203, removing dirty characters in each text, wherein the dirty characters refer to characters affecting semantic judgment;
s204, manually labeling according to the data set determined in the step S1.
The construction of the sample set is specifically as follows:
(1) Extracting N documents from each type of document respectively;
(2) Determining the category of the data group by combining the actual data to be analyzed;
(3) Manually or by means of a labeling platform, labeling the data groups in the N documents;
(4) The total sample set is composed by the sample model structure of the formula 1 and the formula 2, and the concrete steps are as follows:
S={s 1 ,s 2 ,s 3 …s M -a }; equation 1
s i ={n i1 ,n i2 ,n i3, …,n id ,n id+1 -a }; equation 2
S is expressed as a total sample set, and the total sample set is composed of sub-sample sets S of M types of documents to be analyzed; each sub-sample set s contains d sub-categories, namely the number of categories of data groups contained in the documents of the category; although each sub-sample set is formed by extracting and labeling N texts randomly extracted from the texts during sample collection; however, in order to eliminate the dependence of the model on the keywords, the actual text in each data group is subjected to block processing in the process of forming the sample set, so that the number of samples contained in the d subcategories is not equal, and the minimum value is N samples; in addition, there are typically many template class sentences in a parsed document, which do not belong to any class, so other classes n are added in each sub-sample set id+1 For distinguishing other classes of text.
The sample set is constructed by taking care of the following:
(1) in the sampling process, the whole sample set should be randomly sampled, so that the comprehensiveness of the sample is ensured;
(2) and the complete text of the original data set is partitioned and then is put into a sample set, so that the model gets rid of dependence on keywords, and the diversity of the content of the sample set is ensured as much as possible.
S3, retraining a text classification model based on an ernie pre-training model: the Ernie pre-training model is the most typical semantic model in the paldlenlp and is trained by multiple NLP tasks. Thus, the Ernie model has the advantage that it can be trained on small samples and that the preprocessing is simple. In view of the fact that the early samples are all manually marked and the sample size is small, an ernie model with strong semantic capability is selected as a pre-training model of text classification. Respectively carrying out model training on M sub-sample sets in the sample set; in the text classification model retraining process based on the ernie pre-training model, three parameters of the maximum sequence length (Maximum Sequence Length), batch Size (Batch Size) and Learning Rate (Learning Rate) of the model are subjected to parameter adjustment; the method comprises the following steps:
s301, selecting maximum sequence length search values max_len_num, batch size search values batch_size_num and learning rate search values learn_rate_num, and combining the maximum sequence length search values max_len_num, batch size_size_num and learning rate search values learn_rate_num into a max_len_num group;
s302, selecting a parameter combination from the step S301, and calculating the average recognition rate of a model by adopting a leave-one-out method cross-validation model;
s303, circulating the step S302 until all groups of parameters are processed, selecting a group of parameters with highest average recognition rate as optimal parameters of a model, and outputting the model trained by the optimal parameters as an optimal model;
s304, training the M sub-sample sets through the steps S301 to S303 respectively to obtain M sub-models.
S4, extracting the content of the data set: extracting the content of the corresponding data set by using the model trained in the step S3; the method comprises the following steps:
s401, performing dirty character removal and block division processing on all texts to be tested; the specific process of the blocking treatment is as follows:
s40101, performing blocking processing on the document to be analyzed by using a period or a carriage return character;
s40102, removing dirty characters of each sentence, wherein the dirty characters are characters affecting semantic judgment;
s40103, inputting the block sentences into the sub-models of the corresponding document types in sequence respectively, and judging the types of the text blocks;
s40104, recombining the classification results of the document categories according to the analysis sequence, and connecting the combination process through carriage returns or blank spaces.
S402, inputting the segmented text into a corresponding text model to classify each text;
s403, combining according to the input sequence, wherein the text content obtained in each data set is used as the data set extraction result of the document, and the combination results of the categories except the other categories are the data set extraction result of the document.
Example 2:
the invention relates to an electronic medical record data set analysis system based on an ernie model, which comprises,
the data set determining unit is used for determining and extracting data sets according to different types of electronic medical records and then carrying out data set mapping or fine adjustment according to the text conditions of the electronic medical records of different factories;
the data set sample extraction and marking unit is used for collecting and marking samples to construct a sample set after determining electronic medical record data sets to be extracted of different types of documents; the data set sample extraction and marking unit comprises,
the text random extraction module is used for randomly extracting N texts from various samples to be analyzed respectively;
the text block module is used for selecting reasonable separators (generally periods or carriage returns or used by combining multiple separators) to block the text according to the actual text condition;
the dirty character removing module is used for removing dirty characters in each text, wherein the dirty characters are characters affecting semantic judgment;
the manual labeling module is used for manually labeling the data set determined in the data set determining module;
the text classification model retraining unit is used for respectively carrying out model training on M sub-sample sets in the sample set; the text classification model is included in the training unit,
the combination module is used for selecting the maximum sequence length search values max_len_num, the batch size search values batch_size_num and the learning rate search values learn_rate_num, and combining the maximum sequence length search values max_len_num, the batch size search values batch_size_num and the learning rate search values learn_rate_num into groups of max_len_num;
the average recognition rate calculation module is used for selecting a parameter combination from the combination module, adopting a leave-one-out method cross verification model, and calculating the average recognition rate of the model;
the optimal model output module is used for circulating the average recognition rate calculation module until all the groups of parameters are processed, selecting a group of parameters with the highest average recognition rate as optimal parameters of the model, and outputting the model trained by the optimal parameters as an optimal model;
the sub-model acquisition module is used for training the M sub-sample sets through the combination module, the average recognition rate calculation module and the optimal model output module respectively to obtain M sub-models;
a data set content extraction unit that extracts the content of the corresponding data set using the trained model; the data set content extraction unit comprises a data set extraction unit,
the dirty character removing and blocking processing module is used for carrying out dirty character removing and blocking processing on all texts to be detected;
the block text classification module is used for inputting the block texts into the corresponding text model to classify each block of text;
and the data set result extraction module is used for combining according to the input sequence, wherein the text content obtained in each data set is used as the data set extraction result of the document, and the combination results of the categories except the other categories are the data set extraction result of the document.
Example 3:
the storage medium of the present invention has a plurality of instructions stored therein, the instructions being loaded by a processor, to perform the steps of the electronic medical record data set parsing method based on the ernie model of embodiment 1.
Example 4:
the electronic device of the present invention includes:
a storage medium based on embodiment 3; and
a processor configured to execute the instructions in the storage medium of embodiment 3.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.