CN111341404B - Electronic medical record data set analysis method and system based on ernie model - Google Patents

Electronic medical record data set analysis method and system based on ernie model Download PDF

Info

Publication number
CN111341404B
CN111341404B CN202010118524.6A CN202010118524A CN111341404B CN 111341404 B CN111341404 B CN 111341404B CN 202010118524 A CN202010118524 A CN 202010118524A CN 111341404 B CN111341404 B CN 111341404B
Authority
CN
China
Prior art keywords
data set
model
text
electronic medical
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010118524.6A
Other languages
Chinese (zh)
Other versions
CN111341404A (en
Inventor
刘文丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Langchao Intelligent Medical Technology Co ltd
Tianjin Health Care Big Data Co ltd
Original Assignee
Shandong Langchao Intelligent Medical Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Langchao Intelligent Medical Technology Co ltd filed Critical Shandong Langchao Intelligent Medical Technology Co ltd
Priority to CN202010118524.6A priority Critical patent/CN111341404B/en
Publication of CN111341404A publication Critical patent/CN111341404A/en
Application granted granted Critical
Publication of CN111341404B publication Critical patent/CN111341404B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Public Health (AREA)
  • Medical Informatics (AREA)
  • Epidemiology (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Biomedical Technology (AREA)
  • Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention discloses an ernie model-based electronic medical record data set analysis method and system, which belong to the field of natural language processing, and the invention aims to solve the technical problems of how to overcome the repeated updating of extraction rules and incapability of analyzing text without keywords caused by the dependence of the extraction process of the electronic medical record data set on keywords and rules, and adopts the following technical scheme: the method is to judge the data group according to the meaning of each sentence in the electronic medical record, and overcomes the dependence on keywords and rules in the analysis process of the electronic medical record; the method comprises the following steps: s1, determining different types of text data sets: determining an extraction data set according to different types of electronic medical records; s2, extracting and marking a data set sample: after determining electronic medical record data sets to be extracted of different types of documents, collecting and labeling samples to construct a sample set; s3, retraining a text classification model based on an ernie pre-training model; s4, extracting the content of the data set: the content of the corresponding data set is extracted using the model trained in step S3.

Description

Electronic medical record data set analysis method and system based on ernie model
Technical Field
The invention relates to the field of natural language processing, in particular to an electronic medical record data set analysis method and system based on an ernie model.
Background
The electronic medical record is a complete and detailed clinical information resource generated and recorded by a person in the process of medical institution treatment, and is a main component of the current medical data. However, the current electronic medical records are mainly in text form and cannot be directly used for analysis and research. Therefore, how to accurately and effectively analyze the electronic medical record and extract the content of the data set for analysis and research is a problem to be solved in medical data treatment.
At present, a commonly used method for analyzing a data set is a method for extracting keywords and matching regular expressions, and the method specifically comprises the following steps:
firstly, determining the position of an extracted data set according to keywords in an electronic medical record;
and then, extracting the content of the data group by using a regular expression and other rule matching modes.
For example, the complaint information is extracted from the admission record text: firstly, determining the position of the group of data groups in the admission record text according to the 'main complaint' two words; and then extracting the main complaint content according to the separator such as carriage return symbol, period number and the like.
Although the method can rapidly analyze the data set of the electronic medical record, a large number of paragraphs are freely filled because the electronic medical record is semi-structured content. And the electronic medical record templates of different factories in different hospitals are different. Therefore, there are the following problems:
(1) when keywords are determined and extraction rules are formulated, the keywords are formulated one by one according to medical records of different manufacturers and different types, so that the universality is poor;
(2) in the extraction process, the rule needs to be repeatedly polished according to the actual content, so that the accuracy is low;
(3) once the system is upgraded or replaced by a manufacturer, the keywords and the extraction rules need to be determined again, so that the universality is poor;
(4) the method can not analyze the condition of keyword missing in the text, and has strong dependence on keywords.
In summary, how to overcome the problems that the extraction rules are repeatedly updated and the text without keywords cannot be analyzed due to the dependence of the extraction process of the electronic medical record data set on the keywords and the rules, and effectively reduce the analysis cost are urgent to be solved in the current medical data management.
Disclosure of Invention
The invention aims to provide an electronic medical record data set analysis method and system based on an ernie model, which solve the problems that extraction rules are repeatedly updated and keyword-free texts cannot be analyzed due to the fact that the extraction process of the electronic medical record data set depends on keywords and rules, and the analysis cost is effectively reduced.
The technical task of the invention is realized in the following way, namely, the method for analyzing the data set of the electronic medical record of the base ernie model is to judge the data set according to the meaning of each sentence in the electronic medical record, and the dependence on keywords and rules in the analysis process of the electronic medical record is overcome; the method comprises the following steps:
s1, determining different types of text data sets: determining an extracted data set according to different types of electronic medical records, and mapping or fine-tuning the data set according to the text conditions of the electronic medical records of different factories;
s2, extracting and marking a data set sample: after determining electronic medical record data sets to be extracted of different types of documents, collecting and labeling samples to construct a sample set;
s3, retraining a text classification model based on an ernie pre-training model: respectively carrying out model training on M sub-sample sets in the sample set;
s4, extracting the content of the data set: the content of the corresponding data set is extracted using the model trained in step S3.
Preferably, the extracting and marking the data set sample in the step S2 is specifically as follows:
s201, randomly extracting N texts from various samples to be analyzed respectively;
s202, selecting reasonable separators (generally periods or carriage returns or multiple separators can be used jointly) to block the text according to the actual text condition;
s203, removing dirty characters in each text, wherein the dirty characters refer to characters affecting semantic judgment;
s204, manually labeling according to the data set determined in the step S1.
Preferably, the construction sample set is specifically as follows:
(1) Extracting N documents from each type of document respectively;
(2) Determining the category of the data group by combining the actual data to be analyzed;
(3) Manually or by means of a labeling platform, labeling the data groups in the N documents;
(4) The total sample set is composed by the sample model structure of the formula 1 and the formula 2, and the concrete steps are as follows:
S={s 1 ,s 2 ,s 3 …s M -a }; equation 1
s i ={n i1 ,n i2 ,n i3, …,n id ,n id+1 -a }; equation 2
S is expressed as a total sample set, and the total sample set is composed of sub-sample sets S of M types of documents to be analyzed; each sub-sample set s contains d sub-categories, namely the number of categories of data groups contained in the documents of the category; although each sub-sample set is formed by extracting and labeling N texts randomly extracted from the texts during sample collection; however, in order to eliminate the dependence of the model on the keywords, the actual text in each data group is subjected to block processing in the process of forming the sample set, so that the number of samples contained in the d subcategories is not equal, and the minimum value is N samples; in addition, there are typically many template class sentences in a parsed document, which do not belong to any class, so other classes n are added in each sub-sample set id+1 For distinguishing other classes of text.
More preferably, the sample set is constructed by taking care of the following:
(1) in the sampling process, the whole sample set should be randomly sampled, so that the comprehensiveness of the sample is ensured;
(2) and the complete text of the original data set is partitioned and then is put into a sample set, so that the model gets rid of dependence on keywords, and the diversity of the content of the sample set is ensured as much as possible.
Preferably, in the step S3, three parameters including a maximum sequence length (Maximum Sequence Length), a Batch Size (Batch Size) and a Learning Rate (Learning Rate) of the model are adjusted in the process of retraining the text classification model based on the ernie pre-training model; the method comprises the following steps:
s301, selecting maximum sequence length search values max_len_num, batch size search values batch_size_num and learning rate search values learn_rate_num, and combining the maximum sequence length search values max_len_num, batch size_size_num and learning rate search values learn_rate_num into a max_len_num group;
s302, selecting a parameter combination from the step S301, and calculating the average recognition rate of a model by adopting a leave-one-out method cross-validation model;
s303, circulating the step S302 until all groups of parameters are processed, selecting a group of parameters with highest average recognition rate as optimal parameters of a model, and outputting the model trained by the optimal parameters as an optimal model;
s304, training the M sub-sample sets through the steps S301 to S303 respectively to obtain M sub-models.
Preferably, the content of the extraction data set in the step S4 is specifically as follows:
s401, performing dirty character removal and block division processing on all texts to be tested;
s402, inputting the segmented text into a corresponding text model to classify each text;
s403, combining according to the input sequence, wherein the text content obtained in each data set is used as the data set extraction result of the document, and the combination results of the categories except the other categories are the data set extraction result of the document.
More preferably, the specific procedure of the partitioning process in step S401 is as follows:
s40101, performing blocking processing on the document to be analyzed by using a period or a carriage return character;
s40102, removing dirty characters of each sentence, wherein the dirty characters are characters affecting semantic judgment;
s40103, inputting the block sentences into the sub-models of the corresponding document types in sequence respectively, and judging the types of the text blocks;
s40104, recombining the classification results of the document categories according to the analysis sequence, and connecting the combination process through carriage returns or blank spaces.
An electronic medical record data set analysis system based on an ernie model, which comprises,
the data set determining unit is used for determining and extracting data sets according to different types of electronic medical records and then carrying out data set mapping or fine adjustment according to the text conditions of the electronic medical records of different factories;
the data set sample extraction and marking unit is used for collecting and marking samples to construct a sample set after determining electronic medical record data sets to be extracted of different types of documents; the data set sample extraction and marking unit comprises,
the text random extraction module is used for randomly extracting N texts from various samples to be analyzed respectively;
the text block module is used for selecting reasonable separators (generally periods or carriage returns or used by combining multiple separators) to block the text according to the actual text condition;
the dirty character removing module is used for removing dirty characters in each text, wherein the dirty characters are characters affecting semantic judgment;
the manual labeling module is used for manually labeling the data set determined in the data set determining module;
the text classification model retraining unit is used for respectively carrying out model training on M sub-sample sets in the sample set; the text classification model is included in the training unit,
the combination module is used for selecting the maximum sequence length search values max_len_num, the batch size search values batch_size_num and the learning rate search values learn_rate_num, and combining the maximum sequence length search values max_len_num, the batch size search values batch_size_num and the learning rate search values learn_rate_num into groups of max_len_num;
the average recognition rate calculation module is used for selecting a parameter combination from the combination module, adopting a leave-one-out method cross verification model, and calculating the average recognition rate of the model;
the optimal model output module is used for circulating the average recognition rate calculation module until all the groups of parameters are processed, selecting a group of parameters with the highest average recognition rate as optimal parameters of the model, and outputting the model trained by the optimal parameters as an optimal model;
the sub-model acquisition module is used for training the M sub-sample sets through the combination module, the average recognition rate calculation module and the optimal model output module respectively to obtain M sub-models;
a data set content extraction unit that extracts the content of the corresponding data set using the trained model; the data set content extraction unit comprises a data set extraction unit,
the dirty character removing and blocking processing module is used for carrying out dirty character removing and blocking processing on all texts to be detected;
the block text classification module is used for inputting the block texts into the corresponding text model to classify each block of text;
and the data set result extraction module is used for combining according to the input sequence, wherein the text content obtained in each data set is used as the data set extraction result of the document, and the combination results of the categories except the other categories are the data set extraction result of the document.
A storage medium having stored therein a plurality of instructions for loading by a processor for performing the steps of the electronic medical record data set parsing method described above based on the ernie model.
An electronic device, the electronic device comprising:
the storage medium described above; and
and a processor for executing the instructions in the storage medium.
The electronic medical record data set analysis method and system based on the ernie model have the following advantages:
the invention judges the data group according to the meaning of each sentence in the electronic medical record, overcomes the dependence on keywords and rules in the analysis process of the electronic medical record, solves the problems of repeated updating of the rules and incapability of analyzing irrelevant keywords, and reduces the analysis cost;
the method solves the problem of excessive dependence on keywords and rules in the process of analyzing the electronic medical record data set, and saves the time for running-in and updating the keywords; compared with the traditional electronic medical record data set analysis method, the method has more universality;
the semantic analysis technology is used, namely a text classification model, and the text classification model is an ernie model in a paldlepay framework, so that the data group classification of each text in the electronic medical record is realized without depending on keywords;
and fourthly, when the text is segmented, the user-defined separator is segmented according to the actual text content, so that the segmentation requirements of different texts can be met, and the accuracy is ensured.
Drawings
The invention is further described below with reference to the accompanying drawings.
FIG. 1 is a flow chart diagram of an electronic medical record data set analysis method based on an ernie model;
fig. 2 is a block diagram of an electronic medical record data set analysis system based on an ernie model.
Detailed Description
The invention relates to an electronic medical record data set analysis method and system based on an ernie model, which are described in detail below with reference to the accompanying drawings and specific embodiments.
Example 1:
as shown in figure 1, the method for analyzing the data set of the electronic medical record based on the ernie model is used for judging the data set according to the meaning of each sentence in the electronic medical record, and overcomes the dependence on keywords and rules in the analysis process of the electronic medical record; the method comprises the following steps:
s1, determining different types of text data sets: the Data Group (Data Group) is a composite Data structure formed by aggregating related information items according to the description of electronic medical record Data Group and Data element. Different types of electronic medical record text contain different data sets; the electronic medical record texts of different factories and hospitals have slightly different contents of the data sets; the data is thus determined as follows:
s101, determining an extraction data set according to different types of electronic medical records;
s102, mapping or fine tuning the data set according to the conditions of electronic medical record texts of different factories;
s2, extracting and marking a data set sample: after determining electronic medical record data sets to be extracted of different types of documents, collecting and labeling samples to construct a sample set; the method comprises the following steps:
s201, randomly extracting N texts from various samples to be analyzed respectively;
s202, selecting reasonable separators (generally periods or carriage returns or multiple separators can be used jointly) to block the text according to the actual text condition;
s203, removing dirty characters in each text, wherein the dirty characters refer to characters affecting semantic judgment;
s204, manually labeling according to the data set determined in the step S1.
The construction of the sample set is specifically as follows:
(1) Extracting N documents from each type of document respectively;
(2) Determining the category of the data group by combining the actual data to be analyzed;
(3) Manually or by means of a labeling platform, labeling the data groups in the N documents;
(4) The total sample set is composed by the sample model structure of the formula 1 and the formula 2, and the concrete steps are as follows:
S={s 1 ,s 2 ,s 3 …s M -a }; equation 1
s i ={n i1 ,n i2 ,n i3, …,n id ,n id+1 -a }; equation 2
S is expressed as a total sample set, and the total sample set is composed of sub-sample sets S of M types of documents to be analyzed; each sub-sample set s contains d sub-categories, namely the number of categories of data groups contained in the documents of the category; although each sub-sample set is formed by extracting and labeling N texts randomly extracted from the texts during sample collection; however, in order to eliminate the dependence of the model on the keywords, the actual text in each data group is subjected to block processing in the process of forming the sample set, so that the number of samples contained in the d subcategories is not equal, and the minimum value is N samples; in addition, there are typically many template class sentences in a parsed document, which do not belong to any class, so other classes n are added in each sub-sample set id+1 For distinguishing other classes of text.
The sample set is constructed by taking care of the following:
(1) in the sampling process, the whole sample set should be randomly sampled, so that the comprehensiveness of the sample is ensured;
(2) and the complete text of the original data set is partitioned and then is put into a sample set, so that the model gets rid of dependence on keywords, and the diversity of the content of the sample set is ensured as much as possible.
S3, retraining a text classification model based on an ernie pre-training model: the Ernie pre-training model is the most typical semantic model in the paldlenlp and is trained by multiple NLP tasks. Thus, the Ernie model has the advantage that it can be trained on small samples and that the preprocessing is simple. In view of the fact that the early samples are all manually marked and the sample size is small, an ernie model with strong semantic capability is selected as a pre-training model of text classification. Respectively carrying out model training on M sub-sample sets in the sample set; in the text classification model retraining process based on the ernie pre-training model, three parameters of the maximum sequence length (Maximum Sequence Length), batch Size (Batch Size) and Learning Rate (Learning Rate) of the model are subjected to parameter adjustment; the method comprises the following steps:
s301, selecting maximum sequence length search values max_len_num, batch size search values batch_size_num and learning rate search values learn_rate_num, and combining the maximum sequence length search values max_len_num, batch size_size_num and learning rate search values learn_rate_num into a max_len_num group;
s302, selecting a parameter combination from the step S301, and calculating the average recognition rate of a model by adopting a leave-one-out method cross-validation model;
s303, circulating the step S302 until all groups of parameters are processed, selecting a group of parameters with highest average recognition rate as optimal parameters of a model, and outputting the model trained by the optimal parameters as an optimal model;
s304, training the M sub-sample sets through the steps S301 to S303 respectively to obtain M sub-models.
S4, extracting the content of the data set: extracting the content of the corresponding data set by using the model trained in the step S3; the method comprises the following steps:
s401, performing dirty character removal and block division processing on all texts to be tested; the specific process of the blocking treatment is as follows:
s40101, performing blocking processing on the document to be analyzed by using a period or a carriage return character;
s40102, removing dirty characters of each sentence, wherein the dirty characters are characters affecting semantic judgment;
s40103, inputting the block sentences into the sub-models of the corresponding document types in sequence respectively, and judging the types of the text blocks;
s40104, recombining the classification results of the document categories according to the analysis sequence, and connecting the combination process through carriage returns or blank spaces.
S402, inputting the segmented text into a corresponding text model to classify each text;
s403, combining according to the input sequence, wherein the text content obtained in each data set is used as the data set extraction result of the document, and the combination results of the categories except the other categories are the data set extraction result of the document.
Example 2:
the invention relates to an electronic medical record data set analysis system based on an ernie model, which comprises,
the data set determining unit is used for determining and extracting data sets according to different types of electronic medical records and then carrying out data set mapping or fine adjustment according to the text conditions of the electronic medical records of different factories;
the data set sample extraction and marking unit is used for collecting and marking samples to construct a sample set after determining electronic medical record data sets to be extracted of different types of documents; the data set sample extraction and marking unit comprises,
the text random extraction module is used for randomly extracting N texts from various samples to be analyzed respectively;
the text block module is used for selecting reasonable separators (generally periods or carriage returns or used by combining multiple separators) to block the text according to the actual text condition;
the dirty character removing module is used for removing dirty characters in each text, wherein the dirty characters are characters affecting semantic judgment;
the manual labeling module is used for manually labeling the data set determined in the data set determining module;
the text classification model retraining unit is used for respectively carrying out model training on M sub-sample sets in the sample set; the text classification model is included in the training unit,
the combination module is used for selecting the maximum sequence length search values max_len_num, the batch size search values batch_size_num and the learning rate search values learn_rate_num, and combining the maximum sequence length search values max_len_num, the batch size search values batch_size_num and the learning rate search values learn_rate_num into groups of max_len_num;
the average recognition rate calculation module is used for selecting a parameter combination from the combination module, adopting a leave-one-out method cross verification model, and calculating the average recognition rate of the model;
the optimal model output module is used for circulating the average recognition rate calculation module until all the groups of parameters are processed, selecting a group of parameters with the highest average recognition rate as optimal parameters of the model, and outputting the model trained by the optimal parameters as an optimal model;
the sub-model acquisition module is used for training the M sub-sample sets through the combination module, the average recognition rate calculation module and the optimal model output module respectively to obtain M sub-models;
a data set content extraction unit that extracts the content of the corresponding data set using the trained model; the data set content extraction unit comprises a data set extraction unit,
the dirty character removing and blocking processing module is used for carrying out dirty character removing and blocking processing on all texts to be detected;
the block text classification module is used for inputting the block texts into the corresponding text model to classify each block of text;
and the data set result extraction module is used for combining according to the input sequence, wherein the text content obtained in each data set is used as the data set extraction result of the document, and the combination results of the categories except the other categories are the data set extraction result of the document.
Example 3:
the storage medium of the present invention has a plurality of instructions stored therein, the instructions being loaded by a processor, to perform the steps of the electronic medical record data set parsing method based on the ernie model of embodiment 1.
Example 4:
the electronic device of the present invention includes:
a storage medium based on embodiment 3; and
a processor configured to execute the instructions in the storage medium of embodiment 3.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims (9)

1. An electronic medical record data set analysis method based on an ernie model is characterized in that the method is used for judging the data set according to the meaning of each sentence in the electronic medical record and overcoming the dependence on keywords and rules in the analysis process of the electronic medical record; the method comprises the following steps:
s1, determining different types of text data sets: determining an extracted data set according to different types of electronic medical records, and mapping or fine-tuning the data set according to the text conditions of the electronic medical records of different factories;
s2, extracting and marking a data set sample: after determining electronic medical record data sets to be extracted of different types of documents, collecting and labeling samples to construct a sample set;
s3, retraining a text classification model based on an ernie pre-training model: respectively carrying out model training on M sub-sample sets in the sample set; the method comprises the steps of performing parameter adjustment on three parameters of the maximum sequence length, the batch size and the learning rate of a model in the process of retraining a text classification model based on an ernie pre-training model; the method comprises the following steps:
s301, selecting maximum sequence length search values max_len_num, batch size search values batch_size_num and learning rate search values learn_rate_num, and combining the maximum sequence length search values max_len_num, batch size_size_num and learning rate search values learn_rate_num into a max_len_num group;
s302, selecting a parameter combination from the step S301, and calculating the average recognition rate of a model by adopting a leave-one-out method cross-validation model;
s303, circulating the step S302 until all groups of parameters are processed, selecting a group of parameters with highest average recognition rate as optimal parameters of a model, and outputting the model trained by the optimal parameters as an optimal model;
s304, training M sub-sample sets through the steps S301 to S303 respectively to obtain M sub-models;
s4, extracting the content of the data set: the content of the corresponding data set is extracted using the model trained in step S3.
2. The method for analyzing an electronic medical record data set based on an ernie model according to claim 1, wherein the extracting and marking the data set sample in the step S2 is specifically as follows:
s201, randomly extracting N texts from various samples to be analyzed respectively;
s202, selecting reasonable separators to perform text blocking according to actual text conditions;
s203, removing dirty characters in each text, wherein the dirty characters refer to characters affecting semantic judgment;
s204, manually labeling according to the data set determined in the step S1.
3. The method for analyzing an electronic medical record data set based on an ernie model according to claim 1, wherein the construction sample set is specifically as follows:
(1) Extracting N documents from each type of document respectively;
(2) Determining the category of the data group by combining the actual data to be analyzed;
(3) Manually or by means of a labeling platform, labeling the data groups in the N documents;
(4) The total sample set is composed by the sample model structure of the formula 1 and the formula 2, and the concrete steps are as follows:
Figure QLYQS_1
s is expressed as a total sample set, and the total sample set is composed of sub-sample sets S of M types of documents to be analyzed; each sub-sample set s contains d sub-categories, namely the category number of the data group contained in the ith category document; adding other classes in each sub-sample set
Figure QLYQS_2
For distinguishing other classes of text.
4. The method for analyzing an electronic medical record data set based on an ernie model according to claim 3, wherein the sample set is constructed by taking the following into consideration:
(1) in the sampling process, the whole sample set should be randomly sampled, so that the comprehensiveness of the sample is ensured;
(2) and the complete text of the original data set is partitioned and then is put into a sample set, so that the model gets rid of dependence on keywords, and the diversity of the content of the sample set is ensured as much as possible.
5. The method for analyzing the electronic medical record data set based on the ernie model according to claim 1, wherein the content of the extracted data set in the step S4 is specifically as follows:
s401, performing dirty character removal and block division processing on all texts to be tested;
s402, inputting the segmented text into a corresponding text model to classify each text;
s403, combining according to the input sequence, wherein the text content obtained in each data set is used as a data set extraction result of the document.
6. The electronic medical record data set analysis method based on the ernie model according to claim 5, wherein the specific process of the partitioning in step S401 is as follows:
s40101, performing blocking processing on the document to be analyzed by using a period or a carriage return character;
s40102, removing dirty characters of each sentence, wherein the dirty characters are characters affecting semantic judgment;
s40103, inputting the block sentences into the sub-models of the corresponding document types in sequence respectively, and judging the types of the text blocks;
s40104, recombining the classification results of the document categories according to the analysis sequence, and connecting the combination process through carriage returns or blank spaces.
7. An electronic medical record data set analysis system based on an ernie model is characterized in that the system comprises,
the data set determining unit is used for determining and extracting data sets according to different types of electronic medical records and then carrying out data set mapping or fine adjustment according to the text conditions of the electronic medical records of different factories;
the data set sample extraction and marking unit is used for collecting and marking samples to construct a sample set after determining electronic medical record data sets to be extracted of different types of documents; the data set sample extraction and marking unit comprises,
the text random extraction module is used for randomly extracting N texts from various samples to be analyzed respectively;
the text block module is used for selecting reasonable separators to block the text according to the actual text condition;
the dirty character removing module is used for removing dirty characters in each text, wherein the dirty characters are characters affecting semantic judgment;
the manual labeling module is used for manually labeling the data set determined in the data set determining module;
the text classification model retraining unit is used for respectively carrying out model training on M sub-sample sets in the sample set; the text classification model is included in the training unit,
the combination module is used for selecting the maximum sequence length search values max_len_num, the batch size search values batch_size_num and the learning rate search values learn_rate_num, and combining the maximum sequence length search values max_len_num, the batch size search values batch_size_num and the learning rate search values learn_rate_num into groups of max_len_num;
the average recognition rate calculation module is used for selecting a parameter combination from the combination module, adopting a leave-one-out method cross verification model, and calculating the average recognition rate of the model;
the optimal model output module is used for circulating the average recognition rate calculation module until all the groups of parameters are processed, selecting a group of parameters with the highest average recognition rate as optimal parameters of the model, and outputting the model trained by the optimal parameters as an optimal model;
the sub-model acquisition module is used for training the M sub-sample sets through the combination module, the average recognition rate calculation module and the optimal model output module respectively to obtain M sub-models;
a data set content extraction unit that extracts the content of the corresponding data set using the trained model; the data set content extraction unit comprises a data set extraction unit,
the dirty character removing and blocking processing module is used for carrying out dirty character removing and blocking processing on all texts to be detected;
the block text classification module is used for inputting the block texts into the corresponding text model to classify each block of text;
and the data set result extraction module is used for combining according to the input sequence, and the text content obtained by dividing in each data set is used as the data set extraction result of the document.
8. A storage medium having stored therein a plurality of instructions for loading by a processor for performing the steps of the ernie model-based electronic medical record dataset parsing method of any of claims 1-6.
9. An electronic device, the electronic device comprising:
the storage medium of claim 8; and
and a processor for executing the instructions in the storage medium.
CN202010118524.6A 2020-02-26 2020-02-26 Electronic medical record data set analysis method and system based on ernie model Active CN111341404B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010118524.6A CN111341404B (en) 2020-02-26 2020-02-26 Electronic medical record data set analysis method and system based on ernie model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010118524.6A CN111341404B (en) 2020-02-26 2020-02-26 Electronic medical record data set analysis method and system based on ernie model

Publications (2)

Publication Number Publication Date
CN111341404A CN111341404A (en) 2020-06-26
CN111341404B true CN111341404B (en) 2023-07-14

Family

ID=71183709

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010118524.6A Active CN111341404B (en) 2020-02-26 2020-02-26 Electronic medical record data set analysis method and system based on ernie model

Country Status (1)

Country Link
CN (1) CN111341404B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113488126A (en) * 2021-07-27 2021-10-08 心医国际数字医疗系统(大连)有限公司 Information processing method, information processing device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110309267A (en) * 2019-07-08 2019-10-08 哈尔滨工业大学 Semantic retrieving method and system based on pre-training model
CN110517788A (en) * 2019-08-30 2019-11-29 山东健康医疗大数据有限公司 A kind of method of Chinese electronic health record information extraction
CN110705293A (en) * 2019-08-23 2020-01-17 中国科学院苏州生物医学工程技术研究所 Electronic medical record text named entity recognition method based on pre-training language model

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10133847B2 (en) * 2014-06-10 2018-11-20 International Business Machines Corporation Automated medical problem list generation from electronic medical record

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110309267A (en) * 2019-07-08 2019-10-08 哈尔滨工业大学 Semantic retrieving method and system based on pre-training model
CN110705293A (en) * 2019-08-23 2020-01-17 中国科学院苏州生物医学工程技术研究所 Electronic medical record text named entity recognition method based on pre-training language model
CN110517788A (en) * 2019-08-30 2019-11-29 山东健康医疗大数据有限公司 A kind of method of Chinese electronic health record information extraction

Also Published As

Publication number Publication date
CN111341404A (en) 2020-06-26

Similar Documents

Publication Publication Date Title
CN113011533B (en) Text classification method, apparatus, computer device and storage medium
CN112270196B (en) Entity relationship identification method and device and electronic equipment
CN111898366B (en) Document subject word aggregation method and device, computer equipment and readable storage medium
CN106886580B (en) Image emotion polarity analysis method based on deep learning
CN106095753B (en) A kind of financial field term recognition methods based on comentropy and term confidence level
CN107506389B (en) Method and device for extracting job skill requirements
CN110083832B (en) Article reprint relation identification method, device, equipment and readable storage medium
CN111581956B (en) Sensitive information identification method and system based on BERT model and K nearest neighbor
CN112307741B (en) Insurance industry document intelligent analysis method and device
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN113486189A (en) Open knowledge graph mining method and system
CN111597356A (en) Intelligent education knowledge map construction system and method
CN113486664A (en) Text data visualization analysis method, device, equipment and storage medium
CN115146062A (en) Intelligent event analysis method and system fusing expert recommendation and text clustering
CN111310467A (en) Topic extraction method and system combining semantic inference in long text
CN111341404B (en) Electronic medical record data set analysis method and system based on ernie model
CN103034657B (en) Documentation summary generates method and apparatus
CN111859955A (en) Public opinion data analysis model based on deep learning
EP3640861A1 (en) Systems and methods for parsing log files using classification and a plurality of neural networks
CN114842982B (en) Knowledge expression method, device and system for medical information system
CN108733733B (en) Biomedical text classification method, system and storage medium based on machine learning
CN113761104A (en) Method and device for detecting entity relationship in knowledge graph and electronic equipment
CN115481240A (en) Data asset quality detection method and detection device
CN114117057A (en) Keyword extraction method of product feedback information and terminal equipment
CN113722421A (en) Contract auditing method and system and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20230619

Address after: 250100 room 3108, 31 / F, building S02, Langchao Science Park, No. 1036 Langchao Road, Jinan area, China (Shandong) pilot Free Trade Zone, Jinan, Shandong

Applicant after: Shandong Langchao Intelligent Medical Technology Co.,Ltd.

Address before: Room 215, east block, Xiyuan building, intersection of Shun'an Road, Yantai Road, Huaiyin District, Jinan City, Shandong Province

Applicant before: SHANDONG HEALTH MEDICAL BIG DATA Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20240531

Address after: 250100 room 3108, 31 / F, building S02, Langchao Science Park, No. 1036 Langchao Road, Jinan area, China (Shandong) pilot Free Trade Zone, Jinan, Shandong

Patentee after: Shandong Langchao Intelligent Medical Technology Co.,Ltd.

Country or region after: China

Patentee after: Tianjin health care big data Co.,Ltd.

Address before: 250100 room 3108, 31 / F, building S02, Langchao Science Park, No. 1036 Langchao Road, Jinan area, China (Shandong) pilot Free Trade Zone, Jinan, Shandong

Patentee before: Shandong Langchao Intelligent Medical Technology Co.,Ltd.

Country or region before: China

TR01 Transfer of patent right