CN116721779A - Medical data preprocessing method and system - Google Patents
Medical data preprocessing method and system Download PDFInfo
- Publication number
- CN116721779A CN116721779A CN202311002583.7A CN202311002583A CN116721779A CN 116721779 A CN116721779 A CN 116721779A CN 202311002583 A CN202311002583 A CN 202311002583A CN 116721779 A CN116721779 A CN 116721779A
- Authority
- CN
- China
- Prior art keywords
- medical
- medical data
- label
- data
- word segmentation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000007781 pre-processing Methods 0.000 title claims abstract description 26
- 238000000034 method Methods 0.000 title claims abstract description 19
- 230000011218 segmentation Effects 0.000 claims abstract description 54
- 238000012549 training Methods 0.000 claims abstract description 38
- 238000002372 labelling Methods 0.000 claims abstract description 32
- 238000012545 processing Methods 0.000 claims abstract description 16
- 238000012937 correction Methods 0.000 claims description 18
- 238000013136 deep learning model Methods 0.000 claims description 18
- 238000004140 cleaning Methods 0.000 claims description 6
- 238000004806 packaging method and process Methods 0.000 claims description 5
- 238000005516 engineering process Methods 0.000 abstract description 3
- 239000000284 extract Substances 0.000 abstract 1
- 208000024891 symptom Diseases 0.000 description 6
- 241000282414 Homo sapiens Species 0.000 description 5
- 238000013473 artificial intelligence Methods 0.000 description 5
- BSYNRYMUTXBXSQ-UHFFFAOYSA-N Aspirin Chemical compound CC(=O)OC1=CC=CC=C1C(O)=O BSYNRYMUTXBXSQ-UHFFFAOYSA-N 0.000 description 4
- 206010019233 Headaches Diseases 0.000 description 4
- 208000019695 Migraine disease Diseases 0.000 description 4
- 206010028813 Nausea Diseases 0.000 description 4
- 229960001138 acetylsalicylic acid Drugs 0.000 description 4
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 4
- 231100000869 headache Toxicity 0.000 description 4
- 206010027599 migraine Diseases 0.000 description 4
- 230000008693 nausea Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 208000002193 Pain Diseases 0.000 description 2
- 239000008280 blood Substances 0.000 description 2
- 210000004369 blood Anatomy 0.000 description 2
- 238000005094 computer simulation Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000003745 diagnosis Methods 0.000 description 2
- 201000010099 disease Diseases 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 206010020772 Hypertension Diseases 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 206010012601 diabetes mellitus Diseases 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000001802 infusion Methods 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention belongs to the technical field of data processing, and discloses a medical data preprocessing method and a system, wherein the medical data preprocessing method comprises the following steps: removing irrelevant symbols in the medical data, and correcting the text data in the medical data; the text data is segmented into different fields through a plurality of medical word segmentation devices to obtain word segmentation results, and the word segmentation results are input into a secondary word segmentation device to obtain final word segmentation results; constructing a medical knowledge graph, and labeling medical fields in a final word segmentation result based on the medical knowledge graph to obtain part of labeled medical data; and labeling unlabeled fields in the partially labeled medical data based on group intelligence to obtain the completely labeled medical data. The invention automatically cleans, extracts and labels the medical data, so that the medical data are processed into the format and content required by medical AI model training, and the data label workflow is optimized by using a group intelligent technology.
Description
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a medical data preprocessing method and system.
Background
With the development of artificial intelligence technology, especially the technological breakthrough of general artificial intelligence, people start to analyze and process medical data by using a medical artificial intelligence model, and provide intelligent inquiry service for patients, automatic analysis of medicine action mechanism for pharmaceutical factories, claim settlement robots for different patients for insurance companies, and the like. However, training of current medical artificial intelligence models requires a significant amount of time and resources, as:
1. the medical data contains many input errors, including text input errors in medical text data (for example, "the patient has a history of diabetes and has been treated with high blood," hypertension "is wrongly input into high blood") and non-uniform data formats (for example, the operation date is recorded in various formats such as "month 1 of 2000", "2000.01.01", "01/01/2020") and the like). Conventional computer algorithms are poorly effective for this type of processing, and therefore require a statistical analyst to analyze, correct, and manually design rules for the new type of error that may occur in each medical data. This requires a lot of human resources and time resources, slowing down the efficiency of project development and advancement.
2. Medical data contains a large amount of professional data, and training data required by different medical application projects is greatly different in format (for example, a label of 'medical noun+type' is required for a model extracted by medical keywords, a label of 'disease description+negative/positive' is required for disease judgment), so that the data cannot be directly used for training by a large language model (large language model), manual separation and medical label marking are required, and labeling work for medical projects is often required to be performed only by a certain medical knowledge, so that qualified data preprocessing personnel are more difficult to recruit than a general AI project.
3. The manpower required for the data labels is very different from one another throughout the project cycle, so that it is difficult to reasonably distribute the personnel. In the early stages of AI model training, a large amount of medical data needs to be labeled, and thus a large amount of manpower needs to be allocated. However, in the later stage of the project, the fine-tuning data tag is mainly used, and only a small amount of manpower is needed.
4. Because the data volume of the medical data is too large and medical expertise is required, when the manual data preprocessing is performed, the situations that the processing error or the processing quality is low are difficult to avoid when the data label is manually performed in consideration of different mastering degrees of the expertise and the state of manual work, and the quality of a final medical AI model is influenced by the situations.
Disclosure of Invention
The present invention aims to solve the above technical problems at least to some extent. Therefore, the invention aims to provide a medical data preprocessing method and a medical data preprocessing system.
The technical scheme adopted by the invention is as follows:
a medical data preprocessing method, comprising the steps of:
s1, removing irrelevant symbols in medical data, and correcting text data in the medical data;
s2, segmenting the text data into different fields through a plurality of medical word segmenters to obtain word segmentation results, and inputting the word segmentation results into a secondary word segmenter to obtain final word segmentation results;
s3, constructing a medical knowledge graph, and labeling medical fields in the final word segmentation result based on the medical knowledge graph to obtain part of labeled medical data;
and S4, marking unlabeled fields in the partially labeled medical data based on group intelligence to obtain completely labeled medical data.
Preferably, the step S4 includes:
s41, disturbing unlabeled fields in the part of labeled medical data, and packaging, distributing and manually labeling;
s42, acquiring a label result after manual labeling, correcting the label result, and removing low-quality labels in the label result.
Preferably, the correction of the label result is achieved by:
performing intra-group competition, voting on staff in the group processing the same piece of data, winning by a plurality of players, regarding as errors by a loser, accumulating with time, counting the working accuracy of each staff, and performing weighted voting on the label result according to the reliability of each staff;
or, comparing the label correction model with the original label, and removing the label with the excessively high error rate;
or, setting a plurality of secondary word separators, obtaining a label result by using the secondary word separators, taking the weight occupied by each secondary word separator as a trainable parameter of the advanced deep learning model, training the advanced deep learning model by using the existing result, comparing the training result with the existing labels, and removing the labels with excessively high error rate;
or, using an unsupervised training model to enable the data in the same class to be consistent when the label processing is carried out, and removing the labels with excessively high error rate.
Preferably, the step S4 includes a step S43: the corrected label results are used to retrain a label correction model, an advanced deep learning model, and an unsupervised training model.
Preferably, the secondary word segmentation device is obtained by training word segmentation results obtained by word segmentation through a medical word segmentation model.
Preferably, the step S3 further includes: punctuation marks in medical data are labeled with separator labels.
Preferably, the step S1 further includes a step S0: translating the foreign language medical data into Chinese medical data.
A medical data preprocessing system, comprising:
the data cleaning module is used for removing irrelevant symbols in the medical data and correcting the text data in the medical data in an error correction way;
the word segmentation module is used for segmenting the text data into different fields through a plurality of medical word segmenters to obtain word segmentation results, and inputting the word segmentation results into the secondary word segmenters to obtain final word segmentation results;
the label generation module is used for constructing a medical knowledge graph, and labeling medical fields in the final word segmentation result based on the medical knowledge graph to obtain part of labeled medical data;
and the group intelligent module is used for marking the unlabeled fields in the medical data obtained by the label generating module to obtain the completely labeled medical data.
Preferably, the group intelligent module is used for disturbing untagged fields in the part of the tagged medical data, and packaging and distributing the untagged fields for manual labeling; and obtaining a label result after manual labeling, correcting the label result, and removing low-quality labels in the label result.
Preferably, the correction of the label result is achieved by:
performing intra-group competition, voting on staff in the group processing the same piece of data, winning by a plurality of players, regarding as errors by a loser, accumulating with time, counting the working accuracy of each staff, and performing weighted voting on the label result according to the reliability of each staff;
or, comparing the label correction model with the original label, and removing the label with the excessively high error rate;
or, setting a plurality of secondary word separators, obtaining a label result by using the secondary word separators, taking the weight occupied by each secondary word separator as a trainable parameter of the advanced deep learning model, training the advanced deep learning model by using the existing result, comparing the training result with the existing labels, and removing the labels with excessively high error rate;
or, using an unsupervised training model to enable the data in the same class to be consistent when the label processing is carried out, and removing the labels with excessively high error rate.
The beneficial effects of the invention are as follows:
the medical data preprocessing method provided by the invention can automatically complete simple tasks such as data translation, data cleaning, automatic spell checking and the like, and simultaneously utilizes the medical knowledge graph to carry out semi-automatic labeling tasks, so that nearly 50% of data labeling work can be automatically completed; the ratio of inferior samples can be reduced, the difficulty of performing label work across countries and languages is solved, the error entry in medical data records is processed, and the preprocessing time of medical data is shortened by more than 70%;
the integrated learning is used for completing the word segmentation task of the medical data, the existing open-source medical word segmentation model can be fully utilized, and the final word segmentation accuracy is ensured by training the secondary word segmentation device;
the medical knowledge graph is used for carrying out semiautomatic labeling on medical data, so that the labeling work can be automatically completed on approximately 50% of the data, and the workload of manual labeling is reduced;
the final labeling work of medical data is carried out by using the group intelligent model, the difference between people and the structural property of the data are fully considered, so that the final result of the labeling work can be improved without new human resource input, which is important for the training of the final deep learning model (the training result of the label with medium quantity but high quality is better than the training result of the label with large quantity but poor quality).
Drawings
Fig. 1 is a flow chart of a medical data preprocessing method of the present invention.
Fig. 2 is a flow chart of step S2 of the present invention.
FIG. 3 is a schematic diagram of the voting-based population intelligence of the present invention.
FIG. 4 is a schematic diagram of the community intelligence of the present invention based on voting and employee reliability.
FIG. 5 is a schematic diagram of the present invention for task feature based population intelligence.
Detailed Description
The following description of the embodiments of the present invention will be made more apparent and fully hereinafter with reference to the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should also be appreciated that in the embodiments, the functions/acts may occur in a different order than the figures. For example, two figures shown in succession may in fact be executed substantially concurrently or the figures may sometimes be executed in the reverse order, depending upon the functionality/acts involved.
As shown in fig. 1 to 5, a medical data preprocessing method of the present embodiment includes the following steps:
s0, acquiring medical data, and if the medical data is foreign language (such as English), translating the medical data of the foreign language into Chinese medical data by using a translation program; the translated medical data is subject to text data and non-text data.
S1, data cleaning, namely removing irrelevant symbols (such as '1 a') and blank characters and the like in medical data, and in addition, aiming at the problem of error entry in text data, performing automatic spell check and error correction on the text data by using an automatic error correction model.
S2, using a superposition generalization technology (Stacking) of ensemble learning (Ensemble Learning) to segment text data into different fields through a plurality of medical word segmenters to obtain word segmentation results, and inputting the word segmentation results into a secondary word segmenter to obtain final word segmentation results.
Each field of the cut should ensure the integrity of the medical noun. For example, raw data is "the patient's main symptoms are headache and nausea, and migraine is diagnosed. It is recommended to take aspirin "at the onset, which should be split into: "patient, chief complaints, headache, and, nausea, diagnosis, migraine, advice, at the time of onset, taking, aspirin". The final word segmentation can be more accurately finished by comprehensively utilizing a plurality of medical word segmentation devices.
Specifically, as shown in fig. 2, the training data of chinese is input into the existing four medical word segmenters (Jieba word segmenter, bastrubertert chinese part-of-speech marker, nestedNER chinese medical noun identifier and RANER chinese medical noun identifier) to obtain corresponding word segmentation results, respectively. The four medical word segmentation devices are open-source medical word segmentation models which are widely used at present, can better segment medical data, but still have the situation of wrong word segmentation;
and taking the output of the four medical word separators as new input, training a secondary word separator, wherein the secondary word separator comprises a trained deep learning model (such as a BERT-LSTM-CRF model, a Struct-LSTM-CRF model and the like) and a classical computer model (a regular expression and the like) and the like, and obtaining a final word separation result. By training the secondary word segmenter on the basis of a plurality of open-source medical word segmentation models, final accuracy can be ensured on the basis of utilizing the existing models by using ensemble learning. And when the secondary word segmentation device is trained, the result of the training data is manually marked, so that the accuracy of the secondary word segmentation device is further ensured.
S3, constructing a medical knowledge graph, and labeling medical fields in a final word segmentation result based on the medical knowledge graph, for example, in the example of the original data, automatically labeling the medical fields as shown in the table 1:
table 1 medical field label
Medical field | Label (Label) |
Headache pain | SYMPTOM |
Nausea of | SYMPTOM |
Migraine headache | DISEASE |
Aspirin | MEDICINE |
Meanwhile, punctuation marks in medical data are labeled with a Separator (SEP).
After the above steps, the medical data is split into different fields, and some of the medical fields are labeled with corresponding labels, so as to obtain partially labeled medical data, as shown in table 2:
table 2 partially labeled medical data
Fields | Label (Label) |
Patient' s | |
Principal symptoms | |
Is that | |
Headache pain | SYMPTOM |
And | |
nausea of | SYMPTOM |
, | SEP |
Diagnosis of | |
Is that | |
Migraine headache | DISEASE |
。 | SEP |
Advice of | |
At the position of | |
Seizure (1) | |
Time of day | |
Is taken orally | |
Aspirin | MEDICINE |
。 | SEP |
And S4, marking unlabeled fields in the partially labeled medical data based on group intelligence to obtain completely labeled medical data. And the accuracy of the whole model is improved on the premise of no new human resource input by using the group intelligent model.
With the popularization of artificial intelligence, the demand of deep learning for data volume is larger and larger, so that a work model for carrying out label work by a small number of people is not suitable for the current project scale, but the work model is certainly a great waste of human resources by adopting a large amount of manpower in view of the fact that the workload of label work in the early and later stages of the project is extremely unbalanced. The invention refers to the construction mode of the knowledge graph, and the website of wikipedia and Baidu encyclopedia uses the wisdom of all human beings to carry out knowledge infusion, and sets up supervision to verify the input knowledge, thereby training own knowledge graph model. Therefore, when the labeling work is performed, the intelligent strength of the group is started, the labeling work is integrated into zero, the advancing speed of the labeling work is accelerated, the labeled label is judged by using the existing deep learning model, the classical computer model and the unsupervised training model, and the step S4 specifically comprises the following steps:
s41, the unlabeled fields in the part of labeled medical data are disordered, and the fields are packaged and distributed to label staff (staff) for manual labeling.
S42, recovering the label, obtaining a label result after manual labeling, correcting the label result, and removing the label with low quality (namely poor label quality) in the label result.
The correction of the label result in step S42 is achieved by one or more of the following four ways:
1. the team competition is performed, the team staff who process the same piece of data are voted, and a plurality of the team staff win, and the loser is regarded as error, as shown in fig. 3. Counting the working accuracy of each employee as time goes on, and carrying out weighted voting on the label result according to the reliability of each employee, as shown in fig. 4;
2. comparing the existing label correction model with the original label, and removing the label with high error rate;
3. setting a plurality of secondary word separators in the step S2, obtaining a label result by using the secondary word separators, taking the weight occupied by each secondary word separator as a trainable parameter of the advanced deep learning model, training the advanced deep learning model by using the existing result, comparing the training result with the existing labels, and removing the labels with high error rate;
4. taking the properties of the original data structure into consideration, performing unsupervised learning clustering, for example, analyzing words contained in two sentences, if the similarity of the words is high and the semantics are similar, indicating that the labels of the two sentences should be the same or similar, using an unsupervised training model to enable the data in the same class to be consistent when label processing is performed, and removing the labels with excessively high error rate, as shown in fig. 5.
S43: the corrected label results are used to retrain a label correction model, an advanced deep learning model, and an unsupervised training model.
The embodiment also provides a medical data preprocessing system, which comprises the following steps:
the translation module is used for translating the foreign language medical data into Chinese medical data;
the data cleaning module is used for removing irrelevant symbols in the medical data and correcting the text data in the medical data in an error correction way; the data cleaning module comprises an automatic error correction model and is used for correcting text data;
the word segmentation module is used for segmenting the text data into different fields through a plurality of medical word segmenters to obtain word segmentation results, and inputting the word segmentation results into the secondary word segmenters to obtain final word segmentation results;
the label generation module is used for constructing a medical knowledge graph, and labeling medical fields in the final word segmentation result based on the medical knowledge graph to obtain part of labeled medical data;
and the group intelligent module is used for marking the unlabeled fields in the medical data obtained by the label generating module to obtain the completely labeled medical data. Specifically, the group intelligent module is used for disturbing unlabeled fields in the part of labeled medical data, packaging and distributing the labeled fields for manual labeling; and (3) acquiring a label result after manual labeling, correcting the label result, and removing a label with low quality (namely poor label quality) in the label result.
The group intelligent module corrects the label result by the following modes:
performing intra-group competition, voting on staff in the group processing the same piece of data, winning by a plurality of players, regarding as errors by a loser, accumulating with time, counting the working accuracy of each staff, and performing weighted voting on the label result according to the reliability of each staff;
or, comparing the label correction model with the original label, and removing the label with the excessively high error rate;
or, setting a plurality of secondary word separators, obtaining a label result by using the secondary word separators, taking the weight occupied by each secondary word separator as a trainable parameter of the advanced deep learning model, training the advanced deep learning model by using the existing result, comparing the training result with the existing labels, and removing the labels with excessively high error rate;
or, using an unsupervised training model to enable the data in the same class to be consistent when the label processing is carried out, and removing the labels with excessively high error rate.
The invention is not limited to the above-described alternative embodiments, and any person who may derive other various forms of products in the light of the present invention, however, any changes in shape or structure thereof, all falling within the technical solutions defined in the scope of the claims of the present invention, fall within the scope of protection of the present invention.
Claims (10)
1. A medical data preprocessing method, characterized by comprising the steps of:
s1, removing irrelevant symbols in medical data, and correcting text data in the medical data;
s2, segmenting the text data into different fields through a plurality of medical word segmenters to obtain word segmentation results, and inputting the word segmentation results into a secondary word segmenter to obtain final word segmentation results;
s3, constructing a medical knowledge graph, and labeling medical fields in the final word segmentation result based on the medical knowledge graph to obtain part of labeled medical data;
and S4, marking unlabeled fields in the partially labeled medical data based on group intelligence to obtain completely labeled medical data.
2. The medical data preprocessing method according to claim 1, characterized in that: the step S4 includes:
s41, disturbing unlabeled fields in the part of labeled medical data, and packaging, distributing and manually labeling;
s42, acquiring a label result after manual labeling, and correcting the label result.
3. The medical data preprocessing method according to claim 2, characterized in that: correcting the label result is achieved by:
performing intra-group competition, voting on staff in the group processing the same piece of data, winning by a plurality of players, regarding as errors by a loser, accumulating with time, counting the working accuracy of each staff, and performing weighted voting on the label result according to the reliability of each staff;
or, comparing the label correction model with the original label, and removing the label with the excessively high error rate;
or, setting a plurality of secondary word separators, obtaining a label result by using the secondary word separators, taking the weight occupied by each secondary word separator as a trainable parameter of the advanced deep learning model, training the advanced deep learning model by using the existing result, comparing the training result with the existing labels, and removing the labels with excessively high error rate;
or, using an unsupervised training model to enable the data in the same class to be consistent when the label processing is carried out, and removing the labels with excessively high error rate.
4. A medical data preprocessing method according to claim 3, characterized in that: the step S4 includes a step S43: the corrected label results are used to retrain a label correction model, an advanced deep learning model, and an unsupervised training model.
5. The medical data preprocessing method according to claim 1, characterized in that: the secondary word segmentation device is obtained by training word segmentation results obtained by word segmentation through a medical word segmentation model.
6. The medical data preprocessing method according to claim 1, characterized in that: the step S3 further includes: punctuation marks in medical data are labeled with separator labels.
7. The medical data preprocessing method according to claim 1, characterized in that: the step S1 further includes a step S0: translating the foreign language medical data into Chinese medical data.
8. A medical data preprocessing system, comprising:
the data cleaning module is used for removing irrelevant symbols in the medical data and correcting the text data in the medical data in an error correction way;
the word segmentation module is used for segmenting the text data into different fields through a plurality of medical word segmenters to obtain word segmentation results, and inputting the word segmentation results into the secondary word segmenters to obtain final word segmentation results;
the label generation module is used for constructing a medical knowledge graph, and labeling medical fields in the final word segmentation result based on the medical knowledge graph to obtain part of labeled medical data;
and the group intelligent module is used for marking the unlabeled fields in the medical data obtained by the label generating module to obtain the completely labeled medical data.
9. The medical data preprocessing system of claim 8, wherein: the group intelligent module is used for disturbing unlabeled fields in the part of labeled medical data, packaging and distributing the labeled fields for manual labeling; and acquiring a label result after manual labeling, and correcting the label result.
10. The medical data preprocessing system of claim 9, wherein: correcting the label result is achieved by:
performing intra-group competition, voting on staff in the group processing the same piece of data, winning by a plurality of players, regarding as errors by a loser, accumulating with time, counting the working accuracy of each staff, and performing weighted voting on the label result according to the reliability of each staff;
or, comparing the label correction model with the original label, and removing the label with the excessively high error rate;
or, setting a plurality of secondary word separators, obtaining a label result by using the secondary word separators, taking the weight occupied by each secondary word separator as a trainable parameter of the advanced deep learning model, training the advanced deep learning model by using the existing result, comparing the training result with the existing labels, and removing the labels with excessively high error rate;
or, using an unsupervised training model to enable the data in the same class to be consistent when the label processing is carried out, and removing the labels with excessively high error rate.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311002583.7A CN116721779B (en) | 2023-08-10 | 2023-08-10 | Medical data preprocessing method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311002583.7A CN116721779B (en) | 2023-08-10 | 2023-08-10 | Medical data preprocessing method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116721779A true CN116721779A (en) | 2023-09-08 |
CN116721779B CN116721779B (en) | 2023-11-24 |
Family
ID=87870155
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311002583.7A Active CN116721779B (en) | 2023-08-10 | 2023-08-10 | Medical data preprocessing method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116721779B (en) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018081751A1 (en) * | 2016-10-28 | 2018-05-03 | Vilynx, Inc. | Video tagging system and method |
CN111046272A (en) * | 2019-10-31 | 2020-04-21 | 九次方大数据信息集团有限公司 | Intelligent question-answering system based on medical knowledge map |
CN112199511A (en) * | 2020-09-28 | 2021-01-08 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Cross-language multi-source vertical domain knowledge graph construction method |
CN112542223A (en) * | 2020-12-21 | 2021-03-23 | 西南科技大学 | Semi-supervised learning method for constructing medical knowledge graph from Chinese electronic medical record |
CN113901224A (en) * | 2021-11-22 | 2022-01-07 | 国家电网有限公司信息通信分公司 | Knowledge distillation-based secret-related text recognition model training method, system and device |
CN114091449A (en) * | 2021-11-12 | 2022-02-25 | 昆明理工大学 | Chinese word segmentation method and Chinese word segmentation device in medical field |
CN114492444A (en) * | 2022-02-10 | 2022-05-13 | 北京工业大学 | Chinese electronic medical case medical entity part-of-speech tagging method |
CN115796177A (en) * | 2022-11-28 | 2023-03-14 | 竹间智能科技(上海)有限公司 | Method, medium and electronic device for realizing Chinese word segmentation and part-of-speech tagging |
CN116108000A (en) * | 2023-04-14 | 2023-05-12 | 成都安哲斯生物医药科技有限公司 | Medical data management query method |
-
2023
- 2023-08-10 CN CN202311002583.7A patent/CN116721779B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018081751A1 (en) * | 2016-10-28 | 2018-05-03 | Vilynx, Inc. | Video tagging system and method |
CN111046272A (en) * | 2019-10-31 | 2020-04-21 | 九次方大数据信息集团有限公司 | Intelligent question-answering system based on medical knowledge map |
CN112199511A (en) * | 2020-09-28 | 2021-01-08 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Cross-language multi-source vertical domain knowledge graph construction method |
CN112542223A (en) * | 2020-12-21 | 2021-03-23 | 西南科技大学 | Semi-supervised learning method for constructing medical knowledge graph from Chinese electronic medical record |
CN114091449A (en) * | 2021-11-12 | 2022-02-25 | 昆明理工大学 | Chinese word segmentation method and Chinese word segmentation device in medical field |
CN113901224A (en) * | 2021-11-22 | 2022-01-07 | 国家电网有限公司信息通信分公司 | Knowledge distillation-based secret-related text recognition model training method, system and device |
CN114492444A (en) * | 2022-02-10 | 2022-05-13 | 北京工业大学 | Chinese electronic medical case medical entity part-of-speech tagging method |
CN115796177A (en) * | 2022-11-28 | 2023-03-14 | 竹间智能科技(上海)有限公司 | Method, medium and electronic device for realizing Chinese word segmentation and part-of-speech tagging |
CN116108000A (en) * | 2023-04-14 | 2023-05-12 | 成都安哲斯生物医药科技有限公司 | Medical data management query method |
Non-Patent Citations (2)
Title |
---|
WANG, YU 等: "Named entity recognition in Chinese medical literature using pretraining models", SCIENTIFIC PROGRAMMING, vol. 2020, pages 1 - 6 * |
司念文: "面向军事领域的句子级文本处理技术研究", 中国优秀硕士学位论文全文数据库 (工程科技Ⅱ辑), no. 01, pages 032 - 14 * |
Also Published As
Publication number | Publication date |
---|---|
CN116721779B (en) | 2023-11-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107731269B (en) | Disease coding method and system based on original diagnosis data and medical record file data | |
US10818397B2 (en) | Clinical content analytics engine | |
CN107705839B (en) | Disease automatic coding method and system | |
Sager et al. | Natural language processing and the representation of clinical data | |
CN106682397B (en) | Knowledge-based electronic medical record quality control method | |
US8612261B1 (en) | Automated learning for medical data processing system | |
CN106874643A (en) | Build the method and system that knowledge base realizes assisting in diagnosis and treatment automatically based on term vector | |
CN113806563B (en) | Architect knowledge graph construction method for multi-source heterogeneous building humanistic historical material | |
WO2020172446A1 (en) | Automated generation of structured patient data record | |
CN112541066B (en) | Text-structured-based medical and technical report detection method and related equipment | |
CN109686443A (en) | A kind of clinical diagnosis aid decision-making system and medical knowledge map accumulative means | |
CN111128388A (en) | Value domain data matching method and device and related products | |
Chandra et al. | Natural language Processing and Ontology based Decision Support System for Diabetic Patients | |
CN114420233A (en) | Method for extracting post-structured information of Chinese electronic medical record | |
CN110597760A (en) | Intelligent method for judging compliance of electronic document | |
Kim et al. | Information Extraction from Patient Care Reports for Intelligent Emergency Medical Services | |
CN113360643A (en) | Electronic medical record data quality evaluation method based on short text classification | |
US11727685B2 (en) | System and method for generation of process graphs from multi-media narratives | |
CN116721779B (en) | Medical data preprocessing method and system | |
CN109036506A (en) | Monitoring and managing method, electronic device and the readable storage medium storing program for executing of internet medical treatment interrogation | |
CN116913548A (en) | Adverse reaction data analysis method, device, electronic equipment and storage medium | |
Hao et al. | Extracting and normalizing temporal expressions in clinical data requests from researchers | |
CN108831560B (en) | Method and device for determining medical data attribute data | |
CN105956362B (en) | A kind of believable case history structural method and system | |
Patrick et al. | Developing SNOMED CT subsets from clinical notes for intensive care service |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |