CN116721779A - Medical data preprocessing method and system - Google Patents

Medical data preprocessing method and system Download PDF

Info

Publication number
CN116721779A
CN116721779A CN202311002583.7A CN202311002583A CN116721779A CN 116721779 A CN116721779 A CN 116721779A CN 202311002583 A CN202311002583 A CN 202311002583A CN 116721779 A CN116721779 A CN 116721779A
Authority
CN
China
Prior art keywords
medical
medical data
label
data
word segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311002583.7A
Other languages
Chinese (zh)
Other versions
CN116721779B (en
Inventor
李睿
胡其桐
邢沛瑶
刘瑞华
徐浩
郑名扬
邢天奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Angels Biomedical Technology Co ltd
Original Assignee
Chengdu Angels Biomedical Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Angels Biomedical Technology Co ltd filed Critical Chengdu Angels Biomedical Technology Co ltd
Priority to CN202311002583.7A priority Critical patent/CN116721779B/en
Publication of CN116721779A publication Critical patent/CN116721779A/en
Application granted granted Critical
Publication of CN116721779B publication Critical patent/CN116721779B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention belongs to the technical field of data processing, and discloses a medical data preprocessing method and a system, wherein the medical data preprocessing method comprises the following steps: removing irrelevant symbols in the medical data, and correcting the text data in the medical data; the text data is segmented into different fields through a plurality of medical word segmentation devices to obtain word segmentation results, and the word segmentation results are input into a secondary word segmentation device to obtain final word segmentation results; constructing a medical knowledge graph, and labeling medical fields in a final word segmentation result based on the medical knowledge graph to obtain part of labeled medical data; and labeling unlabeled fields in the partially labeled medical data based on group intelligence to obtain the completely labeled medical data. The invention automatically cleans, extracts and labels the medical data, so that the medical data are processed into the format and content required by medical AI model training, and the data label workflow is optimized by using a group intelligent technology.

Description

Medical data preprocessing method and system
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a medical data preprocessing method and system.
Background
With the development of artificial intelligence technology, especially the technological breakthrough of general artificial intelligence, people start to analyze and process medical data by using a medical artificial intelligence model, and provide intelligent inquiry service for patients, automatic analysis of medicine action mechanism for pharmaceutical factories, claim settlement robots for different patients for insurance companies, and the like. However, training of current medical artificial intelligence models requires a significant amount of time and resources, as:
1. the medical data contains many input errors, including text input errors in medical text data (for example, "the patient has a history of diabetes and has been treated with high blood," hypertension "is wrongly input into high blood") and non-uniform data formats (for example, the operation date is recorded in various formats such as "month 1 of 2000", "2000.01.01", "01/01/2020") and the like). Conventional computer algorithms are poorly effective for this type of processing, and therefore require a statistical analyst to analyze, correct, and manually design rules for the new type of error that may occur in each medical data. This requires a lot of human resources and time resources, slowing down the efficiency of project development and advancement.
2. Medical data contains a large amount of professional data, and training data required by different medical application projects is greatly different in format (for example, a label of 'medical noun+type' is required for a model extracted by medical keywords, a label of 'disease description+negative/positive' is required for disease judgment), so that the data cannot be directly used for training by a large language model (large language model), manual separation and medical label marking are required, and labeling work for medical projects is often required to be performed only by a certain medical knowledge, so that qualified data preprocessing personnel are more difficult to recruit than a general AI project.
3. The manpower required for the data labels is very different from one another throughout the project cycle, so that it is difficult to reasonably distribute the personnel. In the early stages of AI model training, a large amount of medical data needs to be labeled, and thus a large amount of manpower needs to be allocated. However, in the later stage of the project, the fine-tuning data tag is mainly used, and only a small amount of manpower is needed.
4. Because the data volume of the medical data is too large and medical expertise is required, when the manual data preprocessing is performed, the situations that the processing error or the processing quality is low are difficult to avoid when the data label is manually performed in consideration of different mastering degrees of the expertise and the state of manual work, and the quality of a final medical AI model is influenced by the situations.
Disclosure of Invention
The present invention aims to solve the above technical problems at least to some extent. Therefore, the invention aims to provide a medical data preprocessing method and a medical data preprocessing system.
The technical scheme adopted by the invention is as follows:
a medical data preprocessing method, comprising the steps of:
s1, removing irrelevant symbols in medical data, and correcting text data in the medical data;
s2, segmenting the text data into different fields through a plurality of medical word segmenters to obtain word segmentation results, and inputting the word segmentation results into a secondary word segmenter to obtain final word segmentation results;
s3, constructing a medical knowledge graph, and labeling medical fields in the final word segmentation result based on the medical knowledge graph to obtain part of labeled medical data;
and S4, marking unlabeled fields in the partially labeled medical data based on group intelligence to obtain completely labeled medical data.
Preferably, the step S4 includes:
s41, disturbing unlabeled fields in the part of labeled medical data, and packaging, distributing and manually labeling;
s42, acquiring a label result after manual labeling, correcting the label result, and removing low-quality labels in the label result.
Preferably, the correction of the label result is achieved by:
performing intra-group competition, voting on staff in the group processing the same piece of data, winning by a plurality of players, regarding as errors by a loser, accumulating with time, counting the working accuracy of each staff, and performing weighted voting on the label result according to the reliability of each staff;
or, comparing the label correction model with the original label, and removing the label with the excessively high error rate;
or, setting a plurality of secondary word separators, obtaining a label result by using the secondary word separators, taking the weight occupied by each secondary word separator as a trainable parameter of the advanced deep learning model, training the advanced deep learning model by using the existing result, comparing the training result with the existing labels, and removing the labels with excessively high error rate;
or, using an unsupervised training model to enable the data in the same class to be consistent when the label processing is carried out, and removing the labels with excessively high error rate.
Preferably, the step S4 includes a step S43: the corrected label results are used to retrain a label correction model, an advanced deep learning model, and an unsupervised training model.
Preferably, the secondary word segmentation device is obtained by training word segmentation results obtained by word segmentation through a medical word segmentation model.
Preferably, the step S3 further includes: punctuation marks in medical data are labeled with separator labels.
Preferably, the step S1 further includes a step S0: translating the foreign language medical data into Chinese medical data.
A medical data preprocessing system, comprising:
the data cleaning module is used for removing irrelevant symbols in the medical data and correcting the text data in the medical data in an error correction way;
the word segmentation module is used for segmenting the text data into different fields through a plurality of medical word segmenters to obtain word segmentation results, and inputting the word segmentation results into the secondary word segmenters to obtain final word segmentation results;
the label generation module is used for constructing a medical knowledge graph, and labeling medical fields in the final word segmentation result based on the medical knowledge graph to obtain part of labeled medical data;
and the group intelligent module is used for marking the unlabeled fields in the medical data obtained by the label generating module to obtain the completely labeled medical data.
Preferably, the group intelligent module is used for disturbing untagged fields in the part of the tagged medical data, and packaging and distributing the untagged fields for manual labeling; and obtaining a label result after manual labeling, correcting the label result, and removing low-quality labels in the label result.
Preferably, the correction of the label result is achieved by:
performing intra-group competition, voting on staff in the group processing the same piece of data, winning by a plurality of players, regarding as errors by a loser, accumulating with time, counting the working accuracy of each staff, and performing weighted voting on the label result according to the reliability of each staff;
or, comparing the label correction model with the original label, and removing the label with the excessively high error rate;
or, setting a plurality of secondary word separators, obtaining a label result by using the secondary word separators, taking the weight occupied by each secondary word separator as a trainable parameter of the advanced deep learning model, training the advanced deep learning model by using the existing result, comparing the training result with the existing labels, and removing the labels with excessively high error rate;
or, using an unsupervised training model to enable the data in the same class to be consistent when the label processing is carried out, and removing the labels with excessively high error rate.
The beneficial effects of the invention are as follows:
the medical data preprocessing method provided by the invention can automatically complete simple tasks such as data translation, data cleaning, automatic spell checking and the like, and simultaneously utilizes the medical knowledge graph to carry out semi-automatic labeling tasks, so that nearly 50% of data labeling work can be automatically completed; the ratio of inferior samples can be reduced, the difficulty of performing label work across countries and languages is solved, the error entry in medical data records is processed, and the preprocessing time of medical data is shortened by more than 70%;
the integrated learning is used for completing the word segmentation task of the medical data, the existing open-source medical word segmentation model can be fully utilized, and the final word segmentation accuracy is ensured by training the secondary word segmentation device;
the medical knowledge graph is used for carrying out semiautomatic labeling on medical data, so that the labeling work can be automatically completed on approximately 50% of the data, and the workload of manual labeling is reduced;
the final labeling work of medical data is carried out by using the group intelligent model, the difference between people and the structural property of the data are fully considered, so that the final result of the labeling work can be improved without new human resource input, which is important for the training of the final deep learning model (the training result of the label with medium quantity but high quality is better than the training result of the label with large quantity but poor quality).
Drawings
Fig. 1 is a flow chart of a medical data preprocessing method of the present invention.
Fig. 2 is a flow chart of step S2 of the present invention.
FIG. 3 is a schematic diagram of the voting-based population intelligence of the present invention.
FIG. 4 is a schematic diagram of the community intelligence of the present invention based on voting and employee reliability.
FIG. 5 is a schematic diagram of the present invention for task feature based population intelligence.
Detailed Description
The following description of the embodiments of the present invention will be made more apparent and fully hereinafter with reference to the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should also be appreciated that in the embodiments, the functions/acts may occur in a different order than the figures. For example, two figures shown in succession may in fact be executed substantially concurrently or the figures may sometimes be executed in the reverse order, depending upon the functionality/acts involved.
As shown in fig. 1 to 5, a medical data preprocessing method of the present embodiment includes the following steps:
s0, acquiring medical data, and if the medical data is foreign language (such as English), translating the medical data of the foreign language into Chinese medical data by using a translation program; the translated medical data is subject to text data and non-text data.
S1, data cleaning, namely removing irrelevant symbols (such as '1 a') and blank characters and the like in medical data, and in addition, aiming at the problem of error entry in text data, performing automatic spell check and error correction on the text data by using an automatic error correction model.
S2, using a superposition generalization technology (Stacking) of ensemble learning (Ensemble Learning) to segment text data into different fields through a plurality of medical word segmenters to obtain word segmentation results, and inputting the word segmentation results into a secondary word segmenter to obtain final word segmentation results.
Each field of the cut should ensure the integrity of the medical noun. For example, raw data is "the patient's main symptoms are headache and nausea, and migraine is diagnosed. It is recommended to take aspirin "at the onset, which should be split into: "patient, chief complaints, headache, and, nausea, diagnosis, migraine, advice, at the time of onset, taking, aspirin". The final word segmentation can be more accurately finished by comprehensively utilizing a plurality of medical word segmentation devices.
Specifically, as shown in fig. 2, the training data of chinese is input into the existing four medical word segmenters (Jieba word segmenter, bastrubertert chinese part-of-speech marker, nestedNER chinese medical noun identifier and RANER chinese medical noun identifier) to obtain corresponding word segmentation results, respectively. The four medical word segmentation devices are open-source medical word segmentation models which are widely used at present, can better segment medical data, but still have the situation of wrong word segmentation;
and taking the output of the four medical word separators as new input, training a secondary word separator, wherein the secondary word separator comprises a trained deep learning model (such as a BERT-LSTM-CRF model, a Struct-LSTM-CRF model and the like) and a classical computer model (a regular expression and the like) and the like, and obtaining a final word separation result. By training the secondary word segmenter on the basis of a plurality of open-source medical word segmentation models, final accuracy can be ensured on the basis of utilizing the existing models by using ensemble learning. And when the secondary word segmentation device is trained, the result of the training data is manually marked, so that the accuracy of the secondary word segmentation device is further ensured.
S3, constructing a medical knowledge graph, and labeling medical fields in a final word segmentation result based on the medical knowledge graph, for example, in the example of the original data, automatically labeling the medical fields as shown in the table 1:
table 1 medical field label
Medical field Label (Label)
Headache pain SYMPTOM
Nausea of SYMPTOM
Migraine headache DISEASE
Aspirin MEDICINE
Meanwhile, punctuation marks in medical data are labeled with a Separator (SEP).
After the above steps, the medical data is split into different fields, and some of the medical fields are labeled with corresponding labels, so as to obtain partially labeled medical data, as shown in table 2:
table 2 partially labeled medical data
Fields Label (Label)
Patient' s
Principal symptoms
Is that
Headache pain SYMPTOM
And
nausea of SYMPTOM
SEP
Diagnosis of
Is that
Migraine headache DISEASE
SEP
Advice of
At the position of
Seizure (1)
Time of day
Is taken orally
Aspirin MEDICINE
SEP
And S4, marking unlabeled fields in the partially labeled medical data based on group intelligence to obtain completely labeled medical data. And the accuracy of the whole model is improved on the premise of no new human resource input by using the group intelligent model.
With the popularization of artificial intelligence, the demand of deep learning for data volume is larger and larger, so that a work model for carrying out label work by a small number of people is not suitable for the current project scale, but the work model is certainly a great waste of human resources by adopting a large amount of manpower in view of the fact that the workload of label work in the early and later stages of the project is extremely unbalanced. The invention refers to the construction mode of the knowledge graph, and the website of wikipedia and Baidu encyclopedia uses the wisdom of all human beings to carry out knowledge infusion, and sets up supervision to verify the input knowledge, thereby training own knowledge graph model. Therefore, when the labeling work is performed, the intelligent strength of the group is started, the labeling work is integrated into zero, the advancing speed of the labeling work is accelerated, the labeled label is judged by using the existing deep learning model, the classical computer model and the unsupervised training model, and the step S4 specifically comprises the following steps:
s41, the unlabeled fields in the part of labeled medical data are disordered, and the fields are packaged and distributed to label staff (staff) for manual labeling.
S42, recovering the label, obtaining a label result after manual labeling, correcting the label result, and removing the label with low quality (namely poor label quality) in the label result.
The correction of the label result in step S42 is achieved by one or more of the following four ways:
1. the team competition is performed, the team staff who process the same piece of data are voted, and a plurality of the team staff win, and the loser is regarded as error, as shown in fig. 3. Counting the working accuracy of each employee as time goes on, and carrying out weighted voting on the label result according to the reliability of each employee, as shown in fig. 4;
2. comparing the existing label correction model with the original label, and removing the label with high error rate;
3. setting a plurality of secondary word separators in the step S2, obtaining a label result by using the secondary word separators, taking the weight occupied by each secondary word separator as a trainable parameter of the advanced deep learning model, training the advanced deep learning model by using the existing result, comparing the training result with the existing labels, and removing the labels with high error rate;
4. taking the properties of the original data structure into consideration, performing unsupervised learning clustering, for example, analyzing words contained in two sentences, if the similarity of the words is high and the semantics are similar, indicating that the labels of the two sentences should be the same or similar, using an unsupervised training model to enable the data in the same class to be consistent when label processing is performed, and removing the labels with excessively high error rate, as shown in fig. 5.
S43: the corrected label results are used to retrain a label correction model, an advanced deep learning model, and an unsupervised training model.
The embodiment also provides a medical data preprocessing system, which comprises the following steps:
the translation module is used for translating the foreign language medical data into Chinese medical data;
the data cleaning module is used for removing irrelevant symbols in the medical data and correcting the text data in the medical data in an error correction way; the data cleaning module comprises an automatic error correction model and is used for correcting text data;
the word segmentation module is used for segmenting the text data into different fields through a plurality of medical word segmenters to obtain word segmentation results, and inputting the word segmentation results into the secondary word segmenters to obtain final word segmentation results;
the label generation module is used for constructing a medical knowledge graph, and labeling medical fields in the final word segmentation result based on the medical knowledge graph to obtain part of labeled medical data;
and the group intelligent module is used for marking the unlabeled fields in the medical data obtained by the label generating module to obtain the completely labeled medical data. Specifically, the group intelligent module is used for disturbing unlabeled fields in the part of labeled medical data, packaging and distributing the labeled fields for manual labeling; and (3) acquiring a label result after manual labeling, correcting the label result, and removing a label with low quality (namely poor label quality) in the label result.
The group intelligent module corrects the label result by the following modes:
performing intra-group competition, voting on staff in the group processing the same piece of data, winning by a plurality of players, regarding as errors by a loser, accumulating with time, counting the working accuracy of each staff, and performing weighted voting on the label result according to the reliability of each staff;
or, comparing the label correction model with the original label, and removing the label with the excessively high error rate;
or, setting a plurality of secondary word separators, obtaining a label result by using the secondary word separators, taking the weight occupied by each secondary word separator as a trainable parameter of the advanced deep learning model, training the advanced deep learning model by using the existing result, comparing the training result with the existing labels, and removing the labels with excessively high error rate;
or, using an unsupervised training model to enable the data in the same class to be consistent when the label processing is carried out, and removing the labels with excessively high error rate.
The invention is not limited to the above-described alternative embodiments, and any person who may derive other various forms of products in the light of the present invention, however, any changes in shape or structure thereof, all falling within the technical solutions defined in the scope of the claims of the present invention, fall within the scope of protection of the present invention.

Claims (10)

1. A medical data preprocessing method, characterized by comprising the steps of:
s1, removing irrelevant symbols in medical data, and correcting text data in the medical data;
s2, segmenting the text data into different fields through a plurality of medical word segmenters to obtain word segmentation results, and inputting the word segmentation results into a secondary word segmenter to obtain final word segmentation results;
s3, constructing a medical knowledge graph, and labeling medical fields in the final word segmentation result based on the medical knowledge graph to obtain part of labeled medical data;
and S4, marking unlabeled fields in the partially labeled medical data based on group intelligence to obtain completely labeled medical data.
2. The medical data preprocessing method according to claim 1, characterized in that: the step S4 includes:
s41, disturbing unlabeled fields in the part of labeled medical data, and packaging, distributing and manually labeling;
s42, acquiring a label result after manual labeling, and correcting the label result.
3. The medical data preprocessing method according to claim 2, characterized in that: correcting the label result is achieved by:
performing intra-group competition, voting on staff in the group processing the same piece of data, winning by a plurality of players, regarding as errors by a loser, accumulating with time, counting the working accuracy of each staff, and performing weighted voting on the label result according to the reliability of each staff;
or, comparing the label correction model with the original label, and removing the label with the excessively high error rate;
or, setting a plurality of secondary word separators, obtaining a label result by using the secondary word separators, taking the weight occupied by each secondary word separator as a trainable parameter of the advanced deep learning model, training the advanced deep learning model by using the existing result, comparing the training result with the existing labels, and removing the labels with excessively high error rate;
or, using an unsupervised training model to enable the data in the same class to be consistent when the label processing is carried out, and removing the labels with excessively high error rate.
4. A medical data preprocessing method according to claim 3, characterized in that: the step S4 includes a step S43: the corrected label results are used to retrain a label correction model, an advanced deep learning model, and an unsupervised training model.
5. The medical data preprocessing method according to claim 1, characterized in that: the secondary word segmentation device is obtained by training word segmentation results obtained by word segmentation through a medical word segmentation model.
6. The medical data preprocessing method according to claim 1, characterized in that: the step S3 further includes: punctuation marks in medical data are labeled with separator labels.
7. The medical data preprocessing method according to claim 1, characterized in that: the step S1 further includes a step S0: translating the foreign language medical data into Chinese medical data.
8. A medical data preprocessing system, comprising:
the data cleaning module is used for removing irrelevant symbols in the medical data and correcting the text data in the medical data in an error correction way;
the word segmentation module is used for segmenting the text data into different fields through a plurality of medical word segmenters to obtain word segmentation results, and inputting the word segmentation results into the secondary word segmenters to obtain final word segmentation results;
the label generation module is used for constructing a medical knowledge graph, and labeling medical fields in the final word segmentation result based on the medical knowledge graph to obtain part of labeled medical data;
and the group intelligent module is used for marking the unlabeled fields in the medical data obtained by the label generating module to obtain the completely labeled medical data.
9. The medical data preprocessing system of claim 8, wherein: the group intelligent module is used for disturbing unlabeled fields in the part of labeled medical data, packaging and distributing the labeled fields for manual labeling; and acquiring a label result after manual labeling, and correcting the label result.
10. The medical data preprocessing system of claim 9, wherein: correcting the label result is achieved by:
performing intra-group competition, voting on staff in the group processing the same piece of data, winning by a plurality of players, regarding as errors by a loser, accumulating with time, counting the working accuracy of each staff, and performing weighted voting on the label result according to the reliability of each staff;
or, comparing the label correction model with the original label, and removing the label with the excessively high error rate;
or, setting a plurality of secondary word separators, obtaining a label result by using the secondary word separators, taking the weight occupied by each secondary word separator as a trainable parameter of the advanced deep learning model, training the advanced deep learning model by using the existing result, comparing the training result with the existing labels, and removing the labels with excessively high error rate;
or, using an unsupervised training model to enable the data in the same class to be consistent when the label processing is carried out, and removing the labels with excessively high error rate.
CN202311002583.7A 2023-08-10 2023-08-10 Medical data preprocessing method and system Active CN116721779B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311002583.7A CN116721779B (en) 2023-08-10 2023-08-10 Medical data preprocessing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311002583.7A CN116721779B (en) 2023-08-10 2023-08-10 Medical data preprocessing method and system

Publications (2)

Publication Number Publication Date
CN116721779A true CN116721779A (en) 2023-09-08
CN116721779B CN116721779B (en) 2023-11-24

Family

ID=87870155

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311002583.7A Active CN116721779B (en) 2023-08-10 2023-08-10 Medical data preprocessing method and system

Country Status (1)

Country Link
CN (1) CN116721779B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018081751A1 (en) * 2016-10-28 2018-05-03 Vilynx, Inc. Video tagging system and method
CN111046272A (en) * 2019-10-31 2020-04-21 九次方大数据信息集团有限公司 Intelligent question-answering system based on medical knowledge map
CN112199511A (en) * 2020-09-28 2021-01-08 西南电子技术研究所(中国电子科技集团公司第十研究所) Cross-language multi-source vertical domain knowledge graph construction method
CN112542223A (en) * 2020-12-21 2021-03-23 西南科技大学 Semi-supervised learning method for constructing medical knowledge graph from Chinese electronic medical record
CN113901224A (en) * 2021-11-22 2022-01-07 国家电网有限公司信息通信分公司 Knowledge distillation-based secret-related text recognition model training method, system and device
CN114091449A (en) * 2021-11-12 2022-02-25 昆明理工大学 Chinese word segmentation method and Chinese word segmentation device in medical field
CN114492444A (en) * 2022-02-10 2022-05-13 北京工业大学 Chinese electronic medical case medical entity part-of-speech tagging method
CN115796177A (en) * 2022-11-28 2023-03-14 竹间智能科技(上海)有限公司 Method, medium and electronic device for realizing Chinese word segmentation and part-of-speech tagging
CN116108000A (en) * 2023-04-14 2023-05-12 成都安哲斯生物医药科技有限公司 Medical data management query method

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018081751A1 (en) * 2016-10-28 2018-05-03 Vilynx, Inc. Video tagging system and method
CN111046272A (en) * 2019-10-31 2020-04-21 九次方大数据信息集团有限公司 Intelligent question-answering system based on medical knowledge map
CN112199511A (en) * 2020-09-28 2021-01-08 西南电子技术研究所(中国电子科技集团公司第十研究所) Cross-language multi-source vertical domain knowledge graph construction method
CN112542223A (en) * 2020-12-21 2021-03-23 西南科技大学 Semi-supervised learning method for constructing medical knowledge graph from Chinese electronic medical record
CN114091449A (en) * 2021-11-12 2022-02-25 昆明理工大学 Chinese word segmentation method and Chinese word segmentation device in medical field
CN113901224A (en) * 2021-11-22 2022-01-07 国家电网有限公司信息通信分公司 Knowledge distillation-based secret-related text recognition model training method, system and device
CN114492444A (en) * 2022-02-10 2022-05-13 北京工业大学 Chinese electronic medical case medical entity part-of-speech tagging method
CN115796177A (en) * 2022-11-28 2023-03-14 竹间智能科技(上海)有限公司 Method, medium and electronic device for realizing Chinese word segmentation and part-of-speech tagging
CN116108000A (en) * 2023-04-14 2023-05-12 成都安哲斯生物医药科技有限公司 Medical data management query method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WANG, YU 等: "Named entity recognition in Chinese medical literature using pretraining models", SCIENTIFIC PROGRAMMING, vol. 2020, pages 1 - 6 *
司念文: "面向军事领域的句子级文本处理技术研究", 中国优秀硕士学位论文全文数据库 (工程科技Ⅱ辑), no. 01, pages 032 - 14 *

Also Published As

Publication number Publication date
CN116721779B (en) 2023-11-24

Similar Documents

Publication Publication Date Title
CN107731269B (en) Disease coding method and system based on original diagnosis data and medical record file data
US10818397B2 (en) Clinical content analytics engine
CN107705839B (en) Disease automatic coding method and system
Sager et al. Natural language processing and the representation of clinical data
CN106682397B (en) Knowledge-based electronic medical record quality control method
US8612261B1 (en) Automated learning for medical data processing system
CN106874643A (en) Build the method and system that knowledge base realizes assisting in diagnosis and treatment automatically based on term vector
CN113806563B (en) Architect knowledge graph construction method for multi-source heterogeneous building humanistic historical material
WO2020172446A1 (en) Automated generation of structured patient data record
CN112541066B (en) Text-structured-based medical and technical report detection method and related equipment
CN109686443A (en) A kind of clinical diagnosis aid decision-making system and medical knowledge map accumulative means
CN111128388A (en) Value domain data matching method and device and related products
Chandra et al. Natural language Processing and Ontology based Decision Support System for Diabetic Patients
CN114420233A (en) Method for extracting post-structured information of Chinese electronic medical record
CN110597760A (en) Intelligent method for judging compliance of electronic document
Kim et al. Information Extraction from Patient Care Reports for Intelligent Emergency Medical Services
CN113360643A (en) Electronic medical record data quality evaluation method based on short text classification
US11727685B2 (en) System and method for generation of process graphs from multi-media narratives
CN116721779B (en) Medical data preprocessing method and system
CN109036506A (en) Monitoring and managing method, electronic device and the readable storage medium storing program for executing of internet medical treatment interrogation
CN116913548A (en) Adverse reaction data analysis method, device, electronic equipment and storage medium
Hao et al. Extracting and normalizing temporal expressions in clinical data requests from researchers
CN108831560B (en) Method and device for determining medical data attribute data
CN105956362B (en) A kind of believable case history structural method and system
Patrick et al. Developing SNOMED CT subsets from clinical notes for intensive care service

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant