CN116721779A

CN116721779A - Medical data preprocessing method and system

Info

Publication number: CN116721779A
Application number: CN202311002583.7A
Authority: CN
Inventors: 李睿; 胡其桐; 邢沛瑶; 刘瑞华; 徐浩; 郑名扬; 邢天奇
Original assignee: Chengdu Angels Biomedical Technology Co ltd
Current assignee: Chengdu Angels Biomedical Technology Co ltd
Priority date: 2023-08-10
Filing date: 2023-08-10
Publication date: 2023-09-08
Anticipated expiration: 2043-08-10
Also published as: CN116721779B

Abstract

The invention belongs to the technical field of data processing, and discloses a medical data preprocessing method and a system, wherein the medical data preprocessing method comprises the following steps: removing irrelevant symbols in the medical data, and correcting the text data in the medical data; the text data is segmented into different fields through a plurality of medical word segmentation devices to obtain word segmentation results, and the word segmentation results are input into a secondary word segmentation device to obtain final word segmentation results; constructing a medical knowledge graph, and labeling medical fields in a final word segmentation result based on the medical knowledge graph to obtain part of labeled medical data; and labeling unlabeled fields in the partially labeled medical data based on group intelligence to obtain the completely labeled medical data. The invention automatically cleans, extracts and labels the medical data, so that the medical data are processed into the format and content required by medical AI model training, and the data label workflow is optimized by using a group intelligent technology.

Description

Medical data preprocessing method and system

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a medical data preprocessing method and system.

Background

With the development of artificial intelligence technology, especially the technological breakthrough of general artificial intelligence, people start to analyze and process medical data by using a medical artificial intelligence model, and provide intelligent inquiry service for patients, automatic analysis of medicine action mechanism for pharmaceutical factories, claim settlement robots for different patients for insurance companies, and the like. However, training of current medical artificial intelligence models requires a significant amount of time and resources, as:

1. the medical data contains many input errors, including text input errors in medical text data (for example, "the patient has a history of diabetes and has been treated with high blood," hypertension "is wrongly input into high blood") and non-uniform data formats (for example, the operation date is recorded in various formats such as "month 1 of 2000", "2000.01.01", "01/01/2020") and the like). Conventional computer algorithms are poorly effective for this type of processing, and therefore require a statistical analyst to analyze, correct, and manually design rules for the new type of error that may occur in each medical data. This requires a lot of human resources and time resources, slowing down the efficiency of project development and advancement.

2. Medical data contains a large amount of professional data, and training data required by different medical application projects is greatly different in format (for example, a label of 'medical noun+type' is required for a model extracted by medical keywords, a label of 'disease description+negative/positive' is required for disease judgment), so that the data cannot be directly used for training by a large language model (large language model), manual separation and medical label marking are required, and labeling work for medical projects is often required to be performed only by a certain medical knowledge, so that qualified data preprocessing personnel are more difficult to recruit than a general AI project.

3. The manpower required for the data labels is very different from one another throughout the project cycle, so that it is difficult to reasonably distribute the personnel. In the early stages of AI model training, a large amount of medical data needs to be labeled, and thus a large amount of manpower needs to be allocated. However, in the later stage of the project, the fine-tuning data tag is mainly used, and only a small amount of manpower is needed.

4. Because the data volume of the medical data is too large and medical expertise is required, when the manual data preprocessing is performed, the situations that the processing error or the processing quality is low are difficult to avoid when the data label is manually performed in consideration of different mastering degrees of the expertise and the state of manual work, and the quality of a final medical AI model is influenced by the situations.

Disclosure of Invention

The present invention aims to solve the above technical problems at least to some extent. Therefore, the invention aims to provide a medical data preprocessing method and a medical data preprocessing system.

The technical scheme adopted by the invention is as follows:

a medical data preprocessing method, comprising the steps of:

s1, removing irrelevant symbols in medical data, and correcting text data in the medical data;

s2, segmenting the text data into different fields through a plurality of medical word segmenters to obtain word segmentation results, and inputting the word segmentation results into a secondary word segmenter to obtain final word segmentation results;

s3, constructing a medical knowledge graph, and labeling medical fields in the final word segmentation result based on the medical knowledge graph to obtain part of labeled medical data;

and S4, marking unlabeled fields in the partially labeled medical data based on group intelligence to obtain completely labeled medical data.

Preferably, the step S4 includes:

s41, disturbing unlabeled fields in the part of labeled medical data, and packaging, distributing and manually labeling;

s42, acquiring a label result after manual labeling, correcting the label result, and removing low-quality labels in the label result.

Preferably, the correction of the label result is achieved by:

performing intra-group competition, voting on staff in the group processing the same piece of data, winning by a plurality of players, regarding as errors by a loser, accumulating with time, counting the working accuracy of each staff, and performing weighted voting on the label result according to the reliability of each staff;

or, comparing the label correction model with the original label, and removing the label with the excessively high error rate;

or, setting a plurality of secondary word separators, obtaining a label result by using the secondary word separators, taking the weight occupied by each secondary word separator as a trainable parameter of the advanced deep learning model, training the advanced deep learning model by using the existing result, comparing the training result with the existing labels, and removing the labels with excessively high error rate;

or, using an unsupervised training model to enable the data in the same class to be consistent when the label processing is carried out, and removing the labels with excessively high error rate.

Preferably, the step S4 includes a step S43: the corrected label results are used to retrain a label correction model, an advanced deep learning model, and an unsupervised training model.

Preferably, the secondary word segmentation device is obtained by training word segmentation results obtained by word segmentation through a medical word segmentation model.

Preferably, the step S3 further includes: punctuation marks in medical data are labeled with separator labels.

Preferably, the step S1 further includes a step S0: translating the foreign language medical data into Chinese medical data.

A medical data preprocessing system, comprising:

the data cleaning module is used for removing irrelevant symbols in the medical data and correcting the text data in the medical data in an error correction way;

the word segmentation module is used for segmenting the text data into different fields through a plurality of medical word segmenters to obtain word segmentation results, and inputting the word segmentation results into the secondary word segmenters to obtain final word segmentation results;

the label generation module is used for constructing a medical knowledge graph, and labeling medical fields in the final word segmentation result based on the medical knowledge graph to obtain part of labeled medical data;

and the group intelligent module is used for marking the unlabeled fields in the medical data obtained by the label generating module to obtain the completely labeled medical data.

Preferably, the group intelligent module is used for disturbing untagged fields in the part of the tagged medical data, and packaging and distributing the untagged fields for manual labeling; and obtaining a label result after manual labeling, correcting the label result, and removing low-quality labels in the label result.

Preferably, the correction of the label result is achieved by:

The beneficial effects of the invention are as follows:

the medical data preprocessing method provided by the invention can automatically complete simple tasks such as data translation, data cleaning, automatic spell checking and the like, and simultaneously utilizes the medical knowledge graph to carry out semi-automatic labeling tasks, so that nearly 50% of data labeling work can be automatically completed; the ratio of inferior samples can be reduced, the difficulty of performing label work across countries and languages is solved, the error entry in medical data records is processed, and the preprocessing time of medical data is shortened by more than 70%;

the integrated learning is used for completing the word segmentation task of the medical data, the existing open-source medical word segmentation model can be fully utilized, and the final word segmentation accuracy is ensured by training the secondary word segmentation device;

the medical knowledge graph is used for carrying out semiautomatic labeling on medical data, so that the labeling work can be automatically completed on approximately 50% of the data, and the workload of manual labeling is reduced;

the final labeling work of medical data is carried out by using the group intelligent model, the difference between people and the structural property of the data are fully considered, so that the final result of the labeling work can be improved without new human resource input, which is important for the training of the final deep learning model (the training result of the label with medium quantity but high quality is better than the training result of the label with large quantity but poor quality).

Drawings

Fig. 1 is a flow chart of a medical data preprocessing method of the present invention.

Fig. 2 is a flow chart of step S2 of the present invention.

FIG. 3 is a schematic diagram of the voting-based population intelligence of the present invention.

FIG. 4 is a schematic diagram of the community intelligence of the present invention based on voting and employee reliability.

FIG. 5 is a schematic diagram of the present invention for task feature based population intelligence.

Detailed Description

The following description of the embodiments of the present invention will be made more apparent and fully hereinafter with reference to the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should also be appreciated that in the embodiments, the functions/acts may occur in a different order than the figures. For example, two figures shown in succession may in fact be executed substantially concurrently or the figures may sometimes be executed in the reverse order, depending upon the functionality/acts involved.

As shown in fig. 1 to 5, a medical data preprocessing method of the present embodiment includes the following steps:

s0, acquiring medical data, and if the medical data is foreign language (such as English), translating the medical data of the foreign language into Chinese medical data by using a translation program; the translated medical data is subject to text data and non-text data.

S1, data cleaning, namely removing irrelevant symbols (such as '1 a') and blank characters and the like in medical data, and in addition, aiming at the problem of error entry in text data, performing automatic spell check and error correction on the text data by using an automatic error correction model.

S2, using a superposition generalization technology (Stacking) of ensemble learning (Ensemble Learning) to segment text data into different fields through a plurality of medical word segmenters to obtain word segmentation results, and inputting the word segmentation results into a secondary word segmenter to obtain final word segmentation results.

Each field of the cut should ensure the integrity of the medical noun. For example, raw data is "the patient's main symptoms are headache and nausea, and migraine is diagnosed. It is recommended to take aspirin "at the onset, which should be split into: "patient, chief complaints, headache, and, nausea, diagnosis, migraine, advice, at the time of onset, taking, aspirin". The final word segmentation can be more accurately finished by comprehensively utilizing a plurality of medical word segmentation devices.

Specifically, as shown in fig. 2, the training data of chinese is input into the existing four medical word segmenters (Jieba word segmenter, bastrubertert chinese part-of-speech marker, nestedNER chinese medical noun identifier and RANER chinese medical noun identifier) to obtain corresponding word segmentation results, respectively. The four medical word segmentation devices are open-source medical word segmentation models which are widely used at present, can better segment medical data, but still have the situation of wrong word segmentation;

and taking the output of the four medical word separators as new input, training a secondary word separator, wherein the secondary word separator comprises a trained deep learning model (such as a BERT-LSTM-CRF model, a Struct-LSTM-CRF model and the like) and a classical computer model (a regular expression and the like) and the like, and obtaining a final word separation result. By training the secondary word segmenter on the basis of a plurality of open-source medical word segmentation models, final accuracy can be ensured on the basis of utilizing the existing models by using ensemble learning. And when the secondary word segmentation device is trained, the result of the training data is manually marked, so that the accuracy of the secondary word segmentation device is further ensured.

S3, constructing a medical knowledge graph, and labeling medical fields in a final word segmentation result based on the medical knowledge graph, for example, in the example of the original data, automatically labeling the medical fields as shown in the table 1:

table 1 medical field label

Medical field	Label (Label)
		Headache pain	SYMPTOM
Nausea of	SYMPTOM
		Migraine headache	DISEASE
Aspirin	MEDICINE

Meanwhile, punctuation marks in medical data are labeled with a Separator (SEP).

After the above steps, the medical data is split into different fields, and some of the medical fields are labeled with corresponding labels, so as to obtain partially labeled medical data, as shown in table 2:

table 2 partially labeled medical data

Fields	Label (Label)
		Patient' s
Principal symptoms
		Is that
Headache pain	SYMPTOM
		And
nausea of	SYMPTOM
		，	SEP
Diagnosis of
		Is that
Migraine headache	DISEASE
		。	SEP
Advice of
		At the position of
Seizure (1)
		Time of day
Is taken orally
		Aspirin	MEDICINE
。	SEP

And S4, marking unlabeled fields in the partially labeled medical data based on group intelligence to obtain completely labeled medical data. And the accuracy of the whole model is improved on the premise of no new human resource input by using the group intelligent model.

With the popularization of artificial intelligence, the demand of deep learning for data volume is larger and larger, so that a work model for carrying out label work by a small number of people is not suitable for the current project scale, but the work model is certainly a great waste of human resources by adopting a large amount of manpower in view of the fact that the workload of label work in the early and later stages of the project is extremely unbalanced. The invention refers to the construction mode of the knowledge graph, and the website of wikipedia and Baidu encyclopedia uses the wisdom of all human beings to carry out knowledge infusion, and sets up supervision to verify the input knowledge, thereby training own knowledge graph model. Therefore, when the labeling work is performed, the intelligent strength of the group is started, the labeling work is integrated into zero, the advancing speed of the labeling work is accelerated, the labeled label is judged by using the existing deep learning model, the classical computer model and the unsupervised training model, and the step S4 specifically comprises the following steps:

s41, the unlabeled fields in the part of labeled medical data are disordered, and the fields are packaged and distributed to label staff (staff) for manual labeling.

S42, recovering the label, obtaining a label result after manual labeling, correcting the label result, and removing the label with low quality (namely poor label quality) in the label result.

The correction of the label result in step S42 is achieved by one or more of the following four ways:

1. the team competition is performed, the team staff who process the same piece of data are voted, and a plurality of the team staff win, and the loser is regarded as error, as shown in fig. 3. Counting the working accuracy of each employee as time goes on, and carrying out weighted voting on the label result according to the reliability of each employee, as shown in fig. 4;

2. comparing the existing label correction model with the original label, and removing the label with high error rate;

3. setting a plurality of secondary word separators in the step S2, obtaining a label result by using the secondary word separators, taking the weight occupied by each secondary word separator as a trainable parameter of the advanced deep learning model, training the advanced deep learning model by using the existing result, comparing the training result with the existing labels, and removing the labels with high error rate;

4. taking the properties of the original data structure into consideration, performing unsupervised learning clustering, for example, analyzing words contained in two sentences, if the similarity of the words is high and the semantics are similar, indicating that the labels of the two sentences should be the same or similar, using an unsupervised training model to enable the data in the same class to be consistent when label processing is performed, and removing the labels with excessively high error rate, as shown in fig. 5.

S43: the corrected label results are used to retrain a label correction model, an advanced deep learning model, and an unsupervised training model.

The embodiment also provides a medical data preprocessing system, which comprises the following steps:

the translation module is used for translating the foreign language medical data into Chinese medical data;

the data cleaning module is used for removing irrelevant symbols in the medical data and correcting the text data in the medical data in an error correction way; the data cleaning module comprises an automatic error correction model and is used for correcting text data;

and the group intelligent module is used for marking the unlabeled fields in the medical data obtained by the label generating module to obtain the completely labeled medical data. Specifically, the group intelligent module is used for disturbing unlabeled fields in the part of labeled medical data, packaging and distributing the labeled fields for manual labeling; and (3) acquiring a label result after manual labeling, correcting the label result, and removing a label with low quality (namely poor label quality) in the label result.

The group intelligent module corrects the label result by the following modes:

The invention is not limited to the above-described alternative embodiments, and any person who may derive other various forms of products in the light of the present invention, however, any changes in shape or structure thereof, all falling within the technical solutions defined in the scope of the claims of the present invention, fall within the scope of protection of the present invention.

Claims

1. A medical data preprocessing method, characterized by comprising the steps of:

2. The medical data preprocessing method according to claim 1, characterized in that: the step S4 includes:

s42, acquiring a label result after manual labeling, and correcting the label result.

3. The medical data preprocessing method according to claim 2, characterized in that: correcting the label result is achieved by:

4. A medical data preprocessing method according to claim 3, characterized in that: the step S4 includes a step S43: the corrected label results are used to retrain a label correction model, an advanced deep learning model, and an unsupervised training model.

5. The medical data preprocessing method according to claim 1, characterized in that: the secondary word segmentation device is obtained by training word segmentation results obtained by word segmentation through a medical word segmentation model.

6. The medical data preprocessing method according to claim 1, characterized in that: the step S3 further includes: punctuation marks in medical data are labeled with separator labels.

7. The medical data preprocessing method according to claim 1, characterized in that: the step S1 further includes a step S0: translating the foreign language medical data into Chinese medical data.

8. A medical data preprocessing system, comprising:

9. The medical data preprocessing system of claim 8, wherein: the group intelligent module is used for disturbing unlabeled fields in the part of labeled medical data, packaging and distributing the labeled fields for manual labeling; and acquiring a label result after manual labeling, and correcting the label result.

10. The medical data preprocessing system of claim 9, wherein: correcting the label result is achieved by: