CN112768059A

CN112768059A - Method for standardizing grade data in medical data

Info

Publication number: CN112768059A
Application number: CN202110097944.5A
Authority: CN
Inventors: 李红良; 秦娟娟; 张晓晶
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2021-01-25
Filing date: 2021-01-25
Publication date: 2021-05-07
Anticipated expiration: 2041-01-25
Also published as: CN112768059B

Abstract

The invention discloses a method for standardizing grade data in medical data, which comprises the following steps: acquiring original physical examination data columns from different data source units, carrying out column name standardization processing through a standard glossary and determining a grading rule; the data are classified into two types according to whether the data are pure numerical data: automatically converting the data content in the data column, which belongs to the pure numerical form, into a corresponding hierarchical form according to the index reference range, namely replacing the data content by the A-type mapping rule; replacing the data content in the data column in the non-pure numerical form with a corresponding hierarchical form through a standard mapping library, namely replacing the data content with a B-type mapping rule; merging A, B types of rule cleaning results to generate statistics of frequency results of cleaning the grade data, and performing quality control on the grade cleaning results; and correcting the conflict item after the result is combined. The invention can control the level data contents with different forms into regular level forms, thereby facilitating subsequent mining and analysis.

Description

Method for standardizing grade data in medical data

Technical Field

The invention relates to the technical field of medical big data, in particular to a method for standardizing grade data in medical data.

Background

In recent years, China has gained rapid development in the field of big data science. However, many technical bottlenecks still exist in the field of medical health big data. One of the problems to be solved urgently is how to effectively manage massive health data so as to mine useful information to benefit human health. Physical examination data is a very important source of medical health data, and the covered population is very wide. The health examination data is effectively treated and mined, and very important scientific reference is provided for the fields of chronic disease prevention and control and the like in China.

The physical examination data mainly comprises three data material types, namely text type data, metering type data and level type data. The grade data refers to data with certain grade, such as clinical curative effect divided into cure, effect, improvement and ineffectiveness, clinical test result divided into-, + + + + + + +, and severity of symptoms such as pain divided into 0 (no pain), 1 (mild), 2 (moderate) and 3 (severe). The hierarchical data is very cluttered due to different standards and description modes of different units. For example, the same level type indicators may be recorded as "-, ±, + + + + +; negative, weak positive, strong positive; the morphologies of 0.00(-), 10 (Weak Yang), 500(+), >10000 "and the like are different, so that the data are difficult to be converted into valuable information through analysis. The present invention can solve the above problems well.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a method for standardizing grade data in medical data, aiming at the defects in the prior art.

The technical scheme adopted by the invention for solving the technical problems is as follows:

the invention provides a method for standardizing grade data in medical data, which comprises the following steps:

step 1: acquiring original physical examination data columns from different data source units, and performing column name standardization processing through a standard glossary to obtain a standardized grade data column;

step 2: determining a grade data column to be cleaned and a grading rule thereof;

and step 3: the method comprises the following steps of (1) dividing the data in a level data column into two types according to the fact whether the data are pure numerical data or not, cleaning the pure numerical data according to an A type mapping rule, and cleaning the non-pure numerical data according to a B type mapping rule;

and 4, step 4: class a mapping rule: automatically converting the data content in the data column, which belongs to the pure numerical form, into a corresponding hierarchical form according to the index reference range;

and 5: class B mapping rules: replacing the data content in the data column which belongs to the non-pure numerical form with a corresponding hierarchical form through a standard database;

step 6, after cleaning through A, B mapping rules, combining cleaning results, performing frequency statistics of grade forms, and performing quality control on the cleaning results;

and 7, merging the graded replacement results, correcting the conflict item after merging the results, and outputting corrected standardized data.

Further, the specific method of the name normalization processing in step 1 of the present invention is:

the row name standardization matches each data row with a corresponding standard term, and the data type of the standard term comprises a text data standard term, a metering data standard term and a grade data standard term.

Further, the specific method of step 2 of the present invention is:

the data column standardized into the grade data terms enters a grade data cleaning process, the standard terminology table sets the grading standard corresponding to each grade data term, and the grading standard of the standard terminology expresses the content of the grade data through numbers, so that the grading data in various forms can be subjected to standardized treatment by using a set of uniform digital standards.

Further, the specific method of step 4 of the present invention is:

the A-type mapping rule automatically converts the normal reference range [ a, b ] of the index given by the data source unit into a uniform interval form through an algorithm: graded form 1 (-infinity, a) | | | graded form 2 [ a, b ] | | | | graded form 3 (b, + ∞); based on the A-type mapping rule, the pure numerical morphological content in the grade data column is subjected to grade replacement through an A-type mapping rule algorithm.

Further, the specific method of step 5 of the present invention is:

the B-type mapping rule is a professional database which is made according to the national clinical examination guideline, and the basic structure of the B-type mapping rule is a standard term name-hierarchical rule-original form-corresponding hierarchical replacement form; based on the B-type mapping rule, the non-pure numerical content is subjected to level replacement through a B-type mapping rule algorithm.

Further, the specific method of step 6 of the present invention is:

and (3) carrying out statistics on the grade form frequency of each data column under each standard term through an algorithm to generate a grade form frequency statistical table, wherein the form of the statistical table is as follows: standard term name-data source unit/data column-level morphology frequency-level morphology percentage. And the quality control of the grade cleaning result is realized by observing whether the grade form distribution proportion of each data column under the same standard term is abnormal.

Further, the specific method of step 7 of the present invention is:

merging all data columns under the same standard term, marking the different grade forms corresponding to the same patient under two or more same standard terms as merging conflicts, and finally selecting the only and correct grade form from the merging conflicts.

The invention has the following beneficial effects: the method for standardizing the grade data in the medical data provided by the invention is used for standardizing the grade physical examination data to finally obtain orderly and uniform digital examination results, thereby greatly improving the orderliness and the mining property of the grade data physical examination data.

Drawings

The invention will be further described with reference to the accompanying drawings and examples, in which:

FIG. 1 is a flow chart of level data cleaning according to an embodiment of the present invention;

fig. 2 is a flow chart of an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, the data content in the pure numerical form in the hierarchical data column is automatically converted into the corresponding hierarchical form (class a mapping rule replacement) through the index reference range, the data content in the non-pure numerical form in the data column is replaced with the corresponding hierarchical form (class B mapping rule replacement) through the standard mapping library, the hierarchical replacement results are merged, and the conflict item correction after the quality control and the merged results is performed.

In the example shown in fig. 2, a list of graded data, with the standard term stool analysis-red blood cells, the original morphology includes: -, + + + + + + + + + + + + + +, negative, weak positive, strong positive, 0, 2, 4, 12 and 18. Through the grade cleaning process, the original form can be finally replaced by the digital standard grade form.

The method comprises the following steps:

step 1: and carrying out column name standardization processing on the original data column through a standard terminology table. Column name normalization matches each data column with a most appropriate standard term. The data type to which the standard terms belong includes text data standard terms, measurement data standard terms and grade data standard terms.

Step 2: the standard terminology table also defines the grading standard corresponding to each grade data terminology. The grading standard of the standard terms expresses the contents of the grading data by numbers, so that the grading data of various forms can be standardized and treated by a set of unified digital standards. The standard terminology for this column of data is: stool analysis-erythrocytes, as a term of rank data, are ranked according to the criteria: 1: negative (-), 2: weak positive (±), 3: positive (+), 4: strong positive (++), 5: strong positive (+++), 6 strong positive (+++).

And step 3: the data row is divided into two categories according to whether the data row is pure numerical data, namely (1) pure numerical data: 0. 2, 4, 12 and 18, and performing level replacement on the part of the content by going through a class A rule; (2) non-pure numerical type: -, + + + + + + + + + + + + + +, negative, weak positive, strong positive, and the contents of the part go through B-type rules to perform grade replacement. The contents in the pure numerical form and the contents in the non-pure numerical form have respective characteristics, and the contents are preferably cleaned according to different cleaning rules so as to improve the cleaning efficiency and accuracy.

And 4, step 4: the computer program generates a class a mapping rule based on the corresponding index reference value range given by the data source unit (usually the hospital examination center). The A-type mapping rule is to convert the normal reference range of the index given by the data source unit into a uniform interval form, and then to replace the pure numerical form content in the grade data column by the A-type mapping rule through a language recognizable by a computer. If the data source unit gives the reference range for the data column: -: 0 to 3; 3-5 parts of +/-0; 5-10 parts of; 10-15 parts of ++; 15-20 parts of ++; 20-infinity, then the mapping rule of A in the automatically generated A mapping table of computer will be recorded as 1: [0, 3); 2, 3, 5); 3,5, 10); 4- [10,15 ]; 5, 15, 20); 6 [20, + ∞). By the class a mapping rule, the 0, 2 level can be replaced by 1, the 4 level can be replaced by 2, the 12 level can be replaced by 4, and the 18 level can be replaced by 5.

And 5: and replacing the contents in the non-pure numerical value form in the grade data row by performing a B-type mapping rule. The B-type mapping rule is a professional database prepared according to national clinical laboratory guidelines, and the basic structure of the B-type mapping rule is standard term name-hierarchical rule-original form-hierarchical alternative form. For example, for the standard term of stool analysis-red blood cells, the classification standard and the corresponding rule of the original form and the corresponding classification replacement form are noted in the mapping rule of class B, and forms such as "-", "negative", "(-) -and the like correspond to the classification form" 1 "; for example, the forms "+ -," weakly positive "(+ -.)" and the like are classified into the form "2", and so on. According to the B-type mapping table, the program can identify the original form in the data to be cleaned and convert the original form into a corresponding hierarchical form. In this example, the B-type mapping rule table can replace the negative level with 1, the positive level with 2, the negative level with 3, the positive level with 4, the strong positive level with 5, the strong positive level with 6.

TABLE 1 class B mapping table

Step 6: after A, B mapping replacement is completed, the program will merge the data after A, B cleaning rule level replacement, and perform level form frequency statistics of each data row under each standard term to generate a level form frequency statistics table. And the quality control of the grade replacement result can be realized by observing whether the grade form distribution proportion of each data column under the same standard term is abnormal or not through the grade form frequency statistical table.

TABLE 2 frequency statistics table for grade morphology

And 7: there may be multiple data columns under the same standard terminology (stool analysis-red blood cells), and data columns normalized to the same standard terminology are merged. If the level morphology is inconsistent under the same standard term of the same patient ID after combination, the level morphology is marked as a combination conflict, and finally, the only and correct level morphology is selected from the combination conflict.

It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.

Claims

1. A method for normalizing grade data in medical data, the method comprising the steps of:

and 5: class B mapping rules: replacing the data content in the data column which belongs to the non-pure numerical form with a corresponding hierarchical form through a standard mapping library;

step 6, after cleaning through A, B mapping rules, combining cleaning results to generate a grade form frequency table, and performing quality control on the cleaning results;

2. The method for standardizing the grading data in the medical data as claimed in claim 1, wherein the step 1 is characterized in that the concrete method of standardization processing of the list names comprises the following steps:

3. The method for normalizing grade data in medical data according to claim 1, wherein the specific method of step 2 is as follows:

4. The method for normalizing grade data in medical data according to claim 1, wherein the specific method of step 4 is as follows:

5. The method for normalizing grade data in medical data according to claim 1, wherein the specific method of step 5 is as follows:

6. The method for normalizing grade data in medical data according to claim 1, wherein the specific method of step 6 is as follows:

and (3) carrying out statistics on the level form frequency of each data column under each standard term through an algorithm to generate a level data form frequency statistical table, wherein the form of the statistical table is as follows: standard term name-data source unit/data column-grade morphology frequency-grade morphology percentage; and judging whether the grade form distribution proportion of each data column under the same standard term is abnormal or not by an algorithm based on the grading standard and manual labeling of the standard term table to realize quality control on the grade cleaning result.

7. The method for normalizing grade data in medical data according to claim 1, wherein the step 7 is performed by:

combining all data columns under the same standard term through an algorithm, marking different grade morphologies corresponding to the same patient under two or more same standard terms as combination conflicts, and finally selecting the only correct grade morphology from the combination conflicts.