Disclosure of Invention
The invention aims to solve the technical problem of providing a method for cleaning illegal words in text data of health examination big data aiming at the defects in the prior art.
The technical scheme adopted by the invention for solving the technical problems is as follows:
the invention provides a method for cleaning illegal words in text data of health examination big data, which comprises the following steps:
step 1, acquiring illegal data of different physical examination items by collecting original physical examination data, manually labeling, training and mining by a machine, and establishing an illegal word bank of text data;
step 2, inputting health examination text data to be cleaned, which contains illegal words, according to a specified data structure, wherein the health examination text data is in a list structure, the first column is a physical examination number, and the subsequent columns are corresponding physical examination index items and corresponding physical examination description results;
step 3, traversing the content of each row of physical examination index items, indexing illegal words in the text data illegal word bank according to the physical examination index items, further performing algorithm matching on the health physical examination text fields according to different types of illegal forms, and judging whether the health physical examination text fields are in the illegal forms;
step 4, deleting the matched illegal words in the health examination text data by using an algorithm to obtain clean health examination text data, and storing a physical examination index item of the health examination data and the matched illegal words, wherein the corresponding physical examination number is a quality control file 1 for subsequent text source tracing; meanwhile, comparing the data statistics before and after the cleaning of the illegal words to obtain a quality control file 2 for quality control of subsequent text cleaning;
step 5, checking whether the output health examination text data is correct or not according to the cleaned health examination data, the quality control file 1 and the quality control file 2;
step 6, performing result check on the output health examination text data quality control file 1, and updating the illegal lexicon of the text data through manual labeling and machine identification;
and 7, completing the cleaning of the illegal words.
Further, the data structure of the text data of the physical examination in step 2 of the present invention includes: the physical examination number, the text data standard term and the text data physical examination result are divided into a plurality of parts by using a function, and then the data are processed in a multithread mode.
Further, the specific method for performing algorithm matching on the health examination text data by using the existing text data illegal thesaurus in the step 3 of the invention is as follows:
matching the general illegal forms corresponding to the index classification in the data, wherein the text data illegal lexicon comprises: index classification, general illegal forms, word elimination and matching rules; wherein:
the index classification contains 105 index classifications, including: imaging examination-ultrasound, imaging examination-CT, imaging examination-X-ray examination, physical examination-chest, physical examination-head and neck, interrogation;
the common illegal forms contain no less than 200 illegal forms, including: undetected, not done, see report;
the exclusion word means that it is meaningful but partially coincides with a general illegal form;
the matching rules include: 1. a special form; 2. regular matching; 3. the full fields are completely matched; wherein:
"1" represents data in a particular form, i.e., a pure numerical form, containing only numerical values;
"2" represents a regular match, i.e., the data contains common illegal words;
"3" represents a full field match, i.e., the data content is completely consistent with the general illegal word;
if the matching rule is met, judging the form to be illegal, and if the matching rule is not met, judging the form to be non-illegal; matching by adopting a text illegal word matching algorithm based on a composite rule, which comprises the following specific steps:
the algorithm matching rule 1 indicates that data only containing numerical value forms are matched, and a specific matching mode is [0-9 ];
the algorithm matching rule 2 represents that general illegal words are matched, the specific character strings of the illegal word rule pass through, and the character string rule set is constructed in advance by a pre-constructed regular expression set, and the specific matching mode is as follows: [.. ] is illegal;
the algorithm matching rule 3 represents full field matching, and the specific matching mode is as follows: [ illegal word ].
Further, the specific method for deleting the matched illegal words by using the algorithm in the step 4 of the present invention is as follows:
deleting the matched illegal words by using an algorithm, and outputting the deleted illegal words to a quality control file 1 and a quality control file 2 for a medical team to check whether the cleaned text data is correct;
the quality control file 1 comprises a plurality of columns of data, and each column of data comprises the following data structure:
data source, which refers to the serial number of the organization for carrying out the cleaning of the illegal words in the text;
the health examination text data comprises a column number corresponding to a column of a general illegal form of the text;
the health examination text data contains standard terms corresponding to columns of general illegal forms of texts;
index classification, namely index classification of standard terms corresponding to columns of general illegal forms of texts, which is contained in the health examination text data;
illegal values, general illegal forms in the text data illegal lexicon;
whether the words are excluded or not, wherein the excluded words correspond to the illegal words in the illegal word bank of the text data, and the excluded words refer to texts which are meaningful but partially overlapped with the general illegal forms;
the illegal form category refers to a matching rule, and the matching rule comprises the following components: 1. a special form; 2. regular matching; 3. the full fields are completely matched; wherein: "1" represents data in a particular form, i.e., a pure numerical form, containing only numerical values; "2" represents a regular match, i.e., the data contains common illegal words; "3" represents a full field match, i.e., the data content is completely consistent with the general illegal word; if the matching rule is met, judging the form to be illegal, and if the matching rule is not met, judging the form to be non-illegal;
frequency of occurrence at which a common illegal modality appears;
a physical examination number, wherein the physical examination number of the general illegal form appears in the column;
the quality control file 2 comprises a plurality of columns of data, and each column of data comprises the following data structure:
data source, which refers to the serial number of the organization for carrying out the cleaning of the illegal words in the text;
column number, column number of health examination text data;
standard terms, the standard terms corresponding to each column of the physical examination text data;
the index null number is obtained, and the health examination text data is subjected to removal of the null number in each column before the illegal word;
the indexes are non-null numbers, and the health examination text data removes the non-null number of each column before the illegal words;
removing non-null numbers after the illegal values are removed, and removing null number of each column after the illegal words are removed from the health examination text data;
the number of illegal words removed from each column of the health examination text data is the number of illegal numbers;
the highest-frequency illegal form is the highest-frequency illegal word in each column of the health examination text data;
the highest frequency is illegal, and the word frequency of the highest frequency illegal word in each column of the health examination text data is high.
Further, step 4 of the present invention includes:
outputting the health examination text data without the illegal words, combining and outputting the multithreading operation results by utilizing an algorithm, wherein the data without the illegal words comprises the following steps: the physical examination number, the standard terms of the text data and the physical examination result of the text data, but the result no longer contains illegal forms.
Further, the specific method for checking whether the output health examination text data is correct in the step 5 of the present invention is as follows:
checking the illegal words, cleaning the quality control file 1 and the quality control file 2, and restoring the meaningful data to the data without the illegal words according to the physical examination number; and checking whether the text data without the illegal words has residual illegal words or not in each column, mixing information, and directly modifying the text data without the illegal words.
Further, the specific method for supplementing the illegal lexicon of the text data in the step 6 of the invention comprises the following steps:
and supplementing the illegal lexicon of the text data by using the inspection result of the output text, wherein the specific contents comprise index classification, illegal values and occurrence frequency in the quality control file 1.
The invention has the following beneficial effects: the method for cleaning the illegal words in the text data of the big data for health examination provides a method for cleaning the illegal words in the text, which has the advantages of standard system, accurate processing, high efficiency, time saving and timely source tracing. The method specifically comprises the steps of matching by adopting a text illegal word matching algorithm based on a composite rule, so that the calculation efficiency is higher, the data cleaning efficiency is improved, the data quality is improved rapidly, the labor consumption is reduced, and the data before and after the text data is cleaned are controlled, so that the traceability monitoring of the data cleaning is ensured.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, the method for cleaning illegal words in the text data of the health examination big data according to the embodiment of the present invention includes the following steps:
step 1, acquiring illegal data of different physical examination items by collecting original physical examination data, manually labeling, training and mining by a machine, and establishing an illegal word bank of text data;
illegal data include: 1. special forms, e.g. containing only values, 2 full field matches, e.g. unchecked, rejected, 3 full field is completely consistent.
Step 2, inputting health examination text data to be cleaned, which contains illegal words, according to a specified data structure, wherein the health examination text data is in a list structure, the first column is a physical examination number, and the subsequent columns are corresponding physical examination index items and corresponding physical examination description results;
step 3, traversing the content of each row of physical examination index items, indexing illegal words in the text data illegal word bank according to the physical examination index items, further performing algorithm matching on the health physical examination text fields according to different types of illegal forms, and judging whether the health physical examination text fields are in the illegal forms;
the process of performing algorithm matching of the health examination text field according to different types of illegal forms comprises the following steps: 1. the special mode (judgment value mode) includes, for example: only numerical values are included, such as "112221345" for examination content of "head CT"; 2. full field matching (regular matching), for example: the content of the non-inspection, non-inspection and refusal inspection, such as the inspection content of the 'ophthalmic inspection', is 'eyesight refusal inspection'; 3. all fields are completely consistent (completely matched), for example, the examination content of the 'physical examination' is 'not reported, and the selection is recommended to be perfect'.
Step 4, deleting the matched illegal words in the health examination text data by using an algorithm to obtain clean health examination text data, and storing a physical examination index item of the health examination data and the matched illegal words, wherein the corresponding physical examination number is a quality control file 1 for subsequent text source tracing; meanwhile, comparing the data statistics before and after the cleaning of the illegal words to obtain a quality control file 2 for quality control of subsequent text cleaning;
step 5, checking whether the output health examination text data is correct or not according to the cleaned health examination data, the quality control file 1 and the quality control file 2; examples of the quality control file 1 and the quality control file 2 are shown in fig. 2 and 3.
Step 6, performing result check on the output health examination text data quality control file 1, and updating the illegal lexicon of the text data through manual labeling and machine identification;
and 7, completing the cleaning of the illegal words.
In another embodiment of the invention:
according to the method for cleaning the illegal words in the health examination big data text data, the illegal words in the health examination text data are cleaned by using the illegal word library in the health examination data text data, so that the purpose of removing the illegal words in the text is achieved, data normalization is achieved, and the data quality is improved. The method specifically comprises the following steps:
step one, acquiring illegal data of different physical examination items (1. special form [ eg. only contains numerical value ], 2. full field matching [ eg. not checking, not checking and refusing ]3. full field is completely consistent) by collecting original physical examination data, and performing manual marking and machine training and mining, and establishing a text data illegal thesaurus
Step two, inputting text data of the illegal words of the text to be cleaned, wherein the text data comprises a physical examination number, text data standard terms and text data physical examination results, and dividing the data into a plurality of parts by using a function so as to perform multi-thread processing, so that the processing efficiency is improved;
and step three, performing algorithm matching on the text data by using the illegal lexicon of the text data, and judging whether the text data is in an illegal form. The method specifically comprises the following steps:
and matching the corresponding index classification in the illegal lexicon by using the index classification corresponding to the column name of the data to be cleaned. And matching the corresponding general illegal forms in the data by using the index classification.
The illegal word bank of the text data comprises: index classification, general illegal forms, word exclusion and matching rules.
The index classification includes: 105 index classifications of imaging examination-ultrasound, imaging examination-CT, imaging examination-X-ray examination, physical examination-chest, physical examination-head and neck, inquiry and the like.
Common illegal forms include: more than 200 illegal forms are not detected, not done and reported.
The exclusionary word is meant to be meaningful but partially coincident with the common illegal form, e.g., "not tested" is the common illegal form and "not tested and" is the exclusionary word. The matching rules include: 1. a special form; 2. regular matching; 3. the full fields match completely. 1 represents data in a particular form, i.e., a pure numerical form, containing only numerical values. If the pure numerical form "2222" appears in the index classification "imaging examination-ultrasound", it is determined that the illegal form contains only numerical values. 2 represents a regular match, i.e. the data contains common illegal words. And 3 represents that the full field is completely matched, namely the data content is completely consistent with the general illegal words.
If the matching rule is met, the illegal form is judged, and if the matching rule is not met, the illegal form is judged to be not the illegal form. The illegal thesaurus of the text data has the advantages of comprehensiveness and accuracy, and the index classification of the illegal thesaurus comprises the classification of all physical examination data; the general illegal forms have classification specificity, namely the general illegal forms classified by different indexes are not all the same; the data is prevented from being deleted by mistake by using the exclusion word;
and step four, deleting the matched illegal words by using an algorithm. And the integrity and the validity of the data can be ensured by deleting the illegal words. And outputting the deleted illegal words to a quality check table for a medical group to check whether the cleaned text data is correct or not. The quality inspection table comprises: data source, serial number, standardized terms, index classification, general illegal values for matching, matching rules, illegal forms, appearance frequency and physical examination number;
and step five, outputting the text data without the illegal words. And merging and outputting the operation results of the multiple threads by using an algorithm. The data after removing the illegal words comprises the following steps: the physical examination number, the standard terms of the text data and the physical examination result of the text data. But the result does not contain illegal forms, such as 'undetected' and 'result self-fetching';
and step six, checking whether the output text data is correct. The specific judgment method comprises the following steps: and (4) the medical group checks the illegal words, cleans a quality control file (quality check table), and restores the meaningful data to the data without the illegal words according to the physical examination number. Checking whether the text data without the illegal words has residual illegal words or not in each column, mixing information, and directly modifying the text data without the illegal words;
and the examination result of the output text is used for supplementing the illegal word bank of the text data. The specific content comprises the following steps: index classification, general illegal forms, word exclusion and matching rules.
And step seven, cleaning the illegal words.
In this embodiment, the illegal word data "abdominal ultrasound-liver" to be cleaned is input in the second step;
in this embodiment, in the third step, the "abdominal ultrasound-liver" is determined as the index classification "imaging examination-ultrasound" by using the illegal corpus of text data, and the general illegal form in the "imaging examination-ultrasound" is used for matching in the data, if the data is matched to "not done";
in the embodiment, the fourth step deletes the matched general illegal words such as 'do not do' by using an algorithm;
in the embodiment, the text data with the illegal forms removed by the abdominal ultrasound-liver is output in the fifth step;
in this embodiment, in step six, it is checked whether there are any remaining illegal words in the data, such as "unchecked", and manually deleted. And if the data deleted by the algorithm is stored in the quality check table, manually restoring the data into the data.
It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.