CN112765964A - Method for cleaning illegal words of text data of big data for health examination - Google Patents

Method for cleaning illegal words of text data of big data for health examination Download PDF

Info

Publication number
CN112765964A
CN112765964A CN202110087779.5A CN202110087779A CN112765964A CN 112765964 A CN112765964 A CN 112765964A CN 202110087779 A CN202110087779 A CN 202110087779A CN 112765964 A CN112765964 A CN 112765964A
Authority
CN
China
Prior art keywords
illegal
data
text data
words
examination
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110087779.5A
Other languages
Chinese (zh)
Other versions
CN112765964B (en
Inventor
李红良
雷昉
杨慧琳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202110087779.5A priority Critical patent/CN112765964B/en
Publication of CN112765964A publication Critical patent/CN112765964A/en
Application granted granted Critical
Publication of CN112765964B publication Critical patent/CN112765964B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention discloses a method for cleaning illegal words in text data of big data of health examination, which comprises the following steps: step 1, acquiring illegal data of different physical examination items by collecting original physical examination data, manually labeling, training and mining by a machine, and establishing an illegal word bank of text data; step 2, inputting the text data of the health examination to be cleaned according to a specified data structure; step 3, performing algorithm matching on the health examination text data by using the illegal word bank of the text data, and judging whether the health examination text data is in an illegal form; step 4, deleting the matched illegal words by using an algorithm; step 5, checking whether the output health examination text data is correct; step 6, performing result check on the output health examination text data, and supplementing an illegal word bank of text data; and 7, completing the cleaning of the illegal words. The invention provides a standard and reasonable method for cleaning illegal words in a text, and the method has high algorithm precision and high calculation efficiency.

Description

Method for cleaning illegal words of text data of big data for health examination
Technical Field
The invention relates to the technical field of medical big data, in particular to a method for cleaning illegal words of text data of big data for health examination.
Background
With the development of the medical field and economy and the increasing concern of health conditions year by year, the health examination population increases year by year. A large amount of health examination data are not fully analyzed and utilized every year, and information systems, execution standards and recording standards of different hospitals are slightly different, so that data of different medical institutions have heterogeneity. At present, a complete, reasonable and normative data management method related to the big health data does not exist, so that the big health data cannot be well managed, stored, shared and analyzed. Text data in health examination data is a common data type. It is often found in the inquiry, imaging examination, special examination, physical examination, etc. Because a large amount of text data illegal words exist in the data of the health examination, such as: "unchecked", "substitute exam", "examined", "xx physician's exams". The existence of the invalid data can affect the authenticity of the data and cause huge barriers to data normalization, data mining and statistical analysis. At present, there is no complete, reasonable and standard data cleaning method related to the cleaning of illegal words of text data. The invention provides a standard and reasonable method for cleaning illegal words in a text.
Disclosure of Invention
The invention aims to solve the technical problem of providing a method for cleaning illegal words in text data of health examination big data aiming at the defects in the prior art.
The technical scheme adopted by the invention for solving the technical problems is as follows:
the invention provides a method for cleaning illegal words in text data of health examination big data, which comprises the following steps:
step 1, acquiring illegal data of different physical examination items by collecting original physical examination data, manually labeling, training and mining by a machine, and establishing an illegal word bank of text data;
step 2, inputting health examination text data to be cleaned, which contains illegal words, according to a specified data structure, wherein the health examination text data is in a list structure, the first column is a physical examination number, and the subsequent columns are corresponding physical examination index items and corresponding physical examination description results;
step 3, traversing the content of each row of physical examination index items, indexing illegal words in the text data illegal word bank according to the physical examination index items, further performing algorithm matching on the health physical examination text fields according to different types of illegal forms, and judging whether the health physical examination text fields are in the illegal forms;
step 4, deleting the matched illegal words in the health examination text data by using an algorithm to obtain clean health examination text data, and storing a physical examination index item of the health examination data and the matched illegal words, wherein the corresponding physical examination number is a quality control file 1 for subsequent text source tracing; meanwhile, comparing the data statistics before and after the cleaning of the illegal words to obtain a quality control file 2 for quality control of subsequent text cleaning;
step 5, checking whether the output health examination text data is correct or not according to the cleaned health examination data, the quality control file 1 and the quality control file 2;
step 6, performing result check on the output health examination text data quality control file 1, and updating the illegal lexicon of the text data through manual labeling and machine identification;
and 7, completing the cleaning of the illegal words.
Further, the data structure of the text data of the physical examination in step 2 of the present invention includes: the physical examination number, the text data standard term and the text data physical examination result are divided into a plurality of parts by using a function, and then the data are processed in a multithread mode.
Further, the specific method for performing algorithm matching on the health examination text data by using the existing text data illegal thesaurus in the step 3 of the invention is as follows:
matching the general illegal forms corresponding to the index classification in the data, wherein the text data illegal lexicon comprises: index classification, general illegal forms, word elimination and matching rules; wherein:
the index classification contains 105 index classifications, including: imaging examination-ultrasound, imaging examination-CT, imaging examination-X-ray examination, physical examination-chest, physical examination-head and neck, interrogation;
the common illegal forms contain no less than 200 illegal forms, including: undetected, not done, see report;
the exclusion word means that it is meaningful but partially coincides with a general illegal form;
the matching rules include: 1. a special form; 2. regular matching; 3. the full fields are completely matched; wherein:
"1" represents data in a particular form, i.e., a pure numerical form, containing only numerical values;
"2" represents a regular match, i.e., the data contains common illegal words;
"3" represents a full field match, i.e., the data content is completely consistent with the general illegal word;
if the matching rule is met, judging the form to be illegal, and if the matching rule is not met, judging the form to be non-illegal; matching by adopting a text illegal word matching algorithm based on a composite rule, which comprises the following specific steps:
the algorithm matching rule 1 indicates that data only containing numerical value forms are matched, and a specific matching mode is [0-9 ];
the algorithm matching rule 2 represents that general illegal words are matched, the specific character strings of the illegal word rule pass through, and the character string rule set is constructed in advance by a pre-constructed regular expression set, and the specific matching mode is as follows: [.. ] is illegal;
the algorithm matching rule 3 represents full field matching, and the specific matching mode is as follows: [ illegal word ].
Further, the specific method for deleting the matched illegal words by using the algorithm in the step 4 of the present invention is as follows:
deleting the matched illegal words by using an algorithm, and outputting the deleted illegal words to a quality control file 1 and a quality control file 2 for a medical team to check whether the cleaned text data is correct;
the quality control file 1 comprises a plurality of columns of data, and each column of data comprises the following data structure:
data source, which refers to the serial number of the organization for carrying out the cleaning of the illegal words in the text;
the health examination text data comprises a column number corresponding to a column of a general illegal form of the text;
the health examination text data contains standard terms corresponding to columns of general illegal forms of texts;
index classification, namely index classification of standard terms corresponding to columns of general illegal forms of texts, which is contained in the health examination text data;
illegal values, general illegal forms in the text data illegal lexicon;
whether the words are excluded or not, wherein the excluded words correspond to the illegal words in the illegal word bank of the text data, and the excluded words refer to texts which are meaningful but partially overlapped with the general illegal forms;
the illegal form category refers to a matching rule, and the matching rule comprises the following components: 1. a special form; 2. regular matching; 3. the full fields are completely matched; wherein: "1" represents data in a particular form, i.e., a pure numerical form, containing only numerical values; "2" represents a regular match, i.e., the data contains common illegal words; "3" represents a full field match, i.e., the data content is completely consistent with the general illegal word; if the matching rule is met, judging the form to be illegal, and if the matching rule is not met, judging the form to be non-illegal;
frequency of occurrence at which a common illegal modality appears;
a physical examination number, wherein the physical examination number of the general illegal form appears in the column;
the quality control file 2 comprises a plurality of columns of data, and each column of data comprises the following data structure:
data source, which refers to the serial number of the organization for carrying out the cleaning of the illegal words in the text;
column number, column number of health examination text data;
standard terms, the standard terms corresponding to each column of the physical examination text data;
the index null number is obtained, and the health examination text data is subjected to removal of the null number in each column before the illegal word;
the indexes are non-null numbers, and the health examination text data removes the non-null number of each column before the illegal words;
removing non-null numbers after the illegal values are removed, and removing null number of each column after the illegal words are removed from the health examination text data;
the number of illegal words removed from each column of the health examination text data is the number of illegal numbers;
the highest-frequency illegal form is the highest-frequency illegal word in each column of the health examination text data;
the highest frequency is illegal, and the word frequency of the highest frequency illegal word in each column of the health examination text data is high.
Further, step 4 of the present invention includes:
outputting the health examination text data without the illegal words, combining and outputting the multithreading operation results by utilizing an algorithm, wherein the data without the illegal words comprises the following steps: the physical examination number, the standard terms of the text data and the physical examination result of the text data, but the result no longer contains illegal forms.
Further, the specific method for checking whether the output health examination text data is correct in the step 5 of the present invention is as follows:
checking the illegal words, cleaning the quality control file 1 and the quality control file 2, and restoring the meaningful data to the data without the illegal words according to the physical examination number; and checking whether the text data without the illegal words has residual illegal words or not in each column, mixing information, and directly modifying the text data without the illegal words.
Further, the specific method for supplementing the illegal lexicon of the text data in the step 6 of the invention comprises the following steps:
and supplementing the illegal lexicon of the text data by using the inspection result of the output text, wherein the specific contents comprise index classification, illegal values and occurrence frequency in the quality control file 1.
The invention has the following beneficial effects: the method for cleaning the illegal words in the text data of the big data for health examination provides a method for cleaning the illegal words in the text, which has the advantages of standard system, accurate processing, high efficiency, time saving and timely source tracing. The method specifically comprises the steps of matching by adopting a text illegal word matching algorithm based on a composite rule, so that the calculation efficiency is higher, the data cleaning efficiency is improved, the data quality is improved rapidly, the labor consumption is reduced, and the data before and after the text data is cleaned are controlled, so that the traceability monitoring of the data cleaning is ensured.
Drawings
The invention will be further described with reference to the accompanying drawings and examples, in which:
fig. 1 is a flowchart of a method for cleaning illegal words in text data of health examination big data according to an embodiment of the present invention.
FIG. 2 is a quality control file 1 according to an embodiment of the present invention;
fig. 3 is a quality control file 2 according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, the method for cleaning illegal words in the text data of the health examination big data according to the embodiment of the present invention includes the following steps:
step 1, acquiring illegal data of different physical examination items by collecting original physical examination data, manually labeling, training and mining by a machine, and establishing an illegal word bank of text data;
illegal data include: 1. special forms, e.g. containing only values, 2 full field matches, e.g. unchecked, rejected, 3 full field is completely consistent.
Step 2, inputting health examination text data to be cleaned, which contains illegal words, according to a specified data structure, wherein the health examination text data is in a list structure, the first column is a physical examination number, and the subsequent columns are corresponding physical examination index items and corresponding physical examination description results;
step 3, traversing the content of each row of physical examination index items, indexing illegal words in the text data illegal word bank according to the physical examination index items, further performing algorithm matching on the health physical examination text fields according to different types of illegal forms, and judging whether the health physical examination text fields are in the illegal forms;
the process of performing algorithm matching of the health examination text field according to different types of illegal forms comprises the following steps: 1. the special mode (judgment value mode) includes, for example: only numerical values are included, such as "112221345" for examination content of "head CT"; 2. full field matching (regular matching), for example: the content of the non-inspection, non-inspection and refusal inspection, such as the inspection content of the 'ophthalmic inspection', is 'eyesight refusal inspection'; 3. all fields are completely consistent (completely matched), for example, the examination content of the 'physical examination' is 'not reported, and the selection is recommended to be perfect'.
Step 4, deleting the matched illegal words in the health examination text data by using an algorithm to obtain clean health examination text data, and storing a physical examination index item of the health examination data and the matched illegal words, wherein the corresponding physical examination number is a quality control file 1 for subsequent text source tracing; meanwhile, comparing the data statistics before and after the cleaning of the illegal words to obtain a quality control file 2 for quality control of subsequent text cleaning;
step 5, checking whether the output health examination text data is correct or not according to the cleaned health examination data, the quality control file 1 and the quality control file 2; examples of the quality control file 1 and the quality control file 2 are shown in fig. 2 and 3.
Step 6, performing result check on the output health examination text data quality control file 1, and updating the illegal lexicon of the text data through manual labeling and machine identification;
and 7, completing the cleaning of the illegal words.
In another embodiment of the invention:
according to the method for cleaning the illegal words in the health examination big data text data, the illegal words in the health examination text data are cleaned by using the illegal word library in the health examination data text data, so that the purpose of removing the illegal words in the text is achieved, data normalization is achieved, and the data quality is improved. The method specifically comprises the following steps:
step one, acquiring illegal data of different physical examination items (1. special form [ eg. only contains numerical value ], 2. full field matching [ eg. not checking, not checking and refusing ]3. full field is completely consistent) by collecting original physical examination data, and performing manual marking and machine training and mining, and establishing a text data illegal thesaurus
Step two, inputting text data of the illegal words of the text to be cleaned, wherein the text data comprises a physical examination number, text data standard terms and text data physical examination results, and dividing the data into a plurality of parts by using a function so as to perform multi-thread processing, so that the processing efficiency is improved;
and step three, performing algorithm matching on the text data by using the illegal lexicon of the text data, and judging whether the text data is in an illegal form. The method specifically comprises the following steps:
and matching the corresponding index classification in the illegal lexicon by using the index classification corresponding to the column name of the data to be cleaned. And matching the corresponding general illegal forms in the data by using the index classification.
The illegal word bank of the text data comprises: index classification, general illegal forms, word exclusion and matching rules.
The index classification includes: 105 index classifications of imaging examination-ultrasound, imaging examination-CT, imaging examination-X-ray examination, physical examination-chest, physical examination-head and neck, inquiry and the like.
Common illegal forms include: more than 200 illegal forms are not detected, not done and reported.
The exclusionary word is meant to be meaningful but partially coincident with the common illegal form, e.g., "not tested" is the common illegal form and "not tested and" is the exclusionary word. The matching rules include: 1. a special form; 2. regular matching; 3. the full fields match completely. 1 represents data in a particular form, i.e., a pure numerical form, containing only numerical values. If the pure numerical form "2222" appears in the index classification "imaging examination-ultrasound", it is determined that the illegal form contains only numerical values. 2 represents a regular match, i.e. the data contains common illegal words. And 3 represents that the full field is completely matched, namely the data content is completely consistent with the general illegal words.
If the matching rule is met, the illegal form is judged, and if the matching rule is not met, the illegal form is judged to be not the illegal form. The illegal thesaurus of the text data has the advantages of comprehensiveness and accuracy, and the index classification of the illegal thesaurus comprises the classification of all physical examination data; the general illegal forms have classification specificity, namely the general illegal forms classified by different indexes are not all the same; the data is prevented from being deleted by mistake by using the exclusion word;
and step four, deleting the matched illegal words by using an algorithm. And the integrity and the validity of the data can be ensured by deleting the illegal words. And outputting the deleted illegal words to a quality check table for a medical group to check whether the cleaned text data is correct or not. The quality inspection table comprises: data source, serial number, standardized terms, index classification, general illegal values for matching, matching rules, illegal forms, appearance frequency and physical examination number;
and step five, outputting the text data without the illegal words. And merging and outputting the operation results of the multiple threads by using an algorithm. The data after removing the illegal words comprises the following steps: the physical examination number, the standard terms of the text data and the physical examination result of the text data. But the result does not contain illegal forms, such as 'undetected' and 'result self-fetching';
and step six, checking whether the output text data is correct. The specific judgment method comprises the following steps: and (4) the medical group checks the illegal words, cleans a quality control file (quality check table), and restores the meaningful data to the data without the illegal words according to the physical examination number. Checking whether the text data without the illegal words has residual illegal words or not in each column, mixing information, and directly modifying the text data without the illegal words;
and the examination result of the output text is used for supplementing the illegal word bank of the text data. The specific content comprises the following steps: index classification, general illegal forms, word exclusion and matching rules.
And step seven, cleaning the illegal words.
In this embodiment, the illegal word data "abdominal ultrasound-liver" to be cleaned is input in the second step;
in this embodiment, in the third step, the "abdominal ultrasound-liver" is determined as the index classification "imaging examination-ultrasound" by using the illegal corpus of text data, and the general illegal form in the "imaging examination-ultrasound" is used for matching in the data, if the data is matched to "not done";
in the embodiment, the fourth step deletes the matched general illegal words such as 'do not do' by using an algorithm;
in the embodiment, the text data with the illegal forms removed by the abdominal ultrasound-liver is output in the fifth step;
in this embodiment, in step six, it is checked whether there are any remaining illegal words in the data, such as "unchecked", and manually deleted. And if the data deleted by the algorithm is stored in the quality check table, manually restoring the data into the data.
It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.

Claims (7)

1. A method for cleaning illegal words in text data of health physical examination big data is characterized by comprising the following steps:
step 1, acquiring illegal data of different physical examination items by collecting original physical examination data, manually labeling, training and mining by a machine, and establishing an illegal word bank of text data;
step 2, inputting health examination text data to be cleaned, which contains illegal words, according to a specified data structure, wherein the health examination text data is in a list structure, the first column is a physical examination number, and the subsequent columns are corresponding physical examination index items and corresponding physical examination description results;
step 3, traversing the content of each row of physical examination index items, indexing illegal words in the text data illegal word bank according to the physical examination index items, further performing algorithm matching on the health physical examination text fields according to different types of illegal forms, and judging whether the health physical examination text fields are in the illegal forms;
step 4, deleting the matched illegal words in the health examination text data by using an algorithm to obtain clean health examination text data, and storing a physical examination index item of the health examination data and the matched illegal words, wherein the corresponding physical examination number is a quality control file 1 for subsequent text source tracing; meanwhile, comparing the data statistics before and after the cleaning of the illegal words to obtain a quality control file 2 for quality control of subsequent text cleaning;
step 5, checking whether the output health examination text data is correct or not according to the cleaned health examination data, the quality control file 1 and the quality control file 2;
step 6, performing result check on the output health examination text data quality control file 1, and updating the illegal lexicon of the text data through manual labeling and machine identification;
and 7, completing the cleaning of the illegal words.
2. The method for cleaning illegal words in the text data of the health examination big data as claimed in claim 1, wherein the data structure of the text data of the health examination in the step 2 comprises: the physical examination number, the text data standard term and the text data physical examination result are divided into a plurality of parts by using a function, and then the data are processed in a multithread mode.
3. The method for cleaning the illegal words in the text data of the big data of the health examination as claimed in claim 1, wherein the specific method for performing the algorithm matching on the text data of the health examination by using the existing illegal word library of the text data in the step 3 comprises the following steps:
matching the general illegal forms corresponding to the index classification in the data, wherein the text data illegal lexicon comprises: index classification, general illegal forms, word elimination and matching rules; wherein:
the index classification contains 105 index classifications, including: imaging examination-ultrasound, imaging examination-CT, imaging examination-X-ray examination, physical examination-chest, physical examination-head and neck, interrogation;
the common illegal forms contain no less than 200 illegal forms, including: undetected, not done, see report;
the exclusion word means that it is meaningful but partially coincides with a general illegal form;
the matching rules include: 1. a special form; 2. regular matching; 3. the full fields are completely matched; wherein:
"1" represents data in a particular form, i.e., a pure numerical form, containing only numerical values;
"2" represents a regular match, i.e., the data contains common illegal words;
"3" represents a full field match, i.e., the data content is completely consistent with the general illegal word;
if the matching rule is met, judging the form to be illegal, and if the matching rule is not met, judging the form to be non-illegal; matching by adopting a text illegal word matching algorithm based on a composite rule, which comprises the following specific steps:
the algorithm matching rule 1 indicates that data only containing numerical value forms are matched, and a specific matching mode is [0-9 ];
the algorithm matching rule 2 represents that general illegal words are matched, the specific character strings of the illegal word rule pass through, and the character string rule set is constructed in advance by a pre-constructed regular expression set, and the specific matching mode is as follows: [.. ] is illegal;
the algorithm matching rule 3 represents full field matching, and the specific matching mode is as follows: [ illegal word ].
4. The method for cleaning the illegal words in the health examination big data text data as claimed in claim 3, wherein the specific method for deleting the matched illegal words by using the algorithm in the step 4 is as follows:
deleting the matched illegal words by using an algorithm, and outputting the deleted illegal words to a quality control file 1 and a quality control file 2 for a medical team to check whether the cleaned text data is correct;
the quality control file 1 comprises a plurality of columns of data, and each column of data comprises the following data structure:
data source, which refers to the serial number of the organization for carrying out the cleaning of the illegal words in the text;
the health examination text data comprises a column number corresponding to a column of a general illegal form of the text;
the health examination text data contains standard terms corresponding to columns of general illegal forms of texts;
index classification, namely index classification of standard terms corresponding to columns of general illegal forms of texts, which is contained in the health examination text data;
illegal values, general illegal forms in the text data illegal lexicon;
whether the words are excluded or not, wherein the excluded words correspond to the illegal words in the illegal word bank of the text data, and the excluded words refer to texts which are meaningful but partially overlapped with the general illegal forms;
the illegal form category refers to a matching rule, and the matching rule comprises the following components: 1. a special form; 2. regular matching; 3. the full fields are completely matched; wherein: "1" represents data in a particular form, i.e., a pure numerical form, containing only numerical values; "2" represents a regular match, i.e., the data contains common illegal words; "3" represents a full field match, i.e., the data content is completely consistent with the general illegal word; if the matching rule is met, judging the form to be illegal, and if the matching rule is not met, judging the form to be non-illegal;
frequency of occurrence at which a common illegal modality appears;
a physical examination number, wherein the physical examination number of the general illegal form appears in the column;
the quality control file 2 comprises a plurality of columns of data, and each column of data comprises the following data structure:
data source, which refers to the serial number of the organization for carrying out the cleaning of the illegal words in the text;
column number, column number of health examination text data;
standard terms, the standard terms corresponding to each column of the physical examination text data;
the index null number is obtained, and the health examination text data is subjected to removal of the null number in each column before the illegal word;
the indexes are non-null numbers, and the health examination text data removes the non-null number of each column before the illegal words;
removing non-null numbers after the illegal values are removed, and removing null number of each column after the illegal words are removed from the health examination text data;
the number of illegal words removed from each column of the health examination text data is the number of illegal numbers;
the highest-frequency illegal form is the highest-frequency illegal word in each column of the health examination text data;
the highest frequency is illegal, and the word frequency of the highest frequency illegal word in each column of the health examination text data is high.
5. The method for cleaning illegal words in the text data of the health examination big data as claimed in claim 4, wherein the step 4 comprises:
outputting the health examination text data without the illegal words, combining and outputting the multithreading operation results by utilizing an algorithm, wherein the data without the illegal words comprises the following steps: the physical examination number, the standard terms of the text data and the physical examination result of the text data, but the result no longer contains illegal forms.
6. The method for cleaning the illegal words in the text data of the health examination big data as claimed in claim 3, wherein the specific method for checking whether the output health examination text data is correct in the step 5 is as follows:
checking the illegal words, cleaning the quality control file 1 and the quality control file 2, and restoring the meaningful data to the data without the illegal words according to the physical examination number; and checking whether the text data without the illegal words has residual illegal words or not in each column, mixing information, and directly modifying the text data without the illegal words.
7. The method for cleaning the illegal words in the text data of the big data for physical examination as claimed in claim 4, wherein the specific method for supplementing the illegal word bank in the text data in the step 6 is as follows:
and supplementing the illegal lexicon of the text data by using the inspection result of the output text, wherein the specific contents comprise index classification, illegal values and occurrence frequency in the quality control file 1.
CN202110087779.5A 2021-01-22 2021-01-22 Method for cleaning illegal words of text data of health examination big data Active CN112765964B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110087779.5A CN112765964B (en) 2021-01-22 2021-01-22 Method for cleaning illegal words of text data of health examination big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110087779.5A CN112765964B (en) 2021-01-22 2021-01-22 Method for cleaning illegal words of text data of health examination big data

Publications (2)

Publication Number Publication Date
CN112765964A true CN112765964A (en) 2021-05-07
CN112765964B CN112765964B (en) 2023-10-03

Family

ID=75705643

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110087779.5A Active CN112765964B (en) 2021-01-22 2021-01-22 Method for cleaning illegal words of text data of health examination big data

Country Status (1)

Country Link
CN (1) CN112765964B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20080047853A (en) * 2006-11-27 2008-05-30 감문호 The studying system using on computer and thereof studying method
GB201416792D0 (en) * 2014-09-23 2014-11-05 Lepeltier Marie Therese Methods and systems of handling patent cliams
CN110597964A (en) * 2019-09-27 2019-12-20 神州数码融信软件有限公司 Double-record quality inspection semantic analysis method and device and double-record quality inspection system
CN110895961A (en) * 2019-10-29 2020-03-20 泰康保险集团股份有限公司 Text matching method and device in medical data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20080047853A (en) * 2006-11-27 2008-05-30 감문호 The studying system using on computer and thereof studying method
GB201416792D0 (en) * 2014-09-23 2014-11-05 Lepeltier Marie Therese Methods and systems of handling patent cliams
CN110597964A (en) * 2019-09-27 2019-12-20 神州数码融信软件有限公司 Double-record quality inspection semantic analysis method and device and double-record quality inspection system
CN110895961A (en) * 2019-10-29 2020-03-20 泰康保险集团股份有限公司 Text matching method and device in medical data

Also Published As

Publication number Publication date
CN112765964B (en) 2023-10-03

Similar Documents

Publication Publication Date Title
Basu How to conduct meta-analysis: a basic tutorial
CN109346145B (en) Method and system for actively monitoring adverse drug reactions
Liao et al. Outlier impact and accommodation methods: Multiple comparisons of Type I error rates
CN112380345B (en) COVID-19 scientific literature fine-grained classification method based on GNN
CN106445906A (en) Generation method and apparatus for medium-and-long phrase in domain lexicon
Mahapatra et al. Concept of outlier study: The management of outlier handling with significance in inclusive education setting
Sauvayre Types of errors hiding in Google Scholar Data
Kicsi et al. Large scale evaluation of natural language processing based test-to-code traceability approaches
CN114005530A (en) Intelligent reminding and monitoring method and system for medical repeated examination and inspection in area
CN113918705A (en) Contribution auditing method and system with early warning and recommendation functions
Williams et al. txttool: Utilities for text analysis in Stata
CN112768058A (en) Method and device for processing medical data of metering information type
CN112765964A (en) Method for cleaning illegal words of text data of big data for health examination
CN116864050A (en) Clinical trial quality control method and equipment for scheme deviation semi-quantitative evaluation
Barnett Automated detection of over-and under-dispersion in baseline tables in randomised controlled trials
CN113361585A (en) Method for optimizing and screening clues based on supervised learning algorithm
Lafia et al. A natural language processing pipeline for detecting informal data references in academic literature
KR20130045514A (en) Future technology value appraisal system and method
Plachouras et al. Information extraction of regulatory enforcement actions: From anti-money laundering compliance to countering terrorism finance
CN112802584A (en) Medical ultrasonic examination data classification method and device based on classifier
CN112802585B (en) Optimized medical X-ray examination data classification method and device based on classifier
CN116151249B (en) Impulse and graceful language detection method based on difficult sample screening
EP1182578A1 (en) System, method and computer program for patent and technology related information management and processing
Gao et al. Food Safety Regulatory Enforcement in China: A Data-Driven Approach
CN116186235A (en) Knowledge graph-based monitoring abnormal data identification method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant