CN109830272B

CN109830272B - Data standardization method and device, computer equipment and storage medium

Info

Publication number: CN109830272B
Application number: CN201910011828.XA
Authority: CN
Inventors: 金晓辉; 阮晓雯; 徐亮
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-01-07
Filing date: 2019-01-07
Publication date: 2022-08-30
Anticipated expiration: 2039-01-07
Also published as: CN109830272A

Abstract

The embodiment of the application provides a data standardization method and device, computer equipment and a storage medium. The method comprises the following steps: acquiring data to be standardized in a physical examination report; determining the data type corresponding to the specific value of the data; and carrying out standardization processing on the data according to the determined data type, wherein the standardization processing modes corresponding to different data types are different. According to the embodiment of the application, different standardized processing modes are adopted for data of different data types, so that the physical examination report can be comprehensively subjected to standardized processing, and the precision and accuracy of standardized processing of the physical examination report are improved; meanwhile, the data after standardized processing can be further used for model learning, and the consistency and accuracy of the data of the model learning are improved.

Description

Data standardization method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a data normalization method and apparatus, a computer device, and a storage medium.

Background

The electronic physical examination reports usually contain a large amount of information, correspond to a plurality of physical examination items, are inconvenient to process, cause that the identification methods of the conventional electronic physical examination reports are relatively rough, mostly directly identify and match corresponding physical examination results through data formats, screen recognizable data for storage and standardization, and are used for model learning at a later stage. However, in different physical examination reports, the physical examination results of the same item of data express the same meaning, the physical examination results in the physical examination reports are completely different, the physical examination results of different items of data are also very different, the physical examination results cannot be completely identified through the rough identification method, the identification of the physical examination items is also incomplete, and the identified data are not beneficial to the learning of a later model.

Disclosure of Invention

The embodiment of the application provides a data standardization method and device, computer equipment and a storage medium, which can improve the precision and accuracy of data standardization processing.

In a first aspect, an embodiment of the present application provides a data normalization method, where the method includes:

acquiring data to be standardized in a physical examination report; determining the data type corresponding to the specific value of the data; and carrying out standardization processing on the data according to the determined data type, wherein the standardization processing modes corresponding to different data types are different.

In a second aspect, an embodiment of the present invention provides a data normalization apparatus, which includes corresponding units for performing the method described in the first aspect.

In a third aspect, an embodiment of the present invention provides a computer device, where the computer device includes a memory and a processor connected to the memory;

the memory is configured to store a computer program, and the processor is configured to execute the computer program stored in the memory to perform the method of the first aspect.

In a fourth aspect, the present invention provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the method according to the first aspect.

According to the embodiment of the application, the data type corresponding to the specific value of one item of data is identified, and different standardization processing is performed on the specific value of the item of data according to different data types. According to the embodiment of the application, different standardized processing modes are adopted for data of different data types, so that the physical examination report can be comprehensively subjected to standardized processing, important physical examination indexes or character characteristics are prevented from being omitted, and the precision and accuracy of standardized processing of the physical examination report are improved; meanwhile, the data after standardized processing can be further used for model learning, and the consistency and accuracy of the data of the model learning are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic flow chart diagram illustrating a data normalization method provided by an embodiment of the present application;

FIG. 2 is a sub-flow diagram of a data normalization method provided by an embodiment of the present application;

FIG. 3 is a sub-flow diagram of a data normalization method provided by an embodiment of the present application;

FIG. 4 is a sub-flow diagram of FIG. 3 provided by an embodiment of the present application;

FIG. 5 is a schematic illustration of a sub-flow chart of FIG. 3 provided by an embodiment of the present application;

FIG. 6 is a schematic block diagram of a data normalization apparatus provided by an embodiment of the present application;

fig. 7 is a schematic block diagram of a type determining unit provided in an embodiment of the present application;

FIG. 8 is a schematic block diagram of a normalization unit provided by an embodiment of the application;

FIG. 9 is a schematic block diagram of a canonical matching unit provided by an embodiment of the application;

FIG. 10 is a schematic block diagram of a natural speech processing unit provided by an embodiment of the present application;

FIG. 11 is a schematic block diagram of a computer device provided by an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.

The data referred to in the examples of the present application are described by taking the data in the physical examination report as an example. It is understood that the solution of the present application can also be applied to other scenarios and also other data that is not of the physical examination report type.

Fig. 1 is a schematic flow chart of a data normalization method provided in an embodiment of the present application. As shown in fig. 1, the method includes S101-S103.

S101, acquiring data to be standardized in the physical examination report.

Wherein, the physical examination report can be divided into a plurality of parts or one part. In this example, the physical examination report is in multiple copies. The physical examination report has a plurality of data to be standardized, such as weight, heart rate, liver color Doppler, eyesight and the like. Each item of data includes: data items and specific values. Such as data items: body weight, specific values were: 176 cm. If the physical examination report has a plurality of copies, acquiring data to be standardized in the physical examination reports, such as the weight and a specific value corresponding to the weight; if the physical examination report only has one part, one piece of data to be standardized in the physical examination report is acquired. It is understood that the physical examination results of different people may be given by different doctors, and because the habits of each doctor are different, the physical examination results of the same data may have a plurality of different values, and the plurality of different values are expressed in the same meaning, so that the physical examination results need to be standardized.

S102, determining a data type corresponding to a specific value of the item of data, wherein the data type comprises a numerical type, an enumeration type, a simple mixed type and a complex mixed type.

In this embodiment, the data types in the physical examination report include numerical type, enumerated type, simple mixed type, and complex mixed type. It is understood that the data types in the physical examination reports are divided into four types, and the four types can basically cover all the physical examination results in the physical examination reports.

The numerical type, i.e. the specific value of the data, is a specific value, such as 175cm, 50 kg. Enumerated types, such as: "negative", "normal", "undetected", "+", and so on. Simple hybrid types, such as: "> 100 beats/minute, sinus tachycardia", this type is numerical dominated. Complex mixed patterns, such as "multiple hypoechoic nodules in the thyroid gland, with a maximum of about 14mm x 8mm in the right lobe and peri-nodal vascular encirclement". This may be the case with pure text, possibly a mixture of text and value, relatively complex, and may include enumerated and simple mixed types.

In one embodiment, as shown in FIG. 2, step S102 includes the following steps S201-S206.

S201, acquiring a specific value of the data and detecting the acquired specific value of the data.

If the data item corresponding to the data item is: weight, the specific values of this data are: 176 cm. Then a specific value of 176cm is obtained for that item of data. And detecting the specific value of the data to judge the data type corresponding to the specific value of the data.

S202, if the specific value of the item of data is a number or a combination of the number and a unit, determining that the data type corresponding to the specific value of the item of data is a numerical type.

Such as data items: age, the corresponding specific values are: 28, i.e., numeric, the corresponding data type is determined to be numeric. Such as data items: the specific value of the hemoglobin is 135g/L, namely the combination of numbers and units, and the corresponding data type is determined to be a numerical type.

S203, if the specific value of the item of data is one of the preset enumeration values, determining that the data type corresponding to the specific value of the item of data is an enumeration type.

Wherein the preset enumerated values include "normal", "no obvious abnormality", "negative", "undetected", "no congestion", "no swelling", "no special", "positive", "abnormal", and the like; where rating is involved, it is also considered to be an enumeration, e.g. where the predetermined enumeration values of the class include "-", "+ + + + + +", e.g. urine sugar ratings, and "I", "II", "III", etc. e.g. cleanliness ratings.

S204, if the specific value of the data includes both characters and numbers, judging whether the number of the characters does not exceed a first preset number and whether the number of times of the numbers does not exceed a second preset number.

If the specific value of a certain item of data is: the morphology and size of the two kidneys were mapped, and the left kidney was observed to have a strong echo with a sound shadow of about 4 × 3 mm. The specific value of the item of data includes both text and numbers. Counting the number of words of the characters and the number of times of the numbers in the specific value of the data, and judging whether the number of the words exceeds a first preset number and the number of times of the numbers exceeds a second preset number. The first preset number may be 20, and the second preset number may be 2. The first predetermined amount and the second predetermined amount may be other values.

And S205, if the number of words of the characters in the specific value of the item of data does not exceed a first preset number and the number of times of occurrence of the numbers does not exceed a second preset number, determining that the data type corresponding to the specific value of the item of data is a simple mixing type.

S206, if the number of words of the characters in the specific value of the item of data exceeds a first preset number, or the number of times of occurrence of the numbers exceeds a second preset number, or the number of words of the characters in the specific value of the item of data exceeds the first preset number and the number of times of occurrence of the numbers exceeds the second preset number, determining that the data type corresponding to the specific value of the item of data is a complex mixed type.

It should be noted that the above scheme for determining the data type of the data is not limited to this, and in other embodiments, other schemes may be used to determine the data type.

And S103, carrying out standardization processing on the data according to the determined data type, wherein the standardization processing modes corresponding to different data types are different.

The data is processed in different ways according to different data types.

In one embodiment, as shown in FIG. 3, step S103 includes the following steps S301-S305.

S301, acquiring the data type corresponding to the determined specific value of the data.

S302, if the data type corresponding to the specific value of the data is numerical type, the specific value of the data in the physical examination report is processed to unify the data unit of the data.

If the height is 168cm or 1.68m, the height is uniformly converted into 168cm and 168cm, or the height is uniformly converted into 1.68m and 1.68m, so that the data units of the data are uniform. If there are multiple physical examination reports, the specific values of the data in the multiple physical examination reports need to be converted.

And S303, if the data type corresponding to the specific value of the data is enumerated, unifying characters in the specific value of the data, or matching and mapping the specific value of the data and a preset numerical value.

If "normal", "no obvious abnormality", "negative", "not detected", "no special", etc. all mean one meaning, all are unified as "normal". The matching mapping between the specific value of the data and the preset value, for example, mapping "normal" and "abnormal" of the general physical examination item to 0 and 1 respectively, where 0 and 1 are the preset values of the data; the graded mark such as the mark of the urine sugar, and the mark of the urine sugar are respectively mapped into 0, 1, 2, 3, and the like, wherein the 0, 1, 2, 3 are preset values of the data.

And S304, if the data type corresponding to the specific value of the item of data is a simple mixed type, standardizing by adopting a regular expression matching mode.

Regular expressions describe a pattern or rule of string matching, and text is matched by a specific character (rule) combination defined in advance. And standardizing by adopting a regular expression matching mode, matching the text by adopting the regular expression mode, and then standardizing the matched text.

S305, if the data type corresponding to the specific value of the item of data is a complex hybrid type, performing normalization by using a natural language processing method.

Natural Language Processing (NLP) is standardized by "understanding" Natural speech.

The embodiment shown in fig. 3 is used to perform normalization processing using different normalization processing methods according to different data types, such as numeric type, enumerated type, simple hybrid type, complex hybrid type, and the like.

In one embodiment, as shown in FIG. 4, step S303 includes the following steps S401-S405.

S401, according to the data item of the data item, acquiring a preset regular expression corresponding to the specific value of the data item.

Such as for data items: heart rate, the preset regular expression is: sinus property. For example, specific values of heart rate in different physical examination reports may be: heart rate 80 beats/minute, sinus heart rate; sinus heartbeat, 80 beats/minute; 80 times/min, sinus tachycardia, etc. Although the descriptions in the different physical examination reports are inconsistent, "sinus" appears. With a preset regular expression: sinus to match, and easily match to the data item. It should be noted that there may be multiple preset regular expressions corresponding to the same data item.

S402, judging whether the preset regular expression is matched with the specific value of the data.

If the preset regular expression is: and if the preset regular expression is matched with the specific value of the data, otherwise, the preset regular expression is not matched. And if the judgment result is not matched, prompting.

And S403, if the preset regular expression is matched with the specific value of the data, judging whether the specific value of the data has a symbol and a number.

There are some descriptions of specific values of data items with symbols, such as "< 30 times", etc.

S404, if the specific value of the data has a symbol and a number, extracting the feature corresponding to the specific value of the data according to a preset format to obtain a standardized result.

The preset format may be: number, symbol, unit. If the number is <30 times, the characteristics extracted according to the preset format are as follows: 30, <, times; if "< 3.12 mmol/L", the features extracted according to the preset format are as follows: 3.12, <, mmol/L. And determining the features extracted according to the preset format as a standardized result.

And S405, if the specific value of the item of data has no sign but has a number, extracting the number in the specific value of the item of data, and taking the extracted number as a standardized result.

If the specific value of the item of data has no sign but has digits, the extraction is divided into single digit extraction and multi-digit extraction according to whether the number of the digits exceeds a second preset number or a plurality of digits. It should be noted that if there is only one number, the number is extracted, for example, 80 of "heart rate 80/min, sinus heart rate" is extracted. If there are several numbers, several numbers are extracted as several result values of said data, such as visual acuity test "left eye vision 4.1 and right eye vision 4.3", 4.1 and 4.3 are extracted, which are respectively correspondent to left eye vision and right eye vision.

This embodiment defines a standardized processing manner for data of a simple hybrid type.

In one embodiment, as shown in FIG. 5, step S304 includes the following steps S501-S506.

And S501, calling a recursive grouping interface, and grouping punctuation of texts corresponding to specific values of the data.

The recursive grouping interface may be an interface provided in the chinese grammar kit THULAC, and is used to perform sentence-breaking grouping on the text corresponding to the specific value of the item of data. The THULAC is a Chinese lexical analysis toolkit developed by natural language processing and social humanistic computation laboratories of Qinghua university, and has the functions of Chinese word segmentation, part of speech tagging and the like. It is understood that the text corresponding to the specific value of the item of data may have a long sentence, and the long sentence includes a short sentence, inside and outside a parenthesis, and the like. And calling a recursive grouping interface to perform sentence-breaking grouping on the text corresponding to the specific value of the data, wherein the large group (segment) comprises a middle group, and the middle group (sentence) comprises small groups (short sentences or words) and the like.

S502, judging whether the data type of the text after the sentence break grouping belongs to a numerical type, an enumeration type or a simple mixed type. Namely, the data type of the short sentences after the sentence break grouping is judged.

And S503, if the data type of the text after the sentence break grouping belongs to a numerical type, an enumeration type or a simple mixed type, performing standardization processing by using a standardization processing mode corresponding to the numerical type, the enumeration type or the simple mixed type.

S504, if the data type of the text after the sentence segmentation grouping does not belong to a numerical type, an enumeration type or a simple mixed type, a word segmentation and part of speech tagging interface is called, word segmentation and part of speech tagging are carried out on the text after the sentence segmentation grouping, and analysis is carried out to obtain a first result.

Specifically, short sentences after the broken sentences are grouped are obtained, word segmentation and part-of-speech tagging interfaces are called, word segmentation is carried out on the short sentences, and the part-of-speech after the word segmentation is determined; and analyzing the short sentences after the sentence break grouping according to certain rules according to the part of speech after word segmentation to obtain a first result. The parts of speech include nouns, adjectives, and the like. The general noun is a core word. The interface for word segmentation and part-of-speech tagging may be an interface provided in the chinese grammar analysis toolkit THULAC, and is used for performing word segmentation, part-of-speech tagging, parsing, and the like. And the corresponding functions can be completed by using word segmentation and part-of-speech tagging interfaces provided by other word segmentation tools. Analyzing the phrases after the punctuation grouping according to a certain rule, if one phrase can be regarded as three parts: 1) what organ, 2) what, 3) specific values; such as 1) thyroid, 2) nodules, 3)2 cm. It should be noted that, when the word segmentation and part-of-speech tagging interfaces are called to perform analysis in this step, the short sentences with numerical values are mainly analyzed, and numerical value features corresponding to the core words are extracted. If there is no numerical feature in the sentence, the first result is null.

S505, calling a keyword extraction algorithm, counting the short sentences after the sentence break grouping to obtain a first frequency of the candidate keywords and a second frequency of the candidate keywords in the multi-copy physical examination report document where the data is located, extracting a group of keywords in specific values of the data from the candidate keywords according to the first frequency and the second frequency, and taking the extracted keywords as a second result.

The keyword extraction algorithm can use a TF-IDF algorithm, TF, Term Frequency, the Frequency of occurrence of the keywords, and the Frequency of occurrence of the keywords is used as a first Frequency, namely the Frequency of occurrence of the (candidate) keywords in the specific value of the data; IDF, Inverse Document Frequency, Frequency of occurrence of a word in the entire library dictionary. The reverse document frequency is referred to as the second frequency, i.e., the frequency of occurrence of the (candidate) keyword in the multi-copy report document in which the item of data is located. Extracting a group of keywords in the specific data value from the candidate keywords according to the first frequency and the second frequency, specifically: multiplying the first frequency and the second frequency corresponding to the candidate keywords to obtain a multiplication result; arranging the multiplication results in a descending order; extracting the arranged first group of candidate keywords; and taking the first group of candidate keywords as a group of keywords in the specific value of the desired data. And taking the group of keywords as the characteristic corresponding to the data, and taking the characteristic as a second result.

If the data item (physical examination item) corresponding to the data item is lung, the extracted keywords (features) are as follows: inflammation, calcification, and the like. I.e. an inflammation of the lungs and calcification of the lungs.

S506, taking the first result and the second result as a normalized result corresponding to the specific value of the data.

In one embodiment, before invoking the segmentation and part-of-speech tagging interface, the steps further include S503 a.

S503a, it is detected whether or not a numeral is present in the text after the sentence break grouping. If the text after the sentence break grouping has numbers, executing a step of calling word segmentation and part-of-speech tagging interfaces; if there is no numeral in the text after the sentence break grouping, step S505 is executed.

In the embodiment, the step of calling the segmentation and part-of-speech tagging interface, performing segmentation and part-of-speech tagging on the text after the sentence break grouping, and performing analysis is mainly performed in case of numbers, and if the number does not exist in the text after the sentence break grouping, the step of calling the segmentation and part-of-speech tagging interface, performing segmentation and part-of-speech tagging on the text after the sentence break grouping, and performing analysis is not required, so that the standardized calculation amount is reduced, and the standardized time is saved.

In an embodiment, after step S506, the method further comprises S506a, S506b, S506 c.

S506a, acquiring the preset characteristics and characteristic identifications corresponding to the specific values of the data.

As for the data item lung, the preset features are "normal", "inflammation", "calcification", and the like, respectively. The corresponding feature labels are respectively '0, 1' (0 indicates normal; 1 indicates abnormal), '0, 1' (0 indicates no corresponding feature, i.e. no inflammation; 1 indicates corresponding feature, i.e. inflammation) and '0, 1' (0 indicates no corresponding feature, i.e. no calcification; 1 indicates corresponding feature, i.e. calcification).

S506b, matching the normalized result corresponding to the specific value of the data with the preset characteristics to obtain a matching result.

If the normalized result is "abnormal" or "inflammation", the matching result obtained by matching the preset features is "abnormal" or "inflammation".

S506c, according to the matching result, marking the standardization result by using the corresponding characteristic mark.

If the matching result is "abnormal" or "inflammation", the result of the mark marked by the corresponding feature is "1", "1" or "0", respectively; if the matching result is "abnormal", "inflammation", or "calcification", the labeling results using the corresponding feature labels are "1", or "1", respectively.

This embodiment further labels the normalized results to quantify the normalized results for analysis and statistics of the model.

The data in the physical examination report are classified in a targeted manner, for example, the data types are divided into four different types, and the data in the physical examination report are respectively subjected to different standardized processing according to different classified results, so that the physical examination report can be comprehensively subjected to standardized processing, important physical examination indexes or character characteristics are prevented from being omitted, and the precision and the accuracy of the standardized processing of the physical examination report are improved. The data after the standardization processing can be further used for model learning, and the consistency and the accuracy of the data of the model learning are improved.

Fig. 6 is a schematic block diagram of a data normalization apparatus provided in an embodiment of the present application. As shown in fig. 6, the apparatus includes a corresponding unit for performing the data normalization method. Specifically, as shown in fig. 6, the apparatus 60 includes an obtaining unit 601, a type determining unit 602, and a normalizing unit 603.

The acquisition unit 601 is configured to acquire one item of data to be standardized in the physical examination report.

A type determining unit 602, configured to determine a data type corresponding to a specific value of the item of data, where the data type includes a numerical type, an enumerated type, a simple mixed type, and a complex mixed type.

In one embodiment, as shown in fig. 7, the type determining unit 602 includes an acquisition detecting unit 701, a numerical type determining unit 702, an enumeration type determining unit 703, a number judging unit 704, and a hybrid type determining unit 705. Wherein, the acquisition detection unit 701 is configured to acquire a specific value of the item of data and detect the acquired specific value of the item of data. A numerical type determining unit 702, configured to determine that the data type corresponding to the specific value of the item of data is a numerical type if the specific value of the item of data is a number or a combination of a number and a unit. An enumeration-type determining unit 703 is configured to determine that the data type corresponding to the specific value of the item of data is an enumeration type if the specific value of the item of data is one of preset enumeration values. The quantity determination unit 704 is configured to determine whether the number of words in the words does not exceed a first preset quantity and whether the number of times the number appears does not exceed a second preset quantity if the specific value of the item of data includes both words and numbers. A mixed type determining unit 705, configured to determine that a data type corresponding to a specific value of the item of data is a simple mixed type if the number of words of the text in the specific value of the item of data does not exceed a first preset number and the number of times of occurrence of the number does not exceed a second preset number; otherwise, determining that the data type corresponding to the specific value of the data is the complex mixing type.

A normalizing unit 603, configured to perform normalization processing on the item of data according to the determined data type, where normalization processing manners corresponding to different data types are different.

In one embodiment, as shown in fig. 8, the normalization unit 603 includes a type acquisition unit 801, a numerical processing unit 802, an enumeration processing unit 803, a regular matching unit 804, and a natural language processing unit 805. The type obtaining unit 801 is configured to obtain a data type corresponding to the determined specific value of the item of data. The numerical processing unit 802 is configured to, if the data type corresponding to the specific value of the item of data is a numerical type, process the specific value of the item of data in the physical examination report to unify the data units of the item of data. An enumeration processing unit 803, configured to unify the characters in the specific value of the item of data, or perform matching mapping between the specific value of the item of data and a preset numerical value, if the data type corresponding to the specific value of the item of data is an enumeration type. The regular matching unit 804 is configured to normalize the data type corresponding to the specific value of the item of data by using a regular expression matching method if the data type is a simple mixed type. And a natural language processing unit 805, configured to perform normalization by using a natural language processing method if the data type corresponding to the specific value of the item of data is a complex hybrid type.

In one embodiment, as shown in fig. 9, the regular matching unit 804 includes an expression obtaining unit 901, a matching judgment unit 902, a symbol number judgment unit 903, a first extraction unit 904, and a second extraction unit 905. The expression obtaining unit 901 is configured to obtain a preset regular expression corresponding to a specific value of the data item according to the data item of the data item. And a matching judgment unit 902, configured to judge whether the preset regular expression matches with the specific value of the item of data. And the symbol number judgment unit 903 is configured to judge whether a symbol and a number exist in the specific value of the item of data if the preset regular expression is matched with the specific value of the item of data. A first extracting unit 904, configured to extract, if the specific value of the item of data has a symbol and a number, a feature corresponding to the specific value of the item of data according to a preset format to obtain a normalized result. A second extracting unit 905, configured to, if there is no symbol but a number in the specific value of the item of data, extract a number in the specific value of the item of data, and use the extracted number as a normalization result.

In one embodiment, as shown in fig. 10, the natural language processing unit 805 includes a sentence-breaking unit 101, a text type judging unit 102, a part-of-speech analyzing unit 103, a keyword extracting unit 104, and a result determining unit 105. The sentence-breaking unit 101 is configured to invoke a recursive grouping interface, and perform sentence-breaking grouping on the text corresponding to the specific value of the item of data. The text type determining unit 102 is configured to determine whether the data type of the text after the sentence break grouping belongs to a numerical type, an enumeration type, or a simple mixing type. And if the data type of the text after the sentence break grouping belongs to a numerical type, an enumeration type or a simple mixed type, triggering a numerical processing unit, an enumeration processing unit or a regular matching unit. And the part-of-speech analysis unit 103 is configured to, if the data type of the text after the sentence segmentation grouping does not belong to a numerical type, an enumeration type or a simple mixed type, call a segmentation and part-of-speech tagging interface, perform segmentation and part-of-speech tagging on the text after the sentence segmentation grouping, and analyze the text to obtain a first result. The keyword extraction unit 104 is configured to invoke a keyword extraction algorithm, count short sentences after sentence break grouping to obtain a first frequency of occurrence of a candidate keyword and a second frequency of occurrence of the candidate keyword in a multi-piece physical examination report document in which the data is located, extract a group of keywords in specific values of the data from the candidate keyword according to the first frequency and the second frequency, and use the extracted keywords as a second result. A result determining unit 105, configured to use the first result and the second result as a normalized result corresponding to the specific value of the item of data.

In one embodiment, the natural language processing unit 804 further includes a digital detection unit 102 a. The number detecting unit 102a is configured to detect whether a number exists in the text after the sentence break grouping if the data type of the text after the sentence break grouping does not belong to a numerical type, an enumerated type or a simple mixed type. If there is a number, the part-of-speech analysis unit 103 is triggered. If no number exists, the keyword extraction unit 104 is triggered.

In an embodiment, the natural language processing unit 804 further includes a feature identification obtaining unit 105a, a feature matching unit 105b, and a marking unit 105 c. The feature identifier acquiring unit 105a is configured to acquire a feature and a feature identifier corresponding to a preset specific value of the item of data. And the feature matching unit 105b is configured to match the normalized result corresponding to the specific value of the item of data with a preset feature to obtain a matching result. And the marking unit 105c is used for marking the standardized result by using the corresponding feature identifier according to the matching result.

It should be noted that, as will be clear to those skilled in the art, specific implementation processes of the above apparatus and each unit may refer to corresponding descriptions in the foregoing method embodiments, and for convenience and brevity of description, no further description is provided herein.

The above-described apparatus may be implemented in the form of a computer program which is executable on a computer device as shown in fig. 11.

Fig. 11 is a schematic block diagram of a computer device according to an embodiment of the present application. The device is a terminal or the like, such as a mobile terminal, a PC terminal, an IPAd or the like. The device 110 includes a processor 112, memory, and a network interface 113 connected by a system bus 111, where the memory may include a non-volatile storage medium 114 and an internal memory 115.

The non-volatile storage medium 114 may store an operating system 1141 and computer programs 1142. The computer program 1142 stored in the non-volatile storage medium, when executed by the processor 112, may implement the data normalization method described above. The processor 112 is used to provide computing and control capabilities to support the operation of the overall device. The internal memory 115 provides an environment for running a computer program in a non-volatile storage medium, which when executed by the processor 112, causes the processor 112 to perform the data normalization method described above. The network interface 113 is used for network communication. Those skilled in the art will appreciate that the configuration shown in fig. 11 is a block diagram of only a portion of the configuration associated with the present application and does not constitute a limitation on the devices to which the present application applies, and that a particular device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

Wherein the processor 112 is configured to run the computer program stored in the memory to implement the steps of:

In an embodiment, the data types include a numerical type, an enumeration type, a simple hybrid type, and a complex hybrid type, and when the processor 112 executes the step of determining the data type corresponding to the specific value of the item of data, the following steps are specifically implemented:

acquiring a specific value of the item of data and detecting the acquired specific value of the item of data; if the specific value of the data is a number or a combination of a number and a unit, determining that the data type corresponding to the specific value of the data is a numerical type; if the specific value of the item of data is one of preset enumeration values, determining that the data type corresponding to the specific value of the item of data is an enumeration type; if the specific value of the data comprises characters and numbers, judging whether the number of the characters does not exceed a first preset number and whether the number of times of the numbers does not exceed a second preset number; if the number of words of characters in the specific value of the data does not exceed a first preset number and the number of times of occurrence of the numbers does not exceed a second preset number, determining that the data type corresponding to the specific value of the data is a simple mixed type; otherwise, determining that the data type corresponding to the specific value of the data is the complex mixing type.

In an embodiment, the data types include a numerical type, an enumeration type, a simple hybrid type, and a complex hybrid type, and when the processor 112 performs the step of performing the normalization processing on the item of data according to the determined data type, the following steps are specifically implemented:

if the data type corresponding to the specific value of the data is numerical type, processing the specific value of the data in the physical examination report to unify the data units of the data; if the data type corresponding to the specific value of the data is enumerated, unifying characters in the specific value of the data, or matching and mapping the specific value of the data and a preset numerical value; if the data type corresponding to the specific value of the data is a simple mixed type, standardizing by adopting a regular expression matching mode; if the data type corresponding to the specific value of the item of data is a complex mixed type, a natural language processing method is adopted for standardization.

In an embodiment, when the processor 112 executes the step of performing normalization by using a regular expression matching method if the data type corresponding to the specific value of the item of data is a simple hybrid type, the following steps are specifically implemented:

acquiring a preset regular expression corresponding to a specific value of the data according to the data item of the data; judging whether the preset regular expression is matched with the specific value of the data; if the preset regular expression is matched with the specific value of the data, judging whether the specific value of the data has a symbol and a number; if the specific value of the data has a symbol and a number, extracting the characteristics corresponding to the specific value of the data according to a preset format to obtain a standardized result; if no sign but a digit exists in the specific value of the data, extracting the digit in the specific value of the data, and taking the extracted digit as a normalization result.

In an embodiment, when the processor 112 executes the step of performing normalization by using a natural language processing method if the data type corresponding to the specific value of the item of data is a complex hybrid type, the following steps are specifically implemented:

calling a recursive grouping interface to perform sentence-breaking grouping on the text corresponding to the specific value of the data; judging whether the data type of the text after the sentence break grouping belongs to a numerical type, an enumeration type or a simple mixed type; if the data type of the text after the sentence break grouping belongs to a numerical type, an enumeration type or a simple mixed type, carrying out standardization processing by using a standardization processing mode corresponding to the numerical type, the enumeration type or the simple mixed type; if the data type of the text after the sentence segmentation grouping does not belong to a numerical type, an enumeration type or a simple mixed type, calling a segmentation and part-of-speech tagging interface, performing segmentation and part-of-speech tagging on the text after the sentence segmentation grouping, and analyzing to obtain a first result; calling a keyword extraction algorithm, counting the short sentences after the sentence break grouping to obtain a first frequency of the occurrence of candidate keywords and a second frequency of the occurrence of the candidate keywords in the multi-piece physical examination report document in which the data is located, extracting a group of keywords in the specific values of the data from the candidate keywords according to the first frequency and the second frequency, and taking the extracted keywords as a second result; and taking the first result and the second result as the normalized result corresponding to the specific value of the data.

In an embodiment, after the processor 112 performs the step of taking the first result and the second result as the normalized result corresponding to the specific value of the data, the following steps are further implemented:

acquiring the preset characteristics and characteristic identifications corresponding to the specific values of the data; matching the standardized result corresponding to the specific data value with the preset characteristics to obtain a matching result; and marking the standardized result by using the corresponding characteristic identifier according to the matching result.

In an embodiment, before the step of executing the interface for invoking segmentation and part-of-speech tagging, performing segmentation and part-of-speech tagging on the text after the sentence break grouping, and analyzing to obtain the first result, the processor 112 further implements the following steps:

detecting whether numbers exist in the text after the sentence breaks are grouped; and if the text after the sentence segmentation grouping has numbers, executing a step of calling a word segmentation and part of speech tagging interface, performing word segmentation and part of speech tagging on the text after the sentence segmentation grouping, and analyzing to obtain a first result.

It should be understood that in the embodiments of the present application, the Processor 112 may be a Central Processing Unit (CPU), and the Processor may be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field-Programmable Gate arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It will be understood by those skilled in the art that all or part of the flow of the method implementing the above embodiments may be implemented by a computer program instructing relevant hardware. The computer program may be stored in a storage medium, which may be a computer-readable storage medium. The computer program is executed by at least one processor in the computer system to implement the flow steps of the embodiments of the method described above.

Accordingly, the present application also provides a storage medium. The storage medium may be a computer-readable storage medium. The storage medium stores a computer program which, when executed by a processor, implements the steps of:

In an embodiment, the data types include a numerical type, an enumeration type, a simple hybrid type, and a complex hybrid type, and when the processor executes the step of determining the data type corresponding to the specific value of the item of data, the following steps are specifically implemented:

acquiring a specific value of the data and detecting the acquired specific value of the data; if the specific value of the data is a number or a combination of a number and a unit, determining that the data type corresponding to the specific value of the data is a numerical type; if the specific value of the item of data is one of preset enumeration values, determining that the data type corresponding to the specific value of the item of data is an enumeration type; if the specific value of the data comprises characters and numbers, judging whether the number of the characters does not exceed a first preset number and whether the number of times of the numbers does not exceed a second preset number; if the number of words of characters in the specific value of the item of data does not exceed a first preset number and the number of times of occurrence of the numbers does not exceed a second preset number, determining that the data type corresponding to the specific value of the item of data is a simple mixed type; otherwise, determining that the data type corresponding to the specific value of the data is the complex mixing type.

In an embodiment, the data types include a numerical type, an enumeration type, a simple hybrid type, and a complex hybrid type, and when the processor performs the step of normalizing the item of data according to the determined data type, the following steps are specifically implemented:

if the data type corresponding to the specific value of the data is numerical type, processing the specific value of the data in the physical examination report to unify the data units of the data; if the data type corresponding to the specific value of the data is enumerated, unifying characters in the specific value of the data, or matching and mapping the specific value of the data and a preset numerical value; if the data type corresponding to the specific value of the data is a simple mixed type, standardizing by adopting a regular expression matching mode; if the data type corresponding to the specific value of the data is a complex mixed type, a natural language processing method is adopted for standardization.

In an embodiment, when the processor executes the step of performing normalization by using a regular expression matching method if the data type corresponding to the specific value of the item of data is a simple mixed type, the following steps are specifically implemented:

acquiring a preset regular expression corresponding to a specific value of the data according to the data item of the data; judging whether the preset regular expression is matched with the specific value of the data; if the preset regular expression is matched with the specific value of the data, judging whether the specific value of the data has symbols and numbers; if the specific value of the data has a symbol and a number, extracting the characteristics corresponding to the specific value of the data according to a preset format to obtain a standardized result; if no sign but a digit exists in the specific value of the data, extracting the digit in the specific value of the data, and taking the extracted digit as a normalization result.

In an embodiment, when the processor executes the step of performing normalization by using a natural language processing method if the data type corresponding to the specific value of the item of data is a complex hybrid type, the following steps are specifically implemented:

In an embodiment, after the step of taking the first result and the second result as the normalized results corresponding to the specific values of the item of data, the processor further implements the following steps:

acquiring the preset characteristics and characteristic identifications corresponding to the specific values of the data; matching the standardized result corresponding to the specific value of the data with the preset characteristics to obtain a matching result; and marking the standardized result by using the corresponding feature identifier according to the matching result.

In an embodiment, before the step of executing the interface for calling segmentation and part-of-speech tagging, performing segmentation and part-of-speech tagging on the text after the sentence break grouping, and analyzing to obtain the first result, the processor further implements the following steps:

detecting whether numbers exist in the text after the sentence breaks are grouped; and if the text after the sentence segmentation grouping has numbers, executing a step of calling a segmentation and part-of-speech tagging interface, performing segmentation and part-of-speech tagging on the text after the sentence segmentation grouping, and analyzing to obtain a first result.

The storage medium may be a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk, which can store various computer readable storage media of program codes.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus, device and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and the division of the unit is only one logical function division, and other division manners may be available in actual implementation. It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses, devices and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of data normalization, the method comprising:

acquiring data to be standardized in a physical examination report;

acquiring a specific value of the data and detecting the acquired specific value of the data;

if the specific value of the item of data is a number or a combination of a number and a unit, determining that the data type corresponding to the specific value of the item of data is a numerical type;

if the specific value of the item of data is one of preset enumeration values, determining that the data type corresponding to the specific value of the item of data is an enumeration type;

if the specific value of the data comprises characters and numbers, judging whether the number of the characters does not exceed a first preset number and whether the number of times of the numbers does not exceed a second preset number;

if the number of words of characters in the specific value of the item of data does not exceed a first preset number and the number of times of occurrence of the numbers does not exceed a second preset number, determining that the data type corresponding to the specific value of the item of data is a simple mixed type; otherwise, determining the data type corresponding to the specific value of the data as a complex mixed type;

if the data type corresponding to the specific value of the data is numerical type, processing the specific value of the data in the physical examination report to unify the data units of the data;

if the data type corresponding to the specific value of the data is enumerated, unifying characters in the specific value of the data, or matching and mapping the specific value of the data and a preset numerical value;

if the data type corresponding to the specific value of the data is a simple mixed type, standardizing by adopting a regular expression matching mode;

if the data type corresponding to the specific value of the data is a complex mixed type, adopting a natural language processing method to carry out standardization;

wherein, if the data type corresponding to the specific value of the data is a complex mixed type, the step of standardizing by adopting a natural language processing method specifically comprises:

calling a recursive grouping interface to perform sentence-breaking grouping on the text corresponding to the specific value of the data;

judging whether the data type of the text after the sentence break grouping belongs to a numerical type, an enumeration type or a simple mixed type;

if the data type of the text after the sentence break grouping belongs to a numerical type, an enumeration type or a simple mixed type, carrying out standardization processing by using a standardization processing mode corresponding to the numerical type, the enumeration type or the simple mixed type;

if the data type of the text after the sentence segmentation grouping does not belong to a numerical type, an enumeration type or a simple mixed type, calling a segmentation and part-of-speech tagging interface, performing segmentation and part-of-speech tagging on the text after the sentence segmentation grouping, and analyzing to obtain a first result;

calling a keyword extraction algorithm, counting the short sentences after the sentence break grouping to obtain a first frequency of the occurrence of candidate keywords and a second frequency of the occurrence of the candidate keywords in the multi-piece physical examination report document in which the data is located, extracting a group of keywords in the specific values of the data from the candidate keywords according to the first frequency and the second frequency, and taking the extracted keywords as a second result;

and taking the first result and the second result as the standardized result corresponding to the specific data value.

2. The method of claim 1, wherein if the data type corresponding to the specific value of the item of data is a simple mixture type, performing normalization by using a regular expression matching method, including:

acquiring a preset regular expression corresponding to a specific value of the data according to the data item of the data;

judging whether the preset regular expression is matched with the specific value of the data;

if the preset regular expression is matched with the specific value of the data, judging whether the specific value of the data has symbols and numbers;

if the specific value of the data has a symbol and a number, extracting the characteristics corresponding to the specific value of the data according to a preset format to obtain a standardized result;

if there is no sign but a digit in the specific value of the item of data, the digit in the specific value of the item of data is extracted, and the extracted digit is taken as a normalization result.

3. The method of claim 1, further comprising:

acquiring the preset characteristics and characteristic identifications corresponding to the specific values of the data;

matching the standardized result corresponding to the specific value of the data with the preset characteristics to obtain a matching result;

and marking the standardized result by using the corresponding feature identifier according to the matching result.

4. The method of claim 1, wherein before the invoking of the segmentation and part-of-speech tagging interface, the segmentation and part-of-speech tagging of the text after the sentence break grouping, and the analyzing to obtain the first result, the method further comprises:

detecting whether numbers exist in the text after the sentence breaks are grouped;

and if the numbers exist, executing a step of calling a segmentation and part-of-speech tagging interface, performing segmentation and part-of-speech tagging on the text after the sentence break grouping, and analyzing to obtain a first result.

5. A data normalization apparatus, characterized in that the data normalization apparatus comprises:

the acquisition unit is used for acquiring data to be standardized in the physical examination report;

the acquisition detection unit is used for acquiring the specific value of the item of data and detecting the acquired specific value of the item of data;

a numerical type determining unit, configured to determine that a data type corresponding to a specific value of the item of data is a numerical type if the specific value of the item of data is a number, or a combination of a number and a unit;

an enumerated determining unit, configured to determine that a data type corresponding to a specific value of the item of data is an enumerated type if the specific value of the item of data is one of preset enumerated values;

the quantity judging unit is used for judging whether the word number of the words does not exceed a first preset quantity and whether the number of times of the occurrence of the numbers does not exceed a second preset quantity if the specific value of the data comprises both the words and the numbers;

a mixed type determining unit, configured to determine that a data type corresponding to the specific value of the item of data is a simple mixed type if the number of words of the text in the specific value of the item of data does not exceed a first preset number and the number of times of occurrence of the number does not exceed a second preset number; otherwise, determining the data type corresponding to the specific value of the data is a complex mixed type;

the numerical processing unit is used for processing the specific value of the data in the physical examination report to unify the data unit of the data if the data type corresponding to the specific value of the data is a numerical type;

the enumeration processing unit is used for unifying characters in the specific values of the data or matching and mapping the specific values of the data with preset numerical values if the data type corresponding to the specific values of the data is an enumeration type;

the regular matching unit is used for adopting a regular expression matching mode to carry out standardization if the data type corresponding to the specific value of the item of data is a simple mixed type;

the natural language processing unit is used for adopting a natural language processing method to carry out standardization if the data type corresponding to the specific value of the data is a complex mixed type;

wherein the natural language processing unit includes:

a punctuation unit, which is used for calling a recursive grouping interface and grouping punctuation of the text corresponding to the specific value of the data;

the text type judging unit is used for judging whether the data type of the text after the sentence break grouping belongs to a numerical type, an enumeration type or a simple mixed type; if the data type of the text after the sentence break grouping belongs to a numerical type, an enumeration type or a simple mixed type, carrying out standardization processing by using a standardization processing mode corresponding to the numerical type, the enumeration type or the simple mixed type;

the part of speech analysis unit is used for calling word segmentation and part of speech tagging interfaces if the data type of the text after the sentence segmentation grouping does not belong to a numerical type, an enumeration type or a simple mixed type, performing word segmentation and part of speech tagging on the text after the sentence segmentation grouping, and analyzing to obtain a first result;

the keyword extraction unit is used for calling a keyword extraction algorithm, counting the short sentences after the sentence break grouping to obtain a first frequency of the occurrence of the candidate keywords and a second frequency of the occurrence of the candidate keywords in the multi-copy physical examination report document where the data is located, extracting a group of keywords in specific values of the data from the candidate keywords according to the first frequency and the second frequency, and taking the extracted keywords as a second result;

and the result determining unit is used for taking the first result and the second result as the normalized result corresponding to the specific value of the data.

6. A computer device, comprising a memory, and a processor coupled to the memory;

the memory is used for storing a computer program; the processor is configured to execute a computer program stored in the memory to perform the method of any of claims 1-4.

7. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, implements the method according to any one of claims 1-4.