CN109830272A

CN109830272A - Data normalization method, apparatus, computer equipment and storage medium

Info

Publication number: CN109830272A
Application number: CN201910011828.XA
Authority: CN
Inventors: 金晓辉; 阮晓雯; 徐亮
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-01-07
Filing date: 2019-01-07
Publication date: 2019-05-31
Anticipated expiration: 2039-01-07
Also published as: CN109830272B

Abstract

The embodiment of the present application provides a kind of data normalization method, apparatus, computer equipment and storage medium.The described method includes: obtaining an item data to be normalized in physical examination report；Determine data type corresponding to the occurrence of the item data；The item data is standardized according to identified data type, wherein the mode of standardization corresponding to different types of data is different.The embodiment of the present application uses different standardization modes to the data of different types of data, can comprehensively be standardized to physical examination report, improve the precision and accuracy to the processing of physical examination reporting standardsization；Make the data after standardization that can be further used for model learning simultaneously, improves the consistency and accuracy of the data of model learning.

Description

Data normalization method, apparatus, computer equipment and storage medium

Technical field

This application involves technical field of data processing more particularly to a kind of data normalization method, apparatus, computer equipment And storage medium.

Background technique

Electronics physical examination report generally comprises bulk information, and corresponding physical examination project is a variety of multinomial, is not easy to handle, and causes at present All relatively roughly, majority matches corresponding physical examination knot directly to identify by data format for common electronics physical examination report recognition methods Fruit, the data that screening can identify are stored and are standardized, and later period model learning is used for.However in different physical examination reports, The physical examination result of same item data expresses the consistent meaning, and the physical examination result in physical examination report is entirely different, and different The physical examination result difference of project data is also very big, can not be identified completely by this rough recognition methods, while body The identification of inspection project is not also comprehensive, and the data identified are also unfavorable for the study of later period model.

Summary of the invention

The embodiment of the present application provides a kind of data normalization method, apparatus, computer equipment and storage medium, and number can be improved According to the precision and accuracy of standardization.

In a first aspect, the embodiment of the present application provides a kind of data normalization method, this method comprises:

Obtain an item data to be normalized in physical examination report；Determine data class corresponding to the occurrence of the item data Type；The item data is standardized according to identified data type, wherein standard corresponding to different types of data The mode for changing processing is different.

Second aspect, the embodiment of the invention provides a kind of data normalization device, which includes using The corresponding unit of method described in the above-mentioned first aspect of execution.

The third aspect, the embodiment of the invention provides a kind of computer equipment, the computer equipment includes memory, with And the processor being connected with the memory；

The memory is for storing computer program, and the processor is for running the calculating stored in the memory Machine program, to execute method described in above-mentioned first aspect.

Fourth aspect, the embodiment of the invention provides a kind of computer readable storage medium, the computer-readable storage Media storage has computer program, when the computer program is executed by processor, realizes method described in above-mentioned first aspect.

Data type corresponding to occurrence of the embodiment of the present application by one item data of identification, and according to different data Type carries out different standardizations to the occurrence of the item data.The embodiment of the present application adopts the data of different types of data With different standardization modes, comprehensively physical examination report can be standardized, avoid important physical examination index or text The omission of word feature improves precision and accuracy to the processing of physical examination reporting standardsization；After making standardization simultaneously Data can be further used for model learning, improve the consistency and accuracy of the data of model learning.

Detailed description of the invention

Technical solution in order to illustrate the embodiments of the present invention more clearly, below will be to needed in embodiment description Attached drawing is briefly described, it should be apparent that, drawings in the following description are some embodiments of the invention, general for this field For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.

Fig. 1 is the flow diagram of data normalization method provided by the embodiments of the present application；

Fig. 2 is the sub-process schematic diagram of data normalization method provided by the embodiments of the present application；

Fig. 3 is the sub-process schematic diagram of data normalization method provided by the embodiments of the present application；

Fig. 4 is the sub-process schematic diagram of Fig. 3 provided by the embodiments of the present application；

Fig. 5 is the sub-process schematic diagram of Fig. 3 provided by the embodiments of the present application；

Fig. 6 is the schematic block diagram of data normalization device provided by the embodiments of the present application；

Fig. 7 is the schematic block diagram of type determining units provided by the embodiments of the present application；

Fig. 8 is the schematic block diagram of Standardisation Cell provided by the embodiments of the present application；

Fig. 9 is the schematic block diagram of canonical matching unit provided by the embodiments of the present application；

Figure 10 is the schematic block diagram of natural-sounding processing unit provided by the embodiments of the present application；

Figure 11 is the schematic block diagram of computer equipment provided by the embodiments of the present application.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen Please in embodiment, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall in the protection scope of this application.

The data being related in the embodiment of the present application with physical examination report in data instance be illustrated.It is to be appreciated that Scheme in the application can also be applied to other scenes, and can also be other is not the data of physical examination report type.

Fig. 1 is the flow diagram of data normalization method provided by the embodiments of the present application.As shown in Figure 1, this method packet Include S101-S103.

S101 obtains an item data to be normalized in physical examination report.

Wherein, physical examination report can have more parts, be also possible to portion.In the present embodiment, physical examination report has more parts.Physical examination Data to be normalized in report have multinomial, such as weight, heart rate, liver color ultrasound, eyesight.Each single item data include: data Item and occurrence.Such as data item: weight, occurrence are as follows: 176cm.If physical examination report has more parts, then obtaining more parts of physical examination reports In an item data to be normalized, occurrence corresponding to such as weight and weight；If physical examination report only has portion, obtain An item data to be normalized in this part of physical examination report.It is to be appreciated that the physical examination result of different people may be by different What doctor provided, due to the habit difference of each doctor, then may have in the physical examination result of same item data it is multiple not Same value, and the meaning of multiple different value expression is consistent, it is therefore desirable to physical examination result is standardized.

S102 determines data type corresponding to the occurrence of the item data, wherein data type include numeric type, piece Type, COMPLEX MIXED type is simply mixed in act type.

In the present embodiment, by physical examination report in data type include numeric type, enumeration type, type, complexity be simply mixed Mixed type.It is to be understood that by physical examination report in data type be divided into four seed types, which can cover substantially All physical examination results in physical examination report.

The numeric type i.e. occurrence of the item data is specific value, such as 175cm, 50kg.Enumeration type, such as: " feminine gender ", " just Often ", it " is not detected ", "+", " ++ " etc..Type is simply mixed, such as: " > 100 beats/min, nodal tachycardia ", this type is with numerical value Based on.COMPLEX MIXED type such as " is shown in that multiple low echo nodules, maximum are located at lobus dexter about 14mm × 8mm, tubercle periphery in thyroid gland Have vascular circle around ".Such case may be pure words, it may be possible to which text and numerical value mixing, relatively complicated, it may include Enumeration type and situations such as be simply mixed type.

In one embodiment, as shown in Fig. 2, step S102 includes the following steps S201-S206.

S201 obtains the occurrence of the item data and detects to the occurrence of the acquired item data.

The data item as corresponding to the item data are as follows: weight, the occurrence of the item data are as follows: 176cm.So obtain this The occurrence 176cm of data.The occurrence of the item data is detected to judge number corresponding to the occurrence of the item data According to type.

S202 if the occurrence of the item data is number, or is the combination of number and unit, then it is determined that the item data Occurrence corresponding to data type be numeric type.

Such as data item: age, corresponding occurrence are as follows: 28, it is as digital, determine that corresponding data type is numerical value Type.Such as data item: hemoglobin, corresponding occurrence are 135g/L, the as combination of number and unit, are determined corresponding Data type is numeric type.

S203, if the occurrence of the item data is one of preset enumerated value, then it is determined that the tool of the item data Data type corresponding to body value is enumeration type.

Wherein, preset enumerated value includes " normal ", " Non Apparent Abnormality ", " showing no obvious abnormalities ", " feminine gender ", " not Detection ", " no hyperemia ", " no enlargement ", " without special ", " positive ", "abnormal" etc.；Include grade classification, is also considered as and enumerates Type, if such preset enumerated value includes "-", "+", " ++ ", " +++ " grade, such as the grade of glucose in urine, further include " I grades ", " II grades ", " III level " etc., such as the grade of cleannes.

S204 if the occurrence of the item data not only includes text, but also includes number, whether not to judge the number of words of text Whether the number more than the first preset quantity and number appearance is less than the second preset quantity.

Such as the occurrence of certain item data are as follows: double kidney form size positions journey, the visible strong echo accompanying sound shadow of left kidney, size is about 4*3mm.The occurrence of so item data not only includes text, but also includes number.It counts in the occurrence of the item data Whether the number that the number of words of text and number occur judges number of words that text occurs more than the first preset quantity and number occurs Whether number is more than the second preset quantity.Wherein, the first preset quantity can be 20, and the second preset quantity can be 2.The One preset quantity and the second preset quantity can also be other numerical value.

S205, if the number of words of the text in the occurrence of the item data is less than time that the first preset quantity and number occur Number is less than the second preset quantity, determines that data type corresponding to the occurrence of the item data is that type is simply mixed.

S206, if the number of words of the text in the occurrence of the item data has been more than the first preset quantity, or number occurs Number be more than the number of words of text in the occurrence of the second preset quantity or the item data be more than the first preset quantity And the number that number occurs has been more than the second preset quantity, determines data type corresponding to the occurrence of the item data for complexity Mixed type.

It should be noted that the scheme of the data type of data determined above is not limited thereto, in other embodiments, Other schemes can also be used to carry out the determination of data type.

S103 is standardized the item data according to identified data type, wherein different types of data institute The mode of corresponding standardization is different.

According to the different different to the mode of item data processing of data type.

In one embodiment, as shown in figure 3, step S103 includes the following steps S301-S305.

S301 obtains data type corresponding to the occurrence of the identified item data.

S302, if data type corresponding to the occurrence of the item data is numeric type, to the item data in physical examination report Occurrence handled, with the data unit of the unified item data.

If height is 168cm, or it is 1.68m, then unified be converted into 168cm, 168cm for height, or uniformly will Height is converted into 1.68m, 1.68m, and the data unit of the item data is so carried out unification.If there are more parts of physical examination reports, need Convert the occurrence of the item data in more parts of physical examination reports.

S303 will be in the item data occurrence if data type corresponding to the occurrence of the item data is enumeration type Text carries out unification, or the occurrence of the item data and pre-set numerical value are carried out matching mapping.

Such as " normal ", " Non Apparent Abnormality ", " showing no obvious abnormalities ", " feminine gender ", " being not detected ", " no special " indicate One meaning, then be all unified for " normal ".The matching of the item data occurrence and pre-set numerical value maps, as will " normal " of physical examination item, "abnormal" are each mapped to 0 and 1, wherein 0 and 1 is the pre-set numerical value of the item data；To have The "-" of such as glucose in urine of grade classification, "+", " ++ ", " +++ " are each mapped to 0,1,2,3 etc., wherein 0,1,2,3 be to be somebody's turn to do The pre-set numerical value of item data.

S304 uses regular expression if data type corresponding to the occurrence of the item data is that type is simply mixed Matched mode is standardized.

Regular expression describes the mode or rule of a kind of string matching, passes through predefined specific character Matched text is gone in (rule) combination.It is standardized by the way of regular expression matching, first uses the mode of regular expression Matched text is gone, then the text after matching is standardized.

S305 uses natural language if data type corresponding to the occurrence of the item data is COMPLEX MIXED type The method of processing is standardized.

Natural language processing (Natural Language Processing, NLP) by " understanding " to natural-sounding come It is standardized.

Embodiment shown in Fig. 3 is with according to different data types, such as numeric type, enumeration type are simply mixed type, are complicated mixed Mould assembly etc. is standardized using different standardization processing methods.

In one embodiment, as shown in figure 4, step S303 includes the following steps S401-S405.

S401 obtains default regular expression corresponding to the occurrence of the item data according to the data item of the item data.

Such as data item: heart rate presets regular expression are as follows: Dou Xing.The occurrence led such as different physical examination Reporting Centers May be: 80 beats/min of heart rate, sinus rate；Sinus property is aroused in interest, and 80 beats/min；80 beats/min, sinus property heart speed etc..Although different physical examinations Description in report is inconsistent, but all " sinus property " occurs.With default regular expression: Dou Xing, to be matched, it is easy to It is fitted on the data item.It should be noted that default regular expression corresponding to the same data item can have it is multiple.

S402, judges whether default regular expression matches with the occurrence of the item data.

If default regular expression are as follows: Dou Xing " sinus property " occurs in the occurrence of the item data, it is determined that default Regular expression is matched with the occurrence of the item data, is otherwise determined and is mismatched.If it is determined that mismatching, then prompted.

S403, if default regular expression is matched with the occurrence of the item data, judge be in the occurrence of the item data It is no to have symbol and number.

There are the description that symbol is had in the occurrence of some data item, such as " < 30 times ".

S404 extracts the tool of the item data according to preset format if having symbol and number in the occurrence of the item data Feature corresponding to body value, to obtain standardization result.

Wherein, preset format can be with are as follows: number, symbol, unit.Such as " < 30 times ", the feature extracted according to preset format Are as follows: 30, <, it is secondary；Such as " < 3.12mmol/L ", the feature extracted according to preset format are as follows: 3.12, <, mmol/L.It will be according to pre- If the feature that format extracts is determined as standardization result.

S405 is extracted in the occurrence of the item data if the occurrence of the item data does not have symbol but to have number Number, using the number extracted as standardization result.

If the occurrence of the item data does not have symbol but to have number, then according to being a number or multiple digital (its In, the number of multiple numbers does not exceed the second preset quantity), it is divided into single digital and extracts and multiple digital extraction.It should be noted that It is if an only number, to extract a number, such as extract 80 in " 80 beats/min of heart rate, sinus rate ".If having Multiple numbers, then multiple numbers are extracted, multiple end values of multiple number as the item data, as eye test is " left Eye vision 4.1, right vision 4.3 " extract 4.1 and 4.3, respectively correspond left vision and right vision.

This embodiment define the standardization modes of the data for the type that is simply mixed.

In one embodiment, as shown in figure 5, step S304 includes the following steps S501-S506.

S501, calls recurrence packet interface, and text corresponding to the occurrence to the item data carries out punctuate grouping.

Wherein, recurrence packet interface can be the interface provided in Chinese grammer analysis tool packet THULAC, use Text corresponding to the occurrence to the item data carries out punctuate grouping.Wherein, THULAC is by Tsinghua University's natural language A set of Chinese lexical analysis kit that processing is released with the development of society & culture's computing laboratory has Chinese word segmentation and part of speech mark The functions such as note.It is to be understood that having long sentence in text corresponding to the occurrence of the item data, includes short sentence in long sentence, includes Situations such as equal inside and outside number.Recurrence packet interface is called, text corresponding to the occurrence to the item data carries out punctuate grouping, greatly Group (section) includes middle group, includes in middle group (sentence) group (short sentence or word) etc..

Whether S502, the data type of the text after judging punctuate grouping belong to numeric type or enumeration type or simple mixed Mould assembly.The short sentence that will make pauses in reading unpunctuated ancient writings after being grouped carries out the judgement of data type.

S503, if the data type of the text after punctuate grouping belongs to numeric type or enumeration type or type is simply mixed, Then using numeric type or enumeration type or the corresponding standardization mode of type is simply mixed it is standardized.

S504, if the data type of the text after punctuate grouping is not belonging to numeric type or enumeration type or is simply mixed Type calls participle and part-of-speech tagging interface, carries out participle and part-of-speech tagging to the text after punctuate grouping, and analyzed, To obtain the first result.

Specifically, the short sentence after obtaining punctuate grouping, calls participle and part-of-speech tagging interface, short sentence is segmented, and Part of speech after determining participle；According to the part of speech after participle, the short sentence after punctuate grouping is analyzed according to certain rules, with Obtain the first result.Wherein, part of speech includes noun, adjective etc..Termini generales are core word.Participle and part-of-speech tagging interface can To be the interface provided in Chinese grammer analysis tool packet THULAC, for carrying out participle and part-of-speech tagging and grammer Analysis etc..Corresponding function can also be completed using the participle of other participle tools offers and part-of-speech tagging interface.It presses The short sentence after punctuate grouping is analyzed according to certain rule, such as a short sentence can be regarded as three parts: 1) what organ, 2) what's the matter, and 3) specific value；Such as 1) thyroid gland, 2) tubercle, 3) 2cm.It should be noted that the step in call participle and When part-of-speech tagging interface is analyzed, mainly the short sentence for having numerical value is analyzed, extracts numerical value corresponding to core word Feature.If not having numerical characteristics in the sentence, the first result is sky.

S505 calls keyword extraction algorithms, counts to the short sentence after punctuate grouping, to show that candidate keywords go out The second frequency that existing first frequency and candidate keywords occurs in more parts of physical examination report files where the item data, root According to the first frequency and the second frequency from the one group of pass extracted in the candidate keywords in the item data occurrence Keyword, using the keyword extracted as the second result.

Wherein, keyword extraction algorithm can be used TF-IDF algorithm, TF, Term Frequency, what keyword occurred Frequency, the frequency that keyword is occurred is as first frequency, i.e., (candidate) keyword occurs in the occurrence of the item data Frequency；IDF, Inverse Document Frequency, reverse document frequency, what a word occurred in entire library dictionary Frequency.Reverse document frequency is known as second frequency, i.e., is reported (candidate) keyword in more parts of physical examinations where the item data The frequency occurred in document.The item number is extracted from the candidate keywords according to the first frequency and the second frequency According to one group of keyword in occurrence, specifically: first frequency corresponding to candidate keywords and second frequency are multiplied to To multiplied result；Multiplied result is arranged according to descending；First group of candidate keywords after extracting arrangement；By this first group candidate Keyword thinks one group of keyword in data occurrence as this.Using this group of keyword as feature corresponding to the item data, Using this feature as the second result.

The data item as corresponding to the item data (physical examination project) is lung, the keyword extracted (feature) are as follows: inflammation, Calcification etc..Indicate that there is inflammation in lung and there is calcification phenomenon in lung.

S506, using first result and second result as standardization knot corresponding to the item data occurrence Fruit.

In one embodiment, before calling participle and part-of-speech tagging interface, the step further includes S503a.

S503a, detecting, which whether there is in the text after punctuate is grouped, number.If punctuate grouping after text in there are Number executes the step of calling participle and part-of-speech tagging interface；If then being held in the text after punctuate grouping there is no there is number Row step S505.

The embodiment, step " call participle and part-of-speech tagging interface, carry out participle and word to the text after punctuate grouping Property mark, and analyzed " primarily directed to the situation for having number, if there is no there is number in text after punctuate grouping, Without execution " participle and part-of-speech tagging interface is called, participle and part-of-speech tagging are carried out to the text after punctuate grouping, and carry out The step of analysis ", reduces standardized calculation amount, saved the standardized time.

In one embodiment, after step S506, the method also includes S506a, S506b, S506c.

S506a obtains feature and signature identification corresponding to the occurrence of the pre-set item data.

Such as data item lung, whether pre-set feature is " normal ", " inflammation ", " calcification " etc. respectively.Institute is right The signature identification answered is respectively that " 0,1 " (0 indicates normal；1 indicates abnormal), " 0,1 " (0 indicates no corresponding feature, that is, does not have Inflammation；1 indicates corresponding feature, that is, has inflammation), " 0,1 " (0 indicates no corresponding feature, i.e., no calcification；1 indicates Corresponding feature, i.e. calcification).

S506b matches standardization result corresponding to the item data occurrence to obtain with pre-set feature To matching result.

If standardization result is "abnormal", " inflammation ", then with the matching result that is obtained after pre-set characteristic matching For "abnormal", " inflammation ".

S506c is marked the standardization result using corresponding signature identification according to matching result.

If matching result be "abnormal", " inflammation ", using corresponding signature identification label result be respectively " 1 ", "1","0"；If matching result is "abnormal", " inflammation ", " calcification ", then the result point of the label using corresponding signature identification It Wei not " 1 ", " 1 ", " 1 ".

Further standardization result is marked for the embodiment, and standardization result is quantized, convenient for point of model Analysis and statistics.

Above method embodiment targetedly classifies to the data in physical examination report, and data type is such as divided into four The different type of kind, and different standardizations is carried out respectively to the data in physical examination report according to the Different Results of classification, Comprehensively physical examination report can be standardized, avoid the omission of important physical examination index or character features, improve simultaneously To the precision and accuracy of the processing of physical examination reporting standardsization.Data after standardization can be further used for model learning, mention The high consistency and accuracy of the data of model learning.

Fig. 6 is the schematic block diagram of data normalization device provided by the embodiments of the present application.As shown in fig. 6, the device packet It includes for executing unit corresponding to above-mentioned data normalization method.Specifically, as shown in fig. 6, the device 60 includes obtaining list First 601, type determining units 602, Standardisation Cell 603.

Acquiring unit 601, for obtaining an item data to be normalized in physical examination report.

Type determining units 602, for determining data type corresponding to the occurrence of the item data, wherein data class Type includes numeric type, enumeration type, type, COMPLEX MIXED type is simply mixed.

In one embodiment, as shown in fig. 7, type determining units 602 are including obtaining detection unit 701, numeric type determines Unit 702, enumeration type determination unit 703, quantity judging unit 704 and mixed type determination unit 705.Wherein, detection is obtained Unit 701, for obtaining the occurrence of the item data and being detected to the occurrence of the acquired item data.Numeric type is true Order member 702 if the occurrence for the item data is number, or is the combination of number and unit, then it is determined that the item number According to occurrence corresponding to data type be numeric type.Enumeration type determination unit 703, if the occurrence for the item data is One of preset enumerated value, then it is determined that data type corresponding to the occurrence of the item data is enumeration type.Quantity Judging unit 704 if the occurrence for the item data not only includes text, but also includes number, judges that the number of words of text is It is no to be less than whether the number that the first preset quantity and number occur is less than the second preset quantity.Mixed type determination unit 705, If the number of words for the text in the occurrence of the item data is less than the number that the first preset quantity and number occur and is less than Second preset quantity determines that data type corresponding to the occurrence of the item data is that type is simply mixed；Otherwise, it determines the item number According to occurrence corresponding to data type be COMPLEX MIXED type.

Standardisation Cell 603, for being standardized according to identified data type to the item data, wherein The mode of standardization corresponding to different types of data is different.

In one embodiment, as shown in figure 8, Standardisation Cell 603 includes type acquiring unit 801, numeric processing unit 802, processing unit 803, canonical matching unit 804 and natural language processing unit 805 are enumerated.Wherein, type acquiring unit 801, for obtaining data type corresponding to the occurrence of the identified item data.Numeric processing unit 802, if for should Data type corresponding to the occurrence of item data is numeric type, is handled the occurrence of the item data in physical examination report, To unify the data unit of the item data.Processing unit 803 is enumerated, if data class corresponding to the occurrence for the item data Type is enumeration type, by the text in the item data occurrence carry out unification, or by the occurrence of the item data with preset Numerical value carry out matching mapping.Canonical matching unit 804, if data type corresponding to occurrence for the item data is letter Single mixed type, then be standardized by the way of regular expression matching.Natural language processing unit 805, if being used for this Data type corresponding to the occurrence of data is COMPLEX MIXED type, then carries out standard using the method for natural language processing Change.

In one embodiment, as shown in figure 9, canonical matching unit 804 includes expression formula acquiring unit 901, matching judgment Unit 902, sign digit judging unit 903, the first extraction unit 904 and the second extraction unit 905.Wherein, expression formula obtains Unit 901 is taken, for the data item according to the item data, obtains default regular expressions corresponding to the occurrence of the item data Formula.Matching judgment unit 902, for judging whether default regular expression matches with the occurrence of the item data.Sign digit Judging unit 903 judges in the occurrence of the item data if matching for default regular expression with the occurrence of the item data Whether symbol and number are had.First extraction unit 904, if for having symbol and number in the occurrence of the item data, according to Preset format extracts feature corresponding to the occurrence of the item data, to obtain standardization result.Second extraction unit 905 is used If there is no symbol but to have number in the occurrence of the item data, the number in the occurrence of the item data is extracted, will be mentioned The number of taking-up is as standardization result.

In one embodiment, as shown in Figure 10, natural language processing unit 805 is sentenced including punctuate unit 101, text type Disconnected unit 102, part of speech analytical unit 103, keyword extracting unit 104, result determination unit 105.Wherein, punctuate unit 101, For calling recurrence packet interface, text corresponding to the occurrence to the item data carries out punctuate grouping.Text type judgement Unit 102, for judging whether the data type of the text after punctuate grouping belongs to numeric type or enumeration type or simple mixed Mould assembly.If the data type of the text after punctuate grouping belongs to numeric type or enumeration type or type is simply mixed, number is triggered Value processing unit enumerates processing unit or canonical matching unit.Part of speech analytical unit 103, if after for grouping of making pauses in reading unpunctuated ancient writings The data type of text is not belonging to numeric type or enumeration type or type is simply mixed, and calls participle and part-of-speech tagging interface, right Text after punctuate grouping carries out participle and part-of-speech tagging, and is analyzed, to obtain the first result.Keyword extracting unit 104, for calling keyword extraction algorithms, the short sentence after punctuate grouping is counted, to obtain what candidate keywords occurred The second frequency that first frequency and candidate keywords occur in more parts of physical examination report files where the item data, according to institute First frequency and the second frequency are stated from the one group of keyword extracted in the item data occurrence in the candidate keywords, Using the keyword extracted as the second result.As a result determination unit 105 are used for first result and second result As standardization result corresponding to the item data occurrence.

In one embodiment, the natural language processing unit 804 further includes Digital Detecting unit 102a.Wherein, digital Detection unit 102a, if the data type for the text after grouping of making pauses in reading unpunctuated ancient writings is not belonging to numeric type or enumeration type or simple mixed Mould assembly, detecting, which whether there is in the text after punctuate is grouped, number.There is number if it exists, triggers part of speech analytical unit 103.If There is no numbers, trigger keyword extracting unit 104.

In one embodiment, the natural language processing unit 804 further includes signature identification acquiring unit 105a, feature With unit 105b, marking unit 105c.Wherein, signature identification acquiring unit 105a, for obtaining the pre-set item data Occurrence corresponding to feature and signature identification.Characteristic matching unit 105b, for will be corresponding to the item data occurrence Standardization result is matched with pre-set feature to obtain matching result.Marking unit 105c, for being tied according to matching Fruit is marked the standardization result using corresponding signature identification.

It should be noted that it is apparent to those skilled in the art that, the tool of above-mentioned apparatus and each unit Body realizes process, can be no longer superfluous herein with reference to the corresponding description in preceding method embodiment, for convenience of description and succinctly It states.

Above-mentioned apparatus can be implemented as a kind of form of computer program, and computer program can be in meter as shown in figure 11 It calculates and is run on machine equipment.

Figure 11 is a kind of schematic block diagram of computer equipment provided by the embodiments of the present application.The equipment is that terminal etc. is set It is standby, such as mobile terminal, PC terminal, IPad.The equipment 110 includes the processor 112 connected by system bus 111, storage Device and network interface 113, wherein memory may include non-volatile memory medium 114 and built-in storage 115.

The non-volatile memory medium 114 can storage program area 1141 and computer program 1142.This is non-volatile to deposit When the computer program 1142 stored in storage media is executed by processor 112, it can be achieved that data normalization side described above Method.The processor 112 supports the operation of whole equipment for providing calculating and control ability.The built-in storage 115 is non-volatile Property storage medium in computer program operation provide environment, the computer program by processor 112 execute when, may make place Reason device 112 executes data normalization method described above.The network interface 113 is for carrying out network communication.Art technology Personnel are appreciated that structure shown in Figure 11, and only the block diagram of part-structure relevant to application scheme, is not constituted Restriction to the equipment that application scheme is applied thereon, specific equipment may include more more or fewer than as shown in the figure Component perhaps combines certain components or with different component layouts.

Wherein, the processor 112 is for running computer program stored in memory, to realize following steps:

In one embodiment, the data type includes numeric type, enumeration type, type, COMPLEX MIXED type is simply mixed, described When the step of the data type corresponding to the occurrence for executing the determination item data of processor 112, it is implemented as follows Step:

It obtains the occurrence of the item data and the occurrence of the acquired item data is detected；If the item data Occurrence is number, or is the combination of number and unit, then it is determined that data type corresponding to the occurrence of the item data For numeric type；If the occurrence of the item data is one of preset enumerated value, then it is determined that the occurrence of the item data Corresponding data type is enumeration type；If the occurrence of the item data not only includes text, but also includes number, text is judged Number of words whether be less than the first preset quantity and number occur number whether be less than the second preset quantity；If the item data Occurrence in text number of words be less than the first preset quantity and number occur number be less than the second preset quantity, really Data type corresponding to the occurrence of the fixed item data is that type is simply mixed；Otherwise, it determines the occurrence institute of the item data is right The data type answered is COMPLEX MIXED type.

In one embodiment, the data type includes numeric type, enumeration type, type, COMPLEX MIXED type is simply mixed, described Processor 112 is when executing the step that the data type according to determined by is standardized the item data, specifically Realize following steps:

If data type corresponding to the occurrence of the item data be numeric type, to physical examination report in the item data it is specific Value is handled, with the data unit of the unified item data；If data type corresponding to the occurrence of the item data is to enumerate Text in the item data occurrence is carried out unification by type, or by the occurrence of the item data and pre-set numerical value into Row matching mapping；If data type corresponding to the occurrence of the item data is that type is simply mixed, regular expression is used The mode matched is standardized；If data type corresponding to the occurrence of the item data is COMPLEX MIXED type, using certainly The method of right Language Processing is standardized.

In one embodiment, if the data class corresponding to the occurrence for executing the item data of the processor 112 Type is that type is simply mixed, then when the step being standardized by the way of regular expression matching, is implemented as follows step:

According to the data item of the item data, default regular expression corresponding to the occurrence of the item data is obtained；Judgement Whether default regular expression matches with the occurrence of the item data；If the occurrence of default regular expression and the item data Match, judges whether there is symbol and number in the occurrence of the item data；If having symbol and number in the occurrence of the item data, Feature corresponding to the occurrence of the item data is extracted, according to preset format to obtain standardization result；If the tool of the item data There is no symbol but to have number in body value, then extract the number in the occurrence of the item data, by the number extracted as mark Standardization result.

In one embodiment, if the data corresponding to the occurrence for executing the item data of the processor 112 Type is COMPLEX MIXED type, then when the step being standardized using the method for natural language processing, is implemented as follows step It is rapid:

Recurrence packet interface is called, text corresponding to the occurrence to the item data carries out punctuate grouping；Judgement punctuate Whether the data type of the text after grouping belongs to numeric type or enumeration type or type is simply mixed；If the text after punctuate grouping This data type, which belongs to numeric type or enumeration type type is perhaps simply mixed, then uses numeric type or enumeration type or simple The corresponding standardization mode of mixed type is standardized；If the data type of the text after punctuate grouping is not belonging to count Type is simply mixed in value type or enumeration type, calls participle and part-of-speech tagging interface, divides the text after punctuate grouping Word and part-of-speech tagging, and analyzed, to obtain the first result；Keyword extraction algorithms are called, to short after punctuate grouping Sentence counted, with obtain candidate keywords occur first frequency and candidate keywords in more parts of bodies where the item data The second frequency occurred in inspection report file, mentions from the candidate keywords according to the first frequency with the second frequency One group of keyword in the item data occurrence is taken out, using the keyword extracted as the second result；By first result With second result as standardization result corresponding to the item data occurrence.

In one embodiment, the processor 112 execute it is described using first result and second result as After the step of standardization result corresponding to the item data occurrence, following steps are also realized:

Obtain feature corresponding to the occurrence of the pre-set item data and signature identification；By the item data occurrence Corresponding standardization result is matched with pre-set feature to obtain matching result；According to matching result, using pair The standardization result is marked in the signature identification answered.

In one embodiment, the processor 112 is executing the calling participle and part-of-speech tagging interface, is grouped to punctuate Text afterwards carries out participle and part-of-speech tagging, and is analyzed, the step of to obtain the first result before, also realize following step It is rapid:

Whether there is in text after detection punctuate grouping has number；If there are number in the text after punctuate grouping, It executes and calls participle and part-of-speech tagging interface, participle and part-of-speech tagging are carried out to the text after punctuate grouping, and analyzed, With the step of obtaining the first result.

It should be appreciated that in the embodiment of the present application, alleged processor 112 can be central processing unit (Central Processing Unit, CPU), which can also be other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (application program lication Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other can Programmed logic device, discrete gate or transistor logic, discrete hardware components etc..General processor can be microprocessor Or the processor is also possible to any conventional processor etc..

Those of ordinary skill in the art will appreciate that be realize above-described embodiment method in all or part of the process, It is that relevant hardware can be instructed to complete by computer program.The computer program can be stored in a storage medium, The storage medium can be computer readable storage medium.The computer program is by the processing of at least one of the computer system Device executes, to realize the process step of the embodiment of the above method.

Therefore, present invention also provides a kind of storage mediums.The storage medium can be computer readable storage medium.It should Storage medium is stored with computer program, which performs the steps of when being executed by a processor

In one embodiment, the data type includes numeric type, enumeration type, type, COMPLEX MIXED type is simply mixed, described When the step of processor data type corresponding to the occurrence for executing the determination item data, it is implemented as follows step It is rapid:

In one embodiment, the data type includes numeric type, enumeration type, type, COMPLEX MIXED type is simply mixed, described Processor is when executing the step that the data type according to determined by is standardized the item data, specific implementation Following steps:

In one embodiment, if processor data type corresponding to the occurrence for executing the item data is Type is simply mixed, then when the step being standardized by the way of regular expression matching, is implemented as follows step:

In one embodiment, if processor data type corresponding to the occurrence for executing the item data For COMPLEX MIXED type, then when the step being standardized using the method for natural language processing, step is implemented as follows:

In one embodiment, the processor is described using first result and second result as this in execution After the step of standardization result corresponding to data occurrence, following steps are also realized:

In one embodiment, the processor is executing the calling participle and part-of-speech tagging interface, after punctuate grouping Text carry out participle and part-of-speech tagging, and analyzed, the step of to obtain the first result before, also realize following step It is rapid:

The storage medium can be USB flash disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), magnetic disk Or the various computer readable storage mediums that can store program code such as CD.

In several embodiments provided herein, it should be understood that disclosed device, device and method, it can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, the division of the unit, Only a kind of logical function partition, there may be another division manner in actual implementation.Those skilled in the art can be with It is well understood, for convenience of description and succinctly, the specific work process of the device of foregoing description, equipment and unit can With with reference to the corresponding process in preceding method embodiment, details are not described herein.The above, the only specific embodiment party of the application Formula, but the protection scope of the application is not limited thereto, and anyone skilled in the art discloses in the application In technical scope, various equivalent modifications or substitutions can be readily occurred in, these modifications or substitutions should all cover the guarantor in the application Within the scope of shield.Therefore, the protection scope of the application should be subject to the protection scope in claims.

Claims

1. a kind of data normalization method, which is characterized in that the described method includes:

Obtain an item data to be normalized in physical examination report；

Determine data type corresponding to the occurrence of the item data；

The item data is standardized according to identified data type, wherein mark corresponding to different types of data The mode of standardization processing is different.

2. the method according to claim 1, wherein the data type includes numeric type, enumeration type, simply mixes Mould assembly, COMPLEX MIXED type, data type corresponding to the occurrence of the determination item data, comprising:

It obtains the occurrence of the item data and the occurrence of the acquired item data is detected；

If the occurrence of the item data is number, or is the combination of number and unit, then it is determined that the occurrence of the item data Corresponding data type is numeric type；

If the occurrence of the item data is one of preset enumerated value, then it is determined that corresponding to the occurrence of the item data Data type be enumeration type；

If the occurrence of the item data not only includes text, but also includes number, judge whether the number of words of text is less than first Whether the number that preset quantity and number occur is less than the second preset quantity；

If the number of words of the text in the occurrence of the item data is less than the number that the first preset quantity and number occur and is less than Second preset quantity determines that data type corresponding to the occurrence of the item data is that type is simply mixed；Otherwise, it determines the item number According to occurrence corresponding to data type be COMPLEX MIXED type.

3. the method according to claim 1, wherein the data type includes numeric type, enumeration type, simply mixes Mould assembly, COMPLEX MIXED type, the data type according to determined by are standardized the item data, comprising:

If data type corresponding to the occurrence of the item data be numeric type, to physical examination report in the item data occurrence into Row processing, with the data unit of the unified item data；

If data type corresponding to the occurrence of the item data is enumeration type, the text in the item data occurrence is united One, or the occurrence of the item data and pre-set numerical value are subjected to matching mapping；

If data type corresponding to the occurrence of the item data is that type is simply mixed, by the way of regular expression matching It is standardized；

If data type corresponding to the occurrence of the item data is COMPLEX MIXED type, the method for using natural language processing To be standardized.

4. if according to the method described in claim 3, it is characterized in that, data class corresponding to the occurrence of the item data Type is that type is simply mixed, then is standardized by the way of regular expression matching, comprising:

According to the data item of the item data, default regular expression corresponding to the occurrence of the item data is obtained；

Judge whether default regular expression matches with the occurrence of the item data；

If default regular expression is matched with the occurrence of the item data, judge whether to have in the occurrence of the item data symbol with Number；

If having symbol and number in the occurrence of the item data, extracted corresponding to the occurrence of the item data according to preset format Feature, to obtain standardization result；

If not having symbol but to have number in the occurrence of the item data, the number in the occurrence of the item data is extracted, it will The number extracted is as standardization result.

5. if according to the method described in claim 3, it is characterized in that, data corresponding to the occurrence of the item data Type is COMPLEX MIXED type, then is standardized using the method for natural language processing, comprising:

Recurrence packet interface is called, text corresponding to the occurrence to the item data carries out punctuate grouping；

Whether the data type of the text after judging punctuate grouping belongs to numeric type or enumeration type or type is simply mixed；

If the data type of the text after punctuate grouping belongs to numeric type or enumeration type or type is simply mixed, numerical value is used Type or enumeration type are simply mixed the corresponding standardization mode of type and are standardized；

If the data type of the text after punctuate grouping is not belonging to numeric type or enumeration type or type is simply mixed, participle is called With part-of-speech tagging interface, participle and part-of-speech tagging are carried out to the text after punctuate grouping, and analyzed, to obtain the first knot Fruit；

Keyword extraction algorithms are called, the short sentence after punctuate grouping is counted, to obtain the first of candidate keywords appearance The second frequency that frequency and candidate keywords occur in more parts of physical examination report files where the item data, according to described One frequency and the second frequency will be mentioned from the one group of keyword extracted in the item data occurrence in the candidate keywords The keyword of taking-up is as the second result；

Using first result and second result as standardization result corresponding to the item data occurrence.

6. according to the method described in claim 5, it is characterized in that, the method also includes:

Obtain feature corresponding to the occurrence of the pre-set item data and signature identification；

Standardization result corresponding to the item data occurrence is matched with pre-set feature to obtain matching result；

According to matching result, the standardization result is marked using corresponding signature identification.

7. according to the method described in claim 5, it is characterized in that, in calling participle and part-of-speech tagging interface, to punctuate Text after grouping carries out participle and part-of-speech tagging, and is analyzed, and before obtaining the first result, the method is also wrapped It includes:

Whether there is in text after detection punctuate grouping has number；

There is number if it exists, execute and call participle and part-of-speech tagging interface, participle and word are carried out to the text after punctuate grouping Property mark, and analyzed, the step of to obtain the first result.

8. a kind of data normalization device, which is characterized in that the data normalization device includes:

Acquiring unit, for obtaining an item data to be normalized in physical examination report；

Type determining units, for determining data type corresponding to the occurrence of the item data；

Standardisation Cell, for being standardized according to identified data type to the item data, wherein different data The mode of standardization corresponding to type is different.

9. a kind of computer equipment, which is characterized in that the computer equipment includes memory, and is connected with the memory Processor；

The memory is for storing computer program；The processor is for running the computer journey stored in the memory Sequence, to execute the method according to claim 1 to 7.

10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey Sequence when the computer program is executed by processor, realizes the method according to claim 1 to 7.