CN103530334A

CN103530334A - System and method for data matching based on comparison module

Info

Publication number: CN103530334A
Application number: CN201310456767.0A
Authority: CN
Inventors: 龚健; 张应才; 张恒; 李登高
Original assignee: Founder International Co Ltd; Founder International Beijing Co Ltd
Current assignee: Peking University Medical Information Technology Co ltd
Priority date: 2013-09-29
Filing date: 2013-09-29
Publication date: 2014-01-22
Anticipated expiration: 2033-09-29
Also published as: CN103530334B

Abstract

The invention provides a system and method for data matching based on a comparison module. The system for data matching based on the compassion module comprises a partitioning unit, a comparing unit and a classification unit, wherein the blocking unit is used for receiving data of different domains and partitioning the data according to set index entries, and the index entries comprise one or more fields of the data; the comparing unit is used for obtaining matching pairs for data blocks and calculating similarity degrees for matching pairs according to rules of the comparison module, and the matching pairs comprise two data recordings; the classification unit is used for determining the matching relation of two data recordings in the matching pairs according to preset threshold values of the similarity degrees. Through the technical scheme, associated data can be quickly recognized from a large data volume, the similarity degree of associated data can be figured out, whether the associated data express the same object or not is judged according to the similarity degree, and thus different systems can be smoothly communicated with one another.

Description

The data matching system of template and method based on the comparison

Technical field

The present invention relates to field of computer technology, in particular to a kind of data matching system of template based on the comparison and the data matching method of template based on the comparison.

Background technology

China's medical information is polymorphic and deposits and gradual perfection at present, and final target reaches medical information socialization.In medical system, each system is separate, such as door emergency treatment system, be in hospital, Physical Examination System, image center etc., the patient information data demand of part system is low, typing is imperfect.Each operation system standard is inconsistent, service fields is inconsistent, thereby causes patient information there is no association, and between system, information is independent.Patient data only has part field effective, can not carry out uniqueness confirmation to patient, disappearance sign.Be that same patient has much information, can not unique definite patient data, thus cause linking up and having some setbacks between system.Secondly, because each intrasystem data volume is larger, after having gathered the data of a plurality of systems, data volume is huger, identifies two records and whether represent that same patient's difficulty is also relatively large in so huge data volume, also there is no at present good solution.

Therefore, need a kind of Data Matching scheme, can the data between system in Rapid matching go out to represent the record of same person to make the data interaction between system more smooth and easy.

Summary of the invention

The present invention, just based on the problems referred to above, has proposed a kind of Data Matching scheme, can the data between system in Rapid matching go out to represent the record of same person to make the data interaction between system more smooth and easy.

According to an aspect of the present invention, the data matching system that has proposed a kind of template based on the comparison, comprising: minute module unit, for receiving from the data of same area not, according to the index entry arranging, data are carried out to piecemeal, one or more fields that described index entry comprises described data; Comparing unit, right for obtain coupling for each data block, described coupling is to comprising two data recording, and according to the rule of template relatively for each coupling is to calculating similarity; Taxon, determines the matching relationship that mates two data recording of centering for the similarity threshold according to default.

Data in system have a plurality of fields, for representing a user ID data.Can carry out big data quantity to carry out piecemeal according to some or certain several field, like this, large data be divided into several data blocks.In each data block, to every record number, by index entry value, be that the record composition coupling that field value is identical is right.Due to big data quantity has been carried out to piecemeal in advance, and the record with same word segment value is mated to association, greatly reduced calculated amount, alleviated system burden.

Obtaining coupling to rear, need further to determine the similarity of two records of coupling centering, according to the size of the similarity of two records and similarity threshold, determine the relation of two records.

In technique scheme, preferred, can also comprise: data cleansing unit, according to data described in preset data format analysis processing, to meet predetermined format; Described comparing unit comprises: obtain subelement, for for each blocks of data, the identical data of the value of described index entry are formed to coupling right.

Different numeric field datas, its form is also different, likely that field is different, being likely that expression way is different, is likely field value mistake, and it is invalid that data cleansing unit can identify, undesirable data, can clean the numeric field data from different system, realize data normalization, be convenient to follow-up association and calculate.

In technique scheme, preferably, described comparing unit comprises computation subunit, for the same field of described two data recording, calculate the similar value of the same field corresponding content of described two data recording, according to the similar value of described same field corresponding content, determine described similarity.

In technique scheme, preferred, described computation subunit is further used for when described two data recording have a plurality of same field, the similarity using the corresponding similar value sum of each same field as described two data recording.Because each record may comprise a plurality of fields, therefore need to compare for each field, calculate the similar value between field value corresponding to the same field of two records, thereby determine the similarity between recording according to the similar value of field value.

In technique scheme, preferred, described similarity threshold comprises first threshold and Second Threshold, and described first threshold is greater than described Second Threshold; Described taxon is further used for when the similarity of described two data recording is more than or equal to described first threshold, the pass of determining described two data recording is matching relationship and generates the unique identification for described two data recording of association, when the similarity of described two data recording is less than described first threshold and is greater than described Second Threshold, the pass of determining described two data recording is doubtful relation, and when the similarity of described two data recording is less than or equal to described Second Threshold, the pass of determining described two data recording is matching relationship not.

For similarity arranges two boundaries, first threshold is high threshold, and Second Threshold is lower limit.If the similarity calculating higher than first threshold, illustrates that the possibility of these two the same objects of record expression is very large, can determine that so these two records are matching relationships; If the similarity calculating between high threshold and lower limit, illustrates so these two records and may represent same object, possibility is not very large, need to manually determine whether these two records represent same object; If the similarity calculating, lower than lower limit, illustrates that these two records can not represent same object, can determine that these two records are not matching relationships so.

According to a further aspect in the invention, also proposed a kind of data matching method of template based on the comparison, having comprised: received from the data of same area not, according to the index entry arranging, data are carried out to piecemeal, one or more fields that described index entry comprises described data; For each data block, obtain coupling right, described coupling is to comprising two data recording; According to the rule that compares template, for each coupling is to calculating similarity, according to default similarity threshold, determine the matching relationship of two data recording of coupling centering.

In technique scheme, preferred, before data are carried out to piecemeal, also comprise: according to data described in preset data format analysis processing, to meet predetermined format; For each data block, obtaining the right step of coupling specifically comprises: for each blocks of data, the identical data of the value of described index entry are formed to coupling right.

In technique scheme, preferably, describedly according to the rule of template relatively, the step of calculating similarity is specifically comprised for each coupling: for the same field of described two data recording, calculate the similar value of the same field corresponding content of described two data recording, according to the similar value of described same field corresponding content, determine described similarity.

In technique scheme, preferred, when described two data recording have a plurality of same field, the similarity using the corresponding similar value sum of each same field as described two data recording.。Because each record may comprise a plurality of fields, therefore need to compare for each field, calculate the similar value between field value corresponding to the same field of two records, thereby determine the similarity between recording according to the similar value of field value.

In above-mentioned arbitrary technical scheme, preferred, described similarity threshold comprises first threshold and Second Threshold, and described first threshold is greater than described Second Threshold; When the similarity of described two data recording is more than or equal to described first threshold, the pass of determining described two data recording is matching relationship and generates the unique identification for described two data recording of association, when the similarity of described two data recording is less than described first threshold and is greater than described Second Threshold, the pass of determining described two data recording is doubtful relation, and when the similarity of described two data recording is less than or equal to described Second Threshold, the pass of determining described two data recording is matching relationship not.

Due to separate between system, data layout between different system, Field Definition, expression content are inconsistent, cause linking up and having some setbacks between system, can not unique definite same target, so Data Matching scheme that has proposed a kind of new template based on the comparison of the present invention, can in big data quantity, identify fast coupling right, Rapid matching goes out the different expression waies of same target, makes the data between system can be interrelated.

Accompanying drawing explanation

Fig. 1 shows according to an embodiment of the invention the schematic diagram of the data matching system of template based on the comparison;

Fig. 2 shows the process flow diagram of the data matching method of template based on the comparison according to an embodiment of the invention;

Fig. 3 shows the particular flow sheet of the data matching method of template based on the comparison according to another embodiment of the present invention.

Embodiment

In order more clearly to understand above-mentioned purpose of the present invention, feature and advantage, below in conjunction with the drawings and specific embodiments, the present invention is further described in detail.It should be noted that, in the situation that not conflicting, the application's embodiment and the feature in embodiment can combine mutually.

A lot of details have been set forth in the following description so that fully understand the present invention; but; the present invention can also adopt other to be different from other modes described here and implement, and therefore, protection scope of the present invention is not subject to the restriction of following public specific embodiment.

Fig. 1 shows according to an embodiment of the invention the schematic diagram of the data matching system of template based on the comparison.

As shown in Figure 1, the data matching system 100 of template based on the comparison according to an embodiment of the invention, comprising: minute module unit 102, for receiving from the data of same area not, according to the index entry arranging, data are carried out to piecemeal, one or more fields that described index entry comprises described data; Comparing unit 104, right for obtain coupling for each data block, described coupling is to comprising two data recording, and according to the rule of template relatively for each coupling is to calculating similarity; Taxon 106, determines the matching relationship that mates two data recording of centering for the similarity threshold according to default.

Obtaining coupling to rear, need further to determine the similarity of two records of coupling centering, according to the size of the similarity of two records and similarity threshold, determine the relation of two records.Such as, a unpaired message is as shown in table 1:

ID	A	B	C	D
					1	a1	b1	c1	d1
2	a2	b2	c2	d2

Table 1

Wherein, id=1 is the right left sibling of coupling; id=2 is the right right node of coupling, and matching identification item is { A, [A+c]; [b]; [c+d] }, coupling set of weights is { 0.9,0.92; 0.5; 0.945}, by the rule of template (as the comparison template to A adopts voice similarity * 1.2+ character similarity * 1.8) relatively, calculating identification item similarity is { 0.8,0.4; 0.9; 0.5} this coupling right score value be f1 (0.8,0.9)+f2 (0.4,0.92)+f3 (0.9; 0.5)+f4 (0.945,0.5); Fn(wherein) function can mate different computing functions according to single marking matched threshold size.

In technique scheme, preferred, can also comprise: data cleansing unit 108, according to data described in preset data format analysis processing, to meet predetermined format; Described comparing unit 104 comprises: obtain subelement 1042, for for each blocks of data, the identical data of the value of described index entry are formed to coupling right.

In technique scheme, preferably, described comparing unit 104 comprises computation subunit 1044, same field for described two data recording, calculate the similar value of the same field corresponding content of described two data recording, according to the similar value of described same field corresponding content, determine described similarity.Such as for " name " field, can adopt the formula of voice similarity * 1.2+ character similarity * 1.8 to calculate similar value.

In technique scheme, preferred, described computation subunit 1044 is further used for when described two data recording have a plurality of same field, the similarity using the corresponding similar value sum of each same field as described two data recording.Because each record may comprise a plurality of fields, therefore need to compare for each field, calculate the similar value between field value corresponding to the same field of two records, thereby determine the similarity between recording according to the similar value of field value.

In technique scheme, preferred, described similarity threshold comprises first threshold and Second Threshold, and described first threshold is greater than described Second Threshold; Described taxon 106 is further used for when the similarity of described two data recording is more than or equal to described first threshold, the pass of determining described two data recording is matching relationship and generates the unique identification for described two data recording of association, when the similarity of described two data recording is less than described first threshold and is greater than described Second Threshold, the pass of determining described two data recording is doubtful relation, and when the similarity of described two data recording is less than or equal to described Second Threshold, the pass of determining described two data recording is matching relationship not.

Fig. 2 shows the process flow diagram of the data matching method of template based on the comparison according to an embodiment of the invention.

As shown in Figure 2, the data matching method of template based on the comparison according to an embodiment of the invention, can comprise the following steps: step 202, receives from the data of same area not, according to the index entry arranging, data are carried out to piecemeal, one or more fields that described index entry comprises described data; Step 204, obtains coupling for each data block right, and described coupling is to comprising two data recording; Step 206, for each coupling is to calculating similarity, determines the matching relationship of two data recording of coupling centering according to the rule that compares template according to default similarity threshold.

Below in conjunction with Fig. 3, describe in detail according to the data matching method of template based on the comparison of the present invention.

As shown in Figure 3, in step 302, device for example receives, from the numeric field data of same area (clinic system is exactly a territory, and the system of being in hospital is also a territory) not.Due to the data field definition of different system or represent that content, expression format are different, therefore need to these numeric field datas, clean according to preset rules.

For example, in management server master data base, the expression way of time is 2012-12-12, and the time expression way of other numeric field datas is on Dec 12nd, 2012 or 2012.12.12, the time data of these these forms can be unified into 2012-12-12 so.Again for example, in master data base, between the word of name, there is no space, but there is space between the word of the name of numeric field data, these spaces can be deleted so, conform to the form of name in master data base.

In step 304, because the data volume of each system is very large, after having gathered the data of a plurality of systems, total quantity is huger, therefore need to carry out piecemeal to numeric field data, by piece, processes.

In step 306, for each data block, obtain coupling right, calculate the similarity that coupling is right.

Data are being carried out after piecemeal, if carry out comparing, first needing to form comparison, it is right to mate.If block data is compared entirely, can form n2 comparison, carry out when similarity is calculated wasting a lot of time, so need to mate right forming process processes, the coupling that removal does not have to compare processing is right, reduce calculated amount, improve arithmetic speed, and reduce resource occupation.When calculating the similarity of two data recording of coupling centering, can first calculate the similar value of each same field, then according to the similar value of all fields, obtain the similarity of two data recording.Relatively in template, be provided with different similarity calculating methods, according to the comparison template of selecting, can calculate the similarity that coupling is right.

In step 308, according to the similarity threshold setting in advance to coupling to classifying.Can be for similarity arranges two similarity thresholds, first threshold is high threshold, Second Threshold is lower limit.If the similarity calculating higher than first threshold, illustrates that the possibility of these two the same objects of record expression is very large, can determine that so these two records are matching relationships; If the similarity calculating between high threshold and lower limit, illustrates so these two records and may represent same object, possibility is not very large, need to manually determine whether these two records represent same object; If the similarity calculating, lower than lower limit, illustrates that these two records can not represent same object, can determine that these two records are not matching relationships so.

In step 310, classification results is audited, for example for the coupling of doubtful relation to carrying out artificial judgment, if think, classification results is inaccurate, can be again to coupling to carrying out key words sorting.For example capable of regulating similarity threshold, is that coupling is to classifying according to the similarity threshold after adjusting again.

The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims

1. a data matching system for template based on the comparison, is characterized in that, comprising:

Minute module unit, for receiving from the data of same area not, carries out piecemeal to data, one or more fields that described index entry comprises described data according to the index entry arranging;

Comparing unit, right for obtain coupling for each data block, described coupling is to comprising two data recording, and according to the rule of template relatively for each coupling is to calculating similarity;

Taxon, determines the matching relationship that mates two data recording of centering for the similarity threshold according to default.

2. the data matching system of template based on the comparison according to claim 1, is characterized in that, also comprises: data cleansing unit, according to data described in preset data format analysis processing, to meet predetermined format;

Described comparing unit comprises: obtain subelement, for for each blocks of data, the identical data of the value of described index entry are formed to coupling right.

3. the data matching system of template based on the comparison according to claim 1 and 2, it is characterized in that, described comparing unit comprises computation subunit, same field for described two data recording, calculate the similar value of the same field corresponding content of described two data recording, according to the similar value of described same field corresponding content, determine described similarity.

4. the data matching system of template based on the comparison according to claim 3, it is characterized in that, described computation subunit is further used for when described two data recording have a plurality of same field, the similarity using the corresponding similar value sum of each same field as described two data recording.

5. the data matching system of template based on the comparison according to claim 1, is characterized in that, described similarity threshold comprises first threshold and Second Threshold, and described first threshold is greater than described Second Threshold;

Described taxon is further used for when the similarity of described two data recording is more than or equal to described first threshold, the pass of determining described two data recording is matching relationship and generates the unique identification for described two data recording of association, when the similarity of described two data recording is less than described first threshold and is greater than described Second Threshold, the pass of determining described two data recording is doubtful relation, and when the similarity of described two data recording is less than or equal to described Second Threshold, the pass of determining described two data recording is matching relationship not.

6. a data matching method for template based on the comparison, is characterized in that, comprising:

Reception, from the data of same area not, is carried out piecemeal to data, one or more fields that described index entry comprises described data according to the index entry arranging;

For each data block, obtain coupling right, described coupling is to comprising two data recording;

According to the rule that compares template, for each coupling is to calculating similarity, according to default similarity threshold, determine the matching relationship of two data recording of coupling centering.

7. the data matching method of template based on the comparison according to claim 6, is characterized in that, before data are carried out to piecemeal, also comprises: according to data described in preset data format analysis processing, to meet predetermined format;

For each data block, obtaining the right step of coupling specifically comprises: for each blocks of data, the identical data of the value of described index entry are formed to coupling right.

8. according to the data matching method of the template based on the comparison described in claim 6 or 7, it is characterized in that, describedly according to the rule of template relatively, the step of calculating similarity is specifically comprised for each coupling: for the same field of described two data recording, calculate the similar value of the same field corresponding content of described two data recording, according to the similar value of described same field corresponding content, determine described similarity.

9. the data matching method of template based on the comparison according to claim 8, is characterized in that, when described two data recording have a plurality of same field, and the similarity using the corresponding similar value sum of each same field as described two data recording.

10. the data matching method of template based on the comparison according to claim 8, is characterized in that, described similarity threshold comprises first threshold and Second Threshold, and described first threshold is greater than described Second Threshold;

When the similarity of described two data recording is more than or equal to described first threshold, the pass of determining described two data recording is matching relationship and generates the unique identification for described two data recording of association, when the similarity of described two data recording is less than described first threshold and is greater than described Second Threshold, the pass of determining described two data recording is doubtful relation, and when the similarity of described two data recording is less than or equal to described Second Threshold, the pass of determining described two data recording is matching relationship not.