CN103530334B

CN103530334B - Based on the data matching system and method for comparing template

Info

Publication number: CN103530334B
Application number: CN201310456767.0A
Authority: CN
Inventors: 龚健; 张应才; 张恒; 李登高
Original assignee: Medical Information Technology Co Ltd Of Beijing University
Current assignee: Peking University Medical Information Technology Co ltd
Priority date: 2013-09-29
Filing date: 2013-09-29
Publication date: 2018-01-23
Anticipated expiration: 2033-09-29
Also published as: CN103530334A

Abstract

The invention provides a kind of based on the data matching system for comparing template and based on the data matching method for comparing template, wherein, included based on the data matching system for comparing template：Blocking unit, for receiving the data from not same area, piecemeal is carried out to data according to the index entry of setting, the index entry includes one or more fields of the data；Comparing unit, for obtaining matching pair for each data block, the matching comprising two data records, and according to the rule for comparing template to be each to matching to calculating similarity；Taxon, for determining the matching relationship of matching two data records of centering according to default similarity threshold.Pass through technical scheme, associated data can be quickly recognized in big data quantity, and the similarity of associated data can be calculated, determine that associated data indicate whether same object according to similarity, so that smooth communication can be carried out between different systems.

Description

Based on the data matching system and method for comparing template

Technical field

The present invention relates to field of computer technology, in particular to a kind of based on the data matching system for comparing template With based on the data matching method for comparing template.

Background technology

China's medical information is polymorphic and deposits simultaneously gradual perfection at present, and final target reaches medical information society Change.Each system is separate in medical system, such as Emergency call system, is in hospital, Physical Examination System, image center etc., part system Patient information data demand it is low, typing is imperfect.Each operation system standard is inconsistent, service fields are inconsistent, so as to cause Patient information does not associate, and Inter-System Information is independent.Patient data only has that part field is effective, patient can not be carried out unique Property confirm, missing mark.I.e. same patient has much information, it is impossible to patient data is uniquely determined, so as to cause system inter-drain It is logical to have some setbacks.Secondly as the data volume in each system is larger, data volume is huger after having gathered the data of multiple systems Greatly, identify that two records indicate whether that the difficulty of same patient is also relatively large in so huge data volume, also do not have at present There is good solution.

Therefore, it is necessary to which a kind of Data Matching scheme, Rapid matching can go out to represent same person in the data between system Record, make the data interaction between system more smooth.

The content of the invention

The present invention is based on above mentioned problem, it is proposed that a kind of Data Matching scheme, can be fast in the data between system Speed matches the record for representing same person, makes the data interaction between system more smooth.

According to an aspect of the present invention, it is proposed that it is a kind of based on the data matching system for comparing template, including：Piecemeal list Member, for receiving the data from not same area, data are carried out with piecemeal according to the index entry of setting, the index entry includes described One or more fields of data；Comparing unit, for obtaining matching pair for each data block, the matching is to including two Data record, and matched according to the rule for comparing template to be each to calculating similarity；Taxon, for according to default Similarity threshold determines the matching relationship of matching two data records of centering.

A data in system has multiple fields, for representing a user ID data.Can according to some or Certain several field carries out carrying out piecemeal to big data quantity, and so, big data is divided into several data blocks.In each data block In, it is the record composition matching pair of field value identical by index entry value to every record number.Due to pre- advanced to big data quantity Go piecemeal, and the record with same word segment value is subjected to matching association, greatly reduce amount of calculation, the system of alleviating is born Load.

Two similarities recorded in are matched, it is necessary to further determine that to rear obtaining matching, are recorded according to two Similarity and the size of similarity threshold determine the relations of two records.

In the above-mentioned technical solutions, it is preferred that can also include：Data cleansing unit, is handled according to preset data form The data, to meet predetermined format；The comparing unit includes：Obtain subelement, for for each block number evidence, will described in The value identical data composition matching pair of index entry.

Different numeric field datas, its form are also different, it may be possible to which field is different, it may be possible to which expression way differs Sample, it may be possible to which field value mistake, it is invalid that data cleansing unit may recognize that, undesirable data, can be to from not The numeric field data of homologous ray is cleaned, and realizes data normalization, is easy to follow-up association to calculate.

In the above-mentioned technical solutions, it is preferred that the comparing unit includes computation subunit, remembers for described two data The same field of record, the similar value of the same field corresponding content of described two data records is calculated, according to the same field The similar value of corresponding content determines the similarity.

In the above-mentioned technical solutions, it is preferred that the computation subunit is further used for having in described two data records When having multiple same fields, the similarity using the corresponding similar value sum of each same field as described two data records.By May include multiple fields in each record, it is therefore desirable to be compared for each field, calculate two record it is identical Similar value between field value corresponding to field, so as to determine the similarity between record according to the similar value of field value.

In the above-mentioned technical solutions, it is preferred that the similarity threshold includes first threshold and Second Threshold, and described first Threshold value is more than the Second Threshold；The taxon is further used for being more than or equal in the similarity of described two data records During the first threshold, the relation of described two data records is determined as matching relationship and is generated for associating described two data The unique mark of record, it is less than the first threshold in the similarity of described two data records and is more than the Second Threshold When, the relation for determining described two data records is doubtful relation, and described two data records similarity be less than etc. When the Second Threshold, the relation for determining described two data records is mismatch relation.

Two boundaries are set for similarity, first threshold is high threshold, and Second Threshold is lower limit.If what is calculated is similar Degree is higher than first threshold, illustrates that the two records represent that the possibility of same object is very big, then can determine that the two notes Record is matching relationship；If the similarity calculated is between high threshold and lower limit, then illustrates that the two records may represent Same object, possibility are not very big, it is necessary to manually be determined that the two records indicate whether same object；If calculate The similarity gone out is less than lower limit, then illustrates that the two records can not possibly represent same object, it may be determined that the two notes Record is not matching relationship.

According to another aspect of the present invention, it is also proposed that it is a kind of based on the data matching method for comparing template, including：Receive Data from not same area, piecemeal is carried out to data according to the index entry of setting, the index entry includes one of the data Or multiple fields；Matching pair is obtained for each data block, the matching is to including two data records；According to comparing template Rule is matched to calculating similarity to be each, according to the matching of default similarity threshold determination matching two data records of centering Relation.

A data in system has multiple fields, for representing a user ID data.Can be according to some or certain Several fields carry out carrying out piecemeal to big data quantity, and so, big data is divided into several data blocks.In each data block, It is the record composition matching pair of field value identical by index entry value to every record number.Due to the advance progress of big data quantity Piecemeal, and the record with same word segment value is subjected to matching association, amount of calculation is greatly reduced, the system of alleviating is born Load.

In the above-mentioned technical solutions, it is preferred that before piecemeal is carried out to data, in addition to：According to preset data form The data are handled, to meet predetermined format；The step that matching pair is obtained for each data block specifically includes：For each piece Data, by the value identical data composition matching pair of the index entry.

In the above-mentioned technical solutions, it is preferred that described to be matched according to the rule for comparing template to be each to calculating similarity The step of specifically include：For the same field of described two data records, the same field of described two data records is calculated The similar value of corresponding content, the similarity is determined according to the similar value of the same field corresponding content.

In the above-mentioned technical solutions, it is preferred that, will be each identical when described two data records have multiple same fields Similarity of the corresponding similar value sum of field as described two data records..Because each record may include multiple words Section, it is therefore desirable to be compared for each field, calculate the phase between field value corresponding to the same field of two records Like value, so as to determine the similarity between record according to the similar value of field value.

In any of the above-described technical scheme, it is preferred that the similarity threshold includes first threshold and Second Threshold, described First threshold is more than the Second Threshold；When the similarity of described two data records is more than or equal to the first threshold, really The relation of fixed described two data records for matching relationship and generates the unique mark for associating described two data records, When the similarity of described two data records is less than the first threshold and is more than the Second Threshold, described two data are determined The relation of record is doubtful relation, and when the similarity of described two data records is less than or equal to the Second Threshold, really The relation of fixed described two data records is mismatch relation.

Because independently of each other, the data format, field definition, expression content between different system are inconsistent, cause between system Link up and have some setbacks between system, it is impossible to uniquely determine same target, therefore the present invention propose it is a kind of new based on comparing template Data Matching scheme, matching pair can be quickly recognized in big data quantity, Rapid matching goes out the different expression sides of same target Formula, enable the data between system interrelated.

Brief description of the drawings

Fig. 1 shows the schematic diagram according to an embodiment of the invention based on the data matching system for comparing template；

Fig. 2 shows the flow chart according to an embodiment of the invention based on the data matching method for comparing template；

Fig. 3 shows the idiographic flow based on the data matching method for comparing template according to another embodiment of the present invention Figure.

Embodiment

It is below in conjunction with the accompanying drawings and specific real in order to be more clearly understood that the above objects, features and advantages of the present invention Mode is applied the present invention is further described in detail.It should be noted that in the case where not conflicting, the implementation of the application Feature in example and embodiment can be mutually combined.

Many details are elaborated in the following description to facilitate a thorough understanding of the present invention, still, the present invention may be used also To be different from other modes described here using other to implement, therefore, protection scope of the present invention is not by described below Specific embodiment limitation.

Fig. 1 shows the schematic diagram according to an embodiment of the invention based on the data matching system for comparing template.

As shown in figure 1, it is according to an embodiment of the invention based on the data matching system 100 for comparing template, including：Piecemeal Unit 102, for receiving the data from not same area, piecemeal, the index entry bag are carried out to data according to the index entry of setting One or more fields containing the data；Comparing unit 104, for obtaining matching pair, the matching for each data block To being matched comprising two data records, and according to the rule for comparing template to be each to calculating similarity；Taxon 106, For determining the matching relationship of matching two data records of centering according to default similarity threshold.

Two similarities recorded in are matched, it is necessary to further determine that to rear obtaining matching, are recorded according to two Similarity and the size of similarity threshold determine the relations of two records.For example a unpaired message is as shown in table 1：

ID	A	B	C	D
					1	a1	b1	c1	d1
2	a2	b2	c2	d2

Table 1

Wherein, id=1 for matching pair left sibling, id=2 for matching pair right node, matching identification item for A, [A+c], [b], [c+d] }, matching power is reassembled as { 0.9,0.92,0.5,0.945 }, by the rule for comparing template（Comparison mould such as to A Plate uses voice similarity * 1.2+ character similarities * 1.8）Identification item similarity is calculated for { 0.8,0.4,0.9,0.5 } then should The score value of matching pair is f1 (0.8,0.9)+f2 (0.4,0.92)+f3 (0.9,0.5)+f4 (0.945,0.5);Wherein fn（）Letter Number can match different calculating functions according to the threshold size of single mark matching.

In the above-mentioned technical solutions, it is preferred that can also include：Data cleansing unit 108, at preset data form The data are managed, to meet predetermined format；The comparing unit 104 includes：Subelement 1042 is obtained, for for each block number According to by the value identical data composition matching pair of the index entry.

In the above-mentioned technical solutions, it is preferred that the comparing unit 104 includes computation subunit 1044, for described two The same field of individual data record, the similar value of the same field corresponding content of described two data records is calculated, according to described The similar value of same field corresponding content determines the similarity.For example for " name " field, voice similarity * can be used 1.2+ character similarities * 1.8 formula calculates similar value.

In the above-mentioned technical solutions, it is preferred that the computation subunit 1044 is further used for remembering in described two data When record has multiple same fields, using the corresponding similar value sum of each same field as the similar of described two data records Degree.Because each record may include multiple fields, it is therefore desirable to be compared for each field, calculate two record Similar value between field value corresponding to same field, it is similar between record so as to be determined according to the similar value of field value Degree.

In the above-mentioned technical solutions, it is preferred that the similarity threshold includes first threshold and Second Threshold, and described first Threshold value is more than the Second Threshold；The taxon 106 is further used for being more than in the similarity of described two data records During equal to the first threshold, the relation of described two data records is determined as matching relationship and is generated described two for associating The unique mark of data record, it is less than the first threshold in the similarity of described two data records and is more than second threshold During value, the relation for determining described two data records is doubtful relation, and is less than in the similarity of described two data records During equal to the Second Threshold, the relation for determining described two data records is mismatch relation.

Fig. 2 shows the flow chart according to an embodiment of the invention based on the data matching method for comparing template.

As shown in Fig. 2 it is according to an embodiment of the invention based on the data matching method for comparing template, it can include following Step：Step 202, the data from not same area are received, piecemeal, the index entry bag are carried out to data according to the index entry of setting One or more fields containing the data；Step 204, matching pair is obtained for each data block, the matching is to including two Individual data record；Step 206, matched according to the rule for comparing template to be each to calculating similarity, according to default similarity Threshold value determines the matching relationship of matching two data records of centering.

Described in detail with reference to Fig. 3 according to the present invention based on the data matching method for comparing template.

As shown in figure 3, in step 302, device receives the numeric field data from not same area（Such as clinic system is exactly a domain, System is also a domain in hospital）.Because the data field definition or expression content, expression format of different system are different, therefore Need to clean these numeric field datas according to preset rules.

For example, the expression way of time is 2012-12-12 in management server primary database, and other numeric field datas when Between expression way be on December 12nd, 2012 or 2012.12.12, then the time data of these forms can be unified into 2012-12-12.In another example there is no space in primary database between the word of name, but there is sky between the word of the name of numeric field data Lattice, then these spaces can be deleted, be consistent with the form of name in primary database.

In step 304, because the data volume of each system is very big, after the data for having gathered multiple systems, total quantity is more Add huge, it is therefore desirable to piecemeal is carried out to numeric field data, handled by block.

In step 306, for each data block, matching pair is obtained, calculates the similarity of matching pair.

After piecemeal is carried out to data, if carrying out comparing, item is compared firstly the need of being formed, i.e. matching pair.If Block data is compared entirely, then can form n2 and compare item, carries out that during Similarity Measure many times can be wasted, so needing The forming process of matching pair is handled, and is removed without the matching pair that must be compared processing, to reduce amount of calculation, improves fortune Speed is calculated, and reduces resource occupation.When calculating matching similarity of two data records in, each same word can be first calculated The similar value of section, further according to the similar similarity for being worth to two data records of all fields.Compare in template and be provided with not Same similarity calculating method, the similarity of matching pair can be calculated according to the comparison template of selection.

In step 308, according to the similarity threshold pre-set come to matching to classifying.Can be that similarity is set Two similarity thresholds, first threshold are high threshold, and Second Threshold is lower limit.If the similarity calculated is higher than the first threshold Value, illustrate that the two records represent that the possibility of same object is very big, then can determine that the two records are matching relationships； If the similarity calculated is between high threshold and lower limit, then illustrate that the two records may represent same object, can Energy property is not very big, it is necessary to manually be determined that the two records indicate whether same object；If the similarity calculated exists Less than lower limit, then illustrate that the two records can not possibly represent same object, it may be determined that the two records are not that matching is closed System.

In step 310, classification results are audited, for example, for doubtful relation matching to carry out artificial judgment, if Think classification results inaccuracy, can be again to matching to carrying out key words sorting.Such as adjustable similarity threshold, after adjustment Similarity threshold again for matching to classifying.

The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for the skill of this area For art personnel, the present invention can have various modifications and variations.Within the spirit and principles of the invention, that is made any repaiies Change, equivalent substitution, improvement etc., should be included in the scope of the protection.

Claims

It is 1. a kind of based on the data matching system for comparing template, it is characterised in that including：

Blocking unit, for receiving the data from not same area, piecemeal, the index are carried out to data according to the index entry of setting Item includes one or more fields of the data；

Comparing unit, for obtaining matching pair for each data block, the matching to comprising two data records, and according to The rule for comparing template is each matching to calculating similarity；

Taxon, for determining the matching relationship of matching two data records of centering according to default similarity threshold；It is described Comparing unit includes：Subelement is obtained, for for each block number evidence, the value identical data of the index entry to be formed into matching It is right；

The similarity threshold includes first threshold and Second Threshold, and the first threshold is more than the Second Threshold；

The taxon is further used for when the similarity of described two data records is more than or equal to the first threshold, really The relation of fixed described two data records for matching relationship and generates the unique mark for associating described two data records, When the similarity of described two data records is less than the first threshold and is more than the Second Threshold, described two data are determined The relation of record is doubtful relation, and when the similarity of described two data records is less than or equal to the Second Threshold, really The relation of fixed described two data records is mismatch relation.
It is 2. according to claim 1 based on the data matching system for comparing template, it is characterised in that also to include：Data are clear Unit is washed, the data are handled according to preset data form, to meet predetermined format.
It is 3. according to claim 1 or 2 based on the data matching system for comparing template, it is characterised in that described relatively more single Member includes computation subunit, for the same field of described two data records, calculates the same word of described two data records The similar value of section corresponding content, the similarity is determined according to the similar value of the same field corresponding content.
It is 4. according to claim 3 based on the data matching system for comparing template, it is characterised in that the computation subunit It is further used for when described two data records have multiple same fields, the corresponding similar value sum of each same field is made For the similarity of described two data records.
It is 5. a kind of based on the data matching method for comparing template, it is characterised in that including：

The data from not same area are received, piecemeal are carried out to data according to the index entry of setting, the index entry includes the number According to one or more fields；

Matching pair is obtained for each data block, the matching is to including two data records；

Matched according to the rule for comparing template to be each to calculating similarity, matching centering is determined according to default similarity threshold The matching relationship of two data records；The step that matching pair is obtained for each data block specifically includes：For each block number evidence, By the value identical data composition matching pair of the index entry；

The similarity threshold includes first threshold and Second Threshold, and the first threshold is more than the Second Threshold；

When the similarity of described two data records is more than or equal to the first threshold, the pass of described two data records is determined It is for matching relationship and generates the unique mark for associating described two data records, in the similar of described two data records When degree is less than the first threshold and is more than the Second Threshold, the relation for determining described two data records is doubtful relation, And when the similarity of described two data records is less than or equal to the Second Threshold, determine the pass of described two data records It is for mismatch relation.
It is 6. according to claim 5 based on the data matching method for comparing template, it is characterised in that to divide to data Before block, in addition to：The data are handled according to preset data form, to meet predetermined format.
7. according to claim 5 or 6 based on the data matching method for comparing template, it is characterised in that it is described according to than Rule compared with template specifically includes for each matching to the step of calculating similarity：For the same word of described two data records Section, the similar value of the same field corresponding content of described two data records is calculated, according to the same field corresponding content Similar value determines the similarity.
It is 8. according to claim 7 based on the data matching method for comparing template, it is characterised in that in described two data When record has multiple same fields, using the corresponding similar value sum of each same field as the similar of described two data records Degree.