CN103530334B - Based on the data matching system and method for comparing template - Google Patents

Based on the data matching system and method for comparing template Download PDF

Info

Publication number
CN103530334B
CN103530334B CN201310456767.0A CN201310456767A CN103530334B CN 103530334 B CN103530334 B CN 103530334B CN 201310456767 A CN201310456767 A CN 201310456767A CN 103530334 B CN103530334 B CN 103530334B
Authority
CN
China
Prior art keywords
data
matching
threshold
similarity
data records
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310456767.0A
Other languages
Chinese (zh)
Other versions
CN103530334A (en
Inventor
龚健
张应才
张恒
李登高
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University Medical Information Technology Co ltd
Original Assignee
Medical Information Technology Co Ltd Of Beijing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Medical Information Technology Co Ltd Of Beijing University filed Critical Medical Information Technology Co Ltd Of Beijing University
Priority to CN201310456767.0A priority Critical patent/CN103530334B/en
Publication of CN103530334A publication Critical patent/CN103530334A/en
Application granted granted Critical
Publication of CN103530334B publication Critical patent/CN103530334B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/22Social work

Abstract

The invention provides a kind of based on the data matching system for comparing template and based on the data matching method for comparing template, wherein, included based on the data matching system for comparing template:Blocking unit, for receiving the data from not same area, piecemeal is carried out to data according to the index entry of setting, the index entry includes one or more fields of the data;Comparing unit, for obtaining matching pair for each data block, the matching comprising two data records, and according to the rule for comparing template to be each to matching to calculating similarity;Taxon, for determining the matching relationship of matching two data records of centering according to default similarity threshold.Pass through technical scheme, associated data can be quickly recognized in big data quantity, and the similarity of associated data can be calculated, determine that associated data indicate whether same object according to similarity, so that smooth communication can be carried out between different systems.

Description

Based on the data matching system and method for comparing template
Technical field
The present invention relates to field of computer technology, in particular to a kind of based on the data matching system for comparing template With based on the data matching method for comparing template.
Background technology
China's medical information is polymorphic and deposits simultaneously gradual perfection at present, and final target reaches medical information society Change.Each system is separate in medical system, such as Emergency call system, is in hospital, Physical Examination System, image center etc., part system Patient information data demand it is low, typing is imperfect.Each operation system standard is inconsistent, service fields are inconsistent, so as to cause Patient information does not associate, and Inter-System Information is independent.Patient data only has that part field is effective, patient can not be carried out unique Property confirm, missing mark.I.e. same patient has much information, it is impossible to patient data is uniquely determined, so as to cause system inter-drain It is logical to have some setbacks.Secondly as the data volume in each system is larger, data volume is huger after having gathered the data of multiple systems Greatly, identify that two records indicate whether that the difficulty of same patient is also relatively large in so huge data volume, also do not have at present There is good solution.
Therefore, it is necessary to which a kind of Data Matching scheme, Rapid matching can go out to represent same person in the data between system Record, make the data interaction between system more smooth.
The content of the invention
The present invention is based on above mentioned problem, it is proposed that a kind of Data Matching scheme, can be fast in the data between system Speed matches the record for representing same person, makes the data interaction between system more smooth.
According to an aspect of the present invention, it is proposed that it is a kind of based on the data matching system for comparing template, including:Piecemeal list Member, for receiving the data from not same area, data are carried out with piecemeal according to the index entry of setting, the index entry includes described One or more fields of data;Comparing unit, for obtaining matching pair for each data block, the matching is to including two Data record, and matched according to the rule for comparing template to be each to calculating similarity;Taxon, for according to default Similarity threshold determines the matching relationship of matching two data records of centering.
A data in system has multiple fields, for representing a user ID data.Can according to some or Certain several field carries out carrying out piecemeal to big data quantity, and so, big data is divided into several data blocks.In each data block In, it is the record composition matching pair of field value identical by index entry value to every record number.Due to pre- advanced to big data quantity Go piecemeal, and the record with same word segment value is subjected to matching association, greatly reduce amount of calculation, the system of alleviating is born Load.
Two similarities recorded in are matched, it is necessary to further determine that to rear obtaining matching, are recorded according to two Similarity and the size of similarity threshold determine the relations of two records.
In the above-mentioned technical solutions, it is preferred that can also include:Data cleansing unit, is handled according to preset data form The data, to meet predetermined format;The comparing unit includes:Obtain subelement, for for each block number evidence, will described in The value identical data composition matching pair of index entry.
Different numeric field datas, its form are also different, it may be possible to which field is different, it may be possible to which expression way differs Sample, it may be possible to which field value mistake, it is invalid that data cleansing unit may recognize that, undesirable data, can be to from not The numeric field data of homologous ray is cleaned, and realizes data normalization, is easy to follow-up association to calculate.
In the above-mentioned technical solutions, it is preferred that the comparing unit includes computation subunit, remembers for described two data The same field of record, the similar value of the same field corresponding content of described two data records is calculated, according to the same field The similar value of corresponding content determines the similarity.
In the above-mentioned technical solutions, it is preferred that the computation subunit is further used for having in described two data records When having multiple same fields, the similarity using the corresponding similar value sum of each same field as described two data records.By May include multiple fields in each record, it is therefore desirable to be compared for each field, calculate two record it is identical Similar value between field value corresponding to field, so as to determine the similarity between record according to the similar value of field value.
In the above-mentioned technical solutions, it is preferred that the similarity threshold includes first threshold and Second Threshold, and described first Threshold value is more than the Second Threshold;The taxon is further used for being more than or equal in the similarity of described two data records During the first threshold, the relation of described two data records is determined as matching relationship and is generated for associating described two data The unique mark of record, it is less than the first threshold in the similarity of described two data records and is more than the Second Threshold When, the relation for determining described two data records is doubtful relation, and described two data records similarity be less than etc. When the Second Threshold, the relation for determining described two data records is mismatch relation.
Two boundaries are set for similarity, first threshold is high threshold, and Second Threshold is lower limit.If what is calculated is similar Degree is higher than first threshold, illustrates that the two records represent that the possibility of same object is very big, then can determine that the two notes Record is matching relationship;If the similarity calculated is between high threshold and lower limit, then illustrates that the two records may represent Same object, possibility are not very big, it is necessary to manually be determined that the two records indicate whether same object;If calculate The similarity gone out is less than lower limit, then illustrates that the two records can not possibly represent same object, it may be determined that the two notes Record is not matching relationship.
According to another aspect of the present invention, it is also proposed that it is a kind of based on the data matching method for comparing template, including:Receive Data from not same area, piecemeal is carried out to data according to the index entry of setting, the index entry includes one of the data Or multiple fields;Matching pair is obtained for each data block, the matching is to including two data records;According to comparing template Rule is matched to calculating similarity to be each, according to the matching of default similarity threshold determination matching two data records of centering Relation.
A data in system has multiple fields, for representing a user ID data.Can be according to some or certain Several fields carry out carrying out piecemeal to big data quantity, and so, big data is divided into several data blocks.In each data block, It is the record composition matching pair of field value identical by index entry value to every record number.Due to the advance progress of big data quantity Piecemeal, and the record with same word segment value is subjected to matching association, amount of calculation is greatly reduced, the system of alleviating is born Load.
Two similarities recorded in are matched, it is necessary to further determine that to rear obtaining matching, are recorded according to two Similarity and the size of similarity threshold determine the relations of two records.
In the above-mentioned technical solutions, it is preferred that before piecemeal is carried out to data, in addition to:According to preset data form The data are handled, to meet predetermined format;The step that matching pair is obtained for each data block specifically includes:For each piece Data, by the value identical data composition matching pair of the index entry.
Different numeric field datas, its form are also different, it may be possible to which field is different, it may be possible to which expression way differs Sample, it may be possible to which field value mistake, it is invalid that data cleansing unit may recognize that, undesirable data, can be to from not The numeric field data of homologous ray is cleaned, and realizes data normalization, is easy to follow-up association to calculate.
In the above-mentioned technical solutions, it is preferred that described to be matched according to the rule for comparing template to be each to calculating similarity The step of specifically include:For the same field of described two data records, the same field of described two data records is calculated The similar value of corresponding content, the similarity is determined according to the similar value of the same field corresponding content.
In the above-mentioned technical solutions, it is preferred that, will be each identical when described two data records have multiple same fields Similarity of the corresponding similar value sum of field as described two data records..Because each record may include multiple words Section, it is therefore desirable to be compared for each field, calculate the phase between field value corresponding to the same field of two records Like value, so as to determine the similarity between record according to the similar value of field value.
In any of the above-described technical scheme, it is preferred that the similarity threshold includes first threshold and Second Threshold, described First threshold is more than the Second Threshold;When the similarity of described two data records is more than or equal to the first threshold, really The relation of fixed described two data records for matching relationship and generates the unique mark for associating described two data records, When the similarity of described two data records is less than the first threshold and is more than the Second Threshold, described two data are determined The relation of record is doubtful relation, and when the similarity of described two data records is less than or equal to the Second Threshold, really The relation of fixed described two data records is mismatch relation.
Two boundaries are set for similarity, first threshold is high threshold, and Second Threshold is lower limit.If what is calculated is similar Degree is higher than first threshold, illustrates that the two records represent that the possibility of same object is very big, then can determine that the two notes Record is matching relationship;If the similarity calculated is between high threshold and lower limit, then illustrates that the two records may represent Same object, possibility are not very big, it is necessary to manually be determined that the two records indicate whether same object;If calculate The similarity gone out is less than lower limit, then illustrates that the two records can not possibly represent same object, it may be determined that the two notes Record is not matching relationship.
Because independently of each other, the data format, field definition, expression content between different system are inconsistent, cause between system Link up and have some setbacks between system, it is impossible to uniquely determine same target, therefore the present invention propose it is a kind of new based on comparing template Data Matching scheme, matching pair can be quickly recognized in big data quantity, Rapid matching goes out the different expression sides of same target Formula, enable the data between system interrelated.
Brief description of the drawings
Fig. 1 shows the schematic diagram according to an embodiment of the invention based on the data matching system for comparing template;
Fig. 2 shows the flow chart according to an embodiment of the invention based on the data matching method for comparing template;
Fig. 3 shows the idiographic flow based on the data matching method for comparing template according to another embodiment of the present invention Figure.
Embodiment
It is below in conjunction with the accompanying drawings and specific real in order to be more clearly understood that the above objects, features and advantages of the present invention Mode is applied the present invention is further described in detail.It should be noted that in the case where not conflicting, the implementation of the application Feature in example and embodiment can be mutually combined.
Many details are elaborated in the following description to facilitate a thorough understanding of the present invention, still, the present invention may be used also To be different from other modes described here using other to implement, therefore, protection scope of the present invention is not by described below Specific embodiment limitation.
Fig. 1 shows the schematic diagram according to an embodiment of the invention based on the data matching system for comparing template.
As shown in figure 1, it is according to an embodiment of the invention based on the data matching system 100 for comparing template, including:Piecemeal Unit 102, for receiving the data from not same area, piecemeal, the index entry bag are carried out to data according to the index entry of setting One or more fields containing the data;Comparing unit 104, for obtaining matching pair, the matching for each data block To being matched comprising two data records, and according to the rule for comparing template to be each to calculating similarity;Taxon 106, For determining the matching relationship of matching two data records of centering according to default similarity threshold.
A data in system has multiple fields, for representing a user ID data.Can according to some or Certain several field carries out carrying out piecemeal to big data quantity, and so, big data is divided into several data blocks.In each data block In, it is the record composition matching pair of field value identical by index entry value to every record number.Due to pre- advanced to big data quantity Go piecemeal, and the record with same word segment value is subjected to matching association, greatly reduce amount of calculation, the system of alleviating is born Load.
Two similarities recorded in are matched, it is necessary to further determine that to rear obtaining matching, are recorded according to two Similarity and the size of similarity threshold determine the relations of two records.For example a unpaired message is as shown in table 1:
ID A B C D
1 a1 b1 c1 d1
2 a2 b2 c2 d2
Table 1
Wherein, id=1 for matching pair left sibling, id=2 for matching pair right node, matching identification item for A, [A+c], [b], [c+d] }, matching power is reassembled as { 0.9,0.92,0.5,0.945 }, by the rule for comparing template(Comparison mould such as to A Plate uses voice similarity * 1.2+ character similarities * 1.8)Identification item similarity is calculated for { 0.8,0.4,0.9,0.5 } then should The score value of matching pair is f1 (0.8,0.9)+f2 (0.4,0.92)+f3 (0.9,0.5)+f4 (0.945,0.5);Wherein fn()Letter Number can match different calculating functions according to the threshold size of single mark matching.
In the above-mentioned technical solutions, it is preferred that can also include:Data cleansing unit 108, at preset data form The data are managed, to meet predetermined format;The comparing unit 104 includes:Subelement 1042 is obtained, for for each block number According to by the value identical data composition matching pair of the index entry.
Different numeric field datas, its form are also different, it may be possible to which field is different, it may be possible to which expression way differs Sample, it may be possible to which field value mistake, it is invalid that data cleansing unit may recognize that, undesirable data, can be to from not The numeric field data of homologous ray is cleaned, and realizes data normalization, is easy to follow-up association to calculate.
In the above-mentioned technical solutions, it is preferred that the comparing unit 104 includes computation subunit 1044, for described two The same field of individual data record, the similar value of the same field corresponding content of described two data records is calculated, according to described The similar value of same field corresponding content determines the similarity.For example for " name " field, voice similarity * can be used 1.2+ character similarities * 1.8 formula calculates similar value.
In the above-mentioned technical solutions, it is preferred that the computation subunit 1044 is further used for remembering in described two data When record has multiple same fields, using the corresponding similar value sum of each same field as the similar of described two data records Degree.Because each record may include multiple fields, it is therefore desirable to be compared for each field, calculate two record Similar value between field value corresponding to same field, it is similar between record so as to be determined according to the similar value of field value Degree.
In the above-mentioned technical solutions, it is preferred that the similarity threshold includes first threshold and Second Threshold, and described first Threshold value is more than the Second Threshold;The taxon 106 is further used for being more than in the similarity of described two data records During equal to the first threshold, the relation of described two data records is determined as matching relationship and is generated described two for associating The unique mark of data record, it is less than the first threshold in the similarity of described two data records and is more than second threshold During value, the relation for determining described two data records is doubtful relation, and is less than in the similarity of described two data records During equal to the Second Threshold, the relation for determining described two data records is mismatch relation.
Two boundaries are set for similarity, first threshold is high threshold, and Second Threshold is lower limit.If what is calculated is similar Degree is higher than first threshold, illustrates that the two records represent that the possibility of same object is very big, then can determine that the two notes Record is matching relationship;If the similarity calculated is between high threshold and lower limit, then illustrates that the two records may represent Same object, possibility are not very big, it is necessary to manually be determined that the two records indicate whether same object;If calculate The similarity gone out is less than lower limit, then illustrates that the two records can not possibly represent same object, it may be determined that the two notes Record is not matching relationship.
Fig. 2 shows the flow chart according to an embodiment of the invention based on the data matching method for comparing template.
As shown in Fig. 2 it is according to an embodiment of the invention based on the data matching method for comparing template, it can include following Step:Step 202, the data from not same area are received, piecemeal, the index entry bag are carried out to data according to the index entry of setting One or more fields containing the data;Step 204, matching pair is obtained for each data block, the matching is to including two Individual data record;Step 206, matched according to the rule for comparing template to be each to calculating similarity, according to default similarity Threshold value determines the matching relationship of matching two data records of centering.
A data in system has multiple fields, for representing a user ID data.Can be according to some or certain Several fields carry out carrying out piecemeal to big data quantity, and so, big data is divided into several data blocks.In each data block, It is the record composition matching pair of field value identical by index entry value to every record number.Due to the advance progress of big data quantity Piecemeal, and the record with same word segment value is subjected to matching association, amount of calculation is greatly reduced, the system of alleviating is born Load.
Two similarities recorded in are matched, it is necessary to further determine that to rear obtaining matching, are recorded according to two Similarity and the size of similarity threshold determine the relations of two records.
In the above-mentioned technical solutions, it is preferred that before piecemeal is carried out to data, in addition to:According to preset data form The data are handled, to meet predetermined format;The step that matching pair is obtained for each data block specifically includes:For each piece Data, by the value identical data composition matching pair of the index entry.
Different numeric field datas, its form are also different, it may be possible to which field is different, it may be possible to which expression way differs Sample, it may be possible to which field value mistake, it is invalid that data cleansing unit may recognize that, undesirable data, can be to from not The numeric field data of homologous ray is cleaned, and realizes data normalization, is easy to follow-up association to calculate.
In the above-mentioned technical solutions, it is preferred that described to be matched according to the rule for comparing template to be each to calculating similarity The step of specifically include:For the same field of described two data records, the same field of described two data records is calculated The similar value of corresponding content, the similarity is determined according to the similar value of the same field corresponding content.
In the above-mentioned technical solutions, it is preferred that, will be each identical when described two data records have multiple same fields Similarity of the corresponding similar value sum of field as described two data records..Because each record may include multiple words Section, it is therefore desirable to be compared for each field, calculate the phase between field value corresponding to the same field of two records Like value, so as to determine the similarity between record according to the similar value of field value.
In any of the above-described technical scheme, it is preferred that the similarity threshold includes first threshold and Second Threshold, described First threshold is more than the Second Threshold;When the similarity of described two data records is more than or equal to the first threshold, really The relation of fixed described two data records for matching relationship and generates the unique mark for associating described two data records, When the similarity of described two data records is less than the first threshold and is more than the Second Threshold, described two data are determined The relation of record is doubtful relation, and when the similarity of described two data records is less than or equal to the Second Threshold, really The relation of fixed described two data records is mismatch relation.
Two boundaries are set for similarity, first threshold is high threshold, and Second Threshold is lower limit.If what is calculated is similar Degree is higher than first threshold, illustrates that the two records represent that the possibility of same object is very big, then can determine that the two notes Record is matching relationship;If the similarity calculated is between high threshold and lower limit, then illustrates that the two records may represent Same object, possibility are not very big, it is necessary to manually be determined that the two records indicate whether same object;If calculate The similarity gone out is less than lower limit, then illustrates that the two records can not possibly represent same object, it may be determined that the two notes Record is not matching relationship.
Described in detail with reference to Fig. 3 according to the present invention based on the data matching method for comparing template.
As shown in figure 3, in step 302, device receives the numeric field data from not same area(Such as clinic system is exactly a domain, System is also a domain in hospital).Because the data field definition or expression content, expression format of different system are different, therefore Need to clean these numeric field datas according to preset rules.
For example, the expression way of time is 2012-12-12 in management server primary database, and other numeric field datas when Between expression way be on December 12nd, 2012 or 2012.12.12, then the time data of these forms can be unified into 2012-12-12.In another example there is no space in primary database between the word of name, but there is sky between the word of the name of numeric field data Lattice, then these spaces can be deleted, be consistent with the form of name in primary database.
In step 304, because the data volume of each system is very big, after the data for having gathered multiple systems, total quantity is more Add huge, it is therefore desirable to piecemeal is carried out to numeric field data, handled by block.
In step 306, for each data block, matching pair is obtained, calculates the similarity of matching pair.
After piecemeal is carried out to data, if carrying out comparing, item is compared firstly the need of being formed, i.e. matching pair.If Block data is compared entirely, then can form n2 and compare item, carries out that during Similarity Measure many times can be wasted, so needing The forming process of matching pair is handled, and is removed without the matching pair that must be compared processing, to reduce amount of calculation, improves fortune Speed is calculated, and reduces resource occupation.When calculating matching similarity of two data records in, each same word can be first calculated The similar value of section, further according to the similar similarity for being worth to two data records of all fields.Compare in template and be provided with not Same similarity calculating method, the similarity of matching pair can be calculated according to the comparison template of selection.
In step 308, according to the similarity threshold pre-set come to matching to classifying.Can be that similarity is set Two similarity thresholds, first threshold are high threshold, and Second Threshold is lower limit.If the similarity calculated is higher than the first threshold Value, illustrate that the two records represent that the possibility of same object is very big, then can determine that the two records are matching relationships; If the similarity calculated is between high threshold and lower limit, then illustrate that the two records may represent same object, can Energy property is not very big, it is necessary to manually be determined that the two records indicate whether same object;If the similarity calculated exists Less than lower limit, then illustrate that the two records can not possibly represent same object, it may be determined that the two records are not that matching is closed System.
In step 310, classification results are audited, for example, for doubtful relation matching to carry out artificial judgment, if Think classification results inaccuracy, can be again to matching to carrying out key words sorting.Such as adjustable similarity threshold, after adjustment Similarity threshold again for matching to classifying.
Because independently of each other, the data format, field definition, expression content between different system are inconsistent, cause between system Link up and have some setbacks between system, it is impossible to uniquely determine same target, therefore the present invention propose it is a kind of new based on comparing template Data Matching scheme, matching pair can be quickly recognized in big data quantity, Rapid matching goes out the different expression sides of same target Formula, enable the data between system interrelated.
The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for the skill of this area For art personnel, the present invention can have various modifications and variations.Within the spirit and principles of the invention, that is made any repaiies Change, equivalent substitution, improvement etc., should be included in the scope of the protection.

Claims (8)

  1. It is 1. a kind of based on the data matching system for comparing template, it is characterised in that including:
    Blocking unit, for receiving the data from not same area, piecemeal, the index are carried out to data according to the index entry of setting Item includes one or more fields of the data;
    Comparing unit, for obtaining matching pair for each data block, the matching to comprising two data records, and according to The rule for comparing template is each matching to calculating similarity;
    Taxon, for determining the matching relationship of matching two data records of centering according to default similarity threshold;It is described Comparing unit includes:Subelement is obtained, for for each block number evidence, the value identical data of the index entry to be formed into matching It is right;
    The similarity threshold includes first threshold and Second Threshold, and the first threshold is more than the Second Threshold;
    The taxon is further used for when the similarity of described two data records is more than or equal to the first threshold, really The relation of fixed described two data records for matching relationship and generates the unique mark for associating described two data records, When the similarity of described two data records is less than the first threshold and is more than the Second Threshold, described two data are determined The relation of record is doubtful relation, and when the similarity of described two data records is less than or equal to the Second Threshold, really The relation of fixed described two data records is mismatch relation.
  2. It is 2. according to claim 1 based on the data matching system for comparing template, it is characterised in that also to include:Data are clear Unit is washed, the data are handled according to preset data form, to meet predetermined format.
  3. It is 3. according to claim 1 or 2 based on the data matching system for comparing template, it is characterised in that described relatively more single Member includes computation subunit, for the same field of described two data records, calculates the same word of described two data records The similar value of section corresponding content, the similarity is determined according to the similar value of the same field corresponding content.
  4. It is 4. according to claim 3 based on the data matching system for comparing template, it is characterised in that the computation subunit It is further used for when described two data records have multiple same fields, the corresponding similar value sum of each same field is made For the similarity of described two data records.
  5. It is 5. a kind of based on the data matching method for comparing template, it is characterised in that including:
    The data from not same area are received, piecemeal are carried out to data according to the index entry of setting, the index entry includes the number According to one or more fields;
    Matching pair is obtained for each data block, the matching is to including two data records;
    Matched according to the rule for comparing template to be each to calculating similarity, matching centering is determined according to default similarity threshold The matching relationship of two data records;The step that matching pair is obtained for each data block specifically includes:For each block number evidence, By the value identical data composition matching pair of the index entry;
    The similarity threshold includes first threshold and Second Threshold, and the first threshold is more than the Second Threshold;
    When the similarity of described two data records is more than or equal to the first threshold, the pass of described two data records is determined It is for matching relationship and generates the unique mark for associating described two data records, in the similar of described two data records When degree is less than the first threshold and is more than the Second Threshold, the relation for determining described two data records is doubtful relation, And when the similarity of described two data records is less than or equal to the Second Threshold, determine the pass of described two data records It is for mismatch relation.
  6. It is 6. according to claim 5 based on the data matching method for comparing template, it is characterised in that to divide to data Before block, in addition to:The data are handled according to preset data form, to meet predetermined format.
  7. 7. according to claim 5 or 6 based on the data matching method for comparing template, it is characterised in that it is described according to than Rule compared with template specifically includes for each matching to the step of calculating similarity:For the same word of described two data records Section, the similar value of the same field corresponding content of described two data records is calculated, according to the same field corresponding content Similar value determines the similarity.
  8. It is 8. according to claim 7 based on the data matching method for comparing template, it is characterised in that in described two data When record has multiple same fields, using the corresponding similar value sum of each same field as the similar of described two data records Degree.
CN201310456767.0A 2013-09-29 2013-09-29 Based on the data matching system and method for comparing template Active CN103530334B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310456767.0A CN103530334B (en) 2013-09-29 2013-09-29 Based on the data matching system and method for comparing template

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310456767.0A CN103530334B (en) 2013-09-29 2013-09-29 Based on the data matching system and method for comparing template

Publications (2)

Publication Number Publication Date
CN103530334A CN103530334A (en) 2014-01-22
CN103530334B true CN103530334B (en) 2018-01-23

Family

ID=49932343

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310456767.0A Active CN103530334B (en) 2013-09-29 2013-09-29 Based on the data matching system and method for comparing template

Country Status (1)

Country Link
CN (1) CN103530334B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104809141A (en) * 2014-01-29 2015-07-29 携程计算机技术(上海)有限公司 Matching system and method of hotel data
CN105096028A (en) * 2014-11-20 2015-11-25 北京航天金盾科技有限公司 Intelligent matching method of population data
CN106681524A (en) * 2015-11-10 2017-05-17 阿里巴巴集团控股有限公司 Method and device for processing information
CN107291672B (en) * 2016-03-31 2020-11-20 阿里巴巴集团控股有限公司 Data table processing method and device
CN106021526B (en) * 2016-05-25 2019-09-27 东软集团股份有限公司 News category method and device
CN108572947B (en) * 2017-03-13 2019-11-19 腾讯科技(深圳)有限公司 A kind of data fusion method and device
CN108664497B (en) * 2017-03-30 2020-11-03 大有秦鼎(北京)科技有限公司 Data matching method and device
CN107193860B (en) * 2017-03-31 2021-03-02 苏州艾隆信息技术有限公司 Medicine information multidimensional identification method and system
CN107203686B (en) * 2017-03-31 2021-04-20 苏州艾隆信息技术有限公司 Medicine information difference processing method and system
CN107103048B (en) * 2017-03-31 2021-04-20 苏州艾隆信息技术有限公司 Medicine information matching method and system
CN108038504B (en) * 2017-12-11 2019-12-27 深圳房讯通信息技术有限公司 Method for analyzing content of house property certificate photo
CN108920601B (en) * 2018-06-27 2020-12-01 中国联合网络通信集团有限公司 Data matching method and device
CN109063178B (en) * 2018-08-22 2019-12-24 四川新网银行股份有限公司 Method and device for automatically expanding self-help analysis report
CN113535943A (en) * 2020-04-14 2021-10-22 阿里巴巴集团控股有限公司 Medical record classification method and device and data record classification method and device
CN111737533B (en) * 2020-06-19 2024-02-09 东软集团股份有限公司 Method, device, storage medium and equipment for processing inspection items
CN112732703B (en) * 2021-03-23 2022-04-12 中国信息通信研究院 Metadata processing method, metadata processing apparatus, and readable storage medium
CN113434584B (en) * 2021-06-28 2022-10-14 国网北京市电力公司 Data processing method and device for power equipment and electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103186427A (en) * 2011-12-31 2013-07-03 中国银联股份有限公司 System and method for analyzing data record set
CN103257961A (en) * 2012-02-15 2013-08-21 北大方正集团有限公司 Method, device and system of bibliography repeat removal

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739414A (en) * 2008-11-25 2010-06-16 华中师范大学 Ontological concept mapping method
US9069850B2 (en) * 2011-11-08 2015-06-30 Comcast Cable Communications, Llc Content descriptor
CN102542262B (en) * 2012-01-04 2013-07-31 东南大学 Waveform identification method based on operating-characteristic working condition waveform library of high-speed rail

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103186427A (en) * 2011-12-31 2013-07-03 中国银联股份有限公司 System and method for analyzing data record set
CN103257961A (en) * 2012-02-15 2013-08-21 北大方正集团有限公司 Method, device and system of bibliography repeat removal

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"不同应用系统相关数据的匹配检测与借用";齐为华;《2007年CAD/CAM学术交流会议论文集》;20070531;第180-181页 *

Also Published As

Publication number Publication date
CN103530334A (en) 2014-01-22

Similar Documents

Publication Publication Date Title
CN103530334B (en) Based on the data matching system and method for comparing template
CN112365987A (en) Diagnostic data anomaly detection method and device, computer equipment and storage medium
US8429220B2 (en) Data exchange among data sources
CN103473375A (en) Data cleaning method and data cleaning system
JP2012511763A (en) Assertion-based record linkage in a decentralized autonomous medical environment
CN110175697B (en) Adverse event risk prediction system and method
US20100169348A1 (en) Systems and Methods for Handling Multiple Records
WO2015027425A1 (en) Method and device for storing data
CN103473373A (en) Threshold matching model-based similarity analysis system and threshold matching model-based similarity analysis method
CN109062936B (en) Data query method, computer readable storage medium and terminal equipment
CN111597177A (en) Data governance method for improving data quality
CN109145003A (en) A kind of method and device constructing knowledge mapping
CN105512300B (en) information filtering method and system
CN110516752A (en) Clustering cluster method for evaluating quality, device, equipment and storage medium
CN110909168A (en) Knowledge graph updating method and device, storage medium and electronic device
CN113111063A (en) Medical patient main index discovery method applied to multiple data sources
CN110019542B (en) Generation of enterprise relationship, generation of organization member database and identification of same name member
WO2022222942A1 (en) Method and apparatus for generating question and answer record, electronic device, and storage medium
WO2022247549A1 (en) Drug prediction method, apparatus and device, and storage medium
CN109346146B (en) Prescription checking and distributing method, electronic equipment and storage medium
US10192031B1 (en) System for extracting information from DICOM structured reports
CN109558461B (en) Medical data classified storage method and device
CN107861965A (en) Data intelligence recognition methods and system
Schnell et al. Building a national perinatal data base without the use of unique personal identifiers
CN105512270B (en) Method and device for determining related objects

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: PKU HEALTHCARE IT CO., LTD.

Free format text: FORMER OWNER: FOUNDER INTERNATIONAL CO., LTD.

Effective date: 20150203

Free format text: FORMER OWNER: FOUNDER INTERNATIONAL (BEIJING) CO., LTD.

Effective date: 20150203

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 215123 SUZHOU, JIANGSU PROVINCE TO: 100080 HAIDIAN, BEIJING

TA01 Transfer of patent application right

Effective date of registration: 20150203

Address after: 100080, No. 19, No. 52 West Fourth Ring Road, Beijing, Haidian District

Applicant after: Peking University Medical Information Technology Co.,Ltd.

Address before: Suzhou City, Jiangsu Province, Suzhou Industrial Park 215123 Xinghu Street No. 328 Creative Industry Park founder International Building

Applicant before: FOUNDER INTERNATIONAL Co.,Ltd.

Applicant before: Founder International Co.,Ltd. (Beijing)

GR01 Patent grant
GR01 Patent grant
PP01 Preservation of patent right
PP01 Preservation of patent right

Effective date of registration: 20240202

Granted publication date: 20180123