CN103530334A - System and method for data matching based on comparison module - Google Patents

System and method for data matching based on comparison module Download PDF

Info

Publication number
CN103530334A
CN103530334A CN201310456767.0A CN201310456767A CN103530334A CN 103530334 A CN103530334 A CN 103530334A CN 201310456767 A CN201310456767 A CN 201310456767A CN 103530334 A CN103530334 A CN 103530334A
Authority
CN
China
Prior art keywords
data
threshold
similarity
data recording
coupling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310456767.0A
Other languages
Chinese (zh)
Other versions
CN103530334B (en
Inventor
龚健
张应才
张恒
李登高
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University Medical Information Technology Co ltd
Original Assignee
Founder International Co Ltd
Founder International Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Founder International Co Ltd, Founder International Beijing Co Ltd filed Critical Founder International Co Ltd
Priority to CN201310456767.0A priority Critical patent/CN103530334B/en
Publication of CN103530334A publication Critical patent/CN103530334A/en
Application granted granted Critical
Publication of CN103530334B publication Critical patent/CN103530334B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/22Social work or social welfare, e.g. community support activities or counselling services

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Tourism & Hospitality (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • General Health & Medical Sciences (AREA)
  • Child & Adolescent Psychology (AREA)
  • Economics (AREA)
  • Primary Health Care (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a system and method for data matching based on a comparison module. The system for data matching based on the compassion module comprises a partitioning unit, a comparing unit and a classification unit, wherein the blocking unit is used for receiving data of different domains and partitioning the data according to set index entries, and the index entries comprise one or more fields of the data; the comparing unit is used for obtaining matching pairs for data blocks and calculating similarity degrees for matching pairs according to rules of the comparison module, and the matching pairs comprise two data recordings; the classification unit is used for determining the matching relation of two data recordings in the matching pairs according to preset threshold values of the similarity degrees. Through the technical scheme, associated data can be quickly recognized from a large data volume, the similarity degree of associated data can be figured out, whether the associated data express the same object or not is judged according to the similarity degree, and thus different systems can be smoothly communicated with one another.

Description

The data matching system of template and method based on the comparison
Technical field
The present invention relates to field of computer technology, in particular to a kind of data matching system of template based on the comparison and the data matching method of template based on the comparison.
Background technology
China's medical information is polymorphic and deposits and gradual perfection at present, and final target reaches medical information socialization.In medical system, each system is separate, such as door emergency treatment system, be in hospital, Physical Examination System, image center etc., the patient information data demand of part system is low, typing is imperfect.Each operation system standard is inconsistent, service fields is inconsistent, thereby causes patient information there is no association, and between system, information is independent.Patient data only has part field effective, can not carry out uniqueness confirmation to patient, disappearance sign.Be that same patient has much information, can not unique definite patient data, thus cause linking up and having some setbacks between system.Secondly, because each intrasystem data volume is larger, after having gathered the data of a plurality of systems, data volume is huger, identifies two records and whether represent that same patient's difficulty is also relatively large in so huge data volume, also there is no at present good solution.
Therefore, need a kind of Data Matching scheme, can the data between system in Rapid matching go out to represent the record of same person to make the data interaction between system more smooth and easy.
Summary of the invention
The present invention, just based on the problems referred to above, has proposed a kind of Data Matching scheme, can the data between system in Rapid matching go out to represent the record of same person to make the data interaction between system more smooth and easy.
According to an aspect of the present invention, the data matching system that has proposed a kind of template based on the comparison, comprising: minute module unit, for receiving from the data of same area not, according to the index entry arranging, data are carried out to piecemeal, one or more fields that described index entry comprises described data; Comparing unit, right for obtain coupling for each data block, described coupling is to comprising two data recording, and according to the rule of template relatively for each coupling is to calculating similarity; Taxon, determines the matching relationship that mates two data recording of centering for the similarity threshold according to default.
Data in system have a plurality of fields, for representing a user ID data.Can carry out big data quantity to carry out piecemeal according to some or certain several field, like this, large data be divided into several data blocks.In each data block, to every record number, by index entry value, be that the record composition coupling that field value is identical is right.Due to big data quantity has been carried out to piecemeal in advance, and the record with same word segment value is mated to association, greatly reduced calculated amount, alleviated system burden.
Obtaining coupling to rear, need further to determine the similarity of two records of coupling centering, according to the size of the similarity of two records and similarity threshold, determine the relation of two records.
In technique scheme, preferred, can also comprise: data cleansing unit, according to data described in preset data format analysis processing, to meet predetermined format; Described comparing unit comprises: obtain subelement, for for each blocks of data, the identical data of the value of described index entry are formed to coupling right.
Different numeric field datas, its form is also different, likely that field is different, being likely that expression way is different, is likely field value mistake, and it is invalid that data cleansing unit can identify, undesirable data, can clean the numeric field data from different system, realize data normalization, be convenient to follow-up association and calculate.
In technique scheme, preferably, described comparing unit comprises computation subunit, for the same field of described two data recording, calculate the similar value of the same field corresponding content of described two data recording, according to the similar value of described same field corresponding content, determine described similarity.
In technique scheme, preferred, described computation subunit is further used for when described two data recording have a plurality of same field, the similarity using the corresponding similar value sum of each same field as described two data recording.Because each record may comprise a plurality of fields, therefore need to compare for each field, calculate the similar value between field value corresponding to the same field of two records, thereby determine the similarity between recording according to the similar value of field value.
In technique scheme, preferred, described similarity threshold comprises first threshold and Second Threshold, and described first threshold is greater than described Second Threshold; Described taxon is further used for when the similarity of described two data recording is more than or equal to described first threshold, the pass of determining described two data recording is matching relationship and generates the unique identification for described two data recording of association, when the similarity of described two data recording is less than described first threshold and is greater than described Second Threshold, the pass of determining described two data recording is doubtful relation, and when the similarity of described two data recording is less than or equal to described Second Threshold, the pass of determining described two data recording is matching relationship not.
For similarity arranges two boundaries, first threshold is high threshold, and Second Threshold is lower limit.If the similarity calculating higher than first threshold, illustrates that the possibility of these two the same objects of record expression is very large, can determine that so these two records are matching relationships; If the similarity calculating between high threshold and lower limit, illustrates so these two records and may represent same object, possibility is not very large, need to manually determine whether these two records represent same object; If the similarity calculating, lower than lower limit, illustrates that these two records can not represent same object, can determine that these two records are not matching relationships so.
According to a further aspect in the invention, also proposed a kind of data matching method of template based on the comparison, having comprised: received from the data of same area not, according to the index entry arranging, data are carried out to piecemeal, one or more fields that described index entry comprises described data; For each data block, obtain coupling right, described coupling is to comprising two data recording; According to the rule that compares template, for each coupling is to calculating similarity, according to default similarity threshold, determine the matching relationship of two data recording of coupling centering.
Data in system have a plurality of fields, for representing a user ID data.Can carry out big data quantity to carry out piecemeal according to some or certain several field, like this, large data be divided into several data blocks.In each data block, to every record number, by index entry value, be that the record composition coupling that field value is identical is right.Due to big data quantity has been carried out to piecemeal in advance, and the record with same word segment value is mated to association, greatly reduced calculated amount, alleviated system burden.
Obtaining coupling to rear, need further to determine the similarity of two records of coupling centering, according to the size of the similarity of two records and similarity threshold, determine the relation of two records.
In technique scheme, preferred, before data are carried out to piecemeal, also comprise: according to data described in preset data format analysis processing, to meet predetermined format; For each data block, obtaining the right step of coupling specifically comprises: for each blocks of data, the identical data of the value of described index entry are formed to coupling right.
Different numeric field datas, its form is also different, likely that field is different, being likely that expression way is different, is likely field value mistake, and it is invalid that data cleansing unit can identify, undesirable data, can clean the numeric field data from different system, realize data normalization, be convenient to follow-up association and calculate.
In technique scheme, preferably, describedly according to the rule of template relatively, the step of calculating similarity is specifically comprised for each coupling: for the same field of described two data recording, calculate the similar value of the same field corresponding content of described two data recording, according to the similar value of described same field corresponding content, determine described similarity.
In technique scheme, preferred, when described two data recording have a plurality of same field, the similarity using the corresponding similar value sum of each same field as described two data recording.。Because each record may comprise a plurality of fields, therefore need to compare for each field, calculate the similar value between field value corresponding to the same field of two records, thereby determine the similarity between recording according to the similar value of field value.
In above-mentioned arbitrary technical scheme, preferred, described similarity threshold comprises first threshold and Second Threshold, and described first threshold is greater than described Second Threshold; When the similarity of described two data recording is more than or equal to described first threshold, the pass of determining described two data recording is matching relationship and generates the unique identification for described two data recording of association, when the similarity of described two data recording is less than described first threshold and is greater than described Second Threshold, the pass of determining described two data recording is doubtful relation, and when the similarity of described two data recording is less than or equal to described Second Threshold, the pass of determining described two data recording is matching relationship not.
For similarity arranges two boundaries, first threshold is high threshold, and Second Threshold is lower limit.If the similarity calculating higher than first threshold, illustrates that the possibility of these two the same objects of record expression is very large, can determine that so these two records are matching relationships; If the similarity calculating between high threshold and lower limit, illustrates so these two records and may represent same object, possibility is not very large, need to manually determine whether these two records represent same object; If the similarity calculating, lower than lower limit, illustrates that these two records can not represent same object, can determine that these two records are not matching relationships so.
Due to separate between system, data layout between different system, Field Definition, expression content are inconsistent, cause linking up and having some setbacks between system, can not unique definite same target, so Data Matching scheme that has proposed a kind of new template based on the comparison of the present invention, can in big data quantity, identify fast coupling right, Rapid matching goes out the different expression waies of same target, makes the data between system can be interrelated.
Accompanying drawing explanation
Fig. 1 shows according to an embodiment of the invention the schematic diagram of the data matching system of template based on the comparison;
Fig. 2 shows the process flow diagram of the data matching method of template based on the comparison according to an embodiment of the invention;
Fig. 3 shows the particular flow sheet of the data matching method of template based on the comparison according to another embodiment of the present invention.
Embodiment
In order more clearly to understand above-mentioned purpose of the present invention, feature and advantage, below in conjunction with the drawings and specific embodiments, the present invention is further described in detail.It should be noted that, in the situation that not conflicting, the application's embodiment and the feature in embodiment can combine mutually.
A lot of details have been set forth in the following description so that fully understand the present invention; but; the present invention can also adopt other to be different from other modes described here and implement, and therefore, protection scope of the present invention is not subject to the restriction of following public specific embodiment.
Fig. 1 shows according to an embodiment of the invention the schematic diagram of the data matching system of template based on the comparison.
As shown in Figure 1, the data matching system 100 of template based on the comparison according to an embodiment of the invention, comprising: minute module unit 102, for receiving from the data of same area not, according to the index entry arranging, data are carried out to piecemeal, one or more fields that described index entry comprises described data; Comparing unit 104, right for obtain coupling for each data block, described coupling is to comprising two data recording, and according to the rule of template relatively for each coupling is to calculating similarity; Taxon 106, determines the matching relationship that mates two data recording of centering for the similarity threshold according to default.
Data in system have a plurality of fields, for representing a user ID data.Can carry out big data quantity to carry out piecemeal according to some or certain several field, like this, large data be divided into several data blocks.In each data block, to every record number, by index entry value, be that the record composition coupling that field value is identical is right.Due to big data quantity has been carried out to piecemeal in advance, and the record with same word segment value is mated to association, greatly reduced calculated amount, alleviated system burden.
Obtaining coupling to rear, need further to determine the similarity of two records of coupling centering, according to the size of the similarity of two records and similarity threshold, determine the relation of two records.Such as, a unpaired message is as shown in table 1:
ID A B C D
1 a1 b1 c1 d1
2 a2 b2 c2 d2
Table 1
Wherein, id=1 is the right left sibling of coupling; id=2 is the right right node of coupling, and matching identification item is { A, [A+c]; [b]; [c+d] }, coupling set of weights is { 0.9,0.92; 0.5; 0.945}, by the rule of template (as the comparison template to A adopts voice similarity * 1.2+ character similarity * 1.8) relatively, calculating identification item similarity is { 0.8,0.4; 0.9; 0.5} this coupling right score value be f1 (0.8,0.9)+f2 (0.4,0.92)+f3 (0.9; 0.5)+f4 (0.945,0.5); Fn(wherein) function can mate different computing functions according to single marking matched threshold size.
In technique scheme, preferred, can also comprise: data cleansing unit 108, according to data described in preset data format analysis processing, to meet predetermined format; Described comparing unit 104 comprises: obtain subelement 1042, for for each blocks of data, the identical data of the value of described index entry are formed to coupling right.
Different numeric field datas, its form is also different, likely that field is different, being likely that expression way is different, is likely field value mistake, and it is invalid that data cleansing unit can identify, undesirable data, can clean the numeric field data from different system, realize data normalization, be convenient to follow-up association and calculate.
In technique scheme, preferably, described comparing unit 104 comprises computation subunit 1044, same field for described two data recording, calculate the similar value of the same field corresponding content of described two data recording, according to the similar value of described same field corresponding content, determine described similarity.Such as for " name " field, can adopt the formula of voice similarity * 1.2+ character similarity * 1.8 to calculate similar value.
In technique scheme, preferred, described computation subunit 1044 is further used for when described two data recording have a plurality of same field, the similarity using the corresponding similar value sum of each same field as described two data recording.Because each record may comprise a plurality of fields, therefore need to compare for each field, calculate the similar value between field value corresponding to the same field of two records, thereby determine the similarity between recording according to the similar value of field value.
In technique scheme, preferred, described similarity threshold comprises first threshold and Second Threshold, and described first threshold is greater than described Second Threshold; Described taxon 106 is further used for when the similarity of described two data recording is more than or equal to described first threshold, the pass of determining described two data recording is matching relationship and generates the unique identification for described two data recording of association, when the similarity of described two data recording is less than described first threshold and is greater than described Second Threshold, the pass of determining described two data recording is doubtful relation, and when the similarity of described two data recording is less than or equal to described Second Threshold, the pass of determining described two data recording is matching relationship not.
For similarity arranges two boundaries, first threshold is high threshold, and Second Threshold is lower limit.If the similarity calculating higher than first threshold, illustrates that the possibility of these two the same objects of record expression is very large, can determine that so these two records are matching relationships; If the similarity calculating between high threshold and lower limit, illustrates so these two records and may represent same object, possibility is not very large, need to manually determine whether these two records represent same object; If the similarity calculating, lower than lower limit, illustrates that these two records can not represent same object, can determine that these two records are not matching relationships so.
Fig. 2 shows the process flow diagram of the data matching method of template based on the comparison according to an embodiment of the invention.
As shown in Figure 2, the data matching method of template based on the comparison according to an embodiment of the invention, can comprise the following steps: step 202, receives from the data of same area not, according to the index entry arranging, data are carried out to piecemeal, one or more fields that described index entry comprises described data; Step 204, obtains coupling for each data block right, and described coupling is to comprising two data recording; Step 206, for each coupling is to calculating similarity, determines the matching relationship of two data recording of coupling centering according to the rule that compares template according to default similarity threshold.
Data in system have a plurality of fields, for representing a user ID data.Can carry out big data quantity to carry out piecemeal according to some or certain several field, like this, large data be divided into several data blocks.In each data block, to every record number, by index entry value, be that the record composition coupling that field value is identical is right.Due to big data quantity has been carried out to piecemeal in advance, and the record with same word segment value is mated to association, greatly reduced calculated amount, alleviated system burden.
Obtaining coupling to rear, need further to determine the similarity of two records of coupling centering, according to the size of the similarity of two records and similarity threshold, determine the relation of two records.
In technique scheme, preferred, before data are carried out to piecemeal, also comprise: according to data described in preset data format analysis processing, to meet predetermined format; For each data block, obtaining the right step of coupling specifically comprises: for each blocks of data, the identical data of the value of described index entry are formed to coupling right.
Different numeric field datas, its form is also different, likely that field is different, being likely that expression way is different, is likely field value mistake, and it is invalid that data cleansing unit can identify, undesirable data, can clean the numeric field data from different system, realize data normalization, be convenient to follow-up association and calculate.
In technique scheme, preferably, describedly according to the rule of template relatively, the step of calculating similarity is specifically comprised for each coupling: for the same field of described two data recording, calculate the similar value of the same field corresponding content of described two data recording, according to the similar value of described same field corresponding content, determine described similarity.
In technique scheme, preferred, when described two data recording have a plurality of same field, the similarity using the corresponding similar value sum of each same field as described two data recording.。Because each record may comprise a plurality of fields, therefore need to compare for each field, calculate the similar value between field value corresponding to the same field of two records, thereby determine the similarity between recording according to the similar value of field value.
In above-mentioned arbitrary technical scheme, preferred, described similarity threshold comprises first threshold and Second Threshold, and described first threshold is greater than described Second Threshold; When the similarity of described two data recording is more than or equal to described first threshold, the pass of determining described two data recording is matching relationship and generates the unique identification for described two data recording of association, when the similarity of described two data recording is less than described first threshold and is greater than described Second Threshold, the pass of determining described two data recording is doubtful relation, and when the similarity of described two data recording is less than or equal to described Second Threshold, the pass of determining described two data recording is matching relationship not.
For similarity arranges two boundaries, first threshold is high threshold, and Second Threshold is lower limit.If the similarity calculating higher than first threshold, illustrates that the possibility of these two the same objects of record expression is very large, can determine that so these two records are matching relationships; If the similarity calculating between high threshold and lower limit, illustrates so these two records and may represent same object, possibility is not very large, need to manually determine whether these two records represent same object; If the similarity calculating, lower than lower limit, illustrates that these two records can not represent same object, can determine that these two records are not matching relationships so.
Below in conjunction with Fig. 3, describe in detail according to the data matching method of template based on the comparison of the present invention.
As shown in Figure 3, in step 302, device for example receives, from the numeric field data of same area (clinic system is exactly a territory, and the system of being in hospital is also a territory) not.Due to the data field definition of different system or represent that content, expression format are different, therefore need to these numeric field datas, clean according to preset rules.
For example, in management server master data base, the expression way of time is 2012-12-12, and the time expression way of other numeric field datas is on Dec 12nd, 2012 or 2012.12.12, the time data of these these forms can be unified into 2012-12-12 so.Again for example, in master data base, between the word of name, there is no space, but there is space between the word of the name of numeric field data, these spaces can be deleted so, conform to the form of name in master data base.
In step 304, because the data volume of each system is very large, after having gathered the data of a plurality of systems, total quantity is huger, therefore need to carry out piecemeal to numeric field data, by piece, processes.
In step 306, for each data block, obtain coupling right, calculate the similarity that coupling is right.
Data are being carried out after piecemeal, if carry out comparing, first needing to form comparison, it is right to mate.If block data is compared entirely, can form n2 comparison, carry out when similarity is calculated wasting a lot of time, so need to mate right forming process processes, the coupling that removal does not have to compare processing is right, reduce calculated amount, improve arithmetic speed, and reduce resource occupation.When calculating the similarity of two data recording of coupling centering, can first calculate the similar value of each same field, then according to the similar value of all fields, obtain the similarity of two data recording.Relatively in template, be provided with different similarity calculating methods, according to the comparison template of selecting, can calculate the similarity that coupling is right.
In step 308, according to the similarity threshold setting in advance to coupling to classifying.Can be for similarity arranges two similarity thresholds, first threshold is high threshold, Second Threshold is lower limit.If the similarity calculating higher than first threshold, illustrates that the possibility of these two the same objects of record expression is very large, can determine that so these two records are matching relationships; If the similarity calculating between high threshold and lower limit, illustrates so these two records and may represent same object, possibility is not very large, need to manually determine whether these two records represent same object; If the similarity calculating, lower than lower limit, illustrates that these two records can not represent same object, can determine that these two records are not matching relationships so.
In step 310, classification results is audited, for example for the coupling of doubtful relation to carrying out artificial judgment, if think, classification results is inaccurate, can be again to coupling to carrying out key words sorting.For example capable of regulating similarity threshold, is that coupling is to classifying according to the similarity threshold after adjusting again.
Due to separate between system, data layout between different system, Field Definition, expression content are inconsistent, cause linking up and having some setbacks between system, can not unique definite same target, so Data Matching scheme that has proposed a kind of new template based on the comparison of the present invention, can in big data quantity, identify fast coupling right, Rapid matching goes out the different expression waies of same target, makes the data between system can be interrelated.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims (10)

1. a data matching system for template based on the comparison, is characterized in that, comprising:
Minute module unit, for receiving from the data of same area not, carries out piecemeal to data, one or more fields that described index entry comprises described data according to the index entry arranging;
Comparing unit, right for obtain coupling for each data block, described coupling is to comprising two data recording, and according to the rule of template relatively for each coupling is to calculating similarity;
Taxon, determines the matching relationship that mates two data recording of centering for the similarity threshold according to default.
2. the data matching system of template based on the comparison according to claim 1, is characterized in that, also comprises: data cleansing unit, according to data described in preset data format analysis processing, to meet predetermined format;
Described comparing unit comprises: obtain subelement, for for each blocks of data, the identical data of the value of described index entry are formed to coupling right.
3. the data matching system of template based on the comparison according to claim 1 and 2, it is characterized in that, described comparing unit comprises computation subunit, same field for described two data recording, calculate the similar value of the same field corresponding content of described two data recording, according to the similar value of described same field corresponding content, determine described similarity.
4. the data matching system of template based on the comparison according to claim 3, it is characterized in that, described computation subunit is further used for when described two data recording have a plurality of same field, the similarity using the corresponding similar value sum of each same field as described two data recording.
5. the data matching system of template based on the comparison according to claim 1, is characterized in that, described similarity threshold comprises first threshold and Second Threshold, and described first threshold is greater than described Second Threshold;
Described taxon is further used for when the similarity of described two data recording is more than or equal to described first threshold, the pass of determining described two data recording is matching relationship and generates the unique identification for described two data recording of association, when the similarity of described two data recording is less than described first threshold and is greater than described Second Threshold, the pass of determining described two data recording is doubtful relation, and when the similarity of described two data recording is less than or equal to described Second Threshold, the pass of determining described two data recording is matching relationship not.
6. a data matching method for template based on the comparison, is characterized in that, comprising:
Reception, from the data of same area not, is carried out piecemeal to data, one or more fields that described index entry comprises described data according to the index entry arranging;
For each data block, obtain coupling right, described coupling is to comprising two data recording;
According to the rule that compares template, for each coupling is to calculating similarity, according to default similarity threshold, determine the matching relationship of two data recording of coupling centering.
7. the data matching method of template based on the comparison according to claim 6, is characterized in that, before data are carried out to piecemeal, also comprises: according to data described in preset data format analysis processing, to meet predetermined format;
For each data block, obtaining the right step of coupling specifically comprises: for each blocks of data, the identical data of the value of described index entry are formed to coupling right.
8. according to the data matching method of the template based on the comparison described in claim 6 or 7, it is characterized in that, describedly according to the rule of template relatively, the step of calculating similarity is specifically comprised for each coupling: for the same field of described two data recording, calculate the similar value of the same field corresponding content of described two data recording, according to the similar value of described same field corresponding content, determine described similarity.
9. the data matching method of template based on the comparison according to claim 8, is characterized in that, when described two data recording have a plurality of same field, and the similarity using the corresponding similar value sum of each same field as described two data recording.
10. the data matching method of template based on the comparison according to claim 8, is characterized in that, described similarity threshold comprises first threshold and Second Threshold, and described first threshold is greater than described Second Threshold;
When the similarity of described two data recording is more than or equal to described first threshold, the pass of determining described two data recording is matching relationship and generates the unique identification for described two data recording of association, when the similarity of described two data recording is less than described first threshold and is greater than described Second Threshold, the pass of determining described two data recording is doubtful relation, and when the similarity of described two data recording is less than or equal to described Second Threshold, the pass of determining described two data recording is matching relationship not.
CN201310456767.0A 2013-09-29 2013-09-29 Based on the data matching system and method for comparing template Active CN103530334B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310456767.0A CN103530334B (en) 2013-09-29 2013-09-29 Based on the data matching system and method for comparing template

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310456767.0A CN103530334B (en) 2013-09-29 2013-09-29 Based on the data matching system and method for comparing template

Publications (2)

Publication Number Publication Date
CN103530334A true CN103530334A (en) 2014-01-22
CN103530334B CN103530334B (en) 2018-01-23

Family

ID=49932343

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310456767.0A Active CN103530334B (en) 2013-09-29 2013-09-29 Based on the data matching system and method for comparing template

Country Status (1)

Country Link
CN (1) CN103530334B (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104809141A (en) * 2014-01-29 2015-07-29 携程计算机技术(上海)有限公司 Matching system and method of hotel data
CN105096028A (en) * 2014-11-20 2015-11-25 北京航天金盾科技有限公司 Intelligent matching method of population data
CN106021526A (en) * 2016-05-25 2016-10-12 东软集团股份有限公司 News classification method and device
CN106681524A (en) * 2015-11-10 2017-05-17 阿里巴巴集团控股有限公司 Method and device for processing information
CN107103048A (en) * 2017-03-31 2017-08-29 苏州艾隆信息技术有限公司 Medicine information matching process and system
CN107193860A (en) * 2017-03-31 2017-09-22 苏州艾隆信息技术有限公司 Medicine information multidimensional identification method and system
CN107203686A (en) * 2017-03-31 2017-09-26 苏州艾隆信息技术有限公司 medicine information difference processing method and system
CN107291672A (en) * 2016-03-31 2017-10-24 阿里巴巴集团控股有限公司 The treating method and apparatus of tables of data
CN108038504A (en) * 2017-12-11 2018-05-15 深圳房讯通信息技术有限公司 A kind of method for parsing property ownership certificate photo content
WO2018166343A1 (en) * 2017-03-13 2018-09-20 腾讯科技(深圳)有限公司 Data fusion method and device, storage medium and electronic device
CN108664497A (en) * 2017-03-30 2018-10-16 大有秦鼎(北京)科技有限公司 The method and apparatus of Data Matching
CN108920601A (en) * 2018-06-27 2018-11-30 中国联合网络通信集团有限公司 A kind of data matching method and device
CN109063178A (en) * 2018-08-22 2018-12-21 四川新网银行股份有限公司 A kind of method and device of the self-service analytical statement extended automatically
CN111737533A (en) * 2020-06-19 2020-10-02 东软集团股份有限公司 Processing method and device for inspection items, storage medium and equipment
CN112732703A (en) * 2021-03-23 2021-04-30 中国信息通信研究院 Metadata processing method, metadata processing apparatus, and readable storage medium
CN113434584A (en) * 2021-06-28 2021-09-24 国网北京市电力公司 Data processing method and device for power equipment and electronic equipment
CN113535943A (en) * 2020-04-14 2021-10-22 阿里巴巴集团控股有限公司 Medical record classification method and device and data record classification method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739414A (en) * 2008-11-25 2010-06-16 华中师范大学 Ontological concept mapping method
CN102542262A (en) * 2012-01-04 2012-07-04 东南大学 Waveform identification method based on operating-characteristic working condition waveform library of high-speed rail
CN103186427A (en) * 2011-12-31 2013-07-03 中国银联股份有限公司 System and method for analyzing data record set
EP2592575A3 (en) * 2011-11-08 2013-07-31 Comcast Cable Communications, LLC Content descriptor
CN103257961A (en) * 2012-02-15 2013-08-21 北大方正集团有限公司 Method, device and system of bibliography repeat removal

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739414A (en) * 2008-11-25 2010-06-16 华中师范大学 Ontological concept mapping method
EP2592575A3 (en) * 2011-11-08 2013-07-31 Comcast Cable Communications, LLC Content descriptor
CN103186427A (en) * 2011-12-31 2013-07-03 中国银联股份有限公司 System and method for analyzing data record set
CN102542262A (en) * 2012-01-04 2012-07-04 东南大学 Waveform identification method based on operating-characteristic working condition waveform library of high-speed rail
CN103257961A (en) * 2012-02-15 2013-08-21 北大方正集团有限公司 Method, device and system of bibliography repeat removal

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
THINKPHPER: ""大数据量的分表方法"", 《BLOG.SINA.COM.CN/S/BLOG_64492FE10100QI3I.HTML》 *
ZHAO HAO ET AL.: ""Adaptive threshold backtracking matching pursuit for compressive sensing"", 《IET INTERNATIONAL RADAR CONFERENCE 2013》 *
洪圆等: ""一种使用双阀值的数据仓库环境下重复记录消除算法"", 《计算机工程与应用》 *
陈波: ""征信系统中实体匹配方法及应用研究"", 《中国博士学位论文全文数据库 经济与管理科学辑》 *
齐为华: ""不同应用系统相关数据的匹配检测与借用"", 《2007年CAD/CAM学术交流会议论文集》 *

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104809141A (en) * 2014-01-29 2015-07-29 携程计算机技术(上海)有限公司 Matching system and method of hotel data
CN105096028A (en) * 2014-11-20 2015-11-25 北京航天金盾科技有限公司 Intelligent matching method of population data
CN106681524A (en) * 2015-11-10 2017-05-17 阿里巴巴集团控股有限公司 Method and device for processing information
CN107291672A (en) * 2016-03-31 2017-10-24 阿里巴巴集团控股有限公司 The treating method and apparatus of tables of data
CN106021526A (en) * 2016-05-25 2016-10-12 东软集团股份有限公司 News classification method and device
CN106021526B (en) * 2016-05-25 2019-09-27 东软集团股份有限公司 News category method and device
WO2018166343A1 (en) * 2017-03-13 2018-09-20 腾讯科技(深圳)有限公司 Data fusion method and device, storage medium and electronic device
CN108664497B (en) * 2017-03-30 2020-11-03 大有秦鼎(北京)科技有限公司 Data matching method and device
CN108664497A (en) * 2017-03-30 2018-10-16 大有秦鼎(北京)科技有限公司 The method and apparatus of Data Matching
CN107193860B (en) * 2017-03-31 2021-03-02 苏州艾隆信息技术有限公司 Medicine information multidimensional identification method and system
CN107193860A (en) * 2017-03-31 2017-09-22 苏州艾隆信息技术有限公司 Medicine information multidimensional identification method and system
CN107103048B (en) * 2017-03-31 2021-04-20 苏州艾隆信息技术有限公司 Medicine information matching method and system
CN107103048A (en) * 2017-03-31 2017-08-29 苏州艾隆信息技术有限公司 Medicine information matching process and system
CN107203686A (en) * 2017-03-31 2017-09-26 苏州艾隆信息技术有限公司 medicine information difference processing method and system
CN108038504A (en) * 2017-12-11 2018-05-15 深圳房讯通信息技术有限公司 A kind of method for parsing property ownership certificate photo content
CN108920601B (en) * 2018-06-27 2020-12-01 中国联合网络通信集团有限公司 Data matching method and device
CN108920601A (en) * 2018-06-27 2018-11-30 中国联合网络通信集团有限公司 A kind of data matching method and device
CN109063178B (en) * 2018-08-22 2019-12-24 四川新网银行股份有限公司 Method and device for automatically expanding self-help analysis report
CN109063178A (en) * 2018-08-22 2018-12-21 四川新网银行股份有限公司 A kind of method and device of the self-service analytical statement extended automatically
CN113535943A (en) * 2020-04-14 2021-10-22 阿里巴巴集团控股有限公司 Medical record classification method and device and data record classification method and device
CN111737533A (en) * 2020-06-19 2020-10-02 东软集团股份有限公司 Processing method and device for inspection items, storage medium and equipment
CN111737533B (en) * 2020-06-19 2024-02-09 东软集团股份有限公司 Method, device, storage medium and equipment for processing inspection items
CN112732703A (en) * 2021-03-23 2021-04-30 中国信息通信研究院 Metadata processing method, metadata processing apparatus, and readable storage medium
CN113434584A (en) * 2021-06-28 2021-09-24 国网北京市电力公司 Data processing method and device for power equipment and electronic equipment
CN113434584B (en) * 2021-06-28 2022-10-14 国网北京市电力公司 Data processing method and device for power equipment and electronic equipment

Also Published As

Publication number Publication date
CN103530334B (en) 2018-01-23

Similar Documents

Publication Publication Date Title
CN103530334A (en) System and method for data matching based on comparison module
CN103473375A (en) Data cleaning method and data cleaning system
CN103473373A (en) Threshold matching model-based similarity analysis system and threshold matching model-based similarity analysis method
US20150356128A1 (en) Index key generating device, index key generating method, and search method
US9177020B2 (en) Gathering index statistics using sampling
JP2013536492A (en) Data analysis using multiple systems
CN108038130A (en) Automatic cleaning method, device, equipment and the storage medium of fictitious users
CN103714086A (en) Method and device used for generating non-relational data base module
WO2022222942A1 (en) Method and apparatus for generating question and answer record, electronic device, and storage medium
Ji et al. Anthropometry and classification of auricular concha for the ergonomic design of earphones
US20190108270A1 (en) Data convergence
CN110909168A (en) Knowledge graph updating method and device, storage medium and electronic device
CN111160855A (en) Report sheet automatic auditing method, device, equipment and storage medium
CN113111063A (en) Medical patient main index discovery method applied to multiple data sources
CN113743477A (en) Histogram data publishing method based on differential privacy
CN111640517B (en) Medical record coding method and device, storage medium and electronic equipment
CN109346146A (en) Checking prescription distribution method, device, electronic equipment and storage medium
CN106961508A (en) Communication means and device based on Sex criminals
CN116150632A (en) Internet of things equipment identification method based on local sensitive hash in intelligent home
CN112163127B (en) Relationship graph construction method and device, electronic equipment and storage medium
CN110175220B (en) Document similarity measurement method and system based on keyword position structure distribution
CN112991131A (en) Government affair data processing method suitable for electronic government affair platform
CN108846543B (en) Computing method and device for non-overlapping community set quality metric index
CN111209284A (en) Metadata-based table dividing method and device
CN113449102A (en) Text clustering method, equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: PKU HEALTHCARE IT CO., LTD.

Free format text: FORMER OWNER: FOUNDER INTERNATIONAL CO., LTD.

Effective date: 20150203

Free format text: FORMER OWNER: FOUNDER INTERNATIONAL (BEIJING) CO., LTD.

Effective date: 20150203

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 215123 SUZHOU, JIANGSU PROVINCE TO: 100080 HAIDIAN, BEIJING

TA01 Transfer of patent application right

Effective date of registration: 20150203

Address after: 100080, No. 19, No. 52 West Fourth Ring Road, Beijing, Haidian District

Applicant after: Peking University Medical Information Technology Co.,Ltd.

Address before: Suzhou City, Jiangsu Province, Suzhou Industrial Park 215123 Xinghu Street No. 328 Creative Industry Park founder International Building

Applicant before: FOUNDER INTERNATIONAL Co.,Ltd.

Applicant before: Founder International Co.,Ltd. (Beijing)

GR01 Patent grant
GR01 Patent grant
PP01 Preservation of patent right

Effective date of registration: 20240202

Granted publication date: 20180123

PP01 Preservation of patent right