WO2017113886A1 - 数据清理方法及装置 - Google Patents

数据清理方法及装置 Download PDF

Info

Publication number
WO2017113886A1
WO2017113886A1 PCT/CN2016/098771 CN2016098771W WO2017113886A1 WO 2017113886 A1 WO2017113886 A1 WO 2017113886A1 CN 2016098771 W CN2016098771 W CN 2016098771W WO 2017113886 A1 WO2017113886 A1 WO 2017113886A1
Authority
WO
WIPO (PCT)
Prior art keywords
title
field
fields
similarity
sim
Prior art date
Application number
PCT/CN2016/098771
Other languages
English (en)
French (fr)
Inventor
蒋瑜
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2017113886A1 publication Critical patent/WO2017113886A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Definitions

  • the present invention relates to data cleaning technologies, and in particular, to a data cleaning method and apparatus.
  • the accuracy of the data is the basic condition for various data analysis.
  • the purpose of data cleansing is to detect erroneous data in the data, and to eliminate or correct the erroneous data to improve the accuracy and quality of the data.
  • Common data errors include null values, values out of bounds, and so on.
  • a common data cleaning method is mainly a domain-specific language-based programming data cleaning method. Specifically, each time a researcher performs data cleaning on a form, the developer makes an error for the form. The data cleaning rules, then determine the specific cleaning algorithm according to the cleaning rules of the error data, and then write the data cleaning program according to the cleaning algorithm, and finally realize the automatic detection and correction of the data through the data cleaning program.
  • the embodiment of the invention provides a data cleaning method and device, which overcomes the problem that the existing data cleaning method is inefficient and has no universality and ease of use.
  • An embodiment of the present invention provides a data cleaning method, including:
  • the history form library select a history form having the same description object as the current form, the current form contains m title fields, and the history form contains n title fields, where m and n are positive integers;
  • Data is cleaned up on data that does not meet the constraint condition in the data corresponding to the i-th title field.
  • the constraint condition of the title field of the history form is adaptively applied to the title field of the current form, and the data corresponding to the title field of the current form is performed based on the constraint condition.
  • Data cleaning no need for developers to write and maintain the cleanup algorithm code program every time data cleaning, reducing the user's threshold of use, has a wide range of applicability, and reduces the intensity of manual data cleanup; also realized
  • the automatic cleaning of big data in the database improves the efficiency and accuracy of data cleaning and improves the accuracy and reliability of the data source.
  • determining that the i-th title field and the j-th title field match according to a preset matching rule includes:
  • determining that the i-th title field and the j-th title field match according to a preset matching rule includes:
  • the similarity SIM(i, j) is not greater than the first preset value and m, n is greater than 1
  • the k title fields are determined according to the preset field association relationship of the current form, and the s title in the k title fields is determined.
  • SIM 0 (i, j) is greater than the first preset value, it is determined that the i-th title field in the m title fields matches the j-th title field in the n title fields.
  • the i-th is determined according to a preset matching rule.
  • the matching of the title field and the j-th title field includes:
  • the similarity SIM(i, j) is not greater than the first preset value and m, n is greater than 1
  • the k title fields are determined according to the preset field association relationship of the current form, and the s title in the k title fields is determined.
  • SIM 0 (i, j) is not greater than the first preset value, SIM 0 (i, j) is continuously corrected Y times, and then SIM 0+Y (i, j) is obtained, if SIM 0+Y (i) , j) is greater than the first preset value, determining that the i-th title field in the m title fields matches the j-th title field in the n title fields;
  • SIM 0+y-1 (i, j) is corrected according to the similarity SIM s by the second preset algorithm, and the value of SIM 0+y (i, j), y is obtained.
  • Y is a positive integer not greater than Y.
  • the second preset algorithm is a formula as shown below:
  • SIM is the similarity to be corrected
  • SIM * is the corrected similarity
  • a is the preset weight coefficient
  • the similarity of the title field is corrected, and the corrected more accurate similarity is obtained, and more matching title fields can be determined. Get more constraints and improve the efficiency of data cleansing.
  • calculating a similarity between each of the m title fields and each of the n title fields according to the first preset algorithm including:
  • each of the m title fields in the current form is obtained according to the third-party knowledge base.
  • each of the m title fields and the n title fields are obtained.
  • max 1 represents the maximum value of the i-th title field
  • min 1 represents the minimum value of the i-th title field
  • max 2 represents the maximum value of the j-th title field
  • min 2 represents the minimum value of the j-th title field.
  • the method further includes:
  • a data cleaning apparatus including:
  • the history form obtaining module is configured to select a history form having the same description object as the current form in the history form library, where the current form contains m title fields, and the history form contains n title fields, wherein m and n are positive integers;
  • a similarity calculation module configured to calculate, according to the first preset algorithm, a similarity between each of the m title fields acquired by the history form acquisition module and each of the n title fields;
  • a matching module configured to calculate, for the similarity calculation module, any similarity SIM(i, j), if it is determined according to a preset matching rule that the i-th title field and the j-th title field match, obtaining the j-th title field Constraint condition; where i represents the i-th title field in the m title fields, j represents the j-th title field in the n title fields, the value of i includes all natural numbers not greater than m, and the value of j includes no more than n All natural numbers;
  • the data cleaning module is configured to perform data cleaning on data that does not meet the constraint condition obtained by the matching module in the data corresponding to the i-th title field.
  • the matching module is specifically configured to:
  • the similarity SIM(i,j) For any similarity SIM(i,j) calculated by the similarity calculation module, if the similarity SIM(i,j) is greater than the first preset value, the i-th title field and n in the m title fields are determined. If the jth title field in the title field matches, the constraint of the jth title field is obtained.
  • the matching module is specifically configured to:
  • Any similarity SIM(i,j) calculated by the similarity calculation module if the similarity SIM(i,j) is not greater than the first preset value and m, n is greater than 1, according to the preset field of the current form
  • the association relationship determines k title fields, and for the s header field in the k title fields, among the similarities between the s header field and each of the n header fields, the maximum similarity SIM is determined. s , wherein the value of s includes all natural numbers not greater than k, and k is the total number of header fields associated with the i-th title field determined according to the preset field association relationship of the current form, where k is less than m;
  • SIM 0 (i, j) is greater than the first preset value, it is determined that the i-th title field in the m title fields matches the j-th title field in the n title fields.
  • the matching module is specifically configured to:
  • Any similarity SIM(i,j) calculated by the similarity calculation module if the similarity SIM(i,j) is not greater than the first preset value and m, n is greater than 1, according to the preset field of the current form
  • the association relationship determines k title fields, and for the s header field in the k title fields, among the similarities between the s header field and each of the n header fields, the maximum similarity SIM is determined. s , wherein the value of s includes all natural numbers not greater than k, and k is the total number of title fields associated with the i-th title field determined according to the preset field association relationship of the current form, where k is less than m;
  • SIM 0 (i, j) is not greater than the first preset value, SIM 0 (i, j) is continuously corrected Y times, and then SIM 0+Y (i, j) is obtained, if SIM 0+Y (i) , j) is greater than the first preset value, determining that the i-th title field in the m title fields matches the j-th title field in the n title fields;
  • SIM 0+y-1 (i, j) is corrected according to the similarity SIM s by the second preset algorithm, and the value of SIM 0+y (i, j), y is obtained.
  • Y is a positive integer not greater than Y.
  • the second preset algorithm is a formula as shown below:
  • SIM is the similarity to be corrected
  • SIM * is the corrected similarity
  • a is the preset weight coefficient
  • the similarity calculation module includes:
  • a first calculating unit configured to obtain each of the m title fields and the n title fields according to the coincidence degree between the title field name of the m title fields in the current form and the title field names of the n title fields in the history form The similarity between each of the title fields in the title field; or
  • a second calculating unit configured to obtain each of the m title fields according to the title field name of the m title fields in the current form and the title field names of the n title fields in the history form according to the third-party knowledge base The similarity between each of the n title fields; or
  • a third calculating unit configured to obtain, according to the coincidence degree of the field value set of the m title fields in the current form and the field value sets of the n title fields in the history form, to obtain each title field in the m title fields and The similarity between each of the n header fields.
  • the third calculating unit is specifically configured to:
  • max 1 represents the maximum value of the i-th title field
  • min 1 represents the minimum value of the i-th title field
  • max 2 represents the maximum value of the j-th title field
  • min 2 represents the minimum value of the j-th title field.
  • the device further includes:
  • a storage module that stores the current form and the constraints of the current form into the history form library.
  • Another aspect of the embodiment of the present invention further provides a data cleaning apparatus, including: a memory, a processor, and a bus, wherein the memory and the processor are respectively connected to the bus, wherein:
  • the memory is used to store data and store program code
  • Embodiment 1 is a schematic flowchart of Embodiment 1 of a data cleaning method according to the present invention
  • FIG. 2 is a schematic diagram of a form scenario of a second embodiment of a data cleaning method according to the present invention.
  • Embodiment 1 of a data cleaning device according to the present invention is a schematic structural diagram of Embodiment 1 of a data cleaning device according to the present invention.
  • FIG. 4 is a schematic structural view of a data cleaning device of the present invention.
  • the embodiment of the present invention considers that a plurality of history forms that have been cleaned by data in the form library are stored. When the current form to be cleaned is consistent with the content in the history form, the history form may be referred to. Data cleaning constraints clean the current form, and propose a data cleaning method that can be widely used in data cleaning for storing data in the form of forms.
  • the embodiment of the invention provides a data cleaning method for automatically cleaning data for a large amount of data stored in a database, discovering possible erroneous data and eliminating or correcting erroneous data.
  • a form stores multiple columns of data, and a column of data usually consists of a header field and data corresponding to the header field.
  • Table 1 shows a common form.
  • the form shown in Table 1 includes a plurality of title fields whose title field names are "name", “ID”, "gender”, and the like. Each title field has different attributes according to the corresponding data, and there are different associations between the title fields.
  • the constraint condition includes the attribute of the title field or the association relationship between the title fields.
  • the corresponding data in the title field “ID” is unique, that is, the duplicate data "4" should not exist in the plurality of data corresponding to the title field "ID”;
  • the title field “city” has a specific value range attribute, that is, the title There should be no “degree” in the field “city”, which is not within the specific value range;
  • the title field “city” has a one-to-one correspondence with the title field “area code”, that is, the title field in the first table “city”
  • the correspondence between "Chengdu” in “the” and “029” in the title field “area code” is incorrect.
  • Table 1 exemplarily identifies several possible erroneous data.
  • the embodiment of the present invention provides a data cleaning method for the problem of erroneous data in a large amount of data as described above.
  • the method is used for data cleaning of a form.
  • the form is the current form.
  • the method first obtains a history form similar to the current form, determines a title field matching the title field of the current form in the history form, and then corresponds to the current form according to the constraint condition of the matching title field in the history form.
  • the matching header field performs data cleaning to clear out data that does not meet the constraint condition.
  • the method aims to solve the problem that the data cleaning method in the prior art is not universal and easy to use, so as to realize the universality of data cleaning. Sex and ease of use.
  • FIG. 1 is a schematic flowchart diagram of Embodiment 1 of a data cleaning method according to the present invention.
  • the execution body of this embodiment is a data cleaning device, and the device can be disposed in the processor.
  • the method in this embodiment may include:
  • Step 101 Select a history form having the same description object as the current form in the history form library, where the current form contains m title fields, and the history form contains n title fields, where m and n are positive integers;
  • Step 102 Calculate a similarity between each of the m title fields and each of the n title fields according to the first preset algorithm.
  • Step 103 For any similarity SIM(i,j), if it is determined according to a preset matching rule that the i-th title field and the j-th title field match, the constraint condition of the j-th title field is obtained; wherein i represents m titles The i-th title field in the field, j represents the j-th title field in the n title fields, the value of i includes all natural numbers not greater than m, and the value of j includes all natural numbers not greater than n;
  • Step 104 Perform data cleaning on data that does not meet the constraint condition in the data corresponding to the i-th title field.
  • one of the m title fields in the current form is recorded as the i-th title field
  • the n title fields in the history form are A title field is denoted as the j-th title field, where m and n are positive integers, the value of i includes all natural numbers not greater than m, and the value of j includes all natural numbers not greater than n, similarity SIM(i,j ) indicates the similarity between the i-th title field and the j-th title field.
  • step 101 for the current form, first, in a plurality of historical forms stored in the history form library, a history form having the same description object as the current form is obtained by comparison or query.
  • the history form library there may be a history form having the same description object as the current form, that is, containing the same title field, and when the number of the same title field is larger, and the number of different title fields is smaller, Indicates that the current form is similar to the historical form. You can refer to the similarity principle and filter out the historical form with the same description object as the current form.
  • step 102 the similarity between the m title fields in the filtered current form and the n title fields in the history form is calculated.
  • the specific implementation manner of acquiring the similarity SIM(i, j) may be combined with any one or more of the following implementation manners.
  • each of the m title fields and n header fields are obtained.
  • the two title fields may be considered to be completely coincident.
  • the similarity of the two header fields is 1; the title field of the title field with the name of the current form in the current form has a coincidence degree with the title field of the title field whose name is "monthly income” in the history form, so the two The similarity of the title field can be It is considered to be 0.
  • each of the m title fields in the current form is obtained according to the third-party knowledge base. Similarity to each of the n title fields.
  • the third-party knowledge base stores the field names "name” and "name” as synonyms
  • the field in the current form with the field name "name” and the field with the field name "name” in the history form can be considered.
  • the similarity is 1.
  • each of the m title fields is obtained. Similarity to each of the n title fields.
  • the feasible implementation manner includes the following two cases:
  • the discrete title field in the current form and the non-discrete title field in the history form can be considered to have a similarity of 0, and no calculation is needed using the formula (1). It can be seen from the formula (1) that the more the values of the fields in the two header fields are the same, the higher the similarity.
  • Exemplary, common discrete title fields are "City”, “Education”, and the like.
  • the value set of a "work city” title field includes: Beijing, Shanghai, Shenzhen
  • the value set of another title field "work place” includes: Beijing, Shanghai, Shenzhen, Tianjin, calculated by the above formula (1)
  • max 1 represents the maximum value of the i-th title field
  • min 1 represents the minimum value of the i-th title field
  • max 2 represents the maximum value of the j-th title field
  • min 2 represents the minimum value of the j-th title field.
  • Illustrative, common continuous title fields have "age”, “salary”, and the like.
  • the similarity ranges from 0 to 1.
  • a similarity of 0 indicates that the two header fields from the current form and the historical form do not have any identical attributes or associations, which can be regarded as invalid similarities.
  • the similarity obtained by the different implementation manners and the preset weights of the similarities obtained by the implementation manners may be obtained to obtain a more accurate similarity.
  • step 103 in all the similarities acquired in step 102, for any similarity SIM(i,j), if it is determined according to the preset matching rule that the i-th title field and the j-th title field match, then Get the constraint of the j-th title field.
  • a constraint can be constructed for the form, and the constraint usually includes an association between the title field attribute and the title field.
  • the title field attribute may be one or more of the following: reliability, uniqueness, label, field synonym, range of values, and the like.
  • the association relationship may also be one or more of the following: correlation, order preservation, one-to-one mapping, and the like.
  • the constraint mainly defines the range of values of each title field in the form, and the relationship between the data corresponding to multiple title fields. The data that does not meet the constraint condition can be regarded as the error data that needs to be cleaned up.
  • the historical forms in the historical form library each have their own constraints.
  • the two title fields from the historical form and the current form are matching title fields, the two title fields should be considered to have the same or similar title fields.
  • Association between attribute and title fields The data, that is, the data corresponding to the two header fields meet the same constraints. Therefore, after determining that the i-th title field and the j-th title field match, the constraint condition of the j-th title field can be obtained.
  • step 104 since the i-th title field and the j-th title field match, the i-th title field is considered to conform to the constraint condition of the j-th title field, and the i-th title can be directly used by the constraint of the j-th title field.
  • the data corresponding to the field is checked, and the data that does not meet the constraint condition in all the data corresponding to the i-th title field is determined, and the data is cleaned. By determining that as many pairs of header fields match, you can perform more adequate data cleanup on the header fields in the current form.
  • the cleaning process includes deleting the error value and providing the correction value.
  • the value range of the title field may be "male, female, unknown”
  • the data is considered to be erroneous data and needs to be cleaned up.
  • step 104 when data cleaning is performed in step 104, when data modification and replacement are performed, the source data and the replacement data to be modified may be selected and displayed to the operator through the display screen, and then corrected by the operator, or After the operator negates, no correction is made, and the accuracy of data cleaning can be improved by adding a confirmation step.
  • a third-party knowledge base or an expert knowledge base may be invoked to query synonyms, synonyms, and association extensions of the data corresponding to the title field and the title field.
  • the relationship between the title field attribute and the title field may be directly input manually.
  • the constraint condition; or the data cleaning device automatically performs the title field matching according to the relationship between the preset title field attribute and the title field already stored in the expert knowledge base, and configures the title field in the current form to be stored in the expert knowledge base. Constraints and cleans up the data.
  • the data cleaning method provided by the embodiment of the present invention further includes:
  • the data is cleaned up by the data cleaning method provided by the embodiment of the present invention by using the data cleaning method provided by the embodiment of the present invention to expand the history form library by storing the current form after the data is cleaned and the current form constraints are stored in the history form library.
  • the data cleaning method provided by the embodiment of the present invention applies the constraint condition of the title field of the historical form to the title field of the current form by combining the history form with the same description object in the history form library, based on the constraint condition.
  • Data cleaning of the data corresponding to the title field of the current form does not require the developer to write and maintain the cleanup algorithm code program every time the data is cleaned, which reduces the user's usage threshold, has wide applicability, and reduces manual
  • the work intensity of data cleaning also realizes the automatic cleaning of big data in the database, improves the efficiency and accuracy of data cleaning, and improves the accuracy and reliability of the data source.
  • FIG. 2 is a schematic diagram of a form scenario of a second embodiment of a data cleaning method according to the present invention.
  • a current form to which the data cleaning method of the present invention is applied and a history form having the same description object that has been filtered are schematically shown in FIG. 2, and the similarity between the partial title fields in the current form and the history form is indicated.
  • FIG. 2 also shows the relationship between the partial preset title fields of the current form and the relationship between the partial title fields of the history form.
  • the specific implementation manner of the preset matching rule may be any one of the following implementation manners.
  • the similarity 1 is greater than a first preset value of 0.9 for the similarity 1 of the "professional" title field in the obtained current form and the "occupation” title field in the history form. Since the similarity 1 is greater than the first preset value of 0.9, it may be determined that the i-th title field and the j-th title field are mutually matching title fields, and the two have similar relationship between the title field attribute and the title field, that is, the current The "career" title field in the form matches the "career” title field in the history form, and both should have the same constraints.
  • the “career” title field in the current form when the "career” title field in the history form has a similar title field of "work”, the “career” title field in the current form also has a similar title field of "work”.
  • Data cleanup of the current form is achieved by applying the constraints of the "professional” title field of the history form directly to the "career” title field in the current form.
  • the first preset value ranges from 0 to 1, and can be preset in advance, or can be appropriately modified in the matching process.
  • the feasible implementation method 2 is:
  • the similarity SIM(i, j) is not greater than the first preset value and m, n is greater than 1
  • the k title fields are determined according to the preset field association relationship of the current form, and the s title in the k title fields is determined.
  • SIM 0 (i, j) is greater than the first preset value, it is determined that the i-th title field in the m title fields matches the j-th title field in the n title fields.
  • the similarity SIM(i, j) when it is determined that the similarity SIM(i, j) is smaller than the first preset value, it is not directly determined that the i-th title field and the j-th title field match.
  • m and n are greater than 1, if there are multiple title fields in the current form and the history form, the similarity of other title fields having an association relationship with the title fields corresponding to the similarity SIM(i, j) may also be used.
  • the similarity SIM(i, j) is corrected to obtain a more accurate similarity.
  • the corrected similarity SIM 0 (i, j) is greater than the first preset value, the i-th title field and the j title field matches.
  • the first preset value may be a fixed value, or may be adapted when the similarity is corrected. Change to other values.
  • the method before comparing the similarity SIM(i, j) with the first preset value, the method further includes: first determining that the similarity SIM(i, j) is greater than 0, and for the similarity SIM(i, j) is 0.
  • the i title field and the jth title field can directly determine that the i-th title field and the j-th title field do not match, and no correction is needed.
  • the method for correcting the similarity SIM(i, j) is:
  • the second preset algorithm is the formula (3) shown below:
  • SIM is the similarity to be corrected
  • SIM * is the corrected similarity
  • a is the preset weight coefficient
  • all the similarities SIM(i, j) smaller than the first preset value may be sorted, and the correction is sequentially performed in descending order of similarity.
  • the corrected similarity is obtained, so as to improve the matching ratio between the current form and the title field in the history form, and the current form is more fully performed.
  • the corrected similarity is 0.817.
  • the title field "monthly income” and the title field “monthly salary” are not considered to match, if in the correction process,
  • the first preset value is modified to 0.81, the title field "monthly income” and the title field “monthly salary” are matched to each other, and the constraint related to the title field "monthly salary” in the history form can be applied to the current form. For data cleaning of the current form.
  • the similarity SIM(i, j) is not greater than the first preset value and m, n is greater than 1
  • the k title fields are determined according to the preset field association relationship of the current form, and the s title in the k title fields is determined.
  • SIM 0 (i, j) is not greater than the first preset value, SIM 0 (i, j) is continuously corrected Y times, and then SIM 0+Y (i, j) is obtained, if SIM 0+Y (i) , j) is greater than the first preset value, determining that the i-th title field in the m title fields matches the j-th title field in the n title fields;
  • SIM 0+y-1 (i, j) is corrected according to the similarity SIM s by the second preset algorithm, and the value of SIM 0+y (i, j), y is obtained.
  • Y is a positive integer not greater than Y.
  • any similarity SIM(i, j) may be first corrected to obtain a modified similarity SIM 0 (i, j), and the specific modification method is the present invention. I won't go into details here.
  • the corrected similarity SIM 0 (i, j) When the corrected similarity SIM 0 (i, j) is greater than the first preset value, it is determined that the i-th title field and the j-th title field match, when the corrected similarity SIM 0 (i, j) is still smaller than At the first preset value, the corrected similarity SIM 0 (i, j) can be continuously corrected a plurality of times.
  • the similarity obtained by the last correction is iteratively corrected according to the similarity SIM s by the second preset algorithm.
  • SIM 0+y-1 (i, j) is corrected according to the similarity SIM s by the second preset algorithm to obtain SIM 0+y (i, j),
  • the value of y includes a positive integer not greater than Y, and Y is a preset value.
  • SIM correction + 0 ( Y , j) after the Y correction is greater than the first preset value, it is determined that the i-th title field and the j-th title field match.
  • SIM 0+X (i, j) is obtained to be greater than the first preset value, and X is less than Y, the iterative correction is stopped.
  • the second preset algorithm is similar to the implementation manner in the second embodiment, and the present invention will not be described again.
  • the similarity before the correction may be replaced by the corrected similarity to improve the correction efficiency.
  • the specific replacement process the following two replacement modes are included.
  • the similarity set F For all similarities, before the correction of the similarity SIM(i, j), all the similarities SIM(i, j) smaller than the first preset value are determined, and the similarity set F is formed.
  • the similarity set F According to the order of similarity from high to low, the correction is performed, and each time a corrected similarity SIM 0 (i, j) is obtained, it is determined whether the similarity SIM 0 (i, j) is greater than the first preset value. If yes, replace the similarity before the correction with the corrected similarity until the correction and replacement of all the similarities in the similarity combination F are completed, and then all the similarities that are still smaller than the first preset value are corrected.
  • SIM 0 (i, j) constitutes a similarity set F * , and in the similarity set F * , correction and replacement are performed in the same manner as the first correction until the Y-continuous correction is completed.
  • the data cleaning method provided by the embodiment of the present invention applies the constraint condition of the title field of the historical form to the title field of the current form by combining the history form with the same description object in the history form library, based on the constraint condition.
  • Data cleaning of the data corresponding to the title field of the current form does not require the developer to write and maintain the cleanup algorithm code program every time the data is cleaned, which reduces the user's usage threshold, has wide applicability, and reduces manual
  • the work intensity of data cleaning also realizes the automatic cleaning of big data in the database, improves the efficiency and accuracy of data cleaning, and improves the accuracy and reliability of the data source.
  • FIG. 3 is a schematic structural diagram of Embodiment 1 of a data cleaning device according to the present invention. As shown in FIG. 3, the apparatus of this embodiment may include:
  • the history form obtaining module 301 is configured to select, in the history form library, a history form having the same description object as the current form, where the current form contains m title fields, and the history form contains n title fields, where m and n are positive integers. ;
  • the similarity calculation module 302 is configured to calculate a similarity between each of the m title fields and each of the n title fields in the m title fields acquired by the history form obtaining module 301 according to the first preset algorithm;
  • the matching module 303 is configured to calculate any similarity SIM(i, j) calculated by the similarity calculation module 302. If it is determined according to the preset matching rule that the i-th title field and the j-th title field match, the j-th title is obtained.
  • the constraint of the field where i represents the i-th title field in the m title fields, j represents the j-th title field in the n title fields, and the value of i includes all natural numbers not greater than m, and the value of j includes no All natural numbers greater than n;
  • the data clearing module 304 is configured to perform data cleaning on data that does not meet the constraint condition acquired by the matching module 303 in the data corresponding to the i-th title field.
  • the device in this embodiment may be used to implement the technical solution of the method embodiment shown in FIG. 1 , and the implementation principle and technical effects are similar, and details are not described herein again.
  • the matching module 303 is specifically configured to:
  • Any similarity SIM(i,j) calculated for the similarity calculation module if similarity If SIM(i, j) is greater than the first preset value, it is determined that the i-th title field in the m title fields matches the j-th title field in the n title fields, and the constraint condition of the j-th title field is obtained.
  • the matching module 303 is specifically configured to:
  • Any similarity SIM(i,j) calculated by the similarity calculation module if the similarity SIM(i,j) is not greater than the first preset value and m, n is greater than 1, according to the preset field of the current form
  • the association relationship determines k title fields, and for the s header field in the k title fields, among the similarities between the s header field and each of the n header fields, the maximum similarity SIM is determined. s , wherein the value of s includes all natural numbers not greater than k, and k is the total number of header fields associated with the i-th title field determined according to the preset field association relationship of the current form, where k is less than m;
  • SIM 0 (i, j) is greater than the first preset value, it is determined that the i-th title field in the m title fields matches the j-th title field in the n title fields.
  • the matching module 303 is specifically configured to:
  • Any similarity SIM(i,j) calculated by the similarity calculation module if the similarity SIM(i,j) is not greater than the first preset value and m, n is greater than 1, according to the preset field of the current form
  • the association relationship determines k title fields, and for the s header field in the k title fields, among the similarities between the s header field and each of the n header fields, the maximum similarity SIM is determined. s , wherein the value of s includes all natural numbers not greater than k, and k is the total number of header fields associated with the i-th title field determined according to the preset field association relationship of the current form, where k is less than m;
  • SIM 0 (i, j) is not greater than the first preset value, SIM 0 (i, j) is continuously corrected Y times, and then SIM 0+Y (i, j) is obtained, if SIM 0+Y (i) , j) is greater than the first preset value, determining that the i-th title field in the m title fields matches the j-th title field in the n title fields;
  • SIM 0+y-1 (i, j) is corrected according to the similarity SIM s by the second preset algorithm, and the value of SIM 0+y (i, j), y is obtained.
  • Y is a positive integer not greater than Y.
  • the second preset algorithm is a formula as shown below:
  • SIM is the similarity to be corrected
  • SIM * is the corrected similarity
  • a is the preset weight coefficient
  • the similarity calculation module 302 includes:
  • a first calculating unit configured to obtain each of the m title fields and the n title fields according to the coincidence degree between the title field name of the m title fields in the current form and the title field names of the n title fields in the history form The similarity between each of the title fields in the title field; or
  • a second calculating unit configured to obtain each of the m title fields according to the title field name of the m title fields in the current form and the title field names of the n title fields in the history form according to the third-party knowledge base The similarity between each of the n title fields; or
  • a third calculating unit configured to obtain, according to the coincidence degree of the field value set of the m title fields in the current form and the field value sets of the n title fields in the history form, to obtain each title field in the m title fields and The similarity between each of the n header fields.
  • the third calculating unit is specifically configured to:
  • max 1 represents the maximum value of the i-th title field
  • min 1 represents the minimum value of the i-th title field
  • max 2 represents the maximum value of the j-th title field
  • min 2 represents the minimum value of the j-th title field.
  • the device further includes:
  • a storage module that stores the current form and the constraints of the current form into the history form library.
  • FIG. 4 is a schematic structural view of a data cleaning device of the present invention.
  • the apparatus can be used to perform the data cleaning method as described in the above embodiments.
  • the apparatus includes a processor 401, a memory 402, and a bus 405.
  • the processor 401 and the memory 402 are respectively connected to the bus 405, wherein:
  • the memory 402 is configured to store data and store program code
  • the processor 401 is configured to read the program code stored in the memory 402 and execute a data cleaning method.
  • the memory 402 stores a large amount of data and program code, and the data is stored in a form, and the processor 401 cleans and corrects the erroneous data that may exist in the memory 402 by implementing the data cleaning method of the present invention;
  • the user equipment further includes a display 403 for displaying the cleaning and correction results of the processor 401, and for displaying the intermediate processing of the processor 401.
  • the user equipment further includes a memory 404.
  • the memory 404 stores preset data such as a third-party database, a history form library, and an expert knowledge base, which is convenient for the processor 401 to implement the data cleaning method of the present invention.
  • a third-party database, a history form library, and expert knowledge is convenient for the processor 401 to implement the data cleaning method of the present invention.
  • Preset data such as a library can also be stored in the memory 402.
  • the third-party knowledge base includes a synonym database of preset title fields, and the history form library contains historical forms that have been cleaned of data.
  • the data in the historical forms has high accuracy, and the historical forms correspond to respective ones.
  • Constraints the expert knowledge base contains constraints based on expert knowledge presets.
  • the above device modules in FIG. 4 may be integrated in the same computer or may be connected only through a network.
  • the disclosed systems, devices, and methods may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the functions may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a standalone product. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, or a part of the technical solution, may be embodied in the form of a software product, which is stored in a storage medium, including The instructions are used to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention.
  • the foregoing storage medium includes: a USB flash drive, a mobile hard disk, a read-only memory (ROM), and a random access memory (Random Access). Memory, referred to as RAM), disk or optical disk, and other media that can store program code.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Software Systems (AREA)
  • Quality & Reliability (AREA)

Abstract

一种数据清理方法及装置。该方法包括:在历史表单库中选取与当前表单具有相同描述对象的历史表单,当前表单中含有m个标题字段,历史表单中含有n个标题字段,其中m和n为正整数(101);按照第一预设算法计算m个标题字段中每个标题字段与n个标题字段中的每个标题字段之间的相似度(102);针对任一相似度SIM(i,j),若按照预设匹配规则判定出第i标题字段和第j标题字段匹配,则获取第j标题字段的约束条件;其中i表示所述m个标题字段中的第i标题字段,j表示所述n个标题字段中的第j标题字段,i的取值包括不大于m的所有自然数,j的取值包括不大于n的所有自然数(103);对第i标题字段对应的数据中不符合约束条件的数据进行数据清理(104)。提供的数据清理方法及装置提高了数据清理的效率和准确性,提高了数据源的准确性和可靠性。

Description

数据清理方法及装置 技术领域
本发明涉及数据清理技术,尤其涉及一种数据清理方法及装置。
背景技术
随着信息技术快速发展,大数据时代到来,各行各业开始建立信息系统并积累大量的数据。而数据的准确性,则是各种数据分析的基本条件。然而,现实中由于采集、传输、存储、处理过程中的各种原因,数据的准确性问题普遍存在。数据清理的目的是检测数据中存在的错误数据,剔除或者改正错误数据,以提高数据的准确性和质量。
常见的数据错误包括空值、取值越界等。现有技术中,为了剔除或者改正错误数据,常见的数据清理方法主要为基于领域专用语言的编程数据清理方法,具体地,研发人员每次对表单进行数据清理时,研发人员为该表单制定错误数据的清理规则,然后根据错误数据的清理规则,确定具体的清理算法,再根据清理算法,编写数据清理程序,最后通过该数据清理程序实现数据的自动检测和修正。
然而,通过数据清理程序实现数据的自动检测和修正,虽然实现了对数据的自动检测和修正,但是该方法要求研发人员在每次清理数据时,都要编写或修改数据清理程序,不仅对研发人员的要求较高,而且效率低下,使得现有技术的数据清理方法不具有普适性和易用性。
发明内容
本发明实施例提供一种数据清理方法及装置,以克服现有数据清理方法效率低,不具有普适性和易用性问题。
本发明实施例一方面提供一种数据清理方法,包括:
在历史表单库中选取与当前表单具有相同描述对象的历史表单,当前表单中含有m个标题字段,历史表单中含有n个标题字段,其中m和n为正整数;
按照第一预设算法计算m个标题字段中每个标题字段与n个标题字段 中的每个标题字段之间的相似度;
针对任一相似度SIM(i,j),若按照预设匹配规则判定出第i标题字段和第j标题字段匹配,则获取第j标题字段的约束条件;其中i表示m个标题字段中的第i标题字段,j表示n个标题字段中的第j标题字段,i的取值包括不大于m的所有自然数,j的取值包括不大于n的所有自然数;
对第i标题字段对应的数据中不符合约束条件的数据进行数据清理。
通过结合历史表单库中的具有相同描述对象的历史表单,将历史表单的标题字段的约束条件适应性的应用到当前表单的标题字段中,基于该约束条件对当前表单的标题字段对应的数据进行数据清理,无需研发人员在每次进行数据清理时进行清理算法代码程序的编写和维护,降低了用户的使用门槛,具有广泛的适用性,同时减少了人工进行数据清理的工作强度;也实现了数据库中大数据的自动清洗,提高了数据清理的效率和准确性,提高了数据源的准确性和可靠性。
进一步地,针对任一相似度SIM(i,j),按照预设匹配规则判定出第i标题字段和第j标题字段匹配包括:
若相似度SIM(i,j)大于第一预设值,则判定出m个标题字段中的第i标题字段和n个标题字段中的第j标题字段匹配。
进一步地,针对任一相似度SIM(i,j),按照预设匹配规则判定出第i标题字段和第j标题字段匹配包括:
若相似度SIM(i,j)不大于第一预设值且m、n大于1时,依据当前表单的预设字段关联关系确定出k个标题字段,对k个标题字段中的第s标题字段,在第s标题字段与m个标题字段中的每个标题字段之间的相似度中,确定最大的相似度SIMs,其中s的取值包括不大于k的所有自然数,k为依据当前表单的预设字段关联关系确定出的与第i标题字段关联的标题字段的总个数,其中k小于m;
根据相似度SIMs通过第二预设算法对SIM(i,j)进行修正,得到修正后的相似度SIM0(i,j);
若SIM0(i,j)大于第一预设值,则判定出m个标题字段中的第i标题字段和n个标题字段中的第j标题字段匹配。
进一步地,针对任一相似度SIM(i,j),按照预设匹配规则判定出第i 标题字段和第j标题字段匹配包括:
若相似度SIM(i,j)不大于第一预设值且m、n大于1时,依据当前表单的预设字段关联关系确定出k个标题字段,对k个标题字段中的第s标题字段,在第s标题字段与n个标题字段中的每个标题字段之间的相似度中,确定最大的相似度SIMs,其中s的取值包括不大于k的所有自然数,k为依据当前表单的预设字段关联关系确定出的与第i标题字段关联的标题字段的总个数,其中k小于m;
根据相似度SIMs通过第二预设算法对SIM(i,j)进行修正,得到修正后的相似度SIM0(i,j);
若SIM0(i,j)不大于第一预设值,则对SIM0(i,j)进行持续修正Y次后,得到SIM0+Y(i,j),若SIM0+Y(i,j)大于第一预设值时,则判定出m个标题字段中的第i标题字段和n个标题字段中的第j标题字段匹配;
其中,在第y次修正中,根据相似度SIMs通过第二预设算法对SIM0+y-1(i,j)进行修正,得到SIM0+y(i,j),y的取值包括不大于Y的正整数。
进一步地,第二预设算法为如下所示的公式:
Figure PCTCN2016098771-appb-000001
其中,SIM为待修正的相似度,SIM*为修正后的相似度,a为预设权重系数。
在确定匹配标题字段的过程中,通过利用表单中预设的标题字段间关联关系,对标题字段的相似度进行修正,得到修正后的更为准确的相似度,可确定出更多匹配标题字段,获得更多的约束条件,提高了数据清理的效率。
进一步地,按照第一预设算法计算m个标题字段中每个标题字段与n个标题字段中的每个标题字段之间的相似度,包括:
按照当前表单中的m个标题字段的标题字段名称与历史表单中的n个标题字段的标题字段名称的重合度,获取m个标题字段中每个标题字段与n个标题字段中的每个标题字段之间的相似度;或者
按照当前表单中的m个标题字段的标题字段名称与历史表单中的n个标题字段的标题字段名称,根据第三方知识库获取m个标题字段中每 个标题字段与n个标题字段中的每个标题字段之间的相似度;或者
按照当前表单中的m个标题字段的字段取值集与历史表单中的n个标题字段的字段取值集的重合度,获取m个标题字段中每个标题字段与n个标题字段中的每个标题字段之间的相似度。
进一步地,按照当前表单中的m个标题字段的字段取值集与历史表单中的n个标题字段的字段取值集的重合度,获取m个标题字段中每个标题字段与n个标题字段中的每个标题字段之间的相似度,包括:
当当前表单中的m个标题字段的字段取值集为离散型时,通过如下公式确定相似度:
Figure PCTCN2016098771-appb-000002
其中,
Figure PCTCN2016098771-appb-000003
表示第i标题字段的字段取值集,
Figure PCTCN2016098771-appb-000004
表示第j标题字段的字段取值集;或者
当当前表单中的m个标题字段的字段取值集为连续型时,通过如下公式确定相似度:
Figure PCTCN2016098771-appb-000005
其中,max1表示第i标题字段的最大值,min1表示第i标题字段的最小值,max2表示第j标题字段的最大值,min2表示第j标题字段的最小值。
进一步地,对当前表单进行数据清理之后,还包括:
将当前表单以及当前表单的约束条件存储至历史表单库中。
下面介绍本发明实施例提供的一种数据清理装置,该装置与方法一一对应,用以实现上述实施例中的数据清理方法,具有相同的技术特征和技术效果,本发明对此不再赘述。
本发明实施例另一方面提供一种数据清理装置,包括:
历史表单获取模块,用于在历史表单库中选取与当前表单具有相同描述对象的历史表单,当前表单中含有m个标题字段,历史表单中含有n个标题字段,其中m和n为正整数;
相似度计算模块,用于按照第一预设算法计算历史表单获取模块获取的m个标题字段中每个标题字段与n个标题字段中的每个标题字段之间的相似度;
匹配模块,用于针对相似度计算模块计算得到的任一相似度SIM(i,j),若按照预设匹配规则判定出第i标题字段和第j标题字段匹配,则获取第j标题字段的约束条件;其中i表示m个标题字段中的第i标题字段,j表示n个标题字段中的第j标题字段,i的取值包括不大于m的所有自然数,j的取值包括不大于n的所有自然数;
数据清理模块,用于对第i标题字段对应的数据中不符合匹配模块获取的约束条件的数据进行数据清理。
进一步的,匹配模块具体用于:
针对相似度计算模块计算得到的任一相似度SIM(i,j),若相似度SIM(i,j)大于第一预设值,则判定出m个标题字段中的第i标题字段和n个标题字段中的第j标题字段匹配,则获取第j标题字段的约束条件。
进一步的,匹配模块具体用于:
针对相似度计算模块计算得到的任一相似度SIM(i,j),若相似度SIM(i,j)不大于第一预设值且m、n大于1时,依据当前表单的预设字段关联关系确定出k个标题字段,对k个标题字段中的第s标题字段,在第s标题字段与n个标题字段中的每个标题字段之间的相似度中,确定最大的相似度SIMs,其中s的取值包括不大于k的所有自然数,k为依据当前表单的预设字段关联关系确定出的与第i标题字段关联的标题字段的总个数,其中k小于m;
根据相似度SIMs通过第二预设算法对SIM(i,j)进行修正,得到修正后的相似度SIM0(i,j);
若SIM0(i,j)大于第一预设值,则判定出m个标题字段中的第i标题字段和n个标题字段中的第j标题字段匹配。
进一步的,匹配模块具体用于:
针对相似度计算模块计算得到的任一相似度SIM(i,j),若相似度SIM(i,j)不大于第一预设值且m、n大于1时,依据当前表单的预设字段关联关系确定出k个标题字段,对k个标题字段中的第s标题字段,在第s标题字段与n个标题字段中的每个标题字段之间的相似度中,确定最大的相似度SIMs,其中s的取值包括不大于k的所有自然数,k为依据当前表单的预设字段关联关系确定出的与第i标题字段关联的标题字段的总个 数,其中k小于m;
根据相似度SIMs通过第二预设算法对SIM(i,j)进行修正,得到修正后的相似度SIM0(i,j);
若SIM0(i,j)不大于第一预设值,则对SIM0(i,j)进行持续修正Y次后,得到SIM0+Y(i,j),若SIM0+Y(i,j)大于第一预设值时,则判定出m个标题字段中的第i标题字段和n个标题字段中的第j标题字段匹配;
其中,在第y次修正中,根据相似度SIMs通过第二预设算法对SIM0+y-1(i,j)进行修正,得到SIM0+y(i,j),y的取值包括不大于Y的正整数。
进一步地,第二预设算法为如下所示的公式:
Figure PCTCN2016098771-appb-000006
其中,SIM为待修正的相似度,SIM*为修正后的相似度,a为预设权重系数。
进一步的,相似度计算模块包括:
第一计算单元,用于按照当前表单中的m个标题字段的标题字段名称与历史表单中的n个标题字段的标题字段名称的重合度,获取m个标题字段中每个标题字段与n个标题字段中的每个标题字段之间的相似度;或者
第二计算单元,用于按照当前表单中的m个标题字段的标题字段名称与历史表单中的n个标题字段的标题字段名称,根据第三方知识库获取m个标题字段中每个标题字段与n个标题字段中的每个标题字段之间的相似度;或者
第三计算单元,用于按照当前表单中的m个标题字段的字段取值集与历史表单中的n个标题字段的字段取值集的重合度,获取m个标题字段中每个标题字段与n个标题字段中的每个标题字段之间的相似度。
进一步地,第三计算单元具体用于:
当当前表单中的m个标题字段的字段取值集为离散型时,通过如下公式确定相似度:
Figure PCTCN2016098771-appb-000007
其中,
Figure PCTCN2016098771-appb-000008
表示第i标题字段的字段取值集,
Figure PCTCN2016098771-appb-000009
表示第j标题字段的字 段取值集;或者
当当前表单中的m个标题字段的字段取值集为连续型时,通过如下公式确定相似度:
Figure PCTCN2016098771-appb-000010
其中,max1表示第i标题字段的最大值,min1表示第i标题字段的最小值,max2表示第j标题字段的最大值,min2表示第j标题字段的最小值。
进一步地,在上述任一装置实施例的基础上,该装置还包括:
存储模块,用于将当前表单以及当前表单的约束条件存储至历史表单库中。
本发明实施例另一方面还提供一种数据清理装置,包括:存储器、处理器以及总线,存储器以及处理器分别与总线连接,其中:
存储器用于存储数据和存储程序代码;
处理器,用于读取存储器中存储的程序代码,执行如上所述的数据清理方法。
附图说明
为了更清楚地说明本发明实施例的技术方案,下面将对实施例描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本发明数据清理方法实施例一的流程示意图;
图2为本发明数据清理方法实施例二的表单场景示意图;
图3为本发明数据清理装置实施例一的结构示意图;
图4为本发明数据清理装置的结构示意图。
具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造 性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
在数据存储中,通常有很多统计数据以表单的形式进行存储,如工作种类和内容表单、员工基本信息表单、流动人口信息表单等,随着表单及表单中数据逐渐增多,则表单中可能出现由于采集、传输、存储、处理等过程导致的错误数据,进而影响了后续的基于表单的数据查询及分析。本发明实施例针对可能的错误数据,考虑到表单库中已存储有多个已进行数据清洗过的历史表单,当待清洗的当前表单与历史表单中的内容相一致时,可参考历史表单的数据清洗的约束条件对当前表单进行清洗,提出一种数据清理方法,可广泛应用于以表单形式存储数据的数据清洗中。
本发明实施例提供一种数据清理方法,用于自动的为数据库中存储的大量数据进行数据清洗,发现可能存在的错误数据并剔除或者改正错误数据。
通常大量的数据在数据库中以表单的形式存储,数据库中存储有多个表单。一个表单中存储了多列数据,一列数据通常由标题字段和该标题字段对应的数据构成。如下所示的表一示出了一种常见的表单。示例性的,表一所示的表单中包括标题字段名称为“姓名”、“ID”、“性别”等多个标题字段。各标题字段依据其对应的数据的不同,具有不同的属性,且各标题字段间存在不同的关联关系。当标题字段对应的数据中存在不符合约束条件的数据时,该不符合约束条件的数据则为错误数据,约束条件包括标题字段的属性或标题字段间的关联关系等。例如,标题字段“ID”中对应的数据具有唯一性,即标题字段“ID”对应的多个数据中不应存在重复数据“4”;标题字段“城市”具有特定取值范围属性,即标题字段“城市”中不应存在“程度”这一不在特定取值范围内的数据;标题字段“城市”与标题字段“区号”间为一一对应的关系,即表一中的标题字段“城市”中的“成都”与标题字段“区号”中的“029”之间的对应关系有误。表一中以下划线的方式示例性的标示出几种可能的错误数据。
表一
Figure PCTCN2016098771-appb-000012
本发明实施例针对如上所述的大量数据中存在错误数据的问题,提出一种数据清理方法,该方法用于对表单进行数据清理,在本实例中,为了便于描述,称待进行数据清理的表单为当前表单。该方法先获取与当前表单相似的历史表单,在历史表单中确定与当前表单的标题字段相匹配的标题字段,然后根据该历史表单中的匹配的标题字段的约束条件,对当前表单中的对应的匹配的标题字段进行数据清理,清理出不符合该约束条件的数据,该方法旨在解决现有技术中数据清理方法不具有普适性和易用性的问题,以实现数据清理的普适性和易用性。
图1为本发明数据清理方法实施例一的流程示意图。本实施例的执行主体为数据清理装置,该装置可以设置在处理器中。如图1所示,本实施例的方法可以包括:
步骤101、在历史表单库中选取与当前表单具有相同描述对象的历史表单,当前表单中含有m个标题字段,历史表单中含有n个标题字段,其中m和n为正整数;
步骤102、按照第一预设算法计算m个标题字段中每个标题字段与n个标题字段中的每个标题字段之间的相似度;
步骤103、针对任一相似度SIM(i,j),若按照预设匹配规则判定出第i标题字段和第j标题字段匹配,则获取第j标题字段的约束条件;其中i表示m个标题字段中的第i标题字段,j表示n个标题字段中的第j标题字段,i的取值包括不大于m的所有自然数,j的取值包括不大于n的所有自然数;
步骤104、对第i标题字段对应的数据中不符合约束条件的数据进行数据清理。
本实施例为了区分历史表单中的标题字段以及当前表单中的标题字段,将当前表单中的m个标题字段中的一个标题字段记为第i标题字段,将历史表单中的n个标题字段中的一个标题字段记为第j标题字段,其中m和n为正整数,i的取值包括不大于m的所有自然数,j的取值包括不大于n的所有自然数,相似度SIM(i,j)则表示第i标题字段与第j标题字段之间的相似度。
具体的,在步骤101中,对于当前表单,首先在历史表单库中存储的多个历史表单中,通过比较或查询获取与当前表单具有相同描述对象的历史表单。
具体地,在历史表单库中,可能存在一历史表单与当前表单具有相同描述对象,即包含相同的标题字段,当相同的标题字段个数越多,且不同的标题字段个数越少,则表明当前表单与该历史表单越相似,可参照该相似原则,在历史表单库中筛选出与当前表单具有相同描述对象的历史表单。
具体的,在步骤102中,计算筛选出来的当前表单中的m个标题字段与历史表单中的n个标题字段间的相似度。示例性的,按照第一预设算法遍历当前表单中的所有标题字段,获取各标题字段与历史表单中的每个标题字段的相似度,得到相似度SIM(i,j),其中i表示m个标题字段中的第i标题字段,j表示n个标题字段中的第j标题字段,i的取值包括不大于m的所有自然数,j的取值包括不大于n的所有自然数。
在具体实现过程中,获取相似度SIM(i,j)的具体实现方式可以为以下实现方式中的任一种或多种相结合。
一种可行的实现方式,按照当前表单中的m个标题字段的标题字段名称与历史表单中的n个标题字段的标题字段名称的重合度,获取m个标题字段中每个标题字段与n个标题字段中的每个标题字段之间的相似度。
示例性的,对于当前表单中的一个标题字段,若其标题字段名称为“姓名”,当历史表单中同样存在标题字段名称为“姓名”的标题字段时,可认为这两个标题字段完全重合,这两个标题字段的相似度为1;当前表单中标题字段名称为“姓名”的标题字段与历史表单中标题字段名称为“月收入”的标题字段的重合度为0,故这两个标题字段的相似度可 认为为0。
另一种可行的实现方式,按照当前表单中的m个标题字段的标题字段名称与历史表单中的n个标题字段的标题字段名称,根据第三方知识库获取m个标题字段中每个标题字段与n个标题字段中的每个标题字段之间的相似度。
示例性的,当第三方知识库中存储有字段名“姓名”与“name”为同义词时,可认为当前表单中字段名为“姓名”的字段与历史表单中字段名为“name”的字段的相似度为1。
又一种可行的实现方式,按照当前表单中的m个标题字段的字段取值集与历史表单中的n个标题字段的字段取值集的重合度,获取m个标题字段中每个标题字段与n个标题字段中的每个标题字段之间的相似度。
具体的,针对表单中标题字段的字段取值集类型的不同,该可行的实现方式包括以下两种情况:
情况一,当标题字段的字段取值集为离散型时,根据如下公式(1),获取当前表单中的离散型标题字段与历史表单中的每个标题字段的相似度;
其中,公式(1)为:
Figure PCTCN2016098771-appb-000013
其中,
Figure PCTCN2016098771-appb-000014
表示第i标题字段的字段取值集,
Figure PCTCN2016098771-appb-000015
表示第j标题字段的字段取值集。
当采用该种方法计算相似度时,当前表单中的离散型标题字段与历史表单中的非离散型标题字段可以认为相似度为0,无需采用公式(1)进行计算。由公式(1)可看出,当两个标题字段的字段取值集中相同的值越多,其相似度越高。
示例性的,常见的离散型标题字段有“城市”、“学历”等。例如当一个“工作城市”标题字段的取值集包括:北京、上海、深圳,另一个标题字段“工作地点”的取值集包括:北京、上海、深圳、天津,采用上述公式(1)计算两字段的相似度,可得到相似度={北京、上海、深圳}/{北京、上海、深圳、天津}=75%。
情况二,当标题字段的字段取值集为连续型时,根据如下公式(2), 获取当前表单中的连续型标题字段与历史表单中的所有标题字段的相似度;
其中,公式(2)为:
Figure PCTCN2016098771-appb-000016
其中,max1表示第i标题字段的最大值,min1表示第i标题字段的最小值,max2表示第j标题字段的最大值,min2表示第j标题字段的最小值。示例性的,常见的连续型标题字段有“年龄”、“工资”等。
对于采用上述任一实现方式获取得到的,当前表单中各标题字段与历史表单中的每个标题字段的相似度,相似度的取值范围为0至1的实数。相似度为0则表明分别来自当前表单和历史表单的两个标题字段没有任何相同的属性或关联关系,可视为无效的相似度。在实际计算相似度的过程中,对于当前表单中的任一标题字段与历史表单中的每个标题字段的相似度,通常最多有1个大于0的有效相似度,其余相似度值均为0。
进一步地,还可以根据上述不同实现方式获取的相似度以及各实现方式得到的相似度的预设权重,获取更为准确的相似度。
具体的,在步骤103中,在步骤102获取的所有相似度中,针对任一相似度SIM(i,j),若按照预设匹配规则判定出第i标题字段和第j标题字段匹配,则获取第j标题字段的约束条件。
对于任一表单,根据该表单中每个标题字段的取值集以及多个标题字段相互之间的关系,可为该表单构建约束条件,约束条件通常包括标题字段属性和标题字段间的关联关系。
可选的,标题字段属性可以为以下中的一项或多项:可靠性、唯一性、标签、字段近义词、取值范围等。可选的,关联关系也可以为以下中的一项或多项:相关、保序、一一映射等。约束条件主要限定了表单中各标题字段的取值范围,以及多个标题字段对应的数据的相互关系,不符合约束条件的数据可认为是需要清理的错误数据。
历史表单库中的历史表单各自均对应有自己的约束条件,当来自历史表单与当前表单的两个标题字段为匹配的标题字段时,可以认为这两个标题字段应该具有相同或相似的标题字段属性和标题字段间关联关 系,即两个标题字段对应的数据符合同样的约束条件。因此,可在判定出第i标题字段和第j标题字段匹配后,获取第j标题字段的约束条件。
具体的,在步骤104中,由于第i标题字段和第j标题字段匹配,则认为第i标题字段应符合第j标题字段的约束条件,可直接用第j标题字段的约束条件对第i标题字段对应的数据进行排查,确定第i标题字段对应的所有数据中不符合约束条件的数据,对该些数据进行清理。通过判定出尽量多对的标题字段匹配,可对当前表单中的标题字段进行更充分的数据清理。
在具体根据约束条件,对当前表单中不符合标题字段属性或标题字段间关系的数值进行清理时,清理过程包括删除错误值并提供修正值。例如,根据约束条件可知标题字段为“性别”时,该标题字段的取值范围可以为“男、女、未知”,而当检测到当前表单中的“性别”标题字段对应的某个数据为“北京”时,则认为该数据为错误数据,需要被清理,删除“北京”后,根据“性别”标题字段的取值范围,可采用“未知”进行修正,从而完成当前表单的数据清理。示例性的,当检测到当前表单中的“性别”标题字段对应的某个数据为“male”时,根据该字段的取值范围以及近义词,可知“male”的近义词为“男”,故可将当前表单中的“male”修正为“男”,完成数据清理。
进一步的,在步骤104中进行数据清理时,当进行数据修改和替换时,可选择将需进行修改的源数据与替换数据通过显示屏显示给操作人员,待操作人员确认后再进行修正,或待操作人员否定后不进行修正,通过增加确认步骤可提高数据清理的准确性。
在具体实施数据清理的过程中,可调用第三方知识库或专家知识库,以查询标题字段及标题字段对应的数据的同义词,近义词及联想扩展词等。
可选的,当步骤101中无法获取到具有相同描述对象的历史表单时,或在步骤101中获取到具有相同描述对象的历史表单的同时,还可直接人工输入标题字段属性和标题字段间关系等约束条件;或由数据清理装置根据专家知识库中已经存储的预置标题字段属性和标题字段间关系,自动进行标题字段匹配,为当前表单中标题字段配置专家知识库中存储的 约束条件,并进行数据清理。
进一步的,在上述实施例的基础上,本发明实施例提供的数据清理方法还包括:
将当前表单以及当前表单的约束条件存储至历史表单库中。
通过将每次进行了数据清理后的当前表单,以及当前表单的约束条件存储至历史表单库中,扩充历史表单库,方便后续再次应用本发明实施例提供的数据清理方法进行数据清理。
本发明实施例提供的数据清理方法,通过结合历史表单库中的具有相同描述对象的历史表单,将历史表单的标题字段的约束条件适应性的应用到当前表单的标题字段中,基于该约束条件对当前表单的标题字段对应的数据进行数据清理,无需研发人员在每次进行数据清理时进行清理算法代码程序的编写和维护,降低了用户的使用门槛,具有广泛的适用性,同时减少了人工进行数据清理的工作强度;也实现了数据库中大数据的自动清洗,提高了数据清理的效率和准确性,提高了数据源的准确性和可靠性。
下面结合具体的实施例,在上述实施例的基础上,对按照预设匹配规则判定两标题字段匹配,进行详细说明。
图2为本发明数据清理方法实施例二的表单场景示意图。图2中示意性的示出了应用了本发明数据清理方法的一个当前表单和一个已经筛选出来的具有相同描述对象的历史表单,并标示出当前表单和历史表单中的部分标题字段间的相似度,图2中还示出了当前表单的部分预设的标题字段间关联关系,及历史表单的部分标题字段间关联关系。
当获取到当前表单中的各标题字段与历史表单中的每个标题字段的相似度SIM(i,j)后,针对任一相似度SIM(i,j),按照预设匹配规则判定该相似度SIM(i,j)对应的第i标题字段和第j标题字段是否匹配。
具体的,预设匹配规则的具体实现方式可以为以下实现方式中的任一种。
可行的实现方式一为:
若相似度SIM(i,j)大于第一预设值,则判定出m个标题字段中的第i标题字段和n个标题字段中的第j标题字段匹配。
示例性的,结合图2,对于获取到的当前表单中的“职业”标题字段与历史表单中的“职业”标题字段的相似度1,判断该相似度1是否大于第一预设值0.9,由于相似度1大于第一预设值0.9,则可确定第i标题字段和第j标题字段为相互匹配的标题字段,二者具有相似的标题字段属性和标题字段间的关联关系,即认为当前表单中的“职业”标题字段与历史表单中的“职业”标题字段相互匹配,二者应该具有相同的约束条件。示例性的,当历史表单中的“职业”标题字段具有“工作”这一相似标题字段,则当前表单中的“职业”标题字段同样具有“工作”这一相似标题字段。通过将历史表单的“职业”标题字段的约束条件直接应用到当前表单中的“职业”标题字段上,实现对当前表单的数据清理。其中,第一预设值的取值范围为0至1的实数,可提前预设,也可在匹配过程进行适当修改。
可行的实现方式二为:
若相似度SIM(i,j)不大于第一预设值且m、n大于1时,依据当前表单的预设字段关联关系确定出k个标题字段,对k个标题字段中的第s标题字段,在第s标题字段与m个标题字段中的每个标题字段之间的相似度中,确定最大的相似度SIMs,其中s的取值包括不大于k的所有自然数,k为依据当前表单的预设字段关联关系确定出的与第i标题字段关联的标题字段的总个数,其中k小于m;
根据相似度SIMs通过第二预设算法对SIM(i,j)进行修正,得到修正后的相似度SIM0(i,j);
若SIM0(i,j)大于第一预设值,则判定出m个标题字段中的第i标题字段和n个标题字段中的第j标题字段匹配。
具体的,当确定相似度SIM(i,j)小于第一预设值时,则无法直接判定出第i标题字段和第j标题字段匹配。当m、n大于1时,当前表单与历史表单中确实存在多个标题字段,则还可根据与相似度SIM(i,j)对应的标题字段具有关联关系的其他标题字段的相似度,对相似度SIM(i,j)进行修正,以得到更准确的相似度,当修正后的相似度SIM0(i,j)大于第一预设值,则可同样判定出第i标题字段和第j标题字段匹配。
可选的,第一预设值可为固定值,也可在对相似度进行修正时,适 应性的更改为其他值。可选的,在比较相似度SIM(i,j)与第一预设值之前,还包括先确定相似度SIM(i,j)大于0,对于相似度SIM(i,j)为0的第i标题字段和第j标题字段,则可直接确定第i标题字段和第j标题字段不匹配,无需进行修正。
示例性的,对该相似度SIM(i,j)进行修正的方法为:
根据当前表单的预设字段关联关系,确定出与第i标题字段关联的k个标题字段,k为依据当前表单的预设字段关联关系确定出的与第i标题字段关联的标题字段的总个数,其中k小于m。对k个标题字段中的第s标题字段,在第s标题字段与m个标题字段中的每个标题字段之间的相似度中,确定最大的相似度SIMs,其中s的取值包括不大于k的所有自然数,即根据预设字段关联关系,可确定出K个相似度SIM1、SIM2…SIMk-1、SIMk。然后,K个相似度SIM1、SIM2…SIMk-1、SIMk通过第二预设算法对SIM(i,j)进行修正,得到修正后的相似度SIM0(i,j)。
具体的,在修正时,第二预设算法为如下所示的公式(3):
Figure PCTCN2016098771-appb-000017
其中,SIM为待修正的相似度,SIM*为修正后的相似度,a为预设权重系数。
在对相似对进行修正时,可将所有小于第一预设值的相似度SIM(i,j)进行排序,按照相似度从高到低的顺序依次进行修正。通过对所有小于第一预设值的相似度SIM(i,j)进行修正,得到修正后的相似度,以提高当前表单与历史表单中的标题字段的匹配率,对当前表单进行更充分的数据清理。
结合图2,参考图2中已经计算出的当前表单和历史表单中的多个相似度,其中,当前表单中的“职业”标题字段和历史表单中的“职业”标题字段的相似度为1,大于第一预设值0.9,可直接确定两标题字段相互匹配,从而可根据历史表单中“职业”标题字段的约束条件对当前表单中的“职业”标题字段进行数据清理,示例性的,若当前表单中的“职业”标题字段对应的数据中存在“北京”、“男”、“2000”等数据时,根据历史表单中的“职业”标题字段的约束条件中限定的取值集为“公务员、程序员、无”,可认为数据“北京”、“男”、“2000” 为错误数据,需要被修改。当前表单中的“月收入”标题字段和历史表单中的“月薪”标题字段的相似度为0.7,当前表单中的“学历”标题字段和历史表单中的“最高学历”标题字段的相似度为0.8。同时,当前表单中的“月收入”标题字段和“学历”标题字段及“职业”标题字段相关联,当历史单中的“月薪”标题字段和“最高学历”标题字段及“职业”标题字段相关联,标题字段“月收入”和标题字段“月薪”的修正的相似度SIM0(i,j),可由如下公式确定:
Figure PCTCN2016098771-appb-000018
当a取0.4时,可得到修正的相似度为0.817,当第一预设值仍为0.9时,可认为标题字段“月收入”和标题字段“月薪”不匹配,若在修正过程中,将第一预设值修改为0.81时,则可认为标题字段“月收入”和标题字段“月薪”相互匹配,可将历史表单中的标题字段“月薪”相关的约束条件都应用到当前表单中,以供当前表单进行数据清理。
可行的实现方式三为:
若相似度SIM(i,j)不大于第一预设值且m、n大于1时,依据当前表单的预设字段关联关系确定出k个标题字段,对k个标题字段中的第s标题字段,在第s标题字段与n个标题字段中的每个标题字段之间的相似度中,确定最大的相似度SIMs,其中s的取值包括不大于k的所有自然数,k为依据当前表单的预设字段关联关系确定出的与第i标题字段关联的标题字段的总个数,其中k小于m;
根据相似度SIMs通过第二预设算法对SIM(i,j)进行修正,得到修正后的相似度SIM0(i,j);
若SIM0(i,j)不大于第一预设值,则对SIM0(i,j)进行持续修正Y次后,得到SIM0+Y(i,j),若SIM0+Y(i,j)大于第一预设值时,则判定出m个标题字段中的第i标题字段和n个标题字段中的第j标题字段匹配;
其中,在第y次修正中,根据相似度SIMs通过第二预设算法对SIM0+y-1(i,j)进行修正,得到SIM0+y(i,j),y的取值包括不大于Y的正整数。
具体的,与可行的实现方式二中的修正方法相同,可先对任一相似度SIM(i,j)进行修正,得到修正后的相似度SIM0(i,j),具体修正方法本发 明对此不再赘述。
当修正后的相似度SIM0(i,j)大于第一预设值时,则判定出第i标题字段和第j标题字段匹配,当修正后的相似度SIM0(i,j)仍小于第一预设值时,可对修正后的相似度SIM0(i,j)进行多次持续修正。可选的,在多次修正过程中,均根据相似度SIMs通过第二预设算法对上一次修正得到的相似度进行迭代修正。具体在迭代过程中,在第y次修正中,根据相似度SIMs通过第二预设算法对SIM0+y-1(i,j)进行修正,得到SIM0+y(i,j),其中y的取值包括不大于Y的正整数,Y为预设值。当Y次修正后的SIM0+Y(i,j)大于第一预设值时,则判定出第i标题字段和第j标题字段匹配。可选的,在迭代修正时,若得到SIM0+X(i,j)大于第一预设值,且X小于Y,则停止迭代修正。其中第二预设算法与实现方式二中类似,本发明对此不再赘述。
可选的,在进行多次持续修正时,还可采用已修正的相似度替换修正前的相似度,以提高修正效率。在具体替换过程中,包括如下两种替换模式。
一种可能的替换模式:
对于所有相似度,在进行第y次修正之前,确定所有大于第一预设值的相似度SIM0+y-1(i,j),采用该些相似度值替换修正前的相似度;并确定所有小于第一预设值的相似度SIM0+y-1(i,j),组成相似度集合E,在相似度集合E中,按照相似度从高到低的顺序,进行第y次修正,得到修正后的相似度SIM0+y(i,j),然后确定所有大于第一预设值的相似度SIM0+y(i,j),采用大于第一预设值的相似度SIM0+y(i,j)替换修正前的相似度SIM0+y-1(i,j)。通过在待修正的相似度进行第y次修正之前,将修正后的相似度替换修正前的相似度,可提高修正效率。
另一种可能的替换模式:
对于所有相似度,在对相似度SIM(i,j)进行修正之前,确定所有小于第一预设值的相似度SIM(i,j),组成相似度集合F,在相似度集合F中,按照相似度从高到低的顺序,进行修正,在每得到一个修正后的相似度SIM0(i,j)时,判断该相似度SIM0(i,j)是否大于第一预设值,若是,则用该修正后的相似度替换修正前的相似度,直至完成相似度结合F中的所有相 似度的修正和替换,然后,将所有仍小于第一预设值的修正后的相似度SIM0(i,j)组成相似度集合F*,在相似度集合F*中,采用与第一次修正相同的方式进行修正和替换,直至完成Y次持续修正。
本发明实施例提供的数据清理方法,通过结合历史表单库中的具有相同描述对象的历史表单,将历史表单的标题字段的约束条件适应性的应用到当前表单的标题字段中,基于该约束条件对当前表单的标题字段对应的数据进行数据清理,无需研发人员在每次进行数据清理时进行清理算法代码程序的编写和维护,降低了用户的使用门槛,具有广泛的适用性,同时减少了人工进行数据清理的工作强度;也实现了数据库中大数据的自动清洗,提高了数据清理的效率和准确性,提高了数据源的准确性和可靠性。
图3为本发明数据清理装置实施例一的结构示意图。如图3所示,本实施例的装置可以包括:
历史表单获取模块301,用于在历史表单库中选取与当前表单具有相同描述对象的历史表单,当前表单中含有m个标题字段,历史表单中含有n个标题字段,其中m和n为正整数;
相似度计算模块302,用于按照第一预设算法计算历史表单获取模块301获取的m个标题字段中每个标题字段与n个标题字段中的每个标题字段之间的相似度;
匹配模块303,用于针对相似度计算模块302计算得到的任一相似度SIM(i,j),若按照预设匹配规则判定出第i标题字段和第j标题字段匹配,则获取第j标题字段的约束条件;其中i表示m个标题字段中的第i标题字段,j表示n个标题字段中的第j标题字段,i的取值包括不大于m的所有自然数,j的取值包括不大于n的所有自然数;
数据清理模块304,用于对第i标题字段对应的数据中不符合匹配模块303获取的约束条件的数据进行数据清理。
本实施例的装置,可以用于执行图1所示方法实施例的技术方案,其实现原理和技术效果类似,此处不再赘述。
进一步的,在上述实施例的基础上,匹配模块303具体用于:
针对相似度计算模块计算得到的任一相似度SIM(i,j),若相似度 SIM(i,j)大于第一预设值,则判定出m个标题字段中的第i标题字段和n个标题字段中的第j标题字段匹配,则获取第j标题字段的约束条件。
进一步的,在上述实施例的基础上,匹配模块303具体用于:
针对相似度计算模块计算得到的任一相似度SIM(i,j),若相似度SIM(i,j)不大于第一预设值且m、n大于1时,依据当前表单的预设字段关联关系确定出k个标题字段,对k个标题字段中的第s标题字段,在第s标题字段与n个标题字段中的每个标题字段之间的相似度中,确定最大的相似度SIMs,其中s的取值包括不大于k的所有自然数,k为依据当前表单的预设字段关联关系确定出的与第i标题字段关联的标题字段的总个数,其中k小于m;
根据相似度SIMs通过第二预设算法对SIM(i,j)进行修正,得到修正后的相似度SIM0(i,j);
若SIM0(i,j)大于第一预设值,则判定出m个标题字段中的第i标题字段和n个标题字段中的第j标题字段匹配。
进一步的,在上述实施例的基础上,匹配模块303具体用于:
针对相似度计算模块计算得到的任一相似度SIM(i,j),若相似度SIM(i,j)不大于第一预设值且m、n大于1时,依据当前表单的预设字段关联关系确定出k个标题字段,对k个标题字段中的第s标题字段,在第s标题字段与n个标题字段中的每个标题字段之间的相似度中,确定最大的相似度SIMs,其中s的取值包括不大于k的所有自然数,k为依据当前表单的预设字段关联关系确定出的与第i标题字段关联的标题字段的总个数,其中k小于m;
根据相似度SIMs通过第二预设算法对SIM(i,j)进行修正,得到修正后的相似度SIM0(i,j);
若SIM0(i,j)不大于第一预设值,则对SIM0(i,j)进行持续修正Y次后,得到SIM0+Y(i,j),若SIM0+Y(i,j)大于第一预设值时,则判定出m个标题字段中的第i标题字段和n个标题字段中的第j标题字段匹配;
其中,在第y次修正中,根据相似度SIMs通过第二预设算法对SIM0+y-1(i,j)进行修正,得到SIM0+y(i,j),y的取值包括不大于Y的正整数。
进一步地,第二预设算法为如下所示的公式:
Figure PCTCN2016098771-appb-000019
其中,SIM为待修正的相似度,SIM*为修正后的相似度,a为预设权重系数。
进一步的,在上述任一装置实施例的基础上,相似度计算模块302包括:
第一计算单元,用于按照当前表单中的m个标题字段的标题字段名称与历史表单中的n个标题字段的标题字段名称的重合度,获取m个标题字段中每个标题字段与n个标题字段中的每个标题字段之间的相似度;或者
第二计算单元,用于按照当前表单中的m个标题字段的标题字段名称与历史表单中的n个标题字段的标题字段名称,根据第三方知识库获取m个标题字段中每个标题字段与n个标题字段中的每个标题字段之间的相似度;或者
第三计算单元,用于按照当前表单中的m个标题字段的字段取值集与历史表单中的n个标题字段的字段取值集的重合度,获取m个标题字段中每个标题字段与n个标题字段中的每个标题字段之间的相似度。
进一步地,在上述实施例的基础上,第三计算单元具体用于:
当当前表单中的m个标题字段的字段取值集为离散型时,通过如下公式确定相似度:
Figure PCTCN2016098771-appb-000020
其中,
Figure PCTCN2016098771-appb-000021
表示第i标题字段的字段取值集,
Figure PCTCN2016098771-appb-000022
表示第j标题字段的字段取值集;或者
当当前表单中的m个标题字段的字段取值集为连续型时,通过如下公式确定相似度:
Figure PCTCN2016098771-appb-000023
其中,max1表示第i标题字段的最大值,min1表示第i标题字段的最小值,max2表示第j标题字段的最大值,min2表示第j标题字段的最小值。
进一步地,在上述任一装置实施例的基础上,该装置还包括:
存储模块,用于将当前表单以及当前表单的约束条件存储至历史表单库中。
图4为本发明数据清理装置的结构示意图。该装置可用于执行如上述实施例所述的数据清理方法。如图4所示,该装置包括:处理器401、存储器402以及总线405,处理器401以及存储器402分别与总线405连接,其中:
存储器402用于存储数据和存储程序代码;
处理器401,用于读取存储器402中存储的程序代码,执行数据清理方法。
具体的,存储器402中,存储有大量数据和程序代码,该数据以表单形式存储,处理器401通过实施本发明的数据清理方法对存储器402中可能存在的错误数据进行清理和修正;可选的,用户设备中还包括显示器403,显示器403用于将处理器401的清理和修正结果进行显示,也可用于将处理器401的中间处理过程进行显示;可选的,用户设备中还包括存储器404,存储器404中存储有第三方数据库、历史表单库、专家知识库等预设数据,便于处理器401实施本发明的数据清理方法时调用,可选的,第三方数据库、历史表单库、专家知识库等预设数据也可存储在存储器402中。其中,第三方知识库中包含预设的标题字段的近义词库,历史表单库中包含已经进行过数据清理的历史表单,该些历史表单中的数据准确性高,且该些历史表单对应有各自的约束条件,专家知识库中包含基于专家知识预设的约束条件。图4中上述各装置模块可以为集成在同一计算机中,也可仅通过网络连接。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本发明的实施例例如能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,简称ROM)、随机存取存储器(Random Access  Memory,简称RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
最后应说明的是:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。

Claims (15)

  1. 一种数据清理方法,其特征在于,所述方法包括:
    在历史表单库中选取与当前表单具有相同描述对象的历史表单,所述当前表单中含有m个标题字段,所述历史表单中含有n个标题字段,其中m和n为正整数;
    按照第一预设算法计算所述m个标题字段中每个标题字段与所述n个标题字段中的每个标题字段之间的相似度;
    针对任一相似度SIM(i,j),若按照预设匹配规则判定出第i标题字段和第j标题字段匹配,则获取所述第j标题字段的约束条件;其中i表示所述m个标题字段中的第i标题字段,j表示所述n个标题字段中的第j标题字段,i的取值包括不大于m的所有自然数,j的取值包括不大于n的所有自然数;
    对所述第i标题字段对应的数据中不符合所述约束条件的数据进行数据清理。
  2. 根据权利要求1所述的方法,其特征在于,所述针对任一相似度SIM(i,j),按照预设匹配规则判定出第i标题字段和第j标题字段匹配包括:
    若所述相似度SIM(i,j)大于第一预设值,则判定出所述m个标题字段中的第i标题字段和所述n个标题字段中的第j标题字段匹配。
  3. 根据权利要求1所述的方法,其特征在于,所述针对任一相似度SIM(i,j),按照预设匹配规则判定出第i标题字段和第j标题字段匹配包括:
    若所述相似度SIM(i,j)不大于第一预设值且m、n大于1时,依据所述当前表单的预设字段关联关系确定出k个标题字段,对所述k个标题字段中的第s标题字段,在所述第s标题字段与所述n个标题字段中的每个标题字段之间的相似度中,确定最大的相似度SIMs,其中s的取值包括不大于k的所有自然数,k为依据所述当前表单的预设字段关联关系确定出的与所述第i标题字段关联的标题字段的总个数,其中k小于m;
    根据所述相似度SIMs通过第二预设算法对所述SIM(i,j)进行修正,得到修正后的相似度SIM0(i,j);
    若所述SIM0(i,j)大于所述第一预设值,则判定出所述m个标题字段中的第i标题字段和所述n个标题字段中的第j标题字段匹配。
  4. 根据权利要求1所述的方法,其特征在于,所述针对任一相似度SIM(i,j),按照预设匹配规则判定出第i标题字段和第j标题字段匹配包括:
    若所述相似度SIM(i,j)不大于第一预设值且m、n大于1时,依据所述当前表单的预设字段关联关系确定出k个标题字段,对所述k个标题字段中的第s标题字段,在所述第s标题字段与所述n个标题字段中的每个标题字段之间的相似度中,确定最大的相似度SIMs,其中s的取值包括不大于k的所有自然数,k为依据所述当前表单的预设字段关联关系确定出的与所述第i标题字段关联的标题字段的总个数,其中k小于m;
    根据所述相似度SIMs通过第二预设算法对所述SIM(i,j)进行修正,得到修正后的相似度SIM0(i,j);
    若所述SIM0(i,j)不大于所述第一预设值,则对SIM0(i,j)进行持续修正Y次后,得到SIM0+Y(i,j),若所述SIM0+Y(i,j)大于所述第一预设值时,则判定出所述m个标题字段中的第i标题字段和所述n个标题字段中的第j标题字段匹配;
    其中,在第y次修正中,根据所述相似度SIMs通过第二预设算法对SIM0+y-1(i,j)进行修正,得到SIM0+y(i,j),y的取值包括不大于Y的正整数。
  5. 根据权利要求3或4所述的方法,其特征在于,所述第二预设算法为如下所示的公式一:
    Figure PCTCN2016098771-appb-100001
    其中,SIM为待修正的相似度,SIM*为修正后的相似度,a为预设权重系数。
  6. 根据权利要求1所述的方法,其特征在于,所述按照第一预设算法计算所述m个标题字段中每个标题字段与所述n个标题字段中的每个标题字段之间的相似度,包括:
    按照所述当前表单中的所述m个标题字段的标题字段名称与所述历史表单中的所述n个标题字段的标题字段名称的重合度,获取所述m个标 题字段中每个标题字段与所述n个标题字段中的每个标题字段之间的相似度;或者
    按照所述当前表单中的所述m个标题字段的标题字段名称与所述历史表单中的所述n个标题字段的标题字段名称,根据第三方知识库获取所述m个标题字段中每个标题字段与所述n个标题字段中的每个标题字段之间的相似度;或者
    按照所述当前表单中的所述m个标题字段的字段取值集与所述历史表单中的所述n个标题字段的字段取值集的重合度,获取所述m个标题字段中每个标题字段与所述n个标题字段中的每个标题字段之间的相似度。
  7. 根据权利要求6所述的方法,其特征在于,所述按照所述当前表单中的所述m个标题字段的字段取值集与所述历史表单中的所述n个标题字段的字段取值集的重合度,获取所述m个标题字段中每个标题字段与所述n个标题字段中的每个标题字段之间的相似度,包括:
    当所述当前表单中的所述m个标题字段的字段取值集为离散型时,通过如下公式二确定所述相似度:
    Figure PCTCN2016098771-appb-100002
    其中,
    Figure PCTCN2016098771-appb-100003
    表示第i标题字段的字段取值集,
    Figure PCTCN2016098771-appb-100004
    表示第j标题字段的字段取值集;或者
    当所述当前表单中的所述m个标题字段的字段取值集为连续型时,通过如下公式三确定所述相似度:
    Figure PCTCN2016098771-appb-100005
    其中,max1表示第i标题字段的最大值,min1表示第i标题字段的最小值,max2表示第j标题字段的最大值,min2表示第j标题字段的最小值。
  8. 一种数据清理装置,其特征在于,所述装置包括:
    历史表单获取模块,用于在历史表单库中选取与当前表单具有相同描述对象的历史表单,所述当前表单中含有m个标题字段,所述历史表单中含有n个标题字段,其中m和n为正整数;
    相似度计算模块,用于按照第一预设算法计算所述历史表单获取模块获取的所述m个标题字段中每个标题字段与所述n个标题字段中的每个 标题字段之间的相似度;
    匹配模块,用于针对所述相似度计算模块计算得到的任一相似度SIM(i,j),若按照预设匹配规则判定出第i标题字段和第j标题字段匹配,则获取所述第j标题字段的约束条件;其中i表示所述m个标题字段中的第i标题字段,j表示所述n个标题字段中的第j标题字段,i的取值包括不大于m的所有自然数,j的取值包括不大于n的所有自然数;
    数据清理模块,用于对所述第i标题字段对应的数据中不符合所述匹配模块获取的约束条件的数据进行数据清理。
  9. 根据权利要求8所述的装置,其特征在于,所述匹配模块具体用于:
    针对所述相似度计算模块计算得到的任一相似度SIM(i,j),若所述相似度SIM(i,j)大于第一预设值,则判定出所述m个标题字段中的第i标题字段和所述n个标题字段中的第j标题字段匹配,则获取所述第j标题字段的约束条件。
  10. 根据权利要求8所述的装置,其特征在于,所述匹配模块具体用于:
    针对所述相似度计算模块计算得到的任一相似度SIM(i,j),若所述相似度SIM(i,j)不大于第一预设值且m、n大于1时,依据所述当前表单的预设字段关联关系确定出k个标题字段,对所述k个标题字段中的第s标题字段,在所述第s标题字段与所述n个标题字段中的每个标题字段之间的相似度中,确定最大的相似度SIMs,其中s的取值包括不大于k的所有自然数,k为依据所述当前表单的预设字段关联关系确定出的与所述第i标题字段关联的标题字段的总个数,其中k小于m;
    根据所述相似度SIMs通过第二预设算法对所述SIM(i,j)进行修正,得到修正后的相似度SIM0(i,j);
    若所述SIM0(i,j)大于所述第一预设值,则判定出所述m个标题字段中的第i标题字段和所述n个标题字段中的第j标题字段匹配。
  11. 根据权利要求8所述的装置,其特征在于,所述匹配模块具体用于:
    针对所述相似度计算模块计算得到的任一相似度SIM(i,j),若所述相 似度SIM(i,j)不大于第一预设值且m、n大于1时,依据所述当前表单的预设字段关联关系确定出k个标题字段,对所述k个标题字段中的第s标题字段,在所述第s标题字段与所述n个标题字段中的每个标题字段之间的相似度中,确定最大的相似度SIMs,其中s的取值包括不大于k的所有自然数,k为依据所述当前表单的预设字段关联关系确定出的与所述第i标题字段关联的标题字段的总个数,其中k小于m;
    根据所述相似度SIMs通过第二预设算法对所述SIM(i,j)进行修正,得到修正后的相似度SIM0(i,j);
    若所述SIM0(i,j)不大于所述第一预设值,则对SIM0(i,j)进行持续修正Y次后,得到SIM0+Y(i,j),若所述SIM0+Y(i,j)大于所述第一预设值时,则判定出所述m个标题字段中的第i标题字段和所述n个标题字段中的第j标题字段匹配;
    其中,在第y次修正中,根据所述相似度SIMs通过第二预设算法对SIM0+y-1(i,j)进行修正,得到SIM0+y(i,j),y的取值包括不大于Y的正整数。
  12. 根据权利要求10或11所述的装置,其特征在于,所述第二预设算法为如下所示的公式一:
    Figure PCTCN2016098771-appb-100006
    其中,SIM为待修正的相似度,SIM*为修正后的相似度,a为预设权重系数。
  13. 根据权利要求8所述的装置,其特征在于,所述相似度计算模块包括:
    第一计算单元,用于按照所述当前表单中的所述m个标题字段的标题字段名称与所述历史表单中的所述n个标题字段的标题字段名称的重合度,获取所述m个标题字段中每个标题字段与所述n个标题字段中的每个标题字段之间的相似度;或者
    第二计算单元,用于按照所述当前表单中的所述m个标题字段的标题字段名称与所述历史表单中的所述n个标题字段的标题字段名称,根据第三方知识库获取所述m个标题字段中每个标题字段与所述n个标题字段中的每个标题字段之间的相似度;或者
    第三计算单元,用于按照所述当前表单中的所述m个标题字段的字段取值集与所述历史表单中的所述n个标题字段的字段取值集的重合度,获取所述m个标题字段中每个标题字段与所述n个标题字段中的每个标题字段之间的相似度。
  14. 根据权利要求13所述的装置,其特征在于,所述第三计算单元具体用于:
    当所述当前表单中的所述m个标题字段的字段取值集为离散型时,通过如下公式二确定所述相似度:
    Figure PCTCN2016098771-appb-100007
    其中,
    Figure PCTCN2016098771-appb-100008
    表示第i标题字段的字段取值集,
    Figure PCTCN2016098771-appb-100009
    表示第j标题字段的字段取值集;或者
    当所述当前表单中的所述m个标题字段的字段取值集为连续型时,通过如下公式三确定所述相似度:
    Figure PCTCN2016098771-appb-100010
    其中,max1表示第i标题字段的最大值,min1表示第i标题字段的最小值,max2表示第j标题字段的最大值,min2表示第j标题字段的最小值。
  15. 一种数据清理装置,其特征在于,包括:存储器、处理器以及总线,所述存储器以及所述处理器分别与所述总线连接,其中:
    所述存储器用于存储数据和存储程序代码;
    所述处理器,用于读取所述存储器中存储的程序代码,执行如权利要求1至7中任意一项所述的数据清理方法。
PCT/CN2016/098771 2015-12-30 2016-09-12 数据清理方法及装置 WO2017113886A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201511022880.3 2015-12-30
CN201511022880.3A CN106933863B (zh) 2015-12-30 2015-12-30 数据清理方法及装置

Publications (1)

Publication Number Publication Date
WO2017113886A1 true WO2017113886A1 (zh) 2017-07-06

Family

ID=59224469

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/098771 WO2017113886A1 (zh) 2015-12-30 2016-09-12 数据清理方法及装置

Country Status (2)

Country Link
CN (1) CN106933863B (zh)
WO (1) WO2017113886A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111639077A (zh) * 2020-05-15 2020-09-08 杭州数梦工场科技有限公司 数据治理方法、装置、电子设备、存储介质
CN112036144A (zh) * 2020-09-03 2020-12-04 广联达科技股份有限公司 数据解析方法、装置、计算机设备和可读存储介质
CN113010517A (zh) * 2021-03-08 2021-06-22 中国工商银行股份有限公司 数据表管理方法及装置
CN114462736A (zh) * 2020-11-09 2022-05-10 中核核电运行管理有限公司 一种核电厂辐射工作许可证申请的经验反馈智能推荐方法

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108984708B (zh) * 2018-07-06 2022-02-01 蔚来(安徽)控股有限公司 脏数据识别方法及装置、数据清洗方法及装置、控制器
CN110399463A (zh) * 2019-07-29 2019-11-01 国网河北省电力有限公司 工作票的相似度匹配方法及装置
CN111258968B (zh) * 2019-12-30 2020-09-11 广州博士信息技术研究院有限公司 企业冗余数据清理方法、装置及大数据平台
CN111538464B (zh) * 2020-05-10 2021-05-07 浙江智飨科技有限公司 一种基于物联网平台的数据清理方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080005106A1 (en) * 2006-06-02 2008-01-03 Scott Schumacher System and method for automatic weight generation for probabilistic matching
CN101739414A (zh) * 2008-11-25 2010-06-16 华中师范大学 一种本体概念映射方法
CN103257961A (zh) * 2012-02-15 2013-08-21 北大方正集团有限公司 书目消重的方法、装置及系统
CN103473373A (zh) * 2013-09-29 2013-12-25 方正国际软件有限公司 基于阈值匹配模型的相似度分析系统和方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324617A (zh) * 2012-03-20 2013-09-25 腾讯科技(深圳)有限公司 一种历史垃圾消息的识别方法及系统
CN104239304B (zh) * 2013-06-07 2018-08-21 华为技术有限公司 一种数据处理的方法、装置以及设备
CN104021160B (zh) * 2014-05-26 2018-06-01 北京金山安全软件有限公司 一种客户端数据清理方法和装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080005106A1 (en) * 2006-06-02 2008-01-03 Scott Schumacher System and method for automatic weight generation for probabilistic matching
CN101739414A (zh) * 2008-11-25 2010-06-16 华中师范大学 一种本体概念映射方法
CN103257961A (zh) * 2012-02-15 2013-08-21 北大方正集团有限公司 书目消重的方法、装置及系统
CN103473373A (zh) * 2013-09-29 2013-12-25 方正国际软件有限公司 基于阈值匹配模型的相似度分析系统和方法

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111639077A (zh) * 2020-05-15 2020-09-08 杭州数梦工场科技有限公司 数据治理方法、装置、电子设备、存储介质
CN111639077B (zh) * 2020-05-15 2024-03-22 杭州数梦工场科技有限公司 数据治理方法、装置、电子设备、存储介质
CN112036144A (zh) * 2020-09-03 2020-12-04 广联达科技股份有限公司 数据解析方法、装置、计算机设备和可读存储介质
CN112036144B (zh) * 2020-09-03 2024-04-02 广联达科技股份有限公司 数据解析方法、装置、计算机设备和可读存储介质
CN114462736A (zh) * 2020-11-09 2022-05-10 中核核电运行管理有限公司 一种核电厂辐射工作许可证申请的经验反馈智能推荐方法
CN113010517A (zh) * 2021-03-08 2021-06-22 中国工商银行股份有限公司 数据表管理方法及装置
CN113010517B (zh) * 2021-03-08 2024-02-09 中国工商银行股份有限公司 数据表管理方法及装置

Also Published As

Publication number Publication date
CN106933863A (zh) 2017-07-07
CN106933863B (zh) 2019-04-19

Similar Documents

Publication Publication Date Title
WO2017113886A1 (zh) 数据清理方法及装置
WO2021151325A1 (zh) 基于医疗知识图谱的分诊模型训练方法、装置及设备
CA2953959C (en) Feature processing recipes for machine learning
EP2973039B1 (en) Apparatus, systems, and methods for grouping data records
WO2021114632A1 (zh) 疾病名称标准化方法、装置、设备及存储介质
US9418066B2 (en) Enhanced document input parsing
US9697301B2 (en) Systems and methods for standardization and de-duplication of addresses using taxonomy
Yang et al. Lenses: An on-demand approach to etl
US10572544B1 (en) Method and system for document similarity analysis
Sariyar et al. Controlling false match rates in record linkage using extreme value theory
Ahle et al. On the complexity of inner product similarity join
US20150205846A1 (en) System and method for dynamic document matching and merging
AU2017250467B2 (en) Query optimizer for combined structured and unstructured data records
WO2022222943A1 (zh) 科室推荐方法、装置、电子设备及存储介质
US9633103B2 (en) Identifying product groups in ecommerce
US20160371435A1 (en) Offline Patient Data Verification
US20180322456A1 (en) Methods of analyzing key entities in a social network
CN110781251A (zh) 保险知识图谱生成方法、装置、设备及存储介质
US20150095202A1 (en) Recommending Product Groups in Ecommerce
CN111091883B (zh) 一种医疗文本处理方法、装置、存储介质及设备
US9152705B2 (en) Automatic taxonomy merge
CN103578067A (zh) 诊疗报告的关联装置和诊疗报告的关联方法
US11321359B2 (en) Review and curation of record clustering changes at large scale
US20120124060A1 (en) Method and system of identifying adjacency data, method and system of generating a dataset for mapping adjacency data, and an adjacency data set
CN111639077A (zh) 数据治理方法、装置、电子设备、存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16880679

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16880679

Country of ref document: EP

Kind code of ref document: A1