CN113837278B

CN113837278B - Method and device for detecting dirty data

Info

Publication number: CN113837278B
Application number: CN202111123840.3A
Authority: CN
Inventors: 林文楷; 连志阳; 陈文艺; 鄢小征; 魏超; 蓝坤宏
Original assignee: Xiamen Meiya Pico Information Co Ltd
Current assignee: Xiamen Meiya Pico Information Co Ltd
Priority date: 2021-09-24
Filing date: 2021-09-24
Publication date: 2022-06-28
Anticipated expiration: 2041-09-24
Also published as: CN113837278A

Abstract

The invention provides a method and a device for detecting dirty data, which comprises the steps of carrying out normalization processing on the attribute type of original data and then carrying out attribute characteristic analysis, thereby distinguishing the original data items with definite types and the original data items with unclear types, and matching the original data with a proper dirty data detection scheme according to the distinguished result. In addition, the original data are classified respectively based on different classification modes, the dirty data proportion of each classification is counted after the matched dirty data detection scheme is used for detection, the used dirty data detection scheme is adjusted according to the obtained dirty data proportion, the dirty data proportion of each classification is counted again, and finally the used dirty data detection scheme is selected as the dirty data detection scheme which is executed preferentially when the dirty data proportion is the highest for the same data item. The method can quickly and accurately identify the dirty data in the massive original data, greatly improve the analysis and utilization value of the big data, and reduce the construction cost of a big data system.

Description

Method and device for detecting dirty data

Technical Field

The invention relates to the technical field of computers, in particular to a method and a device for detecting dirty data.

Background

The big data system is required to be accessed with massive raw data with various types and weak cause and effect relations every day, the raw data comprises network logs, pictures, geographical positions and the like, the raw data is generated very quickly and contains a large amount of dirty data, the big data system has very strict requirements on the processing speed of the data, the traditional dirty data detection rule can only preset a corresponding detection rule for the raw data with a known type in a manual mode, however, the raw data has the characteristics of poor relevance, undefined attribute type and the like, when a plurality of raw data are analyzed and put in storage, the corresponding detection rule is not matched, so that a plurality of dirty data enter the big data system, and the quality of the big data service business development is seriously influenced. Therefore, how to quickly and accurately finish the purification of the original data and reduce the dirty data proportion of the final data assets is a key point for efficiently supporting business handling of big data.

A method and apparatus for dirty data detection is to solve the actual combat demand, utilize the intelligent recognition algorithm, to already definite type and not definite type primitive data item, print the label of different dimensions, adjust the dirty data detection rule dynamically according to label printed, prevent the dirty data from entering the big data system; the dirty data proportion of each original data item is sampled and analyzed by using a detection scheme adjusting algorithm, and the detection rule of the original data item is dynamically adjusted according to the proportion, so that unnecessary detection links are reduced, the warehousing efficiency of the original data is improved, and the service supporting capability of big data is improved.

Because the original data entering the big data system has the characteristics of poor relevance, undefined attribute type and the like, the existing dirty data detection rules in the market at present preset corresponding detection rules for the original data of known type in a manual mode, and the technologies have the following defects:

1) the dirty data detection mode is single: matching can be performed only through a single template or regular expression, the meaning corresponding to the attribute of the original data item cannot be automatically analyzed, and a corresponding detection rule is configured.

2) The range of dirty data detection is small: corresponding detection rules can be preset only for the original data items with definite types, and the original data items with indefinite types cannot be detected, so that a lot of dirty data enter a big data system, and the quality of big data service development is influenced;

aiming at the problems, the invention provides a method and a device for detecting dirty data, which mainly utilize an intelligent identification algorithm and a detection scheme adjustment algorithm to improve the detection accuracy and detection efficiency of the dirty data in mass data, reduce the entrance of the dirty data into a big data system and improve the quality of big data service business development.

Disclosure of Invention

The present invention provides a method and apparatus for dirty data detection, which solves the above mentioned drawbacks of the prior art.

In one aspect, the present invention provides a method of dirty data detection, the method comprising:

s1: performing attribute normalization on historical original data which enters a big data system, and constructing a feature detection rule base for storing corresponding relations between data items of different standard fields and dirty data detection schemes matched with the data items according to the dirty data detection schemes applied to the historical original data by the big data system;

the attribute normalization comprises: extracting field attributes and specific information of the field attributes, and normalizing and naming fields used for describing the same attribute type in the field attributes as the same standard field;

s2: when original data to be detected enters a big data system, performing attribute normalization on the original data to be detected, and dividing the original data to be detected into data items of definite types and data items of indefinite types;

s3: selecting a dirty data detection scheme matched with the data items of the definite type according to the characteristic detection rule base;

finding out a characteristic rule of the data item of the undefined type by analyzing the attribute semantics, the attribute type and the value distribution of the data item of the undefined type, matching a dirty data detection scheme to the data item of the undefined type according to the characteristic rule, and adding the incidence relation between the data item of the undefined type and the matched dirty data detection scheme into the characteristic detection rule base;

S4: selecting a matched dirty data detection scheme as a recommended dirty data detection scheme for each data item in the original data to be detected according to the characteristic detection rule base, and detecting each data item in the original data to be detected by using the recommended dirty data detection scheme; classifying each data item in the original data to be detected based on different classification modes, and calculating the dirty data proportion of each type of data item under each different classification mode, wherein the dirty data proportion is the proportion of the data item detected as dirty data;

s5: under each different classification mode, the recommended dirty data detection scheme corresponding to each type of data item is adjusted according to the dirty data proportion in each type of data item, so that each type of data item obtains different adjusted dirty data detection schemes according to the different classification modes, the different adjusted dirty data detection schemes are used for detecting each type of data item again, and the scheme used when the dirty data proportion is the highest is the dirty data detection scheme which is executed preferentially.

According to the method, when the big data system is accessed to the original data, the attribute type of the original data is normalized, then the attribute characteristic analysis is carried out on the normalized original data, so that the original data items with definite types and the original data items without definite types are distinguished, the original data is matched with a proper dirty data detection scheme according to the distinguished result, and therefore the dirty data is effectively prevented from entering the big data system. In addition, the original data are classified respectively based on different classification modes, the dirty data proportion of each classification is counted after the matched dirty data detection scheme is used for detection, the used dirty data detection scheme is adjusted according to the obtained dirty data proportion, the dirty data proportion of each classification is counted again, and finally the used dirty data detection scheme is selected as the dirty data detection scheme which is executed preferentially when the dirty data proportion is the highest for the same data item; according to the scheme, the data item detection rule of the original data is dynamically adjusted according to the dirty data proportion, unnecessary detection links are reduced, and the warehousing efficiency of the original data is improved.

In a specific embodiment, the classifying the data items in the original data to be detected based on different classification manners, and calculating the dirty data ratio of each type of data item in each of the different classification manners, the classifying according to different attribute types, and classifying according to a historical dirty data ratio and a current dirty data ratio, specifically including the following steps:

the classification is performed according to different attribute types: classifying the data items according to different attribute types, and respectively counting dirty data proportions in the attribute types to obtain a first dirty data proportion;

the classification is carried out according to the historical dirty data proportion and the current dirty data proportion: and respectively counting the historical dirty data proportion and the current dirty data proportion of the original data to be detected, and taking the maximum value of the historical dirty data proportion and the current dirty data proportion as a second dirty data proportion. The dirty data detection scheme is dynamically adjusted based on the different classification modes, so that the processing efficiency can be improved, and the error probability of dirty data detection can be reduced.

In a specific embodiment, the specific step of S5 includes:

the method comprises the following steps: selecting a matching adjustment rule for the dirty data detection scheme matched with the data items in each attribute type in the feature detection rule base according to the range of the first dirty data proportion, and obtaining an adjusted dirty data detection scheme through adjustment of the adjustment rule, wherein the adjusted dirty data detection scheme is marked as a first detection scheme;

According to the range of the second dirty data proportion, selecting a matched adjustment rule for a dirty data detection scheme matched with the data item in the original data to be detected in the feature detection rule base, and obtaining an adjusted dirty data detection scheme through adjustment of the adjustment rule, wherein the adjusted dirty data detection scheme is marked as a second detection scheme;

step two: and performing dirty data detection on the original data to be detected by using the first detection scheme and the second detection scheme respectively, and selecting a scheme used when the dirty data proportion is the highest as a dirty data detection scheme to be executed preferentially through calculation and comparison. Based on the steps, the detection efficiency can be improved, and the consumption of operation resources is reduced.

In a specific embodiment, the constructing a feature detection rule base for storing correspondence between data items of different standard fields and dirty data detection schemes matched with the data items according to the dirty data detection schemes applied to historical raw data by the big data system includes:

and acquiring a dirty data detection scheme used for detecting the historical original data from the historical information of the big data system, and combining the standard fields of the historical original data and the specific contents of each standard field to arrange and store in a warehouse to construct the characteristic detection rule base.

In a specific embodiment, the field attribute specifically includes: data source, resource name, field naming, comments, type, and length.

In a specific embodiment, the feature detection rule base stores data items in a form of a database or a data table, and the feature detection rule base specifically includes the following data:

an attribute type of the data item;

the standard field of a data item;

a set of dirty data detection schemes to which the data items match;

a data source for the data item.

In a specific embodiment, the method for dividing the original data to be tested into the data items with the definite type and the data items with the undefined type specifically includes: and matching the data items in the original data to be detected by using a regular expression so as to divide the data items into data items of definite types and data items of indefinite types.

In a specific embodiment, the different classification manners further include: the classification is based on the data source of the data item.

In a specific embodiment, the adjusting the recommended dirty data detection scheme corresponding to each type of data item according to the dirty data proportion in each type of data item is based on a preset detection scheme adjustment rule base, where the detection scheme adjustment rule base is a database used for representing a correspondence between data items of different attribute types and adjustment rules matched with the data items, and the database specifically includes the following data:

Property type of data item:

the standard field of a data item;

a data source of the data item;

and adjusting rules corresponding to the data items, wherein the adjusting rules comprise a range of dirty data proportion and an execution scheme corresponding to the range, and the execution scheme comprises skipping a detection link, executing the detection link and stopping entering the big data system.

In a specific embodiment, the S4 further includes:

sampling the various data items, detecting the sampled various data items and calculating the dirty data proportion;

wherein the step of sampling comprises: intercepting data streams with certain lengths in front and at back of the various data items, taking out whole data in the data streams to form a sample data set, classifying elements in the sample data set based on different classification modes to form a data block table constructed according to the elements in the sample data set, and storing field attribute names, total number, dirty data proportion and data content of the elements.

According to a second aspect of the present invention, a computer-readable storage medium is proposed, on which a computer program is stored, which computer program, when being executed by a computer processor, carries out the above-mentioned method.

According to a third aspect of the present invention, a system for dirty data detection is provided, the system comprising:

an attribute normalization module: configuring a feature detection rule base for performing attribute normalization on historical original data entering a big data system and then constructing a corresponding relation between data items for storing different standard fields and dirty data detection schemes matched with the data items according to the dirty data detection schemes applied to the historical original data by the big data system;

the attribute normalization comprises: extracting field attributes and specific information of the field attributes, and normalizing and naming the fields used for describing the same attribute type in the field attributes as the same standard field;

the attribute feature analysis module: the method comprises the steps that when original data to be detected enter a big data system, attribute normalization is carried out on the original data to be detected, and then the original data to be detected are divided into data items with definite types and data items with unclear types;

a detection scheme matching module: configuring a dirty data detection scheme for selecting a dirty data detection scheme matching the data item of the definite type according to the feature detection rule base;

finding out a characteristic rule of the data items of the undefined type by analyzing the attribute semantics, the attribute types and the value distribution of the data items of the undefined type, matching a dirty data detection scheme to the data items of the undefined type according to the characteristic rule, and adding the association relationship between the data items of the undefined type and the matched dirty data detection scheme into the characteristic detection rule base;

Dirty data proportion calculation module: configuring a dirty data detection scheme which is used for selecting a matched dirty data detection scheme for each data item in the original data to be detected according to the characteristic detection rule base to serve as a recommended dirty data detection scheme, and detecting each data item in the original data to be detected by using the recommended dirty data detection scheme; classifying each data item in the original data to be detected based on different classification modes, and calculating the dirty data proportion of each type of data item under each different classification mode, wherein the dirty data proportion is the proportion of the data item detected as dirty data;

a detection scheme adjustment module: and the configuration unit is used for adjusting the recommended dirty data detection schemes corresponding to the various data items according to the dirty data proportion in the various data items under the different classification modes, so that the various data items obtain different adjusted dirty data detection schemes according to the different classification modes, the different adjusted dirty data detection schemes are used for detecting the various data items again, and the scheme used when the dirty data proportion is the highest is the dirty data detection scheme which is preferentially executed.

The invention normalizes the attribute type of the original data when the big data system accesses the original data, and analyzes the attribute characteristics of the normalized original data, thereby distinguishing the original data items with definite types from the original data items with undefined types, and matching the original data with a proper dirty data detection scheme according to the distinguished result, thereby effectively preventing the dirty data from entering the big data system. In addition, the original data are classified respectively based on different classification modes, the dirty data proportion of each classification is counted after the matched dirty data detection scheme is used for detection, the used dirty data detection scheme is adjusted according to the obtained dirty data proportion, the dirty data proportion of each classification is counted again, and finally the used dirty data detection scheme is selected as the dirty data detection scheme which is executed preferentially when the dirty data proportion is the highest for the same data item; according to the scheme, the data item detection rule of the original data is dynamically adjusted according to the dirty data proportion, unnecessary detection links are reduced, and the warehousing efficiency of the original data is improved.

Drawings

The accompanying drawings are included to provide a further understanding of the embodiments and are incorporated in and constitute a part of this specification. The drawings illustrate embodiments and together with the description serve to explain the principles of the invention. Other embodiments and many of the intended advantages of embodiments will be readily appreciated as they become better understood by reference to the following detailed description. Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram to which the present application may be applied;

FIG. 2 is a flow diagram of a method of dirty data detection according to an embodiment of the present invention;

FIG. 3 is a feature detection rule base of a specific embodiment of the present invention;

FIG. 4 is a detection scheme adjustment rule base according to an embodiment of the present invention;

FIG. 5 is a flow diagram for intelligent identification of attributes of data items in accordance with a specific embodiment of the present invention;

FIG. 6 is a flow chart of detection scheme adjustment for a specific embodiment of the present invention;

FIG. 7 is a block diagram of a system for dirty data detection according to an embodiment of the present invention;

FIG. 8 is a block diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 illustrates an exemplary system architecture 100 to which a method of dirty data detection of an embodiment of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. Network 104 is the medium used to provide communication links between

terminal devices

101, 102, 103 and server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various applications, such as a data processing application, a data visualization application, a web browser application, etc., may be installed on the

terminal devices

101, 102, 103.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as a plurality of software or software modules (e.g., software or software modules used to provide distributed services) or as a single software or software module. And is not particularly limited herein.

The server 105 may be a server that provides various services, such as a background information processing server that provides support for raw data presented on the

terminal devices

101, 102, 103. The background information processing server may process the acquired adjustment scheme and generate a processing result (e.g., an adjusted dirty data detection scheme).

It should be noted that the method provided in the embodiment of the present application may be executed by the server 105, or may be executed by the

terminal devices

101, 102, and 103, and the corresponding apparatus is generally disposed in the server 105, or may be disposed in the

terminal devices

101, 102, and 103.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Fig. 2 shows a flowchart of a method of dirty data detection according to an embodiment of the present invention. As shown in fig. 2, the method comprises the steps of:

in a specific embodiment, the constructing a feature detection rule base for storing a correspondence between data items of different standard fields and dirty data detection schemes matched with the data items according to the dirty data detection schemes applied to the historical raw data by the big data system includes:

The attribute normalization comprises: and extracting field attributes and specific information of the field attributes, and normalizing and naming the fields used for describing the same attribute type in the field attributes to be the same standard field, thereby solving the processing difference caused by different source naming modes. In a specific embodiment, the field attribute specifically includes: data source, resource name, field naming, comments, type, and length.

an attribute type of the data item;

the standard field of a data item;

a set of dirty data detection schemes to which the data items match;

a data source for the data item.

Fig. 3 is a feature detection rule base according to a specific embodiment of the present invention, which is constructed by the method described in S1, and specific construction codes are as follows:

{LCBC:phone,clyc:checkPhone}

FunctioncheckPhone{

validatePhone＝(rule,value,callback)＝>{Varphone＝value.replace(/\s/g,””)；letregs＝/^((13[0-9])|(17[0-1,6-8])|(15[^4,\\D])\(18[0-9]))\d{8}$/；if(value.length＝＝0){callback()；}else{if(！regs.test(phone){callback([newerror(‘error’)])；}else{callback()；}}}

}

s2: when the original data to be detected enters a big data system, the attribute normalization is carried out on the original data to be detected, and then the original data to be detected is divided into data items with definite types and data items with uncertain types.

In a specific embodiment, the method for dividing the original data to be tested into the data items with the definite type and the data items with the undefined type specifically includes: and matching the data items in the original data to be detected by using a regular expression so as to divide the data items into the data items with definite types and the data items with unclear types. For example: presetting a regular rule base, and matching data items by using regular expressions in the regular rule base, wherein data items conforming to "/\ W + ([ - + ] \ W +) > @ \ W + ([ - ] \ W +) >) are data items of the electronic mailbox, namely identified data items of definite types, and other data items which are not in the regular rule base are data items of indefinite types.

and finding out a characteristic rule of the data items of the undefined type by analyzing the attribute semantics, the attribute types and the value distribution of the data items of the undefined type, matching a dirty data detection scheme to the data items of the undefined type according to the characteristic rule, and adding the association relationship between the data items of the undefined type and the matched dirty data detection scheme into the characteristic detection rule base.

For example: the data items 'identity card number' and 'contact way' in the train ticket booking information table are stored, the 'identity card number' is of a definite type attribute when the train ticket booking information table is stored, a dirty data detection engine corresponding to a characteristic detection rule base is directly called, the 'contact way' is of an undetermined type attribute, a keyword 'contact' is obtained through semantic analysis, and then a detection scheme of the train ticket booking information table is obtained, namely 'mobile phone number/fixed phone/mailbox' and the like.

S4: selecting a matched dirty data detection scheme as a recommended dirty data detection scheme for each data item in the original data to be detected according to the characteristic detection rule base, and detecting each data item in the original data to be detected by using the recommended dirty data detection scheme; classifying each data item in the original data to be detected based on different classification modes, and calculating the dirty data proportion of each type of data item under each different classification mode, wherein the dirty data proportion is the proportion of the data item detected as dirty data.

the classification is carried out according to the historical dirty data proportion and the current dirty data proportion: and respectively counting the historical dirty data proportion and the current dirty data proportion of the original data to be detected, and taking the maximum value of the historical dirty data proportion and the current dirty data proportion as a second dirty data proportion.

In a specific embodiment, the S4 further includes:

sampling the various data items, detecting the various sampled data items and calculating a dirty data proportion;

S5: and under each different classification mode, adjusting the recommended dirty data detection scheme corresponding to each type of data item according to the dirty data proportion in each type of data item, so that each type of data item obtains different adjusted dirty data detection schemes according to the different classification modes, and detecting each type of data item again by using the different adjusted dirty data detection schemes, wherein the scheme used when the dirty data proportion is the dirty data detection scheme executed preferentially when the dirty data proportion is the highest.

In a specific embodiment, the specific step of S5 includes:

Step two: and performing dirty data detection on the original data to be detected by using the first detection scheme and the second detection scheme respectively, and selecting a scheme used when the dirty data proportion is the highest as a dirty data detection scheme to be executed preferentially through calculation and comparison.

property type of data item:

the standard field of a data item;

a data source of the data item;

and adjusting rules corresponding to the data items, wherein the adjusting rules comprise a range of dirty data proportion and an execution scheme correspondingly selected in the range, and the execution scheme comprises skipping a detection link, executing the detection link and stopping entering the big data system.

Fig. 4 shows a detection scheme adjustment rule base of a specific embodiment of the present invention, and as shown in the figure, the detection scheme adjustment rule base records the file Id of the data item, the data source, the standard field, the start proportion and the end proportion of the adjustment rule range, the execution scheme corresponding to the adjustment rule, and the available status.

Based on the feature detection rule base shown in fig. 3 and the detection scheme adjustment rule base shown in fig. 4, a complete process of the present invention can be divided into two parts, namely: intelligent identification of data item attributes and adjustment of detection schemes, fig. 5 shows a flow chart of intelligent identification of data item attributes according to a specific embodiment of the present invention, fig. 6 shows a flow chart of adjustment of detection schemes according to a specific embodiment of the present invention, and a specific embodiment of the present invention is described in detail below based on fig. 5 and fig. 6.

The steps of intelligent identification of data item attributes shown in fig. 5 are as follows:

attribute normalization 501: when the original data is put in storage, extracting field attribute information corresponding to the original data to form a field attribute set P to be analyzed, wherein elements are data sources, resource names, field naming, comments, types and lengths, traversing the set P, and standardizing field naming, for example, the name standard is DE 00002.

Attribute feature analysis 502: the principle of the attribute feature analysis algorithm is that a regular expression is utilized to analyze original data items, the original data items are divided into two types, namely a definite type and an unremarkable type, wherein the definite type data items can be matched with a preset regular expression, and other data items which cannot be matched with the preset regular expression belong to the unremarkable type;

for the explicit type: selecting a dirty data detection scheme matched with the data items of the definite type according to the characteristic detection rule base; for unspecified types: finding out the characteristic rule of the data item of the undetermined type by analyzing the attribute semantics, the attribute type and the value distribution of the data item of the undetermined type, matching a dirty data detection scheme to the data item of the undetermined type according to the characteristic rule, and adding the incidence relation between the data item of the undetermined type and the matched dirty data detection scheme into the characteristic detection rule base.

The specific algorithm of the attribute feature analysis 502 is described as follows:

traversing a field attribute set P { i if (standard field of the feature detection rule base is [ P ]. field name) { obtaining a dirty data detection engine [ P ]. clyc of the data item is the feature detection rule base, processing engine }, by combining a feature detection rule base; if (standard field of the feature detection rule base | ═ P ]. field naming) { extracting the key noun GJC appearing in the field naming by the NLP engine, then verifying whether the extracted key noun is correct by combining [ P ]. attribute type and length, { dirty data detection engine [ P ]. clyc ═ feature detection rule base · processing engine } acquiring the data item.

Save detection scheme 503: a set P of data item dirty data detection schemes is saved.

The quality of original data items from the same source often has the characteristics of stability, stage and the like, for example, the data item of the ticket booking person identity card number in a train ticket booking information table has the condition that the quality is normal and only has full-right, full-error or starting error at a certain time point, and the like. The invention not only provides an attribute intelligent identification algorithm for the input original data, but also obtains a final dirty data detection scheme by analyzing the proportion of dirty data, recommending a proper detection scheme, adjusting the detection scheme and other methods on the basis of the dirty data detection scheme generated by the attribute intelligent identification algorithm, thereby improving the detection efficiency of dirty data in mass data and reducing the construction cost of a large data system. Based on the above concept, the steps of the detection scheme adjustment shown in fig. 6 are as follows:

Calculating dirty data proportion 601: intercepting the first 1M and last 1M contents of an original data stream, wherein the data stream is less than 2M, taking the whole data to form a sample data set S, wherein the S comprises n subsets { S1, S2, …, Sn }, and storing a field attribute name, the total number, the correct number (namely the number which is not dirty data), the correct rate (namely the data proportion which is not dirty data) and the contents in a data block table; traversing S { phi according to the field attribute name of the field (P) to obtain the field attribute of the field (P); if ([ S ]. content is not null) { execute [ P ]. processing engine } for [ S ]. content); thirdly, if the detection is passed, the correct number is equal to the correct number plus 1; after the traversal detection is finished, calculating the correct rate CurZZQL of each data item [ S ]. exact number/[ S ]. total number of bars, and storing the correct rate CurZZQL into a set S, wherein if Email { lwk @163.com, test @ qq.com,1356@ sohu.com,112222,5555}, the CurZZQL is 60%;

adjusting the detection scheme 602: the algorithm has two cores, wherein firstly, a proper dirty data detection scheme is recommended according to different attribute types and in combination with an area where the accuracy is located; secondly, the historical accuracy and the current accuracy of the data items are considered at the same time, and a proper dirty data detection scheme is recommended by taking the lowest value of the two values as reference;

According to different attribute types and in combination with the area where the accuracy is located, a suitable dirty data detection scheme is recommended, and specific examples thereof are as follows: for example, the data item 'identification card number' in the train ticket booking information table, the accuracy is set as: 98%, the regulation rule is "detection not required"; 50% -98%, and the regulation rule is 'detection execution'; < 50%, the regulation rule is "stop warehousing";

meanwhile, the historical accuracy and the current accuracy of the data item are considered, and a proper dirty data detection scheme is recommended by taking the lowest value of the two values as a reference, and the specific example is as follows: the accuracy of the current time is 99%, the history is 79%, and the low value of 79% is taken to adjust the detection scheme, so that the obtained adjustment rule is 'execution detection'.

The specific algorithm for adjusting the detection scheme 602 is described as follows: traversing the set S { i (i) according to the standard field BZZD [ S ]. field attribute name of the characteristic rule table, and acquiring the historical accuracy PreZZQL ═ ZZQL of the characteristic rule table; ② obtaining a final accuracy ZZQL ═ Min (PreZZQL, [ S ]. CurZZQL); and thirdly, according to the standard field of the detection scheme adjustment rule base, the field attribute name, the initial proportion of the detection scheme adjustment rule base, the ZZQL, and the final proportion of the detection scheme adjustment rule base, obtaining an execution detection scheme ZCHA, the execution scheme of the detection scheme adjustment rule base, and storing the execution scheme into a set S.

Save detection scheme 603: the core of the algorithm is to execute a detection scheme of data items according to the set S; and the analysis result of the S is collected and written back to the characteristic rule base, the execution sequence of the dirty data detection engine of the data item is dynamically adjusted, for example, the identification card number, two detection engines (respectively aiming at the detection of 15-bit identification card numbers and 18-bit identification card numbers) are provided, if the final detection result is that the ratio of the accuracy of the 18-bit identification card is larger, the detection engine is adjusted to execute the 18-bit identification card number preferentially, so that the detection efficiency can be improved, and the consumption of operation resources is reduced.

FIG. 7 illustrates a block diagram of a system for dirty data detection, in accordance with an embodiment of the present invention. The system comprises an attribute normalization module 701, an attribute feature analysis module 702, a detection scheme matching module 703, a dirty data proportion calculation module 704 and a detection scheme adjustment module 705.

In a specific embodiment, the attribute normalization module 701 is configured to perform attribute normalization on historical original data that has entered a big data system, and then construct a feature detection rule base for storing a correspondence between data items of different standard fields and dirty data detection schemes that are matched with the data items according to the dirty data detection schemes that are applied to the historical original data by the big data system;

the attribute feature analysis module 702 is configured to perform the attribute normalization on the original data to be detected when the original data to be detected enters a big data system, and then divide the original data to be detected into data items of a definite type and data items of an indefinite type;

the detection scheme matching module 703 is configured to select a dirty data detection scheme matching the unambiguous type of data item according to the feature detection rule base;

the dirty data proportion calculation module 704 is configured to select a matched dirty data detection scheme for each data item in the raw data to be detected according to the feature detection rule base as a recommended dirty data detection scheme, and then detect each data item in the raw data to be detected by using the recommended dirty data detection scheme; classifying each data item in the original data to be detected based on different classification modes, and calculating the dirty data proportion of each type of data item in each different classification mode, wherein the dirty data proportion is the proportion of the data item detected as dirty data;

The detection scheme adjusting module 705 is configured to, in each of the different classification manners, adjust the recommended dirty data detection scheme corresponding to each type of data item according to a dirty data proportion in the type of data item, so that the type of data item obtains different adjusted dirty data detection schemes according to the different classification manners, and re-detect the type of data item by using the different adjusted dirty data detection schemes, where a scheme used when a dirty data proportion is the highest is a dirty data detection scheme that is preferentially executed.

The system normalizes the attribute type of the original data when the big data system accesses the original data, and then performs attribute feature analysis on the normalized original data so as to distinguish the original data items with definite types and the original data items without definite types, and matches the original data with a proper dirty data detection scheme according to the distinguished result, thereby effectively preventing the dirty data from entering the big data system. In addition, the original data are classified respectively based on different classification modes, the dirty data proportion of each classification is counted after the matched dirty data detection scheme is used for detection, the used dirty data detection scheme is adjusted according to the obtained dirty data proportion, the dirty data proportion of each classification is counted again, and finally the used dirty data detection scheme is selected as the dirty data detection scheme which is executed preferentially when the dirty data proportion is the highest for the same data item; according to the scheme, the data item detection rule of the original data is dynamically adjusted according to the dirty data proportion, unnecessary detection links are reduced, and the warehousing efficiency of the original data is improved.

There is no similar optimization algorithm on the market today, which is already implemented and integrated in our product. The algorithm is based on the mass data scene, can be adapted to the characteristics of different types of original data items, automatically matches the corresponding dirty data detection rules, greatly improves the identification accuracy and efficiency, and can improve the identification rate of dirty data by 50% and reduce the operation resources by 10% compared with similar products on the market under the condition of billions of data volume through actual measurement and calculation.

Referring now to FIG. 8, shown is a block diagram of a computer system 800 suitable for use in implementing the electronic device of an embodiment of the present application. The electronic device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 8, the computer system 800 includes a Central Processing Unit (CPU)801 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the system 800 are also stored. The CPU 801, ROM 802, and RAM 803 are connected to each other via a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output portion 807 including a display such as a Liquid Crystal Display (LCD) and a speaker; a storage section 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. The computer program performs the above-described functions defined in the method of the present application when executed by the Central Processing Unit (CPU) 801. It should be noted that the computer readable storage medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable storage medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor and the names of the units do not in some cases constitute limitations on the units themselves.

Embodiments of the invention also relate to a computer-readable storage medium having stored thereon a computer program which, when executed by a computer processor, implements the method above. The computer program comprises program code for performing the method shown in the flow chart. Note that the computer readable medium of the present application can be a computer readable signal medium or a computer readable medium or any combination of the two.

The foregoing description is only exemplary of the preferred embodiments of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements in which any combination of the features described above or their equivalents does not depart from the spirit of the invention disclosed above. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A method of dirty data detection, comprising the steps of:

the attribute normalization includes: extracting field attributes and specific information of the field attributes, and normalizing and naming fields used for describing the same attribute type in the field attributes as the same standard field;

S2: when original data to be detected enter a big data system, performing attribute normalization on the original data to be detected, and dividing the original data to be detected into data items of definite types and data items of indefinite types;

2. The method according to claim 1, wherein the classifying is performed on each data item in the raw data to be tested based on different classification manners, and the dirty data proportion of each data item is calculated in each of the different classification manners, and the above steps include classifying according to different attribute types and classifying according to historical dirty data proportion and current dirty data proportion, and the specific steps are as follows:

The classification is carried out according to the historical dirty data proportion and the current dirty data proportion: and respectively counting the historical dirty data proportion and the current dirty data proportion of the original data to be measured, and taking the maximum value of the historical dirty data proportion and the current dirty data proportion as a second dirty data proportion.

3. The method according to claim 2, wherein the specific step of S5 includes:

the method comprises the following steps: selecting an adjusting rule matched with the dirty data detection scheme matched with the data items in each attribute type in the feature detection rule base according to the range of the first dirty data proportion, and obtaining an adjusted dirty data detection scheme through adjustment of the adjusting rule, wherein the adjusted dirty data detection scheme is marked as a first detection scheme;

step two: and performing dirty data detection on the original data to be detected by using the first detection scheme and the second detection scheme respectively, and selecting the scheme used when the dirty data proportion is the highest as a dirty data detection scheme which is executed preferentially through calculation and comparison.

4. The method according to claim 1, wherein the step of constructing a feature detection rule base for storing the corresponding relationship between the data items of different standard fields and the dirty data detection schemes matched with the data items according to the dirty data detection schemes applied by the big data system to the historical raw data comprises:

5. The method according to claim 1, wherein the field attributes specifically include: data source, resource name, field naming, comments, type, and length.

6. The method of claim 1, wherein the feature detection rule base stores data items in the form of a database or a data table, and the feature detection rule base specifically comprises the following data:

an attribute type of the data item;

the standard field of a data item;

a set of dirty data detection schemes to which the data items match;

A data source for the data item.

7. The method according to claim 1, wherein the method of separating the raw data to be tested into data items of a well-defined type and data items of an undefined type comprises: and matching the data items in the original data to be detected by using a regular expression so as to divide the data items into the data items with definite types and the data items with unclear types.

8. The method of claim 2, wherein the different classification further comprises: the classification is based on the data source of the data item.

9. The method according to claim 1, wherein the adjusting the recommended dirty data detection schemes corresponding to the various types of data items according to the dirty data proportion in the various types of data items is based on a preset detection scheme adjustment rule base, wherein the detection scheme adjustment rule base is a database for characterizing a correspondence relationship between data items of different attribute types and adjustment rules matched with the data items, and the database specifically includes the following data:

property type of data item:

the standard field of a data item;

a data source of the data item;

10. The method according to claim 1, wherein the S4 further comprises:

11. A computer-readable storage medium, on which a computer program is stored which, when being executed by a computer processor, carries out the method of any one of claims 1 to 10.

12. An apparatus for dirty data detection, comprising:

an attribute normalization module: configuring a feature detection rule base for performing attribute normalization on historical original data which enters a big data system, and then constructing a corresponding relation between data items for storing different standard fields and dirty data detection schemes matched with the data items according to the dirty data detection schemes applied to the historical original data by the big data system;

an attribute feature analysis module: the method comprises the steps that when original data to be tested enter a big data system, attribute normalization is carried out on the original data to be tested, and then the original data to be tested are divided into data items of definite types and data items of indefinite types;

a detection scheme matching module: configuring a dirty data detection scheme for selecting a dirty data detection scheme matching the unambiguous type of data item according to the feature detection rule base;