CN106708909B

CN106708909B - Data quality detection method and device

Info

Publication number: CN106708909B
Application number: CN201510796894.4A
Authority: CN
Inventors: 曲丹鹤
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2015-11-18
Filing date: 2015-11-18
Publication date: 2020-12-08
Anticipated expiration: 2035-11-18
Also published as: CN106708909A

Abstract

The application provides a method and a device for detecting data quality, wherein the method for detecting the data quality comprises the following steps: acquiring metadata of data to be detected, wherein the metadata comprises N fields of the data to be detected and the data type of each field, and N is a positive integer; searching at least one detection rule corresponding to each field in a pre-established detection rule base according to a preset matching strategy and the data type of each field; and respectively carrying out quality detection on each field according to at least one detection rule corresponding to each field to obtain and display the detection results of the N fields. The data quality detection method can effectively improve the coverage rate and the universality of detection, improve the detection efficiency and reduce the labor cost and the time cost.

Description

Data quality detection method and device

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for detecting data quality.

Background

With the advent of the data age, data information is developing toward massive and diversified quantities. Data anomalies can cause serious loss or accidents, so that the detection of data quality is a very important link in a data development system.

At present, for the detection of data quality, a tester is usually required to construct a comprehensive test case for each field according to business logic, that is, the tester writes codes and execution logic of the test case, and judges whether an execution result has an abnormality or not and whether a business requirement is met or not. In addition, after the manual test is completed, if the abnormal data is repaired, the test case needs to be manually tested again for regression after the data is repaired, and the test case is repeatedly compiled and executed. Because the mode is artificial participation, abnormal data has the possibility of missing detection, the test coverage rate is low, and test cases need to be repeatedly executed for different fields and repaired data, so the efficiency is low and the cost is high.

Disclosure of Invention

The present application aims to address the above technical problem, at least to some extent.

To this end, a first object of the present application is to propose a method for detecting data quality.

A second object of the present application is to provide a data quality detection apparatus.

In order to achieve the above object, according to a first aspect of the present application, a method for detecting data quality is provided, including the following steps: acquiring metadata of data to be detected, wherein the metadata comprises N fields of the data to be detected and a data type of each field, and N is a positive integer; searching at least one detection rule corresponding to each field in a pre-established detection rule base according to a preset matching strategy and the data type of each field; and respectively carrying out quality detection on each field according to at least one detection rule corresponding to each field to obtain and display the detection results of the N fields.

According to the data quality detection method, the detection rule corresponding to each field can be searched in the detection rule base according to the preset matching strategy and the data type of the field in the metadata of the data to be detected, and the quality of the field in the data to be detected is detected respectively, so that the quality detection result is obtained. The detection rules in the detection rule base can be suitable for different databases and fields, the coverage is wider, and the detection rule matching and the quality detection can be automatically completed, so that compared with the traditional mode of manually compiling test cases, the data quality detection method provided by the embodiment of the application can effectively improve the coverage and universality of detection, improve the detection efficiency, and reduce the labor cost and the time cost.

The embodiment of the second aspect of the present application provides a device for detecting data quality, including: the device comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring metadata of data to be detected, the metadata comprises N fields of the data to be detected and the data type of each field, and N is a positive integer; the search module is used for searching at least one detection rule corresponding to each field in a pre-established detection rule base according to a preset matching strategy and the data type of each field; and the detection module is used for respectively carrying out quality detection on each field according to at least one detection rule corresponding to each field so as to obtain and display the detection results of the N fields.

The data quality detection device according to the embodiment of the application can search the detection rule corresponding to each field in the detection rule base according to the preset matching strategy and the data type of the field in the metadata of the data to be detected, and perform quality detection on the field in the data to be detected respectively to obtain the quality detection result. The detection rules in the detection rule base can be suitable for different databases and fields, the coverage is wider, and the detection rule matching and the quality detection can be automatically completed, so that compared with the traditional mode of manually compiling test cases, the data quality detection method provided by the embodiment of the application can effectively improve the coverage and universality of detection, improve the detection efficiency, and reduce the labor cost and the time cost.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The above and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flow chart of a method of detecting data quality according to one embodiment of the present application;

FIG. 2 is a flow chart of a method for detecting data quality according to an embodiment of the present application;

FIG. 3 is a flow chart of a method for data quality detection according to another embodiment of the present application;

FIG. 4a is a schematic diagram of a data quality detection method according to an embodiment of the present application;

FIG. 4b is a graph of a gender enumeration distribution of the test results of the embodiment of FIG. 4a according to the present application;

FIG. 4c is a graph showing the distribution of age intervals according to the detection results of the embodiment shown in FIG. 4a

Fig. 5 is a schematic structural diagram of a data quality detection apparatus according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

A method and apparatus for detecting data quality according to an embodiment of the present application are described below with reference to the accompanying drawings.

Fig. 1 is a flowchart of a method for detecting data quality according to an embodiment of the present application.

As shown in fig. 1, a method for detecting data quality according to an embodiment of the present application includes the following steps.

S101, obtaining metadata of data to be detected, wherein the metadata comprises N fields of the data to be detected and a data type of each field, and N is a positive integer.

Specifically, a table name input by a user can be received, and a data table corresponding to the table name is used as data to be detected.

Furthermore, the partition selected by the user for detection can be received, and the partition table corresponding to the partition is used as the data to be detected. The partition table is a subset of a data table, that is, a large data table can be divided into a plurality of partitions, and each partition is a partition table.

The metadata is data (data about data) for describing data, is mainly used for describing data property (property), and is used for supporting functions such as indicating storage location, history data, resource searching, file recording and the like. That is, the metadata of the data to be detected includes the field of the data to be detected and the corresponding information such as the field type, the key value, and the creation time.

The field of the data to be detected is a basic storage unit of the data to be detected. For example, one field may be a column in the data to be detected, and the value of N depends on the total number of columns in the data to be detected.

Taking the distributed cluster syntax as an example, the Data types of the fields may include a Boolean type, a Double (Double precision floating point) type, a digit (integer) type, a precise value) type, a String (String) type, a Data time (date) type, etc. For example, the Data type of the field corresponding to the gender may be a Boolean type, the Data type of the field corresponding to the age may be a binary type, and the Data type of the field corresponding to the login time may be a Data time (date) type.

S102, searching at least one detection rule corresponding to each field in a pre-established detection rule base according to a preset matching strategy and the data type of each field.

In an embodiment of the application, a plurality of detection rules can be extracted according to a test case corresponding to each data type in different data types, and a detection rule base is established according to the extracted plurality of detection rules. Specifically, the core calculation logic of the test case with the common data quality can be abstracted into a rule template in advance, and the general expected value of the test case is used as the expected value corresponding to the corresponding rule template, so that a detection rule base is established. The detection rule base comprises a plurality of detection rules, and each detection rule comprises a rule template and an expected value.

It will be appreciated that some detection rules for a probe class have no expected value and therefore their corresponding expected value may be a default value, e.g., may be N/a. For example, the number of repetitions is not particularly required for the age field in the data table, and therefore, the expected value of the detection rule for the number of repetitions may be set to N/a.

The preset matching strategy is obtained by carrying out statistical analysis in advance according to a test case with common data quality. Specifically, for each detection rule, statistics can be performed on which data types need to be used in the quality detection process, and a label of the corresponding data type is added to the detection rule.

Therefore, after the data type of a field is acquired, all detection rules with the data type label can be found in the detection rule base according to the data type.

For example, for a Boolean type field, such as a field representing gender, the corresponding detection rules may include: detection rules such as a detection null data amount (expected 0), a detection valid value number (expected >0), a detection null rate (expected 0%), and the like.

For fields of numeric type such as Double type, binary type and Decimal type, such as a field indicating age, the corresponding detection rule may include: detection rules such as detection null data amount (expected 0), detection effective value number (expected >0), detection null rate (expected 0%), detection maximum value (expected >0), detection minimum value (expected <0), detection average value, detection percentile (expected to be a preset quantile), detection negative value (expected 0), detection zero value, and the like.

For a String type field, such as a field representing a user ID, the corresponding detection rule may include: detection rules such as detection null data amount (expectation is 0), detection effective value number (expectation >0), detection null rate (expectation is 0%), number of detection enumerated values, maximum value of detection field length, minimum value of detection field length, and the like.

For a field of Data time type, such as a field representing the login time of a user, the corresponding detection rule may include: detecting a null value data amount (expect 0), detecting a number of valid values (expect >0), detecting a null value rate (expect 0%), detecting a number satisfying a preset date format (e.g., yyymmdd format), and the like.

It should be understood that the expected value of the detection rule may be a default value in the detection rule base, or may be set by the detection personnel as needed.

S103, respectively carrying out quality detection on each field according to at least one detection rule corresponding to each field to obtain and display the detection results of the N fields.

Specifically, for each field, the detection value of the field under each detection rule can be respectively calculated according to the detection rule corresponding to the field, and the detection value is respectively compared with the expected value of the corresponding detection rule, if each detection value meets the expected value, it is indicated that the field is not abnormal, otherwise, the field is abnormal. And if each field in the data to be detected is not abnormal, and if not, the data to be detected is abnormal.

For example, if the following problems exist in the data to be detected, it indicates that the data to be detected is abnormal: the method comprises the following steps that repeated key exists, namely, a primary key is not unique, a null value exists in a core field or a null value rate is larger than a preset range, a negative value exists in an index field (such as fields of amount, age and the like), data logic exception exists among fields (such as access amount page view < independent visitor unique viewer), the field length exceeds service expectation (such as an identity number is larger than 18 bits), an enumerated field (such as gender) has an abnormal enumerated value or an abnormal data distribution proportion of the enumerated value, the field format is not in accordance with a preset format and the like.

In one embodiment of the application, when an anomaly is detected, the corresponding field and the corresponding detection value can be recorded and provided to a detection person. The detection results of each field can be displayed in diversified forms such as tables, distribution maps and the like, and abnormal fields and problems corresponding to the abnormal fields can be marked by highlighting or other special marks, so that detection personnel can find and process the abnormal fields in time.

It should be understood that after the abnormal data is processed by the inspector, the above-mentioned S101-S103 may be automatically repeated to inspect the processed data again.

In another embodiment of the present application, to further improve the detection efficiency, when performing quality detection on the data to be detected according to the detection rule, merging detection may be performed, and parallel detection may be performed on the fields according to the merged detection rule. Specifically, fig. 2 is a flowchart of a method for detecting data quality according to an embodiment of the present application. As shown in fig. 2, the method for detecting data quality in the embodiment of the present application includes the following steps:

s201, obtaining metadata of data to be detected, wherein the metadata comprises N fields of the data to be detected and a data type of each field, and N is a positive integer.

S202, at least one detection rule corresponding to each field is searched in a pre-established detection rule base according to a preset matching strategy and the data type of each field.

S201-S202 are the same as S101-S102, and are not described herein again.

S203, merging the rule templates of the detection rules corresponding to each field to obtain a merging rule template.

It should be understood that the application is not limited to the merge rules of the rule templates. In the embodiment of the application, a plurality of detection rule templates corresponding to one field may be combined into one combination rule template, or all detection rules corresponding to all fields in the data to be detected may be combined into one combination rule template.

Specifically, when the detection rules corresponding to all the fields are combined, the fields need to be distinguished. For example, if the detection rule templates corresponding to column1 are rule 1(), rule 2(), rule 3(), and the detection rule corresponding to column1 is rule 1(), rule 3(), rule 4(), and rule 5(), it is necessary to distinguish column1 from column2 when the templates are merged, and the merged rule template rule 1(column1), rule 2(column1), rule 3(column1), rule 1(column2), rule 3(column2), rule 4(column2), and rule 5(column2) are obtained.

S204, detecting the N fields by using the merging rule template to obtain a plurality of detection results, wherein the plurality of detection results correspond to the detection rules corresponding to the N fields.

Specifically, when the merging rule template is used for detection, a plurality of rule templates protected in the merging rule template can be processed in parallel, so that the quality detection of one field or even one data packet can be completed in a short time.

S205, respectively comparing whether the detection results are consistent with the expected values of the corresponding detection rules.

S206, if the detection result is consistent with the expected value of the corresponding detection rule, judging that the field corresponding to the detection result is normal.

And S207, if the detection result is inconsistent with the expected value of the corresponding detection rule, judging that the field corresponding to the detection result is abnormal.

The data quality detection method can combine a plurality of detection rule templates and use the combination rule templates for quality detection, so that detection values of a plurality of detection rules can be obtained in one-time calculation through combination calculation, detection results can be obtained more quickly, and detection efficiency is further improved.

Fig. 3 is a flow chart of a method for detecting data quality according to another embodiment of the present application.

As shown in fig. 3, the method for detecting data quality in the embodiment of the present application includes the following steps:

s301, obtaining metadata of data to be detected, wherein the metadata comprises N fields of the data to be detected and the data type of each field, and N is a positive integer.

S302, at least one detection rule corresponding to each field is searched in a pre-established detection rule base according to a preset matching strategy and the data type of each field.

S303, respectively carrying out quality detection on each field according to at least one detection rule corresponding to each field.

Wherein, S301-S303 are the same as S101-S103, and are not described herein again.

S304, if the detection result of the number of the enumerated values is contained in the detection result of the N fields, whether the number of the enumerated values meets a first preset condition is judged.

Wherein, the number of enumerated values is the number of the field value after the duplication is removed.

The first preset condition may be: the number of enumerated values is less than the preset number, and the total number of enumerated values/field values is less than the preset proportion. The preset number and the preset ratio may be adjusted according to the service, for example, the preset number may be 100, and the preset ratio may be 0.01.

If the number of enumeration values of a field meets a first preset condition, the field is represented as an enumerable non-discrete field, so that enumeration value distribution detection can be carried out on the field.

S305, if yes, obtaining the distribution proportion of the enumeration values according to the number of the enumeration values.

Specifically, the frequency of occurrence of each enumerated value, i.e., the number of repetitions of each enumerated value/the total number of field values, may be calculated to obtain the enumerated value distribution ratio.

S306, if the distribution proportion of the enumeration values does not meet the preset proportion condition, judging that the field corresponding to the detection result containing the number of the enumeration values is abnormal.

In the embodiment of the present application, the preset proportion condition may be determined according to a commonly used evaluation criterion, or may be set by a detection person. Of course, in the embodiment of the present application, the obtained enumeration value distribution ratio may also be displayed to a detection person, so that the detection person determines whether the data is abnormal according to the enumeration value distribution ratio.

S307, if the detection results of the N fields comprise the detection results of the maximum value and the minimum value, judging whether the difference value of the maximum value and the minimum value meets a second preset condition.

The second preset condition may be set according to a service requirement, for example, the second preset condition may be < ═ 100.

S308, if yes, the field values of the fields corresponding to the detection results containing the maximum value and the minimum value are obtained and counted according to the intervals, and interval distribution data of the fields corresponding to the detection results containing the maximum value and the minimum value are obtained.

Specifically, the field value of the field corresponding to the corresponding detection result may be equally divided into preset intervals according to the maximum value and the minimum value, and the number of the field values in each interval is counted, so as to obtain the interval distribution data of the field.

S309, if the interval distribution data does not meet the preset distribution, judging that the field corresponding to the detection result containing the maximum value and the minimum value is abnormal.

In the embodiment of the present application, the preset distribution may be determined according to a commonly used evaluation criterion, or may be set by a detection person. Of course, in the embodiment of the present application, the obtained interval distribution data may also be displayed to the detecting person, so that the detecting person can determine whether the data is abnormal according to the interval distribution data.

The data quality detection method according to the embodiment of the present application will be described with reference to fig. 1 and 4a to 4 c.

Fig. 4a is a schematic diagram of a data quality detection method according to an embodiment of the present application.

Table 1 is a table of field test results of the data quality testing method according to an embodiment of the present application.

FIG. 4b is a gender enumeration distribution graph of the test results of the embodiment shown in FIG. 4a according to the present application.

FIG. 4c is a graph of age interval distribution of the test results of the embodiment shown in FIG. 4a according to the present application.

As shown in fig. 4a, the user selects the partition table with pt being 20150801 of the user table a _ user as the data to be detected. The data table includes fields of userid, nick, age, score, loginitme, and the like. And automatically matching corresponding detection rules (only the detection rules matched by the userid, the age and the logimit are shown in the figure) according to the field types of the fields, and then detecting according to the matched detection rules to obtain a detection result.

Then, a field test result table shown in table 1, a gender enumeration distribution chart shown in fig. 4b, and an age interval distribution chart shown in fig. 4c are generated based on the test results.

As shown in table 1, table 1 lists the exception fields userid, age, logimit, wherein the bold and italic values are the detection values with exceptions, i.e. the detection values that do not correspond to the corresponding expected values.

TABLE 1

Field(s)	Number of null values	Number of significant values	Maximum value	Minimum value	Amount of repetition
						userid	100	102	N/A	N/A	10
age	0	102	10	-1	N/A
						logintime	1	101	N/A	N/A	50

As shown in fig. 4b, the male and female rates are 80% and 20%, respectively, and the male rate of gender is very high, so that the inspector can determine whether the phenomenon is reasonable according to specific services.

As shown in fig. 4c, the age interval [0, 100] can be divided into 10 equal intervals, and the data amount at each interval age is counted, so as to obtain the statistical result, i.e. the distribution diagram of the age interval shown in fig. 4 c. It can be seen that the population is the most in the 50-60 years of age, and the inspector can determine whether the result is reasonable according to specific services.

Therefore, after a preliminary detection result is obtained, further detection can be carried out according to whether the field can be enumerated or not, whether the field can be detected in a segmented mode or not, and the like, so that an enumeration distribution graph or a segmented detection result is obtained, and therefore the data quality can be judged more intuitively according to the distribution of data quality detection, and data problems can be found more deeply.

It should be understood that, in the embodiment of the present application, steps S304-S306 and steps S307-S309 are optional, and the order of steps S304-S306 and steps S307-S309 is not sequential.

In an embodiment of the present application, the metadata may further include primary key information of the data to be detected, and optionally, after performing quality detection on each field according to at least one detection rule corresponding to each field, the method may further include:

s310, determining a field corresponding to the primary key information.

The primary key information is information for identifying which field in the data table is the primary key. The metadata of the data table includes primary key information, and thus, a corresponding field can be determined based on the primary key information.

S311, a result of detecting the number of repetitions of the field corresponding to the primary key information is obtained.

The number of repetitions is used to detect whether there is any repetition in the field values corresponding to the detected field. Taking the userid user ID as an example, if there are two or more users whose IDs are both xiaoming, it indicates that there is a repeated field value.

S312, if the field corresponding to the primary key information has the repeated field value, judging that the primary key information or the field corresponding to the primary key information is abnormal.

Since the primary key is used to uniquely identify a record in the data table, if there is a duplicate value, the record cannot be uniquely identified, and thus, it is possible to judge that the data is abnormal.

Therefore, the primary key of the data packet can be further detected through the primary key information, and the problem of the data can be more comprehensively found.

It should be understood that, in the embodiment of the present application, the execution order of steps S310 to S312, steps S304 to S306, and steps S307 to S309 is not sequential.

In order to implement the above embodiments, the present application further provides a data quality detection apparatus.

As shown in fig. 5, the apparatus for detecting data quality according to the embodiment of the present application includes: an acquisition module 10, a lookup module 20, and a detection module 30.

Specifically, the obtaining module 10 is configured to obtain metadata of data to be detected, where the metadata includes N fields of the data to be detected and a data type of each field, where N is a positive integer.

More specifically, the obtaining module 10 may receive a table name input by a user, and use a data table corresponding to the table name as data to be detected.

Further, the obtaining module 10 may also receive a partition selected by the user for detection, and use a partition table corresponding to the partition as data to be detected. The partition table is a subset of a data table, that is, a large data table can be divided into a plurality of partitions, and each partition is a partition table.

The searching module 20 is configured to search for at least one detection rule corresponding to each field in a pre-established detection rule base according to a preset matching policy and a data type of each field.

In an embodiment of the present application, the search module 20 may extract a plurality of detection rules according to the test case corresponding to each data type in different data types, and establish a detection rule base according to the extracted plurality of detection rules. Specifically, the core calculation logic of the test case with the common data quality can be abstracted into a rule template in advance, and the general expected value of the test case is used as the expected value corresponding to the corresponding rule template, so that a detection rule base is established. The detection rule base comprises a plurality of detection rules, and each detection rule comprises a rule template and an expected value.

It will be appreciated that some detection rules for a probe class have no expected value and therefore their corresponding expected value may be a default value, e.g., may be N/a.

The detection module 30 is configured to perform quality detection on each field according to at least one detection rule corresponding to each field, so as to obtain and display detection results of the N fields.

Specifically, for each field, the detection module 30 may respectively calculate a detection value of the field under each detection rule according to the detection rule corresponding to the field, and compare the detection value with an expected value of the corresponding detection rule, if each detection value meets the expected value, it indicates that the field is not abnormal, otherwise, the field is abnormal. And if each field in the data to be detected is not abnormal, and if not, the data to be detected is abnormal.

In one embodiment of the present application, when an anomaly is detected, the detection module 30 may record a corresponding field and a corresponding detection value, and provide the corresponding field and the corresponding detection value to a detection person. The detection module 30 can display the detection result of each field in a diversified form such as a table, a distribution diagram, etc., and can mark the abnormal field and the corresponding problem with a highlight or other special marks, so that the detection personnel can find and handle the abnormality in time.

In another embodiment of the present application, in order to further improve the detection efficiency, the detection module 30 may perform merging detection when performing quality detection on the data to be detected according to the detection rule, and perform parallel detection on the fields according to the merged detection rule.

In particular, the detection module 30 may be specifically configured to: merging the rule templates of the detection rules corresponding to each field to obtain a merged rule template; detecting the N fields by using the merging rule template to obtain a plurality of detection results, wherein the plurality of detection results correspond to the detection rules corresponding to the N fields; respectively comparing whether the detection results are consistent with the expected values of the corresponding detection rules; if the detection result is consistent with the expected value of the corresponding detection rule, judging that the field corresponding to the detection result is normal; and if the detection result is inconsistent with the expected value of the corresponding detection rule, judging that the field corresponding to the detection result is abnormal.

It should be understood that the application is not limited to the merge rules of the rule templates. In the embodiment of the application, a plurality of detection rule templates corresponding to one field may be combined into one combination rule template, or all detection rules corresponding to all fields in the data to be detected may be combined into one combination rule template. When the merging rule template is used for detection, a plurality of rule templates protected in the merging rule template can be processed in parallel, and therefore quality detection of one field or even one data packet can be completed in a short time.

Further, in an embodiment of the present application, the detection module 30 is further configured to: after quality detection is respectively carried out on each field according to at least one detection rule corresponding to each field, if the detection result of the number of enumerated values is contained in the detection result of the N fields, whether the number of the enumerated values meets a first preset condition is judged; if yes, acquiring the distribution proportion of the enumeration values according to the number of the enumeration values; and if the distribution proportion of the enumeration values does not meet the preset proportion condition, judging that the field corresponding to the detection result containing the number of the enumeration values is abnormal.

The first preset condition may be: the number of enumerated values is less than the preset number, and the total number of enumerated values/field values is less than the preset proportion. The preset number and the preset ratio may be adjusted according to the service, for example, the preset number may be 100, and the preset ratio may be 0.01. The preset proportion condition can be determined according to the common judgment criteria and can also be set by a detector.

If the number of enumeration values of a field meets a first preset condition, the field is represented as an enumerable non-discrete field, so that enumeration value distribution detection can be carried out on the field. Specifically, the frequency of occurrence of each enumerated value, i.e., the number of repetitions of each enumerated value/the total number of field values, may be calculated to obtain the enumerated value distribution ratio.

In the embodiment of the application, the obtained distribution ratio of the enumeration values can be displayed to a detection person, so that the detection person can judge whether the data is abnormal according to the distribution ratio of the enumeration values.

In one embodiment of the present application, the detection module 30 is further configured to: after quality detection is respectively carried out on each field according to at least one detection rule corresponding to each field, if the detection results of the N fields comprise the detection results of the maximum value and the minimum value, whether the difference value between the maximum value and the minimum value meets a second preset condition is judged; if yes, obtaining field values of fields corresponding to the detection results containing the maximum value and the minimum value, and carrying out statistics according to intervals to obtain interval distribution data of the fields corresponding to the detection results containing the maximum value and the minimum value; and if the interval distribution data does not meet the preset distribution, judging that the field corresponding to the detection result containing the maximum value and the minimum value is abnormal.

The second preset condition may be set according to a service requirement, for example, the second preset condition may be < ═ 100. The predetermined distribution can be determined according to the usual evaluation criteria or can be set by the test person.

In the embodiment of the application, the obtained interval distribution data can be displayed to a detection person, so that the detection person can judge whether the data is abnormal or not according to the interval distribution data.

In an embodiment of the present application, the metadata includes primary key information of the data to be detected, and the detection module 30 is further configured to: after quality detection is respectively carried out on each field according to at least one detection rule corresponding to each field, determining a field corresponding to the primary key information; acquiring a repeated number detection result of a field corresponding to the primary key information; and if the field corresponding to the primary key information has the repeated field value, judging that the primary key information or the field corresponding to the primary key information is abnormal.

The primary key information is information for identifying which field in the data table is the primary key. The metadata of the data table includes primary key information, and thus, a corresponding field can be determined based on the primary key information. The number of repetitions is detected, that is, whether there is a repetition in the field value corresponding to the field is detected. Taking the userid user ID as an example, if there are two or more users whose IDs are both xiaoming, it indicates that there is a repeated field value.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present application includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present application have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the application, the scope of which is defined by the claims and their equivalents.

Claims

1. A method for detecting data quality is characterized by comprising the following steps:

acquiring metadata of data to be detected, wherein the metadata comprises N fields of the data to be detected and a data type of each field, and N is a positive integer; the metadata is information used for describing data attributes of the data to be detected;

searching at least one detection rule corresponding to each field in a pre-established detection rule base according to a preset matching strategy and the data type of each field; the detection rules in the detection rule base comprise rule names, rule templates and rule expected values;

respectively carrying out quality detection on each field according to at least one detection rule corresponding to each field to obtain and display the detection results of the N fields; the quality detection of each field according to the at least one detection rule corresponding to each field specifically includes: merging the rule templates of the detection rules corresponding to each field to obtain a merged rule template; detecting the N fields by using the merging rule template to obtain a plurality of detection results, wherein the plurality of detection results correspond to the detection rules corresponding to the N fields; respectively comparing whether the detection results are consistent with the expected values of the corresponding detection rules; if the detection result is consistent with the expected value of the corresponding detection rule, judging that the field corresponding to the detection result is normal; and if the detection result is inconsistent with the expected value of the corresponding detection rule, judging that the field corresponding to the detection result is abnormal.

2. The method for detecting data quality according to claim 1, wherein after the step of performing quality detection on each field according to the at least one detection rule corresponding to each field, the method further comprises:

if the detection result of the number of enumerated values is contained in the detection result of the N fields, judging whether the number of the enumerated values meets a first preset condition or not;

if yes, acquiring an enumeration value distribution proportion according to the enumeration value number;

and if the distribution proportion of the enumeration values does not meet the preset proportion condition, judging that the field corresponding to the detection result containing the number of the enumeration values is abnormal.

3. The method for detecting data quality according to claim 1, wherein after the step of performing quality detection on each field according to the at least one detection rule corresponding to each field, the method further comprises:

if the detection results of the N fields comprise the detection results of the maximum value and the minimum value, judging whether the difference value of the maximum value and the minimum value meets a second preset condition;

if yes, obtaining field values of fields corresponding to the detection results containing the maximum value and the minimum value, and carrying out statistics according to intervals to obtain interval distribution data of the fields corresponding to the detection results containing the maximum value and the minimum value;

and if the interval distribution data does not meet the preset distribution, judging that the field corresponding to the detection result containing the maximum value and the minimum value is abnormal.

4. The method according to claim 1, wherein the metadata includes primary key information of the data to be detected, and after the quality detection is performed on each field according to the at least one detection rule corresponding to each field, the method further includes:

determining a field corresponding to the primary key information;

acquiring a repeated number detection result of a field corresponding to the primary key information;

and if the field corresponding to the primary key information has repeated field values, judging that the primary key information or the field corresponding to the primary key information is abnormal.

5. The method for detecting data quality according to claim 1, wherein the data types include: boolean type, String type, date Data time type, Double-precision floating-point Double type, integer Bigint type, or precision value Decimal type.

6. The method according to claim 1, wherein a plurality of detection rules are extracted according to the test cases corresponding to each data type in different data types, and the detection rule base is established according to the plurality of detection rules.

7. An apparatus for detecting data quality, comprising:

the device comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring metadata of data to be detected, the metadata comprises N fields of the data to be detected and the data type of each field, and N is a positive integer; the metadata is information used for describing data attributes of the data to be detected;

the search module is used for searching at least one detection rule corresponding to each field in a pre-established detection rule base according to a preset matching strategy and the data type of each field; the detection rules in the detection rule base comprise rule names, rule templates and rule expected values;

the detection module is used for respectively carrying out quality detection on each field according to at least one detection rule corresponding to each field so as to obtain and display the detection results of the N fields; the detection module is specifically configured to: merging the rule templates of the detection rules corresponding to each field to obtain a merged rule template; detecting the N fields by using the merging rule template to obtain a plurality of detection results, wherein the plurality of detection results correspond to the detection rules corresponding to the N fields; respectively comparing whether the detection results are consistent with the expected values of the corresponding detection rules; if the detection result is consistent with the expected value of the corresponding detection rule, judging that the field corresponding to the detection result is normal; and if the detection result is inconsistent with the expected value of the corresponding detection rule, judging that the field corresponding to the detection result is abnormal.

8. The apparatus for detecting data quality as claimed in claim 7, wherein the detection module is further configured to:

after the quality detection is respectively carried out on each field according to the at least one detection rule corresponding to each field, if the detection result of the number of enumerated values is contained in the detection results of the N fields, whether the number of the enumerated values meets a first preset condition is judged;

9. The apparatus for detecting data quality as claimed in claim 7, wherein the detection module is further configured to:

after the quality detection is respectively carried out on each field according to at least one detection rule corresponding to each field, if the detection results of the N fields comprise the detection results of the maximum value and the minimum value, whether the difference value between the maximum value and the minimum value meets a second preset condition is judged;

10. The apparatus for detecting data quality as claimed in claim 7, wherein the metadata includes primary key information of the data to be detected, and the detecting module is further configured to:

after the quality detection is respectively carried out on each field according to at least one detection rule corresponding to each field, determining the field corresponding to the primary key information;

11. The apparatus for detecting data quality according to claim 7, wherein the data types include: boolean type, String type, date Data time type, Double-precision floating-point Double type, integer Bigint type, or precision value Decimal type.

12. The apparatus according to claim 7, wherein a plurality of detection rules are extracted according to the test cases corresponding to each data type in different data types, and the detection rule base is established according to the plurality of detection rules.